OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 42 OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM Hemraj Kumawat CSE,IIT Jodhpur, Jodhpur, India Jitendra Chaudhary CSE,IIT Jodhpur, Jodhpur, India ABSTRACT Data compression refers to reducing the amount of space needed to store data or reducing the amount of time needed to transmit data. Many data compression techniques allow encoding the compressed form of data with different compression ratio. In particular, in the case of LZ77 technique, it reduces the data concurrency of an input file. In the output of this technique it conveys more information that is actually not needed in practical. Removing the extra information from the encoded file that makes this algorithm more optimal. Our task is to identify how much extra information it conveys and how can we minimize it so that there is no trouble at the time of decoding. Basically the encoded output of LZ77 is the sequence of triplets (a structure of encoded output) that is in binary and having fix size. For making the triplets of fix size, sometimes we are creating unnecessary information. We present the method of variable triplet size as a way to improve LZ77 compression and demonstrate it through many experiments. In our optimization algorithm we are getting more compression ratio compare to the conventional LZ77 data compression algorithm. Keywords: Look-ahead Buffer: The look-ahead buffer contains characters yet to be encoded. This buffer starts where the Search buffer ends and during the algorithm the Search buffer extends into the look-ahead buffer. Match Length: The Match Length is the length of largest matching block in the look-ahead buffer. These pairs are called triplets, consisting of offset, matching length and code word of character. If the character is matching then next character code word is used, otherwise same character code word is used. Offset: The actual distance between the current position of the pointer and the look-ahead buffer is known as offset. Search Buffer: The Search Buffer represents the most recently encoded characters. Sliding Window: The Structure for Data manipulation, in which the Data is held The Sliding Window, is divided into two parts as Search buffer and look-ahead buffer. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 5, September – October (2013), pp. 42-48 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET © I A E M E

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 43 1. INTRODUCTION LZ77 algorithm achieves compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. A match is encoded by a pair of numbers called a length-distance pair. Some common convention and definition of the words that we are using in this paper. 2. CONVENTIONAL LZ77 ALGORITHM LZ77 compression algorithm exploits the fact that words and phrases within a text file are likely to be repeated. When there is repetition, they can be encoded as a pointer to an earlier occurrence, with the pointer accompanied by the number of characters to be matched. It is a very simple adaptive scheme that requires no prior knowledge of the source and seems to require no assumptions about the characteristics of the source. In the LZ77 approach the dictionary is simply a portion of the previously encoded sequence. The encoder examines the input sequence through a sliding window which consists of two parts: a search buffer that contains a portion of the recently encoded sequence and a look ahead buffer that contains the next portion of the sequence to be encoded. The algorithm searches the sliding window for the longest match with the beginning of the look-ahead buffer and outputs a reference (a pointer) to that match. It is possible that there is no match at all, so the output cannot contain just pointers. In LZ77 the reference is always represented as a triplet<o,l,c>, where ‘o’ is an offset to the match, ‘l’ is length of the match and ‘c’ is the next symbol after the match. If there is no match, the algorithm outputs a null-pointer (both the offset and the match length equal to 0) and the first symbol in the look-ahead buffer. The values of an offset to a match and length must be limited to some maximum constant. For this algorithm we have to define the length of the look-ahead buffer, search buffer. The symbol is usually encoded in 8 bit. More over the compression performance of LZ77 mainly depends on these values. Generally the search buffer length is more than the look-ahead-buffer size. So the total triplet size: While ( look-ahead Buffer not empty) { get a reference (position, length) to longest match; if (length > 0) { output (position, length, next symbol); shift the window length+1 positions along; } else { output (0, 0, first symbol in the look-ahead buffer); shift the window 1 character along; } } ST= [⌈log2(search buffer length)⌉] +[⌈log2(look-ahead buffer length)⌉]+8

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 44 We can have better understanding with an example- “aacaacabcabaaac” For this example the size of look-ahead buffer is 6 and search buffer is 4. Triple Binary -- <0, 0, a> 00000011000001 -- <1, 1, c> 00100111000011 -- < 3, 4, b> 01110011000010 --< 3, 3, a> 01101111000001 -- <1, 2, c> 00101011000011 Sliding window( Size: 6 ) Longest match Next Character The triplet length for this example is 14(3+3+8). So here the encoded binary string of this example. 0000001100000100100111000011011100110000100110111100000100101011000011 triplet triplet triplet triplet triplet -------------------|-------------------|-----------------------|----------------------|--------------------| The decoding is much faster than the encoding in this process because we have to move our pointer with fixed length (triple length-14 for this example) and it is one of the important features of this process. From this way we get the triplets .Now we can easily decode to original data by reversing the encoding process. 3. OPTIMIZATION OF LZ77 As we have described earlier the triplets of LZ77 algorithm have fix size. In the case, when offset is equal to the matching length we can modify the structure of triplets and represent it with the new structure that have only <l,c> where l is the matching length and ‘c’ is the next symbol after the match .We called this new structure as “doublet” in this paper .All the things are same as the conventional LZ77 algorithm except replacing the triplet with doublet in the case of matching length equal to offset length. ࡿࡰ= log2(look-ahead buffer length)⌉] + 8 By replacing the triplet with doublet we are saving [⌈log2(search buffer length)⌉] number of bits per matching case. The exact algorithms is described in the block a a c a a c a b c a b a a a c a a c a a c a b c a b a a a c a a c a a c a b c a b a a a c a a c a a c a b c a b a a a c a a c a a c a b c a b a a a c

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 45 While ( look-ahead Buffer not empty) { get a reference (position, length) to longest match; if (length > 0) { If (length== position){ output (length, next symbol); shift the window length+1 positions along; } else{ output (position, length, next symbol); shift the window length+1 positions along; }} else { output (0, first symbol in the look-ahead buffer); shift the window 1 character along; } } We will have better understanding of this optimized algorithm with the previous section example- “aacaacabcabaaac” Triplet/doublet Binary --<0, a> 00011000001 --<1, c> 00111000011 --< 3, 4, b> 01110011000010 --< 3, a> 01111000001 --<1, 2, c> 00101011000011 The length of triplet for this example is 14(3+3+8) and the doublet length is 11(3+8). So here the encoded binary string of this example 0001100000100111000011011100110000100111100000100101011000011 doublet doublet triplet doublet triplet -----------------|------------------|---------------|-----------------|-----------------------| The decoding process is slower than the conventional LZ77 Decoding. In this algorithm we have to move our pointer with the variable size (doublet and triplet length) .But the problem in decoding is how we will identify which one is doublet and which one is triplet. So the decoding process is described in the next section. Once we decode encoded file to triplets and doublet then we can easily get to original data by reversing the encoding (data to triplet/doublet) process. a a c a A c A b c a b a A a c a a c a A c A b c a b a A a c a a c a A c a b c a b a A a c a a c a A c A b c a b a a a c a a c a A c A b c a b a A a c

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 46 4. DECODING PROCESS For decoding the binary string that contains doublets and triplets, first we have to identify which one is doublet’s binary string and which one is triplet binary string .To identify this we are using the concept of delimiter that will depend on the size of the sliding window length. We will put the delimiter before the doublet binary string. The effective doublet length is: doublet size + delimiter size. Now each binary substring is starting from the delimiter .Now for decoding we move the pointer from starting of the binary string and check whether the first consecutive k bits are equal to the delimiter or not where k is the length of delimiter .If it is equal then move the current pointer with the effective doublet length ahead and make this substring to the doublet substring otherwise move the pointer with the triplet length and make this substring to the triplet substring. So the decoding algorithm is given below Effective doublet length = doublet size + delimiter length //delimiter is a binary string that depends on the offset length Starting the moving pointer from 0 While(the moving pointer m<total binary length){ if((m to m+ delimiter’s length substring)==delimiter ) { Move the pointer m with effective doublet length ahead. Get the doublet binary substring form (m + delimiter’s length ) to (m + effective length) } else{ Move the pointer m with triplet length ahead. Get the triplet binary substring form m to (m + triplet length)} So here the above example’s binary string after using the delimiter before the doublets.(here we are using the delimiter=”1”) 1000110000011001110000110111001100001010111100000100101011000011 doublet double triplet doublet triplet ------------------|----------------|----------------|----------------------|----------------------| So the substring of doublet length just after the blue one is the doublet binary substring and the rest substrings of triple size are the triplet binary string .We can easily identify the delimiter (blue one for this example) by moving the pointer with appropriate length according the doublet and triplets. But what will happen when the offset’s first k bits are equal to the delimiter then this algorithm is not valid, where k is the length of delimiter .We cannot decompress the binary string from this delimiter. Let’s see this from an example: For maximum search buffer length=31 and look-ahead buffer=7 and delimiter =”111” Binary string 1110001100000111100011000010111010101100110111100111000111 |___|---doublet----||___|--doublet----||___|----triplet----------||___|--doublet--- From the starting we will see that first two strings are doublet string. The third substring is actually triplet substring but from our algorithm it reads out as the doublet substring because the first

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 47 3 bits are equal to the delimiter. This will affect the rest decoding process .Eventually we will get the wrong decoding output. So we are bounding our delimiter size by: Delimiter size = no. of first consecutive 1’s in binary representation of sliding window length +1 All the bits in the delimiter are 1’s. From this Delimiter string we are getting the correct decoding output. 5. ALGORITHMS ANALYSIS C: Total no of Matching (offset=Matching Length) CR: Compression ratio Op LZ77: Optimized LZ77 Algorithm LZ77: Conventional LZ77 Algorithm D: % Difference between conventional LZ77 and Optimized LZ77= (Op LZ77 – LZ77) / LZ77 Max. Offset Value=maximum offset value= search buffer length 6. CONCLUSION As we can see from the above analysis that for maximum offset value (111) the improved compression ratio is 1.896 and for maximum offset value (223) the improved compression ratio is 1.154. From this analysis result we can see that maximum offset value is increasing, the improved compression ratio is decreasing. When we increase the search buffer length then the total number of matching (offset=matching length) will be decreased. The improved compression is directly proportional to the matching. The improved compression will be more when the matching is more. So generally this optimal algorithm is more effective, where either conventional LZ77 algorithm is not so optimal or the input has less repetition. Serial Number File size (in bytes) Comparison between conventional LZ77 and Optimized LZ77 Max. Offset Value=111 Max. Offset Value=223 Max. Offset Value=447 C CR C CR C CR LZ77 Op LZ77 D LZ77 Op LZ77 D LZ77 Op LZ77 D 1 61,305 2208 1.180 1.200 2.280 1099 1.293 1.313 1.547 621 1.400 1.417 1.130 2 4,21,144 11417 1.210 1.230 1.770 5514 1.318 1.333 1.153 3104 1.420 1.433 0.842 3 7,79,959 17044 1.197 1.213 1.370 6412 1.310 1.319 0.700 3710 1.386 1.393 0.521 4 73,246 2820 1.189 1.169 1.690 1398 1.265 1.275 0.800 742 1.334 1.338 0.341 5 1,09,684 3755 1.200 1.226 2.180 1857 1.317 1.336 1.469 1063 1.423 1.440 1.080 6 6,04,919 19281 1.264 1.291 2.160 8246 1.390 1.407 1.258 4053 1.507 1.519 0.800 7 1,51,610 3489 1.236 1.255 1.497 1430 1.349 1.360 0.830 779 1.453 1.462 0.591 8 1,30,725 3309 1.222 1.242 1.635 1494 1.347 1.360 1.020 750 1.465 1.475 0.660 9 4,27,180 14369 1.254 1.282 2.253 6673 1.389 1.409 1.437 3388 1.517 1.531 0.950 10 6,78,036 23537 1.157 1.182 2.130 10763 1.263 1.279 1.320 5683 1.362 1.374 0.903 Average 10123 1.211 1.229 1.896 4488 1.324 1.339 1.154 2389 1.426 1.438 0.783

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 48 REFERENCES [1] IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-23, NO. 3, MAY 1977 337A Universal Algorithm for Sequential Data Compression [2] International Symposium on Information Theory and its Applications, ISITA2006 Seoul, Korea, October 29–November 1, 2006 Improving LZ77 Data Compression using Bit Recycling [3] International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011 68 A Comparative Study Of Text Compression Algorithms [4] Northwestern University Department of Electrical and Computer Engineering ECE 428: Information Theory Spring 2004 [5] http://www.stringology.org/DataCompression/lz77/index_en.html [6] http://www.zlib.net/feldspar.html • Links of the text files used in Analysis are given below (sorted by the index of the table) 1. http://www.gutenberg.org/cache/epub/28466/pg28466.txt 2. http://www.gutenberg.org/cache/epub/16728/pg16728.txt 3. http://www.gutenberg.org/cache/epub/9173/pg9173.txt 4. http://www.gutenberg.org/cache/epub/32482/pg32482.txt 5. http://www.gutenberg.org/files/25731/25731-0.txt 6. http://www.gutenberg.org/cache/epub/26598/pg26598.txt 7. http://www.gutenberg.org/cache/epub/28569/pg28569.txt 8. http://www.gutenberg.org/cache/epub/32962/pg32962.txt 9. http://www.gutenberg.org/cache/epub/23319/pg23319.txt 10. http://www.gutenberg.org/cache/epub/101/pg101.txt

OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

More Related Content

What's hot

Viewers also liked

Similar to OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

Recently uploaded

OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM