Simple Data Compression by Differential Analysis using Bit Reduction and Number System Theory

ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 Simple Data Compression by Differential Analysis using Bit Reduction and Number System Theory Debashish Chakraborty1, Sandipan Bera2 , Anil Kumar Gupta2, Soujit Mondal2 Department of Computer Science & Engineering1 and Information Technology2 St. Thomas’ College of Engineering. & Technology, Kolkata-23, West Bengal, India sunnydeba@gmail.com,sandipan.bera@gmail.com, anilgupta00749@gmail.com Abstract— This is a simple algorithm which is based on number distortion theory [6,7,8,9]. This is a new data compression theory system and file differential technique. It employs a algorithm. It is not so much dependent about the characters technique which unlike other is independent of repetition repetition. Though its compression ratio is not so much but and frequency of character.In the algorithm original file is the differential technique makes the compression ratio to a broken into some small files using differential techniques constant value. The main advantage of this technique is that and then every small file is considered as certain n-base number system where n is the number of different characters it can compress the output file which is produced after applying in the file. Now the compression is done converting this n- certain compression techniques on a file. So this algorithm base number system to binary number system. The main gives good result when it is used for hybridization with other concept behind this algorithm is to save the bits by converting algorithm. the higher number system to lower one. It is a simple compression and decompression process which is free from II. ALGORITHM STRATEGY time complexity. At first we need to break the file into some parts or some Keywords— Number system, Bit saving, Differential sub files. This differential condition is that the file will be technique. broken when the no. of distinct characters will be n. Here n is an integer number which will be varied file to file. Now the I. INTRODUCTION total number of different characters is n then we consider the file as n base number system where each character is one of Data compression or source coding is the process of the elements of this n base number system. So, now the encoding information using fewer bits (or other information requirement is indexing of the different characters present in bearing units) than a non encoded representation would use the sequential file from 0 to n-1. Now we pick g numbers of through specific use of encoding schemes [1,2,3,5]. It follows characters from the original file at a time and then taking the that the receiver must be aware of the encoding scheme in index number of each character calculates its decimal value order to decode the data to its original form. The compression according to number theory system. Now represent this schemes that are designed are basically trade- offs among decimal value by s numbers of bits in binary format to get the the degree of data compression, the amount of distortion compressed file. Here is a certain relation between g and s introduced and the resources (software and hardware) which is established by mathematical analysis in the later required to compress and decompress data. The primary section. To get the best compression proper selection of this reason behind doing so is to reduce the storage space g and s parameter is very necessary. required to save the data, or the bandwidth required to transmit it. Although storage technology has developed III. CALCULATION significantly over the past decade, the same cannot be said for transmission capacity. Data compression schemes may Let take an n-base number system where n can be any broadly be classified into – 1.Lossless compression and integer. So the value of maximum element of this number 2.Lossy compression. Lossless compression algorithms system is n-1. Now we take g no. of such element of this usually exploit statistical redundancy in such a way as to number system. Then the maximum value of this g no. of represent the sender’s data more concisely without error. elements would be, Lossless compression is possible because most real world Max_val= (n-1)*n0 +(n-1)*n1 +(n-1)*n2 +…….+(n-1)*ng-1 data has statistical redundancy. Another kind of compression, =(n-1)[(ng -1)/(n-1)]= (ng -1). called lossy data compression is possible if some loss of Then, similarly with s no. of bits the maximum binary value is fidelity is acceptable. It is important to consider that in case (2s -1). of lossy compression, the original data cannot be So, now if we transfer this n-base number system to binary reconstructed from the compressed data due to rounding off number system then, or removal of some parts of data as a result of redundancies. We can say, (ng -1)= (2s -1)  ng = 2s  g*log(n) These types of compression are also widely used in Image =s*log(2)  s/g= log(n)/log(2) compression [10,11,12]. The theoretical background of  s/g = 3.322*log(n) compression is provided by information theory and by rate Now, if we consider a file as a number system where each © 2011 ACEEE 16 DOI: 01.IJIT.01.03.537

ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 character is an element of this system then actually g bytes 6. Let the decimal value is val. can be replaced by s bits. So, we can say 8g bits will be 7. Divide val by n , until we store g no. of remainder in an replaced by s no. of bits. Now, we can say the array rem[ ]. Let rem[ ]={c0, c1, c2, ……..,cg-2, cg-1}. %compression=(1-s/(8g))*100 % 8. Set i=0 and Flag=0. = (1-0.41525*log(n))*100 % where n is the no. of different 9. While(i<g and Flag=0) types of characters in the file. 10. If rem[i]=Sequence[n-1] then Go to step 11 else Go to step 13. IV. ALGORITHM 11. Set Flag=1. 12. Go to step 9. Steps for Compression: 13. i=i+1 1. Begin 14. Go to step 9. 2. Set store[0]=(space) 15. End of While Loop. 3. Set count=1 16. If Flag=1 then Go to 4. Take an empty one order matrix with size n. Let the 17. Now map the each element of the array rem[ ] with the matrix is store[ ]. Sequence[ ] which is mentioned in Compression routine. 5. Do Until(end of file) Let rem[ ]={c0, c1, c2, ……..,cg-2, cg-1}.Then put the Sequence 6. Read the file and pick the character from the file. [c0 ], sequence[c1],…….sequence [cg-1 ] into a file extract. 7. If the character is present in the matrix store[ ] then go 18. Go to step 4. to step 8 else go to step 4 . 19. Now map the 0 to ith element of the array rem[ ] with the 8. Set store[count]=character Sequence[ ] which is mentioned in Compression routine. So, 9. count=count+1 put the Sequence [c0 ], Sequence[c1],……………...... 10. If count = n then go to step 11 else go to step 5 . Sequence[ci] in the file extract. 11. Break the file including the last character which is read. 20. Set j=j+1 12. count=1 21. Sequence[ ] = {Character set of jth Sub file.} 13. Go to step 5 22. Go to Step 4. 14. End of Loop. 23. End of Loop. 15. Apply the steps from step 15 to step 27 on each sub 24. Extract will be the decompressed File. file for getting the compression. 25. End 16. Find out the different types of characters in the file. 17. Store the different characters in an array. V. ALGORITHM ILLUSTRATION Sequence[ ]={sorted character set}. 18. Count the no. of different characters. Now it will be n. For example consider a text like ‘Bob is a boy’. Total no of So here we consider the file as a n-base number system. characters are 12 but the character set is [B,o,b,i,s,a,y, ,]. They 19. g=select an integer g. can be indexed as {B=0,o=1,b=2,i=3,s=4,a=5,y=6 and 7 for 20. s=Ceiling value of (3.322*g*log(n)) space}. Since number of characters in character set is 8, 21. Do until (end of file) therefore n=8 and take g=4. So s=3.322*(log 8)*4. 22. Pick up g no. of characters from the original file at a s=12.00243.Taking ceiling value of s we get s=13. Now take 4 time. characters from original string at a time and calculate the 23. Find the position of the each picked up character within decimal value of the index. The first 4 characters {B,o,b, } the Sequence[ ] and store the positional value into another and the index position are {0,1,2,7} and calculate it in decimal array pos[ ]. Let pos[ ]={a0, a1, …..,ag-2, ag-1}. value in such a way, 0*8^0+1*8^1+2*8^2+7*8^3=3720. Again 24. pos_val= a0 * n0 + a1 * n1+ …+ ag-2* ng-2 + ag-1 * ng-1. convert 3720 into binary to get a 13 bit value, put in the file in 25. Now represent the pos_val with s no. of bits in binary (0111010001000). So by representing in this way the total size system. Put these bits into a file Result. of compressed file will be 13*3= 39 bits, whereas the original 26. Go to step 21. size of the text was 8*12= 96bits. Now in the decryption 27. End of loop process, convert the compressed bit string into the decimal 28. Result file is the compressed file. value. So converting (0111010001000)2=(3720)10. Again 29. End convert this 3720 with n=8 base number system then we get {7,2,1,0} and then reverse it {0,1,2,7} . After mapping with Steps for Decompression: index, we get back the original string {B,o,b }.In such a way 1. Begin decryption process can be done. 2. Set j=0 3. Sequence[ ] = {Character set of jth Sub file.} 4. Do until end of compressed string or File. 5. Take s no. of bits at a time and then calculate its decimal value. © 2011 ACEEE 17 DOI: 01.IJIT.01.03.537

ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 VI. ALGORITHM DISCUSSION A. Characters Set Size and Compression gain The above mentioned graph(Fig.1) just show how the compression gain varies with size of characters set when the differential technique is not applied i.e. no breaking of files. It gives better compression when the size of characters set is lesser. Here we can find some similarity with other well known statistical data compression techniques like Huffman algorithm B. Differential Technique and Compression gain The above mentioned table and graph(Fig.2) shows when we apply differential technique then we get constant compression. The selection of no. characters(variable ‘n’ in the compression routine) on which the breaking of the file is dependent is very important. This number should be the trade Fig. 1 The graph for No. Of distinct characters in a File Vs of between the compression gain and the no. of file breakings. Compression gain without differential technique. If the number is very small then the no. of file breakings will be increase and it increase the complexity so much. Though it improves the compression gain also. Similarly if the selected number is so high then the no. of file breakings will be decreased which decrease the complexity also. But on the other hand compression gain will be deteriorated also. So the selected number should be a certain value by which the algorithm performs with a moderate compression gain and complexity. VII. CONCLUSIONS In this paper a new data compression algorithm is introduced. The unique feature of this algorithm is its simplicity. An entirely different technique is employed to reduce the size of text files. The technique of ’saving bits’ is Fig. 2 The graph for No. Of distinct characters in a File Vs employed in this algorithm. Since every character is taken Compression gain applying differential technique. (Differential care of, so the output codes do not depend upon the condition is, n=20) . repetition, like most of the other compression algorithms. T ABLE I Different combination of characters can be represented by fewer numbers of bits. After the code formation, ASCII replaces the binary numbers, which finally reduces the file size. The compression algorithm takes O(n) time, where n is the total number of characters in the file. Since the differential breaking follows Divide and Conquer policy, it takes O(nlogn) time. So, the total computation time required for this algorithm is proportional to O(nlogn). Quite a lot of research and findings led to the conclusion that there are no such algorithms in data compression that lay emphasis on differential compression based on number theory and bit reduction. ACKNOWLEDGMENT First, we would like to thank Professor Subarna Bhattacharjee[4], for her valuable advice, support and constant encouragement. Her constant criticisms and reviews gave us the conceptual clarity. We owe a significant lot to all faculty members of the Department of Computer Science and Engineering and the Department of Information Technology. Fig. 3 Compression Gain of proposed algorithm compared with It was in one such class of Design and Analysis of Algorithms existing algorithms that we first envisioned the above algorithm. We would also © 2011 ACEEE 18 DOI: 01.IJIT.01.03.537

ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 like to thank our friends for patiently enduring our [5] Khalid Sayood, “An Introduction to Data Compression”, explanations. Their reviews and comments were extremely Academic Press,1996. helpful. And of course, we owe our ability to complete this [6] David Solomon, “ Data Compression:The Complete Reference”, project to our families whose love and encouragement has SpringerPublication,2000. [7] M. Atallah and Y. Genin, “Pattern matching text compression: been our cornerstone. Algorithmicand empirical results”, International Conference on Data Compression, vol II: pp. 349-352,Lausanne,1996. REFERENCES [8] Mark Nelson and Jean-Loup Gaily, “ The Data Compression [1] J.Ziv and A. Lempel, “Compression of individual sequences via Book”, Second Edition, M&T Books. variable length coding”, IEEE Transaction on InformationTheory , [9] Timothy C. Bell, “Text Compression”, Prentice Hall Vol 24: pp. 530 – 536, 1978. Publishers,1990. [2] J.Ziv and A. Lempel, “A universal algorithm for sequential data [10] Ranjan Parekh,” Principles of Multimedia”, Tata McGraw- compression”, IEEE Transaction on InformationTheory , Vol 23: Hill Companies,2006. pp. 337 – 343, May 1977. [11] Amiya Halder, Sourav Dey, Soumyodeep Mukherjee and Ayan [3] Gonzalo Navarro and Mathieu A Raffinot, “General Practical Banerjee, “An Efficient Image Compression Algorithm Based on Approach to Pattern Matchingover Ziv-LempelCompressed Text”, Block Optimization and Byte Compression”, ICISA-2010, Chennai, Proc. CPM’99, LNCS 1645,Pages14-36. Tamilnadu, India, pp.14-18, Feb 6, 2010. [4] S. Bhattacharjee, J. Bhattacharya, U. Raghavendra, D.Saha, P. [12] Ayan Banerjee and Amiya Halder, “An Efficient Image Pal Chaudhuri, “A VLSI architecture for cellular automata based Compression Algorithm Based on Block Optimization, Byte parallel data compression”, IEEE-2006,Bangalore, India, Jan 03- Compression and Run-Length Encoding along Y-axis”, IEEE ICCSIT 06. 2010, Chengdu, China, IEEE Computer Society Press, July 9-11, 2010. [13] Debashis Chakraborty, Sandipan Bera, Anil Kumar Gupta and Soujit Mondal,, “Efficient Data Compression using Character Replacement through Generated Code”, IEEE NCETACS 2011, Shillong, India, March 4-5,2011,pp 31-34. © 2011 ACEEE 19 DOI: 01.IJIT.01.03.537

Simple Data Compression by Differential Analysis using Bit Reduction and Number System Theory

More Related Content

What's hot

Viewers also liked

Similar to Simple Data Compression by Differential Analysis using Bit Reduction and Number System Theory

More from IDES Editor

Recently uploaded

Simple Data Compression by Differential Analysis using Bit Reduction and Number System Theory