Data Compression Manish T I
• If a data item d occurs n consecutive times in the input stream, replace the n occurrences with the single pair nd. • The n consecutive occurrences of a data item are called a run length of n, and this approach to data compression is called run-length encoding or RLE. Run Length Encoding
We have to adopt the convention that only three or more repetitions of the same character will be replaced with a repetition factor. The main problems with this method are the following: • In plain English text there are not many repetitions. • There are many “doubles” but a “triple” is rare. • The most repetitive character is the space. • Dashes or asterisks may sometimes also repeat. • In mathematical texts, digits may repeat.
• Example Paragraph The abbott from Abruzzi accedes to the demands of all abbesses from Narragansett and Abbevilles from Abyssinia. He will accommodate them, abbreviate his sabbatical, and be an accomplished accessory. • The character “@” may be part of the text in the input stream, in which case a different escape character must be chosen. • Sometimes the input stream may contain every possible character in the alphabet. • Example An example is an object file, the result of compiling a program. Such a file contains machine instructions and can be considered a string of bytes that may have any values.
• Since the repetition count is written on the output stream as a byte, it is limited to counts of up to 255. • This limitation can be softened somewhat when we realize that the existence of a repetition count means that there is a repetition (at least three identical consecutive characters). • We may adopt the convention that a repeat count of 0 means three repeat characters, which implies that a repeat count of 255 means a run of 258 identical characters.
• The MNP class 5 method was used for data compression in old modems. • It has been developed by Microcom, Inc., a maker of modems (MNP stands for Microcom Network Protocol), and it uses a combination of run-length and adaptive frequency encoding.
Performance We assume that the string contains M repetitions of average length L each. Each of the M repetitions is replaced by 3 characters (escape, count, and data) Size of the compressed string is N − M × L +M ×3 = N −M(L − 3) Compression factor = N / N −M(L − 3)
Digram Encoding • A variant of run length encoding for text is digram encoding. • This method is suitable for cases where the data to be compressed consists only of certain characters, e.g., just letters, digits, and punctuation. • Good results can be obtained if the data can be analyzed beforehand. • “E”, “T”, “TH”, and “A”, occur often.
Pattern Substitution For compressing computer programs, where certain words, such as for, repeat, and print, occur often. Each such word is replaced with a control character or, if there are many such words, with an escape character followed by a code character. Assuming that code “a” is assigned to the word print, the text “m:print,b,a;” will be compressed to “m:@a,b,a;”.
Relative Encoding [Differencing] • Successive temperatures normally do not differ by much, so the sensor needs to send only the first temperature, followed by differences. The sequence of temperatures 70, 71, 72.5, 73.1, . . can be compressed to 70, 1, 1.5, 0.6, . . .. The differences are small and can be expressed in fewer bits.
The sequence 110, 115, 121, 119, 200, 202, . . . can be compressed to 110, 5, 6,−2, 200, 2, . . . . Now need to distinguish between a difference and an actual value. The compressor creating an extra bit (a flag) for each number sent, accumulating those bits, and sending them to the de compressor from time to time, as part of the transmission. Assuming that each difference is sent as a byte, the compressor should follow (or precede) a group of 8 bytes with a byte consisting of their 8 flags.
Another practical way to send differences mixed with actual values is to send pairs of bytes. Each pair is either an actual 16-bit measurement (positive or negative) or two 8-bit signed differences. Thus actual measurements can be between 0 and ±32K and differences can be between 0 and ±255. For each pair, the compressor creates a flag: 0 if the pair is an actual value, 1 if it is a pair of differences. After 16 pairs are sent, the compressor sends the 16 flags.
• The sequence of measurements 110, 115, 121, 119, 200, 202, . . . is sent as (110), (5, 6), (−2,−1), (200), (2, . . .), where each pair of parentheses indicates a pair of bytes. • The −1 has value 11111111 (binary) , which is ignored by the de-compressor (it indicates that there is only one difference in this pair).
Reference:-  Data Compression: The Complete Reference, David Salomon, Springer Science & Business Media, 2004 For any queries contact: Web: www.iprg.co.in E-mail: manishti2004@gmail.com Facebook: @ImageProcessingResearchGroup

Data Compression - Text Compression - Run Length Encoding

  • 1.
  • 2.
    • If adata item d occurs n consecutive times in the input stream, replace the n occurrences with the single pair nd. • The n consecutive occurrences of a data item are called a run length of n, and this approach to data compression is called run-length encoding or RLE. Run Length Encoding
  • 4.
    We have toadopt the convention that only three or more repetitions of the same character will be replaced with a repetition factor. The main problems with this method are the following: • In plain English text there are not many repetitions. • There are many “doubles” but a “triple” is rare. • The most repetitive character is the space. • Dashes or asterisks may sometimes also repeat. • In mathematical texts, digits may repeat.
  • 5.
    • Example Paragraph Theabbott from Abruzzi accedes to the demands of all abbesses from Narragansett and Abbevilles from Abyssinia. He will accommodate them, abbreviate his sabbatical, and be an accomplished accessory. • The character “@” may be part of the text in the input stream, in which case a different escape character must be chosen. • Sometimes the input stream may contain every possible character in the alphabet. • Example An example is an object file, the result of compiling a program. Such a file contains machine instructions and can be considered a string of bytes that may have any values.
  • 6.
    • Since therepetition count is written on the output stream as a byte, it is limited to counts of up to 255. • This limitation can be softened somewhat when we realize that the existence of a repetition count means that there is a repetition (at least three identical consecutive characters). • We may adopt the convention that a repeat count of 0 means three repeat characters, which implies that a repeat count of 255 means a run of 258 identical characters.
  • 7.
    • The MNPclass 5 method was used for data compression in old modems. • It has been developed by Microcom, Inc., a maker of modems (MNP stands for Microcom Network Protocol), and it uses a combination of run-length and adaptive frequency encoding.
  • 8.
    Performance We assume thatthe string contains M repetitions of average length L each. Each of the M repetitions is replaced by 3 characters (escape, count, and data) Size of the compressed string is N − M × L +M ×3 = N −M(L − 3) Compression factor = N / N −M(L − 3)
  • 9.
    Digram Encoding • Avariant of run length encoding for text is digram encoding. • This method is suitable for cases where the data to be compressed consists only of certain characters, e.g., just letters, digits, and punctuation. • Good results can be obtained if the data can be analyzed beforehand. • “E”, “T”, “TH”, and “A”, occur often.
  • 10.
    Pattern Substitution For compressingcomputer programs, where certain words, such as for, repeat, and print, occur often. Each such word is replaced with a control character or, if there are many such words, with an escape character followed by a code character. Assuming that code “a” is assigned to the word print, the text “m:print,b,a;” will be compressed to “m:@a,b,a;”.
  • 11.
    Relative Encoding [Differencing] •Successive temperatures normally do not differ by much, so the sensor needs to send only the first temperature, followed by differences. The sequence of temperatures 70, 71, 72.5, 73.1, . . can be compressed to 70, 1, 1.5, 0.6, . . .. The differences are small and can be expressed in fewer bits.
  • 12.
    The sequence 110,115, 121, 119, 200, 202, . . . can be compressed to 110, 5, 6,−2, 200, 2, . . . . Now need to distinguish between a difference and an actual value. The compressor creating an extra bit (a flag) for each number sent, accumulating those bits, and sending them to the de compressor from time to time, as part of the transmission. Assuming that each difference is sent as a byte, the compressor should follow (or precede) a group of 8 bytes with a byte consisting of their 8 flags.
  • 13.
    Another practical wayto send differences mixed with actual values is to send pairs of bytes. Each pair is either an actual 16-bit measurement (positive or negative) or two 8-bit signed differences. Thus actual measurements can be between 0 and ±32K and differences can be between 0 and ±255. For each pair, the compressor creates a flag: 0 if the pair is an actual value, 1 if it is a pair of differences. After 16 pairs are sent, the compressor sends the 16 flags.
  • 14.
    • The sequenceof measurements 110, 115, 121, 119, 200, 202, . . . is sent as (110), (5, 6), (−2,−1), (200), (2, . . .), where each pair of parentheses indicates a pair of bytes. • The −1 has value 11111111 (binary) , which is ignored by the de-compressor (it indicates that there is only one difference in this pair).
  • 15.
    Reference:-  Data Compression:The Complete Reference, David Salomon, Springer Science & Business Media, 2004 For any queries contact: Web: www.iprg.co.in E-mail: manishti2004@gmail.com Facebook: @ImageProcessingResearchGroup