Chapter 5. Compression

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Compression

5.1. Television Listing

8:00 PM 2 (IRS) Beverly Hills Model Patrol— New lip gloss introduced.
- 5 (QUS) Cash Calliope: Musical Detective— Death with a Capital D-minor.
- 9 (PVC) Northern Cops— Town council bans eccentrics at town meeting, but not for long.
- 14 (TTV) Def N B— Beethoven raps for the Queen.
9:00 PM 2 (IRS) Super Hero Bunch— Evil just keeps coming back for more.
- 5 (QUS) Sniffmaster Spot— Spot discovers toxic waste at Acme Dog Food Plant.
- 9 (PVC) Mom's a Klepto— Family stress as Mom plagiarizes daughter's English paper.
- 14 (TTV) Easy Cheesy— Customer asks for Triple Anchovy pizza.
10:00 PM 2 (IRS) X Knows Best— Alien stepdad shows love is not Earthbound.
- 5 (QUS) Dum De Dum Dum— Detective Gump meets murdering publisher.
- 9 (PVC) Betrayal Place— Bob betrays Jane.
- 14(TTV) Beverly Hills Astronaut— Buzz discovers there are no malls in Space!

5.2. Patterns and Compression

Life often reduces to formulas. At least it does on television, where the solutions appear every 30 or 60 minutes. When you know the formula, it is very easy to summarize information or compress it. A network executive reportedly commissioned the television show “Miami Vice” with a two-word memo to the producer reading, “MTV Cops.” You can easily specify a episode of “Gilligan's Island” with a single sentence like, “The one with the cosmonauts.” Anyone who's seen only one episode of the show will know that some cosmonauts appear on the island, offer some hope of rescue, but this hope will be dashed at the end when Gilligan screws things up.

Compressing generic information is also just a matter of finding the right formula that describes the data. It is often quite easy to find a good formula that works moderately well, but it can be maddeningly difficult to identify a very good formula that compresses the data very well. Finding a good formula that works well for specific types of data like text or video is often economically valuable. People are always looking for good ways to cut their data storage and communications costs.

Compressing data is of great interest to anyone who wants to hide data for four reasons:Of course this approach is dangerous as well. If the JPEG algorithm strips away some information, then any information you insert in this location is just as liable to be stripped away. An attacker or a well-meaning programmer along the path could compress a file to save space and destroy your message. This makes the technique dangerous for weakly protected data like watermarks, but potentially useful if you can be reasonably sure the file won't be compressed along the path.¹^[1]

Less data is easier to handle. This speaks for itself. Basic text can easily be compressed by 50 to 70%. Images might be compressed by 90%.
Compressed data is usually whiter. Compression shouldn't destroy information in a signal. This means that the information per bit should increase if the size of the file decreases. More information per bit usually appears more random. Details about measuring information are on page 31.
Reversing compression can mimic data. Compression algorithms try to find a formula that fits the data and then return the specific details of the formula as compressed data. If you input random data into a compression function, it should spit out data that fits the formula. Page 183 shows how the JPEG algorithm can identify just how much space can be exploited in an image.
Compression algorithms identify noise Good compression algorithms understand how human perception works. Algorithms like JPEG or MP3 can strip away extra information from a file that a human doesn't notice. This information can also be exploited by steganographers to locate places where a significant amount of noise might be replaced by a hidden message.

¹ WebTV's network, for instance, will strip away higher order data from images if it won't show up on a television set.

One watermarking algorithm in Section 14.7.1 deliberately aims to hide information in the most important parts of the image to avoid being destroyed during compression.

Compression is an important tool for these reasons. Many good commercial compression programs already exist simply because of the first reason. Many good encryption programs use compression as an additional source of strength. Mimicking, though, is why compression is discussed in depth in this book. Some of the basic compression algorithms provide a good way to make information look like something else. This trick of flipping the algorithm on its head is discussed in Chapter 6.

A number of techniques for compressing data that are used today. The field has expanded wildly over the last several years because of the great economic value of such algorithms. A procedure that compresses data in half can double the storage area of a computer with no extra charge for hardware. People continue to come up with new and often surprisingly effective techniques for compressing data, but it all comes down to the basic process of identifying a formula that does a good job of fitting the data. The parameters that make the formula fit the data directly becomes the compressed surrogate. Some of the more popular techniques are:

Probability Methods These count up the occurrences of characters or bytes in a file. Then they assign a short code word to the most common characters and a long one to least common ones. Morse code is a good example of a compression algorithm from this class. The letter “e”, which is the most common in the English language, is encoded as a dot. The letter “p”, which is less common, is encoded as dot-dash-dash-dot. The Huffman code is the best known edition of these codes.
Dictionary Methods These algorithms compile a list of the most common words, phrases, or collections of bytes in a file, then number the words. If a word is on this list, then the compressed file simply contains the number pointing to the dictionary entry. If it isn't, the word is transmitted without change. This technique can be quite effective if the data file has a large amount of text. Some report compressing text to 10 to 20% of its original size. The Lempel-Ziv compression algorithm is the most commonly used version of this algorithm.
Run-Length Encoding Many images are just blocks of black pixels and white pixels. If you walk along a line, you might encounter 1000 white pixels followed by 42 black pixels followed by 12 white pixels, etc. Run-length encoding stores this as a sequence of numbers 1000, 42, 12, etc. This often saves plenty of space and works well for black-and-white line art. Faxes use this technique extensively.
Wave Methods These algorithms use a collection of waves as the basic collection of formulas. Then they adjust the size and position of the waves to best fit the data. These work quite well with images that do not need to be reconstructed exactly. The new image only needs to approximate the original. The JPEG, JPEG2000 and MPEG image and video compression standards are three of the more famous examples of this technique. Chapter 14 investigates the information-hiding capabilities of wavelets.
Fractal Methods Fractal functions produce extremely complicated patterns from very simple formulas. This means that they can achieve extremely high compression if you can find the formula that fits your data. A good introduction to fractal compression can be found in [BS88, Bar88, Bar93].
Adaptive Compression Schemes Many compression schemes can be modified to adapt to the changing data patterns. Each of the types described here comes in versions that modify themselves in the middle of the data stream to adapt to new patterns.

All of these compression schemes are useful in particular domains. There is no universal algorithm that comes with a universal set of functions that adapt well to any data. So people modify existing algorithms and come up with their own formulas.

Compression functions make good beginnings for people who want to hide data because the functions were constructed to describe patterns. There are two ways to use compression functions successfully to hide information. One is to mold it into the form of other data so it blends in. A compression function that works well on zebras can model black and white stripes and convert a set of stripes into a simple set of parameters. If you had such a function, it could be applied to some data in reverse and it would expand the data into zebra stripes. The result would be bigger, but it would look like something else. The data could be recovered by compressing it again.

Compression techniques can also be used to identify the least important nooks and crannies of a file so that extra data can be snuck into them. Many image-compression functions are designed to be lossy. That means that the reconstructed image may look very similar to the original image, but it won't be exactly the same. If the functions that describe an image can be fitted more loosely, then the algorithms can use fewer of them and produce a smaller compressed output. For instance, an apple might be encoded as a blob of solid red instead of a smooth continuum of various shades of red. When the image is decompressed, much of the smaller detail is lost but the overall picture still looks good. These compression functions can easily compress an image to be one-fifth to one-tenth of its original size. This is why they are so popular.

The television format example from the beginning of the chapter is an example of lossy compression. They are not enough to recreate the entire program. They're a better example of lossy compression where a surrogate is found.

5.2.1. Huffman Coding

A good way to understand basic compression is to examine a simple algorithm like Huffman coding. This technique analyzes the frequency with which each letter occurs in a file and then replaces it with a flexible-length code word. Normally, each letter is stored as a byte which takes up 8 bits of information. Some estimates of the entropy of standard English, though, show that it is something just over about 3 bits per letter. Obviously there is room to squeeze up to almost 5/8ths of a file of English text. The trick is to assign the short code words to common letters and long code words to the least common letters. Although some of the long words will end up being longer than 8 bits, the net result will still be shorter. The common letters will have the greatest effect.

Table 5.1 shows a table of the occurrences of letters in several different opinions from the United States Supreme Court. The space is the most common character followed by the letter “E”. This table was constructed by mixing lower- and uppercase letters for simplicity. An actual compression function would keep separate entries for each form as well as an entry for every type of punctuation mark. In general, there would be 256 entries for each byte.

Table 5.1. The frequency of occurrence of letters in a set of opinions generated by the U.S. Supreme Court.

Letter	Frequency
space	26974
B	1275
D	2823
F	1757
H	3279
J	152
L	3114
N	5626
P	2195
R	5173
T	8375
V	928
X	369
Z	60
A	6538
C	3115
E	9917
G	1326
I	6430
K	317
M	1799
O	6261
Q	113
S	5784
U	2360
W	987
Y	1104

Table 5.2 shows a set of codes that were constructed for each letter using the data in Table 5.1. The most common character, the space, gets a code that is only 2 bits long: 01. Many of the other common characters get codes that are 4 bits long. The least common character, “Z”, gets an 11-bit code: 00011010001. If these codes were used to encode data, then it should be easy to reduce a file to less than one-half of its original size.

Table 5.2. The codes constructed from Table 5.1. A Huffman tree based on these codes is shown in Figure 5.2.

Letter	Code
space	01
B	111011
D	11100
F	001101
H	00111
J	0001101001
L	10111
N	1101
P	000101
R	1111
T	0010
V	0001111
X	0001110
Z	00011010001
A	1000
C	10110
E	0000
G	111010
I	1001
K	000110101
M	001100
O	1010
Q	00011010000
S	1100
U	000100
W	0001110
Y	0001100

Here's a simple example that takes 48 bits used to store the word “ARTHUR” in normal ASCII into 27 bits in compressed form:

Letter:	A	R	T	H	U	R
ASCII:	01000001	01010010	01010100	01001000	01010101	01010010
Compressed:	1000	1111	0010	00111	000100	1111

The Huffman algorithm can also be used to compress any type of data, but its effectiveness varies. For instance, it could be used on a photograph where the intensity at each pixel is stored as a byte. The algorithm would be very effective on a photograph that had only a few basic values of black and white, but it wouldn't work well if the intensities were evenly distributed in a photograph with many even shades between dark and light. The algorithm works best when there are a few basic values.

More sophisticated versions of the Huffman code exist. It is common to construct second-order codes that aggregate pairs of letters. This can be done in two ways. The easiest way to do this is to simply treat each pair of letters as the basic atomic unit. Instead of constructing a frequency table of characters, you would construct a table of pairs. The table would be much larger, but it would generate even better compression because many of the pairs would rarely occur. Pairs like “ZF” are almost nonexistent.

Another way is to construct 26 different tables by analyzing which letters follow other letters. One table for the letter “T” would hold the frequency that the other letters might follow after the “T”. The letter “H” would be quite common in this table because “TH” occurs frequently in English. These 26 tables would produce even more compression because more detailed analysis would tune the code word even more. The letter “U” would receive a very short code word after the letter “Q” because it invariably follows.

This example has shown how a Huffman compression function works in practice. It didn't explain how the code words were constructed nor did it show why they worked so well. The next section in this chapter will do that.

Chapter 6 shows how to run Huffman codes in reverse.

5.3. Building Compression Algorithms

Creating a new compression algorithm has been one of the more lucrative areas of mathematics and computer science. A few smart ideas are enough to save people billions of dollars of storage space and communications time, and so many have worked with the idea in depth. This chapter won't investigate the best work because it is beyond the scope of the book. Many of the easiest ideas turn out to hide information the best. Huffman codes are a perfect solution for basic text. Dictionary algorithms, like Lempel-Ziv, are less effective.

5.3.1. Huffman Compression

Huffman compression is easy to understand and construct. Let the set of characters be ∑ and let ρ(c) be the probability that a particular character, c, occurs in a text file. Constructing such a frequency table is easily done by analyzing a source file. It is usually done on a case-by-case basis and is stored in the header to the compressed version, but it can also be done in advance and used again and again.

The basic idea is to construct a binary tree that contains all of the characters at the leaves. Each branche is labeled with either a zero or a one. The path between the root and the leaf specifies the code used for each letter. Figure 5.1 shows this for a small set of letters.

Figure 5.1. A small Huffman tree. The code for each letter is determined by following the path between the root of the tree and the leaf containing a particular letter. The letter “T”, for instance, receives the code 110.

The key is to construct the tree so that the most common letters occur near the top of the tree. This can be accomplished with a relatively easy process:

Start with one node for each character. This node is also a simple tree. The weight of this tree is set to be the probability that the character associated with the tree occurs in the file. Call the trees for tⁱ and the weight w(nⁱ). The value of i changes as the number of trees change.
Find the two trees with the smallest weight. Glue these into one tree by constructing a new node with two branches connected to the roots of the two trees. One branch will be labeled with a one and the other will get a zero. The weight of this new tree is set to be the sum of the old trees that were joined.
Figure 5.2. The top of the tree built from the data in Table 5.1. The generated the codes shown in Table 5.2. Only the top part is shown here because of space considerations. Less common letters like “Z” are in the missing part of the tree corresponding to the prefix 0001.
Repeat the previous step until there is only one tree left. The codes can be constructed by following the path between the root and the leaves.

I know of a Greek labyrinth which is a single straight line. Along this line so many philosophers have lost themselves that a mere detective might well do so too. —Jorge Luis Borges in Death and the Compass

The characters with the smallest weights are joined together first. Each joining process adds another layer between the root and the leaves. So it is easy to see how the least common letters get pushed far away from the root where they have a longer code word. The most common letters aren't incorporated until the end, so they end up near the top.

The algorithm naturally balances the tree by always taking the smallest weights first. The weight for a tree represents the number of times that any of the characters in the tree will occur in a file. You can prove that the tree constructed by this algorithm is the best possible tree by imagining what happens if you mistakenly choose the wrong two trees to join at a step. More common characters get pushed farther from the root and get longer code words than less common characters do. The average compression drops.

Many other people have extended the theme of Huffman coding by creating other algorithms that use the addresses of nodes in a tree. One popular technique is to use Splay trees where the trees are modified every time a character is encoded. One version moves the letter to the top of the tree in a complex move that preserves much of the structure. The result is that the most common letters bubble up to the top. The constant rearrangement of the tree means that the tree adapts to the local conditions. This type of algorithm would be ideal for compressing a dictionary where even the least popular letters like “j” or “z” are common in sections. When the algorithm moved through the “j” part of the dictionary, the node containing “j” would be pushed repeatedly to the top of the splay tree where it would get a short code word. Later, when the compression function got to the “z” section, the node for “z” would end up near the top consistently giving “z” a short code word. Obviously one major problem with this compression scheme is that the entire file must be processed from the beginning to keep an accurate description of the splay tree. You can't simply jump to the “z” section and begin decompressing.

A good basic reference on compression is [Sto88].

This basic Huffman algorithm has many different uses. It will be in Chapter 6 to turn data into something that looks like English text. Huffman encoding is also used as a building block in Chapter 7 to make optimal weighted choices between different words. The same structure is as useful there as it is here.

5.3.2. Dictionary Compression

Compression schemes like the popular and patented Lempel-Ziv algorithm are called dictionary schemes because they build a big list of common words in the file.^[2] This list can either be created in one swoop at the beginning of the compression or it could be built and changed adaptively as the algorithm processes the file. The algorithms succeed because a pointer describing the position in the dictionary takes up much less space than the common word itself.

² The algorithms are not particularly good at compressing files like dictionaries used by humans. The fact that I used a regular dictionary as an example in the previous section is just a coincidence. Don't be confused.

The dictionary is just a list of words. It is almost always 2ⁿ words because that makes the pointer to a particular word take up n bits. Each word can either be a fixed length or a flexible length. Fixed lengths are easier to handle, but flexible lengths do a better job of approximating the English language and x86 machine code.

Compression follows easily. First, the file is analyzed to create a list of the 2ⁿ most common words. Then the file is processed by scanning from beginning to end. If the current word is in the dictionary, then it is replaced by a tag, <InDict>, followed by the position in the dictionary. If it isn't in the dictionary, then it is replaced by a tag, <verbatim>, followed by the word that remains unchanged.

Obviously, the success of the algorithm depends on the size of the tags (<InDict> and <Verbatim>), the size of the dictionary, and the number of times something is found in the dictionary. One basic and usually effective solution is to make the tags be one entire byte, B. If the value of the byte is zero, then the next n bits represents a word in the dictionary. If the value of the byte, B, is greater than zero, then B bytes are copied verbatim out of the original file. This scheme allows the program to use flexible word sizes that work well with English. There are many different schemes that are more efficient that others in some cases.

The index into the dictionary does not need to be n bit numbers. You can also count the occurrence of words in the dictionary and use a Huffman-like scheme to devise short code words for some of them. The tag for verbatim text is usually included as just another word in this case.

The dictionary can also adapt as the file is processed. One simple technique is to keep track of the last time a word from it was used. Whenever a section of verbatim text is encountered, the oldest word is swapped out of the dictionary and the newest verbatim text is swapped in. This is a great technique for adapting to the text because many words are often clustered in sections. For instance, the words “dictionary,” “Huffman,” and “compression” are common in this section but relatively rare in other parts of the book. An adaptive scheme would load these words into the dictionary at the beginning of the section when they were first encountered and not swap them out until they aren't used for a while.

Dictionary schemes can be quite effective for compressing arbitrary text, but they are difficult to run in reverse to make data mimic something. Chapter 6 uses Huffman-like algorithms to generate real text, but it doesn't include a section on reversing dictionary algorithms. They are described in this chapter because compression is a good way to save space and whiten data. The algorithms don't work particularly well for mimicry because they require a well-constructed dictionary. In practice, there is no good automatic way that I know for constructing a good one.

5.3.3. JPEG Compression

The Huffman encoding described in “Huffman Compression” (see page 80) and the dictionary schemes in “Dictionary Compression” (see page 82) are ideal for arbitrary collections of data. They can also work quite well on some types of image files, but they fail on others. If an image has a small number of colors that may occur in a predictable pattern, then both of these algorithms may do a good job of finding a pattern that is strong enough to generate a good compression. This often doesn't happen because the images contain many shades of colors. The Japanese flag, for instance, has one red circle that is a constant color, but a realistically lit apple has many different shades of red.

The JPEG algorithm is a good example of how to tune an algorithm to a particular type of data. In this case, the algorithm fits cosine functions to the data and then stores the amplitude and period. The number of functions used and the size can be varied according to the amount of compression desired. A small number of functions produces a high amount of compression but a grainy image. More functions add accuracy, but take up more space. This flexibility is possible because people don't always particularly care if they get exactly the same image back when it is decompressed. If it looks reasonably close, it is good enough.

This flexibility is what is so useful about JPEG encoding. The algorithm from Section 5.3.1 will be run in reverse to produce text that mimics English text. The JPEG algorithm doesn't do that well. However, it does have the ability to identify nooks and crannies in the image that might have space to hold information. This is described in detail in Chapter 9.

5.3.4. GZSteg

Many of the compression algorithms can be tweaked in clever ways to hide information. A simple but quite effective technique was used by Andrew Brown when he created the GZSteg algorithm to hide information normally stored with the popular GZIP compression algorithm. This technique is used frequently throughout the Net so it makes an ideal candidate for an innocuous location.

Ordinarily, the GZIP algorithm will compress data by inserting tokens that point back to a previous location where the data was found. Here's a sample section of text:

The famous Baltimore Oriole, Cal Ripken Jr., is the son of Cal Ripken Sr. who coached for the Orioles in the past.

Here's a sample section that was compressed. The tokens are shown in italics.

The famous Baltimore Oriole, Cal Ripken Jr., is the son of (30,10) Sr. who coached for the (48,6)s in the past.

In this example, there are two tokens. The first one, (30,10), tells the algorithm to back 30 characters and copy 10 characters to the current location. The compression technique works quite well for many text algorithms.

GZSteg hides information by changing the number of characters to copy. Every time it inserts a token that requires copying more than 5, it will hide one bit. If the bit is zero, then the token is left unchanged. If the bit is one, then the number of characters to be copied is shortened by one. Here's the same quote with the two bits 11 encoded:

The famous Baltimore Oriole, Cal Ripken Jr., is the son of (30,9)n Sr. who coached for the (46,5)es in the past.

In both cases, the size of the copying was cut by one. This does reduce the amount of compression to a small extent.

The greatest advantage of this approach is that the file format is unchanged. A standard GZIP program will be able to decompress the data without noticing that information was hidden in the process. Information could be left around without attracting suspicion. A quick analysis, however, could also reveal that data was hidden in such a manner. If you scan the file and examine the tokens, you can easily determine which ones are just a character too small. There is no way to deny that the program that did the GZIP compression failed.

5.4. Summary

Compression algorithms are normally used to reduce the size of a file without removing information. This can increase their entropy and make the files appear more random because all of the possible bytes become more common. The compression algorithms can also be useful when they're used to produce mimicry by running the compression functions in reverse. This is described in Chapter 6.

The Disguise Compression algorithms generally produce data that looks more random. That is, there is a more even distribution of the data.
How Secure Is It? Not secure at all. Most compression algorithms transmit the table or dictionary at the beginning of the file. This may not be necessary because both parties could agree on such a table in advance. Although I don't know how to figure out the mapping between the letters and the bits in the Huffman algorithm, I don't believe it would be hard to figure out.
How to Use It Many compression programs available for all computers. They often use proprietary algorithms that are better than the versions offered here and make an ideal first pass for any encryption program.

Table of Contents for
Chapter 5. Compression

Chapter 5. Compression

5.1. Television Listing

5.2. Patterns and Compression

5.2.1. Huffman Coding

Table 5.1. The frequency of occurrence of letters in a set of opinions generated by the U.S. Supreme Court.

Table 5.2. The codes constructed from Table 5.1. A Huffman tree based on these codes is shown in Figure 5.2.

5.3. Building Compression Algorithms

5.3.1. Huffman Compression

Figure 5.1. A small Huffman tree. The code for each letter is determined by following the path between the root of the tree and the leaf containing a particular letter. The letter “T”, for instance, receives the code 110.

Figure 5.2. The top of the tree built from the data in Table 5.1. The generated the codes shown in Table 5.2. Only the top part is shown here because of space considerations. Less common letters like “Z” are in the missing part of the tree corresponding to the prefix 0001.

5.3.2. Dictionary Compression

5.3.3. JPEG Compression

5.3.4. GZSteg

5.4. Summary

Further Reading

Table of Contents for Chapter 5. Compression

Create new playlist

Sign In

Sign Up

Chapter 5. Compression

5.1. Television Listing

5.2. Patterns and Compression

5.2.1. Huffman Coding

Table 5.1. The frequency of occurrence of letters in a set of opinions generated by the U.S. Supreme Court.

Table 5.2. The codes constructed from Table 5.1. A Huffman tree based on these codes is shown in Figure 5.2.

5.3. Building Compression Algorithms

5.3.1. Huffman Compression

Figure 5.1. A small Huffman tree. The code for each letter is determined by following the path between the root of the tree and the leaf containing a particular letter. The letter “T”, for instance, receives the code 110.

Figure 5.2. The top of the tree built from the data in Table 5.1. The generated the codes shown in Table 5.2. Only the top part is shown here because of space considerations. Less common letters like “Z” are in the missing part of the tree corresponding to the prefix 0001.

5.3.2. Dictionary Compression

5.3.3. JPEG Compression

5.3.4. GZSteg

5.4. Summary

Further Reading

Table of Contents for
Chapter 5. Compression