Chapter 6. Basic Mimicry

6.1. Reading between the Lines

Here is the transcript from the mind of a cynic reading through the personals section of a newspaper:

Great. Eliza Doolittle. Literally. I come up with a dating strategy and she does little but ride along. This is not a good sign. She's probably a princess working as a executive assistant who wants to be rescued and catapulted into the upper class. Rules me out. I'm not going to work my butt off so she can relax in Spain trying to pronounce words differently. What's so romantic about Spain, anyway? She's probably read Hemingway too and I'll be forced to run in front of a bunch of bulls just so she'll think I'm dashing in an old-fashioned way. No thanks. I'll take a new-fashioned Range Rover like they drive around Africa. Those things can't be toppled by a bunch of bulls. And if it's raining, I won't get wet or slip all over the place. Geez.

Great. Poetry. She'll expect me to reciprocate. I just won't be able to say, “Yeah, let's grab a burger tonight.” Nope. I'll have to get some watercolors and paint a letter to her. In some ancient verse form. Rhyming really is the sign of an overactive mind. Who really cares if two words in different parts of a paragraph happen to end with the same sound? It's just a coincidence. She'll probably spend all of her time picking up patterns in our lives. I'll have to keep saying, “No. I still love you. I just want to watch the seventh game of the World Series. The Red Sox are in it this year. It's tied. They might actually win! This is not a sign of a bad relationship.” Geez.

Great. Has she ever fallen off of a fast horse? They're animals. They only tolerate us on their backs as long as the oats are fresh. Women are the same way. But they don't take to a rein as well. And they don't just want fresh oats. I bet fast food isn't on her list. She'll ride along and take me for whatever I've got. Then she'll grab a fast plane out of my life. No way. Her boat's sinking already. Geez.

6.2. Running in Reverse

The cynic looking for a date in the introduction to this chapter has the ability to take a simple advertisement and read between the lines until he's plotted the entire arc of the relationship and followed it to its doom. Personal ads have an elaborate shorthand system for compressing a person's dreams into less than 100 words. The shorthand evolved over the years as people learned to pick up the patterns in what people wanted. “ISO” means “In Search Of” for example. The cynic was just using his view of the way that people want to expand the bits of data into a reality that has little to do with the incoming data.

This chapter is about creating an automatic way of taking small, innocuous bits of data and embellishing them with deep, embroidered details until the result mimics something completely different. The data is hidden as it assumes this costume. The effect is accomplished here by running the Huffman compression algorithm described in Chapter 5 in reverse. Ordinarily, the Huffman algorithm would approximate the statistical distribution of the text and then convert it into a digital shorthand. Running this in reverse can take normal data and form it into elaborate patterns.

Figure 6.1 is a good place to begin. The text in this figure was created using a fifth-order regular mimic function by analyzing an early draft of Chapter 5. The fifth-order statistical profile of the chapter was created by counting all possible sets of five letters in a row that occur. In the draft, the five letters ‘mpres’ occur together in that order 84 times. Given that these letters are part of the word ‘compression’, it is not surprising that the five letters ‘ompre’ and ‘press’ also occur 84 times.

Figure 6.1. This is a Fifth order random text generated by mimicking the statistical distribution of letters in an early draft of Chapter 5.

The text is generated in a process guided by these statistics. The text begins by selecting one group of five letters at random. In the Figure, the first five letters are “The l”. Then it uses the statistics to dictate which letters can follow. In the draft of Chapter 5, the five letters ‘he la’ occur 2 times, the letters ‘he le' occur 16 times and the letters “he lo” occur 2 times. If the fifth-order text is going to mimic the statistical profile of Chapter 5, then there should be a 2 out of 20 chance that the letter “a” should follow the random “The l”. Of course, there should also be a 16 out of 20 chance that it should be a “e” and a 2 out of 20 chance that it should be an “o”.

This process is repeated ad infinitum until enough text is generated. It is often amazing just how real the result sounds. To a large extent, this is caused by the smaller size of the sample text. If you assume that there are about 64 printable characters in a text file, then there are about 645 different combinations of five letters. Obviously, many of them like “zqTuV” never occur in the English language, but a large number of them must make their way into the table if the algorithm is to have many choices. In the last example, there were three possible choices for a letter to follow “The l”. The phrase “The letter” is common in Chapter 5, but the phrase “The listerine” is not. In many cases, there is only one possible choice that was dictated by the small number of words used in the sample. This is what gives it such a real sounding pattern.

Here's the algorithm for generating nth-order text called T given a source text S:

  1. Construct a list of all combinations of n letters that occur in S and keep track of how many times each of these occurs in the S.
  2. Choose one at random to be a seed. This will be the first n letters of T.
  3. Repeat this loop until enough text is generated:
    • Take the last nl letters of T.
    • Search through the statistical table and find all combinations of letters that begin with these n – 1 letters.
    • The last letters of these combinations is the set of possible choices for the next letter to be added to T.
    • Choose among these letters and use the frequency of their occurrences in S to weight your choice.
    • Add it to T.

The algorithm works for n that is two or larger. Obviously, the quality of the output of the lower-order samples depends on the order. Here are some samples:

There is little doubt that the text gets more and more readable as the order increases. But who would this fool? What if the enemy designed a computer program that would flag suspicious electronic mail by identifying messages that don't have the right statistical mix of characters? Foreign languages could pop right out. French, for instance, has a greater number of apostrophes as well as a different distribution of letters. Russian has an entirely different alphabet, but even when it is transliterated the distribution is different. Each language and even each regional dialect has a different composition.

The texts generated here could fool such an automatic scanning device because the output is statistically equivalent to honest English text. For instance, the letter “e” is the most common and the letter “t” is next most common. Everything looks statistically correct at all of the different orders. If the scanning software was looking for statistical deviance, it wouldn't find it.

An automatic scanning program is also at a statistical disadvantage with relatively short text samples. Its statistical definition of what is normal must be loose enough to fit changes caused by the focus of the text. A document about zebras, for instance, would have many more “z”s than the average document, but this alone doesn't make it abnormal. Many documents might have a higher than average occurrence of “j”s or “q”s merely because the topic involves something like jails or quiz shows.

Of course, these texts wouldn't be able to fool a person. At least the first-, second-, or third-order texts wouldn't fool someone. But a fifth-order text based on a sample from an obscure and difficult jargon like legal writing might fool many people who aren't familiar with the structures of the genre.

More complicated statistical models can produce better mimicry, at least in the right cases. Markov models, for instance, are common in speech recognition and genetic algorithms can do a good job predicting some patterns. In general, any of the algorithms designed to help a computer learn to recognize a pattern can be applied here to suss out a pattern before being turned in reverse to imitate it.

More complicated grammatical analysis is certainly possible. There are grammar checkers that scan documents and identify bad sentence structure. These products are far from perfect. Many people write idiomatically and others stretch the bounds of what is considered correct grammar without breaking any of the rules. Although honest text generated by humans may set off many flags, even the fifth-order text shown in this chapter would appear so wrong that it could be automatically detected. Any text that had, say, more wrong than right with it could be flagged as suspicious by an automatic process. [KO84, Way85].

Chapter 7 offers an approach to defeating grammar checkers.

6.2.1. Choosing the Next Letter

The last section showed how statistically equivalent text could be generated by mimicking the statistical distribution of a source collection of text. The algorithm showed how to choose the next letter so it would be statistically correct, but it did not explain how to hide information in the process. Nor did it explain how to run Huffman compression in reverse.

The information is hidden by letting the data to be concealed dictate the choice of the next letter. In the example described above, either “a”, “e”, or “o” could follow the starting letters “The l”. It is easy to come up with a simple scheme for encoding information. If “a” stands for “1”, “e” stands for “2” and “o” stands for “3”, then common numbers could be encoded in the choice of the letters. Someone at a distance could recover this value if they had a copy of the same source text, S, that generated the table of statistics. The could look up “The l” and discover that there are three letters that follow “he l” in the table. The letter “e” is the second choice in alphabetical order, so the letter “e” stands for the message “2”.

A long text like the one shown in Figure 6.1 could hide a different number in each letter. If there were no choice about the next letter to be added to the output, though, then no information could be hidden. That letter would not hide anything.

Simply using a letter to encode a number is not an efficient or a flexible way to send data. What if you wanted to send the message “4” and there were only three choices? What if you wanted to send a long picture? What if your data wanted to send the value “1”, but the first letter was the least common choice. Would this destroy the statistical composition?

Running Huffman codes in reverse is the solution to all of these problems. Figure 6.2 shows a simple Huffman tree constructed from the three choices of letters to follow “The l”. The tree was constructed using the statistics that showed that the letter “e” followed in 16 out of the 20 times while the letters “a” and “o” both followed twice apiece.

Figure 6.2. A small Huffman tree built to hide bits in the choice of a new letter. Here, the letter “a” encodes “10”, the letter “e” encodes “0” and the letter “o” encodes “11”.

Messages are encoded with a Huffman tree like this with a variable number of bits. The choice of “e” encodes the bit “0”; the choice of “a” encodes “10”; and the choice of “o” encodes the message “11”. These bits can be recovered at the other end by reversing this choice. The number of bits that are hidden with each choice of a letter varies directly with the number of choices that are possible and the probabilities that govern the choice.

There should generally be more than three choices available if the source text S is large enough to offer some variation, but there will rarely be a full 26 choices. This is only natural because English has plenty of redundancy built into the language. Shannon recognized this when he set up information theory. If the average entropy of English is about 3 bits per character, then this means that there should only be about 23 or eight choices that can be made for the next character. This value is weighted by the probabilities.

There are problems, of course, with this scheme. This solution is the best way to hide the information so that it mimics the source text S for the same reason that Huffman codes are the most efficient way to construct tree-like compression schemes. The same proof that shows this works in reverse.

Section 6.3.1 shows a more accurate approximation.

But even if it is the best, it falls short of being perfect. In the small example in Figure 6.2, the letter “e” is chosen if the next bit to be hidden is “0”, while either “a” or “o” will be hidden if the next bit is “1”. If the data to be hidden is purely random, then “e” will be chosen 50% of the time while “a” or “o” will be chosen the other 50% of the time. This does not mimic the statistics from the source text exactly. If it did, the letter “e” would be chosen 80% of the time and the other letters would each be chosen 10% of the time. This inaccuracy exists because of the binary structure of the Huffman tree and the number of choices available.

6.3. Implementing the Mimicry

There are two major problems in writing software that will generate regular nth-order mimicry. The first is acquiring and storing the statistics. The second is creating a tree structure to do the Huffman like coding and decoding. The first problem is something that requires a bit more finesse because there are several different ways to accomplish the same ends. The second problem is fairly straightforward.

Several people have approached a similar problem called generating a travesty. This was addressed in a series of Byte magazine articles [KO84, Way85] that described how to generate statistically equivalent text. The articles didn't use the effect to hide data, but they did concentrate on the most efficient way to generate it. This work ends up being quite similar in practice to the homophonic ciphers described by H. N. Jendal, Y. J. B. Kuhn, and J. L. Massey in [JKM90] and generalized by C. G. Gunther in [Gun88].

Here are four different approaches to storing the statistical tables needed to generate the data:

  • Giant Array Allocate an array with cn boxes where c is the number of possible characters at each position and n is the order of the statistics being kept. Obviously c can be as low as 27 if only capital letters and spaces are kept. But it can also be 256 if all possible values of a byte are stored. This may be practical for small values of n, but it quickly grows impossible if there are k letters produced.
  • Giant List Create an alphabetical list of all of the entries. There is one counter per node as well as a pointer and a string holding the value in question. This makes the nodes substantially less efficient than the array. This can still pay off if there are many nodes that are kept out. If English text is being mimicked, there are many combinations of several letters that don't occur. A list is definitely more efficient.
  • Giant Tree Build a big tree that contains one path from the root to a leaf for each letter combination found in the tree. This can contain substantially more pointers, but it is faster to use than the Giant List. Figure 6.3 illustrates an implementation of this.
    Figure 6.3. This tree stores the frequency data for a file with n layers of branching for nth-order statistics. Access is substantially faster. The dashed lines show where nodes are omitted. The only complete word shown here is “at”. It occurs three times in the sample.

  • Going Fishing Randomize the search. There is no statistical table produced at all because c and n are too large. The source file serves as a random source and it is consulted at random for each choice of a new letter. This can be extremely slow, but it may be the only choice if memory isn't available.

The first three solutions are fairly easy to implement for anyone with a standard programming background. The array is the easiest. The list is not hard. Anyone implementing the tree has a number of choices. Figure 6.3 shows that the new branches at each level are stored in a list. This could also be done in a binary tree to speed lookup.

The fourth solution, going fishing, is a bit more complicated. The idea is to randomly select positions in the text and use this to randomize the search. Not all of the data can be kept in a table so all of the choices won't be available at each juncture. Therefore, you must live with what you can find. The most extreme version of this algorithm simply searches the entire file and constructs the right table entry on the fly. Here is a more sensible approach:

  1. Choose a location in the source file at random. Call this character i. This random source must be duplicated during decoding so it must come from a pseudo-random number generator that is synchronized.
  2. If you are constructing an nth-order mimicry, search forward until you find the n – 1 characters in question. The next character may be the one you desire.
  3. Let there be k characters in the source file. Go to position i + k/2 mod k. Search forward until the right combination of n – 1 characters are found.
  4. If the next character suggested by both positions is the same, then nothing can be encoded here. Send out that character and repeat.
  5. If the characters are different, then one bit can be encoded with the choice. If you are hiding a 0 using this mimicry, then output the character found beginning at position i. If you are hiding a 1, then output the character found after the search began at i + k/2 mod k.

This solution can be decoded. All of the information encoded here can be recovered as long as both the encoder and the decoder have access to the same source file and the same stream of i values coming from a pseudo-random source. The pseudo-random generator ensures that all possible combinations are uncovered. This does assume, however, that the candidates of n – 1 characters are evenly distributed throughout the text.

The solution can also be expanded to store more than one bit per output letter. You could begin the search at four different locations and hope that you uncover four different possible letters to output. If you do, then you can encode two bits. This approach can be extended still further, but each search does slow the output.

In general, the fishing solution is the slowest and most cumbersome of all the approaches. Looking up each new letter takes an amount of time proportional to the occurrence of the n – 1 character group in the data. The array has the fastest lookup, but it can be prohibitively large in many cases. The tree has the next fastest lookup and is probably the most generally desirable for text applications.

6.3.1. Goosing with Extra Data

Alas, statistical purity is often hard to generate. If the data to be hidden has maximum entropy, then the letters that emerge from the Huffman-tree based mimicry will emerge with a probability distribution that seems a bit suspicious. Every letter will appear with a probability of the form 1/2i; that is, 50%, 25%, 12.5%, and so on. This may not be that significant, but it might be detected.

Music is also fair game. Many have experimented with using musical rules of composition to create new music from statistical models of existing music. One paper on the topic is [BJNW57].

Better results can be obtained by trading off some of the efficiency and using a pseudo-random number generator to add more bits to make the choice better approximate the actual occurrence in the data.

This technique can best be explained by example. Imagine that there are three characters, “a”, “b”, and “c” that occur with probabilities of 50%, 37.5%, and 12.5% respectively. The ordinary Huffman tree would look like the one in Figure 6.4. The character “a” would occur in the output file 50% of the time. This would be fine. But “b” and “c” would both occur 25% of the time. “b” will occur as often as “c”, not three times as often as dictated by the source file.

Figure 6.4. An ordinary Huffman tree built for three characters, “a”, “b”, and “c” that occur with probabilities of 50%, 37.5%, and 12.5% respectively.

Figure 6.5 shows a new version of the Huffman tree designed to balance the distribution. There are now two extra layers added to the tree. The branching choices made in these extra two layers would use extra bits supplied by a pseudo-random generator. When they were recovered, these bits would be discarded. It should be easy to establish that “b” will emerge 37.5% of the time and “c” will be output 12.5% of the time if the data being hidden is perfectly distributed.

Figure 6.5. An expanded version of the tree shown in Figure 6.4. The decisions about which of the dashed branches to take are made by drawing bits from an extra pseudo-random source. Only the first decision is made using a bit from the data to be hidden.

The cost of this process is efficiency. The new tree may produce output with the right distribution, but decoding is often not possible. The letter “b” is produced from the leaves with addresses 100, 101, and 110. Since only the first bit remains constant with the tree in Figure 6.5 then only one bit can be hidden with the letter “b”. The other two bits would be produced by the pseudo-random bit stream and not recovered at the other end. The tree in Figure 6.4 would hide two bits with the letter “b”, but it would produce a “b” 25% of the time. This is the trade-off of efficiency versus accuracy.

How many bits are hidden or encoded if a “c” is output? It could either be three that are encoded when a 111 is found in the input file or it could be one bit padded in the same manner as the letter “b”. Either choice is fine.

This technique can be extended significantly to support any amount of precision. The most important step is to make sure that there will be no ambiguity in the decoding process. If the same character exists on both branches, then no bit can be encoded using any of the subtree descending from this point.

This means that it is not possible to encode data which is dominated by one character that appears more than 50% of the time. If “a”,“b” and “c” were to emerge 75%, 25% and 5% of the time respectively, then it would not be possible to encode information with this scheme and also produce the letter “a” 75% of the time.

One way around this process is to produce pairs of characters. This is often feasible if one letter dominates the distribution. That is, produce the pairs “aa”,“ab”,“ac”,“ba”,“bb”,“bc”, “ca”,“cb”, and “cc” with probabilities of 56%, 18%, 3%, 18%, 6%, 1%, 3%, 1%, and .2% respectively.

6.3.2. Regular Mimicry and Images

The regular mimicry algorithms described in this chapter are aimed at text and they do a good job in this domain. Adapting them to images is quite possible, if only because the digitized images are just patterns of the two letters “1” and “0”. But the success is somewhat diluted.

Chapter 9 shows how to flip the least significant bits to store information. Chapter 9 doesn't try to mimic the pattern of the least significant bits. It just assumes that they fall into a standard even distribution. The regular mimicry algorithms can be used to tailor the distribution to some model.

The simplest solution is to group together the pixels into a regular set of groups. These groups might be 2 × 2 or 3 × 3 blocks or they might be broken into linear groups of pixels from the same row. Now the least significant bits of each of these pixels can be treated as characters. One image might be used as the model used to compute the distribution table that generates a Huffman tree. Then data can be hidden in another image by using this Huffman tree to generate blocks of bits to replace the least significant bits in the image.

More sophisticated solutions could be based on the color of the pixels themselves, but they are probably too complicated to be practical. The advantage of this system is that it could detect and imitate any statistical anomalies introduced when an image was created. Ordinarily, CCD arrays have slight imperfections that affect how each sensing cell reacts to light. High-quality arrays used by people like NASA are tested and corrected. Most civilian arrays never receive this individual treatment. The system might pick up any low-level incongruities if they happen to fall in a pattern that is reflected in the statistical distribution of the pixel groups.

6.4. Summary

This chapter described how to produce mimic text that looks statistically similar to the original text. The mechanisms in this chapter treat letters as the individual element, something that allows the data to pass some statistical tests but fail others. The count of letters like ‘e' and ‘t' might be consistent, but there are often large numbers of words that can't be found in a dictionary. Another approach taken by some experimenters is to treat words as the individual elements for the statistical models. This requires more text to create the model, but it provides excellent, if rambling, results. There are no misspelled words that aren't found in the source.

Chapter 7 describes how to use a more sophisticated grammar-based method to achieve a better result. Chapter 8 goes even further and shows how a Turing machine can be made to run backward and forward to produce the most complicated text.

  • The Disguise The text produced by these regular mimic functions can be quite realistic. The results are statistically equivalent. First-order text will have similar first-order statistics. Second-order text will have the same occurrence of pairs. This can be quite realistic in the higher orders, but it will rarely pass the reading test. Humans will quickly recognize it as gibberish.
  • How Secure Is It? There is no reason to guess that this system offers any more security than hiding the information. How hard it would be to break such a statistical system is an open question. I believe that it would be possible to examine the statistics and come up with a pretty good guess about the shape of the Huffman trees used to generate the text. There may only be a few thousand options, which can be tested quite quickly if some known plaintext is available. For that reason, this system should probably be used in low-grade applications that demand verisimilitude but not perfection.
  • How to Use It? No software is being distributed right now to handle this problem, but it should be easy to code it.

Further Reading

  • Krista Bennett offers a nice survey of textual steganographic methods in her report, “Linguistic Steganography: Survey, Analysis, and Robustness Concerns for Hiding Information in Text”. [Ben04]
  • Steganosaurus, from John Walker, is a C-based program that will use a dictionary to turn bits into gibberish; see fourmilab.ch/stego.[Wal94]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.17.64