Chapter 1. Framing Information
On its face, information in computers seems perfectly defined and certain. A bank account either has $1,432,442 or it has
$8.32. The weather is either going to be 73 degrees or 74 degrees. The meeting is either going to be at 4 pm or 4:30 pm. Computers
deal only with numbers and numbers are very definite.
Life isn't so easy. Advertisers and electronic gadget manufacturers like to pretend that digital data is perfect and immutable,
freezing life in a crystalline mathematical amber; but the natural world is filled with noise and numbers that can only begin
to approximate what is happening. The digital information comes with much more precision than the world may provide.
Numbers themselves are strange beasts. All of their certainty can be scrambled by arithmetic, equations and numerical parlor
tricks designed to mislead and misdirect. Statisticians brag about lying with numbers. Car dealers and accountants can hide
a lifetime of sins in a balance sheet. Encryption can make one batch of numbers look like another with a snap of the fingers.
Language itself is often beyond the grasp of rational thought. Writers dance around topics and thoughts, relying on nuance,
inflection, allusion, metaphor, and dozens of other rhetorical techniques to deliver a message. None of these tools are perfect
and people seem to find a way to argue about the definition of the word “is”.
This book describes how to hide information by exploiting this uncertainty and imperfection. This book is about how to take
words, sounds, and images and hide them in digital data so they look like other words, sounds, or images. It is about converting
secrets into innocuous noise so that the secrets disappear in the ocean of bits flowing through the Net. It describes how
to make data mimic other data to disguise its origins and obscure its destination. It is about submerging a conversation in a flow of noise so that
no one can know if a conversation exists at all. It is about taking your being, dissolving it into nothingness, and then pulling
it out of the nothingness so it can live again.
Traditional cryptography succeeds by locking up a message in a mathematical safe. Hiding the information so it can't be found
is a similar but often distinct process often called steganography. There are many historical examples of it including hidden compartments, mechanical systems like microdots, or burst transmissions,
that make the message hard to find. Other techniques like encoding the message in the first letters of words disguise the
content and make it look like something else. All of these have been used again and again.
David Kahn's Codebreakers provides a good history of the techniques.[Kah67]
Digital information offers wonderful opportunities to not only hide information, but also to develop a general theoretical
framework for hiding the data. It is possible to describe general algorithms and make some statements about how hard it will
be for someone who doesn't know the key to find the data. Some algorithms offer a good model of their strength. Others offer
none.
Some of the algorithms for hiding information use keys that control how they behave. Some of the algorithms in this book hide
information in such way that it is impossible to recover the information without knowing the key. That sounds like cryptography,
even though it is accomplished at the same time as cloaking the information in a masquerade.
Is it better to think of these algorithms as “cryptography” or as “steganography”? Drawing a line between the two is both
arbitrary and dangerously confusing. Most good cryptographic tools also produce data that looks almost perfectly random. You
might say that they are trying to hide the information by disguising it as random noise. On the other hand, many steganographic
algorithms are not trivial to break even after you learn that there is hidden data to find. Placing an algorithm in one camp
often means forgetting why it could exist in the other. The best solution is to think of this book as a collection of tools
for massaging data. Each tool offers some amount of misdirection and some amount of security. The user can combine a number
of different tools to achieve their end.
The book is published under the title of “Disappearing Cryptography” for the reason that few people knew about the word “steganography”
when it appeared. I have kept the title for many of the same practical reasons, but this doesn't mean that title is just cute
mechanism for giving the buyer a cover text they can use to judge the book. Simply thinking of these algorithms as tools for disguising information is a mistake. Some offer cryptographic security at
the same time as an effective disguise. Some are deeply intertwined with cryptographic algorithms, while others act independently.
Some are difficult to break without the key while others offer only basic protection. Trying to classify the algorithms purely
as steganography or cryptography imposes only limitations. It may be digital information, but that doesn't mean there aren't
an infinite number forms, shapes, and appearances the information may assume.
1.0.1. Reasons for Secrecy
There are many different reasons for using the techniques in this book and some are scurrilous. There is little doubt that
the Four Horsemen of the Infocalypse— the drug dealers, the terrorists, the child pornographers, and the money launderers—
will find a way to use the tools to their benefit in the same way that they've employed telephones, cars, airplanes, prescription
drugs, box cutters, knives, libraries, video cameras and many other common, everyday items. There's no need to explain how
people can hide behind the veils of anonymity and secrecy to commit heinous crimes.
But these tools and technologies can also protect the weak. In book's defense, here's a list of some possible good uses:
- So you can seek counseling about deeply personal problems like suicide.
- So you can inform colleagues and friends about a problem with odor or personal hygiene.
- So you can meet potential romantic partners without danger.
- So you can play roles and act out different identities for fun.
- So you can explore job possibilities without revealing where you currently work and potentially losing your job.
- So you can turn a person in to the authorities anonymously without fear of recrimination.
- So you can leak information to the press about gross injustice or unlawful behavior.
- So you can take part in a contentious political debate about, say, abortion, without losing the friendship of those who happen
to be on the other side of the debate.
- So you can protect your personal information from being exploited by terrorists, drug dealers, child pornographers and money
launderers.
- So the police can communicate with undercover agents infiltrating the gangs of bad people.
Chapter 22 examines the promises and perils of this technology in more detail.
The Central Intelligence Agency, for instance, has been criticized for missing the collapse of the former Soviet Union. They
continued to issue pessimistic assessments of a burgeoning Soviet military while the country imploded. Some blame greed, power,
and politics. I blame the sheer inefficiency of keeping information secret. Spymaster Bob can't share the secret data he got
from Spymaster Fred because everything is compartmentalized. When people can't get new or solid information, they fall back
to their basic prejudices—which in this case was that the Soviet Union was a burgeoning empire. There will always be a need
for covert analysis for some problems, but it will usually be much more inefficient than overt analysis.
Anonymous dissemination of information is a grease for the squeaky wheel of society. As long as people question its validity
and recognize that its source is not willing to stand behind the text, then everyone should be able to function with the information.
When it comes right down to it, anonymous information is just information. It's just a torrent of bits, not a bullet, a bomb
or a broadside. Sharing information generally helps society pursue the interests of justice.
Secret communication is essential for security. The police and the defense department are not the only people who need the
ability to protect their schedules, plans, and business affairs. The algorithms in this book are like locks on doors and cars.
Giving this power to everyone gives everyone the power to protect themselves against crime and abuse. The police do not need
to be everywhere because people can protect themselves.
For all of these reasons and many more, these algorithms are powerful tools for the protection of people and their personal
data.
1.0.2. How It Is Done
There are a number of different ways to hide information. All of them offer some stealth, but not all of them are as strong
as the others. Some provide startling mimicry with some help from the user. Others are largely automatic. Some can be combined
with others to provide multiple layers of security. All of them exploit some bit of randomness, some bit of uncertainty, or
some bit of unspecified state in a file. Here is an abstract list of the techniques used in this book:
- Use the Noise The simplest technique is to replace the noise in an image or sound file with your message. The digital file consist of numbers
that represent the intensity of light or sound at a particular point of time or space. Often these numbers are computed with
extra precision that can't be detected effectively by humans. For instance, one spot in a picture might have 220 units of
blue on a scale that runs between 0 and 255 total units. An average eye would not notice if that one spot was converted to
having 219 units of blue. If this process is done systematically, it is possible to hide large volumes of information just
below the threshold of perception. A digital photo-CD image has 2048 by 3072 pixels that each contain 24 bits of information
about the colors of the image. 756k of data can be hidden in the three least significant bits for each color of each pixel.
That's probably more than the text of this book. The human eye would not be able to detect the subtle variations but a computer
could reconstruct them all.
- Spread the Information Out Some of the more sophisticated mechanisms spread the information over a number of pixels or moments in the sound file. This
diffusion protects the data and also makes it less susceptible to detection, either by humans looking at the information or
by computers looking for statistical profiles. Many of the techniques that fall into this category came from the radio communication
arena where the engineers first created them to cut down on interference, reduce jamming, and add some secrecy. Adapting them
to digital communications is not difficult.
Spreading the information out often increases the resilience to destruction by either random or malicious forces. The spreading
algorithms often distribute the information in such a way that not all of the bits are required to reassemble the original
data. If some parts get destroyed, the message still gets through.
Many of these spreading techniques hide information in the noise of an image or sound file, but there is no reason why they
can't be used with other forms of data as well.
Many of the techniques are closely related to the process of generating cryptographically secure random numbers— that is,
a stream of random numbers that can't be predicted. Some algorithms use this number stream to choose locations, others blend
the random values with the hidden information, still others replace some of the random values with the message.
- Adopt a Statistical Profile Data often falls into a pattern and computers often try to make decisions about data by looking at the pattern. English text,
for instance, uses the letter ‘p’ for more often than the letter ‘q’ and this information can be useful for breaking ciphers.
If data can be reformulated so it adopts the statistical profile of the English language, then a computer program minding
ps and qs will be fooled.
- Adopt a Structural Profile Mimicking the statistics of a file is just the beginning. More sophisticated solutions rely on complex models of the underlying
data to better mimic it. Chapter 7, for instance, hides information by making it look like the transcript of a baseball game. The bits are hidden by using them
to choose between the nouns, verbs and other parts of the text. The data are recovered by sorting through the text and matching
up the words with the bits that selected them. This technique can produce startling results, although the content of the messages
often seems a bit loopy or directionless. This is often good enough to fool humans or computers that are programmed to algorithmically
scan for particular words or patterns.
- Replace Randomness Many software programs use random number generators to add realism to scenes, sounds, and games. Monsters look better if
a random number generator adds blotches, warts, moles, scars and gouges to a smooth skin defined by mathematical spheres.
Information can be hidden in the place of the random number. The location of the splotches and scars carries the message.
- Change the Order A grocery list may be just a list, but the order of the items can carry a surprisingly large amount of information.
- Split Information Data can be split into any number of packets that take different routes to their destination. Sophisticated algorithms can
also split the information so that any subset of k of the n parts are enough to reconstruct the entire message.
- Hide the Source Some algorithms allow people to broadcast information without revealing their identity. This is not the same as hiding the
information itself, but it is still a valuable tool. Chapters 10 and 11 show how to use anonymous remailers and more mathematically sophisticated Dining Cryptographers' solutions to distribute information anonymously.
These different techniques can be combined in many ways. First information can be hidden by hiding it in a list, then the
list can be hidden in the noise of a file that is then broadcast in a way to hide the source of the data.
1.0.3. How Steganography Is Used
Hidden information has a variety of uses in products and protocols. Hiding slightly different information or combining the
various algorithms creates different tools with different uses. Here are some of the most interesting applications:
- Enhanced Data Structures Most programmers know that standard data structures get old over time. Eventually there comes a time when new, unplanned
information must be added to the format without breaking old software. Steganography is one solution. You can hide extra information
about the photos in the photos themselves. This information travels with the photo but will not disturb old software that
doesn't know of its existence.
A radiologist could embed comments from in the background of a digitized x-ray. The file would still work with standard tools,
saving hospitals the cost of replacing all of their equipment.
- Strong Watermarks The creators of digital content like books, movies, and audio files want to add hidden information into the file to describe
the restrictions they place on the file. This message might be as simple as “This file copyright 2001 by Big Fun” or as complex
as “This file can only be played twice before 12/31/2002 unless you purchase three cases of soda and submit their bottle tops
for rebate. In which case you get 4 song plays for every bottle top.”
Digital Watermarking by Ingemar J. Cox, Matthew L. Miller and Jeffrey A. Bloom is a good introduction to watermarks and the challenges particular
to the subfield. [CMB01]
Some watermarks are meant to be found even after the file undergoes a great deal of distortion. Ideally, the watermark will
still be detectable even after someone crops, rotates, scales and compresses some document. The only way to truly destroy
it is to alter the document so much that it is no longer recognizable.
Other watermarks are deliberately made as fragile as possible. If someone tries to tamper with the file, the watermark will
disappear. Combining strong and weak watermarks is a good option when tampering is possible.
- Document-Tracking Tools Hidden information can identify the legitimate owner of the document. If it is leaked or distributed to unauthorized people,
it can be tracked back to the rightful owner. Adding individual tags to each document is an idea attractive to both content-generating
industries and government agencies with classified information.
- File Authentication The hidden information bundled with a file can also contain a digital signature certifying its authenticity. A regular software
program would simply display (or play) the document. If someone wanted some assurance, the digital signature embedded in the
document can verify that the right person signed it.
- Private Communications Steganography is also useful in political situations when communications is dangerous. There will always be moments when
two people can't exchange messages because their enemies are listening. Many governments continue to see the Internet, corporations
and electronic conversations as an opportunity for surveillance. In these situations, hidden channels offer the politically
weak a chance to elude the powerful who control the networks. [Sha01]
Not all uses for hidden information come classified as steganography or cryptography. Anyone who deals with old data formats
and old software knows that programmers don't always provide ideal data structures with full documentation. Many basic hacks
aren't much different from the steganographic tools in this book. Clever programmers find additional ways to stretch a data
format by packing extra information where it wasn't needed before. This kind of hacking is bound to yield more applications
than people imagined for steganography. Somewhere out there, a child's life may be saved thanks to clever data handling and
steganography!
1.0.4. Attacks on Steganography
Steganographic algorithms provide stealth, camouflage and security to information. How much, though, is hard to measure. As
data blends into the background, when does it effectively disappear? One way to judge the strength is to imagine different
attacks and then try to determine whether the algorithm can successfully withstand them. This approach is far from perfect,
but it is the best available. There's no way to anticipate all possible attacks, although you can try.
Attacking steganographic algorithms is very similar to attacking cryptographic algorithms and many of the same techniques
apply. Of course, steganographic algorithms promise some additional stealth in addition to security so they are also vulnerable
to additional attacks.
Here's a list of some possible attacks:
- File Only The attacker has access to the file and must determine if it holds a hidden message. This is the weakest form of attack,
but it is also the minimum threshold for successful steganography.
Many of these basic attacks rely on a statistical analysis of digital images or sound files to reveal the presence of a message
in the file. This type of attack is often more of an art than a science because the person hiding the message can try to counter
an attack by adjusting the statistics.
- File and Original Copy In some cases, the attacker may have a copy of the file with the encoded message and a copy of the original, pre-encoded
file. Clearly, detecting some hidden message is a trivial operation. If the two files are different, there must be some new
information hidden inside of it.
The real question is what the attacker may try to do with the data. The attacker may try to destroy the hidden information,
something that can be accomplished by replacing it with the original. The attacker may try to extract the information or even
replace it with their own. The best algorithms try to defend against someone trying to forge hidden information in a way that
it looks like it was created by someone else. This is often imagined in the world of watermarks, where the hidden information
might identify the rightful owner. An attacker might try to remove the watermark from a legitimate owner and replace it with
a watermark giving themselves all of the rights and privileges associated with ownership.
- Multiple Encoded Files The attacker gets n different copies of the files with n different messages. One of them may or may not be the original unchanged file. This situation may occur if a company is inserting
different tracking information into each file and the attacker is able to gather a number of different versions. If music
companies sell digital sound files with personalized watermarks, then several fans with legitimate copies can get together
and compare their files.
Some attackers may try to destroy the tracking information or to replace it with their own version of the information. One
of the simplest attacks in this case is to blend the files together, either by averaging the individual elements of the file
or by creating a hybrid by taking different parts from each file.
- Access to the File and Algorithm An ideal steganographic algorithm can withstand scrutiny even if the attacker knows the algorithm itself. Clearly, basic
algorithms that hide and unveil information can't resist this attack. Anyone who knows the algorithm can use this it to extract
the information.
But this can work if you keep some part of the algorithm secret and use it as the “key” to unlock the information. Many algorithms
in this book use a cryptographically secure random number generator to control how the information is blended into a file.
The seed value to this random number stream acts like a key. If you don't know it, you can't generate the random number stream
and you can't unblend the information.
- Destroy Everything Attack Some people argue that steganography is not particularly useful because an attacker could simply destroy the message by blurring
a photo or adding noise to a sound file. One common technique used against the kind of block compression algorithms like JPEG
is to rotate an image 45 degrees, blur the image, sharpen it again, and then rotate it back. This mixes information from different
blocks of the image, effectively removing some schemes like the ones in Chapter 14.
This technique is a problem, but it can be computationally prohibitive for many users and it introduces its own side effects.
A site like Flickr.com might consider doing this to all incoming images to deter communications, but it would require a fair amount of computation.
It is also not an artful attack. Anyone can destroy messages. Cryptography and many other protocols are also vulnerable to
it.
- Random Tweaking Attacks Some attackers may not try to determine the existence of a message with any certainty. An attacker could just add small,
random tweaks to all files in the hope of destroying whatever message may be there. During World War II, the government censors
would add small changes to numbers in telegrams in the hopes of destroying covert communications. This approach is not very
useful because it sacrifices overall accuracy for the hope of squelching a message. Many of the algorithms in this book can resist a limited attack by using error-correcting codes to recover from a limited number
of seemingly random changes.
- Add New Information Attack Attackers can use the same software to encode a new message in a file. Some algorithms are vulnerable to these attacks because
they overwrite the channel used to hide the information. The attack can be resisted with good error-correcting codes and by
using only a small fraction of the channel chosen at random.
- Reformat Attack One possible attack is to change the format of the file because many competing file formats don't store data in exactly the
same way. There are a number of different image formats, for instance, that use a variety of bits to store the individual
pixels. Many basic tools help the graphic artist deal with the different formats by converting one file format into an other.
Many of these conversions can't be perfect. The hidden information is often destroyed in the process. Images can be stored
as either JPEG or GIF images, but converting from JPEG to GIF removes some of the extra information— the EXIF fields — embedded
in the file as part of the standard.
Many watermark algorithms for images try to resist this type of attack because reformatting is so common in the world of graphic
arts. An ideal audio watermark, for instance, would still be readable after someone plays the music on a stereo and records
it after it has traveled through the air.
Of course, there are limits to this. Reformatting can be quite damaging and it is difficult to anticipate all of the cropping,
rotating, scaling, and shearing that a file might undergo. Some of the best algorithms do come close.
- Compression Attack One of the easiest attacks is to compress the file. Compression algorithms try to remove the extraneous information from
a file and “hidden” is often equivalent to “extraneous”. The dangerous compression algorithms are the so-called lossy ones that do not reconstruct a file exactly during decompression. The JPEG image format, for instance, does a good job approximating
the original.
Some of the watermarking algorithms can resist compression by the most popular algorithms, but there are none that can resist
all of them.
The only algorithms that can resist all compression attacks hides the information in plain sight by changing the “perceptually salient” features of an image or sound file.
Unfortunately, steganography is not a solid science, in part because there's no simple way to measure how well it is doing.
How hidden must the information be before no one can see it? Just how invisible is invisible? The models of human perception
are often too basic to measure what is happening.
The lack of a solid model means it is difficult to establish how well the algorithms resist attack. Many algorithms can survive
cursory scrutiny but fail if a highly trained or talented set of ears and eyes analyze the results. Some people with so-called
“golden ears” can hear supposedly changes in an audio file that are inaudible to average humans. A watermark may be completely
inaudible to most of the buying public, but if the musicians can hear it the record company may not use it.
Our lack of understanding does not mean that the algorithms don't have practical value. A watermark heard by 1% of the population
is of no concern to the other 99%. An image with hidden information may be detectable, but this only matters if someone is
trying to detect it.
There is also little doubt that a watermark or a steganographic tool does not need to resist all attackers to have substantial
value. A watermark that lives on after cropping and basic compression still carries its message to many people. A hacker may
learn how to destroy it, but most people have better things to do with their time.
Our lack of understanding does not mean that the algorithms do not offer some security. Some of the algorithms insert their
information with mechanisms that offer cryptographic strength. Borrowing these ideas and incorporating them provides both
stealth and security.
1.1. Adding Context
One reviewer of the book who was asked for a backcover blurb joked that the book should be “essential bedside for reading
for every terrorist”. After a pause he added, “and every freedom fighter, Hollywood executive, police officer, abused spouse,
chief information officer, and anyone needing privacy anywhere.”
You may be a terrorist or you may be a freedom fighter. Who knows? This book is just about technology and technology is neutral.
It teaches you how to cast shape shifting spells that make data look like something completely different. You may have good
plans for these ideas. Perhaps you want to expose a local chemical company dumping toxic waste into the ground. Or you might be
filled with the proverbial malice aforethought and you can't wait to hatch a maniacal plan. You might be part of that cabal
of executives using these secret algorithms to plan where and when to dump the toxic waste. Technology is neutral.
There is some human impulse that would like to believe that all information is ordered, correct, structured, organized, and
above all true. We dream that computers and their vast collection of trivia about the world will keep us safe, secure, and
moving toward some glorious goal, even if we don't know what it is. We hope that the databases held by the government, the
banks, the insurance companies, the retail stores, the doctors, and practically everyone else will deliver unto us a perfectly
ordered world.
Alas, nothing could be farther from the truth. Even the bits can hide multiple meanings. They're supposed to be either on
or off, true or false, 0 or 1, but even the bits can conspire to carry secret messages and hidden truths. Information is not
as certain or as precise as it may seem to be. Sometimes a cigar carries a freight train load of meaning and sometimes it
is just a cigar. Sometimes it is close and no cigar at all.
Throughout it all, only a human can make sense of it. Only a human can determine the difference between an obscene allusion
to a cigar and reference to an object for delivering nicotine. We keep hoping that artificial intelligence and database engines
will be able to parse all of the data, all of the facts, and all of the bits and identify the terrorists who need punishing,
the good people who need help, and the split ends that need another dose of special conditioner.
You, the reader, are the human who must decide how to use the information in this book. You can solve crimes, coordinate a
wedding, plan a love that will last forever, or concoct dastardly schemes. The technology is neutral. The book is just equations
on a page. You will determine what the equations mean for the world.