Chapter 21. Plain Sight

21.1. Laughtracks

To: Production Staff and Set Design

From: Jerry Brown, Asst. to the Executive Writing Arc Supervisor

Re: Meta-textual clues for audience.

Recent audience surveys show that the larger broadcast audience fails to grasp the humorous possibilities of the narrative arc designed by the office of the Writing Arc. To facilitate the absorbtion of our humor, we are asking that we construct a new set of neon signs that will cue the studio audience to inject the kind of meta-narrative instructions that will allow our broadcast audience to, for lack of a bigger term, laugh at the right places.

All future scripts from the Office of the Writing Arc Supervisor will include cues for when to activate these lighted signs.

  • Snicker For use when a mild signal should be injected into the story stream.
  • Snort We hope to limit this to the jokes told by our particularly curmugeonly characters.
  • Please Groan Best used for puns and other cheap forms of humor. This will indicate that we're not stooping to cheap tricks to get laughs but approaching them with an obviously ironic pose.
  • Knee Slapper For older jokes that we've borrowed from the old school of comedy.
  • Through the Nose For physical comedy.
  • Atomic Funny This signals super-duper funniness. The atomic bomb of humor. Not to be confused with humor that just bombs. To be used sparingly, no more than twice per episode.

21.2. Hiding in the Open

In the middle of Michael Crichton's Jurassic Park, when the characters are coming to realize the depth of their predicament, the mathematician in the bunch, Ian Malcolm, asks the dinosaur curators to reprogram their computers. The original software began with an expected count of dinosaurs and then scanned the park looking to account for all of the dinosaurs on the list. If one was missing, it raised an alarm and triggered a search. It all seemed bullet-proof.

But when Malcolm asked them to raise the expected count, the computer came back and found even more dinosaurs than it did in the original count.

“Now you see the flaw in your procedures,” Malcolm said. “You only tracked the expected number of dinosaurs. You were worried about losing animals, and your procedures were designed to advise you instantly if you had less than the expected number. But that wasn't the problem. The problem was, you had more than the expected number.”[Cri90]

The dinosaurs were breeding and the elaborate control system couldn't account for them. Malcolm saw this weakness in the system as an example of a larger truth about the universe.

“[S]traight linearity, which we have come to take for granted in everything from physics to fiction, simply does not exist. Linearity is an artificial way of viewing the world. Real life isn't a series of interconnected events occurring one after another like beads strung on a necklace. Life is actually a series of encounters in which one event may change those that follow in a wholly unpredictable, even devastating way.” he explained.

All data formats for computers begin with expectations. While the formats may not be linear in the strictest sense of the word, they are still well defined and the strength of this definition leads to its weakness.

Much of this book involves tweaking the actual data stored in a file by introducing small changes like adding a bit more red to a pixel. These solutions are useful, but they can introduce distortions that can lead to detection. As Chapter 17 shows, many of the simplest algorithms distort the statistical profile of the files in subtle but often detectable ways. To make matters worse, the approach is in constant competition with compression algorithms that try to squeeze out all extraneous information and noise to save space. This is why many researchers suggest that the only long-term solution is to hide information in the salient features, the actual visible or audible parts of a file.

Here's a counter approach: Instead of hiding information in the data itself, hide information in the gaps in the data structures, in the places where the software won't look for it.

Wojciech Mazurczyk and Krzysztof Szczypiorski found that Voice Over IP calls could hide information because the algorithms work around missing packets— or packets replaced with hidden messages.[MS08]

Consider how to join these two facts: (1) GIF files, like most files, begin at the beginning of the data with a few bytes that describe the size of the data, while (2) ZIP files begin at the end with a table that describes the location of data inside. This makes it possible to concatenate a GIF file and a ZIP file so that both are still decodable, at least least in theory. Just type this on a Mac or UNIX box: “K is for Keeler, As fresh as green paint, The fastest and mostest To hit where they ain't.”-Ogden Nash[Nas49] cat somefile.zip >> somefile.gif

This will append somefile.zip to the end of somefile.gif. It essentially hides a GIF file at the beginning of a ZIP file or a ZIP file at the end of the GIF file. A program looking for a GIF file will start at the beginning, decode the header information describing the size of the image, and then unpack it beginning at the very beginning. A ZIP file decoder will do the same, but from the end. I assume that the ZIP file was designed this way to make it easier to add more files to a ZIP file by just appending them to the end.

Neither software package will notice the other— unless there are some unspoken assumptions made by the programmers. It's entirely possible that a clever programmer will mix up the two different values: (1) the length of the file as described by the header and (2) the length of the file as described by the operating system. This will lead the results to crash nonstandard implementations.

21.3. Other Formats

Many formats make it easy to add freeloading data to a file. In fact, good programmers have been pushing this as a design feature for software because it makes it simpler to improve software without crashing older versions. A good file format will include a mechanism to add more data later in case the need becomes necessary.

Many of the modern tagged languages like XML (Extensible Markup Language) or its cousins like SGML (Standard Generalized Markup Language) or HTML (Hypertext Markup Language) are designed to let the programmer toss in additional information or create additional data as necessary.

Here's a common example:

Here's how easily it can be extended with some new tags and attributes:

The new version includes attributes describing the type of measurement used for the ingredients and a new tag giving credit to the creator. Most software tuned to the original package will ignore these extra tags and find only the data it expects to find.

This is usually the case with XML, but it is not always true. The specification includes a mechanism for defining the legitimate pattern for the tags and this specification can block the adding of extra tags. The Document Type Definition, or DTD, includes definitions for which tags should be found and where they can be found.

Many programmers report a fair amount of frustration with rigid DTDs because they cause incompatibilities between versions of the software. This is especially true if the software relies on a feature that allows a DTD to be downloaded from a web site. I know one open-source project that started crashing after some of the keepers updated the DTD on a distant web site.

21.3.1. Microformats

One of the more established attempts at building a regular mechanism for revising and extending HTML lives at the www.microfor-mats.org web site. The Microformats project includes a number of additions to HTML that add more semantic meaning to the text. The extra meaning identifies the context or meaning of the text on a web page by identifying its role. One type of tag wraps around a zip or postal code. Nother identifies a telephone number.

Here's an example of an address spelled out in the the vcard format:

There are similar formats for calendars (hCalendar), opinions (hReview), social networks (XFN), geography (geo) and a few others. All are designed to add the information in a way that will be ignored by the browser. It will slip by the web browser like a dinosaur because the web browser isn't looking for it.

21.3.2. Rice's Theorem

One of the more interesting theorems from computer science theory suggests that it may be theoretically impossible for anyone to examine a program and determine whether it is packing extra information in a data file. This theorem is worth mentioning even if it may not have much practical use.

A casual version of the theorem, due to Henry Gordon Rice, states that a software program can't be counted on to detect whether another program is conforming to some standard for a file format. The theorem itself says that the problem is undecidable, a term that means that the program is guaranteed to halt and give a definitive answer. If it doesn't halt, it could go on checking, rechecking or looking for some complex answer.

This theorem may help establish a theoretical limit to checking for secret messages in file formats, but it may not be of practical value because the theoretical result is based on asking a computer program to examine itself, a sort of logical tongue-twister that can have odd logical consequences. A less rigorous piece of software may be able to do a good job of testing for errant code.

The parsers for XML, for instance, can test to see whether XML conforms to a well-defined model. Extra tags and attributes are flagged and reported. Even if most software can't rely on these rules, they exist and do a good job of checking the data flowing along the wires. The XML standard, though, isn't Turing-complete and so it's possible to build a fairly straightforward testing tool.

21.4. Summary

Almost every data format has plenty of loopholes that can be used to add extra data. If the code reads the first n items on a line, you can stick more information after the nth item. If there's a special end of file marker, say a zero, then you can add more after the zero. This technique makes it easy to add information in many cases.

A neat trick is mixing together two files with head-first and tail-first ordering of data like the GIF format and the ZIP formats. If these two parts are glued together, decoding algorithms will frequently fail to notice the other half. This lets a GIF hitch a ride on ZIP file and a ZIP file hitch a ride on a GIF.

  • The Disguise Information is stored in the spare corners of data files, a surprisingly easy process.
  • How Secure Is It? It may be theoretically impossible to detect that a piece of software is capable of reading or hiding extra data in a file. This theoretical barrier, though, may not have much practical weight.
  • How to Use It? The simplest solution may be to glue together a ZIP and a GIF file. Or just add extra nodes to an XML file.

Further Reading

There are a many data format books out there. It's impossible to list them all.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.57