15. Doing Real Work Without Real Data

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 15. Doing Real Work Without Real Data

Peter Wayner

The largest privacy breaches are caused by data thieves stealing the contents of corporate or government databases. Imagine a database that can do useful work without having any useful information in it. For instance, imagine a server that can answer questions about the items you purchased, your schedule for next Thursday, your favorite movies, or countless other details like other databases hooked up to the Internet—but if someone snuck through the firewalls, cracked the password layer, or found some way to get superuser control on the machine, he would find nothing he could use. Even if the evil hacker/ninja snuck into the server room and hooked the hard disk up to a forensic analyzer, there would be no juicy info available.

A database like this sounds impossible. How could the database answer questions about next Thursday without knowing something about what’s going to happen next Thursday? It’s got to have the data there somewhere, right?

Others have suggested suboptimal solutions to protecting sensitive data. But even if it’s locked away inside some electronic safe hidden in a virtual stonewalled chamber buried inside a cyber castle wrapped by an impenetrable software moat filled with digital acid that dissolves any bad bits that come in contact with it, the data is present and remains vulnerable to someone smart enough to simulate a privileged user.

The solution I’ve developed in this chapter is unique. The data is present, I admit, but not in a form that’s useful to anyone but the end user. The data is scrambled by the end user or client before sending it to the server and unscrambled after the client fetches it back. While it lives on the database, the information remains inscrutable and virtually useless to everyone but the rightful owner.

This approach is a bit different from simply encrypting the database using the built-in encryption functions that are part of most databases. That encryption is handled by the database itself, leaving the server with a readable version of the data at both the storage and retrieval ends. The data might live in encrypted form, but the database itself will decrypt it to search it and make decisions. If the server knows what is going on, so can anyone who can hack into the server or the network carrying the traffic.

The trick is to stop thinking of the database as an all-seeing oracle and start imagining it as a mathematical switching network that knows nothing about the information flowing through it. Then you cloak the data selectively and store it translucently so the sensitive bits are hidden while the not-so-sensitive information is kept in the clear. In other words, stop thinking of the database as a fortress to be protected, and start imagining it as a diffuse, unattackable translucent cloud of data dispersed throughout space.

This approach does not require any revolutionary mathematics or sophisticated computer science. The tools for encrypting and decrypting information are well understood, reasonably efficient on modern CPUs, and relatively stable. Thirty-five years of public research has given us a wide range of encryption functions with a surprisingly varied set of options for creating databases that store information without knowing anything about it.

There are limitations and compromises to this system. If the database doesn’t know anything, it cannot offer the same trusted foundation, the same all powerful Oz-like stability to the programmers who tend it and the people who rely on it. If someone forgets her password—the key for unlocking all of the data—well, the data is gone forever. Pace Yeats, the center cannot hold, because the center doesn’t know anything.

Is this approach worthwhile? It is, more often than many expect. When I first started exploring the idea, a friend said something like, “That’s nice. But not for anything real. If the Internet is going to read our minds, it’s going to need to track our movements.” That didn’t seem right to me. So to test it, I sketched out the architecture for a big online store much like Amazon and found that such a store could offer almost all of the features of the best online stores without keeping any personalized information. They just have to store data translucently so some parts are visible and some parts are opaque.

How Data Translucency Works

What would such a store look like? The best way to explain some of the basic design patterns is with an example. Table 15-1 shows a traditional database table in a traditional store.

Table 15-1. Sample traditional database

Customer	Item purchased	Size	Color
James Morrison	Loafing Khakis	34W-32L	Moondust
James Morrison	Pure White Shirt	16–32	Supernova White
Richard James	Loafing Khakis	32W-32L	Moondust
Harrison James	Tread-On-Me Socks	L	Blackhole

This is the kind of information needed by the folks who stock the shelves and make decisions about marketing. They need to know what is being sold and, so they think, to whom.

People have different instinctual reactions to the existence of these databases. Some welcome them and see the efficiency as a net gain for society (Brin 1998). Others fret about how the information might be used for other purposes. We have laws protecting the movie rental records of people because the data became a focal point during the Senate confirmation hearings of Robert Bork (Etzioni 1999). Times and mores change, and no one knows what will be made of this information in the future. Cigarette smoking was once widespread and an accepted way to push the pause button on life. Now, companies fire people for smoking to save medical costs. What will the society of the future do with your current purchase records?

A store selling clothes might try to placate customer worries by announcing a privacy policy and pledging to keep it locked up, a strategy that often works until some breach happens. This may be intentional (Wald 2004, Anon. 2003, Shenon 2003) or unintentional (Wald 2007, Zeller 2006, Pfanner 2007, and many more), but it’s impossible to prevent when you use the fortress model of security.

The translucent model can solve the problem by scrambling some of the columns with a one-way function such as SHA256, a well-known hash function that’s designed to be practically impossible to invert. That is, if you start with the string x, you can reliably create SHA256(x), but if you get your hands on SHA256(x), it’s infeasible to figure out x.^[106] If the customers hash their names, the plain-text name can be replaced by the 256 bits produced by the computation BASE64(SHA256(name)). The result might look like Table 15-2.^[107]

Table 15-2. Sample database with hashes in customer field

Customer	Item purchased	Size	Color
ick93LKJ0dkK	Loafing Khakis	34W-32L	Moondust
ick93LKJ0dkK	Pure White Shirt	16–32	Supernova White
98JOID30dsl0	Loafing Khakis	32W-32L	Moondust
8JidIklel09f	Tread-On-Me Socks	L	Blackhole

The stock department can still track how many pairs of khakis were sold. The marketing department can still try to look for patterns, like the fact that the same customer, ick93LKJ0dkK, bought pants and a shirt at the same time. The names, though, are obscured.

The value of a one-way function is that it combines the two seemingly incompatible goals of hiding information while making it reliably persistent. That is, James Morrison can give the store a hash of his name in the form ick93LKJ0dkK and the store can be sure that his records in the database belong to him, but they can’t find his real name. This can be useful with warranties, frequent shopper programs, and automated suggestion tools. Websites that compute suggestions for customers based on their past purchases can still compute them for someone with the digital pseudonym ick93LKJ0dkK.

When James Morrison returns, he can provide his name and look up these values. He doesn’t need to remember ick93LKJ0dkK, because the software on his computer can reconstruct it by calculating BASE64(SHA256(name)). After that, ick93LKJ0dkK acts just like a regular name.

The raw data in the scrambled table isn’t much use. A drug fiend on a crime spree couldn’t use a translucent database from a pharmacy to find houses with narcotic prescriptions^[108] (Meier 2007, Harris 2004, Butterfield 2001).

There are still limitations. If James Morrison’s neighbor works at the store, that neighbor can still find out what James Morrison bought by just typing in the right name with the right spelling. This weakness can be reduced by forcing James Morrison to remember a secret password and computing a pseudonym from SHA256(name+password). This is a good solution until James Morrison forgets the password, effectively destroying all of the records.^[109] This is why this level of security may be best for more casual applications, such as online stores. It might not make sense for highly valuable records, such as certificates of deposit at banks.

A Real-Life Example

There are a number of ways in which translucent databases can save system administrators the time and energy of worrying about the data under their care. Imagine a database that holds the schedule of babysitters so that parents can search for the babysitter who’s available on a particular evening, a service that can save parents and babysitters many annoying rounds of phone tag. I built a version of this system to explore the value of the idea; it required distributing a solid hash function to the client. The central server didn’t require any code change because the hashed identities were strings that could replace the names.

This information falls into a troubling limbo in terms of security motivation. The potential for malice is large and the downside is catastrophic, but there’s no great economic support for the system. The babysitters—unlike the U.S. government, for instance—don’t have the money to support the kind of system that protects the nuclear launch codes.

Translucent databases reduce the danger of aggregating so much information in one place protected by one root password. Instead of storing the name of each babysitter, the schedules can be stored under pseudonyms, as shown in Table 15-3.

Table 15-3. Babysitter availability table with hashed names

Name	Free/busy	Start time	End time
8a9d9b999d9da9s	Free	Saturday 15:30	Saturday 22:30
8a9d9b999d9da9s	Busy	Sunday 11:00	Sunday 23:00
Aab773783cc838	Free	Saturday 18:30	Saturday 23:00

How do parents get access to the schedules? The babysitters choose who can have access by computing SHA256(parent's name+parent's password). Then, they put an entry into an access table that looks like Table 15-4.

Table 15-4. Babysitter access table

Babysitter’s pseudonym	Parents’ pseudonym	Encrypted babysitter name
8a9d9b999d9da9s	4373A73CC83892	91Ab3638dc99390203
Aab773783cc838	4373A73CC83892	2Ab37838DDcc83922
Aab773783cc838	99919a9b9bbb933	AB888329394CC9324

The parents log in by typing in their name and password to generate their pseudonym. They can then search the access table to find the pseudonyms of the babysitters’ names. This allows them to search the schedules and find a babysitter with an open schedule. If the babysitter keeps the parents’ pseudonym out of the second table, the parents are locked out. The babysitter is in control.

How do the parents locate the babysitter? The third column contains the babysitter’s name encrypted with the parents’ name and password. Thus, the babysitter’s real name in each row can be decrypted only by the parent indicated in that row. In this case, the second column with the pseudonym might be constructed from SHA256(parent's name+parent's password), as already described, while the key might be computed a bit differently with SHA256(parent's password+parent's name). Thus, parents need just one set of information (name and password) to calculate their pseudonyms and keys, but no one can guess the secret key from the public pseudonym.

Personal Data Stored As a Convenience

The great benefit of my system is in protecting data kept by a store as a convenience. Many stores, both brick-and-mortar and online, keep the customers’ shipping addresses, credit cards, and other personalized information on file to avoid the trouble of retyping these for every purchase.

One solution is to combine traditional encryption with the one-way function and use the one-way function to compute the key from a name and a password. That is, encrypt the addresses and credit card number with a key defined by SHA256(password+name+padding), where padding is some random value added to ensure that the value is different from the digital pseudonym, SHA256(name+password), used to replace the name.

A store needs to keep track of a customer’s shipping address until the package leaves the building. Then they can delete the data and keep only the encrypted version until the customer comes back again.

Trade-offs

Does it make sense to turn your database into an inscrutable pile of bits? Here are the advantages as I see them:

Lower security costs: The bits should be useless to anyone who gets root access. This system makes it even easier to distribute the entire database to several sites because there’s less placed in the individual sites. Betrayal isn’t a problem if the data is inscrutable.
Lower computation costs: The server doesn’t need to encrypt the data or support an SSL session. Clients, which usually have plenty of cycles to burn, do it instead.
More efficient database operations: This may be a minor effect, but a good one-way function should distribute the database keys evenly across the space. If you use something like a name, you’ll end up with clusters around common words or names, slowing the index.

There are, however, prices to be paid:

No superuser to the rescue: If the server administrator has no power to read the data, she has no power to fix the data either.
The center can’t initiate action: A store with a translucent database can’t help users until they initiate service by presenting the pseudonym. A store wouldn’t be able to send out email notices saying effectively, “You bought A, B, and C in the past, so we know you would love correlation_function(A,B,C).” But some see this limitation as a feature, reducing spam.
Clients become the target: If the server is neutered, hackers will attack the client instead with sophisticated key sniffers, social engineering, and other malware. The users may not be as adept at blocking the action as a good system administrator. The advantages of attacking an individual client, however, is dramatically lower than hitting a centralized server.
Encrypted data is more fragile: A single spelling error in a name is often recoverable with clear text, but even a one-bit change can lead to big differences in the value of SHA256(x).^[110]

These considerations suggest that this approach is best for protecting mildly sensitive data that can be lost without real consequences. A customer who loses access to the purchase history at a store is not seriously inconvenienced. A patient who can’t access medical records, however, may be pretty upset.

One of the best uses for this approach may be with multitiered databases. A hospital could keep its patient records intact in one centralized vault, but distribute depersonalized, translucent solutions to researchers conducting statistical studies. These researchers could still work with real data without the worry of protecting personal data (Schneier 2007).

Going Deeper

One-way functions are just the beginning of the possibilities for working with protected data. Other traditional mathematical tools developed by cryptologists can be used with applications in the same way to preserve privacy. Steganographic applications, for instance, can mix a hidden layer of data in with the real layer, effectively creating a two-tier database. Only the users with the right keys for unscrambling the steganographic layer can extract the second tier. Public-key solutions can create functions that are one-way except for people who possess the secret key for unlocking the data. All of these solutions are explored in greater detail in Translucent Databases (see References).

The digital fortress is the traditional approach to storing sensitive information. The best database companies are filled with talented individuals who wrap adamantine layers around the data with access levels, security models, and plenty of different roles that allow the right queries while blocking the suspicious ones. These “unbreakable” systems have their place, but they are complicated, expensive, and fragile. A better solution is often keeping no real data at all and letting the user control the fate of his own information.

References

Anonymous. “Betraying One’s Passengers,” New York Times. September 23, 2003.

Brin, David. The Transparent Society. New York: Basic Books, 1998.

Butterfield, Fox. “Theft of Painkiller Reflects Its Popularity on the Street,” New York Times. July 7, 2001.

Etzioni, Amitai. “Privacy Ain’t Dead Yet,” New York Times. April 6, 1999.

Harris, Gardiner. “Tiny Antennas to Keep Tabs on U.S. Drugs,” New York Times. November 15, 2004.

Meier, Barry. “3 Executives Spared Prison in OxyContin Case,” New York Times. July 21, 2007.

Pfanner, Eric. “Data Leak in Britain Affects 25 Million,” New York Times. November 22, 2007.

Shenon, Philip with John Schwartz. “JetBlue Target Of Inquiries By 2 Agencies,” New York Times. September 23, 2003.

Schneier, Bruce. “Why ‘Anonymous’ Data Sometimes Isn’t,” Wired. December 13, 2007. http://www.wired.com/politics/security/commentary/securitymatters/2007/12/securitymatters_1213.

Wald, Matthew L. “U.S. Calls Release of JetBlue Data Improper,” New York Times. February 21, 2004.

Wald, Matthew L. “Randi A.J. v. Long Is. Surgi-Center, No. 2005-04976.” N.Y. App. Div, September 25, 2007.

Wayner, Peter. Translucent Databases. Flyzone, 2003. http://www.wayner.org/books/td/.

Zeller, Tom Jr. “U.S. Settles With Company on Leak of Consumers’ Data,” New York Times. January 27, 2006.

^[106]Users should pay close attention to research into how to break hash functions like SHA256 because this function and many similar ones are proving to be far from perfectly one-way. It makes sense, for instance, to use the HMAC construction to add more security to the average hash function like SHA256.

^[107]256 bits are normally 42 and 2/3 characters long. Space can be saved by keeping only the first n characters of BASE64(SHA256(name)). A shorter string that is still a reasonable length is likely to still be unique, which is the goal.

^[108]Pharmacies have recordkeeping requirements that force them to track sales to individuals, a requirement that might limit the use of translucency. But it still might make sense to keep a translucent database at the local store while locking up the real names in a hardened vault at headquarters.

^[109]Errors caused by misspelling or forgetting parts of a key can be limited with various techniques, such as removing whitespace, converting everything to uppercase, or using a SOUNDEX code. Other solutions are described in Translucent Databases (see References).

^[110]There are a number of partial solutions described in Translucent Databases that can minimize these dangers and even eliminate them in some cases (see References).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Doing Real Work Without Real Data

Create new playlist

Sign In

Sign Up