Chapter 2

Identification, Deidentification, and Reidentification

Outline

Many errors, of a truth, consist merely in the application the wrong names of things.

Baruch Spinoza

Background

Data identification is certainly the most underappreciated and least understood Big Data issue. Measurements, annotations, properties, and classes of information have no informational meaning unless they are attached to an identifier that distinguishes one data object from all other data objects and that links together all of the information that has been or will be associated with the identified data object (see Glossary item, Annotation). The method of identification and the selection of objects and classes to be identified relates fundamentally to the organizational model of the Big Data resource. If data identification is ignored or implemented improperly, the Big Data resource cannot succeed.

This chapter will describe, in some detail, the available methods for data identification and the minimal properties of identified information (including uniqueness, exclusivity, completeness, authenticity, and harmonization). The dire consequences of inadequate identification will be discussed, along with real-world examples. Once data objects have been properly identified, they can be deidentified and, under some circumstances, reidentified (see Glossary item, Deidentification, Reidentification). The ability to deidentify data objects confers enormous advantages when issues of confidentiality, privacy, and intellectual property emerge (see Glossary items, Privacy and confidentiality, Intellectual property). The ability to reidentify deidentified data objects is required for error detection, error correction, and data validation.

A good information system is, at its heart, an identification system: a way of naming data objects so that they can be retrieved by their name and a way of distinguishing each object from every other object in the system. If data managers properly identified their data and did absolutely nothing else, they would be producing a collection of data objects with more informational value than many existing Big Data resources. Imagine this scenario. You show up for treatment in the hospital where you were born and in which you have been seen for various ailments over the past three decades. One of the following events transpires.

1. The hospital has a medical record of someone with your name, but it’s not you. After much effort, they find another medical record with your name. Once again, it’s the wrong person. After much time and effort, you are told that the hospital cannot produce your medical record. They deny losing your record, admitting only that they cannot retrieve the record from the information system.

2. The hospital has a medical record of someone with your name, but it’s not you. Neither you nor your doctor is aware of the identity error. The doctor provides inappropriate treatment based on information that is accurate for someone else, but not for you. As a result of this error, you die, but the hospital information system survives the ordeal, with no apparent injury.

3. The hospital has your medical record. After a few minutes with your doctor, it becomes obvious to both of you that the record is missing a great deal of information, relating to tests and procedures done recently and in the distant past. Nobody can find these missing records. You ask your doctor whether your records may have been inserted into the electronic chart of another patient or of multiple patients. The doctor shrugs his or her shoulders.

4. The hospital has your medical record, but after a few moments, it becomes obvious that the record includes a variety of tests done on patients other than yourself. Some of the other patients have your name. Others have a different name. Nobody seems to understand how these records pertaining to other patients got into your chart.

5. You are informed that the hospital has changed its hospital information system and your old electronic records are no longer available. You are asked to answer a long list of questions concerning your medical history. Your answers will be added to your new medical chart. Many of the questions refer to long-forgotten events.

6. You are told that your electronic record was transferred to the hospital information system of a large multihospital system. This occurred as a consequence of a complex acquisition and merger. The hospital in which you are seeking care has not yet been deployed within the information structure of the multihospital system and has no access to your records. You are assured that your records have not been lost and will be accessible within the decade.

7. You arrive at your hospital to find that the once-proud edifice has been demolished and replaced by a shopping mall. Your electronic records are gone forever, but you console yourself with the knowledge that J.C. Penney has a 40% off sale on jewelry.

Hospital information systems are prototypical Big Data resources. Like most Big Data resources, records need to be unique, accessible, complete, uncontaminated (with records of other individuals), permanent, and confidential. This cannot be accomplished without an adequate identifier system.

Features of an Identifier System

An object identifier is an alphanumeric string associated with the object. For many Big Data resources, the objects that are of greatest concern to data managers are human beings. One reason for this is that many Big Data resources are built to store and retrieve information about individual humans. Another reason for the data manager’s preoccupation with human identifiers relates to the paramount importance of establishing human identity, with absolute certainty (e.g., banking transactions, blood transfusions). We will see, in our discussion of immutability (see Chapter 6), that there are compelling reasons for storing all information contained in Big Data resources within data objects and providing an identifier for each data object (see Glossary items, Immutability, Mutability). Consequently, one of the most important tasks for data managers is the creation of a dependable identifier system.23

The properties of a good identifier system are the following:

1. Completeness. Every unique object in the Big Data resource must be assigned an identifier.

2. Uniqueness. Each identifier is a unique sequence.

3. Exclusivity. Each identifier is assigned to a unique object, and to no other object.

4. Authenticity. The objects that receive identification must be verified as the objects that they are intended to be. For example, if a young man walks into a bank and claims to be Richie Rich, then the bank must ensure that he is, in fact, who he says he is.

5. Aggregation. The Big Data resource must have a mechanism to aggregate all of the data that is properly associated with the identifier (i.e., to bundle all of the data that belong to the uniquely identified object). In the case of a bank, this might mean collecting all of the transactions associated with an account. In a hospital, this might mean collecting all of the data associated with a patient’s identifier: clinic visit reports, medication transactions, surgical procedures, and laboratory results. If the identifier system performs properly, aggregation methods will always collect all of the data associated with an object and will never collect any data that is associated with a different object.

6. Permanence. The identifiers and the associated data must be permanent. In the case of a hospital system, when the patient returns to the hospital after 30 years of absence, the record system must be able to access his identifier and aggregate his data. When a patient dies, the patient’s identifier must not perish.

7. Reconciliation. There should be a mechanism whereby the data associated with a unique, identified object in one Big Data resource can be merged with the data held in another resource, for the same unique object. This process, which requires comparison, authentication, and merging, is known as reconciliation. An example of reconciliation is found in health record portability. When a patient visits a hospital, it may be necessary to transfer her electronic medical record from another hospital (see Glossary item, Electronic medical record). Both hospitals need a way of confirming the identity of the patient and combining the records.

8. Immutability. In addition to being permanent (i.e., never destroyed or lost), the identifier must never change (see Chapter 6).24 In the event that two Big Data resources are merged, or that legacy data is merged into a Big Data resource, or that individual data objects from two different Big Data resources are merged, a single data object will be assigned two identifiers—one from each of the merging systems. In this case, the identifiers must be preserved as they are, without modification. The merged data object must be provided with annotative information specifying the origin of each identifier (i.e., clarifying which identifier came from which Big Data resource).

9. Security. The identifier system is vulnerable to malicious attack. A Big Data resource with an identifier system can be irreversibly corrupted if the identifiers are modified. In the case of human-based identifier systems, stolen identifiers can be used for a variety of malicious activities directed against the individuals whose records are included in the resource.

10. Documentation and quality assurance. A system should be in place to find and correct errors in the patient identifier system. Protocols must be written for establishing the identifier system, for assigning identifiers, for protecting the system, and for monitoring the system. Every problem and every corrective action taken must be documented and reviewed. Review procedures should determine whether the errors were corrected effectively, and measures should be taken to continually improve the identifier system. All procedures, all actions taken, and all modifications of the system should be thoroughly documented. This is a big job.

11. Centrality. Whether the information system belongs to a savings bank, an airline, a prison system, or a hospital, identifiers play a central role. You can think of information systems as a scaffold of identifiers to which data is attached. For example, in the case of a hospital information system, the patient identifier is the central key to which every transaction for the patient is attached.

12. Autonomy. An identifier system has a life of its own, independent of the data contained in the Big Data resource. The identifier system can persist, documenting and organizing existing and future data objects even if all of the data in the Big Data resource were to suddenly vanish (i.e., when all of the data contained in all of the data objects are deleted).

Registered Unique Object Identifiers

Uniqueness is one of those concepts that everyone thoroughly understands; explanations would seem unnecessary. Actually, uniqueness in computational sciences is a somewhat different concept than uniqueness in the natural world. In computational sciences, uniqueness is achieved when a data object is associated with a unique identifier (i.e., a character string that has not been assigned to any other data object). Most of us, when we think of a data object, are probably thinking of a data record, which may consist of the name of a person followed by a list of feature values (height, weight, age, etc.) or a sample of blood followed by laboratory values (e.g., white blood cell count, red cell count, hematocrit, etc.). For computer scientists, a data object is a holder for data values (the so-called encapsulated data), descriptors of the data, and properties of the holder (i.e., the class of objects to which the instance belongs). Uniqueness is achieved when the data object is permanently bound to its own identifier sequence.

Unique objects have three properties:

1. A unique object can be distinguished from all other unique objects.

2. A unique object cannot be distinguished from itself.

3. Uniqueness may apply to collections of objects (i.e., a class of instances can be unique).

Registries are trusted services that provide unique identifiers to objects. The idea is that everyone using the object will use the identifier provided by the central registry. Unique object registries serve a very important purpose, particularly when the object identifiers are persistent. It makes sense to have a central authority for Web addresses, library acquisitions, and journal abstracts. Some organizations that issue identifiers are listed here:

DOI, Digital object identifier

PMID, PubMed identification number

LSID (Life Science Identifier)

HL7 OID (Health Level 7 Object Identifier)

DICOM (Digital Imaging and Communications in Medicine) identifiers

ISSN (International Standard Serial Numbers)

Social Security Numbers (for U.S. population)

NPI, National Provider Identifier, for physicians

Clinical Trials Protocol Registration System

Office of Human Research Protections Federal Wide Assurance number

Data Universal Numbering System (DUNS) number

International Geo Sample Number

DNS, Domain Name Service

In some cases, the registry does not provide the full identifier for data objects. The registry may provide a general identifier sequence that will apply to every data object in the resource. Individual objects within the resource are provided with a registry number and a suffix sequence, appended locally. Life Science Identifiers serve as a typical example of a registered identifier. Every LSID is composed of the following five parts: Network Identifier, root DNS name of the issuing authority, name chosen by the issuing authority, a unique object identifier assigned locally, and an optional revision identifier for versioning information.

In the issued LSID identifier, the parts are separated by a colon, as shown: urn:lsid:pdb.org:1AFT:1. This identifies the first version of the 1AFT protein in the Protein Data Bank. Here are a few LSIDs:

urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434

This identifies a PubMed citation.

urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2

This refers to the second version of an entry in GenBank.

An object identifier (OID) is a hierarchy of identifier prefixes. Successive numbers in the prefix identify the descending order of the hierarchy. Here is an example of an OID from HL7, an organization that deals with health data interchanges: 1.3.6.1.4.1.250.

Each node is separated from the successor by a dot. A sequence of finer registration details leads to the institutional code (the final node). In this case, the institution identified by the HL7 OID happens to be the University of Michigan.

The final step in creating an OID for a data object involves placing a unique identifier number at the end of the registered prefix. OID organizations leave the final step to the institutional data managers. The problem with this approach is that the final within-institution data object identifier is sometimes prepared thoughtlessly, corrupting the OID system.25

Here is an example. Hospitals use an OID system for identifying images—part of the DICOM (Digital Imaging and Communications in Medicine) image standard. There is a prefix consisting of a permanent, registered code for the institution and the department and a suffix consisting of a number generated for an image, as it is created.

A hospital may assign consecutive numbers to its images, appending these numbers to an OID that is unique for the institution and the department within the institution. For example, the first image created with a computed tomography (CT) scanner might be assigned an identifier consisting of the OID (the assigned code for institution and department) followed by a separator such as a hyphen, followed by “1”.

In a worst-case scenario, different instruments may assign consecutive numbers to images, independently of one another. This means that the CT scanner in room A may be creating the same identifier (OID + image number) as the CT scanner in room B for images on different patients. This problem could be remedied by constraining each CT scanner to avoid using numbers assigned by any other CT scanner. This remedy can be defeated if there is a glitch anywhere in the system that accounts for image assignments (e.g., if the counters are reset, broken, replaced, or simply ignored).

When image counting is done properly and the scanners are constrained to assign unique numbers (not previously assigned by other scanners in the same institution), each image may indeed have a unique identifier (OID prefix + image number suffix). Nonetheless, the use of consecutive numbers for images will create havoc, over time. Problems arise when the image service is assigned to another department in the institution, when departments merge, or when institutions merge. Each of these shifts produces a change in the OID (the institutional and departmental prefix) assigned to the identifier. If a consecutive numbering system is used, then you can expect to create duplicate identifiers if institutional prefixes are replaced after the merge. The old records in both of the merging institutions will be assigned the same prefix and will contain replicate (consecutively numbered) suffixes (e.g., image 1, image 2, etc.).

Yet another problem may occur if one unique object is provided with multiple different unique identifiers. A software application may be designed to ignore any previously assigned unique identifier, and to generate its own identifier, using its own assignment method. Doing so provides software vendors with a strategy that insulates them from bad identifiers created by their competitor’s software and potentially nails the customer to their own software (and identifiers).

In the end, the OID systems provide a good set of identifiers for the institution, but the data objects created within the institution need to have their own identifier systems. Here is the HL7 statement on replicate OIDs: “Though HL7 shall exercise diligence before assigning an OID in the HL7 branch to third parties, given the lack of a global OID registry mechanism, one cannot make absolutely certain that there is no preexisting OID assignment for such third-party entity.” 26

There are occasions when it is impractical to obtain unique identifiers from a central registry. This is certainly the case for ephemeral transaction identifiers such as the tracking codes that follow a blood sample accessioned into a clinical laboratory.

The Network Working Group has issued a protocol for a Universally Unique IDentifier (UUID, also known as GUID, see Glossary item, UUID) that does not require a central registrar. A UUID is 128 bits long and reserves 60 bits for a string computed directly from a computer time stamp.27 UUIDs, if implemented properly, should provide uniqueness across space and time. UUIDs were originally used in the Apollo Network Computing System and were later adopted in the Open Software Foundation’s Distributed Computing Environment. Many computer languages (including Perl, Python, and Ruby) have built-in routines for generating UUIDs.19

There are enormous advantages to an identifier system that uses a long random number sequence, coupled to a time stamp. Suppose your system consists of a random sequence of 20 characters followed by a time stamp. For a time stamp, we will use the so-called Unix epoch time. This is the number of seconds that have elapsed since midnight, January 1, 1970. An example of an epoch time occurring on July 21, 2012, is 1342883791.

A unique identifier could be produced using a random character generator and an epoch time measurement, both of which are easily available routines built into most programming languages. Here is an example of such an identifier: mje03jdf8ctsSdkTEWfk-1342883791.

The characters in the random sequence can be uppercase or lowercase letters, roman numerals, or any standard keyboard characters. These comprise about 128 characters, the so-called seven-bit ASCII characters (see Glossary item, ASCII). The chance of two selected 20-character random sequences being identical is 128 to the - 20 power. When we attach a time stamp to the random sequence, we place the added burden that the two sequences have the same random number prefix and that the two identifiers were created at the same moment in time (see Glossary item, Time stamp).

A system that assigns identifiers using a long, randomly selected sequence followed by a time-stamp sequence can be used without worrying that two different objects will be assigned the same identifier.

Hypothetically, though, suppose you are working in a Big Data resource that creates trillions of identifiers every second. In all those trillions of data objects, might there not be a duplication of identifiers that might someday occur? Probably not, but if that is a concern for the data manager, there is a solution. Let’s assume that there are Big Data resources that are capable of assigning trillions of identifiers every single second that the resource operates. For each second that the resource operates, the data manager keeps a list of the new identifiers that are being created. As each new identifier is created, the list is checked to ensure that the new identifier has not already been assigned. In the nearly impossible circumstance that a duplicate exists, the system halts production for a fraction of a second, at which time a new epoch time sequence has been established and the identifier conflict resolves itself.

Suppose two Big Data resources are being merged. What do you do if there are replications of assigned identifiers in the two resources? Again, the chances of identifier collisions are so remote that it would be reasonable to ignore the possibility. The faithfully obsessive data manager may select to compare identifiers prior to the merge. In the exceedingly unlikely event that there is a match, the replicate identifiers would require some sort of annotation describing the situation.

It is technically feasible to create an identifier system that guarantees uniqueness (i.e., no replicate identifiers in the system). Readers should keep in mind that uniqueness is just 1 of 12 design requirements for a good identifier system.

Really Bad Identifier Methods

I always wanted to be somebody, but now I realize I should have been more specific.

Lily Tomlin

Names are poor identifiers. Aside from the obvious fact that they are not unique (e.g., surnames such as Smith, Zhang, Garcia, Lo, and given names such as John and Susan), a single name can have many different representations. The sources for these variations are many. Here is a partial listing.

1. Modifiers to the surname (du Bois, DuBois, Du Bois, Dubois, Laplace, La Place, van de Wilde, Van DeWilde, etc.).

2. Accents that may or may not be transcribed onto records (e.g., acute accent, cedilla, diacritical comma, palatalized mark, hyphen, diphthong, umlaut, circumflex, and a host of obscure markings).

3. Special typographic characters (the combined “æ”).

4. Multiple “middle names” for an individual that may not be transcribed onto records, for example, individuals who replace their first name with their middle name for common usage while retaining the first name for legal documents.

5. Latinized and other versions of a single name (Carl Linnaeus, Carl von Linne, Carolus Linnaeus, Carolus a Linne).

6. Hyphenated names that are confused with first and middle names (e.g., Jean-Jacques Rousseau or Jean Jacques Rousseau; Louis-Victor-Pierre-Raymond, 7th duc de Broglie, or Louis Victor Pierre Raymond Seventh duc deBroglie).

7. Cultural variations in name order that are mistakenly rearranged when transcribed onto records. Many cultures do not adhere to the western European name order (e.g., given name, middle name, surname).

8. Name changes, through legal action, aliasing, pseudonymous posing, or insouciant whim.

Aside from the obvious consequences of using names as record identifiers (e.g., corrupt database records, impossible merges between data resources, impossibility of reconciling legacy record), there are nonobvious consequences that are worth considering. Take, for example, accented characters in names. These word decorations wreak havoc on orthography and on alphabetization. Where do you put a name that contains an umlauted character? Do you pretend the umlaut isn’t there and put it in alphabetic order with the plain characters? Do you order based on the ASCII-numeric assignment for the character, in which the umlauted letter may appear nowhere near the plain-lettered words in an alphabetized list? The same problem applies to every special character.

A similar problem exists for surnames with modifiers. Do you alphabetize de Broglie under “D,” under “d,” or “B?” If you choose B, then what do you do with the concatenated form of the name, “deBroglie?”

When it comes down to it, it is impossible to satisfactorily alphabetize a list of names. This means that searches based on proximity in the alphabet will always be prone to errors.

I have had numerous conversations with intelligent professionals who are tasked with the responsibility of assigning identifiers to individuals. At some point in every conversation, they will find it necessary to explain that although an individual’s name cannot serve as an identifier, the combination of name plus date of birth provides accurate identification in almost every instance. They sometimes get carried away, insisting that the combination of name plus date of birth plus social security number provides perfect identification, as no two people will share all three identifiers: same name, same date of birth, and same social security number. This argument rises to the height of folly and completely misses the point of identification. As we will see, it is relatively easy to assign unique identifiers to individuals and to any data object, for that matter. For managers of Big Data resources, the larger problem is ensuring that each unique individual has only one identifier (i.e., denying one object multiple identifiers).

Let us see what happens when we create identifiers from the name plus birthdate. We will examine name + birthdate + social security number later in this section.

Consider this example. Mary Jessica Meagher, born June 7, 1912, decided to open a separate bank account in each of 10 different banks. Some of the banks had application forms, which she filled out accurately. Other banks registered her account through a teller, who asked her a series of questions and immediately transcribed her answers directly into a computer terminal. Ms. Meagher could not see the computer screen and could not review the entries for accuracy.

Here are the entries for her name plus date of birth:

1. Marie Jessica Meagher, June 7, 1912 (the teller mistook Marie for Mary).

2. Mary J. Meagher, June 7, 1912 (the form requested a middle initial, not name).

3. Mary Jessica Magher, June 7, 1912 (the teller misspelled the surname).

4. Mary Jessica Meagher, Jan 7, 1912 (the birth month was constrained, on the form, to three letters; Jun, entered on the form, was transcribed as Jan).

5. Mary Jessica Meagher, 6/7/12 (the form provided spaces for the final two digits of the birth year. Through the miracle of bank registration, Mary, born in 1912, was reborn a century later).

6. Mary Jessica Meagher, 7/6/2012 (the form asked for day, month, year, in that order, as is common in Europe).

7. Mary Jessica Meagher, June 1, 1912 (on the form, a 7 was mistaken for a 1).

8. Mary Jessie Meagher, June 7, 1912 (Marie, as a child, was called by the informal form of her middle name, which she provided to the teller).

9. Mary Jessie Meagher, June 7, 1912 (Marie, as a child, was called by the informal form of her middle name, which she provided to the teller and which the teller entered as the male variant of the name).

10. Marie Jesse Mahrer, 1/1/12 (an underzealous clerk combined all of the mistakes on the form and the computer transcript and added a new orthographic variant of the surname).

For each of these 10 examples, a unique individual (Mary Jessica Meagher) would be assigned a different identifier at each of 10 banks. Had Mary reregistered at one bank, 10 times, the results may have been the same.

If you toss the social security number into the mix (name + birthdate + social security number), the problem is compounded. The social security number for an individual is anything but unique. Few of us carry our original social security cards. Our number changes due to false memory (“You mean I’ve been wrong all these years?”), data entry errors (“Character trasnpositoins, I mean transpositions, are very common”), intention to deceive (“I don’t want to give those people my real number”), or desperation (“I don’t have a number, so I’ll invent one”), or impersonation (“I don’t have health insurance, so I’ll use my friend’s social security number”). Efforts to reduce errors by requiring patients to produce their social security cards have not been entirely beneficial.

Beginning in the late 1930s, the E. H. Ferree Company, a manufacturer of wallets, promoted their product’s card pocket by including a sample social security card with each wallet sold. The display card had the social security number of one of their employees. Many people found it convenient to use the card as their own social security number. Over time, the wallet display number was claimed by over 40,000 people. Today, few institutions require individuals to prove their identity by showing their original social security card. Doing so puts an unreasonable burden on the honest patient (who does not happen to carry his/her card) and provides an advantage to criminals (who can easily forge a card).

Entities that compel individuals to provide a social security number have dubious legal standing. The social security number was originally intended as a device for validating a person’s standing in the social security system. More recently, the purpose of the social security number has been expanded to track taxable transactions (i.e., bank accounts, salaries). Other uses of the social security number are not protected by law. The Social Security Act (Section 208 of Title 42 U.S. Code 408) prohibits most entities from compelling anyone to divulge his/her social security number.

Considering the unreliability of social security numbers in most transactional settings, and considering the tenuous legitimacy of requiring individuals to divulge their social security numbers, a prudently designed medical identifier system will limit its reliance on these numbers. The thought of combining the social security number with name and date of birth will virtually guarantee that the identifier system will violate the strict one-to-a-customer rule.

Embedding Information in an Identifier: Not Recommended

Most identifiers are not purely random numbers—they usually contain some embedded information that can be interpreted by anyone familiar with the identification system. For example, they may embed the first three letters of the individual’s family name in the identifier. Likewise, the last two digits of the birth year are commonly embedded in many types of identifiers. Such information is usually included as a crude “honesty” check by people “in the know.” For instance, the nine digits of a social security number are divided into an area code (first three digits), a group number (the next two digits), followed by a serial number (last four digits). People with expertise in the social security numbering system can pry considerable information from a social security number and can determine whether certain numbers are bogus based on the presence of excluded subsequences.

Seemingly inconsequential information included in an identifier can sometimes be used to discover confidential information about individuals. Here is an example. Suppose every client transaction in a retail store is accessioned under a unique number, consisting of the year of the accession, followed by the consecutive count of accessions, beginning with the first accession of the new year. For example, accession 2010-3518582 might represent the 3,518,582nd purchase transaction occurring in 2010. Because each number is unique, and because the number itself says nothing about the purchase, it may be assumed that inspection of the accession number would reveal nothing about the transaction.

Actually, the accession number tells you quite a lot. The prefix (2010) tells you the year of the purchase. If the accession number had been 2010-0000001, then you could safely say that accession represented the first item sold on the first day of business in the year 2010. For any subsequent accession number in 2010, simply divide the suffix number (in this case 3,518,512) by the last accession number of the year, multiply by 365 (the number of days in a nonleap year), and you have the approximate day of the year that the transaction occurred. This day can easily be converted to a calendar date.

Unimpressed? Consider this scenario. You know that a prominent member of the President’s staff had visited a Washington, DC, hospital on February 15, 2005, for the purpose of having a liver biopsy. You would like to know the results of that biopsy. You go to a Web site that lists the deidentified pathology records for the hospital for the years 2000 to 2010. Though no personal identifiers are included in these public records, the individual records are sorted by accession numbers. Using the aforementioned strategy, you collect all of the surgical biopsies performed on or about February 15, 2005. Of these biopsies, only three are liver biopsies. Of these three biopsies, only one was performed on a person whose gender and age matched the President’s staff member. The report provides the diagnosis. You managed to discover some very private information without access to any personal identifiers.

The alphanumeric character string composing the identifier should not expose the patient’s identity. For example, a character string consisting of a concatenation of the patient’s name, birthdate, and social security number might serve to uniquely identify an individual, but it could also be used to steal an individual’s identity. The safest identifiers are random character strings containing no information whatsoever.

One-Way Hashes

A one-way hash is an algorithm that transforms a string into another string in such a way that the original string cannot be calculated by operations on the hash value (hence the term “one-way” hash). Popular one-way hash algorithms are MD5 and Standard Hash Algorithm. A one-way hash value can be calculated for any character string, including a person’s name, a document, or even another one-way hash. For a given input string, the resultant one-way hash will always be the same.

Here are a few examples of one-way hash outputs performed on a sequential list of input strings, followed by their one-way hash (MD5 algorithm) output.

Jules Berman ⇒ Ri0oaVTIAilwnS8 + nvKhfA

“Whatever” ⇒ n2YtKKG6E4MyEZvUKyGWrw

Whatever ⇒ OkXaDVQFYjwkQ + MOC8dpOQ

jules berman ⇒ SlnuYpmyn8VXLsxBWwO57Q

Jules J. Berman ⇒ i74wZ/CsIbxt3goH2aCS + A

Jules J Berman ⇒ yZQfJmAf4dIYO6Bd0qGZ7g

Jules Berman ⇒ Ri0oaVTIAilwnS8 + nvKhfA

The one-way hash values are a seemingly random sequence of ASCII characters (the characters available on a standard keyboard). Notice that a small variation among input strings (e.g., exchanging an uppercase for a lowercase character, adding a period or quotation mark) produces a completely different one-way hash output. The first and the last entry (Jules Berman) yield the same one-way hash output (Ri0oaVTIAilwnS8 + nvKhfA) because the two input strings are identical. A given string will always yield the same hash value so long as the hashing algorithm is not altered. Each one-way hash has the same length (22 characters for this particular MD5 algorithm), regardless of the length of the input term. A one-way hash output of the same length (22 characters) could have been produced for a string, file, or document of any length.

One-way hash values can substitute for identifiers in individual data records. This permits Big Data resources to accrue data, over time, to a specific record, even when the record is deidentified. Here is how it works.28 A record identifier serves as the input value for a one-way hash. The primary identifier for the record is now a one-way hash sequence. The data manager of the resource, looking at such a record, cannot determine the individual associated with the record because the original identifier has been replaced with an unfamiliar sequence.

An identifier will always yield the same one-way hash sequence whenever the hash algorithm is performed. When the patient revisits the hospital at some future time, another transactional record is created, with the same patient identifier. The new record is deidentified, and the original patient identifier for the record is substituted with its one-way hash value. The recorded new deidentified record can now be combined with prior deidentified records that have the same one-way hash value. Using this method, deidentified records produced for an individual can be combined, without knowing the name of the individual whose records are being collected. Methods for record deidentification will be described in a later section in this chapter.

Implementation of one-way hashes carries certain practical problems. If anyone happens to have a complete listing of all of the original identifiers, then it would be a simple matter to perform one-way hashes on every listed identifier. This would produce a look-up table that can match deidentified records back to the original identifier, a strategy known as a dictionary attack. For deidentification to work, the original identifier sequences must be kept secret.

Use Case: Hospital Registration

Imagine a hospital that maintains two separate registry systems: one for dermatology cases and another for psychiatry cases. The hospital would like to merge records from the two services under a carefully curated index of patients (the master patient index). Because of sloppy identifier practices, a sample patient has been registered 10 times in the dermatology system and 6 times in the psychiatry system, each time with different addresses, social security numbers, birthdates, and spellings of the name, producing 16 differently registered records. The data manager uses an algorithm designed to reconcile multiple registrations under a single identifier from a master patient index. One of the records from the dermatology service is matched positively against one of the records from the psychiatry service. Performance studies on the algorithm indicate that the two merged records have a 99.8% chance of belonging to the correct patient listed in the master patient index. Though the two merged records correctly point to the same patient, the patient still has 14 unmatched records, corresponding to the remaining 14 separate registrations. The patient’s merged record will not contain his complete set of records. If all of the patient’s records were deidentified, the set of one patient’s multiply registered records would produce a misleading total for the number of patients included in the data collection.

Consider these words, from the Healthcare Information and Management Systems Society,29 “A local system with a poorly maintained or ‘dirty’ master person index will only proliferate and contaminate all of the other systems to which it links.”

Here are just a few examples of the kinds of problems that can result when hospitals misidentify patients:

1. Bill sent to incorrectly identified person.

2. Blood transfusion provided to incorrectly identified person.

3. Correctly identified medication provided to incorrectly identified person.

4. Incorrectly identified dosage of correct medication provided to correctly identified person.

5. Incorrectly identified medication provided to correctly identified person.

6. Incorrectly identified patient treated for another patient’s illness.

7. Report identified with wrong person’s name.

8. Report provided with diagnosis intended for different person.

9. Report sent to incorrectly identified physician.

10. Wrong operation performed on incorrectly identified patient.30

Patient identification in hospitals is further confounded by a natural reluctance among some patients to comply with the registration process. A patient may be highly motivated to provide false information to a registrar, or to acquire several different registration identifiers, or to seek a false registration under another person’s identity (i.e., commit fraud), or to circumvent the registration process entirely. In addition, it is a mistake to believe that honest patients are able to fully comply with the registration process. Language and cultural barriers, poor memory, poor spelling, and a host of errors and misunderstandings can lead to duplicative or otherwise erroneous identifiers. It is the job of the registrar to follow hospital policies that overcome these difficulties.

Registration in hospitals should be conducted by a trained registrar who is well versed in the registration policies established by the institution. Registrars may require patients to provide a full legal name, any prior held names (e.g., maiden name), date of birth, and a government-issued photo id card (e.g., driver’s license or photo id card issued by the department of motor vehicles). In my opinion, registration should require a biometric identifier [e.g., fingerprints, retina scan, iris scan, voice recording, photograph, DNA markers31,32 (see Glossary item, CODIS)]. If you accept the premise that hospitals have the responsibility of knowing who it is that they are treating, then obtaining a sample of DNA from every patient, at the time of registration, is reasonable. That DNA can be used to create a unique patient profile from a chosen set of informative loci; a procedure used by the CODIS system developed for law enforcement agencies. The registrar should document any distinguishing and permanent physical features that are plainly visible (e.g., scars, eye color, colobomas, tattoos).

Neonatal and pediatric identifiers pose a special set of problems for registrars. It is quite possible that a patient born in a hospital and provided with an identifier will return, after a long hiatus, as an adult. An adult should not be given a new identifier when a pediatric identifier was issued in the remote past. Every patient who comes for registration should be matched against a database of biometric data that does not change from birth to death (e.g., fingerprints, DNA). Registration is a process that should occur only once per patient. Registration should be conducted by trained individuals who can authenticate the identity of patients.

Deidentification

Deidentification is the process of stripping information from a data record that might link the record to the public name of the record’s subject. In the case of a patient record, this would involve stripping any information from the record that would enable someone to connect the record to the name of the patient. The most obvious item to be removed in the deidentification process is the patient’s name. Other information that should be removed would be the patient’s address (which could be linked to the name), the patient’s date of birth (which narrows down the set of individuals to whom the data record might pertain), and the patient’s social security number. In the United States, patient privacy regulations include a detailed discussion of record deidentification, and this discussion recommends 18 patient record items for exclusion from deidentified records.33

Before going any further, it is important to clarify that deidentification is not achieved by removing an identifier from a data object. In point of fact, nothing good is ever achieved by simply removing an identifier from a data object; doing so simply invalidates the data object (i.e., every data object, identified or deidentified, must have an identifier). As discussed earlier in the chapter, identifiers can be substituted with a one-way hash value, thus preserving the uniqueness of the record. Deidentification involves removing information contained in the data object that reveals something about the publicly known name of the data object. This kind of information is often referred to as identifying information, but it would be much less confusing if we used another term for such data, such as “name-linking information.” The point here is that we do not want to confuse the identifier of a data object with the information contained in a data object that can link the object to its public name.

It may seem counterintuitive, but there is very little difference between an identifier and a deidentifier; under certain conditions the two concepts are equivalent. Here is how a dual identification/deidentification system might work:

1. Collect data on unique object: “Joe Ferguson’s bank account contains $100.”

2. Assign a unique identifier: “Joe Ferguson’s bank account is 7540038947134.”

3. Substitute name of object with its assigned unique identifier: “754003894713 contains $100.”

4. Consistently use the identifier with data.

5. Do not let anyone know that Joe Ferguson owns account “754003894713.”

The dual use of an identifier/deidentifier is a tried-and-true technique. Swiss bank accounts are essentially unique numbers (identifiers) assigned to a person. You access the bank account by producing the identifier number. The identifier number does not provide information about the identity of the bank account holder (i.e., it is a deidentifier).

The purpose of an identifier is to tell you that whenever the identifier is encountered, it refers to the same unique object, and whenever two different identifiers are encountered, they refer to different objects. The identifier, by itself, contains no information that links the data object to its public name.

It is important to understand that the process of deidentification can succeed only when each record is properly identified (i.e., there can be no deidentification without identification). Attempts to deidentify a poorly identified data set of clinical information will result in replicative records (multiple records for one patient), mixed-in records (single records composed of information on multiple patients), and missing records (unidentified records lost in the deidentification process).

The process of deidentification is best understood as an algorithm performed on the fly, in response to a query from a data analyst. Here is how such an algorithm might proceed.

1. The data analyst submits a query requesting a record from a Big Data resource. The resource contains confidential records that must not be shared, unless the records are deidentified.

2. The Big Data resource receives the query and retrieves the record.

3. A copy of the record is parsed, and any of the information within the data record that might link the record to the public name of the subject of the record (usually the name of an individual) is deleted from the copy. This might include the aforementioned name, address, date of birth, social security number, and so on.

4. A pseudo-identifier sequence is prepared for the deidentified record. The pseudo-identifier sequence might be generated by a random number generator, or it might be generated by encrypting the original identifier, or through a one-way hash algorithm, or by other methods chosen by the Big Data manager.

5. A transaction record is attached to the original record that includes the pseudo-identifier, the deidentified record, the time of the transaction, and any information pertaining to the requesting entity (e.g., the data analyst who sent the query) that is deemed fit and necessary by the Big Data resource data manager.

6. A record is sent to the data analyst that consists of the deidentified record and the unique pseudo-identifier created for the record.

Because the deidentified record and its unique pseudo-identifier are stored with the original record, subsequent requests for the record can be returned with the prepared information, at the discretion of the Big Data manager. This general approach to data deidentification will apply to requests for a single record or a million records.

At this point, you might be asking yourself the following question: “What gives the data manager the right to distribute parts of a confidential record, even if it happens to be deidentified?” You might think that if you tell someone a secret, under the strictest confidence, then you would not want any part of that secret to be shared with anyone else. The whole notion of sharing confidential information that has been deidentified may seem outrageous and unacceptable.

We will discuss the legal and ethical issues of Big Data in Chapters 13 and 14. For now, readers should know that there are several simple and elegant principles that justify sharing deidentified data.

Consider the statement “Jules Berman has a blood glucose level of 85.” This would be considered a confidential statement because it tells people something about my medical condition. Consider the phrase “blood glucose 85.” When the name “Jules Berman” is removed, we are left with a disembodied piece of data. “Blood glucose 85” is no different from “Temperature 98.6,” “Apples 2,” or “Terminator 3.” They are simply raw data belonging to nobody in particular.

The act of deidentification renders the data harmless by transforming information about a person or data object into information about nothing in particular. Because the use of deidentified data poses no harm to human subjects, U.S. regulations allow the unrestricted use of such data for research purposes.33,34 Other countries have similar provisions.

Data Scrubbing

Data scrubbing is sometimes used as a synonym for deidentification. It is best to think of data scrubbing as a process that begins where deidentification ends. A data scrubber will remove unwanted information from a data record, including information of a personal nature, and any information that is not directly related to the purpose of the data record. For example, in the case of a hospital record, a data scrubber might remove the names of physicians who treated the patient, the names of hospitals or medical insurance agencies, addresses, dates, and any textual comments that are inappropriate, incriminating, irrelevant, or potentially damaging.

In medical data records, there is a concept known as “minimal necessary” that applies to shared confidential data33(see Glossary item, Minimal necessary). It holds that when records are shared, only the minimum necessary information should be released. Any information not directly relevant to the intended purposes of the data analyst should be withheld. The process of data scrubbing gives data managers the opportunity to render a data record free of information that would link the record to its subject and free of extraneous information that the data analyst does not actually require.

There are many methods for data scrubbing. Most of these methods require the data manager to develop an exception list of items that should not be included in shared records (e.g., cities, states, zip codes, names of people, and so on). The scrubbing application moves through the records, extracting unnecessary information along the way. The end product is cleaned, but not sterilized. Though many undesired items can be successfully removed, this approach never produces a perfectly scrubbed set of data. In a Big Data resource, it is simply impossible for the data manager to anticipate every objectionable item and to include it in an exception list. Nobody is that smart.

There is, however, a method whereby data records can be cleaned, without error. This method involves creating a list of data (often in the form of words and phrases) that is acceptable for inclusion in a scrubbed and deidentified data set. Any data that is not in the list of acceptable information is automatically deleted. Whatever is left is the scrubbed data. This method can be described as a reverse scrubbing method. Everything is in the data set is automatically deleted, unless it is an approved “exception.”

This method of scrubbing is very fast and can produce an error-free deidentified and scrubbed output.19,35,36 An example of the kind of output produced by a reverse scrubber is shown:

“Since the time when * * * * * * * * his own * and the * * * *, the anomalous * * have been * and persistent * * *; and especially * true of the construction and functions of the human *, indeed, it was the anomalous that was * * * in the * the attention, * * that were * to develop into the body * * which we now * *. As by the aid * * * * * * * * * our vision into the * * * has emerged *, we find * * and even evidence of *. To the highest type of * * it is the * the ordinary * * * * *. * to such, no less than to the most *, * * * is of absorbing interest, and it is often * * that the * * the most * into the heart of the mystery of the ordinary. * * been said, * * * * *. * * dermoid cysts, for example, we seem to * * * the secret * of Nature, and * out into the * * of her clumsiness, and * of her * * * *, *, * tell us much of * * * used by the vital * * * * even the silent * * * upon the * * *.”

The reverse scrubber requires the preexistence of a set of approved terms. One of the simplest methods for generating acceptable terms involves extracting them from a nomenclature that comprehensively covers the terms used in a knowledge domain. For example, a comprehensive listing of living species will not contain dates or zip codes or any of the objectionable language or data that should be excluded from a scrubbed data set. In a method that I have published, a list of approved doublets (approximately 200,000 two-word phrases collected from standard nomenclatures) are automatically collected for the scrubbing application.19 The script is fast, and its speed is not significantly reduced by the size of the list of approved terms.

Reidentification

For scientists, deidentification serves two purposes:

1. To protect the confidentiality and the privacy of the individual (when the data concerns a particular human subject)

2. To remove information that might bias the experiment (e.g., to blind the experimentalist to patient identities)

Because confidentiality and privacy concerns always apply to human subject data and because issues of experimental bias always apply when analyzing data, it would seem imperative that deidentification should be an irreversible process (i.e., the names of the subjects and samples should be held a secret, forever).

Scientific integrity does not always accommodate irreversible deidentification. On occasion, experimental samples are mixed up; samples thought to come from a certain individual, tissue, record, or account may, in fact, come from another source. Sometimes major findings in science need to be retracted when a sample mix-up has been shown to occur.3741 When samples are submitted, without mix-up, the data is sometimes collected improperly. For example, reversing electrodes on an electrocardiogram may yield spurious and misleading results. Sometimes data is purposefully fabricated and otherwise corrupted to suit the personal agendas of dishonest scientists. When data errors occur, regardless of reason, it is important to retract the publications.42,43 To preserve scientific integrity, it is sometimes necessary to discover the identity of deidentified records.

In some cases, deidentification stops the data analyst from helping individuals whose confidentiality is being protected. Imagine you are conducting an analysis on a collection of deidentified data and you find patients with a genetic marker for a disease that is curable, if treated at an early stage; or you find a new biomarker that determines which patients would benefit from surgery and which patients would not. You would be compelled to contact the subjects in the database to give them information that could potentially save their lives. An irreversibly deidentified data set precludes any intervention with subjects—nobody knows their identities.

Deidentified records can, under strictly controlled circumstances, be reidentified. Reidentification is typically achieved by entrusting a third party with a confidential list that maps individuals to their deidentified records. Obviously, reidentification can only occur if the Big Data resource keeps a link connecting the identifiers of their data records to the identifiers of the corresponding deidentified record. The act of assigning a public name to the deidentified record must always involve strict oversight. The data manager must have in place a protocol that describes the process whereby approval for reidentification is obtained. Reidentification provides an opportunity whereby confidentiality can be breached and human subjects can be harmed. Consequently, stewarding the reidentification process is one of the most serious responsibilities of Big Data managers.

Lessons Learned

Everything has been said before, but since nobody listens we have to keep going back and beginning all over again.

Andre Gide

Identification issues are often ignored by Big Data managers who are accustomed to working on small data projects. It is worthwhile to repeat the most important ideas described in this chapter, many of which are counterintuitive and strange to those whose lives are spent outside the confusing realm of Big Data.

1. All Big Data resources can be imagined as an identifier system for data objects and data-related events (i.e., timed transactions). The data in a big data resource can be imagined as character sequences that are attached to identifiers.

2. Without an adequate identification system, a Big Data resource has no value. The data within the resource cannot be trusted.

3. An identifier is a unique alphanumeric sequence assigned to a data object.

4. A data object is a collection of data that contains self-describing information, and one or more data values. Data objects should be associated with a unique identifier.

5. Deidentification is the process of stripping information from a data record that might link the record to the public name of the record’s subject.

6. Deidentification should not be confused with the act of stripping a record of an identifier. A deidentified record must have an associated identifier, just as an identified data record must have an identifier.

7. Where there is no identification, there can be no deidentification and no reidentification.

8. Reidentification is the assignment of the public name associated with a data record to the deidentified record. Reidentification is sometimes necessary to verify the contents of a record or to provide information that is necessary for the well-being of the subject of a deidentified data record. Reidentification always requires approval and oversight.

9. When a deidentified data set contains no unique records (i.e., every record has one or more additional records from which it cannot be distinguished, aside from its assigned identifier sequence), then it becomes impossible to maliciously uncover a deidentified record’s public name.

10. Data scrubbers remove unwanted information from a data record, including information of a personal nature, and any information that is not directly related to the purpose of the data record. Data deidentification is a process whereby links to the public name of the subject of the record are removed (see Glossary items, Data cleaning, Data scrubbing).

11. The fastest known method of data scrubbing involves preparing a list of approved words and phrases that can be retained in data records and removing every word or phrase that is not found in the approved list.

References

19. Berman JJ. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton, FL: Chapman and Hall; 2010.

23. Reed DP. Naming and synchronization in a decentralized computer system. MIT 1978; Doctoral Thesis.

24. Joint NEMA/COCIR/JIRA Security and Privacy Committee (SPC). Identification and allocation of basic security rules in healthcare imaging systems September, 2002; Available from: http://www.medicalimaging.org/wp-content/uploads/2011/02/Identification_and_Allocation_of_Basic_Security_Rules_In_Healthcare_Imaging_Systems-September_2002.pdf; September, 2002; viewed January 10, 2013.

25. Kuzmak P, Casertano A, Carozza D, Dayhoff R, Campbell K. Solving the problem of duplicate medical device unique identifiers: High Confidence Medical Device Software and Systems (HCMDSS) workshop. Philadelphia, PA June 2-3, 2005; Available from: http://www.cis.upenn.edu/hcmdss/Papers/submissions/; June 2-3, 2005; viewed August 26, 2012.

26. Health Level 7 OID Registry. Available from: http://www.hl7.org/oid/frames.cfm; viewed August 26, 2012.

27. Leach P, Mealling M, Salz R. A Universally Unique IDentifier (UUID) URN namespace. Request for Comment 4122, Standards Track. Available from: Network Working Group August 26, 2012; http://www.ietf.org/rfc/rfc4122.txt; August 26, 2012; viewed.

28. Berman JJ. Confidentiality for medical data miners. Art Intell Med. 2002;26:25–36.

29. Patient Identity Integrity. A White Paper by the HIMSS Patient Identity Integrity Work Group, December 2009. Available from: http://www.himss.org/content/files/PrivacySecurity/PIIWhitePaper.pdf; viewed September 19, 2012.

30. Berman JJ. Biomedical informatics. Sudbury, MA: Jones and Bartlett; 2007.

31. Pakstis AJ, Speed WC, Fang R, et al. SNPs for a universal individual identification panel. Hum Genet. 2010;127:315–324.

32. Katsanis SH, Wagner JK. Characterization of the standard and recommended CODIS markers. J Foren Sci 2012; Aug 24.

33. Department of Health and Human Services. 45 CFR (Code of Federal Regulations), Parts 160 through 164 Standards for Privacy of Individually Identifiable Health Information (Final Rule). Fed Reg. 2000;65(250):82461–82510.

34. Department of Health and Human Services. 45 CFR (Code of Federal Regulations), 46 Protection of Human Subjects (Common Rule). Fed Reg. 1991;56:28003–28032.

35. Berman JJ. Concept-match medical data scrubbing: how pathology datasets can be used in research. Arch Pathol Lab Med. 2003;127:680–686.

36. Berman JJ. Comparing de-identification methods. Available from: http://www.biomedcentral.com/1472-6947/6/12/comments/comments.htm; March 31, 2006; viewed January 31, 2013.

37. Knight J. Agony for researchers as mix-up forces retraction of ecstasy study. Nature. 2003;425:109.

38. Sainani K. Error: what biomedical computing can learn from its mistakes. Biomed Comput Rev 2011;12–19 Fall.

39. Palanichamy MG, Zhang Y. Potential pitfalls in MitoChip detected tumor-specific somatic mutations: a call for caution when interpreting patient data. BMC Cancer. 2010;10:597.

40. Bandelt H, Salas A. Contamination and sample mix-up can best explain some patterns of mtDNA instabilities in buccal cells and oral squamous cell carcinoma. BMC Cancer. 2009;9:113.

41. Harris G. U.S Inaction lets look-alike tubes kill patients. The New York Times August 20, 2010.

42. Flores G. Science retracts highly cited paper: study on the causes of childhood illness retracted after author found guilty of falsifying data. The Scientist June 17, 2005.

43. Gowen LC, Avrutskaya AV, Latour AM, Koller BH, Leadon SA. Retraction of: Gowen LC, Avrutskaya AV, Latour AM, Koller BH, Leadon SA. Science. 1998 Aug 14;281(5379):1009-12. Science. 2003;300:1657.


ent“To view the full reference list for the book, click here

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.253.62