Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Identification, Deidentification, and Reidentification

Abstract

This chapter describes, in some detail, the available methods for data identification and the minimal properties of identified information (uniqueness, exclusivity, completeness, authenticity, and reconciliation). The dire consequences of faulty identification will be discussed, along with real-world examples. Once data objects have been properly identified, they can be properly deidentified and reidentified. The ability to deidentify data objects confers enormous advantages when issues of confidentiality, privacy, and intellectual property emerge.

Keywords

Identification; Identifier; Data uniqueness; Deidentification; Reidentification; Anonymization; One-way hash; Random character string; Hash collision

Section 3.1. What Are Identifiers?

Where is the ‘any' key?

Homer Simpson, in response to his computer's instruction to “Press any key”

Let us begin this chapter with a riddle. “Is the number 5 a data object?” If you are like most people, you will answer “yes” because “5” is an integer and therefore it is represents numeric data, and “5” is an object because it exists and is different from all the other numbers. Therefore “5” is a data object. This line of reasoning happens to be completely erroneous. Five is not a data object. As a pure abstraction with nothing binding it to a physical object (e.g., 5 pairs of shoes, 5 umbrellas), it barely qualifies as data.

When we speak of a data object, in computer science, we refer to something that is identified and described. Consider the following statements:

< f183136d-3051-4c95-9e32-66844971afc5 ><name ><Baltimore >
< f183136d-3051-4c95-9e32-66844971afc5 ><class ><city >
< f183136d-3051-4c95-9e32-66844971afc5 ><population ><620,961 >

Without knowing much about data objects (which we will be discussing in detail in Section 6.2), we can start to see that these three statements are providing information about Baltimore. They tell us that Baltimore is a city of population 620,961, and that Baltimore has been assigned an alphanumeric sequence, “f183136d-3051-4c95-9e32-66844971afc5,” to which all our available information about Baltimore has been attached. Peeking ahead into Chapter 6, we can now surmise that a data object consists of a unique alphanumeric sequence (the object identifier) plus the descriptive information associated with the identifier (e.g., name, population number, class). We will see that there are compelling reasons for storing all information contained in Big Data resources within uniquely identified data objects. Consequently, one of the most important tasks for data managers is the creation of a dependable identifier system [1]. In this chapter, we will be focusing our attention on the unique identifier and how it is created and utilized in the realm of Big Data.

Identification issues are often ignored by data managers who are accustomed to working on small data projects. It is worthwhile to list, up front, the most important ideas described in this chapter, many of which are counterintuitive and strange to those whose careers are spent outside the confusing realm of Big Data.

– All Big Data resources can be imagined as identifier systems to which we attach our data.
– Without an adequate identification system, a Big Data resource has no value. In this case, the data within the resource cannot be sensibly analyzed.
– Data deidentification is a process whereby links to the public name of the subject of the record are removed.
– Deidentification should not be confused with the act of stripping a record of an identifier. A deidentified record, like any valid data object, must always have an associated identifier.
– Deidentification should not be confused with data scrubbing. Data scrubbers remove unwanted information from a data record, including information of a personal nature, and any information that is not directly related to the purpose of the data record. [Glossary Data cleaning, Data scrubbing]
– Reidentification is a concept that specifically involves personal and private data records. It involves ascertaining the name of the individual who is associated with a deidentified record. Reidentification is sometimes necessary to verify the contents of a record, or to provide information that is necessary for the well-being of the subject of a deidentified data record. Ethical reidentification always requires approval and oversight.
– Where there is no identification, there can be no deidentification and no reidentification.
– When a deidentified data set contains no unique records (i.e., every record has one or more additional records from which it cannot be distinguished, aside from its assigned identifier sequence), then it becomes impossible to maliciously uncover a deidentified record's public name.

Section 3.2. Difference Between an Identifier and an Identifier System

Many errors, of a truth, consist merely in the application the wrong names of things.

Baruch Spinoza

Data identification is among the most underappreciated and least understood Big Data issue. Measurements, annotations, properties, and classes of information have no informational meaning unless they are attached to an identifier that distinguishes one data object from all other data objects, and that links together all of the information that has been or will be associated with the identified data object. The method of identification and the selection of objects and classes to be identified relates fundamentally to the organizational model of the Big Data resource. If data identification is ignored or implemented improperly, the Big Data resource cannot succeed. [Glossary Annotation]

This chapter will describe, in some detail, the available methods for data identification, and the minimal properties of identified information (including uniqueness, exclusivity, completeness, authenticity, and harmonization). The dire consequences of inadequate identification will be discussed, along with real-world examples. Once data objects have been properly identified, they can be deidentified and, under some circumstances, reidentified. The ability to deidentify data objects confers enormous advantages when issues of confidentiality, privacy, and intellectual property emerge. The ability to reidentify deidentified data objects is required for error detection, error correction, and data validation. [Glossary Deidentification, Re-identification, Privacy versus confidentiality, Intellectual property]

Returning to the title of this section, let us ask ourselves, “What is the difference between an identifier and an identifier system?” To answer, by analogy, it is like the difference between having a $100 dollar bill in your pocket and having a savings account with $100 credited to the account. In the case of the $100 bill, anyone in possession of the bill can use it to purchase items. In the case of the $100 credit, there is a system in place for uniquely assigning the $100 to one individual, until such time as that individual conducts an account transaction that increases or decreases the account value. Likewise, an identifier system creates a permanent environment in which the identifiers are safely stored and used.

Every good information system is, at its heart, an identification system: a way of naming data objects so that they can be retrieved by their name, and a way of distinguishing each object from every other object in the system. If data managers properly identified their data, and did absolutely nothing else, they would be producing a collection of data objects with more informational value than many existing Big Data resources.

The properties of a good identifier system are the following:

– Completeness

Every unique object in the big data resource must be assigned an identifier.

– Uniqueness

Each identifier is a unique sequence.

– Exclusivity

Each identifier is assigned to a unique object, and to no other object.

– Authenticity

The objects that receive identification must be verified as the objects that they are intended to be. For example, if a young man walks into a bank and claims to be Richie Rich, then the bank must ensure that he is, in fact, who he says he is.

– Aggregation

The Big Data resource must have a mechanism to aggregate all of the data that is properly associated with the identifier (i.e., to bundle all of the data that belongs to the uniquely identified objected). In the case of a bank, this might mean collecting all of the transactions associated with an account holder. In a hospital, this might mean collecting all of the data associated with a patient's identifier: clinic visit reports, medication transactions, surgical procedures, and laboratory results. If the identifier system performs properly, aggregation methods will always collect all of the data associated with an object and will never collect any data that is associated with a different object.

– Permanence

The identifiers and the associated data must be permanent. In the case of a hospital system, when the patient returns to the hospital after 30 years of absence, the record system must be able to access his identifier and aggregate his data. When a patient dies, the patient's identifier must not perish.

– Reconciliation

There should be a mechanism whereby the data associated with a unique, identified object in one Big Data resource can be merged with the data held in another resource, for the same unique object. This process, which requires comparison, authentication, and merging is known as reconciliation. An example of reconciliation is found in health record portability. When a patient visits a hospital, it may be necessary to transfer her electronic medical record from another hospital. Both hospitals need a way of confirming the identity of the patient and combining the records. [Glossary Electronic medical record]

– Immutability

In addition to being permanent (i.e., never destroyed or lost), the identifier must never change (see Chapter 6) [2]. In the event that two Big Data resources are merged, or that legacy data is merged into a Big Data resource, or that individual data objects from two different Big Data resources are merged, a single data object will be assigned two identifiers; one from each of the merging systems. In this case, the identifiers must be preserved as they are, without modification. The merged data object must be provided with annotative information specifying the origin of each identifier (i.e., clarifying which identifier came from which Big Data resource).

– Security

The identifier system is vulnerable to malicious attack. A Big Data resource with an identifier system can be irreversibly corrupted if the identifiers are modified. In the case of human-based identifier systems, stolen identifiers can be used for a variety of malicious activities directed against the individuals whose records are included in the resource.

– Documentation and Quality Assurance

A system should be in place to find and correct errors in the identifier system. Protocols must be written for establishing the identifier system, for assigning identifiers, for protecting the system, and for monitoring the system. Every problem and every corrective action taken must be documented and reviewed. Review procedures should determine whether the errors were corrected effectively; and measures should be taken to continually improve the identifier system. All procedures, all actions taken, and all modifications of the system should be thoroughly documented. This is a big job.

– Centrality

Whether the information system belongs to a savings bank, an airline, a prison system, or a hospital, identifiers play a central role. You can think of information systems as a scaffold of identifiers to which data is attached. For example, in the case of a hospital information system, the patient identifier is the central key to which every transaction for the patient is attached.

– Autonomy

An identifier system has a life of its own, independent of the data contained in the Big Data resource. The identifier system can persist, documenting and organizing existing and future data objects even if all of the data in the Big Data resource were to suddenly vanish (i.e., when all of the data contained in all of the data objects are deleted).

In theory, identifier systems are incredibly easy to implement. Here is exactly how it is done:

1. Generate a unique character sequence, such as UUID, or a long random number. [Glossary UUID, Randomness]
2. Assign the unique character sequence (i.e., identifier) to each new object, at the moment that the object is created. In the case of a hospital a patient chart is created at the moment he or she is registered into the hospital information system. In the case of a bank a customer record is created at the moment that he or she is provided with an account number. In the case of an object-oriented programming language, such as Ruby, this would be the moment when the “new” method is sent to a class object, instructing the class object to create a class instance. [Glossary Object-oriented programming, Instance]
3. Preserve the identifier number and bind it to the object. In practical terms, this means that whenever the data object accrues new data, the new data is assigned to the identifier number. In the case of a hospital system, this would mean that all of the lab tests, billable clinical transactions, pharmacy orders, and so on, are linked to the patient's unique identifier number, as a service provided by the hospital information system. In the case of a banking system, this would mean that all of the customer's deposits and withdrawals and balances are attached to the customer's unique account number.

Section 3.3. Generating Unique Identifiers

A UUID is 128 bits long, and can guarantee uniqueness across space and time.

P. Leach, M. Mealling and R. Salz [3]

Uniqueness is one of those concepts that everyone intuitively understands; explanations would seem unnecessary. Actually, uniqueness in the computational sciences is a somewhat different concept than uniqueness in the natural world. In computational sciences, uniqueness is achieved when a data object is associated with an unique identifier (i.e., a character string that has not been assigned to any other data object). Most of us, when we think of a data object, are probably thinking of a data record, which may consist of the name of a person followed by a list of feature values (height, weight, and age), or a sample of blood followed by laboratory values (e.g., white blood cell count, red cell count, and hematocrit). For computer scientists a data object is a holder for data values (the so-called encapsulated data), descriptors of the data, and properties of the holder (i.e., the class of objects to which the instance belongs). Uniqueness is achieved when the data object is permanently bound to its own identifier sequence. [Glossary Encapsulation]

Unique objects have three properties:

– A unique object can be distinguished from all other unique objects.
– A unique object cannot be distinguished from itself.
– Uniqueness may apply to collections of objects (i.e., a class of instances can be unique).

UUID (Universally Unique IDentifier) is an example of one type of algorithm that creates unique identifiers, on command, at the moment when new objects are created (i.e., during the run-time of a software application). A UUID is 128 bits long and reserves 60 bits for a string computed directly from a computer time stamp, and is usually represented by a sequence of alphanumeric ASCII characters [3]. UUIDs were originally used in the Apollo Network Computing System and were later adopted in the Open Software Foundation's Distributed Computing Environment [4]. [Glossary Time stamp, ASCII]

Linux systems have a built-in UUID utility, “uuidgen.exe,” that can be called from the system prompt.

Here are a few examples of output values generated by the “uuidgen.exe” utility: [Glossary Command line utility, Utility]

$ uuidgen.exe
312e60c9-3d00-4e3f-a013-0d6cb1c9a9fe
$ uuidgen.exe
822df73c-8e54-45b5-9632-e2676d178664
$ uuidgen.exe
8f8633e1-8161-4364-9e98-fdf37205df2f
$ uuidgen.exe
83951b71-1e5e-4c56-bd28-c0c45f52cb8a
$ uuidgen -t
e6325fb6-5c65-11e5-b0e1-0ceee6e0b993
$ uuidgen -r
5d74e36a-4ccb-42f7-9223-84eed03291f9

Notice that each of the final two examples has a parameter added to the “uuidgen” command (i.e., “-t” and “-r”). There are several versions of the UUID algorithm that are available. The “-t” parameter instructs the utility to produce a UUID based on the time (measured in seconds elapsed since the first second of October 15, 1582, the start of the Gregorian calendar). The “-r” parameter instructs the utility to produce a UUID based on the generation of a pseudorandom number. In any circumstance, the UUID utility instantly produces a fixed length character string suitable as an object identifier. The UUID utility is trusted and widely used by computer scientists. Independent-minded readers can easily design their own unique object identifiers, using pseudorandom number generators, or with one-way hash generators. [Glossary One-way hash, Pseudorandom number generator]

Python has its own UUID generator. The uuid module is included in the standard python distribution and can be called directly from the script.

import uuid
print(uuid.uuid4())

When discussing UUIDs the question of duplicates (so-called collisions, in the computer science literature) always arises. How can we be certain that a UUID is unique? Isn't it possible that the algorithm that we use to create a UUID may, at some point, produce the same sequence on more than one occasion? Yes, but the odds are small. It has been estimated that duplicate UUIDs are produced, on average, once every 2.71 quintillion (i.e., 2.71 ⁎ 10∧18) executions [5]. It seems that reports of UUID collisions, when investigated, have been attributed to defects in the implementation of the UUID algorithms. The general consensus seems to be that UUID collisions are not worth worrying about, even in the realm of Big Data.

Section 3.4. Really Bad Identifier Methods

I always wanted to be somebody, but now I realize I should have been more specific.

Lily Tomlin

Names are poor identifiers. First off, we can never assume that any name is unique. Surnames such as Smith, Zhang, Garcia, Lo, and given names such as John and Susan are very common. In Korea, five last names account for nearly 50% of the population [6]. Moreover, if we happened to find an individual with a truly unique name (e.g., Mr. Mxyzptlk), there would be no guarantee that some other unique individual might one day have the same name. Compounding the non-uniqueness of names, there is the problem of the many variant forms of a single name. The sources for these variations are many. Here is a partial listing:

1. Modifiers to the surname (du Bois, DuBois, Du Bois, Dubois, Laplace, La Place, van de Wilde, Van DeWilde, etc.).
2. Accents that may or may not be transcribed onto records (e.g., acute accent, cedilla, diacritical comma, palatalized mark, hyphen, diphthong, umlaut, circumflex, and a host of obscure markings).
3. Special typographic characters (the combined “ae”).
4. Multiple “middle names” for an individual, that may not be transcribed onto records. Individuals who replace their first name with their middle name for common usage, while retaining the first name for legal documents.
5. Latinized and other versions of a single name (Carl Linnaeus, Carl von Linne, Carolus Linnaeus, Carolus a Linne).
6. Hyphenated names that are confused with first and middle names (e.g., Jean-Jacques Rousseau, or Jean Jacques Rousseau; Louis-Victor-Pierre-Raymond, 7th duc de Broglie, or Louis Victor Pierre Raymond Seventh duc deBroglie).
7. Cultural variations in name order that are mistakenly rearranged when transcribed onto records. Many cultures do not adhere to the Western European name order (e.g., given name, middle name, surname).
8. Name changes; through marriage or other legal actions, aliasing, pseudonymous posing, or insouciant whim.

Aside from the obvious consequences of using names as record identifiers (e.g., corrupt database records, forced merges between incompatible data resources, impossibility of reconciling legacy record), there are non-obvious consequences that are worth considering. Take, for example, accented characters in names. These word decorations wreak havoc on orthography and on alphabetization. Where do you put a name that contains an umlauted character? Do you pretend the umlaut is not there, and alphabetize it according to its plain characters? Do you order based on the ASCII-numeric assignment for the character, in which the umlauted letter may appear nowhere near the plain-lettered words in an alphabetized list. The same problem applies to every special character. [Glossary American Standard Code for Information Interchange, ASCII]

A similar problem exists for surnames with modifiers. Do you alphabetize de Broglie under “D” or under “d” or under “B”? If you choose B, then what do you do with the concatenated form of the name, “deBroglie”? When it comes down to it, it is impossible to satisfactorily alphabetize a list of names. This means that searches based on proximity in the alphabet will always be prone to errors.

I have had numerous conversations with intelligent professionals who are tasked with the responsibility of assigning identifiers to individuals. At some point in every conversation, they will find it necessary to explain that although an individual's name cannot serve as an identifier, the combination of name plus date of birth provides accurate identification in almost every instance. They sometimes get carried away, insisting that the combination of name plus date of birth plus social security number provides perfect identification, as no two people will share all three identifiers: same name, same date of birth, same social security number. This argument rises to the height of folly and completely misses the point of identification. As we will see, it is relatively easy to assign unique identifiers to individuals and to any data object, for that matter. For managers of Big Data resources, the larger problem is ensuring that each unique individual has only one identifier (i.e., denying one object multiple identifiers). [Glossary Social Security Number]

Let us see what happens when we create identifiers from the name plus the birthdate. We will examine name + birthdate + social security number later in this section.

Consider this example. Mary Jessica Meagher, born June 7, 1912 decided to open a separate bank account in each of 10 different banks. Some of the banks had application forms, which she filled out accurately. Other banks registered her account through a teller, who asked her a series of questions and immediately transcribed her answers directly into a computer terminal. Ms. Meagher could not see the computer screen and could not review the entries for accuracy.

Here are the entries for her name plus date of birth:

1. Marie Jessica Meagher, June 7, 1912 (the teller mistook Marie for Mary).
2. Mary J. Meagher, June 7, 1912 (the form requested a middle initial, not name).
3. Mary Jessica Magher, June 7, 1912 (the teller misspelled the surname).
4. Mary Jessica Meagher, Jan 7, 1912 (the birth month was constrained, on the form, to three letters; Jun, entered on the form, was transcribed as Jan).
5. Mary Jessica Meagher, 6/7/12 (the form provided spaces for the final two digits of the birth year. Through a miracle of modern banking, Mary, born in 1912, was re-born a century later).
6. Mary Jessica Meagher, 7/6/2012 (the form asked for day, month, year, in that order, as is common in Europe).
7. Mary Jessica Meagher, June 1, 1912 (on the form, a 7 was mistaken for a 1).
8. Mary Jessie Meagher, June 7, 1912 (Marie, as a child, was called by the informal form of her middle name, which she provided to the teller).
9. Mary Jesse Meagher, June 7, 1912 (Marie, as a child, was called by the informal form of her middle name, which she provided to the teller, and which the teller entered as the male variant of the name).
10. Marie Jesse Mahrer, 1/1/12 (an underzealous clerk combined all of the mistakes on the form and the computer transcript, and added a new orthographic variant of the surname).

For each of these ten examples, a unique individual (Mary Jessica Meagher) would be assigned a different identifier at each of 10 banks. Had Mary re-registered at one bank, ten times, the outcome may have been the same.

If you toss the social security number into the mix (name + birth date + social security number) the problem is compounded. The social security number for an individual is anything but unique. Few of us carry our original social security cards. Our number changes due to false memory (“You mean I've been wrong all these years?”), data entry errors (“Character transpositoins, I mean transpositions, are very common”), intention to deceive (“I don't want to give those people my real number”), or desperation (“I don't have a number, so I'll invent one”), or impersonation (“I don't have health insurance, so I'll use my friend's social security number”). Efforts to reduce errors by requiring patients to produce their social security cards have not been entirely beneficial.

Beginning in the late 1930s, the E. H. Ferree Company, a manufacturer of wallets, promoted their product's card pocket by including a sample social security card with each wallet sold. The display card had the social security number of one of their employees. Many people found it convenient to use the card as their own social security number. Over time, the wallet display number was claimed by over 40,000 people. Today, few institutions require individuals to prove their identity by showing their original social security card. Doing so puts an unreasonable burden on the honest patient (who does not happen to carry his/her card) and provides an advantage to criminals (who can easily forge a card).

Entities that compel individuals to provide a social security number have dubious legal standing. The social security number was originally intended as a device for validating a person's standing in the social security system. More recently, the purpose of the social security number has been expanded to track taxable transactions (i.e., bank accounts, salaries). Other uses of the social security number are not protected by law. The Social Security Act (Section 208 of Title 42 U.S. Code 408) prohibits most entities from compelling anyone to divulge his/her social security number.

Considering the unreliability of social security numbers in most transactional settings, and considering the tenuous legitimacy of requiring individuals to divulge their social security numbers, a prudently designed medical identifier system will limit its reliance on these numbers. The thought of combining the social security number with name and date of birth will virtually guarantee that the identifier system will violate the strict one-to-a-customer rule.

Most identifiers are not purely random numbers; they usually contain some embedded information that can be interpreted by anyone familiar with the identification system. For example, they may embed the first three letters of the individual's family name in the identifier. Likewise, the last two digits of the birth year are commonly embedded in many types of identifiers. Such information is usually included as a crude “honesty” check by people “in the know.” For instance, the nine digits of a social security number are divided into an area code (first three digits), a group number (the next two digits), followed by a serial number (last four digits). People with expertise in the social security numbering system can pry considerable information from a social security number, and can determine whether certain numbers are bogus, based on the presence of excluded sub-sequences.

Seemingly inconsequential information included in an identifier can sometimes be used to discover confidential information about individuals. Here is an example. Suppose every client transaction in a retail store is accessioned under a unique number, consisting of the year of the accession, followed by the consecutive count of accessions, beginning with the first accession of the new year. For example, accession 2010-3518582 might represent the 3,518,582nd purchase transaction in the year 2010. Because each number is unique, and because the number itself says nothing about the purchase, it may be assumed that inspection of the accession number would reveal nothing about the transaction.

Actually, the accession number tells you quite a lot. The prefix (2010) tells you the year of the purchase. If the accession number had been 2010-0000001, then you could safely say that accession represented the first item sold on the first day of business in the year 2010. For any subsequent accession number in 2010, simply divide the suffix number (in this case 3,518,582) by the last accession number of the year, and multiply by 365 (the number of days in a non-leap year), and you have the approximate day of the year that the transaction occurred. This day can easily be converted to a calendar date.

Unimpressed? Consider this scenario. You know that a prominent member of the President's staff had visited a Washington, D.C. Hospital on February 15, 2005, for the purpose of having a liver biopsy. You would like to know the results of that biopsy. You go to a Web site that lists the deidentified pathology records for the hospital, for the years 2000–2010. Though no personal identifiers are included in these public records, the individual records are sorted by accession numbers. Using the aforementioned strategy, you collect all of the surgical biopsies performed on or about February 15, 2010. Of these biopsies, only three are liver biopsies. Of these three biopsies, only one was performed on a person whose gender and age matched the President's staff member. The report provides the diagnosis. You managed to discover some very private information without access to any personal identifiers.

The alphanumeric character string composing the identifier should not expose the patient's identity. For example, a character string consisting of a concatenation of the patient's name, birth date, and social security number might serve to uniquely identify an individual, but it could also be used to steal an individual's identity. The safest identifiers are random character strings containing no information whatsoever.

Section 3.5. Registering Unique Object Identifiers

It isn't that they can't see the solution. It's that they can't see the problem.

G. K. Chesterton

Registries are trusted services that provide unique identifiers to objects. The idea is that everyone using the object will use the identifier provided by the central registry. Unique object registries serve a very important purpose, particularly when the object identifiers are persistent. It makes sense to have a central authority for Web addresses, library acquisitions, and journal abstracts. Such registries include:

– DOI, Digital object identifier
– PMID, PubMed identification number
– LSID (Life Science Identifier)
– HL7 OID (Health Level 7 Object Identifier)
– DICOM (Digital Imaging and Communications in Medicine) identifiers
– ISSN (International Standard Serial Numbers)
– Social Security Numbers (for United States population)
– NPI, National Provider Identifier, for physicians
– Clinical Trials Protocol Registration System
– Office of Human Research Protections FederalWide Assurance number
– Data Universal Numbering System (DUNS) number
– International Geo Sample Number
– DNS, Domain Name Service
– URL, Unique Resource Locator [Glossary URL]
– URN, Unique Resource Name [Glossary URN]

In some cases the registry does not provide the full identifier for data objects. The registry may provide a general identifier sequence that will apply to every data object in the resource. Individual objects within the resource are provided with a non-unique registry number. A unique suffix sequence is appended locally (i.e., not by a central registrar). Life Science Identifiers (LSIDs) serve as a typical example of a registered identifier. Every LSIDs is composed of the following 5 parts: Network Identifier, root DNS name of the issuing authority, name chosen by the issuing authority, a unique object identifier assigned locally, and an optional revision identifier for versioning information.

In the issued LSID identifier, the parts are separated by a colon, as shown:

urn:lsid:pdb.org:1AFT:1

This identifies the first version of the 1AFT protein in the Protein Data Bank. Here are a few LSIDs:

urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434

This identifies a PubMed citation

urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2

This refers to the second version of an entry in GenBank

An OID, short for Object Identifier, is a hierarchy of identifier prefixes. Successive numbers in the prefix identify the descending order of the hierarchy. Here is an example of an OID from HL7, an organization that deals with health data interchanges:

1.3.6.1.4.1.250

Each node is separated from the successor by a dot, Successively finer registration detail leads to the institutional code (the final node). In this case the institution identified by the HL7 OID happens to be the University of Michigan.

The final step in creating an OID for a data object involves placing a unique identifier number at the end of the registered prefix. OID organizations leave the final step to the institutional data managers.

The problem with this approach is that the final within-institution data object identifier is sometimes prepared thoughtlessly, corrupting the OID system [7]. Here is an example. Hospitals use an OID system for identifying images, part of the DICOM (Digital Imaging and Communications in Medicine) image standard. There is a prefix consisting of a permanent, registered code for the institution and the department, and a suffix consisting of a number generated for an image as it is created.

A hospital may assign consecutive numbers to its images, appending these numbers to an OID that is unique for the institution and the department within the institution. For example, the first image created with a CT-scanner might be assigned an identifier consisting of the OID (the assigned code for institution and department) followed by a separator such as a hyphen, followed by “1.”

In a worst-case scenario, different instruments may assign consecutive numbers to images, independently of one another. This means that the CT-scanner in room A may be creating the same identifier (OID + image number) as the CT-scanner in Room B; for images on different patients. This problem could be remedied by constraining each CT-scanner to avoid using numbers assigned by any other CT-scanner. This remedy can be defeated if there is a glitch anywhere in the system that accounts for image assignments (e.g., if the counters are re-set, broken, replaced or simply ignored).

When image counting is done properly, and the scanners are constrained to assign unique numbers (not previously assigned by other scanners in the same institution), each image may indeed have a unique identifier (OID prefix + image number suffix). Nonetheless, the use of consecutive numbers for images will create havoc over time. Problems arise when the image service is assigned to another department in the institution, or when departments or institutions merge. Each of these shifts produces a change in the OID (the institutional and departmental prefix) assigned to the identifier. If a consecutive numbering system is used, then you can expect to create duplicate identifiers if institutional prefixes are replaced after the merge. The old records in both of the merging institutions will be assigned the same prefix and will contain replicate (consecutively numbered) suffixes (e.g., image 1, image 2, etc.).

Yet another problem may occur if one unique object is provided with multiple different unique identifiers. A software application may be designed to ignore any previously assigned unique identifier and to generate its own identifier, using its own assignment method. Doing so provides software vendors with a strategy that insulates the vendors from bad identifiers created by their competitor's software, and locks the customer to a vendor's software, and identifiers, forever.

In the end the OID systems provide a good set of identifiers for the institution, but the data objects created within the institution need to have their own identifier systems. Here is the HL7 statement on replicate OIDs:

Though HL7 shall exercise diligence before assigning an OID in the HL7 branch to third parties, given the lack of a global OID registry mechanism, one cannot make absolutely certain that there is no preexisting OID assignment for such third-party entity [8].

It remains to be seen whether any of the registration identifier systems will be used and supported with any serious level of permanence (e.g., over decades and centuries).

Section 3.6. Deidentification and Reidentification

Never answer an anonymous letter.

Yogi Berra

For scientists, deidentification serves two purposes:

– To protect the confidentiality and the privacy of the individual (when the data concerns a particular human subject), and
– To remove information that might bias the experiment (e.g., to blind the experimentalist to patient identities).

Deidentification involves stripping information from a data record that might link the record to the public name of the record's subject. In the case of a patient record, this would involve stripping any information from the record that would enable someone to connect the record to the name of the patient. The most obvious item to be removed in the deidentification process is the patient's name. Other information that should be removed would be the patient's address (which could be linked to the name), the patient's date of birth (which narrows down the set of individuals to whom the data record might pertain), and the patient's social security number. In the United States, patient privacy regulations include a detailed discussion of record deidentification and this discussion recommends 18 patient record items for exclusion from deidentified records [9].

Before going any further, it is important to clarify that deidentification is not achieved by removing an identifier from a data object. In point of fact, nothing good is ever achieved by simply removing an identifier from a data object; doing so simply invalidates the data object (i.e., every data object, identified or deidentified, must have an identifier). Deidentification involves removing information contained in the data object that reveals something about the publicly known name of the data object. This kind of information is often referred to as identifying information, but it would be much less confusing if we used another term for such data, such as “name-linking information.” The point here is that we do not want to confuse the identifier of a data object with information contained in a data object that can link the object to its public name.

It may seem counterintuitive, but there is very little difference between an identifier and a deidentifier; under certain conditions the two concepts are equivalent. Here is how a dual identification/deidentification system might work:

1. Collect data on unique object. “Joe Ferguson's bank account contains $100.”
2. Assign a unique identifier. “Joe Ferguson's bank account is 7540038947134.”
3. Substitute name of object with its assigned unique identifier: “754003894713 contains $100.”.
4. Consistently use the identifier with data.
5. Do not let anyone know that Joe Ferguson owns account “754003894713.”

The dual use of an identifier/deidentifier is a tried and true technique. Swiss bank accounts are essentially unique numbers (identifiers) assigned to a person. You access the bank account by producing the identifier number. The identifier number does not provide information about the identity of the bank account holder (i.e., it is a deidentifier and an identifier).

The purpose of an identifier is to tell you that whenever the identifier is encountered, it refers to the same unique object, and whenever two different identifiers are encountered, they refer to different objects. The identifier, by itself, should contain no information that links the data object to its public name.

It is important to understand that the process of deidentification can succeed only when each record is properly identified (i.e., there can be no deidentification without identification). Attempts to deidentify a poorly identified data set of clinical information will result in replicative records (multiple records for one patient), mixed-in records (single records composed of information on multiple patients), and missing records (unidentified records lost in the deidentification process).

The process of deidentification is best understood as an algorithm performed on-the-fly, in response to a query from a data analyst. Here is how such an algorithm might proceed.

1. The data analyst submits a query requesting a record from a Big Data resource. The resource contains confidential records that must not be shared, unless the records are deidentified.
2. The Big Data resource receives the query and retrieves the record.
3. A copy of the record is parsed and any of the information within the data record that might link the record to the public name of the subject of the record (usually the name of an individual) is deleted from the copy. This might include the aforementioned name, address, date of birth, and social security number.
4. A pseudo-identifier sequence is prepared for the deidentified record. The pseudo-identifier sequence might be generated by a random number generator, by encrypting the original identifier, through a one-way hash algorithm, or by other methods chosen by the Big Data manager. [Glossary Encryption]
5. A transaction record is attached to the original record that includes the pseudo-identifier, the deidentified record, the time of the transaction, and any information pertaining to the requesting entity (e.g., the data analyst who sent the query) that is deemed fit and necessary by the Big Data resource data manager.
6. A record is sent to the data analyst that consists of the deidentified record (i.e., the record stripped of its true identifier and containing no data that links the record to a named person) and the unique pseudo-identifier created for the record.

Because the deidentified record, and its unique pseudo-identifier are stored with the original record, subsequent requests for the pseudo-identified record can be retrieved and provided, at the discretion of the Big Data manager. This general approach to data deidentification will apply to requests for a single record or to millions of records.

At this point, you might be asking yourself the following question, “What gives the data manager the right to distribute parts of a confidential record, even if it happens to be deidentified?” You might think that if you tell someone a secret, under the strictest confidence, then you would not want any part of that secret to be shared with anyone else. The whole notion of sharing confidential information that has been deidentified may seem outrageous and unacceptable.

We will discuss the legal and ethical issues of Big Data in Chapters 18 and 19. For now, readers should know that there are several simple and elegant principles that justify sharing deidentified data.

Consider the statement “Jules Berman has a blood glucose level of 85.” This would be considered a confidential statement because it tells people something about my medical condition.

Consider the phrase, “Blood glucose 85.”

When the name “Jules Berman” is removed, we are left with a disembodied piece of data. “Blood glucose 85” is no different from “Temperature 98.6” or “Apples 2” or “Terminator 3.” They are simply raw data belonging to nobody in particular. The act of removing information linking data to a person renders the data harmless. Because the use of properly deidentified data poses no harm to human subjects, United States Regulations allow the unrestricted use of such data for research purposes [9,10]. Other countries have similar provisions.

– Reidentification

Because confidentiality and privacy concerns always apply to human subject data, it would seem imperative that deidentification should be an irreversible process (i.e., the names of the subjects and samples should be held a secret, forever).

Scientific integrity does not always accommodate irreversible deidentification. On occasion, experimental samples are mixed-up; samples thought to come from a certain individual, tissue, record, or account, may in fact come from another source. Sometimes major findings in science need to be retracted when a sample mix-up has been shown to occur [11,12,13,14,15]. When samples are submitted, without mix-up, the data is sometimes collected improperly. For example, reversing electrodes on an electrocardiogram may yield spurious and misleading results. Sometimes data is purposefully fabricated and otherwise corrupted, to suit the personal agendas of dishonest scientists. When data errors occur, regardless of reason, it is important to retract the publications [16,17]. To preserve scientific integrity, it is sometimes necessary to discover the identity of deidentified records.

In some cases, deidentification stops the data analyst from helping individuals whose confidentiality is being protected. Imagine you are conducting an analysis on a collection of deidentified data, and you find patients with a genetic marker for a disease that is curable, if treated at an early stage; or you find a new biomarker that determines which patients would benefit from surgery and which patients would not. You would be compelled to contact the subjects in the database to give them information that could potentially save their lives. Having an irreversibly deidentified data sets precludes any intervention with subjects; nobody knows their identities.

Deidentified records can, under strictly controlled circumstances, be reidentified. Reidentification is typically achieved by entrusting a third party with a confidential list that maps individuals to their deidentified records. Obviously, reidentification can only occur if the Big Data resource keeps a link connecting the identifiers of their data records to the identifiers of the corresponding deidentified record (what we've been calling pseudo-identifiers). The act of assigning a public name to the deidentified record must always involve strict oversight. The data manager must have in place a protocol that describes the process whereby approval for reidentification is obtained. Reidentification provides an opportunity whereby confidentiality can be breached and human subjects can be harmed. Consequently, stewarding the reidentification process is one of the most serious responsibilities of Big Data managers [18].

Section 3.7. Case Study: Data Scrubbing

It is a sin to believe evil of others but it is seldom a mistake.

Garrison Keillor

The term “data scrubbing” is sometimes used, mistakenly, as a synonym for deidentification. It is best to think of data scrubbing as a process that begins where deidentification ends. A data scrubber will remove unwanted information from a data record, including information of a personal nature and any information that is not directly related to the purpose of the data record. For example, in the case of a hospital record a data scrubber might remove the names of physicians who treated the patient; the names of hospitals or medical insurance agencies; addresses; dates; and any textual comments that are inappropriate, incriminating, irrelevant, or potentially damaging. [Glossary Data munging, Data scraping, Data wrangling]

In medical data records, there is a concept known as “minimal necessary” that applies to shared confidential data [9]. It holds that when records are shared, only the minimum necessary information should be released. Any information not directly relevant to the intended purposes of the data analyst should be withheld. The process of data scrubbing gives data managers the opportunity to render a data record that is free of information that would link the record to its subject and free of extraneous information that the data analyst does not actually require. [Glossary Minimal necessary]

There are many methods for data scrubbing. Most of these methods require that data managers develop an exception list of items that should not be included in shared records (e.g., cities, states, zip codes, and names of people). The scrubbing application moves through the records, extracting unnecessary information along the way. The end product is cleaned, but not sterilized. Though many undesired items can be successfully removed, this approach never produces a perfectly scrubbed set of data. In a Big Data resource, it is simply impossible for the data manager to anticipate every objectionable item and to include it in an exception list. Nobody is that smart.

There is, however, a method whereby data records can be cleaned, without error. This method involves creating a list of data (often in the form of words and phrases) that is acceptable for inclusion in a scrubbed and deidentified data set. Any data that is not in the list of acceptable information is automatically deleted. Whatever is left is the scrubbed data. This method can be described as a reverse scrubbing method. Everything is in the data set is automatically deleted, unless it is an approved “exception.”

This method of scrubbing is very fast and can produce an error-free deidentified and scrubbed output [4,19,20]. An example of the kind of output produced by such a scrubber is shown:

Since the time when ⁎ ⁎ ⁎ ⁎ ⁎ ⁎ ⁎ ⁎ his own ⁎ and the ⁎ ⁎ ⁎ ⁎, the anomalous ⁎ ⁎ have been ⁎ and persistent ⁎ ⁎ ⁎; and especially ⁎ true of the construction and functions of the human ⁎, indeed, it was the anomalous that was ⁎ ⁎ ⁎ in the ⁎ the attention, ⁎ ⁎ that were ⁎ to develop into the body ⁎ ⁎ which we now ⁎ ⁎. As by the aid ⁎ ⁎ ⁎ ⁎ ⁎ ⁎ ⁎ ⁎ ⁎ our vision into the ⁎ ⁎ ⁎ has emerged ⁎, we find ⁎ ⁎ and even evidence of ⁎. To the highest type of ⁎ ⁎ it is the ⁎ the ordinary ⁎ ⁎ ⁎ ⁎ ⁎. ⁎ to such, no less than to the most ⁎, ⁎ ⁎ ⁎ is of absorbing interest, and it is often ⁎ ⁎ that the ⁎ ⁎ the most ⁎ into the heart of the mystery of the ordinary. ⁎ ⁎ been said, ⁎ ⁎ ⁎ ⁎ ⁎. ⁎ ⁎ dermoid cysts, for example, we seem to ⁎ ⁎ ⁎ the secret ⁎ of Nature, and ⁎ out into the ⁎ ⁎ of her clumsiness, and ⁎ of her ⁎ ⁎ ⁎ ⁎, ⁎, ⁎ tell us much of ⁎ ⁎ ⁎ used by the vital ⁎ ⁎ ⁎ ⁎ even the silent ⁎ ⁎ ⁎ upon the ⁎ ⁎ ⁎.

The reverse-scrubber requires the preexistence of a set of approved terms. One of the simplest methods for generating acceptable terms involves extracting them from a nomenclature that comprehensively covers the terms used in a knowledge domain. For example, a comprehensive listing of living species will not contain dates or zip codes or any of the objectionable language or data that should be excluded from a scrubbed data set. In a method that I have published a list of approved doublets (approximately 200,000 two-word phrases collected from standard nomenclatures) are automatically collected for the scrubbing application [4]. The script is fast, and its speed is not significantly reduced by the size of the list of approved terms.

Here is a short python script. scrub.py, that will take any line of text and produce a scrubbed output. It requires an external file, doublets.txt, containing an approved list of doublet terms.

import sys, re, string
doub_file = open("doublets.txt", "r")
doub_hash = {}
for line in doub_file:
 line = line.rstrip()
 doub_hash[line] = " "

doub_file.close()
print("What would you like to scrub?")
line = sys.stdin.readline()
line = line.lower()
line = line.rstrip()
linearray = re.split(r' +', line)
lastword = "⁎"
for i in range(0, len(linearray)):
 doublet = " ".join(linearray[i:i + 2])
 if doublet in doub_hash:
   print(" " + linearray[i], end ="")
   lastword = " " + linearray[i + 1]
 else:
   print(lastword, end ="")
   lastword = " ⁎"
 if (i == len(linearray) + 1):
   print(lastword, end ="")

Section 3.8. Case Study (Advanced): Identifiers in Image Headers

Plus ca change, plus c'est la meme chose.

Old French saying (“The more things change, the more things stay the same.”)

As it happens, nothing is ever as simple as it ought to be. In the case of an implementation of systems that employ long sequence generators to produce unique identifiers, the most common problem involves indiscriminate reassignment of additional unique identifiers to the same data object, thus nullifying the potential benefits of the unique identifier systems.

Let us look at an example wherein multiple identifiers are redundantly assigned to the same image, corrupting the identifier system. In Section 4.3, we discuss image headers, and we provide examples wherein the ImageMagick “identify” utility could extract the textual information included in the image header. One of the header properties created, inserted, and extracted by ImageMagick's “identify” is an image-specific unique string. [Glossary ImageMagick]

When ImageMagick is installed in our computer, we can extract any image's unique string, using the “identify” utility and the “-format” attribute, on the following system command line: [Glossary Command line]

c:>identify -verbose -format "%#" eqn.jpg

Here, the image file we are examining is “eqn.jpg”. The “%#” character string is ImageMagick's special syntax indicating that we would like to extract the image identifier from the image header. The output is shown.

219e41b4c761e4bb04fbd67f71cc84cd6ae53a26639d4bf33155a5f62ee36e33

We can repeat the command line whenever we like, for this image; and the same image-specific unique sequence of characters will be produced.

Using ImageMagick, we can insert text into the “comment” section of the header, using the “-set” attribute. Let us add the text, “I'm modifying myself”:

c:ftp > convert eqn.jpg -set comment "I'm modifying myself" eqn.jpg

Now, let us extract the comment that we just added, to satisfy ourselves that the “-set” attribute operated as we had hoped. We do this using the “-format” attribute and the “%c” character string, which is ImageMagick's syntax for extracting the comment section of the header.

c:ftp > identify -verbose -format "%c" eqn.jpg

The output of the command line is:

I'm modifying myself

Now, let us run, one more time, the command line that produces the unique character string that is unique for the eqn.jpg image file

c:ftp > identify -verbose -format "%#" eqn.jpg

The output is:

cb448260d6eeeb2e9f2dcb929fa421b474021584e266d486a6190067a278639f

What just happened? Why has the unique character string specific for the eqn.jpg image changed? Has our small modification of the file, which consisted of adding a text comment to the image header, resulted in the production of a new image object, worthy of a new unique identifier?

Before answering these very important questions, let us pose the following gedanken question. Imagine you have a tree. This tree, like every living organism, is unique. It has a unique history, a unique location, and a unique genome (i.e., a unique sequence of nucleotides composing its genetic material). In ten years, its leaves drop off and are replaced ten times. Its trunk expands in size and its height increases. In the ten years of its existence, has the identify of the tree changed? [Glossary Gedanken]

You would probably agree that the tree has changed, but that it has maintained its identity (i.e., it is still the same tree, containing the descendants of the same cells that grew within the younger version of itself).

In informatics, a newly created object is given an identifier, and this identifier is immutable (i.e., cannot be changed), regardless of how the object is modified. In the case of the unique string assigned to an image by ImageMagick, the string serves as an authenticator, not as an identifier. When the image is modified a new unique string is created. By comparing the so-called identifier string in copies of the image file, we can determine whether any modifications have been made. That is to say, we can authenticate the file.

Getting back to the image file in our example, when we modified the image by inserting a text comment, ImageMagick produced a new unique string for the image. The identity of the image had not changed, but the image was different from the original image (i.e., no longer authentic). It seems that the string that we thought to be an identifier string was actually an authenticator string. [Glossary Authentication]

If we want an image to have a unique identifier that does not change when the image is modified, we must create our own identifier that persists when the image is modified.

Here is a short Python script, image_id.py, that uses Python's standard UUID method to create an identifier, which is inserted into the comment section of the image's header, and flanking the identifier with XML tags. [Glossary XML, HTML]

import sys, os, uuid
my_id = "<image_id >" + str(uuid.uuid4()) + "</image_id >"
in_command = "convert leaf.jpg -set comment "" + my_id + "" leaf.jpg"
os.system(in_command)
out_command = "identify -verbose -format "%c" leaf.jpg"
print ("
Here's the unique identifier:")
os.system(out_command)
print ("
Here's the unique authenticator:")
os.system("identify -verbose -format "%#" leaf.jpg")
os.system("convert leaf.jpg -resize 325x500! leaf.jpg")
print ("
Here's the new authenticator:")
os.system("identify -verbose -format "%#" leaf.jpg")
print ("
Here's the unique identifier:")
os.system(out_command)

Here is the output of the image_id.py script:

Here's the unique identifier:
< image_id > b0836a26-8f0e-4a6b-842d-9b0dde2b3f59 </image_id >

Here's the unique authenticator:
98c9fe07e90ce43f49961ab6226cd1ccffee648edd1a456a9d06a53ad6d3215a

Here's the new authenticator:
017e401d80a41aafa289ae9c2a1adb7c00477f7a943143141912189499d69ad2

Here's the unique identifier:
< image_id > b0836a26-8f0e-4a6b-842d-9b0dde2b3f59 </image_id >

What did the script do and what does it teach us? It employed the UUID utility to create a unique and permanent identifier for the image (leaf.jpg, in this case), and inserted the unique identifier into the image header. This identifier, “b0836a26-8f0e-4a6b-842d-9b0dde2b3f59,” did not change when the image was subsequently modified. A new authenticator string was automatically inserted into the image header, by ImageMagick, when the image was modified. Hence, we achieved what we needed to achieve: a unique identifier that never changes, and a unique authenticator that changes when the image is modified in any way.

If you have followed the logic of this section, then you are prepared for the following question posed as an exercise for Zen Buddhists. Imagine you have a hammer. Over the years, you have replaced its head, twice, and its handle, thrice. In this case, with nothing remaining of the original hammer, has it maintained its identity (i.e., is it still the same hammer?). The informatician would answer “Yes,” the hammer has maintained its unique identity, but it is no longer authentic (i.e., it is what it must always be, though it has become something different).

Section 3.9. Case Study: One-Way Hashes

I live on a one-way street that's also a dead end. I'm not sure how I got there.

Steven Wright

A one-way hash is an algorithm that transforms a string into another string is such a way that the original string cannot be calculated by operations on the hash value (hence the term “one-way” hash). Popular one-way hash algorithms are MD5 and Standard Hash Algorithm (SHA). A one-way hash value can be calculated for any character string, including a person's name, or a document, or even another one-way hash. For a given input string, the resultant one-way hash will always be the same.

Here are a few examples of one-way hash outputs performed on a sequential list of input strings, followed by their one-way hash (md5 algorithm) output.

Jules Berman => Ri0oaVTIAilwnS8 + nvKhfA
"Whatever" => n2YtKKG6E4MyEZvUKyGWrw
Whatever => OkXaDVQFYjwkQ + MOC8dpOQ
jules berman => SlnuYpmyn8VXLsxBWwO57Q
Jules J. Berman => i74wZ/CsIbxt3goH2aCS + A
Jules J Berman => yZQfJmAf4dIYO6Bd0qGZ7g
Jules Berman => Ri0oaVTIAilwnS8 + nvKhfA

The one-way hash values are a seemingly random sequence of ASCII characters (the characters available on a standard keyboard). Notice that a small variation among input strings (e.g., exchanging an uppercase for a lowercase character, adding a period or quotation mark) produces a completely different one-way hash output. The first and the last entry (Jules Berman) yield the same one-way hash output (Ri0oaVTIAilwnS8 + nvKhfA) because the two input strings are identical. A given string will always yield the same hash value, so long as the hashing algorithm is not altered. Each one-way hash has the same length (22 characters for this particular md5 algorithm) regardless of the length of the input term. A one-way hash output of the same length (22 characters) could have been produced for a string or file or document of any length. Once produced, there is no feasible mathematical algorithm that can reconstruct the input string from its one-way hash output. In our example, there is no way of examining the string “Ri0oaVTIAilwnS8 + nvKhfA” and computing the name Jules Berman.

We see that the key functional difference between a one-way hash and a UUID sequence is that the one-way hash algorithm, performed on a unique string, will always yield the same random-appearing alphanumeric sequence. A UUID algorithm has no input string; it simply produces unique alphanumeric output, and never (almost never) produces the same alphanumeric output twice.

One-way hashes values can serve as ersatz identifiers, permitting Big Data resources to accrue data, over time, to a specific record, even when the record is deidentified (e.g., even when its UUID identifier has been stripped from the record). Here is how it works [18]:

1. A data record is chosen, before it is deidentified, and a one-way hash is performed on its unique identifier string.
2. The record is deidentified by removing the original unique identifier. The output of the one-way hash (from step 1) is substituted for the original unique identifier.
3. The record is deidentified because nobody can reconstruct the original identifier from the one-way hash that has replaced it.
4. The same process is done for every record in the database.
5. All of the data records that were associated with the original identifier will now have the same one-way hash identifier and can be collected under this substitute identifier, which cannot be computationally linked to the original identifier.

Implementation of one-way hashes carry certain practical problems. If anyone happens to have a complete listing of all of the original identifiers, then it would be a simple matter to perform one-way hashes on every listed identifier. This would produce a look-up table that can match deidentified records back to the original identifier, a strategy known as a dictionary attack. For deidentification to work, the original identifier sequences must be kept secret.

One-way hash protocols have many practical uses in the field of information science [21,18,4]. It is very easy to implement one-way hashes, and most programming languages and operating systems come bundled with one or more implementations of one-way hash algorithms. The two most popular one-way hash algorithms are md5 (message digest version 5) and SHA (Secure Hash Algorithm). [Glossary HMAC, Digest, Message digest, Check digit]

Here we use Cygwin's own md5sum.exe utility on the command line to produce a one-way hash for an image file, named dash.png:

c:ftp > c:cygwin64inmd5sum.exe dash.png

Here is the output:

db50dc33800904ab5f4ac90597d7b4ea ⁎dash.png

We could call the same command line from a Python script:

import sys, os
os.system("c:/cygwin64/bin/md5sum.exe dash.png")

The output will always be the same, as long as the input file, dash.png, does not change:

db50dc33800904ab5f4ac90597d7b4ea ⁎dash.png

OpenSSL contains several one-way hash implementations, including both md5 and several variants of SHA.

One-way hashes on files are commonly used as a quick and convenient authentication tool. When you download a file from a Web site, you are likely to see that the file distributor has posted the file's one-way hash value. When you receive the file, it is a good idea to calculate the one-way hash on the file that you have received. If the one-way hash value is equal to the posted one-way hash value, then you can be certain that the file received is an exact copy of the file that was intentionally sent. Of course, this does not ensure that the file that was intentionally sent was a legitimate file or that the website was an honest file broker. We will be using our knowledge of one-way hashes when we discuss trusted time stamps (Section 8.5), blockchains (Section 8.6) and data security protocols (Section 18.3).

Glossary

ASCII ASCII is the American Standard Code for Information Interchange, ISO-14962-1997. The ASCII standard is a way of assigning specific 8-bit strings (a string of 0s and 1s of length 8) to the alphanumeric characters and punctuation. Uppercase letters are assigned a different string of 0s and 1s than their matching lowercase letters. There are 256 ways of combining 0s and 1s in strings of length 8. This means that that there are 256 different ASCII characters, and every ASCII character can be assigned a number-equivalent, in the range of 0–255. The familiar keyboard keys produce ASCII characters that happen to occupy ASCII values under 128. Hence, alphanumerics and common punctuation are represented as 8-bits, with the first bit, “0”, serving as padding. Hence, keyboard characters are commonly referred to as 7-bit ASCII, and files composed exclusively of common keyboard characters are referred to as plain-text files or as 7-bit ASCII files.
These are the classic ASCII characters:
!"#$%&'()⁎+,-./0123456789:;<=>
?@ABCDEFGHIJKLMNOPQRSTUVWXYZ
[]ˆ_`abcdefghijklmnopqrstuvwxyz{|}~
Python has several methods for removing non-printable characters from text, including the “printable” method, as shown in this short script, printable.py.
# -⁎- coding: iso-8859-15 -⁎-
import string
in_string = "prinüéâäàtable"
out_string = "".join(s for s in in_string if s in string.printable)
print(out_strung)
output:
printable
It is notable that the first line of code violates a fundamental law of Python programming; that the pound sign signifies that a comment follows, and that the Python interpreter will ignore the pound sign and any characters that follow the pound sign on the line in which they appear. For obscure reasons, the top line of the snippet is a permitted exception to the rule. In nonpythonic language, the top line conveys to the Python compiler that it may expect to find non-ASCII characters encoded in the iso-8859-15 standard.
The end result of this strange snippet is that non-ASCII characters are stripped from input strings; a handy script worth saving.

American Standard Code for Information Interchange Long form of the familiar acronym, ASCII.

Annotation Annotation involves describing data elements with metadata or attaching supplemental information to data objects.

Authentication A process for determining if the data object that is received (e.g., document, file, image) is the data object that was intended to be received. The simplest authentication protocol involves one-way hash operations on the data that needs to be authenticated. Suppose you happen to know that a certain file, named temp.txt will be arriving via email and that this file has an MD5 hash of “a0869a42609af6c712caeba454f47429”. You receive the temp.txt file, and you perform an MD5 one-way hash operation on the file.
In this example, we will use the md5 hash utility bundled into the CygWin distribution (i.e., the Linux emulator for Windows systems). Any md5 implementation would have sufficed.
c:cygwin64in > openssl md5 temp.txt
MD5(temp.txt)= a0869a42609af6c712caeba454f47429

We see that the md5 hash value generated for the received file is identical to the md5 hash value produced on the file, by the file's creator, before the file was emailed. This tells us that the received, temp.txt, is authentic (i.e., it is the file that you were intended to receive) because no other file has the same MD5 hash. Additional implementations of one-way hashes are described in Section 3.9. The authentication process, in this example, does not tell you who sent the file, the time that the file was created, or anything about the validity of the contents of the file. These would require a protocol that included signature, time stamp, and data validation, in addition to authentication. In common usage, authentication protocols often include entity authentication (i.e., some method by which the entity sending the file is verified). Consequently, authentication protocols are often confused with signature verification protocols. An ancient historical example serves to distinguish the concepts of authentication protocols and signature protocols. Since earliest of recorded history, fingerprints were used as a method of authentication. When a scholar or artisan produced a product, he would press his thumb into the clay tablet, or the pot, or the wax seal closing a document. Anyone doubting the authenticity of the pot could ask the artisan for a thumbprint. If the new thumbprint matched the thumbprint on the tablet, pot, or document, then all knew that the person creating the new thumbprint and the person who had put his thumbprint into the object were the same individual. Hence, ancient pots were authenticated. Of course, this was not proof that the object was the creation of the person with the matching thumbprint. For all anyone knew, there may have been a hundred different pottery artisans, with one person pressing his thumb into every pot produced. You might argue that the thumbprint served as the signature of the artisan. In practical terms, no. The thumbprint, by itself, does not tell you whose print was used. Thumbprints could not be read, at least not in the same way as a written signature. The ancients needed to compare the pot's thumbprint against the thumbprint of the living person who made the print. When the person died, civilization was left with a bunch of pots with the same thumbprint, but without any certain way of knowing whose thumb produced them. In essence, because there was no ancient database that permanently associated thumbprints with individuals, the process of establishing the identity of the pot-maker became very difficult once the artisan died. A good signature protocol permanently binds an authentication code to a unique entity (e.g., a person). Today, we can find a fingerprint at the scene of a crime; we can find a matching signature in a database; and we can link the fingerprint to one individual. Hence, in modern times, fingerprints are true “digital” signatures, no pun intended. Modern uses of fingerprints include keying (e.g., opening locked devices based on an authenticated fingerprint), tracking (e.g., establishing the path and whereabouts of an individual by following a trail of fingerprints or other identifiers), and body part identification (i.e., identifying the remains of individuals recovered from mass graves or from the sites of catastrophic events based on fingerprint matches). Over the past decade, flaws in the vaunted process of fingerprint identification have been documented, and the improvement of the science of identification is an active area of investigation [22].

Check digit A checksum that produces a single digit as output is referred to as a check digit. Some of the common identification codes in use today, such as ISBN numbers for books, come with a built-in check digit. Of course, when using a single digit as a check value, you can expect that some transmitted errors will escape the check, but the check digit is useful in systems wherein occasional mistakes are tolerated; or wherein the purpose of the check digit is to find a specific type of error (e.g., an error produced by a substitution in a single character or digit), and wherein the check digit itself is rarely transmitted in error.

Command line Instructions to the operating system, that can be directly entered as a line of text from the a system prompt (e.g., the so-called C prompt, “c:>”, in Windows and DOS operating systems; the so-called shell prompt, “$”, in Linux-like systems).

Command line utility Programs lacking graphic user interfaces that are executed via command line instructions. The instructions for a utility are typically couched as a series of arguments, on the command line, following the name of the executable file that contains the utility.

Data cleaning More correctly, data cleansing, and synonymous with data fixing or data correcting. Data cleaning is the process by which errors, spurious anomalies, and missing values are somehow handled. The options for data cleaning are: correcting the error, deleting the error, leaving the error unchanged, or imputing a different value [23]. Data cleaning should not be confused with data scrubbing.

Data munging Refers to a multitude of tasks involved in preparing data for some intended purpose (e.g., data cleaning, data scrubbing, and data transformation). Synonymous with data wrangling.

Data scraping Pulling together desired sections of a data set or text by using software.

Data scrubbing A term that is very similar to data deidentification and is sometimes used improperly as a synonym for data deidentification. Data scrubbing refers to the removal of unwanted information from data records. This may include identifiers, private information, or any incriminating or otherwise objectionable language contained in data records, as well as any information deemed irrelevant to the purpose served by the record.

Data wrangling Jargon referring to a multitude of tasks involved in preparing data for eventual analysis. Synonymous with data munging [24].

Deidentification The process of removing all of the links in a data record that can connect the information in the record to an individual. This usually includes the record identifier, demographic information (e.g., place of birth), personal information (e.g., birthdate), and biometrics (e.g., fingerprints). The deidentification strategy will vary based on the type of records examined. Deidentifying protocols exist wherein deidentificated records can be reidentified, when necessary.

Digest As used herein, “digest” is equivalent to a one-way hash algorithm. The word “digest” also refers to the output string produced by a one-way hash algorithm.

Electronic medical record Abbreviated as EMR, or as EHR (Electronic Health Record). The EMR is the digital equivalent of a patient's medical chart. Central to the idea of the EMR is the notion that all of the documents, transactions, and all packets of information containing test results and other information on a patient are linked to the patient's unique identifier. By retrieving all data linked to the patient's identifier, the EMR (i.e., the entire patient's chart) can be assembled instantly.

Encapsulation The concept, from object oriented programming, that a data object contains its associated data. Encapsulation is tightly linked to the concept of introspection, the process of accessing the data encapsulated within a data object. Encapsulation, Inheritance, and Polymorphism are available features of all object-oriented languages.

Encryption A common definition of encryption involves an algorithm that takes some text or data and transforms it, bit-by-bit, into an output that cannot be interpreted (i.e., from which the contents of the source file cannot be determined). Encryption comes with the implied understanding that there exists some reverse transform that can be applied to the encrypted data, to reconstitute the original source. As used herein, the definition of encryption is expanded to include any protocols by which files can be shared, in such a way that only the intended recipients can make sense of the received documents. This would include protocols that divide files into pieces that can only be reassembled into the original file using a password. Encryption would also include protocols that alter parts of a file while retaining the original text in other parts of the file. As described in Chapter 5, there are instances when some data in a file should be shared, while only specific parts need to be encrypted. The protocols that accomplish these kinds of file transformations need not always employ classic encryption algorithms (e.g., Winnowing and Chaffing [25], threshold protocols [21]).

Gedanken Gedanken is the German word for “thought.” A gedanken experiment is one in which the scientist imagines a situation and its outcome, without resorting to any physical construction of a scientific trial. Albert Einstein, a consummate theoretician, was fond of inventing imaginary scenarios, and his use of the term “gedanken trials” has done much to popularize the concept. The scientific literature contains multiple descriptions of gedanken trials that have led to fundamental breakthroughs in our understanding of the natural world and of the universe [26].

HMAC Hashed Message Authentication Code. When a one-way hash is employed in an authentication protocol, it is often referred to as an HMAC.

HTML HyperText Markup Language is an ASCII-based set of formatting instructions for web pages. HTML formatting instructions, known as tags, are embedded in the document, and double-bracketed, indicating the start point and end points for instruction. Here is an example of an HTML tag instructing the web browser to display the word “Hello” in italics: < i > Hello </i >. All web browsers conforming to the HTML specification must contain software routines that recognize and implement the HTML instructions embedded within in web documents. In addition to formatting instructions, HTML also includes linkage instructions, in which the web browsers must retrieve and display a listed web page, or a web resource, such as an image. The protocol whereby web browsers, following HTML instructions, retrieve web pages from other Internet sites, is known as HTTP (HyperText Transfer Protocol).

ImageMagick An open source utility that supports a huge selection of robust and sophisticated image editing methods. ImageMagick is available for download at: https://www.imagemagick.org/script/download.php

Instance An instance is a specific example of an object that is not itself a class or group of objects. For example, Tony the Tiger is an instance of the tiger species. Tony the Tiger is a unique animal and is not itself a group of animals or a class of animals. The terms instance, instance object, and object are sometimes used interchangeably, but the special value of the “instance” concept, in a system wherein everything is an object, is that it distinguishes members of classes (i.e., the instances) from the classes to which they belong.

Intellectual property Data, software, algorithms, and applications that are created by an entity capable of ownership (e.g., humans, corporations, and universities). The entity holds rights over the manner in which the intellectual property can be used and distributed. Protections for intellectual property may come in the form of copyrights and patent. Copyright applies to published information. Patents apply to novel processes and inventions. Certain types of intellectual property can only be protected by being secretive. For example, magic tricks cannot be copyrighted or patented; this is why magicians guard their intellectual property so closely. Intellectual property can be sold outright, essentially transferring ownership to another entity; but this would be a rare event. In other cases, intellectual property is retained by the creator who permits its limited use to others via a legal contrivance (e.g., license, contract, transfer agreement, royalty, and usage fee). In some cases, ownership of the intellectual property is retained, but the property is freely shared with the world (e.g., open source license, GNU license, FOSS license, and Creative Commons license).

Message digest Within the context of this book, “message digest”, “digest”, “HMAC”, and “one-way hash” are equivalent terms.

Minimal necessary In the field of medical informatics, there is a concept known as “minimal necessary” that applies to shared confidential data [9]. It holds that when records are shared, only the minimum necessary information should be released. Information not directly relevant to the intended purposes of the study should be withheld.

Object-oriented programming In object-oriented programming, all data objects must belong to one of the classes built into the language or to a class created by the programmer. Class methods are subroutines that belong to a class. The members of a class have access to the methods for the class. There is a hierarchy of classes (with superclasses and subclasses). A data object can access any method from any superclass of its class. All object-oriented programming languages operate under this general strategy. The two most important differences among the object oriented programming languages relate to syntax (i.e., the required style in which data objects call their available methods) and content (the built-in classes and methods available to objects). Various esoteric issues, such as types of polymorphism offered by the language, multi-parental inheritance, and non-Boolean logic operations may play a role in how expert programmer's choose a specific object-oriented language for the job at-hand.

One-way hash A one-way hash is an algorithm that transforms one string into another string (a fixed-length sequence of seemingly random characters) in such a way that the original string cannot be calculated by operations on the one-way hash value (i.e., the calculation is one-way only). One-way hash values can be calculated for any string, including a person's name, a document, or an image. For any given input string, the resultant one-way hash will always be the same. If a single byte of the input string is modified, the resulting one-way hash will be changed, and will have a totally different sequence than the one-way hash sequence calculated for the unmodified string.
Most modern programming languages have several methods for generating one-way hash values. Regardless of the language we choose to implement a one-way hash algorithm (e.g., md5, SHA), the output value will be identical. One-way hash values are designed to produce long fixed-length output strings (e.g., 256 bits in length). When the output of a one-way hash algorithm is very long, the chance of a hash string collision (i.e., the occurrence of two different input strings generating the same one-way hash output value) is negligible. Clever variations on one-way hash algorithms have been repurposed as identifier systems [27,28,29,30]. A detailed discussion of one-way hash algorithms can be found in Section 3.9, “Case Study: One-Way Hashes.”

Privacy versus confidentiality The concepts of confidentiality and of privacy are often confused, and it is useful to clarify their separate meanings. Confidentiality is the process of keeping a secret with which you have been entrusted. You break confidentiality if you reveal the secret to another person. You violate privacy when you use the secret to annoy the person whose confidential information was acquired. If you give a friend your unlisted telephone number in confidence, then your fried is expected to protect this confidentiality by never revealing the number to other persons. In addition, your friend may be expected to protect your privacy by resisting the temptation to call you in the middle of the night, complain about a mutual acquaintance. In this case, the same information object (unlisted telephone number) is encumbered by separable confidentiality and privacy obligations.

Pseudorandom number generator It is impossible for computers to produce an endless collection of truly random numbers. Eventually, algorithms will cycle through their available variations and begins to repeat themselves, producing the same set of “random” numbers, in the same order; a phenomenon referred to as the generator's period. Because algorithms that produce seemingly random numbers are imperfect, they are known as pseudorandom number generators. The Mersenne Twister algorithm, which has an extremely long period, is used as the default random number generator in Python. This algorithm performs well on most of the tests that mathematicians have devised to test randomness.

Randomness Various tests of randomness are available [31]. One of the easiest to implement takes advantage of the property that random strings are uncompressible. If you can show that if a character string, a series of numbers, or a column of data cannot be compressed by gzip, then it is pretty safe to conclude that the data is randomly distributed, and without any informational value.

Reidentification A term casually applied to any instance whereby information can be linked to a specific person after the links between the information and the person associated with the information were removed. Used this way, the term reidentification connotes an insufficient deidentification process. In the healthcare industry, the term “reidentification” means something else entirely. In the United States, regulations define “reidentification” under the “Standards for Privacy of Individually Identifiable Health Information”. Reidentification is defined therein as a legally valid process whereby deidentified records can be linked back to the respective human subjects, under circumstances deemed compelling by a privacy board. Reidentification is typically accomplished via a confidential list of links between human subject names and deidentified records, held by a trusted party. As used by the healthcare industry, reidentification only applies to the approved process of re-establishing the identity of a deidentified record. When a human subject is identified through fraud, trickery, or through the deliberate use of computational methods to break the confidentiality of insufficiently deidentified records, the term “reidentification” would not apply.

Social Security Number The common strategy, in the United States, of employing social security numbers as identifiers is often counterproductive, owing to entry error, mistaken memory, or the intention to deceive. Efforts to reduce errors by requiring individuals to produce their original social security cards puts an unreasonable burden on honest individuals, who rarely carry their cards, and provides an advantage to dishonest individuals, who can easily forge social security cards. Institutions that compel patients to provide a social security number have dubious legal standing. The social security number was originally intended as a device for validating a person's standing in the social security system. More recently, the purpose of the social security number has been expanded to track taxable transactions (i.e., bank accounts, salaries). Other uses of the social security number are not protected by law. The Social Security Act (Section 208 of Title 42 U.S. Code 408) prohibits most entities from compelling anyone to divulge his/her social security number. Legislation or judicial action may one day stop healthcare institutions from compelling patients to divulge their social security numbers as a condition for providing medical care. Prudent and forward-thinking institutions will limit their reliance on social security numbers as personal identifiers.

Time stamp Many data objects are temporal events and all temporal events must be given a time stamp indicating the time that the event occurred, using a standard measurement for time. The time stamp must be accurate, persistent, and immutable. The Unix epoch time (equivalent to the Posix epoch time) is available for most operating systems and consists of the number of seconds that have elapsed since January 1, 1970, midnight, Greenwhich mean time. The Unix epoch time can easily be converted into any other standard representation of time. The duration of any event can be easily calculated by subtracting the beginning time from the ending time. Because the timing of events can be maliciously altered, scrupulous data managers employ a trusted time stamp protocol by which a time stamp can be verified. A trusted time stamp must be accurate, persistent, and immutable. Trusted time stamp protocols are discussed in Section 8.5, “Case Study: The Trusted Time stamp.”

URL Unique Resource Locator. The Web is a collection of resources, each having a unique address, the URL. When you click on a link that specifies a URL, your browser fetches the page located at the unique location specified in the URL name. If the Web were designed otherwise (i.e., if several different web pages had the same web address, or if one web address were located at several different locations), then the web could not function with any reliability.

URN Unique Resource Name. Whereas the URL identifies objects based on the object's unique location in the Web, the URN is a system of object identifiers that are location-independent. In the URN system, data objects are provided with identifiers, and the identifiers are registered with, and subsumed by, the URN.
For example:
urn:isbn-13:9780128028827
Refers to the unique book, “Repurposing Legacy Data: Innovative Case Studies,” by Jules Berman
urn:uuid:e29d0078-f7f6-11e4-8ef1-e808e19e18e5
Refers to a data object tied to the UUID identifier e29d0078-f7f6-11e4-8ef1-e808e19e18e5.
In theory, if every data object were assigned a registered URN, and if the system were implemented as intended, the entire universe of information could be tracked and searched.

UUID UUID, the abbreviation for Universally Unique IDentifiers, is a protocol for assigning identifiers to data objects, without using a central registry. UUIDs were originally used in the Apollo Network Computing System [3].

Utility In the context of software, a utility is an application that is dedicated to performing one specific task, very well, and very fast. In most instances, utilities are short programs, often running from the command line, and thus lacking any graphic user interface. Many utilities are available at no cost, with open source code. In general, simple utilities are preferable to multi-purpose software applications [32]. Remember, an application that claims to do everything for the user is, most often, an application that requires the user to do everything for the application.

XML Abbreviation for eXtensible Markup Language. A syntax for marking data values with descriptors (metadata). The descriptors are commonly known as tags. In XML, every data value is enclosed by a start-tag, indicating that a value will follow, and an end-tag, indicating that the value had preceded the tag. For example: < name > Tara Raboomdeay </name >. The enclosing angle brackets, “<>”, and the end-tag marker, “/”, are hallmarks of XML markup. This simple but powerful relationship between metadata and data allows us to employ each metadata/data pair as though it were a small database that can be combined with related metadata/data pairs from any other XML document. The full value of metadata/data pairs comes when we can associate the pair with a unique object, forming a so-called triple.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3: Identification, Deidentification, and Reidentification

Create new playlist

Sign In

Sign Up