8

Immutability and Immortality

Abstract

One of the most important features of serious Big Data resources is immutability. The rule is simple. Data is immortal and cannot change. You can add data to the system, but you can never alter data and you can never erase data. Immutability is a counterintuitive concept for many people, including some data analysts. If a patient has a glucose level of 100 on Monday, and the same patient has a glucose level of 115 on Tuesday, then it would seem obvious that his glucose level changed between Monday and Tuesday. Not so. Until the end of time, Monday's glucose level will always be 100. On Tuesday, another glucose level was added to the record for the patient. Nothing that existed prior to Tuesday was changed. This chapter discusses immutability as it applies to incorporating legacy data into Big Data resources, correcting and updating data and records held in data merging operations, reconciling data object identifiers across Big Data resources, and merging data between resources.

Keywords

Legacy data; Retrospective data; Abandoned data; Integration of information systems; Immutability; Serious Big Data

Section 8.1. The Importance of Data That Cannot Change

Cheese is milk's leap toward immortality

Clifton Fadiman (editor of Mathematical Magpie)

Immutability is one of those issues, like identifiers and introspection, that seem unimportant, until something goes terribly wrong. Then, in the midst of the problem, you realize that your entire information system was designed incorrectly, and there really is nothing you can do to cope.

Here is an example of a immutability problem. You are a pathologist working in a university hospital that has just installed a new, $600 million information system. On Tuesday, you released a report on a surgical biopsy, indicating that it contained cancer. On Friday morning, you showed the same biopsy to your colleagues, who all agreed that the biopsy was not malignant, and contained a benign condition that simulated malignancy (looked a little like a cancer, but was not). Your original diagnosis was wrong, and now you must rectify the error. You return to the computer, and access the prior report, changing the wording of the diagnosis to indicate that the biopsy is benign. You can do this, because pathologists are granted “edit” access for pathology reports. Now, everything seems to have been set right. The report has been corrected, and the final report in the computer is the official diagnosis.

Unknown to you, the patient's doctor read the incorrect report on Wednesday, the day after the incorrect report was issued, and two days before the correct report replaced the incorrect report. Major surgery was scheduled for the following Wednesday (five days after the corrected report was issued). Most of the patient's liver was removed. No cancer was found in the excised liver. Eventually, the surgeon and patient learned that the original report had been altered. The patient sued the surgeon, the pathologist, and the hospital.

You, the pathologist, argued in court that the computer held one report issued by the pathologist (following the deletion of the earlier, incorrect report) and that report was correct and available to the surgeon prior to the surgery date. Therefore, you said, you made no error. The patient's lawyer had access to a medical chart in which paper versions of the diagnosis had been kept. The lawyer produced, for the edification of the jury, two reports from the same pathologist, on the same biopsy: one positive for cancer, the other negative for cancer. The hospital, conceding that they had no credible defense, settled out of court for a very large quantity of money. Meanwhile, back in the hospital, a fastidious intern is deleting an erroneous diagnosis, and substituting her improved rendition.

One of the most important features of serious Big Data resources (such as the data collected in hospital information systems) is immutability. The rule is simple. Data is immortal and cannot change. You can add data to the system, but you can never alter data and you can never erase data. Immutability is counterintuitive to most people, including most data analysts. If a patient has a glucose level of 100 on Monday, and the same patient has a glucose level of 115 on Tuesday, then it would seem obvious that his glucose level changed between Monday and Tuesday. Not so. Monday's glucose level remains at 100. For the end of time, Monday's glucose level will always be 100. On Tuesday, another glucose level was added to the record for the patient. Nothing that existed prior to Tuesday was changed. [Glossary Serious Big Data]

Section 8.2. Immutability and Identifiers

People change and forget to tell each other.

Lillian Hellman

Immutability applies to identifiers. In a serious Big Data resource, data objects never change their identity (i.e., their identifier sequences). Individuals never change their names. A person might add a married name, but the married name does not change the maiden name. The addition of a married name might occur as follows:

18843056488 is_a     patient
18843056488 has_a     maiden_name
18843056488 has_a     married_name
9937564783  is_a     maiden_name
4401835284  is_a     married_name
18843056488 maiden_name Karen Sally Smith
18843056488 married_name Karen Sally Smythe

Here, we have a woman named Karen Sally Smith. She has a unique, immutable identifier, “18843056488.” Her patient record has various metadata/data pairs associated with her unique identifier. Karen is a patient, Karen has a maiden name, and Karen has a married name. The metadata tags that describe the data that is associated with Karen include “maiden_name” and “married_name.” These metadata tags are themselves data objects. Hence, they must be provided with unique, immutable identifiers. Though metadata tags are themselves unique data objects, each metadata tag can be applied to many other data objects. In the following example, the unique maiden_name and married_name tags are associated with two different patients.

9937564783 is_a     maiden_name
4401835284 is_a     married_name
18843056488 is_a     patient
18843056488 has_a     maiden_name
18843056488 has_a     married_name
18843056488 maiden_name Karen Sally Smith
18843056488 married_name Karen Sally Smythe
73994611839 is_a     patient
73994611839 has_a     maiden_name
73994611839 has_a     married_name
73994611839 maiden_name Barbara Hay Wire
73994611839 married_name Barbara Haywire

The point here is that patients may acquire any number of names over the course of their lives, but the Big Data resource must have a method for storing, and describing each of those names and associating them with the same unique patient identifier. Everyone who uses a Big Data resource must be confident that all the data objects in the resource are unique, identified, and immutable.

By now, you should be comfortable with the problem confronted by the pathologist who changed his mind. Rather than simply replacing one report with another, the pathologist might have issued a modification report, indicating that the new report supercedes the earlier report. In this case, the information system does not destroy or replace the earlier report, but creates a companion report. As a further precaution the information system might flag the early report with a link to the ensuant entry. Alternately, the information system might allow the pathologist to issue an addendum (i.e., add-on text) to the original report. The addendum could have clarified that the original diagnosis is incorrect, stating the final diagnosis is the diagnosis in the addendum. Another addendum might indicate that the staff involved in the patient's care was notified of the updated diagnosis. The parts of the report (including any addenda) could be dated and authenticated with the electronic signature of the pathologist. Not one byte in the original report is ever changed. Had these procedures been implemented, the unnecessary surgery, the harm inflicted on the patient, the lawsuit, and the settlement, might have all been avoided. [Glossary Digital signature]

The problem of updating diagnoses may seem like a problem that is specific for the healthcare industry. It is not. The content of Big Data resources is constantly changing; the trick is to accommodate all changes by the addition of data, not by the deletion or modification of data. For example, suppose a resource uses an industry standard for catalog order numbers assigned to parts of an automobile. These 7-digit numbers are used whenever a part needs to be purchased. The resource may inventory millions of different parts, each with an order number annotation. What happens when the standard suddenly changes, and 12-digit numbers replace all of the existing 7-digit numbers? A well-managed resource will preserve all of the currently held information, including the metadata tag that describe the 7-digit standard and the 7-digit order number for each part in the resource inventory. The new standard, containing 12-digit numbers, will have a different metadata tag from the prior standard, and the new metadata/data pair will be attached to the internal identifier for the part. This operation will work if the resource maintains its own unique identifiers for every data object held in the resource and if the data objects in the resource are associated with metadata/data pairs. All of these actions involve adding information to data objects, not deleting information.

In the days of small data, this was not much of a problem. The typical small data scenario would involve creating a set of data, all at once, followed soon thereafter by a sweeping analytic procedure applied against the set of data, culminating in a report that summarized the conclusions. If there was some problem with the study, a correction would be made, and everything would be repeated. A second analysis would be performed in the new and improved data set. It was all so simple.

A procedure for replicative annotations to accommodate the introduction of new standards and nomenclatures as well as new versions of old standards and nomenclatures is one of the more onerous jobs of the Big Data curator. Over the years, dozens of new or additional annotations could be required. It should be stressed that replicative annotations for nomenclatures and standards can be avoided if the data objects in the resource are not tied to any specific standard. If the data objects are well specified (i.e., providing adequate and uniform descriptions), queries can be matched against any standard nomenclature on-the-fly (i.e., as needed, in response to queries), as previously discussed in Section 2.5, “Autocoding” [1]. [Glossary Curator]

Why is it always bad to change the data objects held in a Big Data resource? Though there are many possible negative repercussions to deleting and modifying data, most of the problems come down to data verification, and time stamping. All Big Data resources must be able to verify that the data held in the resource conforms to a set of protocols for preparing data objects and measuring data values. When you change pre-existing data, all of your efforts at resource verification are wasted, because the resource that you once verified no longer exists. The resource has become something else. Aside from producing an unverifiable resource, you put the resource user into the untenable position of deciding which data to believe; the old data or the new data. Time stamping is another component of data objects. Events (e.g., a part purchased, a report issued, a file opened) have no meaning unless you know when they occurred. Timestamps applied to data objects must be unique and immutable. A single event cannot occur at two different times. [Glossary Time stamp, Verification and validation]

  •  Immortal Data Objects

In Section 6.2, we defined the term “data object.” To review, a data object is a collection of triples that have the same identifier. A respectable data object should always encapsulate two very specific triples: one that tells us the class to which the data object holds membership, and another that tells us the name of the parent class from which the data object descends. When these two triples are included in the data object, we can apply the logic and the methods of object-oriented programming to Big Data objects.

In addition, we should note that if the identifier and the associated metadata/data pairs held by the data object are immutable (as they must be, vida supra), and if all the data held in the Big Data resource is preserved indefinitely (as it should be), then the data objects achieve immortality. If every data object has metadata/data pairs specifying its class and parent class, then all of the relationships among every data object in the Big Data resource will apply forever. In addition, all the class-specific methods can be applied to objects belonging to its class and its subclass descendants, can always be applied; and all of the encapsulated data can always be reconstructed. This would hold true, even if the data objects were reduced to their individual triples, scattered across the planet, and deposited into countless data clouds. The triples could, in theory, reassemble into data objects under their immortal identifier.

Big Data should be designed to last forever. Hence, Big Data managers must do what seems to be impossible; they must learn how to modify data without altering the original content. The rewards are great.

Section 8.3. Coping With the Data That Data Creates

The chief problem in historical honesty isn't outright lying. It is omission or de-emphasis of important data.

Howard Zinn

Imagine this scenario. A data analyst extracts a large set of data from a Big Data resource. After subjecting the data to several cycles of the usual operations (data cleaning, data reduction, data filtering, data transformation, and the creation of customized data metrics), the data analyst is left with a new set of data, derived from the original set. The data analyst has imbued this new set of data with some added value, not apparent in the original set of data.

The question becomes, “How does the data analyst insert her new set of derived data back into the original Big Data resource, without violating immutability?” The answer is simple but disappointing; re-inserting the derived data is impossible, and should not be attempted. The transformed data set is not a collection of original measurements; the data manager of the Big Data Resource can seldom verify it. Data derived from other data (e.g., age-adjustments, normalized data, averaged data values, and filtered data) will not sensibly fit into the data object model upon which the resource was created. There simply is no substitute for the original and primary data.

The data analyst should make her methods and her transformed data available for review by others. Every step involved in creating the new data set needs to be carefully recorded and explained, but the transformed set of data should not be absorbed back into the resource. The Big Data resource may provide a link to sources that hold the modified data sets. Doing so provides the public with an information trail leading from the original data to the transformed data prepared by the data analyst. [Glossary Raw data]

Section 8.4. Reconciling Identifiers Across Institutions

Mathematics is the art of giving the same name to different things.

Henri Poincare

In math, we are taught that variables are named “x” or “y,” or sometimes “n,” (if you are sure the variable is an integer). Using other variable names, such as “h” or “s,” is just asking for trouble. Computer scientists have enlarged their list of familiar variables to include “foo” and “bar.” A long program with hundreds of different local variables, all named “foo” is unreadable, even to the person who wrote the code. The sloppiness with which mathematicians and programmers assign names has carried over into the realm of Big Data. Sometimes, it seems that data professionals just don't care much about how we name our data records, just so long as we have lots of them to play with. Consequently, we must deal with the annoying problem that arises when multiple data records, for one unique object, are assigned different identifiers (e.g., when identifier x and identifier y and identifier foo all refer to the same unique data object). The process of resolving identifier replications is known as reconciliation. [Glossary Metasyntactic variable]

In many cases, the biggest obstacle to achieving Big Data immutability is data record reconciliation [2]. When different institutions merge their data systems, it is crucial that no data is lost, and all identifiers are sensibly preserved. Cross-institutional identifier reconciliation is the process whereby institutions determine which data objects, held in different resources, are identical (i.e., the same data object). The data held in reconciled identical data objects can be combined in search results, and the identical data objects themselves can be merged (i.e., all of the encapsulated data can be combined into one data object), when Big Data resources are integrated, or when legacy data is absorbed into a Big data resource.

In the absence of successful reconciliation, there is no way to determine the unique identity of records (i.e., duplicate data objects may exist across institutions and data users will be unable to rationally analyze data that relates to or is dependent upon the distinctions among objects in a data set). For all practical purposes, without data object reconciliation, there is no way to understand data received from multiple sources.

Reconciliation is particularly important for healthcare agencies. Some countries provide citizens with a personal medical identifier that is used in every medical facility in the nation. Hospital A can send a query to Hospital B for medical records pertaining to a patient sitting Hospital A's emergency room. The national patient identifier insures that the cross-institutional query will yield all of Hospital B's data on the patient, and will not include data on other patients. [Glossary National Patient Identifier]

Consider the common problem of two institutions trying to reconcile personal records (e.g., banking records, medical charts, dating service records, credit card information). When both institutions are using the same identifiers for individuals in their resources, then reconciliation is effortless. Searches on an identifier will retrieve all the information attached to the identifier, if the search query is granted access to the information systems in both institutions. However, universal identifier systems are rare. If any of the institutions lack an adequate identifier system, the data from the systems cannot be sensibly reconciled. Data pertaining to a single individual may be unattached to any identifier, attached to one or more of several different identifiers, or mixed into the records of other individuals. The merging process would fail, at this point.

Assuming both institutions have adequate identifiers, then the two institutions must devise a method whereby a new identifier is created, for each record, that will be identical to the new identifier created for the same individual's record, in the other institution. For example, suppose each institution happens to store biometric data (e.g., retinal scan, DNA sequences, fingerprints), then the institutions might agree on a way to create a new identifier validated against these unique markers. With some testing, they could determine whether the new identifier works as specified (i.e., either institution will always create the same identifier for the same individual, and the identifier will never apply to any other individual). Once testing is finished, the new identifiers can be used for cross-institutional searches.

Lacking a unique biometric for individuals, reconciliation between institutions is feasible, but difficult. Some combination of identifiers (e.g., date of birth, social security number, name) might be developed. Producing an identifier from a combination of imperfect attributes has its limitations (as discussed in detail in Section 3.4, “Really Bad Identifier Methods”), but it has the advantage that if all the pre-conditions of the identifier are met, errors in reconciliation will be uncommon. In this case, both institutions will need to decide how they will handle the set of records for which there is no identifier match in the other institution. They may assume that some individuals will have records in both institutions, but their records were not successfully reconciled by the new identifier. They may also assume that unmatched group contains individuals that actually have no records in the other institution. Dealing with unreconciled records is a nasty problem. In most cases, it requires a curator to slog through individual records, using additional data from records or new data supplied by individuals, to make adjustments, as needed. This issue will be explored further, in Section 18.5, “Case Study: Personal Identifiers.”

Section 8.5. Case Study: The Trusted Timestamp

Time is what keeps everything from happening at once.

Ray Cummings in his 1922 novel, “The Girl in the Golden Atom”

Time stamps are not tamper-proof. In many instances, changing a recorded time residing in a file or data set requires nothing more than viewing the data on your computer screen and substituting one date and time for another. Dates that are automatically recorded, by your computer system, can also be altered. Operating systems permit users to reset the system date and time. Because the timing of events can be altered, scrupulous data managers employ a trusted timestamp protocol by which a timestamp can be verified.

Here is a description of how a trusted time stamp protocol might work. You have just created a message, and you need to document that the message existed on the current date. You create a one-way hash on the message (a fixed-length sequence of seemingly random alphanumeric characters). You send the one-way hash sequence to your city's newspaper, with instructions to publish the sequence in the classified section of that day's late edition. You are done. Anyone questioning whether the message really existed on that particular date can perform their own one-way hash on the message and compare the sequence with the sequence that was published in the city newspaper on that date. The sequences will be identical to each other. [Glossary One-way hash]

Today, newspapers are seldom used in trusted time stamp protocols. A time authority typically receives the one-way hash value on the document, appends a time, and encrypts a message containing the one-way hash value and the appended time, using a private key. Anyone receiving this encrypted message can decrypt it using the time authority's public key. The only messages that can be decrypted with the time authority's public key are messages that were encrypted using the time authority's private key; hence establishing that the message had been sent by the time authority. The decrypted message will contain the one-way hash (specific for the document) and the time that the authority received the document. This time stamp protocol does not tell you when the message was created; it tells you when the message was stamped.

Section 8.6. Case Study: Blockchains and Distributed Ledgers

It's worse than tulip bulbs.

JP Morgan CEO Jamie Dimon, referring to Bitcoin, a currency exchange system based on blockchains

Today, no book on the subject of Big Data would be complete without some mention of blockchains, which are likely to play an important role in the documentation and management of data transactions for at least the next decade, or until something better comes along. Fortunately, blockchains are built with two data structures that we have already introduced: one-way hashes and triples. All else is mere detail, determined by the user's choice of implementation.

At its simplest, a blockchain is a collection of short data records, with each record consisting of some variation on the following:

< head >-<message >-<tail >

Here are the conditions that the blockchain must accommodate:

  1. 1. The head (i.e., first field) in each blockchain record consists of the tail of the preceding data record.
  2. 2. The tail of each data record consists of a one-way hash of the head of the record concatenated with the record message.
  3. 3. Live copies of the blockchain (i.e., a copy that grows as additional blocks are added) are maintained on multiple servers.
  4. 4. A mechanism is put in place to ensure that every copy of the blockchain is equivalent to one another, and that when a blockchain record is added, it is added to every copy of the blockchain, in the same sequential order, and with the same record contents.

We will soon see that conditions 1 through 3 are easy to achieve. Condition 4 can be problematic, and numerous protocols have been devised, with varying degrees of success, to ensure that the blockchain is updated identically, at every site. Most malicious attacks on blockchains are targeted against condition 4, which is considered to be the most vulnerable point in every blockchain enterprise.

By convention, records are real-time transactions, acquired sequentially, so that we can usually assume that the nth record was created at a moment in time prior to the creation of the n + 1th record.

Let us assume that the string that lies between the head and the tail of each record is a triple. This assumption is justified because all meaningful information can be represented as a triple or as a collection of triples.

Here is our list of triples that we will be blockchaining.

a0ce8ec6ˆˆobject_nameˆˆHomo
a0ce8ec6ˆˆsubclass_ofˆˆHominidae
a0ce8ec6ˆˆpropertyˆˆglucose_at_time
a1648579ˆˆobject_nameˆˆHomo sapiens
a1648579ˆˆsubclass_ofˆˆHomo
98495efcˆˆobject_nameˆˆAndy Muzeack
98495efcˆˆinstance_ofˆˆHomo sapiens
98495efcˆˆdobˆˆ1 January, 2001
98495efcˆˆglucose_at_timeˆˆ87, 02-12-2014 17:33:09

Let us create our own blockchain using these nine triples as our messages.

Each blockchain record will be of the form:

< tail of prior blockchain link----the current record's triple----md5 hash of the current triple concatenated with the header >

For example, to compute the tail of the second link, we would perform an md5 hash on:

ufxOaEaKfw7QBrgsmDYtIw----a0ce8ec6ˆˆsubclass_ofˆˆHominidae

Which yields:

=> PhjBvwGf6dk9oUK/+yxrCA

The resulting blockchain is shown here.

       a0ce8ec6ˆˆobject_nameˆˆHomo----ufxOaEaKfw7QBrgsmDYtIw
ufxOaEaKfw7QBrgsmDYtIw----a0ce8ec6ˆˆsubclass_ofˆˆHominidae----PhjBvwGf6dk9oUK/+yxrCA
PhjBvwGf6dk9oUK/+yxrCA----a0ce8ec6ˆˆpropertyˆˆglucose_at_time----P40p5GHp4hE1gsstKbrFPQ
P40p5GHp4hE1gsstKbrFPQ----a1648579ˆˆobject_nameˆˆHomo sapiens----2wAF1kWPFi35f6jnGOecYw
2wAF1kWPFi35f6jnGOecYw----a1648579ˆˆsubclass_ofˆˆHomo----N2y3fZgiOgRcqfx86rcpwg
N2y3fZgiOgRcqfx86rcpwg----98495efcˆˆobject_nameˆˆAndy Muzeack----UXSrchXFR457g4JreErKiA
UXSrchXFR457g4JreErKiA----98495efcˆˆinstance_ofˆˆHomo sapiens----5wDuJUTLWBJjQIu0Av1guw
5wDuJUTLWBJjQIu0Av1guw----98495efcˆˆglucose_at_timeˆˆ87, 02-12-2014 17:33:09----Y1jCYB7YyRBVIhm4PUUbaA

Whether you begin with a list of triples that you would like to convert into a blockchain data structure, or whether you are creating a blockchain one record at a time, through transactions that occur over time, it is easy to write a short script that will generate the one-way hashes and attach them to the end of the nth triple and the beginning of the n + 1th triple, as needed.

Looking back at our blockchain, we can instantly spot an anomaly, in the header of the very first record. The header to the record is missing. Whenever we begin to construct a new blockchain, the first record will have no antecedent record from which a header can be extracted. This poses another computational bootstrap paradox. In this instance, we cannot begin until there is a beginning. The bootstrap paradox is typically resolved with the construction of a root record (record 0). The root record is permitted to break the rules.

Now that we have a small blockchain, what have we achieved? Here are the properties of a blockchain

  •   Every blockchain header is built from the values in the entire succession of preceding blockchain links
  •   The blockchain is immutable. Changing any of the messages contained in any of the blockchain links, would produce a totally different blockchain. Dropping any of the links of the blockchain or inserting any new links (anywhere other than as an attachment to the last validated link) will produce an invalid blockchain.
  •   The blockchain is recomputable. Given the same message content, the entire blockchain, with all its headers and tails, can be rebuilt. If it cannot recompute, then the blockchain is invalid.
  •   The blockchain, in its simplest form, is a trusted “relative time” stamp. Our blockchain does not tell us the exact time that a record was created, but it gives its relative time of creation compared with the preceding and succeeding records.

With a little imagination, we can see that a blockchain can be used as a true time stamp authority, if the exact time were appended to each of the records in the container at the moment when the record was added to the blockchain. The messages contained in blockchain records could be authenticated by including data encrypted with a private key. Tampering of the blockchain data records could be prevented by having multiple copies of the blockchain at multiple sites, and routinely checking for discrepancies among the different copies of the data.

We might also see that the blockchain could be used as a trusted record of documents, legal transactions (e.g., property deals), monetary exchanges (e.g., Bitcoin). Blockchains may also be used for authenticating voters, casting votes, and verifying the count. The potential value of blockchains in the era of Big Data is enormous, but the devil hides in the details. Every implementation of a blockchain comes with its own vulnerabilities and much has been written on this subject [3,4].

Section 8.7. Case Study (Advanced): Zero-Knowledge Reconciliation

Experience is what you have after you've forgotten her name.

Milton Berle

Though record reconciliation across institutions is always difficult, the task becomes truly Herculean when it must be done blindly, without directly comparing records. This awkward situation occurs quite commonly whenever confidential data records from different institutions must be checked to see if they belong to the same person. In this case, neither institution is permitted to learn anything about the contents of records in the other institutions. Reconciliation, if it is to occur, must implement a zero-knowledge protocol; a protocol that does not reveal any information concerning the reconciled records [5].

We will be describing a protocol for reconciling identifiers without exchanging information about the contents of data records. Because the protocol is somewhat abstract and unintuitive, a physical analogy may clarify the methodology. Imagine two people each holding a box containing an item. Neither person knows the contents of the box that they are holding or of the box that the other person is holding. They want to determine whether they are holding identical items, but they don't want to know anything about the items. They work together to create two identical imprint stamps, each covered by a complex random collection of raised ridges. With eyes closed, each one pushes his imprint stamp against his item. By doing so, the randomly placed ridges in the stamp are compressed in a manner characteristic of the object's surface. The stamps are next examined to determine if the compression marks on the ridges are distributed identically in both stamps. If so, the items in the two boxes, whatever they may be, are considered to be identical. Not all of the random ridges need to be examined-just enough of them to reach a high level of certainty. It is theoretically possible for two different items to produce the same pattern of compression marks, but it is highly unlikely. After the comparison is made, the stamps are discarded.

The physical analogy demonstrates the power of a zero-knowledge protocol. Neither party knows the identity of his own item. Neither party learns anything about his item or the other party's item during the transaction. Yet, somehow, the parties can determine whether the two items are identical.

Here is how the zero-knowledge protocol to reconcile confidential records across institutions [5]:

  1. 1.  Both institutions generate a random number of a pre-determined length and each institution sends the random number to the other institution.
  2. 2.  Each institution sums their own random number with the random number provided by the other institution. We will refer to this number as Random_A. In this way, both institutions have the same final random number and neither institution has actually transmitted this final random number. The splitting of the random number was arranged as a security precaution.
  3. 3.  Both institutions agree to create a composite representation of information contained in the record that could establish the human subject of the record. The composite might be a concatenation of the social security number, the date of birth, the first initial of the surname.
  4. 4.  Both institutions create a program that automatically creates the composite numeric representation of the record (which we will refer to as the record signature) and immediately sums the signature with Random_A, the random number that was negotiated between the two institutions (steps 1 and 2). The sum of the composite representation of the record plus Random_A is a random number that we will call Random_B.
  5. 5.  If the two records being compared across institutions belong to the same human subject, then Random_B will the identical in both institutions. At this point, the two institutions must compare their respective versions of Random_B in such a way that they do not actually transmit Random_B to the other institution. If they were to transmit Random_B to the other institution, then the receiving institution could subtract Random_A from Random B and produce the signature string for a confidential record contained in the other institution. This would be a violation of the requirement to share zero knowledge during the transaction.
  6. 6.  The institutions take turns sending consecutive characters of their versions of Random_B. For example, the first institution sends the first character to the second institution. The second institution sends the second character to the first institution. The first institution sends the third character to the second institution. The exchange of characters proceeds until the first discrepancy occurs, or until the first 8 characters of the string match successfully. If any of the characters do not match, both institutions can assume that the records belong to different human subjects (i.e., reconciliation failed). If the first 8 characters match, then it is assumed that both institutions are holding the same Random_B string, and that the records are reconciled.

At the end, both institutions learn whether their respective records belong to the same individual; but neither institution has learned anything about the records held in the other institution. Anyone eavesdropping on the exchange would be treated to a succession of meaningless random numbers.

Glossary

Curator The word “curator” derives from the Latin, “curatus,” and the same root for “curative,” indicating that curators “take care of” things. A data curator collects, annotates, indexes, updates, archives, searches, retrieves and distributes data. Curator is another of those somewhat arcane terms (e.g., indexer, data archivist, lexicographer) that are being rejuvenated in the new millennium. It seems that if we want to enjoy the benefits of a data-centric world, we will need the assistance of curators, trained in data organization.

Digital signature As it is used in the field of data privacy a digital signature is an alphanumeric sequence that could only have been produced by a private key owned by one particular person. Operationally, a message digest (e.g., a one-way hash value) is produced from the document that is to be signed. The person “signing” the document encrypts the message digest using her private key, and submits the document and the encrypted message digest to the person who intends to verify that the document has been signed. This person decrypts the encrypted message digest with her public key (i.e., the public key complement to the private key) to produce the original one-way hash value. Next, a one-way hash is performed on the received document. If the resulting one-way hash is the same as the decrypted one-way hash, then several statements hold true: the document received is the same document as the document that had been “signed.” The signer of the document had access to the private key that complemented the public key that was used to decrypt the encrypted one-way hash. The assumption here is that the signer was the only individual with access to the private key. Digital signature protocols, in general, have a private method for encrypting a hash, and a public method for verifying the signature. Such protocols operate under the assumption that only one person can encrypt the hash for the message, and that the name of that person is known; hence, the protocol establishes a verified signature. It should be emphasized that a digital signature is quite different from a written signature; the latter usually indicates that the signer wrote the document or somehow attests to agreement with the contents of the document. The digital signature merely indicates that the document was received from a particular person, contingent on the assumption that the private key was available only to that person. To understand how a digital signature protocol may be maliciously deployed, imagine the following scenario: I contact you and tell you that I am Elvis Presley and would like you to have a copy of my public key plus a file that I have encrypted using my private key. You receive the file and the public key; and you use the public key to decrypt the file. You conclude that the file was indeed sent by Elvis Presley. You read the decrypted file and learn that Elvis advises you to invest all your money in a company that manufactures concrete guitars; which, of course, you do. Elvis knows guitars. The problem here is that the signature was valid, but the valid signature was not authentic.

Metasyntactic variable A variable name that imports no specific meaning. Popular metasyntactic variables are x, y, n, foo, bar, foobar, spam, eggs, norf, wubble, and blah. Dummy variables are often used in iterating loops. For example:
for($i = 0;$i < 1000;$i ++)
Good form dictates against the liberal use of metasyntactic variables. In most cases, programmers should create variable names that describe the purpose of the variable (e.g., time_of_day, column_sum, current_line_from_file).

National Patient Identifier Many countries employ a National Patient Identifier (NPI) system. In these cases, when a citizen receives treatment at any medical facility in the country, the transaction is recorded under the same permanent and unique identifier. Doing so enables the data collected on individuals, from multiple hospitals, to be merged. Hence, physicians can retrieve patient data that was collected anywhere in the nation. In countries with NPIs, data scientists have access to complete patient records and can perform healthcare studies that would be impossible to perform in countries that lack NPI systems. In the United States, where a system of NPIs has not been adopted, there is a perception that such a system would constitute an invasion of privacy and would harm citizens.

One-way hash A one-way hash is an algorithm that transforms one string into another string (a fixed-length sequence of seemingly random characters) in such a way that the original string cannot be calculated by operations on the one-way hash value (i.e., the calculation is one-way only). One-way hash values can be calculated for any string, including a person's name, a document, or an image. For any given input string, the resultant one-way hash will always be the same. If a single byte of the input string is modified, the resulting one-way hash will be changed, and will have a totally different sequence than the one-way hash sequence calculated for the unmodified string.
Most modern programming languages have several methods for generating one-way hash values. Regardless of the language we choose to implement a one-way hash algorithm (e.g., md5, SHA), the output value will be identical. One-way hash values are designed to produce long fixed-length output strings (e.g., 256 bits in length). When the output of a one-way hash algorithm is very long, the chance of a hash string collision (i.e., the occurrence of two different input strings generating the same one-way hash output value) is negligible. Clever variations on one-way hash algorithms have been repurposed as identifier systems [69]. A detailed discussion of one-way hash algorithms can be found in Section 3.9, “Case Study: One-Way Hashes.”

Raw data Raw data is the unprocessed, original data measurement, coming straight from the instrument to the database, with no intervening interference or modification. In reality, scientists seldom, if ever, work with raw data. When an instrument registers the amount of fluorescence emitted by a hybridization spot on a gene array, or the concentration of sodium in the blood, or virtually any of the measurements that we receive as numeric quantities, an algorithm executed by the measurement instrument produces the output. Pre-processing of data is commonplace in the universe of Big Data, and data managers should not labor under the false impression that the data received is “raw,” simply because the data has not been modified by the person who submits the data.

Serious Big Data 3 V's (data volume, data variety and data velocity) plus “seriousness.” Seriousness is a tongue-in-cheek term that the author applies to Big Data resources whose objects are provided with an adequate identifier and a trusted timestamp and provide data users with introspection, including pointers to the protocols that produced the data objects. The metadata in Big Data resources are appended with namespaces. Serious Big Data resources can be merged with other serious Big Data resources. In the opinion of the author, Big Data resources that lack seriousness should not be used in science, legal work, banking, and in the realm of public policy.

Time stamp Many data objects are temporal events and all temporal events must be given a time stamp indicating the time that the event occurred, using a standard measurement for time. The time stamp must be accurate, persistent, and immutable. The Unix epoch time (equivalent to the Posix epoch time) is available for most operating systems and consists of the number of seconds that have elapsed since January 1, 1970, midnight, Greenwhich mean time. The Unix epoch time can easily be converted into any other standard representation of time. The duration of any event can be easily calculated by subtracting the beginning time from the ending time. Because the timing of events can be maliciously altered, scrupulous data managers employ a trusted time stamp protocol by which a time stamp can be verified. A trusted time stamp must be accurate, persistent, and immutable. Trusted time stamp protocols are discussed in Section 8.5, “Case Study: The Trusted Time stamp.”

Verification and validation As applied to data resources, verification is the process that ensures that data conforms to a set of specifications. Validation is the process that checks whether the data can be applied in a manner that fulfills its intended purpose. This often involves showing that correct conclusions can be obtained from a competent analysis of the data. For example, a Big Data resource might contain position, velocity, direction, and mass data for the earth and for a meteor that is traveling sunwards. The data may meet all specifications for measurement, error tolerance, data typing, and data completeness. A competent analysis of the data indicates that the meteor will miss the earth by a safe 50,000 miles, plus or minus 10,000 miles. If the asteroid smashes into the earth, destroying all planetary life, then an extraterrestrial observer might conclude that the data was verified, but not validated.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.84.150