Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data Sharing and Data Security

Abstract

Without data sharing, we do not have science. We just have people with their own agendas asking us to believe their conclusions, without evidence. A seemingly infinite list of official position papers, urging researchers to share their research data, has been published, but progress has been slow. To be sure, there are technical obstacles to data sharing, but for every technical obstacle, there is a wealth of literature offering solutions. In this section, we will be looking at all of the impediments to data sharing, and suggesting some practical solutions.

Keywords

Data sharing; Open access; Data hoarding; Data encumbrances; Usage agreements; National Patient Identifier

Section 18.1. What Is Data Sharing, and Why Don't We Do More of It?

It's antithetical to the basic norms of science to make claims that cannot be validated because the necessary data are proprietary.

Michael Eisen [1]

Without data sharing, there can be very little progress in the field of Big Data. The reasons for this are simple:

– Research findings have limited value unless they are correlated with data contained in other databases.
– All findings, even those based on verified data, are tentative and must be validated against data contained in multiple datasets.
– Unless data is shared, scientists cannot build upon the work of others, and science devolves into a collection of research laboratories working in isolation from one another, leading to intellectual stagnation [1,2].
– Scientific conclusions have no credibility when the research community, oversight agencies, and the interested public cannot review the data upon which the findings were based and the details of how that data was measured.

Without data sharing, we do not have science. We just have people with their own agendas asking us to believe their conclusions. A long list of anguished position papers, urging researchers to share their data, has been published [3–8]. To be sure, there are technical obstacles to data sharing, but for every technical obstacle, there is a wealth of literature offering solutions [9–20].

Despite the imperatives of data sharing, scientists have been slow to adopt data sharing policies [6]. Because the issue of data sharing is so important to the field of Big Data, it is worth reviewing the impediments to its successful implementation.

Section 18.2. Common Complaints

Science advances funeral by funeral.

Folk wisdom

Here is a listing of the commonly heard reasons for withholding data from the public, along with suggested remedies.

– To protect scientists from “research parasites”

A recent opinion expressed by two editors of the New England Journal of Medicine, in an essay entitled “Data Sharing,” expressed concern that a new brand of researcher uses data generated by others, for his or her own ends. The editors indicated that some front-line researchers characterize such individuals as “research parasites” [21]. The essayists suggested that researchers who want to use the data produced by others should do so by forming collaborative partnerships with the group that produced the original data [21].

The idea of collaborative partnerships may have been a reasonable strategy 30 years ago, before the emergence of enormous datasets, built from the work of hundreds of data contributors. Negotiating for data, with the promise of developing a mutually beneficial collaboration, is not a feasible option. Today, scientific projects may involve dozens or hundreds of scientists, with no single individual claiming ownership or responsibility for the aggregate data set. The individual contributors may have only the dimmest awareness of their own role in the effort. Under these circumstances, an outside investigator is unlikely to find an identifiable individual or group of individuals with the technical expertise, the scientific judgment, the legal standing, the ethical authority, and the strength of will, to negotiate a new collaboration and to surrender a large set of data.

Today, well-designed data sets can be merged with other sources of data, and repurposed for studies that were never contemplated by the original data contributors [22]. The goal of Big Data is to create data sets that can be utilized by the entire scientific community, with minimal barriers to access. Characterizing data users as “research parasites,” as witnessed in the New England Journal of Medicine article, misses the whole point of Big Data science [21].

– To avoid data misinterpretation

Every scientist who releases data to the public must contend with the fear that a member of the public will misinterpret the data, reach an opposite conclusion, publish their false interpretation, and destroy the career of the trusting soul who was kind enough to provide the ammunition for his own execution. Teams of scientists developing a new drug or treatment protocol, may fear that if their data is released to the public, their competitors may seize the opportunity to unjustly critique their work and thus jeopardize their project.

Examples of such injustices have been sought, but not found [6]. There is no evidence that would lead anyone to believe that a misinterpretation of data has ever overshadowed a correct interpretation of data. Scientists have endured the withering criticisms of their colleagues since time immemorial. As they say, it comes with the territory. Hiding data for the purpose of avoiding criticism is unprofessional.

– To limit access to responsible professionals

Some researchers believe that data sharing must be a conditional process wherein investigators submit their data requests to a committee of scientists who decide whether the request is justified [23]. In some cases, the committee retains the right to review any results predicated on the shared data, with the intention of disallowing publication of results that they consider to be objectionable.

There are serious drawbacks to subjecting scientists to a committee approval process. The public needs unfettered access to the original data upon which published research results are based. Anything less makes it impossible to validate the conclusions drawn from the data, and invites all manner of scientific fraud. In the United States, this opinion is codified by law. The Data Quality Act of 2002 restrains government agencies from developing policies based on data that is unavailable for public review [24–27]. [Glossary Data Quality Act]

– To sustain the traditional role of data protector

Lawyers, bankers, healthcare workers, and civil servants are trained to preserve confidentiality; much like priests protect the confessional. It is understandable that many professionals are reluctant to share their data with the world-at-large. Nonetheless, it is unreasonable to hide data that has scientific value.

Before confidential information is released, data holders must be convinced that data sharing can be accomplished without breaching confidentiality, and that the effort spent in the process will yield some benefit to the individuals whose data is being appropriated. A large literature on the subject of safely sharing confidential data is readily available [28–33].

– To await forthcoming universal data standards

Trying to merge data sets that are disorganized is impossible, as is merging data sets wherein equivalent types of data are discordantly annotated. Because researchers and other data collectors use a variety of different types of software to collect and organize their data, obstacles raised by data incompatibility have been a major impediment to data sharing. The knee-jerk solution to the problem has always been to create new data standards.

In the past few decades, the standards-making process has evolved into a major industry [34]. There are standards for describing, organizing, and transmitting data. There are dozens, maybe hundreds, of standards, nomenclatures, classifications, and ontologies for the various domains of Big Data information, and all of these intellectual products are subject to multiple revisions. [Glossary Classification versus ontology]

The hunger for standards is insatiable [35]. The calls for new standards and new ontologies never seems to end. The many shortcomings of standards were discussed at length in Chapter 7, “Standards and Data Integration.” It must be noted that the proliferation of standards, many of which are abandoned soon after they are created, has served to increase the complexity and decrease the permanence of Big Data resources [28,36].

Despite all the effort devoted to data standards, there is no widely adopted system for organizing and sharing all the different types of data encountered in Big Data resources. Perhaps the “standard” answer is not the correct answer, in this instance. Specifications, discussed in detail in Section 7.2, are a possible alternative to standards, and should be considered an option for those who are open to suggestion in this matter.

– To protect legal ownership

Ownership is a mercantile concept conferring the right to sell. If someone owns a cow, that means that they have the right to sell the cow. If you own a house, even a mortgaged house, then you have the right to sell the house. Let's focus on one particular type of confidential record that pertains to virtually everyone: the medical record. Who owns your confidential medical record? Is it owned by the patient? Is it owned by the medical center? Is it owned by anyone?

In law, there does not seem to be anyone who has the right to sell medical records; hence, it is likely that nobody can claim ownership. Still, medical institutions have a fiduciary responsibility to maintain medical records for their patients, and this entitles both patients and healthcare providers to use the records, as needed. Patients have the right to ask hospitals to send their medical records to other medical centers or to themselves. Hospitals are expected to archive tissues, medical reports and patient charts to serve the patient and society. In the United States, State health departments, the Centers for Disease Control and Prevention (CDC), and cancer registries all expect medical centers to deliver medical reports on request.

In all the aforementioned examples, data sharing is conducted without jeopardizing, or otherwise influencing, the data holder's claim of ownership.

– To comply with rules issued from above

It is not uncommon for researchers to claim that they would love to share their data, but they are forbidden from doing so by the lawyers and administrators at their institutions. Two issues tend to dissuade administrators from data sharing. The first is legal liability. Institutions have a responsibility to avoid punitive tort claims, such as those that may arise if human subjects complain that their privacy has been violated when their confidential information is shared. From the viewpoint of the institution, the simplest remedy is to forbid scientists from sharing their data. Secondly, institutions want to protect their own intellectual property, and this would include patents, drug discoveries, manufacturing processes, and even data generated by their staff scientists. Institutions may sometimes equate data sharing with poor stewardship of intellectual property. [Glossary Intellectual property]

When an institution forbids data sharing, as a matter of policy, scientists should argue that the institution cannot facilitate or promote its own research. Simply put, if Institution A does not share its data with Institution B, then Institution B will not share its data with Institution A. In addition, when Institution A publishes a scientific breakthrough, then Institution B will not find those claims credible, until their own researchers can review the primary data.

It is easy to forget that society, as a whole, benefits when scientific projects lead to discoveries. Without data sharing, those benefits will come at a glacial pace, and at great expense. Institutions have a societal obligation to advance science through data sharing.

– To demand reimbursement

Professionals are, by definition, people who are paid for their services. Professionals who go to the trouble of providing data to the public will want to be reimbursed. Data holders can be reassured that if they have created data at their own expense, for their own private use, then that data is theirs to keep. Nobody will take that data from them. But if the data holders have made public assertions based on their data, then they should understand that the scientific community will not give those assertions any credence, without having the data available for review.

The price that researchers pay for withholding their data (i.e., lack of validation of conclusions and professional obscurity) far exceeds the negligible costs of data sharing. Contrariwise, if the data is shared, and their results are validated, then their payment may come in the form of future grants, patents, prestige, and successful collaborations.

– To avoid distributing flawed data

Scientists are reluctant to release data that is full of errors. In particular, data curators may fear that if such data is released to the public, they will be inundated with complaints from angry data analysts, demanding that every error be corrected.

A 2011 study has shown that researchers with high quality data were, generally, willing to share their data [37]. Researchers who had weak data, that might support various interpretations and contrasting conclusions, were less willing to share their data. It is important to convince the scientists who create and hold data that the researchers who use their data, without asking permission and without forming collaborations, are not “research parasites”; they are the people who will validate good work, improve upon imperfect work, and create new work from otherwise abandoned data. The societal push toward data sharing should provide a strong impetus for scientists to improve the value of their data, so that they will something worth sharing.

Aside from corrections, all data sets need constant updating, and there are proper and improper ways of revising data (discussed in Section 8.1, “The Importance of Data that Cannot Change”). Dealing with change, in the form of revised systems of annotations, and revised data elements, is part of the job of the data curator.

Institutions cannot refuse to share their data simply because their data contains errors or is awaiting revisions. Flawed data is common, and it's a safe bet that every large data set contains errors [38]. Institutions and scientific teams should hire professionals with the requisite skills to properly prepare, correct, and improve upon their data collections.

– To protect against data hackers

Properly deidentified, Big Data records may contain information that, when combined with data held in other databases, may uniquely identify patients [39]. As an obvious example, if a medical record contains an un-named patient's birth date, gender and zip-code, and a public database lists names of people in a zip-code, along with their birth dates and gender, it is a simple step to ascertain the identity of “deidentified” patients.

A specific instance, making national news headlines, may serve to clarify just how this may happen [40]. A 15-year-old boy was fathered using anonymously donated sperm. The boy wanted to know the identity of his biological father. A private company, had created a DNA Database from 45,000 DNA samples. The purpose of the database was to allow clients to discover kin by having their DNA compared with all the DNA samples in the database. The boy sent his DNA sample to the company, along with a fee. A comparison of the boy's Y chromosome DNA (inherited exclusively from the father) was compared with Y chromosome DNA in the database. The names of two men with close matches to the boy's Y chromosome were found.

The boy's mother had been provided (from the sperm bank) with the sperm donor's date of birth and birthplace. The boy used an online service to obtain the name of every man born in the sperm donor's place of birth on the sperm donor's date of birth. Among those names, one name matched one of the two Y-chromosome matches from the DNA database search. This name, according to newspaper reports, identified the child's father. [Glossary Y-chromosome]

In this case, the boy had access to his own uniquely identifying information (i.e., his DNA and specifically his Y chromosome DNA), and he was lucky to be provided with the date of birth and birthplace of his biological father. He was also extremely lucky that the biological father had registered his DNA in a database of 45,000 samples. And he was lucky that the DNA database revealed the names of its human subjects. The boy's success in identifying his father required a string of unlikely events, and a lax attitude toward subject privacy on the part of the personnel at the sperm bank and the personnel at the DNA database.

Regardless of theoretical security flaws, the criminal or malicious identification of human subjects included in deidentified research data sets is extremely rare. More commonly, confidential records (e.g., personnel records, credit records, fully identified medical records) are stolen wholesale, relieving thieves from the intellectually challenging task of finding obscure information that may link a deidentified record to the identity of its subject.

– To preserve compartmentalization of data

Most data created by modern laboratories has not been prepared in a manner that permits its meaningful use in other laboratories. In many cases the data has been compartmentalized, so that the data is disbursed in different laboratories. It is par for the course that a no single individual has taken the responsibility of collecting and reviewing all of the data that has been used to support the published conclusions of a multi-institutional project.

In the late 1990s, Dr. Wu Suk Hwang was a world-famous cloning researcher. The government of South Korea was so proud of Dr. Hwang that they issued a commemorative stamp to celebrate his laboratory's achievements. Dr. Hwang's status drastically changed when fabrications were discovered in a number of the manuscripts produced by his laboratory. Dr. Hwang had a habit of placing respected scientists as co-authors on his papers [41]. When the news broke, Hwang pointed a finger at several of his collaborators.

A remarkable aspect of Dr. Hwang's publications was his ability to deceive the coworkers in his own laboratory, and the co-authors located in laboratories around the world, for a very long time. Dr. Hwang used a technique known as compartmentalization; dividing his projects into tasks distributed to small groups of scientists who specialized in one step of the research process. By so doing, his coworkers never had access to the entire project's data. The data required to validate the final achievement of the research was not examined by his co-workers [42,41].

For several years, South Korean politicians defended the scientist, to the extent of questioning the patriotism of his critics. Over time, additional violations committed by Dr. Hwang were brought to light. In 2009, Hwang was sentenced in Seoul, S. Korea, to a two-year suspended prison sentence for embezzlement and bioethical violations; but he was never found guilty of fabrication [43].

Large data projects are almost always compartmentalized. When you have dozens or even hundreds of individuals contributing to a project, compartmentalization occurs quite naturally. In fact, what would you do without compartmentalization? Wait for every scientist involved in the project to review and approve one another's data? Today, large scientific projects may involve hundreds of scientists. Without compartmentalization, nothing would ever get published. The lesson here is that at the end of every research project, all of the data that contributed to the results must be gathered together as an organized and well-annotated dataset for public review.

– To guard research protocols

In every scientific study, the measurements included in the data must be linked to the study protocols (e.g., laboratory procedures) that produced the data. In some cases, the protocols are not well documented. In other cases, the protocols are well documented, but the researchers may have failed to follow the recommended protocols; thus rendering the data irreproducible. Occasionally, the protocols are the intellectual property of an entity other than the persons who created the data. In all these instances, the data holders may be reluctant to share their protocols with the public.

– To conceal instances of missing data

It is almost inevitable, when data sets are large and complex, that there will be some missing data points. In this case, data may be added “by imputation.” This involves computing a statistical best bet on what the missing data element value might have been, and inserting the calculated number into the data set. A data manager may be reluctant to release to the public a database with “fudged” data.

It is perfectly legitimate to include imputed data points, on the condition that all the data is properly annotated, so that reviewers are aware of imputed values, and of the methods used to generate such values.

– To avoid bureaucratic hurdles

As discussed in Section 9.3, “Data that Comes with Conditions,” institutions may resort to Kafkaesque measures to insure that only qualified and trusted individuals gain access to research data. It should come as no surprise that formal requests for data may take two years or longer to review and approve [44]. The approval process is so cumbersome, that it simply cannot be implemented without creating major inconveniences and delays, for everyone involved (i.e., data manager and data supplicant).

In the United States, federal agencies often seek to share data with one another. Such transactions require Memoranda of Understanding between agencies, and these memoranda can take months to negotiate and finalize [44]. In some cases, try as they might, data is not shared among federal agencies due to a lack of regulatory authorization that cannot be resolved to anyone's favor [44].

Hyper vigilance, on the part of U.S. Federal agencies, may stem from unfortunate incidents from the past that cannot be easily forgotten. One such incident, which attracted international attention, occurred when the United States accidentally released details of hundreds of the nuclear sites and programs, including the exact locations of nuclear stockpiles [45]. Despite their reluctance to share some forms of data, U.S. agencies have been remarkably generous with bioinformatics data, and the National Institutes of Health commonly attaches data sharing requirements to grants, and other awards.

In the U.S., Federal regulations impose strict controls on sharing identified medical data. Those same regulations specify that deidentified human subject data is exempted from those controls, and can be freely shared [32,33]. Data holders must learn the proper methods for deidentifying or anonymizing private and confidential medical data.

Whew! Where does this leave us? Data sharing is not easy. Nonetheless, published claims cannot be validated unless the data is made available to the public for review, and science cannot advance if scientists cannot build upon the data produced by their colleagues. Research institutions, both public and private, must find ways to deal with the problem, despite the difficulties. They might start by hiring scientists who are steeped in the craft of data sharing.

Section 18.3. Data Security and Cryptographic Protocols

No matter how cynical you become, it's never enough to keep up.

Lily Tomlin

Let us be practical. Nearly everyone has confidential information on their computers. Often, this information resides in a few very private files. If those files fell into the hands of the wrong people, the results would be calamitous. For myself, I encrypt my sensitive files. When I need to work with those files, I decrypt them. When I'm finished working with them, I encrypt them again. These files are important to me, so I keep copies of the encrypted files on thumb drives and on an external server. I don't care if my thumb drives are lost or stolen. I don't care if a hacker gets access to the server that stores my files. The files are encrypted, and only I know how to decrypt them.

Anyone in the data sciences will tell you that it is important to encrypt your data files, particularly when you are transferring files via the internet. Very few data scientists follow their own advice. Scientists, despite what you may believe, are not a particularly disciplined group of individuals. Few scientists get into the habit of encrypting their files. Perhaps they perceive the process as being too complex.

For serious encryption, you will want to step up to OpenSSL. OpenSSL is an open source collection of message digest protocols (i.e., protocols that yield one-way hashes) and encryption protocols. This useful set of utilities, with implementations for various operating systems, is available at no cost from:

https://www.openssl.org/related/binaries.html

Encryption algorithms and suites of cipher strategies available through OpenSLL include: RSA, DH (numerous protocols), DSS, ECDH, TLS, AES (including 128 and 256 bit keys), CAMELLIA, DES (including triple DES), RC4, IDEA, SEED, PSK, and numerous GOST protocols. In addition, implementations of popular one-way hash algorithms are provided (i.e., MD5 and SHA, including SHA384). OpenSSL comes with an Apache-style open source license. [Glossary AES]

For Windows users, the OpenSSL download contains three files that are necessary for file encryption: openssl.exe, ssleay32.dll, and libeay32.dll. If these three files are located in your current directory, you can encrypt any file, directly from the command prompt, as shown:

openssl aes128 -in public.txt -out secret.aes -pass pass:abcdefgh

The command line provides your chosen password, “abcdefgh” to the aes128 encryption algorithm, which takes the file public.txt and produces an AES-encrypted output file, secret.aes. Of course, once you've encrypted a file, you will need a decryption method. Here's a short command line that decrypts the encrypted file created by the preceding command line:

openssl aes128 -d -in secret.aes -out decrypted.txt -pass pass:abcdefgh

We see that decryption involves inserting the “-d” option into the command line. AES is an example of a symmetric encryption algorithm, which means that the encryption password also serves as the decryption password.

Encrypting and decrypting individual strings, files, groups of files, and directory contents is extremely simple and can provide a level of security that is likely to be commensurate with your personal needs.

Here is a short Python script, aes.py, that encrypts all the files included in a list, and deposits the encrypted files in a thumb drive sitting in the “f:” drive.

import sys, os, re
filelist = ['diener.txt','simplify.txt','re-ana.txt', 'phenocop.txt', 'mystery.txt','disaster.txt', 'factnote.txt', 'perlbig.txt', 'referen.txt', 'create.txt', 'exploreo.txt']
pattern = re.compile("txt")
for filename in filelist:
 out_filename = pattern.sub('enc', filename)
 out_filename = "f:\" + out_filename
 print(out_filename)
 cmdstring = "openssl aes128 -in " + filename + " -out " + out_filename + " -pass pass:abcdefgh"
 os.system(cmdstring)

– Public and private key cryptographic protocols

Many cryptographic algorithms are symmetric; the password used to encrypt a file is the same as the password used to decrypt the file. In an asymmetric cryptographic algorithm, the password that is used to encrypt a file is different from the password that is used to decrypt the file. The encrypting password is referred to as the public key, by convention. The public key can be distributed to friends or posted on a public web site. The decrypting password is referred to as the private key, and it must never be shared.

How is a public/private key system used? If Alice were to encrypt a file with Bob's public key, only Bob's private key could decrypt the file. If Bob does not lose his private key, and if Bob does not allow his private key to be shared or stolen, then only Bob can decrypt files encrypted with his public key. Alice can send the encrypted file, without worrying that the encrypted file could be intercepted and opened by someone other than Bob.

As discussed, openssl can be run via command lines, from the system prompt (e.g., c:> in Windows systems). Let's generate a public/private key pair that we'll use for RSA encryption.

openssl genpkey -algorithm RSA -out private_key.pem -pkeyopt rsa_keygen_bits:2048
openssl rsa -pubout -in private_key.pem -out public_key.pem

These two commands produced two files, each containing a cryptographic key. The private key file is private_key.pm. The public key file is public_key.pem. Let's encrypt a file (sample.txt) using RSA encryption and the public key we just created (public_key.pem)

openssl rsautl -encrypt -inkey public_key.pem -pubin -in sample.txt -out sample.ssl

This produces an encrypted file, sample.ssl. Let's decrypt the encrypted file (sample.ssl) using the private key that we have created (private_key.pem)

openssl rsautl -decrypt -inkey private_key.pem -in sample.ssl -out decrypted.txt

In common usage, this protocol only transmits small message files, such as passwords. Alice could send a large file, strongly encrypted with AES. In a separate exchange, Alice and Bob would use public and private keys to transmit the password. First, Alice would encrypt the password with Bob's public key. The encrypted message would be sent to Bob. Bob would decrypt the message with his private key, thus producing the password. Bob would use the password to decrypt the large AES-encrypted file. A large variety of security protocols have been devised, utilizing public/private key pairs, suiting a variety of purposes.

The public/private keys can also be used to provide the so-called digital signature of the individual holding the private key.

To sign and authenticate a transferred data file (e.g., mydata.txt), the following three steps must be taken:

1. Alice creates a one-way hash of her data file, mydata.txt, and creates a new file, called hashfile, to hold the one-way hash value.
openssl dgst -sha256 mydata.txt > hashfile
2. Alice signs the hashfile with her private key, to produce a digital signature in the file “signaturefile”:
openssl rsautl -sign -inkey private_key.pem -keyform PEM -in hashfile > signaturefile
3. Alice sends mydata.txt and the signature file(“signature”) to Bob
4. Bob verifies the signature, with his public key.
openssl rsautl -verify -inkey public_key.pem -pubin -keyform PEM -in signaturefile

The verified contents of the signature file is the original hash created by Alice, of mydata.txt
SHA256(mydata.txt)= 6e2a1dbf9ea8cbf2accb64f33ff83c7040413963e69c736accdf47de0bc16b1a

This verifies Alice's signature and yields Alice's hash of her mydata.txt file
5. Bob conducts his own one-way hash on his received file, mydata.txt.
openssl dgst -sha256 mydata.txt

This produces the following hash value, which is the same value that was decrypted from Alice's signature

SHA256(mydata.txt)= 6e2a1dbf9ea8cbf2accb64f33ff83c7040413963e69c736accdf47de0bc16b1a

Because the received mydata.txt file has the same one-way hash value as the sent mydata.txt file, and because Bob has verified that the sent mydata.txt file was signed by Alice, then Bob has taken all steps necessary to authenticate the file (i.e., to show that the received file is the file that Alice sent).

There are some limitations to this protocol. Anyone in possession of Alice's private key can “sign” for Alice. Hence the signature is not equivalent to a hand-written signature or to a biometric that uniquely identifies Alice (e.g., iris image, CODIS gene sequences, full set of fingerprints). Really, all the process tells us is that a document was sent by someone in possession of Alice's private key. We never really know who sent the document.

The signature does not attest to anything, other than that a person with Alice's key actually sent the document. There is nothing about the process that would indicate that she personally verifies that the content of the transmitted material is accurate or that the content was created by Alice or that she agrees with the contents.

Cryptography is fascinating, but experts who work in the field of data security will tell you that cryptographic algorithms and protocols can never substitute for a thoughtful data security plan that is implemented with the participation and cooperation of the staff of an organization [46,47]. In many instances, security breaches occur when individuals, often trusted employees, violate protocol and/or behave recklessly. Hence, data security is more often a “people thing” than a “computer thing”. Nonetheless, if you are not a multi-million dollar institution, and simply want to keep some of your data private, here are a few tips that you might find helpful. If you have really important data, the kind that could hurt yourself or others if the data were to fall into the wrong hands, then you should totally disregard the advice that follows and seek the help of a professional security agent.

– Save yourself a lot of grief by settling for a level of security that is appropriate and reasonable for your own needs.

Don't use a bank vault when a padlock will suffice.

– Avail yourself of no-cost solutions.

Some of the finest encryption algorithms, and their implementations, are publicly available in OpenSSL and other sources.

– The likelihood that you will lose your passwords is much higher than the likelihood that someone will steal your passwords.

Develop a good system for passkey management that is suited to your own needs.

– The security of the system should not be based on hiding your encrypted files or keeping the encryption algorithm secret.

The greatest value of modern encryption protocols is that it makes no difference whether anyone steals or copies your encrypted files, or learns your encryption algorithm.

– File encryption and decryption should be computationally fast.

Fast, open source protocols are readily available.

– File encryption should be done automatically, as part of some computer routine (e.g., a backup routine), or as a chron job (i.e., a process that occurs at predetermined time).

You should be able to batch-encrypt and batch-decrypt any number of files all at once (i.e., from a command loop within a script), and you should be able to combine encryption with other file maintenance activities. For example, you should be able to implement a simple script that loops through every file in a directory, or a directory tree (i.e., all the files in all of the subdirectories under the directory), all at once, adding file header and metadata information into the file, scrubbing data as appropriate, calculating a one-way hash (i.e., message digest) of the finished file, and producing an encrypted file output.

– You should never implement an encryption system that is more complex than you can understand [48].

Your data may be important to you, and to a few of your colleagues, but the remainder of the world looks upon your output with glazed eyes. If you are the type of person who would protect your valuables with a padlock, rather than a safety deposit box, then you should probably be thinking less about encryption strength and more about encryption operability. Ask yourself whether the encryption protocols that you use today shall be widely available, platform-independent and vendor independent, 5, 10 or 50 years from now. Will you always be able to decrypt your own encrypted files?

– Don't depend on redundancy

At first blush, it would be hard to argue that redundancy, in the context of information systems, is a bad thing. With redundancy, when one server fails, another picks up the slack; if a software system crashes, its duplicate takes over; when one file is lost, it is replaced by its back-up copy. It all seems good.

The problem with redundancy is that it makes the system more complex. The operators of a Big Data resource with built-in redundancies must maintain the operability of the redundant systems in addition to the primary systems. More importantly, the introduction of redundancies introduces a new set of interdependencies (i.e., how the parts of the system interact), and the consequences of those interdependencies may be difficult to anticipate.

In recent memory, the most dramatic example of a failed redundant system involved the Japanese nuclear power plant at Fukushima. The plant was designed with redundant systems. If the power failed, a secondary power generator would kick in. On March 11, 2011, a powerful earthquake off the shore of Japan produced a tidal wave that cut the nuclear reactor's access to the electric power grid. The back-up generators were flooded by the same tidal wave. The nuclear facilities were cut off from emergency assistance; also due to the tidal wave. Subsequent meltdowns and radiation leaks produced the worst nuclear disaster since Chernobyl.

As discussed previously in this chapter, on June 4, 1996, the first flight of the Ariane 5 rocket self-destructed, 37 seconds after launch. There was a bug in the software, but the Ariane had been fitted with a back-up computer. The back-up was no help; the same bug that crippled the primary computer put the back-up computer out of business [49]. The lesson here, and from the Fukushima nuclear disaster, is that redundant systems are often ineffective if they are susceptible to the same destructive events that caused failure in the primary systems.

Computer software and hardware problems may occur due to unanticipated interactions among software and hardware components. Redundancy, by contributing to system complexity, and by providing an additional opportunity for components to interact in an unpredictable manner, may actually increase the likelihood of a system-wide crash. Cases have been documented wherein system-wide software problems arose due to bugs in the systems that controlled the redundant subsystems [49].

A common security measure involves backing up files and storing the back-up files off-site. If there is a fire, flood, or natural catastrophe at the computer center, or if the computer center is sabotaged, then the back-up files can be withdrawn from the external site and eventually restored. The drawback of this approach is that the back-up files create a security risk. In Portland Oregon, in 2006, 365,000 medical records were stolen from Providence Home Services, a division of Seattle-based Providence Health Systems [50]. The thief was an employee who was handed the back-up files and instructed to store them in his home, as a security measure. In this case, the theft of identified medical records was a command performance. The thief complied with the victim's request to be robbed, as a condition of his employment. At the very least, the employer should have encrypted the back-up files before handing them over to an employee.

Nature takes a middle-of-the-road approach on redundancy. Humans evolved to have two eyes, two arms, two legs, two kidneys, and so on. Not every organ comes in duplicate. We have one heart, one brain, one liver, one spleen. There are no organs that come in triplicate. Human redundancy is subject to some of the same vulnerabilities as computer redundancy. A systemic poison that causes toxicity in one organ will cause equivalent toxicity in its contra-lateral twin.

– Save Time and Money; Don’t Protect Data that Does not Need Protection

Big Data managers tend to be overprotective of the data held in their resources, a professional habit that can work in their favor. In many cases, though, when data is of a purely academic nature, containing no private information, and is generally accessible from alternate sources, there really is no reason to erect elaborate security barriers.

Security planning always depends on the perception of the value of the data held in the resource (i.e., Is the data in the Big Data resource worth anything?), and the risks that the data might be used to harm individuals (e.g., through identity theft). In many cases, the data held in Big Data resources has no intrinsic monetary value and poses no risks to individuals. The value of most Big Data resource is closely tied to its popularity. A resource used by millions of people provides opportunities for advertising and attracts funders and investors.

Regarding the release of potentially harmful data, it seems prudent to assess, from the outset, whether there is a simple method by which the data can be rendered harmless. In many cases, deidentification can be achieved through a combination of data scrubbing, and expunging data fields that might conceivably tie a record to an individual. If your data set contains no unique records (i.e., if every record in the system can be matched with another record, from another individual, for which every data field is identical), then it is impossible to link any given record to an individual, with certainty. In many cases, it is a simple matter to create an enormous data set wherein every record is matched by many other records that contain the same informational fields. This process is sometimes referred to as record ambiguation [51].

Sometimes a Big Data team is compelled to yield to the demands of their data contributors, even when those demands are unreasonable. An individual who contributes data to a resource may insist upon assurances that a portion of any profit resulting from the use of their contributed data will be returned as royalties, shares in the company, or some other form of remuneration. In this case, the onus of security shifts from protecting the data to protecting the financial interests of the data providers. When every piece of data is a source of profit, measures must be put into place to track how each piece of data is used, and by whom. Such measures are often impractical, and have great nuisance value for data managers and data users. The custom of capitalizing on every conceivable opportunity for profit is a cultural phenomenon, not a scientific imperative.

Section 18.4. Case Study: Life on Mars

You must accept one of two basic premises: Either we are alone in the universe, or we are not alone in the universe. And either way, the implications are staggering.

Wernher von Braun

On September 3, 1976, the Viking Lander 2 touched down upon the planet mars, where it remained operational for the next 3 years, 7 months and 8 days. Soon after landing, it performed an interesting remote-controlled experiment. Using samples of martian dirt, astrobiologists measured the conversion of radioactively-labeled precursors into more complex carbon-based molecules; the so-called Labeled-Release study. For this study, control samples of dirt were heated to a high temperature (i.e., sterilized), and likewise exposed to radioactively-labeled precursors, without producing complex carbon-containing molecules. The tentative conclusion, published soon thereafter, was that Martian organisms in the samples of dirt had built carbon-based molecules through a metabolic pathway [52]. As you might expect, the conclusion was immediately challenged, and remains controversial, to this day, nearly 32 years later [22].

In the years since 1976, long after the initial paper was published, the data from the Labeled-Release study has been available to scientists, for re-analysis. New analytic techniques have been applied to the data, and new interpretations have been published [52]. As additional missions have reached mars, more data has emerged (i.e., the detection of water and methane), also supporting the conclusion that there is life on mars. None of the data is conclusive; Martian organisms have not been isolated. The point made here is that the shared Labeled-Release data is accessible and permanent, and can be studied again and again, compared or combined with new data, and argued ad nauseum [22].

Section 18.5. Case Study: Personal Identifiers

Secret agent man, secret agent man

They've given you a number and taken away your name

Theme from the television show “Secret Agent”, airing in the United States from 1964–66; song written by P. F. Sloan and Steve Barriby

We came to a conclusion in 2002. I don't think you can do it (create an electronic health record) without a national identifier.

Peter Drury [53]

An awful lot of the data collected by scientists concerns people (e.g., financial data, marketing data, medical data). Given everything discussed so far in this book regarding the importance of providing data object uniqueness, you would think that we would all be assigned our own personal identifiers by now.

Of course, nothing could be further from the truth. Each of us are associated with dozens, if not hundreds of irreconcilable identifiers intended to serve a particular need at a particular moment in time. These include bank accounts, credit cards, loan applications, brokerage and other investment accounts, library cards, and voter IDs. In the United States, a patient may be assigned separate identifiers for the various doctors’ offices, clinics and hospitals that she visits over the course of her life. As mentioned, a single hospital may assign a patient many different “unique” identifiers, for each department visited, for each newly installed hospital information system, and whenever the admission clerk forgets to ask the patient if he or she had been previously registered. U.S. Hospitals try to reconcile the different identifiers for a patient under a so-called Enterprise Master Patient Index, but experience has shown the problems encountered are insurmountable. As one example, in Houston's patient index system, which includes 3.5 million patients, there are about 250,000 patients that have a first and last name in common with at least one other registrant, and there are 70,000 instances wherein two people share the same first name, last name, and birthdate [54]. There is a growing awareness that efforts at reconciling systems wherein individual patients are registered multiple times, are never entirely satisfactory [53].

The subject of data security cannot be closed without mention of the National Patient Identifier. Some countries employ a National Patient Identifier (NPI) system. In these cases, when a citizen receives treatment at any medical facility in the country, the transaction is recorded under the same permanent and unique identifier. Doing so enables the data collected on individuals, from multiple hospitals, to be merged. Hence, physicians can retrieve patient data that was collected anywhere in the nation. In countries with NPIs, data scientists have access to complete patient records and can perform healthcare studies that would be impossible to perform in countries that lack NPI systems. In the United States, where a system of NPIs has not been adopted, there is a perception that such a system would constitute an invasion of privacy. Readers from outside the United States are probably wondering why the United States is so insecure on this issue.

In the United States, the call for a national patient identification system is raised, from time to time. The benefits to patients and to society are many. Aside from its absolutely necessary role in creating data that can be sensibly aggregated and meaningfully analyzed, it also serves to protect the privacy of individuals by eliminating the need for less secure forms of identification (e.g., credit cards, drivers licenses).

Regardless, U.S. citizens are reluctant to have an identifying number that is associated with a federally controlled electronic record of their private medical information. To show its disdain for personal identifiers, the U.S. Congress passed Public Law 105-277, in 1999, prohibiting the Department of Health and Human Services from using its funds to develop personal health identifiers, without first obtaining congressional approval [55].

In part, this distrust results from the lack of any national insurance system in the United States. Most health insurance in the United States is private, and private insurers have wide discretion over the fees and services provided to enrollees. There is a fear that if there were a national patient identifier with centralized electronic medical records, insurers may withhold reimbursements or raise premiums or otherwise endanger the health of patients. Because the cost of U.S. medical care is the highest in the world, medical bills for uninsured patients can quickly mount, impoverishing individuals and families [56].

Realistically, no data is totally safe. Data breaches today may involve hundreds of millions of confidential records. The majority of Americans have had social security numbers, credit card information, and private identifiers (e.g., birth dates, city of birth, names of relatives) misappropriated or stolen. Medical records have been stolen in large number. Furthermore, governments demand and receive access to our confidential medical records, when they deem it necessary [57]. Forbidding National Patient Identifiers has not made us safe. [Glossary Social Security Number]

Maybe we should ask ourselves the following: “Is it rational to forfeit the very real opportunity of developing new safe and effective treatments for serious diseases, for the very small likelihood that someone will crack my deidentified research record and somehow leverage this information to my disadvantage?”

Suppose everyone in the United States were given a choice: you can be included in a national patient identifier system, or you can opt out. Most likely, there would be many millions of citizens who would opt out of the offer, seeing no particular advantage in having a national patient identifier, and sensing some potential harm. Now, suppose you were told that if you chose to opt out, you would not be permitted to enjoy any of the health benefits coming from studies performed with data collected through the national patient identifier system. New safe and effective drugs, warnings of emerging epidemics, information on side effects associated with your medications, biomarker tests for preventable illnesses, and so on, would be reserved for individuals with national patient identifiers. Those who made no effort to help the system would be barred from any of the benefits that the system provided. Would you reconsider your refusal to accept a national patient identifier, if you knew the consequences? Of course, this is a fanciful scenario, but it makes a point.

Glossary

AES The Advanced Encryption Standard (AES) is the cryptographic standard endorsed by the U.S. government as a replacement for the old government standard, DES (Data Encryption Standard). AES was chosen from among many different encryption protocols submitted in a cryptographic contest conducted by the U.S. National Institute of Standards and Technology, in 2001. AES is also known as Rijndael, after its developer. It is a symmetric encryption standard, meaning that the same password used for encryption is also used for decryption.

Classification versus ontology A classification is a system in which every object in a knowledge domain is assigned to a class within a hierarchy of classes. The properties of superclasses are inherited by the subclasses. Every class has one immediate superclass (i.e., parent class), although a parent class may have more than one immediate subclass (i.e., child class). Objects do not change their class assignment in a classification, unless there was a mistake in the assignment. For example, a rabbit is always a rabbit, and does not change into a tiger. Classifications can be thought of as the simplest and most restrictive type of ontology, and serve to reduce the complexity of a knowledge domain [58]. Classifications can be easily modeled in an object-oriented programming language and are non-chaotic (i.e., calculations performed on the members and classes of a classification should yield the same output, each time the calculation is performed). A classification should be distinguished from an ontology. In an ontology, a class may have more than one parent class and an object may be a member of more than one class. A classification can be considered a special type of ontology wherein each class is limited to a single parent class and each object has membership in one and only one class.

Data Quality Act In the United States the data upon which public policy is based must have quality and must be available for review by the public. Simply put, public policy must be based on verifiable data. The Data Quality Act of 2002, requires the Office of Management and Budget to develop government-wide standards for data quality [24].

Intellectual property Data, software, algorithms, and applications that are created by an entity capable of ownership (e.g., humans, corporations, universities). The entity holds rights over the manner in which the intellectual property can be used and distributed. Protections for intellectual property may come in the form of copyrights and patent. Copyright applies to published information. Patents apply to novel processes and inventions. Certain types of intellectual property can only be protected by being secretive. For example, magic tricks cannot be copyrighted or patented; this is why magicians guard their intellectual property so closely. Intellectual property can be sold outright, essentially transferring ownership to another entity; but this would be a rare event. In other cases, intellectual property is retained by the creator who permits its limited use to others via a legal contrivance (e.g., license, contract, transfer agreement, royalty, usage fee, and so on). In some cases, ownership of the intellectual property is retained, but the property is freely shared with the world (e.g., open source license, GNU license, FOSS license, Creative Commons license).

Social Security Number The common strategy, in the United States, of employing social security numbers as identifiers is often counterproductive, owing to entry error, mistaken memory, or the intention to deceive. Efforts to reduce errors by requiring individuals to produce their original social security cards puts an unreasonable burden on honest individuals, who rarely carry their cards, and provides an advantage to dishonest individuals, who can easily forge social security cards. Institutions that compel patients to provide a social security number have dubious legal standing. The social security number was originally intended as a device for validating a person's standing in the social security system. More recently, the purpose of the social security number has been expanded to track taxable transactions (i.e., bank accounts, salaries). Other uses of the social security number are not protected by law. The Social Security Act (Section 208 of Title 42 U.S. Code 408) prohibits most entities from compelling anyone to divulge his/her social security number. Legislation or judicial action may one day stop healthcare institutions from compelling patients to divulge their social security numbers as a condition for providing medical care. Prudent and forward-thinking institutions will limit their reliance on social security numbers as personal identifiers.

Y-chromosome A small chromosome present in males and inherited from the father. The normal complement of chromosomes in male cells has one Y chromosome and one X chromosome. The normal complement of chromosomes in female cells has two X chromosomes and no Y chromosomes. Analysis of the Y chromosome is useful for determining paternal lineage.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18: Data Sharing and Data Security

Create new playlist

Sign In

Sign Up

Section 18.1. What Is Data Sharing, and Why Don't We Do More of It?

Section 18.2. Common Complaints

Section 18.3. Data Security and Cryptographic Protocols

Section 18.4. Case Study: Life on Mars

Section 18.5. Case Study: Personal Identifiers

Table of Contents for
18: Data Sharing and Data Security