9

Assessing the Adequacy of a Big Data Resource

Abstract

Before the data analyst devotes time and energy to a data resource, he or she must determine whether the data is likely to be accurate, comprehensive, representative, whether it has been organized sensibly, and whether it has been provided with adequate annotation. The first step usually involves looking at all of the data or at least looking at a large number of data samples. This chapter describes some basic approaches to examining Big Data resources.

Keywords

ASCII editor; Data assessment; Comprehensive data; Representative data; Data flattening; Full access to data

Section 9.1. Looking at the Data

discovery is “.....seeing what others have seen, but thinking what others have not.”

Albert Szent-Gyorgyi

Big Data must not be a Big Waste of time. Looking at the data will tell you immediately if you can use the data. Moving forward with calculations before looking at the data is inexcusable. Before you choose and apply analytic methods to data sets, you should spend time studying your raw data. The following steps may be helpful:

  1. 1.  Find a free ASCII editor.

When I encounter a large data file, in plain ASCII format, the first thing I do is open the file and take a look at its contents. Unless the file is small (i.e., under about 20 megabytes), most commercial word processors will fail at this task. They simply cannot open really large files (in the Gigabyte range). You will want to use an editor designed to work with large ASCII files. Two of the more popular, freely available editors are Emacs and vi (also available under the name vim). Downloadable versions are available for Linux, Windows, and Macintosh systems. On most computers, these editors will open files in the range of a Gigabyte. For even larger files, there are operating system utilities that can do the job. These will be discussed in Section 9.4, “Case Study: Utilities for Viewing and Searching Large Files.” [Glossary Text editor]

  1. 2.  Download and study the “readme” or index files, or their equivalent.

In prior decades, large collections of data were often assembled as files within subdirectories and these files could be downloaded in part or in toto, via ftp (file transfer protocol). Traditionally, a “readme” file would be included with the files, and the “readme” file would explain the purpose, contents, and organization of all the files. In some cases, an index file might be available, providing a list of terms covered in the files and their locations in the various files. When such files are prepared thoughtfully, they are of great value to the data analyst. It is always worth a few minutes time to open and browse the “readme” file. I think of “readme” files as treasure maps. The data files contain great treasure, but you are unlikely to find anything of value unless you study and follow the map.

In the past few years, data resources have grown in size and complexity. Today, Big Data resources are often collections of resources, housed on multiple servers. New and innovative access protocols are continually being developed, tested, released, updated, and replaced. Still, some things remain the same. There will always be documents to explain how the Big Data resource “works” for the user. It behooves the data analyst to take the time to read and understand this prepared material. If there is no prepared material, or if the prepared material is unhelpful, then you may want to reconsider using the resource.

  1. 3.  Assess the number of records in the Big Data resource.

There is a tendency among some data managers to withhold information related to the number of records held in the resource. In many cases, the number of records says a lot about the inadequacies of the resource. If the total number of records is much smaller than the typical user might have expected or desired, then the user might seek their data elsewhere. Data managers, unlike data users, sometimes dwell in a perpetual future that never merges into the here and now. They think in terms of the number of records they will acquire in the next 24 hours, the next year, or the next decade. To the data manager, limitations in the present are often irrelevant.

Data managers may be reluctant to divulge the number of records held in the Big Data resource when the number is so large as to defy credibility. Consider this example. There are about 5700 hospitals in the United States serving a population of about 313 million people. If each hospital served a specific subset of the population with no overlap in service between neighboring hospitals, then each would provide care for about 54,000 people. In practice, there is always some overlap in catchment population and a popular estimate for the average (overlapping) catchment for United States hospitals is 100,000. The catchment population for any particular hospital can be estimated by factoring in a parameter related to its size. For example, if a hospital hosts twice the number of beds than the average United States hospital, then one would guess that its catchment population would be about 200,000. The catchment population represents the approximate number of electronic medical records for living patients served by the hospital (one living individual, one hospital record). If you are informed that a hospital, of average size, contains 10 million records (when you are expecting about 100,000), then you can infer that something is very wrong. Most likely, the hospital is creating multiple records for individual patients. In general, institutions do not voluntarily provide users with information that casts doubt on the quality of their information systems. Hence, the data analyst, ignorant of the total number of records in the system, might proceed under the false assumption that each patient is assigned one and only one hospital record. Suffice it to say that the data user must know the number of records available in a resource, and the manner in which records are identified and internally organized.

A related issue of particular importance is the sample number/sample dimension dichotomy. Some resources with enormous amounts of data may have very few data records. This occurs when individual records contain mountains of data (e.g., sequences, molecular species, images), but the number of individual records is woefully low (e.g., hundreds or thousands). This problem, falling under the curse of dimensionality, will be further discussed in Section 14.6, “Case Study (Advanced): Curse of Dimensionality.”

  1. 4.  Determine how data objects are identified and classified.

As discussed in previous chapters, if you know the identifier for a data object, then you can collect all of the information associated with the object, regardless of its location in the resource. If other Big Data resources use the same identifier for the data object, you can integrate all of the data associated with the data object, regardless of its location in external resources. Furthermore, if you know the class that holds a data object, you can combine objects of a class and study all of the members of the class. Consider the following example.

Big Data resource 1

75898039563441  name        G. Willikers
75898039563441  gender       male

Big Data resource 2

75898039563441  age         35
75898039563441  is_a_class_member cowboy
94590439540089  name        Hopalong Tagalong
94590439540089  is_a_class_member cowboy

Merged Big Data Resource 1 + 2

75898039563441  name        G. Willikers
75898039563441  gender       male
75898039563441  is_a_class_member cowboy
75898039563441  age         35
94590439540089  name        Hopalong Tagalong
94590439540089  is_a_class_member cowboy

The merge of two Big Data resources combines data related to identifier 75898039563441 from both resources. We now know a few things about this data object that we did not know before the merge. The merge also tells us that the two data objects identified as 75898039563441 and 94590439540089 are both members of class cowboy. We now have two instance members from the same class, and this gives us information related to the types of instances contained in the class.

The consistent application of standard methods for object identification and for class assignments, using a standard classification or ontology, greatly enhances the value of a Big Data resource. A savvy data analyst will quickly determine whether the resource provides these important features. [Glossary Identification]

  1. 5.  Determine whether data objects contain self-descriptive information.

Data objects should be well specified. All values should be described with metadata, all metadata should be defined, and the definitions for the metadata should be found documents whose unique names and locations are provided. The data should be linked to protocols describing how the data was obtained and measured. [Glossary ISO metadata standard]

  1. 6.  Assess whether the data is complete and representative.

You must be prepared to spend hours reading through the records; otherwise, you will never really understand the data. After you have spent a few weeks of your life browsing through Big Data resources, you will start to appreciate the value of the process. Nothing comes easy. Just as the best musicians spend thousands of hours practicing and rehearsing their music, the best data analysts must devote thousands of hours to studying their data sources. It is always possible to run sets of data through analytic routines that summarize the data, but drawing insightful observations from the data requires thoughtful study.

An immense Big Data resource may contain spotty data. On one occasion, I was given a large hospital-based data set, with assurances that the data was complete (i.e., containing all necessary data relevant to the project). After determining how the records and the fields were structured, I looked at the distribution frequency of diagnostic entities contained in the data set. Within a few minutes I had the frequencies of occurrence of the different diseases, categorized under broad diagnostic categories. I spent another few hours browsing through the list, and before long I noticed that there were very few skin diseases included in the data. I am not a dermatologist, but I knew that skin diseases are among the most common conditions encountered in medical clinics. Where were the missing skin diseases? I asked one of the staff clinicians assigned to the project. He explained that the skin clinic operated somewhat autonomously from the other hospital departments. The dermatologists maintained their own information system, and their cases were not integrated into the general disease data set. I inquired as to why I had been assured that the data set was complete, when everyone other than myself knew full well that the data set lacked skin cases. Apparently, the staff had become so accustomed to ignoring the field of dermatology that it never crossed their minds to mention the matter.

It is a quirk of human nature to ignore anything outside one's own zone of comfort and experience. Otherwise fastidious individuals will blithely omit relevant information from Big Data resources if they consider the information to be inconsequential, irrelevant, or insubstantial. I have had conversations with groups of clinicians who requested that the free-text information in radiology and pathology reports (the part of the report containing descriptions of findings and other comments) be omitted from the compiled electronic records on the grounds that it is all unnecessary junk. Aside from the fact that “junk” text can serve as important analytic clues (e.g., measurements of accuracy, thoroughness, methodological trends), the systematic removal of parts of data records produces a biased and incomplete Big Data resource. In general, data managers should not censor data. It is the job of the data analyst to determine what data should be included or excluded from analysis; and to justify his or her decision. If the data is not available to the data analyst, then there is no opportunity to reach a thoughtful and justifiable determination.

On another occasion, I was given an anonymized set of clinical data from an undisclosed hospital. As I always do, I looked at the frequency distributions of items on the reports. In a few minutes, I noticed that germ cell tumors, rare tumors that arise from a cell lineage that includes oocytes and spermatocytes, were occurring in high numbers. At first, I thought that I might have discovered an epidemic of germ cell tumors in the hospital's catchment population. When I looked more closely at the data, I noticed that the increased incidence occurred in virtually every type of germ cell tumor, and there did not seem to be any particular increase associated with gender, age, or ethnicity. Cancer epidemics raise the incidence of one or maybe two types of cancer and may involve a particular at-risk population. A cancer epidemic would not be expected to raise the incidence of all types of germ cell tumors, across ages and genders. It seemed more likely that the high numbers of germ cell tumors were explained by a physician or specialized care unit that concentrated on treating patients with germ cell tumors, receiving referrals from across the nation. Based on the demographics of the data set (the numbers of patients of different ethnicities), I could guess the geographic region of the hospital. With this information and knowing that the institution probably had a prestigious germ cell clinic, I guessed the name of the “undisclosed” hospital. My suspicions were eventually confirmed. [Glossary Anonymization versus deidentification]

It sometimes helps to compare the distribution of data in a new collection against the distribution in data in a known and trusted population. For example, you may want to stratify data records by the age of individuals and compare it with the distribution of ages in a control or normal population of individuals. You might also create a word list or index of terms extracted from the data to determine if the frequency of occurrences of the included words or terms are similar to what you have come to expect from comparable data sets. If you find that there are too many kinds of data that are missing from your new collection of data, then you may need to abandon the project. You may find that the information contained in the new collection is similar in kind, but dissimilar in frequency to other populations. For example, if you encounter a population of men and women of all ages, but with a woman:male ration of 5:1 and with very few men over the age of 70 included in the population, then you might want to normalize your population against a control population. [Glossary Age-adjusted incidence]

The point here is that if you take the time to study raw data, you can spot systemic deficiencies or excesses in the data, if they exist, and you may gain deep insights that would not be obtained by mathematical techniques.

  1. 7.  Plot some of the data.

Plotting data is quick, easy, and surprisingly productive. Within minutes, the data analyst can assess long-term trends, short-term and periodic trends, the general shape of data distribution and general notions of the kinds of functions that might represent the data (e.g., linear, exponential, power series). Simply knowing that the data can be expressed as a graph is immeasurably reassuring to the data analyst.

There are many excellent data visualization tools that are widely available. Without making any recommendation, I mention that graphs produced for this book were made with Matplotlib, a plotting library for the Python programming language; and Gnuplot, a graphing utility available for a variety of operating systems. Both Matplotlib and Gnuplot are open source applications that can be downloaded, at no cost, and are available at sourceforge.net. [Glossary Open source]

Gnuplot is extremely easy to use, either as stand-alone scripts containing gnuplot commands, or from the system command line. Most types of plots can be created with a single gnuplot command line. Gnuplot can fit a mathematically expressed curve to a set of data using the nonlinear least-squares Marquardt-Levenberg algorithm [1,2]. Gnuplot can also provide a set of statistical descriptors (e.g., median, mean, and standard deviation) for plotted sets of data.

Gnuplot operates from data held in tab-delimited ASCII files. Typically, data extracted from a Big Data resource is ported into a separate ASCII file, with column fields separated with a tab character, and rows separated by a newline character. In most cases, you will want to modify your raw data, readying it for plotting. Use your favorite programming language to normalize, shift, transform, covert, filter, translate, or munge your raw data, as you see fit. Export the data as a tab-delimited file, named with a .dat suffix.

It takes about a second to generate a plot for 10,000 data points (Fig. 9.1).

Fig. 9.1
Fig. 9.1 A plot of 10,000 random data points, in three coordinates. The data for this figure was created with a 7 line script using the Perl programming language, but any scripting language would have been sufficient [3]. Ten thousand data points were created, with the x, y, and z coordinates for each point produced by a random number generator. The point coordinates were put into a file named xyz_rand.dat.

One command line in Gnuplot produced the graph, from the data.

splot 'c:ftpxyz_rand.dat'

It is very easy to plot data, but one of the most common mistakes of the data analyst is to assume that the available data actually represents the full range of data that may occur. If the data under study does not include the full range of the data, the data analyst will often reach a completely erroneous explanation for the observed data distribution.

Data distributions will almost always appear to be linear at various segments of their range. An oscillating curve that reaches equilibrium may look like a sine wave early in its course, and a flat-line later on. In the larger oscillations, it may appear linear along the length of a half-cycle. Any of these segmental interpretations of the data will miss observations that would lead to a full explanation of the data (Fig. 9.2).

Fig. 9.2
Fig. 9.2 An oscillating wave reaching equilibrium. The top graph uses circle-points to emphasize a linear segment for a half-cycle oscillation. The bottom graph of the same data emphasizes a linear segment occurring at equilibrium.

An adept data analyst can eyeball a data distribution and guess the kind of function that might model the data. For example, a symmetric bell-shaped curve is probably a normal or Gaussian distribution. A curve with an early peak and a long, flat tail is often a power law distribution. Curves that are simple exponential or linear can also be assayed by visual inspection. Distributions that may be described by a Fourier series or a power series, or that can be segmented into several different distributions, can also be assessed. [Glossary Power law, Power series, Fourier series]

  1. 8.  Estimate the solution to your multi-million dollar data project, on day 1.

This may seem difficult to accept, and there will certainly be exceptions to the rule, but the solution to almost every multi-million dollar analytic problem can usually be estimated in just a few hours, sometimes minutes, at the outset of the project. If an estimate cannot be attained fairly quickly, then there is a good chance that the project will fail. If you do not have the data for a quick and dirty estimate, then you will probably not have the data needed to make a precise determination.

The past several decades have witnessed a profusion of advanced mathematical techniques for analyzing large data sets. It is important that we have these methods, but in most cases, newer methods serve to refine and incrementally improve older methods that do not rely on powerful computational techniques or sophisticated mathematical algorithms. As someone who was raised prior to the age of hand-held calculators and personal computers, I was taught quick-and-dirty estimation methods for adding, subtracting, multiplying, and dividing lists of numbers. The purpose of the estimation was to provide a good idea of the final answer, before much time was spent on a precise solution. If no mistake was introduced in either the estimate or the long calculation, then the two numbers would come close to one another. Conversely, mistakes in the long calculations could be detected if the two calculations yielded different numbers.

If data analysts go straight to the complex calculations before they perform a simple estimation, they will find themselves accepting wildly ridiculous calculations. For comparison purposes, there is nothing quite like a simple, intuitive estimate to pull a overly-eager analyst back to reality. Often, the simple act of looking at a stripped-down version of the problem opens a new approach that can drastically reduce computation time [4]. In some situations, analysts will find that a point is reached when higher refinements in methods yield diminishing returns. When everyone has used their most advanced algorithms to make an accurate prediction, they may sometimes find that their best effort offers little improvement over a simple estimator.

Section 9.2. The Minimal Necessary Properties of Big Data

In God we trust, all others bring data.

William Edwards Deming (1900–1993)

Many of today's statisticians and scientists came of age in the world of small data. When you are working with a few hundred measurements, most of the issues discussed in this book have almost no relevance. Small data does not need to be dressed up with identifiers and metadata. Scientists did not worry very much about creating self-explanatory data; each scientist understood their own data, and that was usually good enough.

Big Data, with its volume, complexity, velocity, and permanence, requires a remarkable amount of annotation and curation. For the most part, the issues raised in this book are unknown to the bulk of individuals who collect Big Data. Hence, most of the Big Data that has been collected and stored has no scientific value; it is simply incomprehensible and unusable [58]. This may seem like an outrageous claim, particularly when you consider how much of the world's activities are data-driven. If you speak with scientists who collect and analyze data, and that would include just about every scientist you are likely to encounter, you will hear them tell you that their data is just fine, and perfectly suitable for their own scientific studies. The point that must be made is that the scientists who collected and analyzed the data cannot judge the value of scientific data. The true value of data must be assessed by the scientists who verify, validate, and re-analyze the data that was collected by other scientists. If the original data cannot be obtained and analyzed by the scientific community, now and in the future, then the original assertions cannot be confirmed, and the data cannot be usefully merged with other data sets, extended, and repurposed. [Glossary Abandonware, Dark data, Universal and perpetual, Data versus datum, Identifier, Data repurposing]

For data to be useful to the scientific community, it must have a set of basic properties, and, unfortunately, these properties are seldom taught or utilized. Here are the universal properties of good data that has lasting scientific value.

  •   Data that has been annotated with metadata
  •   Data that establishes uniqueness or identity
  •   Time stamped data that accrues over time [Glossary Time, Time stamp]
  •   Data that resides within a data object
  •   Data that has membership in a defined class
  •   Introspective data—data that explains itself
  •   Immutable data
  •   Data that has been simplified

Let us take a moment to examine each of these data features:

  •   Data that has been annotated with metadata

Metadata, the data that explains data, was discussed in Sections 4.1 through 4.3. The modern specification for metadata is the eXtensible Markup Language (XML). The importance of XML to data scientists cannot be overstated. As a data-organizing technology, it is as important as the invention of written language (circa 3000 bc) or the appearance of mass-printed books (circa 1450 ad). Markup allows us to convey any message as XML (a pathology report, a radiology image, a genome database, a workflow process, a software program, or an e-mail). [Glossary Data annotation, Annotation, Data sharing]

  •   Data that establishes uniqueness or identity

The most useful data establishes the identity of objects. In many cases, objects have their own, natural identifiers that come very close to establishing uniqueness. Examples include fingerprints, iris patterns, and the sequence of nucleotides in an organism's genetic material.

In regard to identifying data objects, we need not depend on each data object having its own naturally occurring identifier. As discussed in Section 3.1, we can simply generate and assign unique identifiers to our data objects [3,810]. Identifiers are data simplifiers, when implemented properly. They allow us to collect all of the data associated with a unique object, while ensuring that we exclude that data that should be associated with some other object.

  •   Time stamped data that accrues over time

When a data set contains data records that collect over time, it becomes possible to measure how the attributes of data records may change as the data accumulates. Signals analysts use the term time series to refer to attribute measurements that change over time. The shape of the time series can be periodic (i.e., repeating over specific intervals), linear, non-linear, Gaussian, or multimodal (i.e., having multiple peaks and troughs), or chaotic. A large part of data science is devoted to finding trends in data, determining simple functions that model the variation of data over time, or predicting how data will change in the future. All these analytic activities require data that is annotated with the time that a measurement is made, or the time that a record is prepared, or the time that an event has occurred. [Glossary Data science, Waveform]

You may be shocked to learn that many, if not most, web pages lack a time stamp to signify the date and time when the page's textual content was created. This oversight applies to news reports, announcements from organizations and governments, and even scientific papers; all being instances for which a time stamp would seem to be an absolute necessity. When a scientist publishes an undated manuscript, how would anyone know if the results are novel? If a news article describes an undated event, how would anyone know whether the report is current? For the purposes of data analysis, undated documents and data records are useless [5].

Whereas undated documents have very little value, all transactions, statements, documents and data points that are annotated with reliable time stamps will always have some value, particularly if the information continues to collect over time. Today, anyone with a computer can easily time stamp his or her data, with the date and the time, accurate to within a second. As discussed in Section 6.4, “Case Study: Time stamping Data,” every operating system and every programming language has access to the time, and can easily annotate any data point with the time that it was created. Time data can be formatted in any of dozens of ways, all of which can be instantly converted to an international standard [11].

It's human nature to value newly collected data and to dismiss old data as being outdated or irrelevant. Nothing could be further from the truth. New data, in the absence of old data, has little value. All historical events develop through time, and the observations made at any given moment in time are always influenced by events that transpired at earlier times. Whenever we speak of “new” data, alternately known as prospectively acquired data, we must think in terms that relate the new data to the “old” data that preceded it. Old data can be used to analyze trends over time and to predict the data values into the future. Essentially, old data provides the opportunity to see the past, the present, and the future. The dependence of new data on old data can be approached computationally. The autocorrelation function is a method for producing a type of measurement indicating the dependence of data elements on prior data elements. Long-range dependence occurs when a value is dependent on many prior values. Long-range dependence is determined when the serial correlation (i.e., the autocorrelation over multiple data elements) is high when the number of sequential elements is large [12]. These are nifty tools for data analysis, but they cannot be employed if the data is not time stamped [6]. [Glossary Correlation distance]

  •   Data that is held in a data object

In Section 6.2, “Data Objects: The Essential Ingredient of Every Big Data Collection,” we defined a data object as an object identifier plus all of the data/metadata pairs that rightly belong to the object identifier, including a data/metadata pair that tells us the object's class. Lucky for us, some of the most common data creations (e.g., emails and photographic images) are automatically composed as data objects by our software (i.e., email clients and digital cameras).

When you send a message, your email client automatically creates a data object that holds the contents of your message, descriptive information about the message, a message identifier, and a time stamp. Here is a sample email header, obtained by selecting the email client's long or detailed version of the message. The actual message contents would normally follow, but are omitted here for brevity.

  •   MIME-Version: 1.0
  •   Received: by 10.36.165.75 with HTTP; Tue, 2 May 2017 14:46:47 -0700 (PDT)
  •   Date: Tue, 2 May 2017 17:46:47 -0400
  •   Delivered-To: [email protected]
  •   Message-ID: < CALVNVe-kk7fqYJ82MfsV6a4kFKW4v57c4y9BLp0UYf1cBHq9pQ@mail.gmail.com >
  •   Subject: tiny fasts
  •   From: Anybody < [email protected] >
  •   To: Anybody Else < [email protected] >
  •   Content-Type: multipart/alternative; boundary = 94eb2c07ab4c054062054e917a03

Notice that each line of the header consists of a colon “:” flanked to the right by metadata (e.g., Subject, From, To) and on the left by the described data. There is a line for a time stamp and a line for an identifier assigned by the email client.

  •   Date: Tue, 2 May 2017 17:46:47 -0400
  •   Message-ID: < CALVNVe-kk7fqYJ82MfsV6a4kFKW4v57c4y9BLp0UYf1cBHq9pQ@mail.gmail.com >

Email messages are an example of data objects that are automatically created when you push the “send” button. When we read about the remarkable results achieved by forensic data analysts, who gather time stamped, immutable, and identified evidence from millions of stored messages, we must give credit to the power of data objects.

  •   Data that has membership in a defined class

In Chapter 5, we discussed classifications and ontologies and explained the importance of assigning instances (e.g., diseases, trucks, investments) to classes wherein every instance shares a set of features typical of the class. All good classifications have a feature that is known as competence; the ability to draw inferences about data objects, and their relationships to other data objects, based on class definitions. Data that is unclassified may have some immediate observational or experimental value to scientists, but such data cannot be used to draw inferences from classes of data objects obtained from Big Data resources.

  •   Introspective data (data that explains itself)

Introspection, as previously discussed in Chapter 6, refers to the ability of data (e.g., data records, documents, and all types of data objects) to describe itself when interrogated. Introspection gives data users the opportunity to see relationships among the individual data records that are distributed in different data sets, and is one of the most useful features of data objects, when implemented properly.

Modern programming languages allow us to interrogate data, and learn everything there is to know about the information contained in data objects. Information about data objects, acquired during the execution of a program, can be used to modify a program's instructions, during run-time, a useful feature known as “reflection”. Detailed information about every piece of data in a data set (e.g., the identifier associated with the data object, the class of objects to which the data object belongs, the metadata and the data values that are associated with the data object), permit data scientists to integrate data objects collected from multiple Big Data resources.

It should be noted that the ability to perform introspection is not limited to object oriented programming languages. Introspection is provided by the data, and any programming language will suffice, so long as the data itself is organized as data objects assigned to classes within a sensibly structured classification.

  •   Immutable data

When you are permitted to change preexisting data, all of your collected data becomes tainted. None of the analyses performed on the data in the database can be verified, because the data that was originally analyzed no longer exists. It has become something else, which you cannot fully understand. Aside from producing an unverifiable data collection, you put the data analyst in the impossible position of deciding which data to believe; the old data or the new data.

  •   Data that has been simplified

Big Data is complex data, and complex data is difficult to understand and analyze. As it happens, all of the properties that we consider the minimal necessary for Big Data preparation happen to be simplifying. Metadata, identifiers, data objects, and classifications all work to drive down the complexity of data and render the data understandable to man or machine.

It is easy for data managers to shrug off the data requirements described in this section as high-tech nuisances. Big Data requires an enormous amount of fussy work that was simply not necessary when data was small. Nonetheless, it is necessary, if we hope to use more than an insignificant fraction of the data that is being collected every day.

Section 9.3. Data That Comes With Conditions

This site has been moved.

We'd tell you where, but then we'd

have to delete you.

Computer-inspired haiku by Charles Matthews

I was involved in one project where the data holders could not be deterred from instituting a security policy wherein data access would be restricted to pre-approved users. Anyone wishing to query the database would first submit an application, which would include detailed information about themselves and their employer. The application required users to explain how they intended to use the resource, providing a description of their data project. Supplying this information was a warm-up exercise for the next step.

A screening committee composed primarily of members of the Big Data team would review the submitted application. A statistician would be consulted to determine if the applicant's plan was feasible. The committee would present their findings to an executive committee that would compare each application's merits against those of the other applicants. The very best applications would be approved for data access.

The data team could not seem to restrain their enthusiasm for adding layers of complexity to the security system. They decided that access to data would be tiered. Some users would be given less access to data than other users. No users would be given free access to the entire set of data. No user would have access to individual deidentified records; only aggregate views of record data would be released. A system would be designed to identify users and to restrict data access based on the identity and assigned access status.

These security measures were unnecessary. The data in the system had been rendered harmless via deidentification and could be distributed without posing any risk to the data subjects or to the data providers. The team seemed oblivious to the complexities engendered by a tiered access system. Bruce Schneier, a widely cited security expert, wrote an essay entitled, “A plea for simplicity: you can't secure what you don't understand” [13]. In this essay, he explained that as you add complexity to a system, the system becomes increasingly difficult to secure. I doubted that the team had the resources or the expertise to implement a complex, multi-tiered access system for a Big Data resource. I suspected that if the multi-tiered access system were actually put into place, the complexity of the system would render the resource particularly vulnerable to attack. In addition, the difficulty of accessing the system would discourage potential users and diminish the scientific value of the Big Data resource.

Many data holders believe that their job, as responsible stewards of data, is to deny data access to undeserving individuals and to ensure that any incorrect conclusions drawn from their data will never see the light of day. I have seen examples wherein the data holders require data users to sign an agreement indicating that the results of their analyses must be submitted back to the data holders before being released to the public in the form of manuscripts, public announcements, or conference presentations. The data holders typically reserve the right to forbid releasing results with which they disapprove. It is easy to see that a less-than-saintly committee might disapprove results that cast their Big Data resource in a bad light, or results that compete in any way with the products of their own research, or results that they hold in disfavor for any capricious reason whatsoever.

Aside from putting strict restrictions on who gets access to data, and which results are permitted to be published, it is commonplace to impose strict restrictions on how the data can be viewed. Anyone who has visited online databases is familiar with the query box. The idea is that the user enters a query and waits for some output to appear on the screen. The assumption here is that the user knows how the query must be composed to produce the most complete output. Of course, this is never the case. When a user enters a query, she cannot know, in advance, whether some other query term might have yielded a better output. Such query boxes almost never return details about the data set or the algorithm employed in responding to the query. It is difficult, under these circumstances, to imagine any scenario wherein these kinds of queries have any scientific merit.

If Big Data resources are to add significantly to the advancement of science, the kinds of complex and stingy data sharing practices that have evolved over the past few decades must face extinction.

Section 9.4. Case Study: Utilities for Viewing and Searching Large Files

It isn't that they can't see the solution. It's that they can't see the problem.

G. K. Chesterton

In Section 9.1, we discussed the importance of looking at your data. Several free and open source text editors were suggested (Open Office, vi, emacs). These text editors can open immense files (gigabytes in length and longer), but they have their limits. Files much larger than a gigabyte may be slow to load, or may actually be unloadable on systems with small memory capacity. In such cases, your computer's operating system may offer a convenient alternative to text editors.

In the Windows operating system, then you can read any text file, one screen at a time, with the “more” command.

For example, on Windows systems, at the prompt:

  • c:>type huge_file.txt | more

The first lines from the huge_file.txt file will fill the screen, and you can proceed through the file by pressing and holding the < Enter > key. [Glossary Line]

Using this simple command, you can assess the format and general organization of any file. For the uninitiated, ASCII data files are inscrutable puzzles. For those who take a few moments to learn the layout of the record items, ASCII records can be read and understood, much like any book.

In contrarian Unix and Linux systems the “less” command functions much like the Windows “more” command, but offers many additional options. At the Unix (or Linux) system prompt, type the following command (substituting your preferred file for “huge_file.txt”):

$ less huge_file.txt

This will load a screen-sized chunk of huge_file.txt onto your monitor. By pressing the “enter” key, or the “arrow down” key, additional lines will scroll onto the monitor, one line at a time. For fast screen scrolls, keep your finger on the “Page Down” key. The “Page Up” key lets you back the screens.

The less command accommodates various options.

$ less -S huge_file.txt

the use of the -S switch cuts off line wrap so that the lines are truncated at the edge of the screen. In general, this speeds up the display.

When you use the Unix “less” command, you will find that the last line at the bottom of the screen is a “:”. The “:” is a prompt for additional instructions. If you were to enter a slash character (“/”) followed by a word or phrase or regex pattern, you would immediately see the line in which the first occurrence of your search term appeared. If you typed “&” and the pattern, at the “:” prompt, you would see all the lines from the file, in which your search pattern appears.

The Unix “less” command is a versatile and fast utility for viewing and searching very large files. If you do not use Unix systems, do not despair. Windows users can install Cygwin, a free Unix-like interface. Cygwin, and supporting documentation, can be downloaded from:

Cygwin opens in a window that produces a shell prompt (equivalent to Windows C prompt) from which Unix programs can be launched. For myself, I use Cygwin primarily as a source of Unix and Linux utilities, of which there are hundreds. In addition, Cygwin comes bundled with some extremely useful applications, such as Perl, Python, OpenSSL, and Gnuplot.

Windows users are not restricted to launching Unix and Linux applications from within the Cygwin shell prompt. A command line from the Windows C prompt will launch Cygwin utilities. For example:

c:cygwin64in > wc temp.txt
 11587 217902 1422378 temp.txt

The command “wc temp.txt,” launched the Unix/Linux word counter utility (“wc”) from the Windows C prompt, yielding a count of the lines, words, and bytes in the temp.txt file. Likewise, a system call from a Python script can invoke Cygwin utilities and applications.

Big Data scientists eventually learn that there are some tasks that are best left to Unix/Linux. Having Cygwin installed on your Windows system will make life easier for you, and for your collaborators, who may prefer to work in Linux.

Section 9.5. Case Study: Flattened Data

Everything should be made as simple as possible, but not simpler.

Albert Einstein

Data flattening is a term that is used differently by data analysts, database experts, and informaticians. Though the precise meaning changes from subfield to subfield, the term always seems to connote a simplification of the data and the elimination of unnecessary structural restraints.

In the field of informatics, data flattening is a popular but ultimately counter-productive method of data organization and data reduction. Data flattening involves removing data annotations that are not needed for the interpretation of data [5].

Imagine, for the sake of illustration, a drastic option that was seriously considered by a large medical institution. This institution, that shall remain nameless, had established an excellent Electronic Medical Record (EMR) system. The EMR assigns a unique and permanent identifier string to each patient, and attaches the identifier string to every hospital transaction involving the patient (e.g., biopsy reports, pharmacy reports, nursing notes, laboratory reports). All of the data relevant to a patient, produced anywhere within the hospital system is linked by the patient's unique identifier. The patient's EMR can be assembled, instantly, whenever needed, via a database query.

Over time, the patient records in well-designed information systems accrue a huge number of annotations (e.g., time stamped data elements, object identifiers, linking elements, metadata). The database manager is saddled with the responsibility of maintaining the associations among all of the annotations. For example, an individual with a particular test, conducted at a particular time, on a particular day, will have annotations that link the test to a test procedure protocol, an instrument identifier, a test code, a laboratory name, a test sample, a sample accession time, and so on. If data objects could be stripped of most of their annotations, after some interval of time, then it would reduce the overall data management burden on the hospital information system. This can be achieved by composing simplified reports and deleting the internal annotations. For example, all of the data relevant to a patient's laboratory test could be reduced to the patient's name, the date, the name of the test, and the test result. All of the other annotations can be deleted. This process is called data flattening.

Should a medical center, or any entity that collects data, flatten their data? The positive result would be a streamlining of the system, with a huge reduction in annotation overhead. The negative result would be the loss of the information that connects well-defined data objects (e.g., test result with test protocol, test instrument with test result, name of laboratory technician with test sample, name of clinician with name of patient). Because the fundamental activity of the data scientist is to find relationships among data objects, data flattening will reduce the scope and value of data repurposing projects. Without annotations and metadata, the data from different information systems cannot be sensibly merged. Furthermore, if there is a desire or a need to reanalyze flattened data, then the data scientist will not be able to verify the data and validate the conclusions drawn from the data [5]. [Glossary Verification and validation, Validation]

Glossary

Abandonware Software that that is abandoned (e.g., no longer updated, supported, distributed, or sold) after its economic value is depleted. In academic circles, the term is often applied to software that is developed under a research grant. When the grant expires, so does the software. Most of the software in existence today is abandonware.

Age-adjusted incidence An age-adjusted incidence is the crude incidence of disease occurrence within an age category (e.g., age 0–10 years, age 70–80 years), weighted against the proportion of persons in the age groups of a standard population. When we age-adjust incidence, we cancel out the changes in the incidence of disease occurrence, in different populations, that result from differences in the proportion of people in different age groups. For example, suppose you were comparing the incidence of childhood leukemia in two populations. If the first population has a large proportion of children, then it will likely have a higher number of childhood leukemia in its population, compared with another population with a low proportion of children. To determine whether the first population has a true, increased rate of leukemia, we need to adjust for the differences in the proportion of young people in the two populations [14].

Annotation Annotation involves describing data elements with metadata or attaching supplemental information to data objects.

Anonymization versus deidentification Anonymization is a process whereby all the links between an individual and the individual's data record are irreversibly removed. The difference between anonymization and deidentification is that anonymization is irreversible. There is no method for re-establishing the identity of the patient from anonymized records. Deidentified records can, under strictly controlled circumstances, be reidentified. Reidentification is typically achieved by entrusting a third party with a confidential list that maps individuals to deidentified records. Obviously, reidentification opens another opportunity of harming individuals, if the confidentiality of the reidentification list is breached. The advantages of reidentification is that suspected errors in a deidentified database can be found, and corrected, if permission is obtained to reidentify individuals. For example, if the results of a study based on blood sample measurements indicate that the original samples were mislabeled, it might be important to reidentify the samples and conduct further tests to resolve the issue. In a fully anonymized data set, the opportunities for verifying the quality of data are highly limited.

Correlation distance Also known as correlation score. The correlation distance provides a measure of similarity between two variables. Two similar variables will rise and fall together [15,16]. The Pearson correlation score is popular, and can be easily implemented [3,17]. It produces a score that varies from − 1 to 1. A score of 1 indicates perfect correlation; a score of − 1 indicates perfect anti-correlation (i.e., one variable rises while the other falls). A Pearson score of 0 indicates lack of correlation. Other correlation measures can be applied to Big Data sets [15,16].

Dark data Unstructured and ignored legacy data, presumed to account for most of the data in the “infoverse”. The term gets its name from “dark matter” which is the invisible stuff that accounts for most of the gravitational attraction in the physical universe.

Data annotation The process of supplementing data objects with additional data, often providing descriptive information about the data (i.e., metadata, identifiers, time information, and other forms of information that enhances the utility of the data object.

Data repurposing Involves using old data in new ways, that were not foreseen by the people who originally collected the data. Data repurposing comes in the following categories: (1) Using the preexisting data to ask and answer questions that were not contemplated by the people who designed and collected the data; (2) Combining preexisting data with additional data, of the same kind, to produce aggregate data that suits a new set of questions that could not have been answered with any one of the component data sources; (3) Reanalyzing data to validate assertions, theories, or conclusions drawn from the original studies; (4) Reanalyzing the original data set using alternate or improved methods to attain outcomes of greater precision or reliability than the outcomes produced in the original analysis; (5) Integrating heterogeneous data sets (i.e., data sets with seemingly unrelated types of information), for the purpose an answering questions or developing concepts that span diverse scientific disciplines; (6) Finding subsets in a population once thought to be homogeneous; (7) Seeking new relationships among data objects; (8) Creating, on-the-fly, novel data sets through data file linkages; (9) Creating new concepts or ways of thinking about old concepts, based on a reexamination of data; (10) Fine-tuning existing data models; and (11) Starting over and remodeling systems [5].

Data science A vague term encompassing all aspects of data collection, organization, archiving, distribution, and analysis. The term has been used to subsume the closely related fields of informatics, statistics, data analysis, programming, and computer science.

Data sharing Providing one's own data to another person or entity. This process may involve free or purchased data, and it may be done willingly, or under coercion, as in compliance with regulations, laws, or court orders.

Data versus datum The singular form of data is datum, but the word “datum” has virtually disappeared from the computer science literature. The word “data” has assumed both a singular and plural form. In its singular form, it is a collective noun that refers to a single aggregation of many data points. Hence, current usage would be “The data is enormous,” rather than “These data are enormous.”

Fourier series Periodic functions (i.e., functions with repeating trends in the data, including waveforms and periodic time series data) can be represented as the sum of oscillating functions (i.e., functions involving sines, cosines, or complex exponentials). The summation function is the Fourier series.

ISO metadata standard ISO 11179 is the standard produced by the International Standards Organization (ISO) for defining metadata, such as XML tags. The standard requires that the definitions for metadata used in XML (the so-called tags) be accessible and should include the following information for each tag: Name (the label assigned to the tag), Identifier (the unique identifier assigned to the tag), Version (the version of the tag), Registration Authority (the entity authorized to register the tag), Language (the language in which the tag is specified), Definition (a statement that clearly represents the concept and essential nature of the tag), Obligation (indicating whether the tag is required), Datatype (indicating the type of data that can be represented in the value of the tag), Maximum Occurrence (indicating any limit to the repeatability of the tag), and Comment (a remark describing how the tag might be used).

Identification The process of providing a data object with an identifier, or the process of distinguishing one data object from all other data objects on the basis of its associated identifier.

Identifier A string that is associated with a particular thing (e.g., person, document, transaction, data object), and not associated with any other thing [18]. In the context of Big Data, identification usually involves permanently assigning a seemingly random sequence of numeric digits (0–9) and alphabet characters (a–z and A–Z) to a data object. The data object can be a class of objects.

Line A line in a non-binary file is a sequence of characters that terminate with an end-of-line character. The end-of-line character may differ among operating systems. For example, the DOS end of line character is ASCII 13 (i.e., the carriage return character) followed by ASCII 10 (i.e., the line feed character), simulating the new line movement in manual typewriters. The Linux end-of-line character is ASCII 10 (i.e., the line feed character only). When programming in Perl, Python or Ruby, the newline character is represented by “ ” regardless of which operating system or file system is used. For most purposes, use of “ ” seamlessly compensates for discrepancies among operating systems with regard to their preferences for end-of-line characters. Binary files, such as image files or telemetry files, have no designated end-of-line characters. When a file is opened as a binary file, any end-of-line characters that happen to be included in the file are simply ignored as such, by the operating system.

Open source Software is open source if the source code is available to anyone who has access to the software.

Power law A mathematical formula wherein a particular value of some quantity varies as an inverse power of some other quantity [19,20]. The power law applies to many natural phenomena and describes the Zipf distribution or Pareto's principle. The power law is unrelated to the power of a statistical test.

Power series A power series of a single variable is an infinite sum of increasing powers of x, multiplied by constants. Power series are very useful because it is easy to calculate the derivative or the integral of a power series, and because different power series can be added and multiplied together. When the high exponent terms of a power series are small, as happens when x is less than one, or when the constants associated with the higher exponents all equal 0, the series can be approximated by summing only the first few terms. Many different kinds of distributions can be represented as a power series. Distributions that cannot be wholly represented by a power series may sometimes by segmented by ranges of x. Within a segment, the distribution might be representable as a power series. A power series should not be confused with a power law distribution.

Text editor A text editor (also called ASCII editor) is a software application designed to create, modify, and display simple unformatted text files. Text editors are different from word processes that are designed to include style, font, and other formatting symbols. Text editors are much faster than word processors because they display the contents of files without having to interpret and execute formatting instructions. Unlike word processors, text editors can open files of enormous size (e.g., gigabyte range).

Time A large portion of data analysis is concerned, in one way or another, with the times that events occur or the times that observations are made, or the times that signals are sampled. Here are three examples demonstrate why this is so: (1) most scientific and predictive assertions relate how variables change with respect to one another, over time; and (2) a single data object may have many different data values, over time, and only timing data will tell us how to distinguish one observation from another; (3) computer transactions are tracked in logs, and logs are composed of time-annotated descriptions of the transactions. Data objects often lose their significance if they are not associated with an accurate time measurement. Because modern computers easily capture accurate time data, there is not annotating all data points with the time when they are measured.

Time stamp Many data objects are temporal events and all temporal events must be given a time stamp indicating the time that the event occurred, using a standard measurement for time. The time stamp must be accurate, persistent, and immutable. The Unix epoch time (equivalent to the Posix epoch time) is available for most operating systems and consists of the number of seconds that have elapsed since January 1, 1970, midnight, Greenwhich mean time. The Unix epoch time can easily be converted into any other standard representation of time. The duration of any event can be easily calculated by subtracting the beginning time from the ending time. Because the timing of events can be maliciously altered, scrupulous data managers employ a trusted time stamp protocol by which a time stamp can be verified. A trusted time stamp must be accurate, persistent, and immutable. Trusted time stamp protocols are discussed in Section 8.5, “Case Study: The Trusted Time Stamp.”

Universal and perpetual Wherein a set of data or methods can be understood and utilized by anyone, from any discipline, at any time. It is a tall order, but a worthy goal. Much of the data collected over the centuries of recorded history is of little value because it was never adequately described when it was recorded (e.g., unknown time of recording, unknown source, unfamiliar measurements, unwritten protocols). Efforts to resuscitate large collections of painstakingly collected data are often abandoned simply because there is no way of verifying, or even understanding, the original data [5]. Data scientists who want their data to serve for posterity should use simple specifications, and should include general document annotations such as the Dublin Core. The importance of creating permanent data is discussed elsewhere [6].

Validation Involves demonstrating that the conclusions that come from data analyses fulfill their intended purpose and are consistent [21]. You validate a conclusion (which my appear in the form of an hypothesis, or a statement about the value of a new laboratory test, or a therapeutic protocol) by showing that you draw the same conclusion repeatedly whenever you analyze relevant data sets, and that the conclusion satisfies some criteria for correctness or suitability. Validation is somewhat different from reproducibility. Reproducibility involves getting the same measurement over and over when you perform the test. Validation involves drawing the same conclusion over and over.

Verification and validation As applied to data resources, verification is the process that ensures that data conforms to a set of specifications. Validation is the process that checks whether the data can be applied in a manner that fulfills its intended purpose. This often involves showing that correct conclusions can be obtained from a competent analysis of the data. For example, a Big Data resource might contain position, velocity, direction, and mass data for the earth and for a meteor that is traveling sunwards. The data may meet all specifications for measurement, error tolerance, data typing, and data completeness. A competent analysis of the data indicates that the meteor will miss the earth by a safe 50,000 miles, plus or minus 10,000 miles. If the asteroid smashes into the earth, destroying all planetary life, then an extraterrestrial observer might conclude that the data was verified, but not validated.

Waveform A graph showing a signal's amplitude over time. By convention, the amplitude of the signal is shown on the y-axis, while the time is shown on the x-axis. A .wav file can be easily graphed as a waveform, in python.
The waveform.py script graphs a sample .wav vile, alert.wav, but any handy .wav file should suffice (Fig. 9.3).
from scipy.io.wavfile import read
import matplotlib.pyplot as plt
input_data = read("alert.wav")
# returns a two-item tuple with sampling rate as
#the 0th item and audio samples as the 1st item
audio = input_data[1]
# we'll plot the first 4096 samples
plt.plot(audio[0:4096])
plt.xlabel("time (samples) at rate " + str(input_data[0]))
plt.show()

Fig. 9.3
Fig. 9.3 The plotted waveform of a .wav file, alert.wav.

References

[1] Levenberg K. A method for the solution of certain non-linear problems in least squares. Q Appl Math. 1944;2:164–168.

[2] Marquardt D.W. An algorithm for the least-squares estimation of nonlinear parameters. J SIAM Appl Math. 1963;11:431–441.

[3] Berman J.J. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton: Chapman and Hall; 2010.

[4] Lee J., Pham M., Lee J., Han W., Cho H., Yu H., et al. Processing SPARQL queries with regular expressions in RDF databases. BMC Bioinf. 2011;12:S6.

[5] Berman J.J. Repurposing legacy data: innovative case studies. Waltham, MA: Morgan Kaufmann; 2015.

[6] Berman J.J. Data simplification: taming information with open source tools. Waltham, MA: Morgan Kaufmann; 2016.

[7] Berman J.J. Biomedical informatics. Sudbury, MA: Jones and Bartlett; 2007.

[8] Berman J.J. Principles of big data: preparing, sharing, and analyzing complex information. Waltham, MA: Morgan Kaufmann; 2013.

[9] Leach P, Mealling M, Salz R. A universally unique identifier (UUID) URN namespace. Network Working Group, Request for Comment 4122, Standards Track. Available from: http://www.ietf.org/rfc/rfc4122.txt [viewed November 7, 2017].

[10] Mealling M. RFC 3061. A URN namespace of object identifiers. Network Working Group; 2001. Available from: https://www.ietf.org/rfc/rfc3061.txt [viewed January 1, 2015].

[11] Klyne G. Newman C. Date and time on the internet: timestamps. Network Working Group Request for Comments RFC:3339, Available from: http://tools.ietf.org/html/rfc3339 [viewed on September 15, 2015].

[12] Downey A.B., Think D.S.P. Digital signal processing in python, Version 0.9.8. Needham, MA: Green Tea Press; 2014.

[13] Schneier B. A plea for simplicity: you can't secure what you don't understand. Information Security November 19. Available from: http://www.schneier.com/essay-018.html. 1999 [viewed July 1, 2015].

[14] Berman J.J. Rare diseases and orphan drugs: keys to understanding and treating common diseases. Cambridge, MD: Academic Press; 2014.

[15] Reshef D.N., Reshef Y.A., Finucane H.K., Grossman S.R., McVean G., Turnbaugh P.J., et al. Detecting novel associations in large data sets. Science. 2011;334:1518–1524.

[16] Szekely G.J., Rizzo M.L. Brownian distance covariance. Ann Appl Stat. 2009;3:1236–1265.

[17] Lewis P.D. R for medicine and biology. Sudbury: Jones and Bartlett Publishers; 2009.

[18] Paskin N. Identifier interoperability: a report on two recent ISO activities. D-Lib Mag. 2006;12:1–23.

[19] Newman M.E.J. Power laws, Pareto distributions and Zipf's law. Contemp Phys. 2005;46:323–351.

[20] Clauset A., Shalizi C.R., Newman M.E.J. Power-law distributions in empirical data. SIAM Rev. 2009;51:661–703.

[21] Committee on Mathematical Foundations of Verification, Validation, and Uncertainty Quantification; Board on Mathematical Sciences and Their Applications, Division on Engineering and Physical Sciences, National Research Council. Assessing the reliability of complex models: mathematical and statistical foundations of verification, validation, and uncertainty quantification. National Academy Press; 2012. Available from: http://www.nap.edu/catalog.php?record_id=13395 [viewed January 1, 2015].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.84.175