Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Special Considerations in Big Data Analysis

Abstract

Big Data statistics are plagued by several intrinsic and intractable problems. When the amount of data is sufficiently large, you can find almost anything you seek lurking somewhere within. Such findings may have statistical significance without having any practical significance. Also, whenever you select a subset of data from an enormous collection, you may have no way of knowing the relevance of the data that you excluded. Most importantly, Big Data resources cannot be designed to examine every conceivable hypothesis. Many types of analytic errors ensue when a Big Data resource is forced to respond to questions that it cannot possibly answer. The purpose of this chapter is to provide general recommendations for the responsible use of analytic methods while avoiding some of the pitfalls in Big Data analysis.

Keywords

Curse of dimensionality; Bigness bias; Analysis pitfalls; Fixing data errors; Combining data; Overfitting

Section 14.1. Theory in Search of Data

If triangles had a god, they would give him three sides.

Voltaire

Here is a riddle: “Which came first, the data, or the data analyst?” The intuitive answer would be that data precedes the data analyst. Without data, there really is no reason for the data analyst to exist. In the Big Data universe nothing is as it seems, and the data analyst commonly precedes the data. All too often the analyst develops a question or a hypothesis or a notion of what the facts “should be,” and then goes about rummaging through the Big Data resource until he or she has created a data set that proves the point.

Several intrinsic flaws plague Big Data statistics. When the amount of data is sufficiently large, you can find almost anything you seek lurking somewhere within. Such findings may have statistical significance without having any practical significance. Also, whenever you select a subset of data from an enormous collection, you may have no way of knowing the relevance of the data that you excluded. Most importantly, Big Data resources cannot be designed to examine every conceivable hypothesis. Many types of analytic errors ensue when a Big Data resource is forced to respond to questions that it cannot possibly answer. The purpose of this chapter is to provide general recommendations for the responsible use of analytic methods, while avoiding some of the pitfalls in Big Data analysis.

We cannot escape the dangerous practice of imposing models on selected sets of data. Historians, who have the whole of human history to study, are just as guilty as technical data analysts in this regard. Consider this hypothetical example: the United States is on the brink of a military intervention against entrenched and hostile revolutionaries on the other side of the globe. Two historians are asked to analyze the situation and render their opinions. The first historian compares the current crisis to the entrance of the United States into World War II. World War II worked out well for the United States. The first historian insists that World War II is a useful model for today's emergency and that we should engage our military against the current threat. The second historian says that the current crisis is very much like the crisis that preceded the Vietnam War. The Vietnam War did not turn out well for United States interests, and it would be best if we avoided direct military involvement in the current emergency. When you have all of history from which to choose, you can select any set of data that supports your biases. As humans, we do this all the time, whenever we make decisions.

Scientists have accused their peers of developing models for the purpose of reinforcing belief in their own favorite paradigms [1]. Big Data will not help advance science if analysts preferentially draw data to support their previously held biases. One of the important tasks for Big Data analysts will involve developing methods for creating unbiased models from Big Data. In the meantime, there is no practical way to validate conclusions drawn from Big Data, other than to test the hypothesis on additional data sets.

Section 14.2. Data in Search of Theory

Without highly specified a-priori hypotheses, there are hundreds of ways to analyse the dullest data set.

John P A Ioannidis [2]

In the prior section the point was made that data analysts can abuse Big Data if data is selected to confirm a hypothesis. In this section the point is made that scientists must enter their analysis with a model theory; otherwise they will choose a hypothesis to fit their data, even if the hypothesis makes no sense. [Glossary Multiple comparisons bias]

Here is a good example. Suppose I am at a shooting range and shoot ten shots at a bull's eye target. I can measure the distance of each bullet from the center of the target, from which I would develop some type of score with which I could compare my marksmanship against that of others. Now, imagine shooting ten shots at a wall that has no target. I may find that six of the bullets clustered very close together. I could then superimpose the bullet holes with a bull's eye target, placing the center of the target over the center of the tight clusters of bullet holes. A statistician analyzing the data might find that the six tightly clustered bullet holes at the center of the bull's eye indicated that I scored very well and that it was highly likely that I had better aim than others (who had actually aimed at the target). Scientists who troll large data sets will always find clusters of data upon which they can hang a bull's eye. Statisticians provided with such data can be tricked into confirming a ridiculous hypothesis that was contrived to fit the data. This deceptive practice is referred to as moving the target to the bullet hole.

Big Data analysts walk a thin line. If they start their project with a preconceived theory, then they run the risk of choosing a data set that confirms their bias. If they start their project without a theory, then they run the risk of favoring a false hypothesis that happens to fit their data. [Glossary Type errors]

Is there a correct approach to Big Data analysis? It is important to remember that a scientific theory is a plausible explanation of observations. Theories are always based on some set of pre-existing principles that are generally accepted as truth. When a scientist approaches a large set of data, he or she asks whether a set of commonly held principles will extend to the observations in the current set of data. Reconciling what is known with what is observed accounts for much of the activity of scientists.

For Big Data projects, holding an a prior theory or model is almost always necessary; otherwise, the scientist is overwhelmed by the options. Adequate analysis can be ensured if three conditions are met:

1. All of the available data is examined, or a subset is prepared from a random sampling (i.e., no cherry-picking data to fit the theory).
2. The analyst must be willing to modify or abandon the theory, if it does not fit the data.
3. The analyst must not believe that fitting the theory to the data validates the theory. Theories must be tested against multiple sets of data.
4. The analyst must accept that the theory may be wrong, even if it is validated. Validation is not proof that a theory is correct. It is proof that the theory is consistent with all of the observed data. A better theory may also be consistent with the observed data and may provide a true explanation of the observations.

One of the greatest errors of Big Data analysts is to believe that data models are tethered to reality; they seldom are. Models are made to express data sets as formulas or as a system that operates under a set of rules. When the data are numeric representations of physical phenomenon, it may sometimes be possible to link the model to a physical law. For example, repeated measurements of force, mass, and acceleration observed on moving bodies might produce a formula that applies consistently, at any time, any place, and with any object (i.e., f = ma). Most mathematical models are abstract, and cannot be ranked as physical laws. At best, they provide a quick glimpse of an ephemeral reality.

Section 14.3. Bigness Biases

Every increased possession loads us with new weariness.

John Ruskin

Because Big Data methods use enormous sets of data, there is a tendency to give the results more credence than would be given to a set of results produced from a small set of data. This is almost always a mistaken belief. In fact, Big Data is seldom a complete or accurate data collection. You can expect most Big Data resources to be selective, intentionally or not, for the data that is included and excluded from the resource. When dealing with Big Data, expect missing values, missing records, “noisy” data, huge variations in the quality of records, plus any and all of the inadequacies found in small data resources. Nevertheless, the belief that Big Data is somehow more reliable, and more useful than smaller data is pervasive in the science community.

When a study is done on a very large number of human subjects (or with a very large number of samples), each annotated with a large number of observations, there is a tendency to accept the results, even when the results defy intuition. In 2007, a study using the enormous patient data set held by the U.S. Veterans Administration Medical Centers reported that the use of statins reduced the risk of developing lung cancer by about half [3]. The study, which involved nearly half a million patients, showed that the reduction in cancer risk held whether patients were smokers or non-smokers. The highest reduction in lung cancers (77%) occurred in people who had taken statins for four years or longer [3].

The potential importance of this study cannot be overestimated. Lung cancer is the most common cause of cancer deaths in the United States. A 77% reduction in lung cancer incidence would prevent the cancer deaths of about 123,000 U.S. residents each year. This number is equivalent to the total number of cancer deaths attributed each year to prostate cancer, breast cancer and colon cancer combined [4]!

As it happens these marvelous findings were as unintuitive as they were exciting. Statins are a widely used drugs that reduce the blood levels of cholesterol and various other blood lipids. There is absolutely nothing known about the biology of statins that would lead anyone to suspect that this drug would lower the incidence of lung cancer, or any other cancer, for that matter. It is always risky to accept a scientific conclusion without some sort of biological mechanism to explain the results.

In 2011, a second study, by another group of researchers, was published on the effect of statins on lung cancer incidence. This study was also big, using about 133,000 patients. The results failed to show any effect of statins on lung cancer incidence [5]. That same year, a third study, using a population of about 365,000 people, also failed to find any influence of statins on the incidence of lung cancer [6]. The authors of the negative studies blamed time-window bias on the misleading results of the first study.

To understand time-window bias, consider the undisputed observation that Nobel prize laureates live longer than other scientists. It would seem that scientists who want to live a long life should try their utmost to win a Nobel prize. Likewise, Popes live longer than other clergymen. If you are a priest, and you want to live long, aim for the Papacy. Both these biases are based on time-window conditions. The Nobel prize committee typically waits decades to determine whether a scientific work is worthy of the Nobel prize, and the prize is only awarded to living scientists. Would-be Nobelists who die before their scientific career begins, and accomplished scientists who die before their works are deemed Nobel-worthy, are omitted from the population of potential winners. Similarly, the Vatican seldom confers the Papacy on its junior clergy. The time-window surrounding Nobel winners and Popes skews their observed longevities upwards. Time-window bias is just one of a general class of biases wherein studies are invalidated by the pre-conditions imposed on the studies [7]. [Glossary Time-window bias]

Time-window bias affected the original large patient-based study because a population that had taken a statin for four years, without dying in the interim, was compared to a general population. Basically, the study imposed a cancer-free 4-year window for the treated population, artifactually conferring a lower cancer incidence on the statin-treated group.

The point here is simply that analytic errors occur just as easily in studies of large populations as they do in studies involving a small number of individuals. Because Big Data analysis tends to be complex, and difficult for anyone to thoroughly review, the chances of introducing Big Data errors is larger than the chances of introducing small data errors.

In the late 1990s, interest was growing in the medical research community and in biomarkers for cancers. It was believed then, as it is believed now, that the different types of cancers must contain biological markers to tell us everything we need to know about their clinical behaviors. Such biomarkers would be used to establish the presence of a cancer, the precise diagnosis of the cancer, the stage of the cancer (i.e., the size and the extent of spread of the cancer), and to predict the response of the cancer to any type of treatment. By the turn of the century, there was a sense that useful cancer biomarkers were not forthcoming; the pipeline for new biomarkers had apparently dried up [8–11]. What was the problem? A gnawing suspicion held that biomarkers failed because we weren't collecting enough data. A consensus had grown that we were wasting cancer research funds on small-scale studies that were irreproducible. What we needed, or so everyone thought, were Big Data studies, producing lots of data, yielding trustworthy results based on many observations. If researchers abandoned their small studies, in favor of large studies, then the field would surely move forward at a rapid pace.

In the past two decades, biomarker studies have seen enormous successes. Surprisingly, though, much of the recent progress has come from relatively small genomic studies, on very rare cancers, with limited numbers of specimens [2,12–14]. Why has Big Data not yielded the kind of progress that nearly everyone expected?

When a new potential biomarker is discovered using large and complex sets of data and advanced analytic tools, it needs to be validated; and validation involves repeating the original study, and drawing the same set of conclusions [15,16]. As a general rule the more complex the experiment, the data, and the analysis, the less likely that it can be reproduced. In addition to these basic limitations on conclusions drawn from Big Data, we must remember that it can be very difficult to analyze systems whose complexity exceeds our comprehension. We assume, quite incorrectly, that given sufficient data, we can understand complex systems. There is nothing to support this kind of self-confidence. Biological systems are highly complex, and we do not, at this time, have a deep understanding of their workings. For that matter, we have very little understanding of the kinds of data that ought to be collected. We are slowly learning that it seldom helps to throw Big Data at a problem, before we have a thorough understanding of what we need to find. In the case of cancer biomarkers, it was much easier to find the key mutations that accounted for rare tumors than it was to find common biomarkers in a general population [12,13].

Still unconvinced that Bigness bias is a real concern for Big Data studies? In the United States, our knowledge of the causes of death in the population is based on death certificate data collected by the Vital Statistics Program of the National Center for Health Statistics. Death certificate data is notoriously faulty [17–19]. In most cases, the data in death certificates is supplied by clinicians, at or near the time of the patient's death, without benefit of autopsy results. In many cases, the clinicians who fill out the death certificate are not well trained for the task, often mistaking the mode of death (e.g., cardiac arrest, cardiopulmonary arrest), with cause of death (e.g., the disease process leading to cardiac arrest or cardiopulmonary arrest), thus nullifying the intended purpose of the death certificate. Thousands of instructional pages have been written on the proper way to complete a death certificate. Nonetheless, these certificates are seldom completed in a consistent manner. Clinicians become confused when there are multiple, sometimes unrelated, conditions that contribute to the patient's death. Though the death certificates are standardized throughout the United States, there are wide variations from state to state in the level of detail provided on the forms [20]. Despite all this the venerable death certificate is the bedrock of vital statistics. What we know, or think we know, about the causes of death in the United States population, is based on an enormous repository, collected since 1935, of many millions of death certificates.

Why do we believe death certificate data when we know that death certificates are highly flawed? Again, it is the bigness factor that prevails. There seems to be a belief, based on nothing but wishful thinking, that if you have a very large data set, bad measurements will cancel themselves out, leaving a final result that comes close to a fair representation of reality. For example, if a clinician forgets to list a particular condition as a cause of death, another physician will mistakenly include the condition on another death certificate, thus rectifying the error.

The cancel-out hypothesis puts forward the delightful idea that whenever you have huge amounts of data, systemic errors cancel out in the long run, yielding conclusions that are accurate. Sadly, there is neither evidence nor serious theory to support this hypothesis. If you think about it, you will see that it makes no sense. One of the most flagrant weaknesses is the fact that it is impossible to balance something that must always be positive. Every death certificate contains a cause of death. You cannot balance a false positive cause of death with a false negative cause of death (i.e., there is no such thing as a negative cause of death). The same applies to numeric databases. An incorrect entry for 5000 pairs of shoes cannot be balanced by a separate incorrect entry for negative 5000 pairs of shoes; there is no such thing as a negative shoe. [Glossary Negative study bias]

Perhaps the most prevalent type of bigness bias relates to the misplaced faith that complete data is representative data. Certainly, you might think that if a Big Data resource contains every measurement for a data domain, then biases imposed by insufficient sampling are eliminated. Danah Boyd, a social media researcher, draws a sharp distinction between Big-ness and Whole-ness [21]. She gives the example of a scientist who is exploring a huge data set of tweets collected by Twitter. If Twitter removes tweets containing expletives, or tweets composed of non-word character strings, or containing certain types of private information, then the resulting data set, no matter how large it may be, is not representative of the population of senders. If the tweets are available as a stripped-down set of messages, without any identifier for senders, then the compulsive tweeters (those who send hundreds or thousands of tweets) will be over-represented, and the one-time tweeters will be under-represented. If each tweet were associated with an account and all the tweets from a single account were collected as a unique record, then there would still be the problem created by tweeters who maintain multiple accounts. Basically, when you have a Big Data resource, the issue of sample representation does not disappear; it becomes more complex and less controlled. For Big Data resources lacking introspection and identifiers, data representation becomes an intractable problem.

– Too Much Data

Intuitively, you might think that the more data we have at our disposal, the more we can learn about the system that we are studying. This is not always the case. There are circumstances when more data simply takes you further and further from the solution you seek. As a trivial example, consider the perennial task of finding a needle in a haystack. As you add more hay, you make the problem harder to solve. You would be much better off if the haystack were small, consisting of a single straw, behind which lies your sought-after needle [22].

In the field of molecular biology the acquisition of whole genome sequencing on many individual organisms, representing hundreds of different species, has brought a flood of data, but many of the most fundamental questions cannot be answered when the data is complex and massive. Evolutionary biologists have invented a new term for a certain type of sequence data: “non-phylogenetic signal.” The term applies to DNA sequences that cannot yield any useful conclusions related to the classification of an organism, or its evolutionary relationships to other organisms.

Evolutionary geneticists draw conclusions by comparing DNA sequences in organism, looking for similar, homologous regions (i.e., sequences that were inherited from a common ancestor). Because DNA mutations arise stochastically over time (i.e., at random locations in the gene, and at random times), unrelated organisms may attain the same sequence in a chosen stretch of DNA, without inheritance through a common ancestor. Such occurrences could lead to false inferences about the relatedness of different organisms. When mathematical phylogeneticists began modeling inferences for gene data sets, they assumed that most class assignment errors would be restricted to a narrow range of situations. This turned out not to be the case. In practice, errors due to non-phylogenetic signal occur due to just about any mechanism that causes DNA to change over time (e.g., random mutations, adaptive convergence) [23,24]. At the moment, there seems to be an excess of genetic information. The practical solution seems to involve moving away from purely automated data analyses and using a step-by-step approach involving human experts who take into account independently acquired knowledge concerning the relationships among organisms and their genes.

– Overfitting

Overfitting occurs when a formula describes a set of data very closely, but does not predict the behavior of comparable data sets. In overfitting, the formula is said to describe the noise of the system, rather than the characteristic behavior of the system. Overfitting commonly occurs with models that perform iterative approximations on training data. Neural networks are an example of a data modeling strategy that is prone to overfitting. In general, the bigger the data set, the easier it is to overfit the model.

Overfitting is discovered by testing your predictor or model on one or several new sets of data [25]. If the data is overfitted the model will fail with the new data. It can be heartbreaking to spend months or years developing a model that works like a charm for your training data and for your first set of test data (collected from the same data set as your training data), but fails completely for a new set of data.

Overfitting can sometimes be avoided by evaluating the model before it has been fitted to a mathematical formula, often during the data reduction stage. There are a variety of techniques that will produce a complex formula fitted to all your variables. It might be better to select just a few variables from your data that you think are most relevant to the model. You might try a few mathematical relationships that seem to describe the data plotted for the subset of variables. A formula built from an intuitive understanding of the relationships among variables may sometimes serve much better than a formula built to fit a multi-dimensional data set. [Glossary Data reduction]

Section 14.4. Data Subsets in Big Data: Neither Additive Nor Transitive

If you're told that a room has 3 people inside, and you count 5 people exiting the room, a mathematician would feel compelled to send in 2 people to empty it out.

Anon

It is often assumed that Big Data has one enormous advantage over small data: that sets of Big Data can be merged to create large populations that reinforce or validate conclusions drawn from small studies. This assumption is simply incorrect. In point of fact, it is possible to draw the same conclusion from two sets of data, only to draw an opposite conclusion when the two sets of data are combined. This phenomenon, well known to statisticians as Simpson's paradox, has particular significance when Big Data resources combine observations collected from multiple populations.

One of the most famous examples of Simpson's paradox was demonstrated in the 1973 Berkeley gender bias study [26]. A preliminary review of admissions data indicated that women had a lower admissions rate than men:

Men Number of applicants.. 8,442 Percent applicants admitted.. 44%
Women Number of applicants.. 4,321 Percent applicants admitted.. 35%

A nearly 10% lower overall admission rate for women, compared with men, seemed significant, but what did it mean? Was the admissions office guilty of gender bias?

A look at admissions department-by-department (in distinction to admissions for the total number of applicants to the university, by gender) showed a very different story. Women were being admitted at higher rates than men, in almost every department. The department-by-department data seemed incompatible with the data obtained when the admissions from all the departments were combined.

The explanation was simple. Women tended to apply to the most popular and oversubscribed departments, such as English and History, that had a high rate of admission denials. Men tended to apply to departments that the women of 1973 avoided, such as mathematics, engineering, and physics, that had high relatively few applicants and high acceptance rates. Though women had an equal footing with men in departmental admissions, the high rate of rejections in the large departments, accounted for an overall lower acceptance rate for women at Berkeley.

Simpson's paradox demonstrates that data is not additive. It also shows us that data is not transitive; you cannot make inferences based on subset comparisons. For example in randomized drug trials, you cannot assume that if drug A tests better than drug B, and drug B tests better than drug C, then drug A will test better than drug C [27]. When drugs are tested, even in well-designed trials, the test populations are drawn from a general population specific for the trial. When you compare results from different trials, you can never be sure whether the different sets of subjects are comparable. Each set may contain individuals whose responses to a third drug are unpredictable. Transitive inferences (i.e., if A is better than B, and B is better than C, then A is better than C), are unreliable.

Simpson's paradox has particular significance for Big Data research, wherein data samples are variously recombined and reanalyzed at different stages of the analytic process.

Section 14.5. Additional Big Data Pitfalls

Any problem in Computer Science can be solved with another level of indirection.

Butler Lampson

...except the problem of indirection complexity.

Bob Morgan

There is a large literature devoted to the pitfalls of data analysis. It would seem that all of the errors associated with small data analysis will apply to Big Data analysis. There are, however, a collection of Big Data errors that do not apply to small data, such as:

– The misguided belief that Big Data is good data

For decades, it was common for scientists to blame their failures on the paucity of their data. You would often here, at public meetings and in private, statements such as “It was a small study, using just a few samples and limited number of measurements on each sample. We really should not generalize at the moment. Let us wait for a definitive study based on a large group of samples.”

There has always been the sense, based on nothing in particular, that a small study cannot be validated by another small study. A small study must be validated by a big study. Anyone who has ever worked on a project that collects large, complex, quickly streaming data knows that such efforts are much more prone to systemic flaws in data collection than are smaller projects. In Section 16.1, “First Analysis (Nearly) Always Wrong,” we will see why conclusions drawn from Big Data are notoriously misleading.

Big Data comes from many different sources, produced by many different protocols, and must undergo a series of tricky normalizations, transformation, and annotations, before it has any value whatsoever. Data analysts can never assume that Big Data is accurate. Competent analysts will always validate their conclusions based on alternate, independently collected data; big or small.

– Blending bias

If you are studying the properties of a class of records (e.g., records of individuals with a specific disease or data collected on a particular species of fish), then any analysis of the data, no matter how large the data set, will be biased if your class assignments are erroneous (e.g., if the disease was misdiagnosed, or if you mistakenly included other species of fish in your collection). Classifications can be deeply flawed when individual classes are poorly defined, or not based on a well-understood set of scientific principles, or are assembled through the use of poor analytic techniques.

Let us look at one example in some depth. Suppose you are a physician living in Southern Italy, in the year 1640, where people are dying in great number, from a mysterious disease characterized by recurring fevers, delirium, and pain. You are approached by an explorer who has just returned from a voyage to South America, in an area corresponding to modern-day Brazil. He holds a bag containing an herbal extract, and says “Give this to your patients, and they will quickly recover.”

It happens that the drug is extracted from the bark of the Cinchona tree. It is a sure-fire cure for malaria. Unknown to you, many of your patients are suffering from malaria, and would benefit greatly from this miraculous drug. Nonetheless, you are skeptical and would like to test this new drug before subjecting your patients to any unanticipated horrors. Though you are not a statistician, you do know something about designing clinical trials. In short order, you collect 100 patients, all of whom have the symptoms of fever and delirium. You administer the cinchona powder, also known as quinine, to all the patients. A few improve, but most do not. Knowing that some patients recover without any medical assistance, you call the trial a wash-out. In the end, you decide not to administer quinine to your patients.

What happened? We know that quinine arrived as a miracle cure for malaria. It should have been effective in a population of 100 malarial patients. The problem with this hypothetical clinical trial is that the patients under study were assembled based on their mutual symptoms: fever and delirium. These same symptoms could have been accounted for by any of hundreds of other diseases that were prevalent in England at the time. The criterion employed at the time to classify diseases was imprecise, and the trial population was diluted with non-malarial patients who were guaranteed to be non-responders. Consequently, the trial failed, and you missed a golden opportunity to treat your malaria patients with quinine, a new, highly effective, miracle drug.

Back in Section 5.5, we discussed Class Blending, an insidious flaw found in many classifications, that virtually guarantees that any analysis will yield misleading results. Having lots and lots of data will not help you. The only way to overcome the bias introduced by class blending is to constantly test and refine your classification.

– Complexity bias

The data in Big Data resources comes from many different sources. Data from one source may not be strictly comparable to data from another source. The steps in data selection, including data filtering, and data transformation, will vary among analysts. Together, these factors create an error-prone analytic environment for all Big Data studies that does not apply to small data studies.

– Statistical method bias

Statisticians can apply different statistical methods to one set of data, and arrive at any of several different, even contradictory, conclusions. Statistical method biases are particularly dangerous for Big Data. The standard statistical tests that apply to small data and to data collected in controlled experiments, may not apply to Big Data. Analysts are faced with the unsatisfying option of applying standard methods to non-standard data, or of developing their own methodologies for their Big Data project. History suggests that given a choice, scientists will adhere to the analysis that reinforces their own scientific prejudices [28].

– Ambiguity of system elements

Big Data analysts want to believe that complex systems are composed of simple elements, having well-defined attributes and functions. Clever systems analysts, using advanced techniques, enjoy believing that algorithms can predict the behavior of complex systems, when the elements of the system are understood. We learn from biological systems that the components of complex systems have ambiguous functionalities, changing from one moment to the next, rendering our best predictions tentative, at best. [Glossary Deep analytics]

For example, living cells are complex systems in which many different metabolic pathways operate simultaneously. A metabolic pathway is a multi-step chemical process involving more than one enzyme and various additional substrate and non-substrate chemicals. Depending on the conditions within a cell, a single enzyme may participate in several different metabolic pathways; and any given pathway may exert any of a number of different biological effects [29–32]. As we learn more and more about cells, we are stunned by their complexities [33,13]. Big Data analysts, working with highly complex systems, cannot assume that any of the elements of their system have a single, defined function. This tells us that all Big Data analyses on living systems (e.g., all biomedical systems and all non-biomedical data that depends in any way on the predictability or reproducibility of biomedical data) may be intractable to the kinds of systems analysis techniques that we have come to understand.

Despite all the potential biases, at the very least Big Data offers us an opportunity to validate predictions based on small data studies. As a ready-made source of observations, Big Data resources may provide the fastest, most economical, and easiest method to “reality test” limited experimental studies. Testing against large, external data sets, on independently collected data, and coming up with the equivalent conclusions, is a reasonable way to validate scientific assertions [34,35].

Section 14.6. Case Study (Advanced): Curse of Dimensionality

As the number of spatial dimensions goes up, finding things or measuring their size and shape gets harder.

The Curse of Dimensionality, attributed to Richard Bellman, and sometimes called Bellman's curse

Any serious student of Big Data will eventually fall prey to the dreadful Curse of Dimensionality. This curse cannot be reversed, and cannot be fully fathomed by 3-dimensional entities. Luckily, we can see the tell-tale signs that indicate where the curse is strongest, and thus avoid the full force of its evil power.

First, let's understand what we mean when we talk about n-dimensional data objects. Each attribute of an object is a dimension. The object might have three attributes: height, width, and depth; and these three attributes would correspond to the familiar three dimensional measurements that we are taught in geometry. The object in a Big Data collection might have attributes of age, length of left foot, width of right foot, hearing acuity, time required to sprint 50 yards, and yearly income. In this case the object is described by 6 attributes and would occupy 6 dimensions of Big Data space.

Let us say that we have normalized the values of every attribute so that each attribute value lies between zero and two (i.e., the age is between 0 and 2; the length of the left foot is between 0 and 2; the width of the right foot is between 0 and 2, and so on for every dimension in the object.

The 6-dimensional cube that encloses the set of data objects with attributes measuring between 0 and 2 will have sides measuring 2 units in length. The general formula for the volume of an n-dimensional cube is the length of a side raised to the nth power. In the case of a 260 dimensional cube, this would give us a volume of 2ˆ260. Just to give you some idea of the size of this number, 2ˆ260 is roughly the estimated number of atoms contained in our universe. So the volume of the 260-dimensional cube, of side 2 units, is large enough to hold the total number of atoms in the universe, spaced one unit apart in every dimension. Because there are many more atoms in the universe than there are data objects in our Big Data resources, we can infer that all high-dimensional volumes of data will be sparsely populated (i.e., lots of space separating data objects from one another). In our physical universe, there is much more empty space than there is matter; in the infoverse, it's much the same thing, only moreso. [Glossary Euclidean distance]

So what? What does it matter that n-dimensional data space is mostly empty, so long as every data object has an n-coordinate location somewhere within the hypervolume?

Let us consider the problem of finding a data object that lies within one unit of a reference object located in the exact center of the data space. As an example, we will continue to use an n-dimensional data object composed of attributes with normalized values between 0 and 2. We will begin by looking at a two dimensional data space.

If the data objects in the 2-dimensional data space are uniformly distributed in the space, then the chances of finding a data object within one unit of the center of the space (i.e., at coordinate 1,1) will be the ratio of the circle of radius one unit around the center divided by the area of the square that contains the data space (i.e., a square whose sides have length of 2). This works out to pi/4, or 0.785. This tells us that in two dimensions, we'll have an excellent chance of finding an object within 1 unit of the center (Fig. 14.1).

Fig. 14.1 A two-dimensional representation of the a circle, of radius length 1, in a square, of side length 2. The fraction of the square's area occupied by the circle is (pi ⁎ r^∧2)/4 or 3.1416/4 or 0.7854.

We can easily imagine that as the number of the dimensions of our data space increases, with an exponentially increasing n-dimensional volume, so too will the volume of the hypersphere that accounts for all the objects lying within 1 radial unit from the center. Regardless of how fast the volume of the space is growing, our hypersphere will keep apace, and we will always be able to find data objects in a 1-radial unity vicinity. Actually, no. Here is where the Curse of Dimensionality truly kicks in.

The general formula for the volume of an n-dimensional sphere is shown in Fig. 14.2.

Let's not get distracted by the lambda function in the denominator. It suffices to know that the volume of a hypersphere in n dimensions is easily computable. Using the formula, here are the volumes of a 1 radial unit sphere in multiple dimensions [36].

Hypersphere volumes when radius = 1, in higher dimensions
n = 1, V = 2
n = 2, V = 3.1416
n = 3, V = 4.1888
n = 4, V = 4.9348
n = 5, V = 5.2638
n = 6, V = 5.1677
n = 7, V = 4.7248
n = 8, V = 4.0587
n = 9, V = 3.2985
n = 10, V = 2.5502

As the dimensionality increases, the volume of the sphere increases until we reach the fifth dimension. After that, the volumes of the 1-unit radius sphere begin to shrink. At 10 dimensions, the volume is down to 2.5502. From there on, the volume decreases faster and faster. The 20-dimension 1-radial unit sphere has a volume of only 0.0285, while the volume of the sphere in 100 dimensions is on the order of 10^∧-40 [36].

How is this possible? If the central hypersphere has a radius of one unit, and the coordinate space is a hypercube that is 2 units on each side, then we know that, for any dimension, the hypersphere touches each and every face of the hypersphere at one point. In the two dimensional example shown above, the inside circle touches the enclosing square on all four sides: at points (1,0), (1, 2), (1, 2), and (0,1). If an n-dimensional sphere touches one point on every face of the enclosing hypercube, then how could the sphere be infinitesimally small while the hypercube is immensely large?

The secret of the curse is that as the dimensionality of the space increases, most of the volume of the hypercube comes to lie in the corners, outside the central hypersphere. The hypersphere misses the corners, just like the 2-dimensional circle misses the corners of the square. This means that as the dimensionality of data objects increases, the likelihood of finding similar objects (i.e., object at a close n-dimensional proximity from one another) drops to about zero. When you have thousands of dimensions, the space that holds the objects is so large that distances between objects become difficult or impossible to compute. Basically, you can't find similar objects if the likelihood of finding two objects in close proximity is always zero.

Glossary

Data reduction When a very large data set is analyzed, it may be impractical or counterproductive to work with every element of the collected data. In such cases, the data analyst may choose to eliminate some of the data, or develop methods whereby the data is approximated. Some data scientists reserve the term “data reduction” for methods that reduce the dimensionality of multivariate data sets.

Deep analytics Insipid jargon occasionally applied to the skill set needed for Big Data analysis. Statistics and machine learning are often cited as two of the most important areas of deep analytic expertise. In a recent McKinsey report, entitled “Big data: The next frontier for innovation, competition, and productivity,” the authors asserted that the United States “faces a shortage of 140,000 to 190,000 people with deep analytical skills” [37].

Euclidean distance Two points, (x1, y1), (x2, y2) in Cartesian coordinates are separated by a hypotenuse distance, that being the square root of the sum of the squares of the differences between the respective x-axis and y-axis coordinates. In n-dimensional space, the Euclidean distance between two points is the square root of the sum of the squares of the differences in coordinates for each of the n dimensional coordinates. The significance of the Euclidean distance for Big Data is that data objects are often characterized by multiple feature values, and these feature values can be listed as though they were coordinate values for an n-dimensional object. The smaller the Euclidian distance between two objects, the higher the similarity to each other. Several of the most popular correlation and clustering algorithms involve pairwise comparisons of the Euclidean distances between data objects in a data collection.

Multiple comparisons bias When you compare a control group against a treated group using multiple hypotheses based on the effects of many different measured parameters, you will eventually encounter statistical significance, based on chance alone. For example, if you are trying to determine whether a population that has been treated with a particular drug is likely to suffer a serious clinical symptom, and you start looking for statistically significant associations (e.g., liver disease, kidney disease, prostate disease, heart disease, etc.), then eventually you will find an organ in which disease is more likely to occur in the treated group than in the untreated group. Because Big Data tends to have high dimensionality, biases associated with multiple comparisons must be carefully avoided. Methods for reducing multiple comparison bias are available to Big Data analysts. They include the Bonferroni correction, the Sidak correction and the Holm-Bonferroni correction.

Negative study bias When a project produces negative results (fails to confirm a hypothesis), there may be little enthusiasm to publish the work [38]. When statisticians analyze the results from many different published manuscripts (i.e., perform a meta-analysis), their work is biased by the pervasive absence of negative studies [39]. In the field of medicine, negative study bias creates a false sense that every kind of treatment yields positive results.

Time-window bias A bias produced by the choice of a time measurement. In medicine, survival is measured as the interval between diagnosis and death. Suppose a test is introduced that provides early diagnoses. Patients given the test will be diagnosed at a younger age than patients who are not given the test. Such a test will always produce improved survival simply because the interval between diagnosis and death will be lengthened. Assuming the test does not lead to any improved treatment, the age at which the patient dies is unchanged by the testing procedure. The bias is caused by the choice of timing interval (i.e., time from diagnosis to death). Survival is improved without a prolongation of life beyond what would be expected without the test. Some of the touted advantages of early diagnosis are the direct result of timing bias.

Type errors Statistical tests should not be confused with mathematical truths. Every statistician understands that conclusions drawn from statistical analyses are occasionally wrong. Statisticians, resigned to accept their own fallibilities, have classified their errors into five types: Type 1 error—Rejecting the null hypothesis when the null hypothesis is correct (i.e., seeing an effect when there was none). Type 2 error—Accepting the null hypotheses when the null hypothesis is false (i.e., seeing no effect when there was one). Type 3 error—Rejecting the null hypothesis correctly, but for the wrong reason, leading to an erroneous interpretation of the data in favor of an incorrect affirmative statement. Type 4 error—Erroneous conclusion based on performing the wrong statistical test. Type 5 error—Erroneous conclusion based on bad data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14: Special Considerations in Big Data Analysis

Create new playlist

Sign In