Chapter 9

Analysis

Outline

At this very moment, there’s an odds-on chance that someone in your organization is making a poor decision on the basis of information that was enormously expensive to collect.

Shvetank Shah, Andrew Horne, and Jaime Capella101

Background

Here are the instructions for a foolproof fly swatter:

1. Place fly on flat surface.

2. Squash fly with stick, using a rapid downward force.

If you find these instructions amusing, then you will appreciate the fundamental challenge in every Big Data analysis project: collecting the data and setting it up for analysis. The analysis step itself is easy; preanalysis is the tricky part.

Data analysis is a well-established field, with thousands of books devoted to the topic. Many currently available analysis books are superb works, with each new book building upon a rich legacy of prior works.102106 Books devoted to preanalytic and postanalytic tasks are scarce. In the days of small data, the acquisition of data and the disposition of statistical results (i.e., submitting abstracts, writing grant reports, moving through the product development cycle) were all performed by professionals other than the mathematicians, computer scientists, and statisticians who analyzed the data. With the advent of Big Data, the data analyst often stands alone, and must somehow learn for herself how the data was obtained and how the results will be interpreted.

Analysis in the Big Data realm is fundamentally different from analysis for small data. This chapter focuses on how data must be prepared for analysis and what we learn about our data when we apply analytic methods.

Analytic Tasks

Big Data performs many computational tasks, some of which are not directly involved in analytics (e.g., searching for specific data, searching for patterns of data, retrieving data, organizing the retrieved data by a ranking system). Data analysis involves drawing conclusions based on reviewing sets of data. These tasks can be somewhat arbitrarily divided into three areas: statistical analysis, modeling, and predictive analysis.

Statistical analysis is closely associated with hypothesis testing. The prototypical hypothesis question is, “If I have a control group and a treated group, how do I know when the treatment results in an effect that makes the treatment group measurably different from the control group?” Though statisticians cannot answer this fundamental question definitively, they provide some measure of confidence in an answer.

Modeling, as used in this book, is a description, using a mathematical equation or some logical language, to describe the behavior of a system or its objects. Examples would be a model for planetary motion, a model for interstate traffic, a model for enzymatic reactions, a model for cancer growth, a model for an ontology, and so on. In many cases, modeling equations will need to describe how different variables change, in relation to one another, over increments of time. Hence, many modeling equations involve differential calculus. Every computer scientist seems to have his or her own definition of modeling. Readers should not assume that the definition provided here will apply everywhere today, or anywhere, 10 years from today.

Predictive analysis is concerned with guessing how an individual, group, or data object will behave based on past outcomes or on the observed behaviors of similar individuals or groups. Predictive analytics relies on one concept: if thing A is similar to thing B, then thing A will most likely behave like thing B. Predictive analytics includes three types of algorithms: recommenders, classifiers, and clustering.103

Clustering, Classifying, Recommending, and Modeling

Reality is merely an illusion, albeit a very persistent one.

Albert Einstein

Clustering Algorithms

Clustering algorithms are currently very popular. They provide a way of taking a large set of data objects that seem to have no relationship to one another and producing a visually simple collection of clusters wherein each cluster member is similar to every other member of the same cluster.

The algorithmic methods for clustering are simple. One of the most popular clustering algorithms is the k-means algorithm, which assigns any number of data objects to one of k clusters.107 The number k of clusters is provided by the user. The algorithm is easy to describe and to understand, but the computational task of completing the algorithm can be difficult when the number of dimensions in the object (i.e., the number of attributes associated with the object) is large.

Here is how the algorithm works for sets of quantitative data.

1. The program randomly chooses k objects from the collection of objects to be clustered. We’ll call each of these k objects a focus.

2. For every object in the collection, the distance between the object and all of the randomly chosen k objects (chosen in step 1) is computed.

3. A round of k clusters is computed by assigning every object to its nearest focus.

4. The centroid focus for each of the k clusters is calculated. The centroid is the point that is closest to all of the objects within the cluster. Another way of saying this is if you sum the distances between the centroid and all of the objects in the cluster, this summed distance will be smaller than the summed distance from any other point in space.

5. Steps 2, 3, and 4 are repeated, using the k centroid foci as the points for which all distances are computed.

6. Step 5 is repeated until the k centroid foci converge on a nonchanging set of k centroid foci (or until the program slows to an interminable crawl).

There are some serious drawbacks to the algorithm.

1. The final set of clusters will sometimes depend on the initial, random choice of k data objects. This means that multiple runs of the algorithm may produce different outcomes.

2. The algorithms are not guaranteed to succeed. Sometimes, the algorithm does not converge to a final, stable set of clusters.

3. When the dimensionality is very high, distances between data objects (i.e., the square root of the sum of squares of the measured differences between corresponding attributes of two objects) can be ridiculously large and of no practical meaning (see Glossary item, Curse of dimensionality). Computations may bog down, cease altogether, or produce meaningless results. In this case, the only recourse may require eliminating some of the attributes (i.e., reducing dimensionality of the data objects). Subspace clustering is a method wherein clusters are found for computationally manageable subsets of attributes. If useful clusters are found using this method, additional attributes can be added to the mix to see if the clustering can be improved.

4. The clustering algorithm may succeed, producing a set of clusters of similar objects, but the clusters may have no practical value. They may miss important relationships among the objects or might group together objects whose similarities are totally noninformative.

At best, clustering algorithms should be considered a first step toward understanding how attributes account for the behavior of data objects.

Classifier Algorithms

These algorithms assign a class (from a preexisting classification) to an object whose class is unknown.107 The k-nearest neighbor algorithm is a simple and popular classifier algorithm. From a collection of data objects whose class is known, the algorithm computes the distances from the object of unknown class to the objects of known class. This involves a distance measurement from the feature set of the objects of unknown class to every object of known class (the test set). The distance measure uses the set of attributes that are associated with each object. After the distances are computed, the k-classed objects with the smallest distance to the object of unknown class are collected. The most common class in the nearest k-classed objects is assigned to the object of unknown class. If the chosen value of k is 1, then the object of unknown class is assigned the class of its closest classed object (i.e., the nearest neighbor).

The k-nearest neighbor algorithm is just one among many excellent classifier algorithms, and analysts have the luxury of choosing algorithms that match their data (e.g., sample size, dimensionality) and purposes.108 Classifier algorithms differ fundamentally from clustering algorithms and from recommender algorithms in that they begin with an existing classification. Their task is very simple: assign an object to its proper class within the classification. Classifier algorithms carry the assumption that similarities among class objects determine class membership. This may not be the case. For example, a classifier algorithm might place cats into the class of small dogs because of the similarities among several attributes of cats and dogs (e.g., four legs, one tail, pointy ears, average weight 8 pounds, furry, carnivorous, etc.). The similarities are impressive, but irrelevant. No matter how much you try to make it so, a cat is not a type of dog. The fundamental difference between grouping by similarity and grouping by relationship will be discussed again later in this chapter.

Like clustering techniques, classifier techniques are computationally intensive when the dimension is high and can produce misleading results when the attributes are noisy (i.e., contain randomly distributed attribute values) or noninformative (i.e., unrelated to correct class assignment).

Recommender Algorithms

When a data object is very similar to another data object, the two objects are likely to behave similarly. Recommender techniques typically measure the distance between one data object (e.g., a potential customer) and other data items (e.g., customers who have bought a particular product or who have indicated a product preference). The data objects that are closest to one another will tend to have the same preferences and can serve as recommenders. Distances between data objects are measured with feature-by-feature comparisons using the Euclidean distance, the Mahalanobis distance, or whichever type of distance measurements seem appropriate (see Glossary items, Euclidean distance, Mahalanobis distance).

Modeling Algorithms

Modeling involves explaining the behavior of a system, often with a formula, sometimes with descriptive language. The formula for the data describes the distribution of the data and often predicts how the different variables will change with one another. Consequently, modeling comes closer than other Big Data techniques to explaining the behavior of data objects and of the system in which the data objects interact. Most of the great milestones in the physical sciences have arisen from a bit of data modeling supplemented by scientific genius (e.g., Newton’s laws of mechanics and optics, Kepler’s laws of planetary orbits, quantum mechanics). The occasional ability to relate observation with causality endows modeling with greater versatility and greater scientific impact than the predictive techniques (recommenders, classifiers, and clustering methods).

Unlike the methods of predictive analytics, which tend to rest on a few basic assumptions about measuring similarities among data objects, the methods of data modeling are selected from every field of mathematics and are based on an intuitive approach to data analysis. In many cases, the modeler simply plots the data and looks for familiar shapes and patterns that suggest a particular type of function (e.g., logarithmic, linear, normal, Fourier series, Power law, etc.). The modeler has various means of testing whether the data closely fits the model.

As one example of data modeling, let’s look at the Fourier series. Periodic functions (i.e., functions with repeating trends in the data, including waveforms and periodic time series data) can be represented as the sum of oscillating functions (i.e., functions involving sines, cosines, or complex exponentials). This applies to periodic functions that are not shaped like a curve. In Figure 9.1, a square wave is represented as a single sine wave, a sum of 2 sine waves, 3 sine waves, up to 10 sine waves. A simple sine wave captures the periodicity and magnitude of the square wave, but it misses the right-angle corners. As the number of contributing sine waves increases, the approximation to the square wave improves. This indicates that when a set of data in periodic form is decomposed to a Fourier series, the form of the function can be approximated with a finite number of oscillating functions. Depending on the purposes of the data analyst, the first few components of the Fourier series may suffice to produce an adequate model of the data.

image

Figure 9.1 A square wave is approximated by a single sine wave, the sum of two sine waves, three sine waves, and so on. As more components are added, the representation of the original signal or periodic set of data is more closely approximated.

A transform is a mathematical operation that takes a function or a time series (e.g., values obtained at intervals of time) and transforms it into something else. An inverse transform will take the product of the transform and produce the original function. Transforms are useful when there are operations that can be performed on the transformed function, but not on the original function.

How can we take what we know about the Fourier series and use it as a transforming operation? Possibly the most popular and useful transform available to data analysts is the Fourier transform. The Fourier transform can be computed with great speed on modern computers, using a modified form known as the fast Fourier transform. Periodic functions and waveforms (periodic time series) can be transformed using this computational method. Operations on the transformed function can sometimes eliminate periodic artifacts or frequencies that occur below a selected threshold (e.g., noise). The transform can also separate signals into tracks (e.g., voice and instrumental tracks) and can find similarities between two signals. When the operations are complete, the inverse of the transform can be calculated and substituted for the original set of data.

Data Reduction

There is something fascinating about science. One gets such a wholesale return of conjecture out of such a trifling investment of fact.

Mark Twain

At first glance, it seems obvious that gravitational attraction is a Big Data problem. We know that gravitation between any two bodies is proportional to the product of their masses and inversely proportional to the square of the distance between them. If we want to predict the gravitational forces on an object, we would need to know the position and mass of every body in the universe. With this data, we would compute a force vector, from which we could determine the net gravitational influence of the universe upon the mass. Of course, this is absurd. If we needed all that data for our computation, physicists would be forever computing the orbit of the earth around the sun. We are lucky to live in a universe wherein gravity follows an inverse square distance rule, as this allows us to neglect the influences of heavenly bodies that are far away from earth and sun and of nearby bodies that have small masses compared with the sun. Any high school student can compute the orbits of the earth around the sun, predicting their relative positions millennia into the future.

Likewise, if we can see two galaxies in space and we notice that they are similar in shape and size and have a similar star density, then we can assume that they both produce about the same amount of light. If the light received on earth from one of those galaxies is four times that received by the other galaxy, we can apply the inverse square law for light intensity and infer that the brighter galaxy is probably twice as far from earth as the other galaxy. In this short analysis, we start with our observations on every visible galaxy in the universe. Next, we compare just two galaxies, and from this comparison we can develop general methods of analysis that may apply to the larger set of data.

The point here is that when Big Data is analyzed, it is seldom necessary to include every point of data in your system model. In the Big Data field, the most successful analysts will often be those individuals who are adept at simplifying the system model, thus eliminating unnecessary calculations.

Because Big Data is complex, you will often find that your data objects have high dimensionality; each data object is annotated with a large number of values. The types of values that are shared among all the different data objects are usually referred to as parameters. It is very difficult to make much sense of high dimensional data. It is always best to develop a filtering mechanism that expunges useless parameters. A useless parameter will often have one of these two properties.

1. Redundancy. If a parameter correlates perfectly with some other parameter, you know that you can safely drop one of the two parameters. For example, you may have some physiologic data on a collection of people, and the data may include weight, waist size, body fat index, weight adjusted by height, density, and so on. These measurements seem to be measuring about the same thing; are they all necessary? If several attributes closely correlate with one another, you might want to drop a few.

Association scores provide a measure of similarity between two variables (see Glossary item, Correlation distance). Two similar variables will rise and fall together. The Pearson correlation score is popular and can be easily implemented.19,105 It produces a score that varies from - 1 to 1. A score of 1 indicates perfect correlation; a score of - 1 indicates perfect anticorrelation (i.e., one variable rises while the other falls). A Pearson score of 0 indicates lack of correlation. Other correlation measures are readily available.109,110 Big Data analysts should not demure from developing their own correlation scores, as needed, to ensure enhanced speed or to provide a scoring measure that best serves their particular goals.

2. Randomness. If a parameter is totally random, then it cannot tell you anything meaningful about the data object and you can drop the parameter. There are many tests that measure randomness; most were designed to measure the quality of random number generators.111 They can also be used to determine the randomness of data sets.

A simple but useful test for randomness can be achieved by putting your set of parameter values into a file and compressing the file. If the values of the parameter are distributed randomly, the file will not compress well, whereas a set of data that has a regular distribution (e.g., a simple curve, a Zipf-like distribution, or a distribution with a sharp peak) will compress down into a very small file.

As a simple illustration, I wrote a short program that created three files, each 10,000 bytes in length. The first file consisted of the number 1, repeated 10,000 times (i.e., 11111111…). The second file consisted of the numbers 0 through 9, distributed as a sequence of 1000 zeros followed by 1000 ones, followed by 1000 twos, and so on, up to 1000 nines. The final file consisted of the numbers 0 through 9 repeated in a purely random sequence (e.g., 285963222202186026084095527364317), extended to fill a file of 10,000 bytes. Each file was compressed with gunzip, which uses the DEFLATE compression algorithm, combining LZ77 and Huffman coding.

The uncompressed files (10,000 bytes) were compressed into the following file sizes:

compressed file size: 58 bytes for 10,000 consecutive “1”

compressed file size: 75 bytes for 1000 consecutive values of 0 through 9

compressed file size: 5092 bytes for a random sequence of 10,000 digits

In the third file, which consisted of a random sequence of digits, a small compression was achieved simply through the conversion from ASCII to binary representation. In general, though, a purely random sequence cannot be compressed. A data analyst can compare the compressibility of data values, parameter by parameter, to determine which parameters might be expunged, at least during the preliminary analysis of a large, multidimensional data set.

When random data are not omitted the unwary analyst may actually develop predictive models and classifiers based entirely on noise. This can occur because clustering algorithms and predictive methods, including neural networks, will produce an outcome from random input data (see Glossary item, Neural network). It has been reported that some published diagnostic tests have been developed from random data.90

Aside from eliminating redundant or random parameters, you might want to inspect the data and eliminate parameters that do not contribute in any useful way towards your analysis. For example, if you have the zip code for an individual, you will probably not need to retain the street address. If you have the radiologic diagnosis for a patient’s chest X-ray, you might not need to retain the file containing the X-ray image unless you are conducting an image analysis project.

The process of reducing parameters applies to virtually all of the fields of data analysis, including standard statistical analysis. Names for this activity include feature reduction or selection, variable reduction and variable subset reduction, and attribute selection. There is sometimes a fine line between eliminating useless data parameters and cherry-picking your test set (see Glossary items, Cherry-picking, Second trial bias). It is important to document the data attributes you have expunged, and your reason for doing so. Your colleagues should be given the opportunity of reviewing all of your data, including the expunged parameters.

An example of a data elimination method is found in the Apriori algorithm. At its simplest, it expresses the observation that a collection of items cannot occur frequently unless each item in the collection also occurs frequently. To understand the algorithm and its significance, consider the items placed together in a grocery checkout cart. If the most popular combination of purchase items is a sack of flour, a stick of butter, and a quart of milk, then you can be certain that collections of each of these items individually, and all pairs of items from the list of three, must also occur frequently. In fact, they must occur at least as often as the combination of all three, because each of these smaller combinations are subsets of the larger set and will occur with the frequency of the larger set plus the frequency of their occurrences in any other item sets. The importance of the Apriori algorithm to Big Data relates to data reduction. If the goal of the analysis is to find association rules for frequently occurring combinations of items, then you can restrict your analysis to combinations composed of single items that occur frequently.104,107

After a reduced data set has been collected, it is often useful to transform the data by any of a variety of methods that enhance your ability to find trends, patterns, clusters, or relational properties that might be computationally invisible in the untransformed data set. The first step is data normalization, described in the next section. It is critical that data be expressed in a comparable form and measure. After the data is normalized, you can further reduce your data by advanced transformative methods.

One popular method for transforming data to reduce the dimensionality of data objects is multidimensional scaling, which employs principal component analysis104(see Glossary item, Principal component analysis). Without going into the mathematics, principal component analysis takes a list of parameters and reduces it to a smaller list, with each component of the smaller list constructed from combinations of variables in the longer list. Furthermore, principal component analysis provides an indication of which variables in both the original and the new list are least correlated with the other variables. Principal component analysis requires operations on large matrices. Such operations are computationally intensive and can easily exceed the capacity of most computers.

As a final caveat, data analysts should be prepared to learn that there is never any guarantee that a collection of data will be helpful, even if it meets every criteria for accuracy and reproducibility. Sometimes the data you have is not the data you need. Data analysts should be aware that many of the advanced analytic methods, including clustering, neural networks, Bayesian methods, and support vector machines, may produce a result that does not take you any closer to a meaningful answer (see Glossary item, Support vector machine). The data analyst must understand that there is an important difference between a result and an answer.

Normalizing and Adjusting Data

When extracting data from multiple sources, recorded at different times, and collected for different purposes, the data values may not be directly comparable. The Big Data analyst must contrive a method to normalize or harmonize the data values.

1. Adjusting for population differences. Epidemiologists are constantly reviewing large data sets on large populations (e.g., local, national, and global data). If epidemiologists did not normalize their data, they would be in a constant state of panic. Suppose you are following long-term data on the incidence of a rare childhood disease in a state population. You notice that the number of people with the disease has doubled in the past decade. You are about to call the New York Times with the news when one of your colleagues taps you on the shoulder and explains that the population of the state has doubled in the same time period. The incidence, described as cases per 100,000 population, has remained unchanged. You calm yourself down and continue your analysis to find that the reported cases of the disease has doubled in a different state that has had no corresponding increase in state population. You are about to call the White House with the news when your colleague taps you on the shoulder and explains that the overall population of the state has remained unchanged, but the population of children in the state has doubled. The incidence as expressed as cases occurring in the target population has remained unchanged.

An age-adjusted rate is the rate of a disease within an age category, weighted against the proportion of persons in the age groups of a standard population. When we age adjust rates, we cancel out the changes in the rates of disease that result from differences in the proportion of people in different age groups.

Some of the most notorious observations on nonadjusted data come from the field of baseball. In 1930, Bill Terry maintained a batting average of 0.401, the best batting average in the National league. In 1968, Carl Yastrzemski led his league with a batting average of 0.301. You would think that the facts prove that Terry’s lead over his fellow players was greater than Yastrzemski’s. Actually, both had averages that were 27% higher than the average of their fellow ballplayers of the year. Normalized against all the players for the year in which the data was collected, Terry and Yastrzemski tied.

2. Rendering data values dimensionless. Histograms express data distributions by binning data into groups and displaying the bins in a bar graph (see Figure 9.2). A histogram of an image may have bins (bars) whose heights consist of the number of pixels in a black-and-white image that fall within a certain gray-scale range.

image

Figure 9.2 An image of the author (left) converted into a histogram representing the number of pixels that have a gray-scale value of 0, 1, 2, 3, and so on up to the top gray-scale value of 256. Each gray-scale value is a bin.

When comparing images of different sizes, the total number of pixels in the images is different, making it impossible to usefully compare the heights of bins. In this case, the number of pixels in each bin can be divided by the total number of pixels in the image to produce a number that corresponds to the fraction of the total image pixels that are found in the bin. The normalized value (now represented as a fraction) can be compared between two images. Notice that by representing the bin size as a fraction, we have stripped the dimension from the data (i.e., a number expressed as pixels) and rendered a dimensionless data item (i.e., a purely numeric fraction).

3. Converting one data type to another, more useful data type. A zip code is an example of data formed by numeric digits that lack numeric properties. You cannot add two zip codes and expect to get any useful information from the process. However, every zip code has been mapped to a specific latitude and longitude at the center of the zip code region, and these values can be used as spherical coordinates from which distances between locations can be computed. It is often useful to assign geographic coordinates to every zip code entry in a database.

4. Converting to a (0,1) interval. Any set of data values can be converted into an interval between 0 and 1, wherein the original data values maintain their relative positions in the new interval. There are several simple ways to achieve the result. The most straightforward is to compute the range of the data by subtracting the smallest data value in your data set from the largest data value. To determine the location of any data value in the 0,1 range, simply subtract from it the smallest value in the data set and then divide the result by the range of the data. This tells you where your value is located, in a 0,1 interval, as a fraction of the range of the data.

Another popular method for converting data sets to a standard interval is to subtract the mean from any data value and divide by the standard deviation. This gives you the position of the data value expressed as its deviation from the mean as a fraction of the standard deviation. The resulting value is called the z score.

When comparing different data sets, it is important to normalize all of the data points to a common interval. In the case of multidimensional data, it is usually necessary to normalize the data in every dimension using some sensible scaling computation. This may include the methods just described (i.e., dividing by range or standard deviation or by substituting data with a dimensionless transformed value, such as a correlation measure).

5. Weighting. Weighting is a method whereby the influence of a value is moderated by some factor intended to yield an improved value. In general, when a data value is replaced by the sum of weighted factors, the weights are chosen to add to 1. For example, if you are writing your own smoothing function, in which each value in a data set is replaced by a value computed by summing contributions from itself and its immediate neighbors on the left and the right, you might multiply each number by one-third so that the final number is scaled to a magnitude similar to your original number. Alternately, you might multiply the number to the left and to the right by one-quarter and the original by one-half to provide a summed number weighted to favor the original number.

It is a shame that Big Data never comes with instructions. Data analysts are constantly required to choose a normalization method, and the choice will always depend on their intended use of the data. Here is an example. Three sources of data provide records on children that include an age attribute. Each source measures age in the same dimension, years. You would think that because the ages are all recorded in years, not months or decades, you can omit a normalization step. When you study the data, you notice that one source contains children up to the year 14, while another is cut off at age 12 and another stops at age 16. Suddenly, you are left with a difficult problem. Can you ignore the differences in the cut-off age in the three data sets? Should you truncate all of the data above age 12? Should you use all of the data, but weigh the data differently for the different sources? Should you divide by the available ranges for the data? Should you compute z scores? It all depends on what you are trying to learn from the data.

Big Data Software: Speed and Scalability

It’s hardware that makes a machine fast. It’s software that makes a fast machine slow.

Craig Bruce

The Cleveland Clinic developed software that predicts disease survival outcomes from a large number of genetic variables. Unfortunately, the time required for these computations was unacceptable. As a result, the Cleveland Clinic issued a challenge to “to deliver an efficient computer program that predicts cancer survival outcomes with accuracy equal or better than the reference algorithm, including 10-fold validation, in less than 15 hours of real world (wall clock) time.”112 The Cleveland Clinic had a working algorithm, but it was not scalable to the number of variables analyzed.

The typical computer user has no patience for computational tardiness. When we submit a Google query, the output is collected from the totality of Web pages and we see the results immediately. We have come to expect every algorithm to operate in the blink of an eye. With Big Data, software may not scale to the size and complexity of the job.

Here are a few suggestions for achieving speed and scalability.

1. Never assume that software is scalable. A software application that works fine for a few megabytes of data may crawl or crash when confronted with a terabyte-sized file, or with millions of variables per record.

2. Avoid turn-key applications. A turn-key application may work well until the Big Data reaches a critical size, at which time the system slows or crashes. Turn-key applications tend to be monolithic and nonmodular, producing a “black box” mentality among users (see Glossary item, Black box). When one of the subsystems fails, it can be difficult or impossible to fix.

3. Use small, efficient, and fast utilities. Utilities are written to perform one task, optimally. For data analysts, utilities fit perfectly with the multistep paradigm of Big Data projects. A savvy data analyst will have hundreds of small utilities, most being free and open source products, that can be pulled, as needed. A utility can be tested with data sets of increasing size and complexity. If the utility scales to the job, then it can be plugged into the project. Otherwise, it can be replaced with an alternate utility or modified to suit your needs.

4. Avoid unpredictable software. Programs that operate with complex sets of input may behave unpredictably. It is important to use software that has been extensively tested by yourself and by your colleagues. After testing, it is important to constantly monitor the performance of software. This is especially important when using new sources of data, which may contain data values that have been formatted differently than prior data.

5. When designing your own software, understand your algorithms and learn their limitations. An algorithm that produces an inexact answer is always preferable to an algorithm that crashes under the load of a gigabyte of data. When your data runs too slowly, do not immediately look for a faster computer; most speed problems are due to design issues or to the selection of an algorithm that cannot scale. Do not look for a faster programming language. The difference in speed among the popular programming languages is not as great as their adherents might have you believe. Thinking deeply about what you really need to accomplish with your software will often lead you to a solution that will work with readily available hardware and programming languages.

6. If you are conducting scientific research, you should avoid using proprietary software, even if it is reliable and scalable. Proprietary software applications have their place in the realm of Big Data. They can be used to operate the servers, databases, and communications involved in building and maintaining the resource. They are often well tested and dependable. They do the job they were designed to do. However, in the analysis phase, it may be impossible to fully explain your research methods if one of the steps is “Install the vendor’s software and click on the ‘Run’ button.” Data analysts should use open source software unless their proprietary software fully documents the algorithms and methods therein.

Find Relationships, Not Similarities

The creative act is the defeat of habit by originality.

George Lois

The distinction between relationships among objects and similarities among objects is one of the most fundamental concepts in data analysis, yet it is commonly misunderstood. Relationships are the fundamental properties of an object that place it within a class of objects, each member of which has the same fundamental properties. A common set of relational properties distinguishes the ancestral classes of an object and the descendant classes of an object. Related objects tend to be similar to one another, but these similarities occur as the consequence of their relationships, not vice versa. For example, you may have many similarities to your father. You are similar to you father because you are related to him; you are not related to him because you are similar to him.

Here are two additional examples that stress the difference between relationship and similarity.

1. You look up at the clouds and you begin to see the shape of a lion. The cloud has a tail, like a lion’s tail, and a fluffy head, like a lion’s mane. With a little imagination, the mouth of the lion seems to roar down from the sky. You have succeeded in finding similarities between the cloud and a lion. If you look at a cloud and you imagine a tea kettle producing a head of steam, then you are establishing a relationship between the physical forces that create a cloud and the physical forces that produce steam from a heated kettle, and you understand that clouds are composed of water vapor.

2. You look up at the stars and you see the outline of a flying horse, Pegasus, or the soup ladle, the Big Dipper. You have found similarities upon which to base the names of celestial landmarks, the constellations. The constellations help you orient yourself to the night sky, but they do not tell you much about the physical nature of the twinkling objects. If you look at the stars and you see the relationship between the twinkling stars in the night sky and the round sun in the daylight sky, then you can begin to understand how the universe operates.

The distinction between grouping data objects by similarity and grouping data objects by relationship is sometimes lost on computer scientists. I have had numerous conversations with intelligent scientists who refuse to accept that grouping by similarity (e.g., clustering) is fundamentally different from grouping by relationship (i.e., building a classification).

Consider Figure 9.3, which contains 300 objects. Each object belongs to one of two classes, marked by an asterisk or by an empty box. The 300 objects naturally cluster into three groups. It is tempting to conclude that the graph shows three classes of objects that can be defined by their similarities, but we know from the outset that the objects fall into two classes, and we see from the graph that objects from both classes are distributed in all three clusters.

image

Figure 9.3 Spatial distribution of 300 objects represented by data points in three dimensions. Each data object falls into one of two classes, represented by an asterisk or an empty box. The data naturally segregates into three clusters. Objects of type asterisk and type box are distributed throughout each cluster.

Is this graph far-fetched? Not really. Suppose you have a collection of felines and canines. The collection of dogs might include Chihuahuas, St. Bernards, and other breeds. The collection of cats might include housecats, lions, and other species, and the data collected on each animal might include weight, age, and hair length. We do not know what to expect when we cluster the animals by similarities (i.e., weight, age, hair length), but we can be sure that short-haired cats and short-haired Chihuahuas of the same age will probably fall into one cluster. Cheetahs and greyhounds, having similar size and build, might fall into another cluster. The similarity clusters will mix together unrelated animals and will separate related animals.

For taxonomists, the importance of grouping by relationship, not by similarity, is a lesson learned the hard way. Literally 2000 years of misclassifications, erroneous biological theorizations, and impediments to progress in medicine and agriculture have occurred whenever similarities were confused with relationships. Early classifications of animals were based on similarities (e.g., beak shape, color of coat, number of toes). These kinds of classifications led to the erroneous conclusions that dolphins were fish and that the various juvenile forms of insects, nematodes, and thousands of species of animals were distinct organisms, unrelated to the adult form into which they would mature. The vast field of animal taxonomy was a hopeless and useless mess until taxonomists began to think very deeply about classes of organisms and the fundamental properties that accounted for the relationships among the classes.

Geneticists have also learned that sequence similarities among genes may bear no relationship to their functionalities, their inheritance from higher organisms, their physical locations, or to any biological process whatsoever. Geneticists use the term homology to describe the relationship among sequences that can be credited to descent from a common ancestral sequence. Similarity among different sequences can be nonhomologous, developing randomly in nonrelated organisms, or developing by convergence, through selection for genes that have common functionality. Sequence similarity that is not acquired from a common ancestral sequence seldom relates to the shared fundamental cellular properties that characterize inherited relationships. Biological inferences drawn from gene analyses are more useful when they are built upon phylogenetic relationships rather than on superficial genetic or physiologic similarities.113

It is my perception that the distinction between classification by similarity and classification by relationship is vitally important to the field of computer science and to the future of Big Data analysis. I have discussed this point with many of my colleagues, who hold the opposite view: that the distinction between similarity classification and relationship classification is purely semantic. There is no practical difference between the two methods. Regardless of which side you may choose, the issue is worth pondering for a few moments.

Two arguments support the opinion that classification is based on similarity measures. The first argument is that classification by similarity is the standard method by which relational classifications are built. The second argument is that relational properties are always unknown at the time that the classification is built. The foundation of every classification must be built on measurable features, and the only comparison we have for measurable features is similarity.

The first argument, based on common practice, has no scientific merit. It assumes that common practice must be correct, but determining the correctness of common practice is what we are trying to settle.

The second argument, that classification by relationship requires access to unobtainable knowledge, is a clever observation that hits on a weakness in the relational theory of classification. To build a classification, you must first know the relational properties that define classes, superclasses, and subclasses, but if you want to know the relationships among the classes, you must refer to the classification. It’s another bootstrapping problem.

Building a classification is an iterative process wherein you hope that your tentative selection of relational properties and your class assignments will be validated over time. You build a classification by guessing which properties are fundamental and relational and by guessing which system of classes will make sense when all of the instances of the classes are assigned. A classification is often likened to a hypothesis that must be tested again and again as the classification grows.

Is it ever possible to build a classification by a hierarchical clustering algorithm based on measuring similarities among objects? The answer is a qualified yes, assuming that the object features that you have measured happen to be the relational properties that define the classes. A good example of this process is demonstrated by the work of Carl Woese and coworkers in the field of the classification of terrestrial organisms.114 Woese compared ribosomal RNA sequences among organisms. Ribosomal RNA is involved in the precise synthesis of proteins according to instructions coded in genes. According to Woese, ribosomal RNA mutates slowly because it must be present in every terrestrial organism and it has scarcely any leeway in its functionality. Changes in the sequence of ribosomal RNA act like a chronometer for evolution. Using sequence similarities, Woese developed a brilliant classification of living organisms that has revolutionized evolutionary genetics. Woese’s analysis is not perfect, and where there are apparent mistakes in his classification, disputations focus on the limitations of using similarity as a substitute for fundamental relational properties.115,116

The field of medical genetics has been embroiled in a debate, lasting well over a decade, on the place of race in science. Some would argue that when the genomes of humans from different races are compared, there is no sensible way to tell one genome from another on the basis of assigned race. The genes of a tall man and a short man are more different than the genes of an African-American man and a white man. Judged by genetic similarity, race has no scientific meaning.117 On the other hand, every clinician understands that various diseases, congenital and acquired, occur at different rates in the African-American population than in the white population. Furthermore, the clinical symptoms, clinical outcome, and even the treatment of these diseases in African-American and white individuals will sometimes differ among ethnic or racial groups. Hence, many medical epidemiologists and physicians perceive race as a clinical reality.118 The discord stems from a misunderstanding of the meanings of similarity and of relationship. It is quite possible to have a situation wherein similarities are absent, while relationships pertain. The lack of informative genetic similarities that distinguish one race from another does not imply that race does not exist. The basis for race is the relationship created by shared ancestry. The morphologic and clinical by-product of the ancestry relationship may occur as various physical features and epidemiologic patterns found by clinicians.

In the cancer field, measurements of tumor cells always demonstrate the same phenomenon—cancer cells differ from normal cells in virtually every measurable parameter. In this case, there is an overabundance of biological differences and a paucity of well-established cellular relationships with which to understand the biological differences. When we cluster different types of cancers based on similarity measures, we usually find that these clusterings are seldom repeatable. The problem has been that the similarities, when present, are not fundamentally related to the processes that account for the development of cancer cells or to the morphologic features observed by pathologists who use traditional microscopic techniques to examine cancerous tissues.87,89,90,119,120

Computer scientists may scoff at examples selected from the realm of biology and medicine. They might remark that biological systems are always messy; by using clean mathematical constructs, the distinction between relationships and similarities would simply disappear. Maybe not. In the past decade, I have programmed with object-oriented languages, particularly the Ruby programming language.65,19 Everything in Ruby is an object, and every object belongs to a class. Methods are inherited through class lineage. Classes are defined by the methods they provide and the methods they inherit. No adept Ruby programmer would have any difficulty distinguishing between methods that apply to a class (i.e., available to every instance of the class) versus methods that apply to individual objects (so-called instance methods in Ruby). The idea of grouping objects belonging to different classes, based on shared instance methods, would be anathema to Ruby programmers. Doing so would defeat the purpose of object-oriented programming and would create chaos.

Fundamentally, all analysis is devoted to finding relationships among objects or classes of objects. All we know about the universe and the processes that play out in our universe can be reduced to simple relationships. In many cases, the process of finding and establishing relationships often begins with finding similarities, but it must never end there.

References

19. Berman JJ. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton, FL: Chapman and Hall; 2010.

65. Berman JJ. Ruby programming for medicine and biology. Sudbury, MA: Jones and Bartlett; 2008.

87. Kolata G. Cancer fight: unclear tests for new drug. The New York Times April 19, 2010.

89. Begley S. In cancer science, many ‘discoveries’ don’t hold up. Reuters Mar 28, 2012.

90. Venet D, Dumont JE, Detours V. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput Biol. 2011;7:e1002240.

101. Shah S, Horne A, Capella J. Good data won’t guarantee good decisions. Harv Bus Rev. April, 2012.

102. White T. Hadoop: the definitive guide. O’Reilly Media 2009.

103. Owen S, Anil R, Dunning T, Friedman E. Mahout in action. Shelter Island, NY: Manning Publications Co; 2012.

104. Janert PK. Data analysis with open source tools. O’Reilly Media 2010.

105. Lewis PD. R for medicine and biology. Sudbury: Jones and Bartlett Publishers; 2009.

106. Segaran T. Programming collective intelligence: building smart Web 2.0 applications. O’Reilly Media 2007.

107. Wu X, Kumar V, Quinlan JR, et al. Top 10 algorithms in data mining. Knowl Inf Syst. 2008;14:1–37.

108. Zhang L, Lin X. Some considerations of classification for high dimension low-sample size data. Available from: Stat Methods Med Res 2011 Nov 23; http://smm.sagepub.com/content/early/2011/11/22/0962280211428387.long; 2011 Nov 23; viewed January 26, 2013.

109. Szekely GJ, Rizzo ML. Brownian distance covariance. Ann Appl Stat. 2009;3:1236–1265.

110. Reshef DN, Reshef YA, Finucane HK, et al. Detecting novel associations in large data sets. Science. 2011;334:1518–1524.

111. Marsaglia G, Tsang WW. Some difficult-to-pass tests of randomness. Available from: J Stat Software. 2002;7:1–8 http://www.jstatsoft.org/v07/i03/paper; 2002; viewed September 25, 2012.

112. Cleveland Clinic: build an efficient pipeline to find the most powerful predictors. Innocentive September 8, 2011; https://www.innocentive.com/ar/challenge/9932794; September 8, 2011; viewed September 25, 2012.

113. Wu D, Hugenholtz P, Mavromatis K, et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature. 2009;462:1056–1060.

114. Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. PNAS. 1977;74:5088–5090.

115. Mayr E. Two empires or three? PNAS. 1998;95:9720–9723.

116. Woese CR. Default taxonomy: Ernst Mayr’s view of the microbial world. PNAS. 1998;95:11043–11046.

117. Bamshad MJ, Olson SE. Does race exist? Sci Am 2003;78–85 December.

118. Wadman M. Geneticists struggle towards consensus on place for ‘race’. Nature. 2004;431:1026.

119. Gerlinger M, Rowan AJ, Horswell S, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med. 2012;366:883–892.

120. Molyneux G, Smalley MJ. The cell of origin of BRCA1 mutation-associated breast cancer: a cautionary tale of gene expression profiling. J Mammary Gland Biol Neoplasia. 2011;16:51–55.


ent“To view the full reference list for the book, click here

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.132.97