Chapter 11

Stepwise Approach to Big Data Analysis

Outline

The purpose of computing is insight, not numbers.

Richard Hamming

Background

At this point, you may feel completely overwhelmed by the complexities of Big Data resources. It may seem that analysis is humanly impossible. The best way to tackle a large and complex project is to divide it into smaller, less intimidating tasks. Of course, every analysis project is unique, and the steps involved in a successful project will vary. Nonetheless, a manageable process, built on techniques introduced in preceding chapters, might be helpful. My hope is that as Big Data resources mature and the methods for creating meaningful, well-annotated, and verified data become commonplace, some of the steps listed in this chapter can be eliminated. Realistically though, it is best to assume that the opposite will occur—more steps will be added.

Step 1. A Question Is Formulated

It takes a certain talent to ask a good question. Sometimes, a question, even a brilliant question, cannot be answered until it is phrased in a manner that clarifies the methods by which the question can be solved. For example, suppose I am interested in how much money is spent, each year, on military defense in the United States. I could probably search the Internet and find the budget for the Department of Defense in the year 2011. The budget for the Department of Defense would not reflect the costs associated with other agencies that have a close relationship with the military, such as intelligence agencies and the State Department. The Department of Defense budget would not reflect the budget of the Veterans Administration (an agency that is separate from the Department of Defense). The budget for the Department of Defense might include various items that have no obvious relationship to military defense. Because I am asking for the “annual” budget, I might need to know how to deal with projects whose costs are annualized over 5, 10, or 15 years. If large commitments were made, in 2005, to pay for long-term projects, with increasing sums of money paid out over the next decade, then the 2011 annual budget may reflect payouts on 2005 commitments. A 2011 budget may not provide a meaningful assessment of costs incurred by 2011 activities. After a little thought, it becomes obvious that the question “How much money is spent, each year, on military defense in the United States?” is complex and probably cannot be answered by any straightforward method.

At this point, it may be best to table the question for a while and to think deeply about what you can reasonably expect from Big Data. Many analysts start with the following general question: “How can this Big Data resource provide the answer to my question?” A more fruitful approach may be “What is the data in this resource trying to tell me?” The two approaches are quite different, and I would suggest that data analysts begin their analyses with the second question.

Step 2. Resource Evaluation

Every good Big Data resource provides users with a detailed description of its data contents. This might be done through a table of contents or an index, or through a detailed “readme” file, or a detailed user license. It all depends on the type of resource and its intended purposes. Resources should provide detailed information on their methods for collecting and verifying data, and their protocols supporting outsider queries and data extractions. Big Data resources that do not provide such information generally fall into two categories: (1) highly specialized resources with a small and devoted user base who are thoroughly familiar with every aspect of the resource and who do not require guidance, or (2) bad resources.

Before developing specific queries related to your research interest, data analysts should develop queries designed to evaluate the range of information contained in the resource. For example, the Surveillance, Epidemiology, and End Results (SEER) database contains deidentified records on millions of cancers occurring in the United States since the mid-1970s.76,153 When you query the database for the types of cancers included, you find that the two most common cancers of humans, basal cell carcinoma and squamous cell carcinoma of skin, are not included in the data. Together, the occurrences of these two cancers are equal to the occurrences of every other type of cancer combined. The SEER program does not collect data on these cancers because there are just too many of them; individuals may develop several such cancers on sun-exposed skin. In addition, these cancers are seldom lethal and are not generally recorded by the cancer registries that feed their data to the SEER database. A data analyst cannot draw conclusions about the totality of cancers when he or she uses a Big Data resource that omits the two most common cancers of humans. In fact, the SEER database excludes a great deal of information that might be of interest to data analysts. For example, the SEER database does not include most precancers (i.e., early stage cancers), thus limiting studies examining the progression rates of precancerous stages to fully invasive cancers.

Big Data resources may contain systemic biases. For example, PubMed contains abstracted data on about 20 million research articles. Research articles are published on positive findings. It is very difficult for a scientist to publish a paper that reports on the absence of an effect or the nonoccurrences of a biological phenomenon. PubMed has a positive result bias. The preferential exclusion or inclusion of specific types of data is very common, and data analysts must try to identify such biases.

Every Big Data resource has its blind spots—areas in which data is missing, scarce, or otherwise unrepresentative of the data domain. Often, the Big Data managers are unaware of such deficiencies. In some cases, Big Data managers blame the data analyst for “inventing” a deficiency that pertains exclusively to unauthorized uses of the resource. When a data analyst wishes to use a Big Data resource for something other than its intended purposes (e.g., using PubMed to predict National Institutes of Health funding priorities over the next decade, using the Netflix query box to determine what kinds of actors appear in zombie movies), then the Big Data manager may be reluctant to respond to the analyst’s complaints.

Simply because you have access to large amounts of data does not imply that you have all the data you would need to draw a correct conclusion.

Step 3. A Question Is Reformulated

If you can dream—and not make dreams your master.

from If (poem), by Rudyard Kipling

Data does not always answer the exact question you started with. After you have assessed the content and design of your Big Data resource(s), you will want to calibrate your question to your available data sources. In the case of our original question, from Step 1, we wanted to know how much money is spent, each year, on military defense in the United States; if we are unable to answer this question, we may be able to answer questions related to the budget sizes of individual government agencies that contribute to military spending. If we knew the approximate portion of each agency budget that is devoted to military spending, we might be able to produce a credible total for the amount devoted to military activities, without actually finding the exact answer.

After exploring the resource, the data analyst learns the kinds of questions that can best be answered with the available data. With this insight, he or she can reformulate the original set of questions.

Step 4. Query Output Adequacy

Big Data resources can often produce an enormous output in response to a data query. When a data analyst receives a large amount of data, particularly if the output is vast, he or she is likely to assume that the query output is complete and valid. A query output is complete when it contains all of the data held in the Big Data resource that answers the query, and a query output is valid if the data in the query output yields a correct and repeatable answer.

A Google query is an example of an instance wherein query output is not seriously examined. When you enter a search term and receive millions of “hits,” you may tend to assume that your query output is adequate. When you’re looking for a particular Web page or an answer to a specific question, the first output page on your initial Google query may meet your needs. A thoughtful data analyst will want to submit many related queries to see which queries produce the best results. The analyst may want to combine the query outputs from multiple related queries and will almost certainly want to filter the combined outputs to discard response items that are irrelevant. The process of query output examination is often arduous, requiring many aggregation and filtering steps.

After satisfying yourself that you’ve taken reasonable measures to collect a complete query output, you will still need to determine whether the output you have obtained is fully representative of the data domain you wish to analyze. For example, you may have a large query output file related to the topic of poisonous mushrooms. You’ve aggregated query outputs on phrases such as “mushroom poisoning,” “mushroom poisons,” “mushroom poison,” “mushroom toxicity,” and “fungal toxins.” You pared down queries on “food poisoning” to include only mushroom-related entries. Now you want to test the output file to see if it has a comprehensive collection of information related to your topic of interest. You find a nomenclature of mushrooms, and you look for the occurrence of each nomenclature term in your aggregated and filtered output file. You find that there are no occurrences in your output of many of the types of mushrooms listed in the mushroom nomenclature, including mushrooms known to be toxic. In all likelihood, this means that the Big Data resource simply does not contain the level of detail you will need to support a thorough data analysis on topics related to poisonous mushrooms.

There is no standard way of measuring the adequacy of a query output; it depends on the questions you want to answer and the analytic methods you will employ. In some cases, a query output will be inadequate because the Big Data resource simply does not contain the information you need; at least not in the detail you need the information. In other cases, the Big Data resource contains the information you need, but does not provide a useful pathway by which your query can access the data. Queries cannot thoroughly access data that is not fully annotated, assigned to classes, and constructed as identified data objects.

Data analysts must be prepared to uncover major flaws in the organization, annotation, and content of Big Data resources. When a flaw is found, it should be promptly reported to the data manager for the resource. A good data manager will have a policy for accepting error reports, conducting investigations, instituting corrections as necessary, and documenting every step in the process.

Step 5. Data Description

Is the output data numeric or is it categorical? If it is numeric, is it quantitative? For example, telephone numbers are numeric, but not quantitative. If the data is numeric and quantitative, then your analytic options are many. If the data is categorical information (e.g., male or female, true or false), then the analytic options are limited. The analysis of categorical data is first and foremost an exercise in counting; comparisons and predictions are based on the number of occurrences of features.

Are all of your data objects comparable? Big Data collects data objects from many different sources, and the different data objects may not be directly comparable. The objects themselves may be annotated with incompatible class hierarchies (e.g., one data object described as a “chicken” may be classed as “Aves,” while another “chicken” object may be classed as “food”). One data object described as “child” may have the “age” property divided into 3-year increments up to age 21. Another “child” object may have “age” divided into 4-year increments up to age 16. The data analyst must be prepared to normalize assigned classes, ranges of data, subpopulations of wildly different sizes, different nomenclature codes, and so on.

After the data is normalized and corrected for missing data and false data, you will need to visualize data distributions. Be prepared to divide your data into many different groupings and to plot and replot your data with many different techniques (e.g., histograms, smoothing convolutions, cumulative plots, etc.). Look for general features (e.g., linear curves, nonlinear curves, Gaussian distributions, multimodal curves, convergences, nonconvergences, Zipf-like distributions). Visualizing your data with numerous alternate plotting methods may provide fresh insights and will reduce the likelihood that any one method will bias your objectivity.

Step 6. Data Reduction

An irony of Big Data analysis is that the data analyst must make every effort to gather all of the data related to a project, followed by an equally arduous phase during which the data analyst must cull the data down to its bare essentials.

There are very few situations wherein all of the data contained in a Big Data resource is subjected to analysis. Aside from the computational impracticalities of analyzing massive amounts of data, most real-life problems are focused on a relatively small set of local observations drawn from a large number of events that are irrelevant to the problem at hand. The process of extracting a small set of relevant data from a Big Data resource is referred to by a variety of names, including data reduction, data filtering, and data selection. The reduced data set that you will use in your project should obey the courtroom oath “the whole truth, and nothing but the truth.”

Methods for reducing the dimensionality of data are described in Chapter 9. As a practical point, when the random and redundant variables have been expunged, the remaining data set may still be too large for a frontal computational attack. A good data analyst knows when to retreat and regroup. If something cannot be calculated to great precision on a large number of variables and data points, then it should be calculated with somewhat less precision with somewhat fewer variables and fewer data points. Why not try the small job first and see what it tells you?

Step 7. Algorithms Are Selected, If Absolutely Necessary

Algorithms are perfect machines. They work to produce consistent solutions; they never make mistakes; they need no fuel; they never wear down; they are spiritual, not physical. Every computer scientist loves algorithms.

If you peruse the titles of books in the Big Data field, you will find that most of these books are focused on data analysis. They focus on parallel processing, cloud computing (see Glossary item, Cloud computing), high-power predictive analytics, combinatorics methods, and the like. It is very easy to believe that the essential feature of Big Data that separates it from small data relates to algorithms. My experience is that we have a great many brilliant algorithms that will serve most of our Big Data analytic needs. The typical Big Data analyst will spend most of his or her career trying to collect and organize data. Algorithms are a small piece of the puzzle.

As algorithms become more and more clever, they become more and more enigmatic. Fewer and fewer people truly understand how they work. Some of the most popular statistical methods defy simple explanation, including p values and linear regression104,154 (see Glossary items, Linear regression, p value). When a scientist submits an article to a journal, he or she can expect the journal editor to insist that a statistician be included as a coauthor. It is so easy to use the wrong statistical method that editors and reviewers do not trust nonstatisticians to conduct their own analyses. The field of Big Data comes with a dazzling assortment of analytic options. Who will judge that the correct method is chosen, that the method is implemented properly, and that the results are interpreted correctly?

If an analyst does not require advanced algorithms, avoidance may be a reasonable recourse. Analysts should consider the following options.

1. Stick with simple estimates. If you have taken to heart the suggestion in Chapter 8, to estimate your answers early in project development, then you have already found simple estimators for your data. Consider this option: keep the estimators and forget about advanced algorithms. For many projects, estimators can be easily understood by project staff and will provide a practical alternative to exact solutions that are difficult to calculate and impossible to comprehend.

2. Pick better metrics, not better algorithms. Sabermetrics is a sterling example of analysis using simple metrics that are chosen to correlate well with a specific outcome—a winning ballgame. In the past several decades, baseball analysts have developed a wide variety of new performance measurements for baseball players. These include base runs, batting average on balls in play, defense-independent pitching statistics, defense-independent earned run average, fielding independent pitching, total player rating, batter—fielder wins, total pitcher index, and ultimate zone rating. Most of these metrics were developed empirically, tested in the field, literally, and optimized as needed. They are all simple linear metrics that use combinations of weighted measures on data collected during ballgames. Though sabermetrics has its detractors, everyone would agree that it represents a fascinating and largely successful effort to bring objective numeric techniques to the field of baseball. Nothing in sabermetrics involves advanced algorithms. It is all based on using a deep understanding of the game of baseball to develop a set of simple metrics that can be easily calculated and validated.

3. Micromanage your macrodata. Much of the success of Big Data is attained by making incremental, frequent changes to your system in response to your metrics. An example of successful micromanagement for Big Data is the municipal CompStat model, used by police departments and other government agencies.155,156 A promising metric is chosen, such as emergency 911 call response time, and a team closely monitors the data on a frequent basis, sometimes several times a day. Slow 911 response times are investigated, and the results of these investigations typically generate action items intended to correct systemic errors. When implemented successfully, the metric improves (e.g., the 911 response time is shortened) and a wide range of systemic problems are solved. Micromanaging a single metric can improve the overall performance of a department.

Departments with imagination can choose very clever metrics upon which to build an improvement model (e.g., time from license application to license issuance, number of full garbage cans sitting on curbs, length of toll booth lines, numbers of broken street lights, etc.). It is useful to choose the best metrics, but the choice of the metric is not as important as the ability to effectively monitor and improve the metric.

As a personal aside, I have used this technique in the medical setting and found it immensely effective. During a period of about 5 years at the Baltimore VA Medical center, I had access to all the data generated in our pathology department. Using a variety of metrics, such as case turn-around time, cases requiring notification of clinician, cases positive for malignancy, and diagnostic errors, our pathologists were able to improve the measured outcomes. More importantly, the process of closely monitoring for deficiencies, quickly addressing the problem, and reporting on the outcome of each correction produced a staff that was sensitized to the performance of the department. There was an overall performance improvement.

Like anything in the Big Data realm, the data micromanagement approach may not work for everyone, but it serves to show that great things may come when you carefully monitor your Big Data resource.

4. Let someone else find an algorithm for you; crowd source your project. There is a lot of analytic talent in this world. Broadcasting your project via the Web may attract the attention of individuals or teams of researchers who have already solved a problem isomorphic to your own or who can rapidly apply their expertise to your specific problem.157

5. Offer a reward. Funding entities have recently discovered that they can solicit algorithmic solutions, offering cash awards as an incentive. For example, the InnoCentive organization issues challenges regularly, and various sponsors pay awards for successful implementations112(see Glossary item, Predictive modeling contests).

6. Develop your own algorithm that you fully understand. You should know your data better than anyone else. With a little self-confidence and imagination, you can develop an analytic algorithm tailored to your needs.

Step 8. Results Are Reviewed and Conclusions Are Asserted

When the weather forecaster discusses the projected path of a hurricane, he or she will typically show the different paths projected by different models. The forecaster might draw a cone-shaped swath bounded by the paths predicted by the several different forecasting models. A central line in the cone might represent the composite path produced by averaging the forecasts from the different models. The point here is that Big Data analyses never produce a single, undisputed answer. There are many ways of analyzing Big Data, and they all produce different solutions.

A good data analyst should interpret results conservatively. Here are a few habits that will keep you honest and will reduce the chances that your results will be discredited.

1. Never assert that your analysis is definitive. If you have analyzed the Big Data with several models, include your results for each model. It is perfectly reasonable to express your preference for one model over another. It is not acceptable to selectively withhold results that could undermine your theory.

2. Avoid indicating that your analysis provides a causal explanation of a physical process. In most cases, Big Data conclusions are descriptive and cannot establish physical causality. This situation may improve as we develop better methods to make reasonable assertions for causality based on analyses of large, retrospective data sets.158,159 In the meantime, the primary purpose of Big Data analysis is to provide a hypothesis that can be subsequently tested, usually through experimentation, and validated.

3. Disclose your biases. It can be hard to resist choosing an analytic model that supports your preexisting opinion. When your results advance your own agenda, it is important to explain that you have a personal stake in the outcome or hypothesis. It is wise to indicate that the data can be interpreted by other methods and that you would be willing to cooperate with colleagues who might prefer to conduct an independent analysis of your data. When you offer your data for reanalysis, be sure to include all of your data: the raw data, the processed data, and step-by-step protocols for filtering, transforming, and analyzing the data.

4. Do not try to dazzle the public with the large number of records in your Big Data project. Large studies are not necessarily good studies, and the honest data analyst will present the facts and the analysis without using the number of data records as a substitute for analytic rigor.

Step 9. Conclusions Are Examined and Subjected to Validation

Sometimes you gotta lose “til you win.”

from Little Miss (song) by Sugarland

Validation involves demonstrating that the assertions that come from data analyses are reliable. You validate an assertion (which may appear in the form of a hypothesis, a statement about the value of a new laboratory test, or a therapeutic protocol) by showing that you draw the same conclusion repeatedly in comparable data sets.

Real science can be validated, if true, and invalidated, if false. Pseudoscience is a pejorative term that applies to scientific conclusions that are consistent with some observations, but which cannot be confirmed or tested with additional data. For example, there is a large body of information that would suggest that the earth has been visited by flying saucers. The evidence comes in the form of eyewitness accounts, numerous photographs, and vociferous official denials of these events suggesting some form of cover-up. Without commenting on the validity of UFO claims, it is fair to say that these assertions fall into the realm of pseudoscience because they are untestable (i.e., there is no way to prove that flying saucers do not exist) and there is no definitive data to prove their existence (i.e., the “little green men” have not been forthcoming).

Big Data analysis always stands on the brink of becoming a pseudoscience. Our finest Big Data analyses are only valid to the extent that they have not been disproven. A good example of a tentative and clever conclusion drawn from data is the Titius—Bode law. Titius and Bode developed a simple formula that predicted the locations of planets orbiting a star. It was based on data collected on all of the planets known to Johann Daniel Titius and Johann Elert Bode, two 18th-century scientists. These planets included Mercury through Saturn. In 1781, Uranus was discovered. Its position fit almost perfectly into the Titius—Bode series, thus vindicating the predictive power of their formula. The law predicted a fifth planet, between Mars and Jupiter. Though no fifth planet was found, astronomers found a very large solar-orbiting asteroid, Ceres, at the location predicted by Titius and Bode. By this time, the Titius—Bode law was beyond rational disputation. Then came the discoveries of Neptune and Pluto, neither of which remotely obeyed the law. The data had finally caught up to the assertion. The Titius—Bode law was purely descriptive—not based on any universal physical principles. It served well for the limited set of data to which it was fitted. Today, few scientists remember the discredited Titius-Bode law.

Let us look at a few counterexamples. Natural selection is an interesting theory, published by Charles Darwin in 1859. It was just one among many interesting theories aimed at explaining evolution and the origin of species. The Lamarckian theory of evolution preceded Darwin’s natural selection by nearly 60 years. The key difference between Darwin’s theory and Lamarck’s theory comes down to validation. Darwin’s theory has withstood every test posed by scientists in the fields of geology, paleontology, bacteriology, mycology, zoology, botany, medicine, and genetics. Predictions based on Darwinian evolution dovetail perfectly with observations from diverse fields. The Lamarckian theory of evolution, proposed well before DNA was established as the genetic template for living organisms, held that animals passed experiences to succeeding generations through germ cells, thus strengthening intergenerational reliance on successful behaviors of the parent. This theory was groundbreaking in its day, but subsequent findings failed to validate the theory. Neither Darwin’s theory nor Lamarck’s theory could be accepted on its own merits. Darwin’s theory is correct, as far as we can know, because it was validated by scientific progress that occurred over the ensuing 150 years. The validation process was not rewarding for Lamarck.

The value of big data is not so much to make predictions, but to test predictions on a vast number of data objects. Scientists should not be afraid to create and test their prediction models in a Big Data environment. Sometimes a prediction is invalidated, but an important conclusion can be drawn from the data anyway. Failed predictions often lead to new, more successful predictions.

References

76. Frey CM, McMillen MM, Cowan CD, Horm JW, Kessler LG. Representativeness of the surveillance, epidemiology, and end results program data: recent trends in cancer mortality rate. JNCI. 1992;84:872.

104. Janert PK. Data analysis with open source tools. O’Reilly Media 2010.

112. Cleveland Clinic: build an efficient pipeline to find the most powerful predictors. September 8, 2011; Innocentive https://www.innocentive.com/ar/challenge/9932794; September 8, 2011; viewed September 25, 2012.

153. Fifty-six year trends in U.S cancer death rates. Available from: In: SEER Cancer Statistics Review 1975—2005. National Cancer Institute September 19, 2012; http://seer.cancer.gov/csr/1975_2005/results_merged/topic_historical_mort_trends.pdf; September 19, 2012; viewed.

154. Cohen J. The earth is round (p < .05). Am Psychol. 1994;49:997–1003.

155. Rosenberg T. Opinionator: armed with data, fighting more than crime. The New York Times May 2, 2012.

156. Hoover JN. Data, analysis drive Maryland government. Information Week March 15, 2010.

157. Howe J. The rise of crowdsourcing. Wired. 2006;14:06.

158. Robins JM. The control of confounding by intermediate variables. Stat Med. 1989;8:679–701.

159. Robins JM. Correcting for non-compliance in randomized trials using structural nested mean models. Commun Stat Theory Methods. 1994;23:2379–2412.


ent“To view the full reference list for the book, click here

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.65.247