Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Finding the Clues in Large Collections of Data

Abstract

It is often assumed that Big Data resources are too large and complex for human comprehension. The analysis of Big Data is best left to software programs. Not so. When data analysts go straight to complex calculations, before they perform simple estimations, they will find themselves accepting wildly ridiculous conclusions. There is nothing quite like a simple, intuitive estimate to yank an overly-eager analyst back to reality. Often the simple act of looking at a stripped-down version of the problem opens a new approach that can drastically reduce computation time. In some situations analysts will find that a point is reached when higher refinements in methods yield diminishing returns. Advanced predictive algorithms may offer little improvement over a thoughtfully chosen, simple estimator. This chapter suggests a few fast and simple methods for exploring large and complex data sets.

Keywords

Missing data; Data trends; Frequency distributions; Data peaks; Multimodal data

Section 12.1. Denominators

The question is not what you look at, but what you see.

Henry David Thoreau in Journal, 5 August 1851

Denominators are the numbers that provide perspective to other numbers. If you are informed that 10,000 persons die each year in the United States, from a particular disease, then you might want to know the total number of deaths, from all causes. When you compare the death from a particular disease with the total number of deaths from all causes (the denominator), you learn something about the relative importance of your original count (e.g., an incidence of 10,000 deaths/350 million persons). Epidemiologists typically represent incidences as numbers per 100,000 population. An incidence of 10,000/350 million is equivalent to an incidence of 2.9 per 100,000.

Denominators are not always easy to find. In most cases the denominator is computed by tallying every data object in a Big Data resource. If you have a very large number of data objects, then the time required to reach a global tally may be quite long. In many cases a Big Data resource will permit data analysts to extract subsets of data, but analysts will be forbidden to examine the entire resource. In such cases the denominator will be computed for the subset of extracted data and will not accurately represent all of the data objects available to the resource.

If you are using Big Data collected from multiple sources, your histograms (i.e., graphic representations of the distribution of objects by some measureable attribute such as age, frequency, size) will need to be represented as fractional distributions for each source's data; not as value counts. The reason for this is that a histogram from one source may not have the same total number of distributed values compared with the histogram created from another source. As discussed in Section 10.4, “Normalizing and Transforming Your Data,” we achieve comparability among histograms by dividing the binned values by the total number of values in a distribution, for each data source. Doing so renders the bin value as a percentage of total, rather than a sum of data values.

Big Data managers should make an effort to supply information that summarizes the total set of data available at any moment in time, and should also provide information on the sources of data that contribute to the total collection. Here are some of the numbers that should be available to analysts: the number of records in the resource, the number of classes of data objects in the resource, the number of data objects belonging to each class in the resource, and the number of data values (preferably expressed as metadata/data pairs) that belong to data objects.

Section 12.2. Word Frequency Distributions

Poetry is everywhere; it just needs editing.

James Tate

There are two general types of data: quantitative and categorical. Quantitative data refers to measurements. Categorical data is simply a number that represents the number of items that have a feature. For most purposes the analysis of categorical data reduces to counting and binning.

Categorical data typically conforms to a Zipf distribution. George Kingsley Zipf (1902–50) was an American linguist who demonstrated that, for most languages, a small number of words account for the majority of occurrences of all the words found in prose. Specifically, he found that the frequency of any word is inversely proportional to its placement in a list of words, ordered by their decreasing frequencies in text. The first word in the frequency list will occur about twice as often as the second word in the list and three times as often as the third word in the list. [Glossary Word lists]

The Zipf distribution applied to languages is a special form of Pareto's principle, or the 80/20 rule. Pareto's principle holds that a small number of causes may account for the vast majority of observed instances. For example a small number of rich people account for the majority of wealth. Likewise, a small number of diseases account for the vast majority of human illnesses. A small number of children account for the majority of the behavioral problems encountered in a school. A small number of states hold the majority of the population of the United States. A small number of book titles, compared with the total number of publications, account for the majority of book sales. Much of Big Data is categorical and obeys the Pareto principle. Mathematicians often refer to Zipf distributions as Power law distributions. A short Python script for producing Zipf distribution's is found under its Glossary item. [Glossary Power law, Pareto's principle, Zipf distribution]

Let us take a look at the frequency distribution of words appearing in a book. Here is the list of the 30 most frequent words in a sample book and the number of occurrence of each word.

01 003977 the
02 001680 and
03 001091 class
04 000946 are
05 000925 chapter
06 000919 that
07 000884 species
08 000580 virus
09 000570 with
10 000503 disease
11 000434 for
12 000427 organisms
13 000414 from
14 000412 hierarchy
15 000335 not
16 000329 humans
17 000320 have
18 000319 proteobacteria
19 000309 human
20 000300 can
21 000264 fever
22 000263 group
23 000248 most
24 000225 infections
25 000219 viruses
26 000219 infectious
27 000216 organism
28 000216 host
29 000215 this
30 000211 all

As Zipf would predict, the most frequent word, “the” occurs 3977 times, roughly twice as often as the second most frequently occurring word, “and,” which occurs 1689 times. The third most frequently occurring word “class” occurs 1091 times, or very roughly one-third as frequently as the most frequently occurring word.

What can we learn about the text from which these word frequencies were calculated? As discussed in Chapter 1 “stop” words are high frequency words that separate terms and tell us little or nothing about the informational content of text. Let us look at this same list with the “stop” words removed:

03 001091 class
05 000925 chapter
07 000884 species
08 000580 virus
10 000503 disease
12 000427 organisms
14 000412 hierarchy
16 000329 humans
18 000319 proteobacteria
19 000309 human
21 000264 fever
22 000263 group
24 000225 infections
25 000219 viruses
26 000219 infectious
27 000216 organism
28 000216 host

What kind of text could have produced this list? Could there be any doubt that the list of words and frequencies shown here came from a book whose subject is related to microbiology? As it happens, this word-frequency list came from a book that I previously wrote entitled “Taxonomic Guide to Infections Diseases: Understanding the Biologic Classes of Pathogenic Organisms” [1]. By glancing at a few words from a large text file, we gain a deep understanding of the subject matter of the text. The words with the top occurrence frequencies told us the most about the book, because these words are low-frequency in most books (e.g., words such as hierarchy, proteobacteria, organism). They occurred in high frequency because the text was focused on a narrow subject (e.g., infectious diseases).

A clever analyst will always produce a Zipf distribution for categorical data. A glance at the output reveals a great deal about the contents of the data.

Let us go one more step, and produce a cumulative index for the occurrence of words in the text, arranging them in order of descending frequency of occurrence.

01 003977 0.0559054232618291 the
02 001680 0.0795214934352948 and
03 001091 0.0948578818634204 class
04 000946 0.108155978520622 are
05 000925 0.121158874300655 chapter
06 000919 0.134077426972926 that
07 000884 0.146503978183249 species
08 000580 0.154657145266946 virus
09 000570 0.162669740504372 with
10 000503 0.169740504371784 disease
11 000434 0.175841322499930 for

12 000427 0.181843740335686 organisms
13 000414 0.187663414771290 from
14 000412 0.193454974837640 hierarchy
15 000335 0.198164131687706 not
16 000329 0.202788945430009 humans
17 000320 0.207287244510669 have
18 000319 0.211771486406702 proteobacteria
19 000309 0.216115156456465 human
20 000300 0.220332311844584 can
21 000264 0.224043408586128 fever
22 000263 0.227740448143046 group
23 000248 0.231226629930558 most
24 000225 0.234389496471647 infections
25 000219 0.237468019904973 viruses
26 000219 0.240546543338300 infectious
27 000216 0.243582895217746 organism
28 000216 0.246619247097191 host
29 000215 0.249641541792010 this
30 000211 0.252607607748320 all
.
.
.
.
.
8957 000001 0.999873485338356 acanthaemoeba
8958 000001 0.999887542522984 acalculous
8959 000001 0.999901599707611 academic
8960 000001 0.999915656892238 absurd
8961 000001 0.999929714076865 abstract
8962 000001 0.999943771261492 absorbing
8963 000001 0.999957828446119 absorbed
8964 000001 0.999971885630746 abrasion
8965 000001 0.999985942815373 abnormalities
8966 000001 1.000000000000000 abasence

In this cumulative listing the third column is the fraction of the occurrences of the word along with the preceding words in the list as a fraction of all the occurrences of every word in the text.

The list is truncated after the thirtieth entry and picks up again at entry number 8957. There are a total of 8966 different, sometimes called unique, words in the text. The total number of words in the text happens to be 71,138. The last word on the list “abasence” has a cumulative fraction of 1.0, as all of the preceding words, plus the last word, account for 100% of word occurrences. The cumulative frequency distribution for the different words in the text is shown (Fig. 12.1). As an aside, the tail of the Zip distribution, which typically contains items occurring once only in a large data collection, are often “mistakes.” In the case of text distributions, typographic errors can be found in the farthest and thinnest part of the tail. In this case the word “abasence” occurs just once, as the last item in the distribution. It is a misspelling for the word “absence.”

Fig. 12.1 A frequency distribution of word occurrences from a sample text. The bottom coordinates indicate that the entire text is accounted for by a list of about 9000 different words. The steep and early rise indicates that a few words account for the bulk of word occurrences. Graphs with this shape are sometimes referred to as Zipf distributions.

Notice that though there are a total of 8957 unique words in the text, the first thirty words account for more than 25% of all word occurrences. The final ten words on the list occurred only once in the text. Common statistical measurements, such as the average of a population or the standard deviation, fail to provide any useful description of Zipf distributions. [Glossary Nonparametric statistics]

Section 12.3. Outliers and Anomalies

The mere formulation of a problem is far more essential than its solution, which may be merely a matter of mathematical or experimental skills. To raise new questions, new possibilities, to regard old problems from a new angle requires creative imagination and marks real advances in science.

Albert Einstein

On occasion the maxima or minima of a set of data will be determined by an outlier value; a value lying nowhere near any of the other values in the data set. If you could just eliminate the outlier, then you might enjoy a maxima and minima that were somewhat close to your other data values (i.e., the second-highest data value and the second-lowest data values would be close to the maxima and the minima, respectively). In these cases the data analyst must come to a decision, to drop or not to drop the outlier. There is no simple guideline for dealing with outliers, but it is sometimes helpful to know something about the dynamic range of the measurements. If a thermometer can measure temperature from − 20 to 140°F, and your data outlier has a temperature of 390°F, then you know that the outlier must be an error; the thermometer does not measure above 140 degrees. The data analyst can drop the outlier, but it would be prudent to determine why the outlier was generated. [Glossary Dynamic range, Outlier, Case report, Dimensionality]

Outliers are extreme data values. The occurrence of outliers hinders the task of developing models, equations, or curves that closely fit all the available data. In some cases, outliers are simply mistakes; while in other cases, outliers may be the most important data in the data set. Examples of outliers that have advanced science are many, including: the observance of wobbly starts leading to the discovery of exoplanets; anomalous X-ray bursts from space leading to the discovery of magnetars, highly magnetized neutron stars; individuals with unusual physiological measurements leading to the discovery of rare diseases. The special importance of outliers to Big Data is that as the size of data sets increases, the number of outliers also increases.

True outliers (i.e., outliers not caused by experimental design error or errors in observation and measurement) obey the same physical laws as everything else in the universe. Therefore a valid outlier will always reveal something that is generally true about reality. Put another way, outliers are not exceptions to the general rules; outliers are the exceptions upon which the general rules are based. This assertion brings us to the sadly underappreciated and underutilized creation known as “the case report.”

The case report, also known as the case study, is a detailed description of a single event or situation, often focused on a particular outlier, detail, or unique event. The concept of the case study is important in the field of Big Data because it highlights the utility of seeking general truths based on observations of outliers that can only be found in Big Data resources. Case reports are common in the medical literature, often beginning with a comment regarding the extreme rarity of the featured disease. You can expect to see phrases such as “fewer than a dozen have been reported in the literature” or “the authors have encountered no other cases of this lesion,” or such and such a finding makes this lesion particularly uncommon and difficult to diagnose. The point that the authors are trying to convey is that the case report is worthy of publication specifically because the observation departs from normal experience. This is just wrong.

Too often, case reports serve merely as cautionary exercises, intended to ward against misdiagnosis. The “beware this lesion” approach to case reporting misses the most important aspect of this type of publication; namely that science, and most aspects of human understanding, involve generalizing from single observations to general rules. When Isaac Newton saw an apple falling, he was not thinking that he could write a case report about how he once saw an apple drop, thus warning others not to stand under apple trees lest a rare apple might thump them upon the head. Newton generalized from the apple to all objects, addressing universal properties of gravity, and discovering the laws by which gravity interacts with matter. Case reports give us an opportunity to clarify the general way things work, by isolating one specific and rarely observed factor [2,3].

Section 12.4. Back-of-Envelope Analyses

Couldn't Prove, Had to Promise.

Book title, Wyatt Prunty

It is often assumed that Big Data resources are too large and complex for human comprehension. The analysis of Big Data is best left to software programs. Not so. When data analysts go straight to the complex calculations, before they perform a simple estimation, they will find themselves accepting wildly ridiculous calculations. For comparison purposes, there is nothing quite like a simple, and intuitive estimate to pull an overly-eager analyst back to reality. Often, the simple act of looking at a stripped-down version of the problem opens a new approach that can drastically reduce computation time. In some situations, analysts will find that a point is reached when higher refinements in methods yield diminishing returns. After the numerati have used their most advanced algorithms to make an accurate prediction, they may find that their best efforts offer little improvement over a simple estimator. This chapter reviews simple methods for analyzing big and complex data.

– Estimation-only analyses

The sun is about 93 million miles from the Earth. At this enormous distance, the light hitting Earth arrives as near-parallel rays and the shadow produced by the earth is nearly cylindrical. This means that the shadow of the earth is approximately the same size as the Earth itself. If the Earth's circular shadow on the moon, as observed during a lunar eclipse, appears to be about 2.5 times the diameter of the moon itself, then the moon must have a diameter approximately 1/2.5 times that of the earth. The diameter of the earth is about 8000 miles, so the diameter of the moon must be about 8000/2.5 or about 3000 miles.

The true diameter of the moon is smaller, about 2160 miles. Our estimate is inaccurate because the Earth's shadow is actually conical, not cylindrical. If we wanted to use a bit more trigonometry, we'd arrive at a closer approximation. Still, we arrived at a fair approximation of the moon's size from one, simple division, based on a casual observation made during a lunar eclipse. The distance was not measured; it was estimated from a simple observation. Credit for the first astronomer to use this estimation goes to the Greek astronomer Aristarchus of Samos (310 BCE–230 BCE). In this particular case, a direct measurement of the moon's distance was impossible. Aristarchus' only option was the rough estimate. His predicament was not unique. Sometimes estimation is the only recourse for data analysts.

A modern-day example wherein measurements failed to help the data analyst is the calculation of deaths caused by heat waves. People suffer during heat waves, and municipalities need to determine whether people are dying from heat-related conditions. If heat-related deaths occur, then the municipality can justifiably budget for supportive services such as municipal cooling stations, the free delivery of ice, and increased staffing for emergency personnel. If the number of heat-related deaths is high, the governor may justifiably call a state of emergency.

Medical examiners perform autopsies to determine causes of death. During a heat wave the number of deceased individuals with a heat-related cause of death seldom rises as much as anyone would expect [4]. The reason for this is that stresses produced by heat cause death by exacerbating pre-existing non-heat-related conditions. The cause of death can seldom be pinned on heat. The paucity of autopsy-proven heat deaths can be relieved, somewhat, by permitting pathologists to declare a heat-related death when the environmental conditions at the site of death are consistent with hyperthermia (e.g., a high temperature at the site of death, and a high body temperature of the deceased measured shortly after death). Adjusting the criteria for declaring heat-related deaths is a poor remedy. In many cases the body is not discovered anytime near the time of death, invalidating the use of body temperature. More importantly, different municipalities may develop their own criteria for heat-related deaths (e.g., different temperature threshold measures, different ways of factoring night-time temperatures and humidity measurements). Basically, there is no accurate, reliable, or standard way to determine heat-related deaths at autopsy [4].

How would you, a data estimator, handle this problem? It is simple. You take the total number of deaths that occurred during the heat wave. Then you go back over your records of deaths occurring in the same period, in the same geographic region, over a series of years in which a heat wave did not occur. You average that number, giving you the expected number of deaths in a normal (i.e., without heat wave) period. You subtract that number from the number of deaths that occurred during the heat wave, and that gives you an estimate of the number of people who died from heat-related mortality. This strategy, applied to the 1995 Chicago heat wave, estimated that the number of heat-related deaths rose from 485 to 739 [5].

– Mean-field averaging

The average behavior of a collection of objects can be applied toward calculations that would exceed computational feasibility if applied to individual objects. Here is an example. Years ago, I worked on a project that involved simulating cell colony growth, using a Monte Carlo method [6]. Each simulation began with a single cell that divided, producing two cells, unless the cell happened to die prior to cell division. Each simulation applied a certain chance of cell death, somewhere around 0.5, for each cell, at each cell division. When you simulate colony growth, beginning with a single cell, the chance that the first cell will die on the first cell division would be about 0.5; hence, there is about a 50% chance that the colony will die out on the first cell division. If the cell survives the first cell division, the cell might go through several additional cell divisions before it dies, by chance. By that time, there are other progeny that are dividing, and these progeny cells might successfully divide, thus enlarging the size of the colony. A Monte Carlo simulation randomly assigned death or life at each cell division, for each cell in the colony. When the colony manages to reach a large size (e.g., ten million cells), the simulation slows down, as the Monte Carlo algorithm must parse through ten million cells, calculating whether each cell will live or die, and assigning two offspring cells for each simulated division, and removing cells that were assigned a simulated “death.” When the computer simulation slowed to a crawl, I found that the whole population displayed an “average” behavior. There was no longer any need to perform a Monte Carlo simulation on every cell in the population. I could simply multiply the total number of cells by the cell death probability (for the entire population), and this would tell me the total number of cells that survived the cycle. For a large colony of cells, with a death probability of 0.5 for each cell, half the cells will die at each cell cycle, and the other half will live and divide, produce two progeny cells; hence the population of the colony will remain stable. When dealing with large numbers, it becomes possible to dispense with the Monte Carlo simulation and to predict each generational outcome with a pencil and paper. [Glossary Monte Carlo simulation]

Substituting the average behavior for a population of objects, rather than calculating the behavior of every single object, is called mean-field approximation. It uses a physical law telling us that large collections of objects can be characterized by their average behavior. Mean-field approximation has been used with great success to understand the behavior of gases, epidemics, crystals, viruses, and all manner of large population problems. [Glossary Mean-field approximation]

Section 12.5. Case Study: Predicting User Preferences

He has no enemies, but is intensely disliked by his friends.

Oscar Wilde

Imagine you have all the preference data for every user of a large movie subscriber service, such as Netflix. You want to develop a system whereby the preference of any subscriber, for any movie, can be predicted. Here are some analytic options, listed in order of increasing complexity; omitting methods that require advanced mathematical skills.

1. Ignore your data and use experts.

Movie reviewers are fairly good predictors of a movie's appeal. If they were not good predictors, they would have been replaced by people with better predictive skills. For any movie, go to the newspapers and magazines and collect about ten movie reviews. Average the review scores and use the average as the predictor for all of your subscribers.

You can refine this method a bit by looking at the average subscriber scores, after the movie has been released. You can compare the scores of the individual experts to the average score of the subscribers. Scores from experts that closely matched the scores from the subscribers can be weighted a bit more heavily than experts whose scores were nothing like the average subscriber score.

2. Use all of your data, as it comes in, to produce an average subscriber score.

Skip the experts; go straight to your own data. In most instances, you would expect that a particular user's preference will come close to the average preference of the entire population in the data set for any given movie.

3. Lump people into preference groups based on shared favorites.

If Ann's personal list of top-favored movies is the same as Fred's top-favored list, then it is likely that their preferences will coincide. For movies that Ann has seen but Fred has not, use Ann's score as a predictor.

In a large data set, find an individual's top ten movie choices and add the individual to a group of individuals who share the same top-ten list. Use the average score for members of the group, for any particular movie as that movie's predictor for each of the members of the group.

As a refinement, find a group of people who share the top-ten and the bottom-ten scoring movies. Everyone in this group shares a love of the same top movies and a loathing for the same bottom movies.

4. Focus your refined predictions.
For many movies, there really is not much of a spread in ratings. If just about everyone loves “Star Wars” and “Raiders of the Lost Arc” and “It's a Wonderful Life,” then there really is no need to provide an individual prediction for such movies. Likewise, if a movie is universally loathed, or universally accepted as an “average” flick, then why would you want to use computationally intensive models for these movies?

Most data sets have a mixture of easy and difficult data. There is seldom any good reason to develop predictors for the easy data. In the case of movie predictors, if there is very little spread in a movie's score, then you can safely use the average rating as the predicted rating for all individuals. By removing all of the “easy” movies from your group-specific calculations, you reduce the total number of calculations for the data collection.

This method of eliminating the obvious has application in many different fields. As a program director at the National Cancer Institute, I was peripherally involved in efforts to predict cancer treatment options for patients diagnosed in different stages of disease. Traditionally, large numbers of patients, at every stage of disease, were included in a prediction model that employed a list of measurable clinical and pathological parameters (e.g., age and gender of patient, size of tumor, the presence of local or distant metastases, and so on). It turned out that early models produced predictions where none were necessary. If a patient had a tumor that was small, confined to its primary site of growth, and minimally invasive at its origin, then the treatment was always limited to surgical excision; there were no options for treatment, and hence no reason to predict the best option for treatment. If a tumor was widely metastatic to distant organs at the time of diagnosis, then there were no available treatments known, at that time, that could cure the patient. By focusing their analyses on the subset of patients who could benefit from treatment and for whom the best treatment option was not predetermined, the data analysts reduced the size and complexity of the data and simplified the problem.

The take-away lesson from this section is that predictor algorithms, so popular now among marketers, are just one of many different ways of determining how individuals and subgroups may behave, under different circumstances. Big Data analysts should not be reluctant to try several different analytic approaches, including approaches of their own invention. Sometimes the simplest algorithms, involving nothing more than arithmetic, are the best.

Section 12.6. Case Study: Multimodality in Population Data

What is essential is invisible to the eye.

Antoine de saint-exupery

Big Data distributions are sometimes multi-modal with several peaks and troughs. Multi-modality always says something about the data under study. It tells us that the population is somehow non-homogeneous. Hodgkin lymphoma is an example of a cancer with a bimodal age distribution. There is a peak in occurrences at a young age, and another peak of occurrences at a more advanced age. This two-peak phenomenon can be found whenever Hodgkin Lymphoma is studied in large populations [7,8].

In the case of Hodgkin lymphoma, lymphomas occurring in the young may share diagnostic features with the lymphomas occurring in the older population, but the occurrence of lymphomas in two separable populations may indicate that some important distinction may have been overlooked: a different environmental cause, or different genetic alterations of lymphomas in the two age sets, or two different types of lymphomas that were mistakenly classified under one name, or there may be something wrong with the data (i.e., misdiagnoses, mix-ups during data collection). Big Data, by providing large numbers of cases, makes it easy to detect data incongruities (such as multi-modality), when they are present. Explaining the causes for data incongruities is always a scientific challenge.

Multimodality in the age distribution of human diseases is an uncommon but well-known phenomenon. In the case of deaths resulting the Spanish flu of 1918, a tri-modal distribution was noticed (i.e., a high death rate in young, middle aged, and old individuals). In such cases, the observation of multimodality has provoked scientific interest, leading to fundamental discoveries in disease biology [9].

Section 12.7. Case Study: Big and Small Black Holes

If I didn't believe it, I would never have seen it.

Anon

The importance of inspecting data for multi-modality also applies to black holes. Most black holes have mass equivalents under 33 solar masses. Another set of black holes are supermassive, with mass equivalents of 10 or 20 billion solar masses. When there are objects of the same type, whose masses differ by a factor of a billion, scientists infer that there is something fundamentally different in the origin or development of these two variant forms of the same object. Black hole formation is an active area of interest, but current theory suggests that lower-mass black holes arise from pre-existing heavy stars. The supermassive black holes presumably grow from large quantities of matter available at the center of galaxies. The observation of bimodality inspired astronomers to search for black holes whose masses are intermediate between black holes with near-solar masses and the supermassive black holes. Intermediates have been found, and, not surprisingly, they come with a set of fascinating properties that distinguish them from other types of black holes. Fundamental advances in our understanding of the universe may sometimes follow from simple observations of multimodal data distributions.

Glossary

Case report The case report, also known as the case study, is a detailed description of a single event or situation, often devoted to an outlier, or a detail, or a unique occurrence of an observation. The concept of the case study is important in the field of data simplification because it highlights the utility of seeking general truths based on observations of rare events. Case reports are common in the biomedical literature, often beginning with a comment regarding the extreme rarity of the featured disease. You can expect to see phrases such as “fewer than a dozen have been reported in the literature” or “the authors have encountered no other cases of this lesion,” or such and such a finding makes this lesion particularly uncommon and difficult to diagnose; and so on. The point that the authors are trying to convey is that the case report is worthy of publication specifically because the observation is rare. Too often, case reports serve merely as a cautionary exercise, intended to ward against misdiagnosis. The “beware this lesion” approach to case reporting misses the most important aspect of this type of publication; namely that science, and most aspects of human understanding, involve generalizing from the specific. When Isaac Newton saw an apple falling, he was not thinking that he could write a case report about how he once saw an apple drop, thus warning others not to stand under apple trees lest a rare apple might thump them upon the head. Newton generalized from the apple to all objects, addressing universal properties of gravity, and discovering the laws by which gravity interacts with matter. The case report gives us an opportunity to clarify the general way things work, by isolating one specific and rarely observed factor [2]. Data scientists must understand that rare cases are not exceptions to the general laws of reality; they are the exceptions upon which the general laws of reality are based.

Dimensionality The dimensionality of a data objects consists of the number of attributes that describe the object. Depending on the design and content of the data structure that contains the data object (i.e., database, array, list of records, object instance, etc.), the attributes will be called by different names, including field, variable, parameter, feature, or property. Data objects with high dimensionality create computational challenges, and data analysts typically reduce the dimensionality of data objects wherever possible.

Dynamic range Every measuring device has a dynamic range beyond which its measurements are without meaning. A bathroom scale may be accurate for weights that vary from 50 to 250 pounds, but you would not expect it to produce a sensible measurement for the weight of a mustard seed or an elephant.

Mean-field approximation A method whereby the average behavior for a population of objects substitutes for the behavior of each and every object in the population. This method greatly simplifies calculations. It is based on the observation that large collections of objects can be characterized by their average behavior. Mean-field approximation has been used with great success to understand the behavior of gases, epidemics, crystals, viruses, and all manner of large population phenomena.

Monte Carlo simulation This technique was introduced in 1946 by John von Neumann, Stan Ulam, and Nick Metropolis [10]. For this technique, the computer generates random numbers and uses the resultant values to simulate repeated trials of a probabilistic event. Monte Carlo simulations can easily simulate various processes (e.g., Markov models and Poisson processes) and can be used to solve a wide range of problems [6,11]. The Achilles heel of the Monte Carlo simulation, when applied to enormous sets of data, is that so-called random number generators may introduce periodic (non-random) repeats over large stretches of data [12]. What you thought was a fine Monte Carlo simulation, based on small data test cases, may produce misleading results for large data sets. The wise Big Data analyst will avail himself of the best possible random number generators, and will test his outputs for randomness. Various tests of randomness are available [13].

Nonparametric statistics Statistical methods that are not based on assumptions about the distribution of the sample population (e.g., not based on the assumption that the sample population fits a Gaussian distribution). Median, mode, and range are examples of common nonparametric statistics.

Outlier The term refers to a data point that lies far outside the value of the other data points in a distribution. The outlier may occur as the result of an error, or it may represent a true value that needs to be explained. When computing a line that is the “best fit” to the data, it is usually prudent to omit the outliers; otherwise, the best fit line may miss most of the data in your distribution. There is no strict rule for identifying outliers, but by convention, statisticians may construct a cut-off that lies 1.5 times the range of the lower quartile of the data, for small outliers, or 1.5 times the upper quartile range for large values.

Pareto's principle Also known as the 80/20 rule, Pareto's principle holds that a small number of items account for the vast majority of observations. For example, a small number of rich people account for the majority of wealth. Just 2 countries, India plus China, account for 37% of the world population. Within most countries, a small number of provinces or geographic areas contain the majority of the population of a country (e.g., East and West coastlines of the United States). A small number of books, compared with the total number of published books, account for the majority of book sales. Likewise, a small number of diseases account for the bulk of human morbidity and mortality. For example, two common types of cancer, basal cell carcinoma of skin and squamous cell carcinoma of skin account for about 1 million new cases of cancer each year in the United States. This is approximately the sum total of for all other types of cancer combined. We see a similar phenomenon when we count causes of death. About 2.6 million people die each year in the United States [14]. The top two causes of death account for 1,171,652 deaths (596,339 deaths from heart disease and 575,313 deaths from cancer [15]), or about 45% of all United States deaths. All of the remaining deaths are accounted for by more than 7000 conditions. Sets of data that follow Pareto's principle are often said to follow a Zipf distribution, or a power law distribution. These types of distributions are not tractable by standard statistical descriptors because they do not produce a symmetric bell-shaped curve. Simple measurements such as average and standard deviation have virtually no practical meaning when applied to Zipf distributions. Furthermore, the Gaussian distribution does not apply, and none of the statistical inferences built upon an assumption of a Gaussian distribution will hold on data sets that observe Pareto's principle [16].

Power law A mathematical formula wherein a particular value of some quantity varies as an inverse power of some other quantity [17,18]. The power law applies to many natural phenomena and describes the Zipf distribution or Pareto's principle. The power law is unrelated to the power of a statistical test.

Word lists Word lists are collections, usually in alphabetic order, of the different words that might appear in a corpus of text or a language dictionary. Such lists are easy to create. Here is a short Python script, words.py, that prepares an alphabetized list of the words occurring in a line of text. This script can be easily modified to create word lists from plain-text files.

import string
line = "a way a lone a last a loved a long the riverrun past eve and adam's from swerve of shore to bend of bay brings us by a commodius vicus"
linearray = sorted(set(line.split(" ")))
for item in linearray:
print(item)
Here is the output:
a
adam's
and
bay
bend
brings
by
commodius
eve
from
last
lone
long
loved
of
past
riverrun
shore
swerve
the
to
us
vicus
way

Aside from word lists you create for yourself, there are a wide variety of specialized knowledge domain nomenclatures that are available to the public [19–24]. Linux distributions often bundle a wordlist, under filename “words,” that is useful for parsing and natural language processing applications. A copy of the Linux wordlist is available at: http://www.cs.duke.edu/~ola/ap/linuxwords
Curated lists of terms, either generalized or restricted to a specific knowledge domain, are indispensable for a variety of applications (e.g., spell-checkers, natural language processors, machine translation, coding by term, indexing.) Personally, I have spent an inexcusable amount of time creating my own lists, when no equivalent public domain resource was available.

Zipf distribution George Kingsley Zipf (1902–50) was an American linguist who demonstrated that, for most languages, a small number of words account for the majority of occurrences of all the words found in prose. Specifically, he found that the frequency of any word is inversely proportional to its placement in a list of words, ordered by their decreasing frequencies in text. The first word in the frequency list will occur about twice as often as the second word in the list, three times as often as the third word in the list, and so on. Many Big Data collections follow a Zipf distribution (income distribution in a population, energy consumption by country, and so on). Zipf distributions within Big Data cannot be sensibly described by the standard statistical measures that apply to normal distributions. Zipf distributions are instances of Pareto's principle.
Here is a short Python script, zipf.py, that produces a Zipf distribution for a few lines of text.

import re, string
word_list =[];freq_list =[];format_list =[];freq ={}
my_string = "Peter Piper picked a peck of pickled
peppers. A peck of pickled peppers Peter Piper picked.
If Peter Piper picked a peck of pickled peppers,
Where is the peck of pickled peppers that Peter Piper
picked?".lower()
word_list = re.findall(r'([a-z]{1,})', my_string)
for item in word_list:
count = freq.get(item,0)
freq[item] = count + 1
for key, value in freq.items():
value = "000000" + str(value)
value = value[-6:]
format_list += [value + " " + key]
format_list = reversed(sorted(format_list))
print(" ".join(format_list))
Here is the output of the zipf,py script:
000004 piper
000004 pickled
000004 picked
000004 peter
000004 peppers
000004 peck
000004 of
000003 a
000001 where
000001 the
000001 that
000001 is
000001 if

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12: Finding the Clues in Large Collections of Data

Create new playlist

Sign In

Sign Up