Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Measurement

Abstract

Engineers have a saying: “You can only control what you can measure.” When a scientist or engineer performs a small-scale experiment in her own laboratory, she can calibrate all of her instruments, make her own measurements, check and cross-check her measurements with different instruments, compare her results with the results from other laboratories, and repeat her measurements until she is satisfied that the results are accurate and valid. When a scientist or engineer draws information from a Big Data resource, none of her customary safeguards apply. In almost all cases the experiment is too big and too complex to repeat. This chapter discusses a variety of techniques that Big Data analysts use to achieve some level of control over their data, including normalization, data transformations, and dimensional reduction.

Keywords

Accuracy; Precision; Normalization; Verification; Validity; Estimation; Variable reduction

Section 10.1. Accuracy and Precision

Get your facts first, then you can distort them as you please.

Mark Twain

Precision is the degree of exactitude of a measurement and is verified by its reproducibility (i.e., whether repeated measurements of the same quantity produce the same result). Accuracy measures how close your data comes to being correct. Data can be accurate but imprecise or precise but inaccurate. If you have a 10-pound object, and you report its weight as 7.2376 pounds, every time you weigh the object, then your precision is remarkable, but your accuracy is dismal.

What are the practical limits of precision measurements? Let us stretch our imaginations, for a moment, and pretend that we have just found an artifact left by an alien race that excelled in the science of measurement. As a sort of time capsule for the universe, their top scientists decided to collect the history of their civilization, and encoded it in binary. Their story looked something like “001011011101000...” extended to about 5 million places. Rather than print the sequence out on a piece of paper or a computer disc, these aliens simply converted the sequence to a decimal length (i.e., .001011011101000...”) and marked the length on a bar composed of a substance that would never change its size. To decode the bar and recover the history of the alien race, one would simply need to have a highly precise measuring instrument that would yield the original binary sequence. Computational linguists could translate the sequence to text, and the recorded history of the alien race would be revealed! Of course, the whole concept is built on an impossible premise. Nothing can be measured accurately to 5 million places. We live in a universe with practical limits (i.e., the sizes of atomic particles, the speed of light, the Heisenberg uncertainty principle, the maximum mass of a star, the second law of thermodynamics, the unpredictability of highly complex systems). There are many things that we simply cannot do, no matter how hard we try. The most precise measurement achieved by modern science has been in the realm of atomic clocks, where accuracy of 18 decimal places has been claimed [1]. Nonetheless, many scientific disasters are caused by our ignorance of our own limitations, and our persistent gullibility, leading us to believe that precision claimed is precision obtained.

It is quite common for scientists to pursue precision when they should be seeking accuracy. For an example, we need look no further than the data-intensive field of “Precision Medicine” [2]. One of the goals of precision medicine is to determine the specific genetic alterations that account for human disease. In many cases, this means finding a change in a single nucleotide (from among the 3 billion nucleotides in the DNA sequence that accounts for the human genome) responsible for the development of a disease. Precision Medicine has had tremendous success for a variety of rare diseases and for rare subtypes of common diseases, but has had less luck with common diseases such as type 2 diabetes and adult onset hypertension. Why is this? The diagnosis of diabetes and hypertension are based on a cut-off measurement. Above a certain glucose level in the blood, the patient is said to have diabetes. Above a certain pressure, the patient is said to have hypertension. It is not much different from being overweight (i.e., above a certain weight) or tall (i.e., above a certain height). Theory, strengthened by empiric observations, informs us that quantitative traits have multiple genetic and environmental influences, a phenomenon recognized since the early studies of RA Fisher, in 1919 [3–5]. Hence, we would expect that hypertension and diabetes would not be amenable to precise diagnosis [2].

At this point, we can determine, with credible accuracy, whether a person is diabetic, hypertensive, obese, tall, able to hold his breath for a long time, or able to run 100 m in record time. We cannot determine, with any precision, the precise genes that are necessary for the development of any acquired human traits. Should we be devoting time and money to attain higher and higher precision in the genetic diagnosis of common, polygenic diseases, if increasing precision brings us no closer to a practical cure for these diseases? Put plainly, shouldn't we be opting for “Accurate Medicine” rather than “Precision Medicine”?

The conflict between seeking accuracy and seeking precision is a common dilemma in the universe of Big Data, wherein access to highly precise measurements is ridiculously abundant.

– Steganography: using imprecision to your advantage

You look at them every day, the ones that others create, and that you create your own, that you share with your friends or with the world. They are part of your life, and you would feel a deep sense of loss if you lost them. I am referring to high resolution digital images. We love them, but we give them more credit than they deserve. When you download a 16-megapixel image of your sister's lasagna, you can be certain that most of the pixel information is padded with so-called empty resolution; pixel precision that is probably inaccurate and certainly exceeding the eye's ability to meaningfully resolve. Most images in the megabyte size range can safely be reduced to the kilobyte size range, without loss of visual information. Steganography is an encryption technique that takes advantage of the empty precision in pixel data by inserting secret text messages into otherwise useless bits of pseudodata.

Steganography is one of several general techniques in which a message is hidden within an object, such as a book or a painting. The forerunners of modern steganography have been around for centuries and were described as early as AD 1500 by Trithemious [6]. Watermarking is closely related to steganography. Digital watermarking is a way of secretly insinuating the name of the owner or creator of a digital object into the object, as a mechanism of rights management [7]. [Glossary Steghide]

Section 10.2. Data Range

Many an object is not seen, though it falls within the range of our visual ray, because it does not come within the range of our intellectual ray, i.e., we are not looking for it. So, in the largest sense, we find only the world we look for.

Henry David Thoreau

Always determine the highest and the lowest observed values in your data collection. These two numbers are often the most important numbers in any set of data; even more important than determining the average or the standard deviation. There is always a compelling reason, relating to the measurement of the data or to the intrinsic properties of the data set, to explain the high and the low of data.

Here is an example. You are looking at human subject data that includes weights. The minimum weight is a pound (the round-off weight of a viable but premature newborn infant). You find that the maximum weight in the data set is 300 pounds, exactly. There are many individuals in the data set who have a weight of 300 pounds, but no individuals with a weight exceeding 300 pounds. You also find that the number of individuals weighing 300 pounds is much greater than the number of individuals weighting 290 pounds or 280 pounds. What does this tell you? Obviously, the people included in the data set have been weighed on a scale that tops off at 300 pounds. Most of the people whose weight was recorded as 300 will have a false weight measurement. Had we not looked for the maximum value in the data set, we would have assumed, incorrectly, that the weights were always accurate.

It would be useful to get some idea of how weights are distributed in the population exceeding 300 pounds. One way of estimating the error is to look at the number of people weighing 295 pounds, 290 pounds, 285 pounds, etc. By observing the trend, and knowing the total number of individuals whose weight is 300 pounds or higher, you can estimate the number of people falling into weight categories exceeding 300 pounds.

Here is another example where knowing the maxima for a data set measurement is useful. You are looking at a collection of data on meteorites. The measurements include weights. You notice that the largest meteorite in the large collection weighs 66 tons (equivalent to about 60,000 kg), and has a diameter of about 3 m. Small meteorites are more numerous than large meteorites, but one or more meteorites account for almost every weight category up to 66 tons. There are no meteorites weighing more than 66 tons. Why do meteorites have a maximum size of about 66 tons?

A little checking tells you that meteors in space can come in just about any size, from a speck of dust to a moon-sized rock. Collisions with earth have involved meteorites much larger than 3 m. You check the astronomical records and you find that the meteor that may have caused the extinction of large dinosaurs about 65 million years ago was estimated at 6–10 km (at least 2000 times the diameter of the largest meteorite found on earth).

There is a very simple reason why the largest meteorite found on earth weighs about 66 tons, while the largest meteorites to impact the earth are known to be thousands of time heavier. When meteorites exceed 66 tons, the impact energy can exceed the energy produced by an atom bomb blast. Meteorites larger than 66 tons leave an impact crater, but the meteor itself disintegrates on impact.

As it turns out, much is known about meteorite impacts. The kinetic energy of the impact is determined by the mass of the meteor and the square of the velocity. The minimum velocity of a meteor at impact is about 11 km/s (equivalent to the minimum escape velocity for sending an object from earth into space). The fastest impacts occur at about 70 km/s. From this data, the energy released by meteors, on impact with the earth, can be easily calculated.

By observing the maximum weight of meteors found on earth we learn a great deal about meteoric impacts. When we look at the distribution of weights, we can see that small meteorites are more numerous than larger meteorites. If we develop a simple formula that relates the size of a meteorite with its frequency of occurrence, we can predict the likelihood of the arrival of a meteorite on earth, for every weight of meteorite, including those weighing more than 66 tons, and for any interval of time.

Here is another profound example of the value of knowing the maximum value in a data distribution. If you look at the distance from the earth to various cosmic objects (e.g., stars, black holes, nebulae) you will quickly find that there is a limit for the distance of objects from earth. Of the many thousands of cataloged stars and galaxies, none of them have a distance that is greater than 13 billion light years. Why? If astronomers could see a star that is 15 billion light years from earth, the light that is received here on earth must have traveled 15 billion light years to reach us. The time required for light to travel 15 billion light years is 15 billion years; by definition. The universe was born in a big bang about 14 billion years ago. This would imply that the light from the star located 15 billion miles from earth must have begun its journey about a billion years before the universe came into existence. Impossible!

By looking at the distribution of distances of observed stars and noting that the distances never exceed about 13 billion years, we can infer that the universe must be at least 13 billion years old. You can also infer that the universe does not have an infinite age and size; otherwise, we would see stars at a greater distance than 13 billion light years. If you assume that stars popped into the universe not long after its creation, then you can infer that the universe has an age of about 13 or 14 billion years. All of these deductions, confirmed independently by theoreticians and cosmologists, were made without statistical analysis, simply by noting the maximum number in a distribution of numbers.

Section 10.3. Counting

On two occasions I have been asked, ‘If you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

Charles Babbage

For the bulk of Big Data projects, analysis begins with counting. If you cannot count the data held in a Big Data resource, then you will derive little benefit from the resource. Systemic counting errors account for irreproducible or misleading results. Surprisingly, there is very little written about this issue in the Big Data literature. Presumably, the subject is considered too trivial for serious study. To rectify this oversight, this section describes, in some depth, the surprising intellectual challenges of Big Data counting.

Most people would agree that the simple act of counting data is something that can be done accurately and reproducibly, from laboratory to laboratory. Actually, this is not the case. Counting is fraught with the kinds of errors previously described in this chapter, plus many other hidden pitfalls. Consider the problem of counting words in a paragraph. It seems straightforward, until you start asking yourself how you might deal with hyphenated words. “Deidentified” is certainly one word. “Under-represented” is probably one word, but sometimes the hyphen is replaced by a space, and then it is certainly two words. How about the term “military-industrial,” which seems as though it should be two words? When a hyphen occurs at the end of a line, should we force a concatenation between the syllables at the end of one line and the start of the next?

Slashes are a tougher nut to crack than hyphens. How should we count terms that combine two related words by a slash, such as “medical/pharmaceutical”; one word or two words? If we believe that the slash is a word separator (i.e., slashes mark the end of one word and the beginning of another), then we would need to parse Web addresses into individual words. For example:

www.science.com/stuff/neat_stuff/super_neat_stuff/balloons.htm

The Web address could be broken into a string of words, if the “.” and “_” characters could be considered valid word separators. In that case, the single Web address would consist of 11 words: www, science, com, stuff, neat, stuff, super, neat, stuff, balloons, htm. If you were only counting words that match entries in a standard dictionary, then the split Web address would contain 8 words: science, stuff, neat, stuff, super, neat, stuff, balloons. If we defined a word as a string bounded by a space or a part-of-sentence separator (e.g., period, comma, colon, semicolon, question mark, exclamation mark, end of line character), then the unsplit Web address would count as 1 word. If the word must match a dictionary term, then the unsplit Web address would count as zero words. So, which is it: 11 words, 8 words, 1 word, or 0 words?

This is just the start of the problem. How shall we deal with abbreviations [8,9]? Should all abbreviations be counted as one word, or as the sum of words represented by the abbreviation? Is “U.S.” One word or two words? Suppose, before counting words, the text is pre-processed to expand abbreviations. All the abbreviated terms (i.e., every instance of “U.S.” becomes an instance of United States, and UCLA would count as 4 words). This would yield an artificial increase in the number of words in the document. How would a word counter deal with abbreviations that look like words, such as “mumps” which could be the name of a viral disease of childhood, or it could be an abbreviation for a computer language used by medical informaticians and expanded as “Massachusetts General Hospital Utility Multi-Programming System.”

How would we deal with numeric sequences appearing in the text? Should each numeric sequence be counted as a word? If not, how do we handle Roman numbers? Should “IV” be counted as a word, because it is composed of alphabetic characters, or should it be omitted as a word, because it is equivalent to the numeric value, “4”? When we encounter “IV,” how can we be certain that we are parsing a Roman numeral? Could “IV,” within the context of our document, represent the abbreviation for “intravenous”?

It is obvious that the number of words in a document will depend on the particular method used to count the words. If we use a commercial word counting application, how can we know which word counting rules are applied? In the field of informatics, the total number of words is an important feature of a document. The total word count often appears in the denominator of common statistical measurements. Counting words seems to be a highly specialized task. My favorite estimator of the number of words in any text file is simply the size of the file divided by 6.5, the average number of characters in a word plus one separator character.

The point here is that a simple counting task, such as word counting, can easily become complex. A complex counting task, involving subjective assessments of observations, seldom yields accurate results. When the criteria for counting change over time, then results that were merely inaccurate may devolve even further, into irreproducibility. An example of a counting task that is complex and objective is the counting of hits and errors in baseball. The rules for counting errors are subjective and based on the scorer's judgment of the intended purpose of the hit (e.g., sacrifice fly) and the expected number of bases reached in the absence of the error. The determination of an error sometimes depends on the outcome of the play after the presumptive error has occurred (i.e., on events that are not controlled or influenced by the error). Counting is also complex with rules covering specific instances of play. For example, passed balls and wild pitches are not scored as errors; they are assigned to another category of play. Plays involving catchers are exempt from certain rules for errors that apply to fielders. It would be difficult to find an example of a counting task that is more complex than counting baseball errors.

Sometimes counting criteria inadvertently exclude categories of items that should be counted. The diagnoses that appear on death certificates are chosen from a list of causes of death included in the International Classification of Diseases (ICD). Diagnoses collected from all of the death certificates issued in the United States are aggregated by the CDC (Centers for Disease Control and Prevention) and published in the National Vital Statistics Report [10]. As it happens, “medical error” is not included as a cause of death in the ICD; hence, United States casualties of medical errors are not counted as such in the official records. Official tally notwithstanding, it is estimated that about one of every six deaths in the United States result from medical error [10].

Big Data is particularly prone to counting errors, as data is typically collected from multiple sources, each with its own method for annotating data. In addition, Big Data may extend forwards and backwards in time; constantly adding new data and merging with legacy data sets. The criteria for counting data may change over time, producing misleading results. Here are a few examples of counts that changed radically when the rules for counting changed. [Glossary Meta-analysis]

– Beachy Head is a cliff in England with a straight vertical drop and a beautiful sea-view. It is a favorite jumping off point for suicides. The suicide rate at Beachy Head dropped as sharply as the cliff when the medical examiner made a small policy change. From a certain moment onward, bodies found at the cliff bottom would be counted as suicides only if their post-mortem toxicology screen was negative for alcohol. Intoxicated subjects were pronounced dead by virtue of accident (i.e., not suicide) [11].
– Sudden Infant Death Syndrome (SIDS, also known as crib death) was formerly considered to be a disease of unknown etiology that caused infants to stop breathing, and die, often during sleep. Today, most SIDS deaths are presumed to be due to unintentional suffocation from bedclothes, often in an overheated environment, and aggravated by a prone (i.e., face down) sleeping position. Consequently, some infant deaths that may have been diagnosed as SIDS in past decades are now diagnosed as unintentional suffocations. This diagnostic switch has resulted in a trend characterized by increasing numbers of infant suffocations and a decreasing number of SIDS cases [12]. This trend is, in part, artifactual, arising from changes in reporting criteria.
– In the year 2000, nearly a half-century after the Korean War, the United States Department of State downsized its long-standing count of United States military war deaths; to 36,616 down from an earlier figure of about 54,000. The drop of 17,000 deaths resulted from the exclusion of United States military deaths that occurred during the Korean War, in countries outside Korea [13]. The old numbers reflected deaths during the Korean War; the newer number reflects deaths occurring due to the Korean War. Aside from historical interest, the alteration indicates how collected counts may change retroactively.
– Human life is flanked by two events, birth and death; both events are commemorated with a certificate. Death certificates are the single most important gauge of public health. They tell us the ages at which deaths occur, and the causes of those deaths. With this information, we can determine the most common causes of death in the population, changes in the frequency of occurrences of the different causes of death, and the effect of interventions intended to increase overall life expectancy and reduce deaths caused by particular causes. Death certificates are collected from greater than 99% of individuals who die in the United States [14]. This data, vital to the health of every nation, is highly error prone, and the problems encountered in the U.S. seem to apply everywhere [15,16]. A survey of 49 national and international health atlases has shown that there is virtually no consistency in the way that death data are prepared [17]. Within the United States there is little consistency among states in the manner in which the causes of death are listed [18]. Death data is Big Data, as it is complex (i.e., containing detailed, non-standard information within death certificates), comes from many sources (i.e., every municipality), arrives continually (i.e., deaths occur every minute), with many records (i.e., everyone dies eventually). The rules for annotating the data change regularly (i.e., new versions of the International Classification of Diseases contain different new terms and codes). The consistency of the data decreases as the Big Data grows in size and in time. Our basic understanding of how humans die, and our ability to measure the effect of potentially life-saving public health interventions, is jeopardized by our inability to count the causes of death.

– Dealing with Negations

A large portion of Big Data is categorical, not quantitative. Whenever counting categorical features, you need to know whether a feature is present or absent. Unstructured text has no specific format for negative assertions (i.e., statements indicating that a feature is absent or that an assertion is false). Negations in unstructured data come into play during parsing routines wherein various features need to be counted.

If a computer program is seeking to collect, count, or annotate occurrences of a particular diagnosis included in a pathology report, or a particular type of “buy” order on a financial transaction, or the mention of a particular species of frog on a field report, there should be some way to distinguish a positive occurrence of the term (e.g., Amalgamated Widget is traded), from a negation statement (e.g., Amalgamated Widget is not traded.”). Otherwise, counts of the positive occurrences of trades would include cases that are demonstrably negative. Informaticians have developed a variety of techniques that deal with negations occurring in textual data [19].

In general, negation techniques rely on finding a negation term (e.g., not present, not found, not seen) in proximity with an annotation term (e.g., a term that matches some term in a standard nomenclature, or a term that has been cataloged or otherwise indexed for the data set, onto which a markup tag is applied). A negated term would not be collected or counted as a positive occurrence of the annotation.

Examples of negation terms included in sentences are shown here:

– He cannot find evidence for the presence of a black hole.
– We cannot find support for the claim.
– A viral infection is not present.
– No presence of Waldo is noted.
– Bigfoot is not in evidence in this footprint analysis.

It is easy to exclude terms that are accompanied by an obvious negation term. When terms are negated or otherwise nullified by terms that are not consistently characterized by a negative inference, the problem becomes complex.

Here is a short list of implied negations, each lacking an unambiguous negation term, followed by the re-written sentence that contains an unambiguous negation term (i.e., “not”).

– “Had this been a tin control processor, the satellite would have failed.”—The satellite did not fail.
– “There is a complete absence of fungus.”—Fungus is not present
– “We can rule out the presence of invasive cancer.”—Invasive cancer is not present.
– “Hurricanes failed to show.”—Hurricanes did not show.
– “The witness fails to make an appearance.”—The witness did not appear.
– “The diagnosis is incompatible with psoriasis.”—Psoriasis is not present.
– “Drenching rain is inconsistent with drought.”—Drought does not occur with drenching rain.
– “There is insufficient evidence for a guilty verdict.”—Not guilty.
– “Accidental death is excluded.”—Not an accidental death.
– “A drug overdose is ruled out.”—Not a drug overdose.
– “Neither fish nor foul.”—Not fish. Not foul.
– There is zero evidence for aliens in Hoboken.”—Aliens have not been found in Hoboken.

In addition to lacking outright negations, sentences may contain purposefully ambiguating terms, intended to prohibit readers from drawing any conclusion, positive or negative. For example, “The report is inconclusive for malignancy.” How would this report be counted? Was a malignancy present, or was it not?

The point here is that, like everything else in the field of Big Data, the individuals who prepare and use resources must have a deep understanding of the contained data. They must also have a realistic understanding of the kinds of questions that can be sensibly asked and answered with the available data. They must have an understanding of the limits of their own precision.

Section 10.4. Normalizing and Transforming Your Data

Errors have occurred.

We won't tell you where or why.

Lazy programmers.

Computer-inspired haiku by Charlie Gibbs

When extracting data from multiple sources, recorded at different times, and collected for different purposes, the data values may not be directly comparable. The Big Data analyst must contrive a method to normalize or harmonize the data values.

– Adjusting for population differences.

Epidemiologists are constantly reviewing large data sets on large populations (e.g., local, national, and global data). If epidemiologists did not normalize their data, they would be in a constant state of panic. Suppose you are following long-term data on the incidence of a rare childhood disease in a state population. You notice that the number of people with the disease has doubled in the past decade. You are about to call the New York Times with the shocking news when one of your colleagues taps you on the shoulder and explains that the population of the state has doubled in the same time period. The incidence, described as cases per 100,000 population, has remained unchanged. You calm yourself down and continue your analysis to find that the reported cases of the disease have doubled in a different state that has had no corresponding increase in state population. You are about to alert the White House with the news when your colleague taps you on the shoulder and explains that the overall population of the state has remained unchanged, but the population of children in the state has doubled. The incidence as expressed as cases occurring in the affected population, has remained unchanged.

An age-adjusted rate is the rate of a disease within an age category, weighted against the proportion of persons in the age groups of a standard population. When we age-adjust rates, we cancel out the changes in the rates of disease that result from differences in the proportion of people in different age groups.

Some of the most notorious observations on non-adjusted data come from the field of baseball. In 1930 Bill Terry maintained a batting average of 0.401, the best batting average in the National league. In 1968 Carl Yastrzemski led his league with a batting average of 0.301. You would think that the facts prove that Terry's lead over his fellow players was greater than Yastrzemski's. Actually, both had averages that were 27% higher than the average of their fellow ballplayers of the year. Normalized against all the players for the year in which the data was collected, Terry and Yastrzemski tied.

– Rendering data values dimensionless.

Histograms express data distributions by binning data into groups and displaying the bins in a bar graph. A histogram of an image may have bins (bars) whose heights consist of the number of pixels in a black and white image that fall within a certain gray-scale range (Fig. 10.1).

Fig. 10.1 An image of the author, left, converted into a histogram representing the number of pixels that have a gray-scale value of 0, 1, 2, 3 and so on up to the top gray-scale value of 256. Each gray-scale value is a bin.

When comparing images of different sizes, the total number of pixels in the images is different, making it impossible to usefully compare the heights of bins. In this case, the number of pixels in each bin can be divided by the total number of pixels in the image, to produce a number that corresponds to the fraction of the total image pixels that are found in the bin. The normalized value (now represented as a fraction), can be compared between two images. Notice that by representing the bin size as a fraction, we have stripped the dimension from the data (i.e., a number expressed as pixels), and rendered a dimensionless data item (i.e., a purely numeric fraction).

– Converting one data type to another, more useful, data type.

A zip code is an example of data formed by numeric digits that lack numeric properties. You cannot add two zip codes and expect to get any useful information from the process. However, every zip code has been mapped to a specific latitude and longitude at the center of the zip code region, and these values can be used as spherical coordinates from which distances between locations can be computed. It is often useful to assign geographic coordinates to every zip code entry in a database.

– Converting to a (0, 1) interval.

Any set of data values can be converted into an interval between 0 and 1, wherein the original data values maintain their relative positions in the new interval. There are several simple ways to achieve the result. The most straightforward is to compute the range of the data by subtracting the smallest data value in your data set from the largest data value. To determine the location of any data value in the 0, 1 range, simply subtract from it the smallest value in the data set and then divide the result by the range of the data (Fig. 10.2). This tells you where your value is located, in a 0, 1 interval, as a fraction of the range of the data.

Fig. 10.2 A formula that will convert any value to a fraction between 0 and 1 by dividing the distance of the value from the smallest value of the attribute in the population by the full data range of the value in the population [20].

Another popular method for converting data sets to a standard interval is to subtract the mean from any data value and divide by the standard deviation. This gives you the position of the data value expressed as its deviation from the mean as a fraction of the standard deviation. The resulting value is called the z-score.

When comparing different data sets, it is frequently important to normalize all of the data points to a common interval. In the case of multi-dimensional data it is usually necessary to normalize the data in every dimension using some sensible scaling computation. This may include the methods just described (i.e., dividing by range or by standard deviation, or by substituting data with a dimensionless transformed value, such as a correlation measure).

– Weighting.

Weighting is a method whereby the influence of a value is moderated by some factor intended to yield an improved value. In general, when a data value is replaced by the sum of weighted factors, the weights are chosen to add to 1. For example, if you are writing your own smoothing function, in which each value in a data set is replaced by a value computed by summing contributions from itself and its immediate neighbor on the left and the right, you might multiply each number by one-third, so that the final number is scaled to a magnitude similar to your original number. Alternately, you might multiply the number to the left and to the right by one-quarter and the original by one-half, to provide a summed number weighted to favor the original number.

It is a shame that Big Data never comes with instructions. Data analysts are constantly required to choose a normalization method, and the choice will always depend on their intended use of the data. Here is an example. Three sources of data provide records on children that include an age attribute. Each source measures age in the same dimension; years. You would think that because the ages are all recorded in years, not months or decades, you can omit a normalization step. When you study the data, you notice that one source contains children up to the year 14, while another is cut off at age 12, and another stops at age 16. Suddenly, you are left with a difficult problem. Can your ignore the differences in the cut-off age in the three data sets? Should you truncate all of the data above age 12? Should you use all of the data, but weigh the data differently for the different sources? Should you divide by the available ranges for the data? Should you compute z-scores? It all depends on what you are trying to learn from the data.

Section 10.5. Reducing Your Data

There is something fascinating about science. One gets such a wholesale return of conjecture out of such a trifling investment of fact.

Mark Twain

At first glance, it seems obvious that gravitational attraction is a Big Data problem. We know that gravitation between any two bodies is proportional to the product of their masses and inversely proportional to the square of the distance between them. If we want to predict the gravitational forces on an object, we would need to know the position and mass of every body in the universe. With this data, we would compute a force vector, from which we could determine the net gravitational influence of the universe upon the mass. Of course, this is absurd. If we needed all that data for our computation, physicists would be forever computing the orbit of the earth around the sun. We are lucky to live in a universe wherein gravity follows an inverse square distance rule, as this allows us to neglect the influences of heavenly bodies that are far away from earth and sun, and of nearby bodies that have small masses compared with the sun. Any high school student can compute the orbits of the earth around the sun, predicting their relative positions millennia into the future.

Likewise, if we can see two galaxies in space and we notice that they are similar in shape, size, and have a similar star density, then we can assume that they both produce about the same amount of light. If the light received on Earth from one of those galaxies is four times that received by the other galaxy, we can apply the inverse square law for light intensity and infer that the dimmer galaxy is probably twice as far from earth as the brighter galaxy. In this short analysis, we start with our observations on every visible galaxy in the universe. Next, we compare just two galaxies and from this comparison we can develop general methods of analysis that may apply to the larger set of data.

The point here is that when Big Data is analyzed it is seldom necessary to include every point of data in your system model. In the Big Data field the most successful analysts will often be those individuals who are adept at simplifying the system model; thus eliminating unnecessary calculations.

Because Big Data is complex, you will often find that your data objects have high dimensionality; each data object is annotated with a large number of values. The types of values that are shared among all the different data objects are usually referred to as parameters. It is very difficult to make much sense of high dimensional data. It is always best to develop a filtering mechanism that expunges useless parameters. A useless parameter will often have one of these two properties:

1. Redundancy. If a parameter correlates perfectly with some other parameter, you know that you can safely drop one of the two parameters. For example, you may have some physiologic data on a collection of people, and the data may include weight, waist size, body fat index, weight adjusted by height, and density. These measurements seem to be measuring about the same thing; are they all necessary? If several attributes closely correlate with one another, you might want to drop a few.

Association scores provide a measure of similarity between two variables. Two similar variables will rise and fall together. The Pearson correlation score is popular and can be easily implemented [18,21]. It produces a score that varies from − 1 to 1. A score of 1 indicates perfect correlation; a score of − 1 indicates perfect anti-correlation (i.e., one variable rises while the other falls). A Pearson score of 0 indicates lack of correlation. Other correlation measures are readily available, as discussed in Section 11.3, “The Dot Product, a Simple and Fast Correlation Method” [22,23]. Big Data analysts should not demure from developing their own correlation scores, as needed, to ensure enhanced speed, or to provide a scoring measure that best serves their particular goals.

2. Randomness. If a parameter is totally random, then it cannot tell you anything meaningful about the data object, and you can drop the parameter. There are many tests that measure randomness; most were designed to measure the quality of random number generators [24]. They can also be used to determine the randomness of data sets.

Putting your set of parameter values into a file, and compressing the file can achieve a simple but useful test for randomness. If the values of the parameter are distributed randomly, the file will not compress well, whereas a set of data that has a regular distribution (e.g., a simple curve, or a Zipf-like distribution, or a distribution with a sharp peak), will compress down into a very small file.

As a small illustration, I wrote a short program that created three files, each 10,000 bytes in length. The first file consisted of the number 1, repeated 10,000 times (i.e., 11111111...). The second file consisted of the numbers 0 through 9, distributed as a sequence of 1000 zeros followed by 1000 ones, followed by 1000 twos, and so on, up to 1000 nines. The final file consisted of the numbers 0 through 9 repeated in a purely random sequence (e.g., 285963222202186026084095527364317), extended to fill a file of 10,000 bytes. Each file was compressed with gunzip, which uses the DEFLATE compression algorithm, combining LZ77 and Huffman coding.

The uncompressed files (10,000 bytes) were compressed into the following file sizes:

compressed file size: 58 bytes for 10,000 consecutive "1"
compressed file size: 75 bytes for 1,000 consecutive values of 0 through 9
compressed file size: 5,092 bytes for a random sequence of 10,000 digits

In the third file, which consisted of a random sequence of digits, a small compression was achieved simply through the conversion from ASCII to binary representation. In general, though, a purely random sequence cannot be compressed. A data analyst can compare the compressibility of data values, parameter by parameter, to determine which parameters might be expunged, at least during the preliminary analysis of a large, multi-dimensional data set.

When random data are not omitted from the data parameters the unwary analyst may actually develop predictive models and classifiers based entirely on noise. This can occur because clustering algorithms and predictive methods, including neural networks, will produce an outcome from random input data. It has been reported that some published diagnostic tests have been developed from random data [25]. [Glossary Classifier, Neural network]

Aside from eliminating redundant or random parameters, you might want to review the data and eliminate parameters that do not contribute in any useful way toward your analysis. For example, if you have the zip code for an individual, you will probably not need to retain the street address. If you have the radiologic diagnosis for a patient's chest X-ray, you might not need to retain the file containing the X-ray image unless you are conducting an image analysis project.

The process of reducing parameters applies to virtually all of the fields of data analysis, including standard statistical analysis. Names for this activity include feature reduction or selection, variable reduction and variable subset reduction, and attribute selection. There is sometimes a fine line between eliminating useless data parameters and cherry-picking your test set. It is important to document the data attributes you have expunged and your reason for doing so. Your colleagues should be given the opportunity of reviewing all of your data, including the expunged parameters. [Glossary Cherry-picking, Second trial bias]

An example of a data elimination method is found in the Apriori algorithm. At its simplest, it expresses the observation that a collection of items cannot occur frequently unless each item in the collection also occurs frequently. To understand the algorithm and its significance, consider the items placed together in a grocery checkout cart. If the most popular combination of purchase items is a sack of flour, a stick of butter, and a quart of milk, then you can be certain that collections of each of these items individually, and all pairs of items from the list of 3, must also occur frequently. In fact, they must occur at least as often as the combination of all three, because each of these smaller combinations are subsets of the larger set and will occur with the frequency of the larger set plus the frequency of their occurrences in any other item sets. The importance of the apriori algorithm to Big Data relates to data reduction. If the goal of the analysis is to find association rules for frequently occurring combinations of items, then you can restrict your analysis to combinations composed of single items that occur frequently [26,20].

After a reduced data set has been collected, it is often useful to transform the data by any of a variety of methods that enhance our ability to find trends, patterns, clusters or relational properties that might be computationally invisible in the untransformed data set. The first step is data normalization, described in the next section. It is critical that data be expressed in a comparable form and measure. After the data is normalized, you can further reduce your data by advanced transformative methods.

As a final caveat, data analysts should be prepared to learn that there is never any guarantee that a collection of data will be helpful, even if it meets every criterion for accuracy and reproducibility. Sometimes the data you have is not the data you need. Data analysts should be aware that advanced analytic methods may produce a result that does not take you any closer to a meaningful answer. The data analyst must understand that there is an important difference between a result and an answer. [Glossary Support vector machine]

Section 10.6. Understanding Your Control

The purpose of computing is insight, not numbers.

Richard Hamming

In the small data realm the concept of “control” is easily defined and grasped. Typically, a group is divided into treatment and control sub-groups. Heterogeneity in the population (e.g., gender, age, health status) is randomly distributed into both groups, so that the treatment and the control subgroups are, to the extent possible, indistinguishable from one another. If the treatment group receives a drug administered by syringe suspended in a saline solution, then the control group might receive an injection of saline solution by syringe, without the drug. The idea is to control the experimental groups so that they are identical in every way, save for one isolated factor. Measurable differences in the control and the treatment groups that arise after treatment are potentially due to the action of the one treatment factor.

The concept of “control” does not strictly apply to Big Data; the data analyst never actually “controls” the data. We resign ourselves to doing our best with the “uncontrolled” data that is provided. In the absence of controlling an experiment, what can the data analyst do to exert some kind of data selection that simulates a controlled experiment? It often comes down to extracting two populations, from the Big Data resource, that are alike in every respect, but one: the treatment.

Let me relate a hypothetical situation that illustrates the special skills that Big Data analysts must master. An analyst is charged with developing a method for distinguishing endometriosis from non-diseased (control) tissue using gene expression data. By way of background, endometriosis is a gynecologic condition wherein endometrial tissue that is usually confined to the endometrium (the tissue that lines the inside cavity of the uterus) is found growing outside the uterus, on the surfaces of the ovaries, pelvis, and other organs found in the pelvis. He finds a public data collection that provides gene expression data on endometriosis tissue (five samples) and on control tissues (five samples). By comparing the endometriosis samples with the control samples, he finds a set of 1000 genes that are biomarkers for endometriosis (i.e., that have “significantly” different expression in the disease samples compared with the control samples).

Let us set aside the natural skepticism reserved for studies that generate 1000 new biomarkers from an analysis of 10 tissue samples. The analyst is asked the question, “What was your control tissue, and how was it selected and prepared for analysis?” The analyst indicates that he does not know anything about the control tissues. He points out that the selection and preparation of control tissues is a pre-analytic task (i.e., outside the realm of influence of the data analyst). In this case, the choice of the control tissue was not at all obvious. If the control tissue were non-uterine tissue, taken from the area immediately adjacent to the area from which the endometriosis was sampled, then the analysis would have been comparing endometriosis with the normal tissue that covers the surface of pelvic organs (i.e., a mixture of various types of connective tissue cells unlike endometrial cells). If the control consisted of samples of normal endometriotic tissue (i.e., the epithelium lining the endometrial canal), then the analysis would have been comparing endometriosis with its normal counterpart. In either case, the significance and rationale for the study would have been very different, depending on the choice of controls.

In this case, as in every case, the choice and preparation of the control is of the utmost importance to the analysis that will follow. In a “small data” controlled study, every system variable but one, the variable studied in the experiment, is “frozen”; an experimental luxury lacking in Big Data. The Big Data analyst must somehow invent a plausible control from the available data. This means that the data analyst, and his close co-workers, must delve into the details of data preparation and have a profound understanding of the kinds of questions that the data can answer. Finding the most sensible control and treatment groups from a Big Data resource can require a particular type of analytic mind that has been trained to cope with data drawn from many different scientific disciplines.

Section 10.7. Statistical Significance Without Practical Significance

The most savage controversies are those about matters as to which there is no good evidence either way.

Bertrand Russell

Big Data provides statistical significance without necessarily providing any practical significance. Here is an example. Suppose you have two populations of people and you suspect that the adult males in the first population are taller than the second population. To test your hypothesis, you measure the heights of a random sampling (100 subjects) from both groups. You find that the average height of group 1 is 172.7 cm, while the average height of the second group is 172.5 cm. You calculate the standard error of the mean (the standard deviation divided by the square root of the number of subjects in the sampled population), and you use this statistic to determine the range in which the mean is expected to fall. You find that the difference in the average height in the two sampled populations is not significant, and you cannot exclude the null hypothesis (i.e., that the two sampled groups are equivalent, height-wise).

This outcome really bugs you! You have demonstrated a 2 mm difference in the average heights of the two groups, but the statistical tests do not seem to care. You decide to up the ante. You use a sampling of one million individuals from the two populations and recalculate the averages and the standard errors of the means. This time, you get a slightly smaller difference in the heights (172.65 for group 1 and 172.51 for group 2). When you calculate the standard error of the mean for each population, you find a much smaller number, because you are dividing the standard deviation by the square root of one million (i.e., one thousand); not by the square root of 100 (i.e., 10) that you used for the first calculation. The confidence interval for the ranges of the averages is much smaller now, and you find that the differences in heights between group 1 and group 2 are sufficient to exclude the null hypothesis with reasonable confidence.

Your Big Data project was a stunning success; you have shown that group 1 is taller than group 2, with reasonable confidence. However, the average difference in their heights seems to be about a millimeter. There are no real life situations where a difference of this small magnitude would have any practical significance. You could not use height to distinguish individual members of group 1 from individual members of group 2; there is too much overlap among the groups, and height cannot be accurately measured to within a millimeter tolerance. You have used Big Data to achieve statistical significance, without any practical significance.

There is a tendency among Big Data enthusiasts to promote large data sets as a cure for the limited statistical power and frequent irreproducibility of small data studies. In general, if an effect is large, it can be evaluated in a small data project. If an effect is too small to confirm in a small data study, statistical analysis may benefit from a Big Data study, by increasing the sample size and reducing variances. Nonetheless, the final results may have no practical significance, or the results may be unrepeatable in a small-scale (i.e., real life) setting, or may be invalidated due to the persistence of biases that were not eliminated when the sample size was increased.

Section 10.8. Case Study: Gene Counting

There is a chasm

of carbon and silicon

the software can't bridge.

Computer-inspired haiku by Rahul Sonnad

The Human Genome Project is a massive bioinformatics project in which multiple laboratories helped to sequence the 3 billion base pair haploid human genome. The project began its work in 1990, a draft human genome was prepared in 2000, and a completed genome was finished in 2003, marking the start of the so-called post-genomics era. There are about 2 million species of proteins synthesized by human cells. If every protein had its own private gene containing its specific genetic code, then there would be about two million protein-coding genes contained in the human genome. As it turns out, this estimate is completely erroneous. Analysis of the human genome indicates that there are somewhere between 20,000 and 150,000 genes. The majority of estimates come in at the low end (about 25,000 genes). Why are the current estimates so much lower than the number of proteins, and why is there such a large variation in the lower and upper estimates (20,000 to 150,000)? [Glossary Human Genome Project]

Counting is difficult when you do not fully understand the object that you are counting. The reason that you are counting objects is to learn more about the object, but you cannot fully understand an object until you have learned what you need to know about the object. Perceived this way, counting is a bootstrapping problem. In the case of proteins a small number of genes can account for a much larger number of protein species because proteins can be assembled from combinations of genes, and the final form of a unique protein can be modified by so-called post-translational events (folding variations, chemical modifications, sequence shortening, clustering by fragments, etc.). The methods used to count protein-coding genes can vary [27]. One technique might look for sequences that mark the beginning and the end of a coding sequence; another method might look for segments containing base triplets that correspond to amino acid codons. The former method might count genes that code for cellular components other than proteins, and the later might miss fragments whose triplet sequences do not match known protein sequences [28]. Improved counting methods are being developed to replace the older methods, but a final number evades our grasp.

The take-home lesson is that the most sophisticated and successful Big Data projects can be stumped by the simple act of counting.

Section 10.9. Case Study: Early Biometrics, and the Significance of Narrow Data Ranges

The proper study of Mankind is Man.

Alexander Pope in “An Essay on Man,” 1734.

It is difficult to determine the moment in history when we seriously began collecting biometric data. Perhaps it started with the invention of the stethoscope. Rene-Theophile-Hyacinthe Laennec (1781–1826) is credited with inventing this device, which provided us with the opportunity to listen to the sounds generated within our bodies. Laennec's 1816 invention was soon followed by his 900-page analysis of sounds, heard in health and disease, the Traite de l'Aascultation Mediate (1819). Laennec's meticulous observations were an early effort in Big Data medical science. A few decades later, in 1854, Karl Vierordt's 1854 sphygmograph was employed to routinely monitor the pulse of patients. Perhaps the first large monitoring project came in 1868 when Carl Wunderlich published Das Verhalten der Eigenwarme in Krankheiten, which collected body temperature data on approximately 25,000 patients [29]. Wunderlich associated peaks and fluctuations of body temperature with 32 different diseases. Not only did this work result in a large collection of patient data, it also sparked considerable debate over the best way to visualize datasets. Competing suggestions for the representation of thermometric data (as it was called) included time interval (discontinuous) graphs and oscillating realtime (continuous) charts. Soon thereafter, sphygmomanometry (blood pressure recordings) was invented (1896). With bedside recordings of pulse, blood pressure, respirations, and temperature (the so-called vital signs), the foundations of modern medical data collection were laid.

At the same time that surveillance of vital signs became commonplace, a vast array of chemical assays of blood and body fluids were being developed. By the third decade of the twentieth century, physicians had at their disposal most of the common blood tests known to modern medicine (e.g., electrolytes, blood cells, lipids, glucose, nitrogenous compounds). What the early twentieth century physicians lacked was any sensible way to interpret the test results. Learning how to interpret blood tests required examination of old data collected on many thousands of individuals, and it took considerable time and effort to understand the aggregated results.

The results of blood tests, measured under a wide range of physiologic and pathologic circumstances, produced a stunning conclusion. It was shown that nearly every blood test conducted on healthy individuals fell into a very narrow range, with very little change between individuals. This was particularly true for electrolytes (e.g., Sodium and Calcium) and to a somewhat lesser extent for blood cells (e.g., white blood cells, red blood cells). Furthermore, for any individual, multiple recordings at different times of the day and on different days, tended to produce consistent results (e.g., Sodium concentration in the morning was equivalent to Sodium concentration in the evening). These finding were totally unexpected at the time [30].

Analysis of the data also showed that significant deviations from the normal concentrations of any one of these blood chemicals is always an indicator of disease. Backed by data, but lacking any deep understanding of the physiologic role of blood components, physicians learned to associate deviations from the normal range with specific disease processes. The discovery of the “normal range” revolutionized the field of physiology. Thereafter physiologists concentrated their efforts toward understanding how the body regulates its blood constituents. These early studies led to nearly everything we now know about homeostatic control mechanisms, and the diseases thereof.

To this day, much of medicine consists of monitoring vital signs, blood chemistries, and hematologic cell indices (i.e., the so-called complete blood count), and seeking to find a cause and a remedy for deviations from the normal range.

Glossary

Cherry-picking The process whereby data objects are chosen for some quality that is intended to boost the likelihood that an experiment is successful, but which biases the study. For example, a clinical trial manager might prefer patients who seem intelligent and dependable, and thus more likely to comply with the rigors of a long and complex treatment plan. By picking those trial candidates with a set of desirable attributes, the data manager is biasing the results of the trial, which may no longer apply to a real-world patient population.

Classifier As used herein, refers to algorithms that assign a class (from a pre-existing classification) to an object whose class is unknown [26]. It is unfortunate that the term classifier, as used by data scientists, is often misapplied to the practice of classifying, in the context of building a classification. Classifier algorithms cannot be used to build a classification, as they assign class membership by similarity to other members of the class; not by relationships. For example, a classifier algorithm might assign a terrier to the same class as a housecat, because both animals have many phenotypic features in common (e.g., similar size and weight, presence of a furry tail, four legs, tendency to snuggle in a lap). A terrier is dissimilar to a wolf, and a housecat is dissimilar to a lion, but the terrier and the wolf are directly related to one another; as are the housecat and the lion. For the purposes of creating a classification, relationships are all that are important. Similarities, when they occur, arise as a consequence of relationships; not the other way around. At best, classifier algorithms provide a clue to classification, by sorting objects into groups that may contain related individuals. Like clustering techniques, classifier algorithms are computationally intensive when the dimension is high, and can produce misleading results when the attributes are noisy (i.e., contain randomly distributed attribute values) or non-informative (i.e., unrelated to correct class assignment).

Human Genome Project The Human Genome Project is a massive bioinformatics project in which multiple laboratories contributed to sequencing the 3 billion base pair haploid human genome (i.e., the full sequence of human DNA). The project began its work in 1990, a draft human genome was prepared in 2000, and a completed genome was finished in 2003, marking the start of the so-called post-genomics era. All of the data produced for the Human Genome Project is freely available to the public.

Meta-analysis Meta-analysis involves combining data from multiple similar and comparable studies to produce a summary result. The hope is that by combining individual studies, the meta-analysis will carry greater credibility and accuracy than any single study. Three of the most recurring flaws in meta-analysis studies are selection bias (e.g., negative studies are often omitted from the literature), inadequate knowledge of the included sets of data (e.g., incomplete methods sections in the original articles), and non-representative data (e.g., when the published data are non-representative samples of the original data sets).

Neural network A dynamic system in which outputs are calculated by a summation of weighted functions operating on inputs. The weights for the individual functions are determined by a learning process, simulating the learning process hypothesized for human neurons. In the computer model, individual functions that contribute to a correct output (based on the training data) have their weights increased (strengthening their influence to the calculated output). Over the past ten or fifteen years, neural networks have lost some favor in the artificial intelligence community. They can become computationally complex for very large sets of multidimensional input data. More importantly, complex neural networks cannot be understood or explained by humans, endowing these systems with a “magical” quality that some scientists find unacceptable.

Second trial bias Can occur when a clinical trial yields a greatly improved outcome when it is repeated with a second group of subjects. In the medical field, second trial bias arises when trialists find subsets of patients from the first trial who do not respond well to treatment, thereby learning which clinical features are associated with poor trial response (e.g., certain pre-existing medical conditions, lack of a good home support system, obesity, nationality). During the accrual process for the second trial, potential subjects who profile as non-responders are excluded. Trialists may justify this practice by asserting that the purpose of the second trial is to find a set of subjects who will benefit from treatment. With a population enriched with good responders, the second trial may yield results that look much better than the first trial. Second trial bias can be considered a type of cherry-picking that is often justifiable.

Steghide Steghide is an open source GNU license utility that invisibly embeds data in image or audio files. Windows and Linux versions are available for download from SourceForge, at:
http://steghide.sourceforge.net/download.php
A Steghide manual is available at:
http://steghide.sourceforge.net/documentation/manpage.php
After installing, you can invoke steghide at the system prompt as a command line launched from the subdirectory in which steghide.exe resides.
Here is an example of a command line invocation of Steghide. Your chosen password can be inserted directly into the commandline. For example:
steghide embed -cf myphoto.jpg -ef mytext.txt -p hideme
The command line was launched from the subdirectory that holds the steghide executable files on my computer. The command instructs steghide to embed the text file, berman_author_bio.txt into the image file, berman_author_photo.jpg, under the password “hideme”.
That is all there is to it. The image file, containing a photo of myself, now contains an embedded text file, containing my short biography. No longer need I keep track of both files. I can generate my biography file, from my image file, but I must remember the password.
I could have called Steghide from a script. Here is an example of an equivalent Python script that invokes steghide from a system call.
import os
command_string = "steghide embed -cf myphoto.jpg -ef mytext.txt -p hideme"
os.system(command_string)
You can see how powerful this method can be. With a bit of tweaking, you can write a short script that uses the Steghide utility to embed a hidden text message in thousands of images, all at once. Anyone viewing those images would have no idea that they contained a hidden message, unless and until you told them so.

Support vector machine A machine learning technique that classifies objects. The method starts with a training set consisting of two classes of objects as input. The support vector machine computes a hyperplane, in a multidimensional space, that separates objects of the two classes. The dimension of the hyperspace is determined by the number of dimensions or attributes associated with the objects. Additional objects (i.e., test set objects) are assigned membership in one class or the other, depending on which side of the hyperplane they reside.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10: Measurement

Create new playlist

Sign In

Sign Up