7.1 The Role of Statistics in Data Analysis

Anyone who has tried to choose one cookie from a tray that comes fresh from the oven knows how difficult it can be. Some cookies are bigger than others, some have a nicer color and some have more raisins in them. We are all vaguely aware that this sort of variation exists everywhere in life. The distance we drive on a tank of fuel varies from time to time, we don't arrive at work at the same time every morning, our hair does not look the same every day – everything varies.

In many cases we do not bother to cope with this variation. If the fuel tank is empty we just fill it up, regardless of the distance we have driven. In other cases we need to take the variation into consideration. If we are required to be at work at a certain time, for example, we probably get up a little earlier in the morning to allow for some unforeseen delays along the way. If we are analyzing data from an experiment it is often necessary to account for the variation in much greater detail. The variation in such data comes both from the measurement setup and the process that we are studying. It is crucially important to find ways to cope with this variation in order to draw conclusions that are correct. For example, if the experiment is aimed at detecting a subtle effect it is important to know how much of the variation comes from the measurement setup. If that variation is larger than the one we create by our deliberate experimentation, something must be done to increase the quality of the measurements. We obviously want to be sure that the effects we measure can be attributed to the factors that are varied in the experiment, and not to the noise in the measurements.

In statistics, variation is often called error. The word is not to be associated with mistakes. It's just a word for the inevitable, natural variation that occurs in real-world data. In an experiment, any variation that occurs through unknown influences is called experimental error and the measurement error is often only a small part of it. Variation in raw materials, environmental conditions, the sampling or the process under study may be larger components. With statistics we can separate such sources of error from the effects. We can also quantify the error. Statistics could be said to be a means of stating, exactly, how uncertain we are about our conclusions.

Statistics is both a toolbox of practical techniques and a mindset that helps us understand the world. Statistical thinking helps us interpret natural phenomena that involve random processes, such as chemical reactions, quantum mechanical systems, evolution in biology, and turbulence in fluid mechanics. Likewise, the absence of proper statistical thinking can be an obstacle to proper understanding of what goes on in the world. We often see examples of this in the news media. Newspapers have some favorite subjects, one of which is to constantly warn us that we risk contracting a terminal disease. The headlines may tell us that a certain type of cancer is above average in a particular area, often close to a factory, mobile base station or something equally unreassuring. They want us to read between the lines that there is a causal connection between the two. Health risks should, of course, be taken seriously. If many people are exposed to a risk, even a small increase in the incidence of health problems could have a large effect on the public health. But that is no excuse for careless interpretation of data. It is quite normal for a cancer form to be above average in some areas: it is an inevitable consequence of variation that things are not the same everywhere. Let us consider an imaginary example to illustrate this:

Say that one day the news billboards proclaim that the caries rate among the citizens of Caramel is above the national average. The news story points to the local candy factory as the suspected culprit. It allegedly discharges byproducts from sweet manufacture into the ground water. Even if it did, does the story prove that the factory caused this “outbreak” of caries? The risk of caries can be expected to vary with several factors, such as eating habits, oral hygiene, hereditary factors, social factors, and so on. This means that the risk varies between people and that there will be a natural variation in the incidence. As a consequence, the cases will not be evenly distributed across the country. Depending on where we look we may find a frequency of caries that is below or above average. As we will see when we discuss the central limit effect, it is often reasonable to assume that approximately half of the places will be above average. In fact, finding a value exactly on the average is highly unlikely. Before we can say anything about how extraordinary the situation is in Caramel, we must know something about how large the variation is in the country as a whole.

As experimenters, we need statistical thinking to understand that there is a natural variation everywhere we look and what that entails when we interpret our measurement data. We also need a toolbox of statistical methods to analyze data. To understand the basis of those tools it is necessary to grasp the concepts introduced in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.237.164