CHAPTER 8

image

Descriptive Statistics and Exploratory Data Analysis

In Chapter 8, we look at both numerical summaries (what are known as descriptive statistics) and graphical summaries related to exploratory data analysis (EDA). We discuss the topic of graphics more generally in Chapter 9, and the related topic of data visualization later in our text.

Statistician John Tukey popularized exploratory data analysis in his 1977 book of the same name. To Tukey, there was exploratory data analysis (EDA) and confirmatory data analysis (CDA), much as we talk about exploratory and confirmatory factor analysis today. Before we do any kind of serious analysis, we should understand our data. Understanding the data involves letting the data tell their own story. Tukey presented a number of graphical and semi-graphical techniques for displaying important characteristics of data distributions. He also said once that “An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem,” a sentiment with which I find myself in total agreement. Tukey pointed out that numerical summaries of data focus on the expected values, while graphical summaries focus on the unexpected.

Tukey wrote something that sounds quite prescient almost 40 years later: “Even when every household has access to a computing system, it is unlikely that ‘just what we would like to work with’ will be easily enough available” (Tukey, 1977, p. 663).

8.1 Central Tendency

The three commonly reported measures of central tendency are the mean, the median, and the mode. R provides built-in functions for the mean and the median. The built-in R function called mode returns the storage class of an object. The prettyR package provides a function called Mode that returns the modal value of a dataset, if there is one. If there are multiple modes, prettyR will inform you of that fact, but will not identify the actual values. A table can be used for that purpose, however.

8.1.1 The Mean

The mean is technically correctly computed only for scale (interval or ratio) data, as the inequality of intervals for ordinal data make the mean inappropriate. In some cases, we do average ranks, as in certain nonparametric tests, but as a general rule, we should use only the mode or the median to describe the center of ordinal data. The mean for a population or a sample of data is defined as the sum of the data values divided by the number of values. When distinctions are necessary, we will use N to refer to the number of values in a population and n to refer to the number of values in a sample. If we need to identify sub-samples, we will use subscripts as in n1 and n2. The built-in function mean will determine the mean for a dataset. Remember if you have missing data, you must set na.rm = TRUE in order for the mean to be calculated.

In addition to being the most obvious measure of central tendency to most people, the mean has several statistical advantages. It uses information from every value in the dataset. It is used in the calculation of additional measures such as standard scores, the variance, and the standard deviation. Perhaps most important, the mean from a sample is an unbiased estimate of the population mean. We examined the distribution of sample means in Chapter 6, and found that with larger samples, the distribution of sample means becomes more normal in shape. The built-in function for the mean is simply mean(). For example, the mean reading score of the 200 students in the hsb dataset is 52.23:

> mean(hsb$read)
[1] 52.23

One disadvantage of the mean is its sensitivity to extreme values. Because every value in the dataset contributes to the value of the mean, extreme values “pull” the mean in their direction. Consider a very small sample of n = 3 where one person is unemployed, one earns in the mid five figures, and the third is a billionaire. The mean salary might be quite high in that case! The next measure we will discuss, the median, is more robust than the mean in this regard.

8.1.2 The Median

The median can be used with ordinal, interval, or ratio data. It is the value separating the dataset into halves. The upper half of the data contains values greater than the median, and the lower half contains values lower than the median. The median is also called the second quartile and the 50th percentile. There is intuitive appeal in a middle value dividing the distribution, and the median may be a better index of central tendency than the mean when the data are skewed.

We can locate the median in any set of data by sorting the data from lowest to highest, and finding the value located at the position (n + 1)/2 in the ordered data. If there are an odd number of values in the data, the median will be the observed middle value. If there are an even number of data values, the median is computed as the mean of the two middle values. Depending on the actual values in the data, the median may thus be either an observed value or an imputed value. In either case, the median is always the midpoint of the dataset.

As mentioned earlier, the median is insensitive to extreme values. We base its value only on the middle one or two data points. When the data distribution is skewed, the median is often more appropriate than the mean in describing the center of the data. The built-in function is median. Let us find the median value of the reading scores.

> median(hsb$read)
[1] 50

The fact that the median is lower than the mean of these data indicates that the data are likely to be positively skewed. Let’s create a histogram of the reading score data, and use the abline function to draw vertical lines at the positions of the mean and the median to demonstrate this. We will represent the mean with a heavy dashed line using the lwd = 2 argument to control line width, and the lty = 2 argument to control line width. We represent the median with a heavy solid blue line using the col = "blue" argument.

> hist(hsb$read) 
> abline(v = mean(hsb$read), lty = 2, lwd = 2)
> abline(v = median(hsb$read), col = "blue", lwd = 2)

The completed histogram with the vertical lines added is shown in Figure 8-1. As expected, the data are positively skewed, and the high scores exert an influence on the mean, which is pulled in the direction of the skew.

9781484203743_Fig08-01.jpg

Figure 8-1. The mean is influenced by extreme values

8.1.3 The Mode

The mode can be found simply by identifying the most frequently occurring value or values in a dataset. Some datasets have no value that repeats, while other data sets may have multiple modes. Recall the geyser eruption data we used in Chapter 6 to illustrate the central limit theorem were bimodal in nature.

R’s built-in mode function returns the storage class of the R object. As we mentioned earlier, the prettyR function Mode will return the mode of a dataset if there is one, but will simply inform you if there are multiple modes. See the following code listing for more details on mode versus Mode.

> install.packages("prettyR")
> library(prettyR)
> mode(hsb$read)
[1] "numeric"
> Mode(hsb$read)
[1] "47"

> Mode(mtcars$hp)
[1] ">1 mode"

Among many other uses, the table function can be used to identify the values of multiple modes. Using the sort function makes it more obvious which values are the modes. It is convenient to sort the table in descending order – note that the top number is the raw data value and the lower number is the count of how many times that data value shows. Here we have three modes (that show up three times each) of 110, 175, and 180. I’ve added some spaces for better readability:

> sort(table(mtcars$hp), decreasing = TRUE)

110 175 180  66 123 150 245  52  62  65  91  93  95  97 105 109 113 205 215 230
  3   3   3   2   2   2   2   1   1   1   1   1   1   1   1   1   1   1   1   1

264 335
  1   1

8.2 Variability

Three common measures of variability are the range, the variance, and the standard deviation. Each has a built-in function in R. The range function returns the minimum and maximum values, so if you want the actual range, you must subtract the minimum from the maximum. The variance function is var, and the standard deviation function is sd.

8.2.1 The Range

The range is easy to compute, requiring the identification of only the maximum and minimum values. It is also intuitively easy to grasp as a measure of how closely together or widely apart the data values are. However, the range is also less informative than other measures of variability in that it tells us nothing else about the data regarding its shape or the nature of the variability. Let us examine the range of the students’ reading scores:

> range(hsb$read)
[1] 28 76
> max(hsb$read) - min(hsb$read)
[1] 48

8.2.2 The Variance and Standard Deviation

The population variance is defined as the average of the squared deviations of the raw data from the population mean:

Eqn8-1.jpg

When dealing with sample data, we calculate the variance as shown in equation (8.2). The n − 1 correction makes the sample value an unbiased estimate of the population variance

Eqn8-2.jpg

where n represents the size of the sample. R’s var function returns the variance treating the dataset as a sample, and the sd function returns the sample standard deviation. If you want to treat the dataset as a population, you must adjust these estimates accordingly.

The variance expresses the dispersion in a dataset in squared terms, and as such changes the units of measurement from the original units. Taking the square root of the variance returns the index to the original units of measure. The standard deviation was conceived by Francis Galton in the late 1860’s as a standardized index of normal variability.

Examine the use of the built-in functions for the variance and standard deviation of the students’ reading and mathematics scores:

> var(hsb$read)
[1] 105.1227
> sd(hsb$read)
[1] 10.25294

> var(hsb$math)
[1] 87.76781
> sd(hsb$math)
[1] 9.368448

The coefficient of variation (cv) is the ratio of the standard deviation to the mean. For ratio measures, this index provides a standardized index of dispersion for a probability or frequency distribution. We can write a simple function to calculate the cv.

> cv <- function(x){
+   cv <- (sd(x)/mean(x))
+   return (cv)
+ }

> cv(mtcars$wt)
[1] 0.3041285

> cv(mtcars$qsec)
[1] 0.1001159

8.3 Boxplots and Stem-and-Leaf Displays

Tukey popularized the five-number summary of a dataset. We used the summary() function in Chapter 2 as a way to summarize sample data. As you may recall, this summary adds the mean to the Tukey five-number summary, providing the values of the minimum, the first quartile, the median, the mean, the third quartile, and the maximum. As a refresher, here is the six-number summary of the math scores for the 200 students in the hsb dataset.

> summary(hsb$math)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  33.00   45.00   52.00   52.64   59.00   75.00

Tukey conceived of a graphical presentation of the five number summary that he called the box-and-whiskers plot. Today this is more commonly known as the boxplot. The boxplot function in the base R graphics package is quite adequate. The box is drawn around the middle 50% of the data, from the first quartile to the third quartile. The whiskers extend from the first and third quartiles toward the minimum and the maximum, respectively. Values beyond 3/2 of the interquartile range (the difference between the third and first quartile) are considered outliers, and are represented by circles. The command to produce the boxplot is boxplot(hsb$math). The completed boxplot is shown in Figure 8-2.

9781484203743_Fig08-02.jpg

Figure 8-2. Boxplot of the math scores of 200 students

Examining the graphical representation of the five-number summary tells a good bit about the data. The location of the median in the box and the relative size of the whiskers tell us whether the distribution is more symmetrical or skewed. When the median is close to the center of the box, and the whiskers are roughly equal in size, the data distribution is more likely to be symmetrical, although these data are not quite symmetrical (the whiskers are not the same length). In Chapter 9, we will begin to use the excellent ggplot2 package written by Hadley Wickham. One of the nice features of ggplot2 is the ability to produce very attractive side-by-side boxplots.

The stem-and-leaf display is a semi-graphical technique suitable for smaller datasets. Stem-and-leaf displays, also called stem plots, are actually displayed in the R console rather than in the R Graphics device. The stems are the leading digits and the leaves are the trailing digits. Using the mtcars dataset provided with R, let us develop a stem-and-leaf display of the miles per gallon:

> data(mtcars)
> stem(mtcars$mpg)

The decimal point is at the |

10 | 44
12 | 3
14 | 3702258
16 | 438
18 | 17227
20 | 00445
22 | 88
24 | 4
26 | 03
28 |
30 | 44
32 | 49

The stem-and-leaf display has the advantage that every data point is shown. The display resembles a simple frequency distribution, but provides additional information. For larger datasets, such displays are less helpful than histograms or other visual representations of the data.

8.4 Using the fBasics Package for Summary Statistics

The contributed package fBasics provides a particularly thorough function for descriptive statistics called basicStats. Apart from the mode, this package provides an excellent statistical summary of a vector of data. Here is the use of the basicStats function with the students’ math scores.

> install.packages ("fBasics")
> library(fBasics)
> basicStats(hsb$math)
             X..hsb.math
nobs          200.000000
NAs             0.000000
Minimum        33.000000
Maximum        75.000000
1. Quartile    45.000000
3. Quartile    59.000000
Mean           52.645000
Median         52.000000
Sum         10529.000000
SE Mean         0.662449
LCL Mean       51.338679
UCL Mean       53.951321
Variance       87.767814
Stdev           9.368448
Skewness        0.282281
Kurtosis       -0.685995

References

Chambers, J. M. (2008). Software for data analysis: Programming in r. New York, NY: Springer. Hunt, A., & Thomas, D. (1999). The pragmatic programmer: From journeyman to master. Reading, MA: Addison Wesley.

Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105 , 156-166.

Ohri, A. (2014). R for cloud computing: An approach for data scientists. New York, NY: Springer.

Pace, L. A. (2012). Beginning R: An introduction to statistical programming. New York, NY: Apress.

Roscoe, J. T. (1975). Fundamental research statistics for the behavioral sciences (2nd ed.). New York, NY: Holt, Rinehart and Winston.

Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison Wesley.

University of California, Los Angeles. (2015). Resources to help you learn and use R. Retrieved from http://www.ats.ucla.edu/stat/r/

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). New York, NY: Springer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.136.226