Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4
Exploratory Data Analysis

Package(s): LearnEDA, e1071, sfsmisc, qcc, aplpack, RSADBE

Dataset(s): memory, morley, InsectSprays, yb, sample, galton, chest, sleep, cloud, octane, AirPassengers, insurance, somesamples, girder

4.1 Introduction: The Tukey's School of Statistics

Exploratory Data Analysis, abbreviated and also simply referred to as EDA, combines very powerful and naturally intuitive graphical methods as well as insightful quantitative techniques for analysis of data arising from random experiments. The direction for EDA was probably laid down in the expository article of Tukey (1962), “The Future of Data Analysis”. The dictionary meaning of the word “explore” means to search or travel with the intent of some kind of useful discovery, and in similar spirit EDA carries a search in the data to provide useful insights. EDA has been developed to a very large extent by the Tukey school of statisticians.

We can probably refer to EDA as a no-assumptions paradigm. To understand this we recall how the model-based statistical approaches work. We include both the classical and Bayesian schools in the model-based framework, see Chapters 7 to 9. Here, we assume that the data is plausibly generated by a certain probability distribution, and that a few parameters of such a distribution are unknown. In a different fashion, EDA places no assumptions on data-generating mechanism. This approach also gives an advantage to the analyst of making an appropriate guess of the underlying true hypothesis rather than speculating on it. The classical methods are referred by Tukey as “Confirmatory Data Analysis”. EDA is more about attitude and not simply a bundle of techniques. No, these are not our words. More precisely, Tukey (1980) explains “Exploratory data analysis is an attitude, a flexibility, and a reliance on display, NOT a bundle of techniques, and should be so taught.”

The major work of EDA has been compiled in the beautiful book of Tukey (1977). The enthusiastic reader must read the thought-provoking sections “How far have we come?” which is there in almost all the chapters. Mosteller and Tukey (1977) have further developed regression methods in thisdomain. Most of the concepts of EDA detailed in this chapter have been drawn from Velleman and Hoaglin (1984). Hoaglin, et al. (1991) extend further the Analysis of Variance method in this school of thought. Albert's R package LearnEDA is useful for the beginner. Further details about this package can be found at http://bayes.bgsu.edu/EDA/R/Reda.html. From a regression modeling perspective, Part IV, Rousseeuw and Leroy (1987) offer very useful extensions.

In this chapter, we focus on two approaches of EDA. The preliminary aspects are covered in Section 4.2. The first approach is the graphical methods. We address several types of graphical methods here of visualizing the data. Those graphical methods omitted are primarily due to space restrictions and the author's limitations. These visualization techniques are addressed in Section 4.3. The quantitative methods of EDA are taken up in Section 4.4. Finally, exploratory regression models are considered in Section 4.5. Sections 4.3 and 4.4 form the second approach.

4.2 Essential Summaries of EDA

A reason for writing this section is that the summary statistics of EDA are often different from basic statistics. The emphasis is, more often than not, on summaries such as median, quartiles, percentiles, etc. Also, we define here a few summaries which we believe useful to gain insight into exploratory analyses. The concept of median, quartiles, and Tukey's five numbers, have already been illustrated in Section 2.3. A useful concept associated with each datum is depth, which is defined next.

Depth for a datum $c04-math-0001$ is denoted by $c04-math-0002$ . We will denote the median by $c04-math-0003$ . By definition, depth of median is $c04-math-0004$ . We note that though in general the letter $c04-math-0005$ stands for derivative, and we use the same for depth, there is not really any room for confusion. The ideas are illustrated using a simple program:

> x <- c(13,17,11,115,12,7,24)
> tab <- cbind(order(x),x[order(x)],c(1:7),c(7:1),pmin(c(1:7), c(7:1)))
> colnames(tab) <- c("x_label","x_order","Position_from_min",
+ "Position_from_max","depth")
> tab
     x_label x_order Position_from_min Position_from_max depth
[1,]       6       7                 1                 7     1
[2,]       3      11                 2                 6     2
[3,]       5      12                 3                 5     3
[4,]       1      13                 4                 4     4
[5,]       2      17                 5                 3     3
[6,]       7      24                 6                 2     2
[7,]       4     115                 7                 1     1

In the above output, the second line of R code arranges the sample in increasing order, with the first column returning to their positions in the original sample using the order function. The third and fourth columns give their positions from minimum and maximum respectively. The fifth and last columns obtain the minimum of the positions from the minimum and maximum values using the pmin function, and thus return the depth of the sample values.

We begin with an explanation of hinges. The hinges are what everyone sees as the connectors between the door and its frame. In the past there would be three hinges to fix the door. The center one is at the middle of the length of the frame, and the other two hinges at the positions of one- and three-quarters of the height of the frame. Thus, if we assume the data as arranged in increasing (or decreasing) order along the height of the frame, the median is then the middle hinge, whereas 25% of the observations are below the lower hinge and 25% above the upper hinge. We naturally ask for the difference between the quartiles and hinges. Hinges are technically calculated from the depth of the median, whereas quartiles are not. Throughout this chapter, the lower-, middle-, and upper- hinges will be respectively denoted by $c04-math-0006$ , $c04-math-0007$ , and $c04-math-0008$ . From the output of the previous table, it is clear that the lower and upper hinges are the averages of 11 and 12, equal to 11.5, and average of 17 and 24, equal to 20.5, respectively.

The Tukey's five numbers form one of the most important summaries in EDA. These five numbers are the minimum, lower hinge, median, upper hinge, and maximum. The five numbers are computed using the fivenum function, as seen earlier in Section 2.3.

Five number inter-difference, abbreviated as fnid, is defined as the consecutive differences of the five numbers, viz., {lower hinge – minimum}, {median – lower hinge}, {upper hinge – median}, and {maximum – upper hinge}. The five number inter-difference gives a fair insight into how the sample is spread out. It is easy to define a new function, which gives us the fnid using the diff operator: fnid <- function(x) diff(fivenum(x)).

As a quantitative measure of skewness, we introduce Bowley's relative measure of skewness based on Tukey's five numbers, to be called Bowley-Tukey measure of skewness, as follows:

4.1

These concepts are illustrated in the next example of Memory Recall Times.

Example 4.2.1. Memory Recall Times

A test had been conducted with the purpose of investigating if people recollect pleasant memories associated with a word earlier than some unpleasant memory related with the same word. The word is flashed on the screen and the time an individual takes to respond via the keyboard is recorded for both types of the memories. This study was conducted by Dunn and Master (1982) and a useful URL is http://openlearn.open.ac.uk/mod/resource/view.php?id=165509. This dataset is available in the ACSWR package as the memory dataset.

We begin the exploratory analysis with the Tukey's five numbers fivenum, median absolute deviation mad, and inter-quartile range IQR.

> data(memory)
> lapply(memory,fivenum)
$Pleasant.memory
[1] 1.070 1.805 2.815 3.320 6.170
$Unpleasant.memory
[1]  1.450  2.335  3.600  6.690 10.930
> lapply(memory,mad)
$Pleasant.memory
[1] 1.134189
$Unpleasant.memory
[1] 2.557485
> lapply(memory,IQR)
$Pleasant.memory
[1] 1.3775
$Unpleasant.memory
[1] 4.2425

We can see from the above summaries that pleasant memories are recollected faster than unpleasant ones, since all the five number summaries for the time to recollect Pleasant.memory are less than the corresponding entries for Unpleasant. memory. The variation in the times to recollect Pleasant.memory is also less than those for Unpleasant.memory, as may be seen in the mad and IQR summaries.

The five-number inter-difference gives a fair insight into how the sample is spread out. For the memory dataset, we first define a new function in R named fnid, an abbreviation for five-number inter-difference.

> fnid <- function(x) diff(fivenum(x)) # difference of fivenum
> lapply(memory,fnid)
$Pleasant.memory
[1] 0.735 1.010 0.505 2.850
$Unpleasant.memory
[1] 0.885 1.265 3.090 4.240

It is interesting to note that each of the five number inter-differences for unpleasant memories is larger than the corresponding measure for pleasant memories.

For the memory data, we next compute the Bowley-Tukey measure of skewness.

> fnid_pleasant <- fnid(memory$Pleasant.memory)
> fnid_unpleasant <- fnid(memory$Unpleasant.memory)
> btskew_pleasant <- (fnid_pleasant[3]-fnid_pleasant[2])
+ /(fnid_pleasant[3]+fnid_pleasant[2])
> btskew_unpleasant <- (fnid_unpleasant[3]-fnid_unpleasant[2])/
+ (fnid_unpleasant[3]+fnid_unpleasant[2])
> btskew_pleasant; btskew_unpleasant
[1] -0.3333333
[1] 0.4190586

We have thus far seen how the data summaries are useful.□

$c04-math-0010$

4.3 Graphical Techniques in EDA

4.3.1 Boxplot

The boxplot is essentially a one-dimensional plot, sometimes known as the box-and-whisker plot. The boxplot may be displayed vertically or horizontally, without any value changes in the information conveyed. Box and whiskers are two important parts here. The boxplot is always based on three quantities. The top and bottom of the box are determined by the upper and lower quartiles, and the band inside the box is the median. The whiskers are created according to the purpose of the analyses and defined according to the convenience of the experimenter and in line with the goals of the experiment. If complete representation of the data is required, then the whiskers are produced by connecting the end points of the box with the minimum and maximum value of the data. The rational of box and whiskers is that the quartiles divide the dataset into four parts, with each part containing one-quarter of the sample. The middle line of the box, the box, and the whiskers hence give an appropriate visual representation of the data.

If the goal is to find the outliers, also known as extreme values, below $c04-math-0011$ and above $c04-math-0012$ percentiles, the ends of the whiskers may be defined as data points which respectively gives us these percentile points. All observations below the lower whisker and those above the upper whisker may be treated as outliers. Some of the common choices of the cut-off percentiles are $c04-math-0013$ = 0.02 and 0.09, and $c04-math-0014$ = 0.98 and 0.91 respectively. Sometimes such percentiles may be based on the inter-quartile range IQR.

The role of NOTCHES. Inference for significant difference between medians can be made based on a boxplot which exhibit notches. The top and bottom notches for a dataset is defined by

4.2

A useful interpretation for notched boxplots is the following. If the notches of two boxplots do not overlap, we can interpret this as strong evidence that the medians of the two samples are significantly different. For more details about notches, we refer the reader to Section 3.4 of Chambers, et al. (1983).

Example 4.3.1. AD8. The Youden-Beale Experiment

We have used this dataset in Chapter 2, Section 4, and in a few other places too. We need to compare here if the two virus extracts have a varying effect on the tobacco leaf or not. We have already read this dataset into R on more than one occasion.

First, the boxplot is generated without the notches for yb data.frame using the boxplot function. The median for Preparation_1 certainly appears higher than for Preparation_2, see Part A of Figure 4.1. Thus, we are tempted to check whether the medians for the two preparations are significantly different with the notched boxplot. Now, the boxplot is generated to produce the notches with the option notch=TRUE. Appropriate headers for a figure are specified with the title function. Most importantly, we have used a powerful graphical technique of R through par, which is useful in setting graphical parameters. Here, mfrow indicates that we need a multi-row figure with one row and two columns. For more details, check ?par.

> par(mfrow=c(1,2))
> boxplot(yb)
> title("A: Boxplot for Youden-Beale Data")
> boxplot(yb,notch=TRUE)
Warning message:
In bxp(list(stats = c(7, 8.5, 13.5, 19, 31, 5, 6.5, 10.5, 15.5,  :
  some notches went outside hinges ('box'): maybe set notch=FALSE
> title("B: Notched Boxplots Now")

Figure 4.1 Boxplot for the Youden-Beale Experiment

We can see that the notches overlapping indicate that the medians are not significantly different, see Part B of Figure 4.1. This result may not be acceptable to Youden and Beale! However, the Warning message suggests that the notched boxplot may not be appropriate. Moreover, the data points are eight only and more data may be required for the notched boxplots.□

Example 4.3.2. The Michelson-Morley Experiment

In the late nineteenth century, a theory floated for the dispersion of light waves was that light also requires a medium of travel like any other waves, such as water waves or sound waves. The medium for light waves to propagate was conjectured to be luminiferous ether. Since it was well known at that time that light can travel through a vacuum too, it was believed that a vacuum must consist of luminiferous ether.

Michelson devised an ingenious experiment for establishing the presence of ether. The device designed in this experiment is referred as an interferometer, in which a single source of light is sent through a half-silvered mirror, splitting the single light beam into two beams which travel at right angles to each other, seeFigure 4.2. Each beam travels to the end of a long arm, and from this end they are reflected back to the middle of small mirrors. Both the beams are combined at this middle point of the small mirrors. If the ether medium exists, the beam which travels to and from parallel to the flow of ether should take more time than the beam which reflects perpendicularly, as the time gained from traveling downwards is less than the one traveling upwards. This phenomenon should result in a delay for one of the light beams. It is proved that such a shift would be approximately 4%. This is the famous Michelson-Morley experiment. Some related R programs for graphical plots of this experiment can be found at http://en.wikipedia.org/wiki/File:Michelsonmorley-boxplot.svg. In the dataset morley, the output Speed contains the kilometers per second information recorded as the speed of light minus the speed registered at the experimental unit. Twenty runs of the experiment is carried out at five different centers. If there is the presence of ether, we would expect this speed to be less than the speed of light in a free medium. The boxplot Figure 4.3 gives us a complete understanding of this data. We obtain it as follows.

> par(mfrow=c(1,2))
> boxplot(Speed ∼ Expt, data=morley,xlab = "Experiment No.",
+ ylab="Speed of light (km/s minus 299,000)")
> abline(h=792.458, lty=3)
> boxplot(Speed ∼ Expt, data=morley,xlab = "Experiment No.",
+ ylab="Speed of light (km/s minus 299,000)",notch=TRUE)
Warning message:
In bxp(list(stats = c(740, 850, 940, 980, 1070, 760, 800, 845, 890, :
  some notches went outside hinges ('box'): maybe set notch=FALSE
> abline(h=792.458, lty=3)

Since the speed is almost in excess of the model value at $c04-math-0016$ meters per second and we expected the speed to become less than this value in the presence of ether, we conclude that ether does not exist. Furthermore, most of the boxplots have overlapping notches, refer to Part B of Figure 4.3, and hence this indicates that medians of the observations across the different centers are identical. This experiment is one of the most important experiments in Physical Science.□

Figure 4.2 Michelson-Morley Experiment

Figure 4.3 Boxplots for Michelson-Morley Experiment

Example 4.3.3. Memory Recall Times

Contd. We now graphically examine the memory data using, of course, the R code boxplot(memory).

> par(mfrow=c(1,2))
> boxplot(memory)
> title("A: Boxplot for Memory Recall")

Part A of Figure 4.4 clearly shows and confirms the results of the data summaries.□

Figure 4.4 Boxplot for the Memory Data

Example 4.3.4. The Effect of Insecticides. McNeil (1977)

Six insecticides, labeled A to F, were used in an agricultural experiment and the number of insects found dead after using them were counted. The dataset is available as InsectSprays in the package datasets. We first count the number of insect deaths due to each of the insecticides using the aggregate function with the by=list provided by the spray column. The notched boxplots are then obtained for each insect spray using the formula boxplot(count∼spray,data=InsectSprays, notch=TRUE).

> data(InsectSprays)
> aggregate(InsectSprays$count,by=list(InsectSprays$spray),sum)
  Group.1   x
1       A 174
2       B 184
3       C  25
4       D  59
5       E  42
6       F 200
> boxplot(count∼spray,data=InsectSprays,notch=TRUE)
Warning message:
In bxp(list(stats = c(7, 11, 14, 18.5, 23, 7, 12, 16.5, 18, 21,  :
  some notches went outside hinges ('box'): maybe set notch=FALSE

The notches for sprays A, B, and F are overlapping, Part B of Figure 4.4, and thus indicate that their medians are not significantly different. Similar exploratory inference holds for sprays C to E. However, the notches for the latter group and the former group are non-overlapping with the medians less in magnitude. This leads us to conclude that if we were to select one of the insect sprays, we could choose any from A, B, and F (without looking any further?). We should be careful of two things here. First, some notches were outside the hinges of the box, and there was the presence of outliers for sprays C and D.□

$c04-math-0017$

4.3.2 Histogram

The histogram was invented by the eminent statistician Karl Pearson and is one of the earliest types of graphical display. It goes without saying that its origin is earlier than EDA, at least the EDA envisioned by Tukey, and yet it is considered by many EDA experts to be a very useful graphical technique, and makes it to the list of one of the very useful practices of EDA. The basic idea is to plot a bar over an interval proportional to the frequency of the observations that lie in that interval. If the sample size is moderately good in some sense and the sample is a true representation of a population, the histogram reveals the shape of the true underlying uncertainty curve. Though histograms are plotted as two-dimensional, they are essentially one-dimensional plots in the sense that the shape of the uncertainty curve is revealed without even looking at the range of the $c04-math-0018$ -axis. Furthermore, the Pareto chart, stem-and-leaf plot, and a few others may be shown as special cases of the histogram. We begin with a “cooked” dataset for understanding a range of uncertainty curves.

Example 4.3.5. Understanding Histogram of Various Uncertainty Curves

In the dataset sample, we have data from five different probability distributions. Towards understanding the plausible distribution of the samples, we plot the histogram and see how useful it is.

> data(sample)
> layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), nrow=2, ncol=6,  byrow=TRUE),respect=FALSE)
> matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), nrow=2, ncol=6, byrow=TRUE)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    2    2    3    3
[2,]    0    4    4    5    5    0
> hist(sample[,1],main="Histogram of Sample 1",xlab="sample1",  ylab="frequency")
> hist(sample[,2],main="Histogram of Sample 2",xlab="sample2",  ylab="frequency")
> hist(sample[,3],main="Histogram of Sample 3",xlab="sample3",  ylab="frequency")
> hist(sample[,4],main="Histogram of Sample 4",xlab="sample4",  ylab="frequency")
> hist(sample[,5],main="Histogram of Sample 5",xlab="sample5",  ylab="frequency")

In the present case, we need five plots on a single graphical framework, and using par with either mfrow=c(2,3) or mfrow=c(3,2) leaves an empty plot. The layout graphics function helps to specify a complex plot. In the above R code, the graphical device is divided into 12 parts across 2 rows and 6 columns. The first plot of the device is plotted on first two parts of the first row, the second third and fourth part of first row, and so forth. The option of 0 says that such parts should not be used for plots.

The hist function works on a numeric vector. The option main helps in specifying a title for the histogram, and xlab and ylab can be used to specify labels for the axes.

Figure 4.5 is the output of the previous R program. The histogram of sample 1 is bell-shaped, and its peak (mode) is near 0. The mean and median are respectively -0.1845 and -0.1450, again closer to 0. The standard deviation and variance are respectively 0.9714806 and 0.9437745. Verify these numbers! The shape indicated by the histogram and the summaries indicate that the distribution of the sample may be a normal distribution.

Figure 4.5 Different Types of Histograms

The histogram of Sample 2 is tailing off very fast after the value of 50. The distribution indicates positive skewness, and also the variance 2933.388 is approximately the square of the mean 53.27. Furthermore, all the values are non-negative, which leads us to believe that the sampling distribution may be an exponential distribution. An important feature of Samples 3 and 4 is that all the values are non-negative integers. This makes us believe that the sampling distribution may be a discrete distribution. The mean and variance of Sample 3 is 4.98 and 5.2117, whereas these numbers for Sample 4 are 2.83 and 1.7384. The mean and variance of Sample 3 are almost equal, which is a characteristic of Poisson distributionand the variance being larger rules out the possibility of the sample being from a binomial distribution. Similarly, the variance of Sample 4 being less than its mean is a reflection that this sample may be from a binomial distribution. The interpretation of the fifth sample is left to the reader.□

A few fundamental questions related to the creation of histograms need to be asked at this moment. The central idea is to plot a bar over an interval. All the intervals together need to cover the range of the variables. For example, the number of intervals for the five histograms above are respectively 11, 6, 11, 6, and 10. How did R decide the number of intervals? The width of each interval for these histograms are respectively 0.5, 50, 1, 1, and 0.1. What is the basis for the width of the intervals? The reader may check them out with length(hist$counts) and diff(hist$breaks). The intervals are also known as bins. Let us denote the number of intervals by $c04-math-0019$ and the width of the bin by $c04-math-0020$ . Now, if we know either the number of the interval or the bin width, the other quantity may be easily obtained with the formula:

4.3

4.4

where the argument $c04-math-0023$ denotes the ceiling of the number. However, in practice we do not know either the number of bins or their width. The hist function offers three options for the bin width/number based on the formulas given by Sturges, Scott, and Freedman-Diaconis:

4.5

4.6

4.7

where $c04-math-0027$ is the number of observations and $c04-math-0028$ is the sample standard deviation. The formulas 4.5–4.7 are respectively specified to the hist function with breaks=“Sturges”, breaks=“Scott”, and breaks=“FD”. The other options include directly specifying the number of breaks with a numeric, say breaks=10, or through a vector breaks=seq(-10,10,0.5).

Example 4.3.6. AD5. The Galton Data

The histogram gives a nice display of the variables. Here the goal would be to obtain a histogram for the height of parent and child on the same plot. However, we would clearly like to see all the frequency bars for parent as well as child. That is, if the frequency of the height for the parent is larger than for the child for a certain interval, it should be reflected the same as well as for the height of the variable of lesser frequency. The histograms should cover the range of both variables. With these technicalities in mind, the next program gives the required figure. The reader can first go through the program and follow it up with the description and the resulting figure.

> hist(galton$parent,freq=FALSE,col="green",density=10,xlim=c(60,75),
+ xlab="height",main="Histograms for AD5")
> hist(galton$child,freq=F,col="red",add=TRUE,density=10,angle=-45)
> legend(x=c(71,73),y=c(0.2,0.17),c("parent","child"),col=c("green", "red"),pch="-")

In the event of two variables having an unequal number of observations, the option freq=FALSE will ensure that the heights of two variables over an interval remain the same if their overall percentage of the bin remains equal. The limits for the height values is contained with xlim=c(60,75). The histogram of the parent's height is identified with col=“green”,density=10, and add=TRUE,density=10,angle=-45 ensures that the embossed histogram is identifiable from that of the preceding one. The legend had been added to suitably complement the program. The reader should further interpret the results from Figure 4.6.□

Figure 4.6 Histograms for the Galton Dataset

4.3.3 Histogram Extensions and the Rootogram

The histogram displays the frequencies over the intervals and for moderately large number of observations, it reflects the underlying probability distribution. The boxplot shows how evenly the data is distributed across the five important measures, although it cannot reveal the probability distribution in a better way than a boxplot. The boxplot helps in identifying the outliers in a more apparent way than the histogram. Hence, it would be very useful if we could bring together both these ideas in a closer way than look at them differently for outliers and probability distributions. An effective way of obtaining such a display is to place the boxplot along the x-axis of the histogram. This helps in clearly identifying outliers and also the appropriate probability distribution.

The R package sfsmisc contains a function histBxp, which nicely places the boxplot along the x-axis of the histogram.

Example 4.3.7. Understanding Histogram of Various Uncertainty Curves

The short program for this problem is given below.

> par(mfrow=c(1,3))
> histBxp(sample$Sample_1,col="blue",boxcol="blue",xlab="x")
> histBxp(sample$Sample_2,col="grey",boxcol="grey",xlab="x")
> histBxp(sample$Sample_3,col="brown",boxcol="brown",xlab="x")
> title("Boxplot and Histogram Complementing",outer=TRUE,line=-1)

The combination of histogram and boxplot gives a very nice display of both the concepts, as seen in Figure 4.7.□

Figure 4.7 Histograms with Boxplot Illustration

Generally, in histograms, bar height varies more in bins with long bars than in bins with short bars. In frequency terms, the variability of the counts increases as their typical size increases. Hence, a re-expression can approximately remove the tendency for the variability of a count to increase with its typical size. The rootogram arises on taking the re-expression as the square root of the frequencies. This observation is important towards an understanding of the transformations.

Example 4.3.8. AD3. The “Militiamen Chests” Dataset

This dataset was introduced in AD3. In what follows, first the histogram is obtained, and then the rootogram.

> data(chest)
> attach(chest)
> names(chest)
[1] "Chest" "Count"
> militiamen <- rep(Chest,Count)
> length(militiamen)
[1] 5738
> bins <- seq(33,48)
> bins
 [1] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
> bin.mids <- (bins[-1]+bins[-length(bins)])/2
> par(mfrow=c(1,2))
> h <- hist(militiamen, breaks = bins, xlab= "Chest Measurements (Inches)", main= "A: Histogram for the Militiamen")
> h$counts <- sqrt(h$counts)
> plot(h,xlab= "Chest Measurements (Inches)",ylab= "ROOT FREQUENCY",
+ main= "B: Rootogram for the Militiamen")

Note that we do not have the real dataset for the chest width of the militiamen in terms of each datum. The data provides summary, actually frequencies, of the number of people having width 33 inches, 34 inches, and so forth. The program first recreates the data using the rep function, where each distinct chest width, Chest, is replicated the number of times as specified in Count. Next, the bin points are specified as integers varying from 33 to 48. A histogram for the militiamen's chest width data is first created and also assigned to a new object h, see Part A of Figure 4.8. Then, the frequency of the histogram is modified by taking the square-root transformation, that is, h$counts <- sqrt(h$counts) and the histogram is then regenerated, which is actually a rootogram. It may be seen from the plot that the rootogram appears smoother than the histogram, see Part B of Figure 4.8.□

Figure 4.8 A Rootogram Transformation for Militiamen Data

4.3.4 Pareto Chart

The Pareto chart has been designed to address the implicit questions answered by the Pareto law. The common understanding of the Pareto law is that “majority resources” is consumed by a “minority user”. The most common of the percentages is the 80–20 rule, implying that 80% of the effects come from 20% of the causes. The Pareto law is also known as the law of vital few, or the 80–20 rule. The Pareto chart gives very smart answers by completely answering how much is owned by how many. Montgomery (2005), page 148, has listed the Pareto chart as one of the seven major tools of Statistical Process Control.

R did not have any function for this plot and neither was there any add-on package which would have helped the user until 2004. However, one R user posed this question to the “list” in 2001 and an expert on the software promptly prepared exhaustive codes over a period of about two weeks. The codes are available at https://stat.ethz.ch/pipermail/r-help/2002-January/018406.html. The Pareto chart can be plotted using pareto.chart from the R package qcc. We will use Wingate's program and assume here that the reader has copied the codes from the web mentioned above and compiled it in the R session.

The Pareto chart contains three axes on a two-dimensional plot only. Generally, causes/users are arranged along the horizontal axis and the frequencies of such categories are conveyed through a bar for each of them. The bars are arranged in a decreasing order of the frequency. The left-hand side of the vertical axis denotes the frequencies. The cumulative frequency curve along the causes are then simultaneously plotted with the right-hand side of the vertical axis giving the cumulative counts. Thus, at each cause we know precisely its frequency and also the total issues up to that point of time.

Example 4.3.9. Cause and Frequencies

For a chart whose codes have been obtained from the Internet, we resort to the Internet again for a dataset. Click on the web-link http://www.otago.ac.nz/sas/qc/chap26/sect4.htm. From this page, copy and paste the Table 4.1.

Table 4.1 Frequency Table of Contamination and Oxide Effect

Obs	Cause	COUNT	PERCENT
1	Contamination	14	45.1613
2	Corrosion	2	6.4516
3	Doping	1	3.2258
4	Metalization	2	6.4516
5	Miscellaneous	3	9.6774
6	Oxide Defect	8	25.8065
7	Silicon Defect	1	3.2258

> freq <- c(14,2,1,2,3,8,1)
> names(freq) <- c("Contamination","Corrosion","Doping", "Metallization", "Miscellaneous", "Oxide Effect","Silicon Effect")
> pareto(freq)

It can be clearly seen from the Pareto chart in Figure 4.9 that Contamination and Oxide Effect account for more than 80% of the causes in this experiment.□

Figure 4.9 A Pareto Chart for Understanding The Cause-Effect Nature

4.3.5 Stem-and-Leaf Plot

Velleman and Hoaglin (1984) describe the basic idea of stem-and-leaf display by allowing the digits of the data values to do the sorting into numerical order and then display the same. The steps for constructing stem-and-leaf are given in the following:

1. Select an appropriate pair of adjacent digits positioned in the data and split each observation between the adjacent digits. The digits selected on the left-hand side of the data are called leading digits.
2. Sort all possible leading digits in ascending order. All possible leading digits are called stems.
3. Write the first digit of each data value beside its stem value. The first digit is referred to as the leaf.

In Step 2, all possible stems are listed irrespective of whether they occur in the given dataset or not. The stem function from the base package will be useful for obtaining stem-and-leaf plots.

Example 4.3.10. AD4. The Sleep Data

The purpose of thisfamous dataset is to investigate if the drug group==2 results in extra hours of sleep compared with the control group group==1. First, the extra sleep hours are sorted by the groups and then the stem function is applied over the groups to enable us to decide whether the drug leads to extra sleep hours.

> sort(sleep$extra[sleep$group==1])
 [1] -1.6 -1.2 -0.2 -0.1  0.0  0.7  0.8  2.0  3.4  3.7
> sort(sleep$extra[sleep$group==2])
 [1] -0.1  0.1  0.8  1.1  1.6  1.9  3.4  4.4  4.6  5.5
> stem(sleep$extra[sleep$group==1],scale=2)
  The decimal point is at the |
  -1 | 62
  -0 | 21
   0 | 078
   1 |
   2 | 0
   3 | 47
> stem(sleep$extra[sleep$group==2],scale=2)
  The decimal point is at the |
  -0 | 1
   0 | 18
   1 | 169
   2 |
   3 | 4
   4 | 46
   5 | 5

For the control group, the sleep hours vary from -1.6 to 3.7, whereas for the drug it varies from -0.1 to 5.5 hours. In the former case, with scale=2, the leading digits would be -1, -0, 0, 1, 2, 3, and for the latter they are -0, 0, 1, ...,5. The stem plot for control group suggests the median value at about 0.35 and for the drug group at 1.75, and thus the stem-and-leaf plot suggests an increase of about 1.4 sleep hours for the drug group.□

Example 4.3.11. The Cloud Seeding Data

Chambers, et al. (1983), page 381, contains the cloud seeding dataset. Rainfall in acre-feet for 52 clouds is measured, 50% of which have natural rain (control group), whereas the others are seeded. We need to visually compare whether seeding the clouds leads to an increase in rainfall in acre-feet. The stem-and-leaf plot will be used to analyze this cloud seed data. The stem function will again be used here.

The small code summary(cloud) shows that the rain in acre-feet for the Control group varies from 1.00 to 1202.60 and for the Seeded group from 4.10 to 2745.60. The number of leading digits for this data will be too enormous and meaningless too. Hence, we resort to a logarithmic transformation of the data.

> data(cloud)
> summary(cloud)
    Control            Seeded
 Min.   :   1.00   Min.   :   4.10
 1st Qu.:  24.82   1st Qu.:  98.12
 Median :  44.20   Median : 221.60
 Mean   : 164.59   Mean   : 441.98
 3rd Qu.: 159.20   3rd Qu.: 406.02
 Max.   :1202.60   Max.   :2745.60
> stem(log(cloud$Seeded),scale=1)
  The decimal point is at the |
  1 | 4
  2 | 09
  3 | 457
  4 | 57889
  5 | 33556678
  6 | 1269
  7 | 449
> stem(log(cloud$Control),scale=1)
  The decimal point is at the |
  0 | 0
  1 | 66
  2 | 49
  3 | 123344679
  4 | 2456
  5 | 015889
  6 | 7
  7 | 1

From the stem-and-leaf plot, the median acre-feet rain for the seeded clouds appears around 5.4 and for the control clouds around 3.8. The difference is certainly significant and the stem-and-leaf plot suggests that the rainfall is larger for the seeded clouds.□

Multiple histograms and boxplots were obtained on the same graphical device. Thus, to compare two stem-and-leaf plots, there is a need for a similar display arrangement. The infrastructure will now be discussed. Tukey has indeed enriched the EDA in ways beyond the discussion thus far. An important technique invented by Prof Tukey is the modification of the stem-and-leaf plot and this technique is available from the aplpack package in the stem.leaf.backback function. This technique will be illustrated through the two examples discussed previously.

If the trailing digits for stems are few, the interpretation of the stem-and-leaf plot becomes simpler. However, if there are a large number of trailing digits for a stem, it is inconvenient to interpret the display. Suppose that the stem is the integer 1 and there are nearly 15 observations among the trailing digits. This means that the stem part will have 15 numbers besides it, which will obscure the display. In an informal way, Prof Tukey suggests that such stems be further divided into sub-stems. The question is then how do we identify those sub-stems, which are part of neither the leading digits nor the trailing digits. Typically, the digits would be one of the ten integers 0 to 9. The notation for the sub-stems suggested by Prof Tukey is to identify the digits by the first letters of their spellings. Thus 2 (two) and 3 (three) will be denoted by t, 4 (four) and 5 (five) by f, 6 (six) and 7 (seven) by s. A convention for the digits 0, 1, 8, and 9 is the denote 0 (zero) and 1 (one) by the star symbol *, and 8 (eight) and 9 (nine) by the period “.”.

Example 4.3.12. Tukey's Extension of the Stem-and-Leaf Plot

The aplpack package's stem.leaf.backback function will be used to arrange two stem-and-leaf plots in parallel, which help in direct comparisons. The cumulative frequencies from both the extremes are also provided for the stem-and-leaf plots and the point at which there is an overlap will help to identify the median. An awesome trick here!

We also consider one more dataset octane from the RSADBE package. In this problem, there are two methods of obtaining the octane rating of gasoline blends, and we need to check which method leads to higherratings. The next batch of R codes gives the action on all the three datasets.

> stem.leaf.backback(sleep$extra[sleep$group==1],sleep$extra [sleep$group==2], back.to.back=FALSE)
__________________________
  1 | 2: represents 1.2, leaf unit: 0.1
sleep$extra[sleep$group == 1]
                sleep$extra[sleep$group == 2]
__________________________
  -2 |         |
  -1 |62    2  |
  -0 |21    4  |1     1
   0 |078  (3) |18    3
   1 |         |169  (3)
   2 |0     3  |
   3 |47    2  |4     4
   4 |         |46    3
   5 |         |5     1
   6 |         |
__________________________
n:    10        10
__________________________
> stem.leaf.backback(log(cloud$Seeded),log(cloud$Control), back.to.back=FALSE)
________________________________
  1 | 2: represents 1.2, leaf unit: 0.1
log(cloud$Seeded)
                   log(cloud$Control)
________________________________
  0* |            |0        1
  0. |            |
  1* |4        1  |
  1. |            |55       3
  2* |0        2  |4        4
  2. |8        3  |8        5
  3* |44       5  |012233  11
  3. |7        6  |678     (3)
  4* |            |234     12
  4. |57778   11  |59       9
  5* |234     (3) |04       7
  5. |56678   12  |789      5
  6* |01       7  |
  6. |58       5  |7        2
  7* |44       3  |0        1
  7. |9        1  |
  8* |            |
________________________________
n:    26           26
________________________________
> library(RSADBE)
> data(octane)
> stem.leaf.backback(octane$Method_1,octane$Method_2, back.to.back=TRUE)
_____________________________________
  1 | 2: represents 12, leaf unit: 1
octane$Method_1
                      octane$Method_2
_____________________________________
   2         10|  8* |
   4         33|   t |3          1
   9      55544|   f |5          2
  16    7777666|   s |
  (2)        98|  8. |89         4
  14   11110000|  9* |
   6         32|   t |22333      9
   4          4|   f |4455      13
               |   s |667777    (6)
   3          8|  9. |8899999   13
               | 10* |011        6
               |   t |3          3
   2         55|   f |
               |   s |66         2
               | 10. |
               | 11* |
_____________________________________
n:           32       32
_____________________________________

The reader should verify if the medians as argued in the earlier examples are indeed correct. In light of the discussion in the paragraph preceding the example, interpret the symbols *, t, f, s, and “.”. The number of bracket (2) for Method_1 and (6) for Method_2 indicates the median, that is, approximately 88 and 97.□

4.3.6 Run Chart

The run chart is also known as the run-sequence plot. In the run chart, the data value is simply plotted against its index number. For example, if $c04-math-0029$ is the data, plot $c04-math-0030$ against their index $c04-math-0031$ . The run charts can be plotted in R using the function plot.ts, which is very commonly used in time-series analysis.

Example 4.3.13. AD9. The Air Passengers Dataset

Consider the dataset of “Airline Passengers” in the US for the period 1949–1960. Plot the number of passengers against the month of the year label to obtain the run chart.

> par(mfrow=c(1,2))
> AirPassengers
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1960 417 391 419 461 472 535 622 606 508 461 390 432
> plot.ts(AirPassengers)
> title("A: Run Chart for AD9")

The run chart in Part A of Figure 4.10 clearly shows that there is an increasing trend in the number of users. It may be seen that there is a seasonal variation whose size is roughly equal to the local mean.□

Figure 4.10 A Time Series Plot for Air Passengers Dataset

Example 4.3.14. Insurance Claims Data

Montgomery (2005), page 42, describes this dataset in which the number of days taken by the company to process and settle the claims of employee health insurance customers. The data is recorded for the number of days for settlement from the first to fortieth claim. Here, it is the claim number which plays the role of “time index” and not the number of days taken to settle the claim. Thus, to execute the run chart for this dataset, we have an insight into how the company has evolved from the first to the fortieth claim.

> data(insurance)
> plot(insurance$Claim,insurance$Days,"l",xlab="Claim Sequence",
+ ylab="Time to Settle the Claim")
> title("B: Run Chart for Insurance Claim Settlement")

We can see in Part B of Figure 4.10 that though there is a lot of upward and downward movements across the claims, it is gradually decreasing and thus shows that as the experience of the company increases in handling the claims, it is able to settle the claims in less time.□

In certain ways, the graphical methods are useful for univariate data. The next technique is more useful for dealing with paired/multivariate data.

4.3.7 Scatter Plot

The reader is most certainly familiar with this very basic format of plots. Whenever we have paired data and there is a belief that the variables are related, its only natural to plot them against one other. Such a display is, of course, known as the scatter plot or the x-y plot. There is a subtle difference between them, see Velleman and Hoaglin (1984). We will straightaway start with examples.

Example 4.3.15. AD5. The Galton Data

As has been described in Section 4 of Chapter 1, we attempt an initial understanding of the relationship between the height of the child and the parent.

> library(UsingR)
> data(galton)
> plot(galton[,2],galton[,1],xlim=range(galton[,2]),ylim=
+ range(galton[,1]),xlab="Parent's Height",ylab="Child's Height")

The display, Figure 4.11, does not suggest a strong correlation between the two variables here.□

Figure 4.11 A Scatter Plot for Galton Dataset

Example 4.3.16. Scatter Plots for Understanding Correlations

We consider a cooked data tailor-made for the use of scatter plots towards understanding correlations. This dataset is available in the file some6samples.csv.

> data(somesamples)
> attach(somesamples)
> par(mfrow=c(2,3))
> plot(x1,y1,main="Sample I",xlim=c(-4,4),ylim=c(-4,4))
> plot(x2,y2,main="Sample II",xlim=c(-4,4),ylim=c(-4,4))
> plot(x3,y3,main="Sample III",xlim=c(-4,4),ylim=c(-4,4))
> plot(x4,y4,main="Sample IV",xlim=c(-4,4),ylim=c(-4,4))
> plot(x5,y5,main="Sample V",xlim=c(-4,4),ylim=c(-4,4))
> plot(x6,y6,main="Sample VI",xlim=c(-4,4),ylim=c(-4,4))

The plots, Figure 4.12, titled Sample I and Sample II, resemble two clouds, and also appear to be mirror images of each other. If the first plot increases in the $c04-math-0032$ value, this suggests an increase in $c04-math-0033$ by a certain value, and the second plot suggests that increasing $c04-math-0034$ should lead to a decrease in $c04-math-0035$ by that certain value. That certain value is surely not 0 but does not look to be a high value either. The scenario in Sample III and Sample IV mirrors the relationship between the first two, except that we are sure that the “certain value” is high. The reader can further see that Sample V looks more or less like Sample I, with the cloud now denser than Sample I but sparser than Sample III. The exercise for the reader is to interpret Sample VI.□

Figure 4.12 Understanding Correlations through Different Scatter Plots

The scatter plot will be later extended to multivariate data with more than two variables through an idea known as matrix of scatter plots. See Sections 12.4 and 14.2.

$c04-math-0036$

4.4 Quantitative Techniques in EDA

We discuss here two important methods of quantitative techniques in EDA. For advance concepts of quantitative techniques, refer to Hoaglin, et al. (1991). The methods described and demonstrated will lay a firm foundation towards the methods described in that book. The first method here is fairly simple, and the second one is more detailed.

4.4.1 Trimean

Trimean is a measure of location and is the weighted average of the median and two other quartiles. Since median is a measure of location, we may intuitively expect the trimean to be more robust than the median as well as the mean. If $c04-math-0037$ are the lower, middle (median), and upper quartiles, the trimean is defined by

4.8

The last part of the above equation suggests that the trimean can be viewed as the average of median and average of the lower and upper quartiles. Weisberg (1992) summarized trimean as “a measure of the center (of a distribution) in that it combines the median's center values with the mid-hinge's attention to the extremes.” In fact, we can even replace the lower and upper quartiles in the above expression by the corresponding hinges and then derive the trimean as a weighted average of the median and the hinges. That is,

As the hinges, the three of them, are obtained using the Tukeys' five numbers function fivenum, it is straightforward to obtain the trimean using either the quartiles (consider using the quantile function) or the hinges. The next small R session defines the required function in TM and TMH, and we also show that hinges and quartiles need not be equal.

> TM <- function(x) {
+   qs <- quantile(x,c(0.25,0.5,0.75))
+   return(as.numeric((qs[2]+(qs[1]+qs[3])/2)/2))
+                   }
> TMH <- function(x) {
+   qh <- fivenum(x,c(0.25,0.5,0.75))
+   return((qh[2]+(qh[1]+qh[3])/2)/2)
+                     }
> TM(iris[,2]); TMH(iris[,2])
[1] 3.02
[1] 2.65
> ji4 <- jitter(iris[,4])
> quantile(ji4,c(0.25,0.75))
 25
0.29 1.80
> fivenum(ji4)[c(2,4)]
[1] 0.289 1.797

The functions are simple to follow and hence it is left to the reader to figure it out.

4.4.2 Letter Values

We have mentioned how EDA is about attitude, data driven stories, etc. Since the emphasis in EDA is on data, it makes a whole lot of sense to understand each “datum” as much as possible. Some natural examples of useful datum are minimum, maximum, etc. Median is also a useful datum when the size of data is odd. Recall that by definition, see Section 4.2, depth of a datum is the minimum of the position of the datum from either end of the batch. We can see from the outputs of Section 4.2 that the minimum and maximum values, also called extremes, have a depth of 1, the second largest and second smallest have a depth of 2, and so on. Thus, two observations, namely, the $c04-math-0040$ and $c04-math-0041$ ordered observations in ascending order have depth $c04-math-0042$ .

The median splits the data into two equal halves. The upper and lower hinges do that to the upper and lower halves dataset what median does to the entire collection, that is, the hinges give us quarters of the dataset. We have seen earlier that the depth of the median, for a sample of size $c04-math-0043$ is $c04-math-0044$ . It is fairly easy to see that the depth of the hinges is therefore

4.9

where $c04-math-0046$ indicates the integer part of $c04-math-0047$ .

The next step is to define the half dividers of quarters, which results in eight equal divisions of the dataset. They are referred to as eights for simplicity, and we denote the eights by $c04-math-0048$ . Furthermore, the depth of eights is given by

4.10

The concept will now be illustrated from Velleman and Hoaglin.

Example 4.4.1. Area of New Jersey Counties. Velleman and Hoaglin (1984)

Data is available on the area of New Jersey counties and we want to find which cities fall in the halves, quarters, and eights of the dataset. See page 44 of Velleman and Hoaglin.

> areanj <- c(569, 234, 819, 221, 267, 500, 130, 329, 47, 423, 228,
+ 312, 476, 468, 642, 192, 365, 307, 527, 103, 362)
> counties <- c("Atlantic", "Bergen", "Burlington", "Camden",  "Cape", + "Cumberland", "Essex", "Gloucester", "Hudson",  "Hunterdon", "Mercer", + "Middlesex", "Monmouth", "Morris",  "Ocean", "Passaic", "Salem", "Somerset", "Sussex",
+ "Union", "Warren")
> njc <- data.frame(counties,areanj)
> njc <- njc[order(njc[,2]),]
> d_median <- (nrow(njc)+1)/2
> d_hinge <- (floor(d_median)+1)/2
> d_eights <- (floor(d_hinge)+1)/2
> d_median;d_hinge;d_eights
[1] 11
[1] 6
[1] 3.5

Though floor is not needed in this example, we have factored it into the program so that if this program is executed on other datasets, we get the right output. We modify the data frame njc for facilitating the computations of the hinges and the eights.

> indices <- c(1:d_median,(d_median-1):1)
> cbind(njc,indices)
     counties areanj indices
9      Hudson     47       1
20      Union    103       2
5        Cape    267       8
18   Somerset    307       9
12  Middlesex    312      10
8  Gloucester    329      11
21     Warren    362      10
17      Salem    365       9
10  Hunterdon    423       8
15      Ocean    642       2
3  Burlington    819       1

We can clearly see from the above display of R outputs that the median of area of New Jersey counties is 329 square miles, and that the lower and upper hinges are 228 and 476 square miles respectively. The depth of the eights is 3.5, and thus we need to take the average of the third and fourth observations from either end to calculate the eights. The lower and upper eights are thus (130+192)/2 = 161 and (527+569)/2 = 548 square miles respectively.□

Note that the measures such as hinges and eights are not the depth values, but are the values of the variable associated with the corresponding depth. Median, hinges, eights! Where to stop exactly will be a very legitimate question for any practitioner. Furthermore, depending on the size of the dataset, the eights may be 3.5 or even 3000. This question is what is precisely answered by letter values.

Letter values continue the division process further into sixteens, thirty-seconds, and so on until we reach the depth of a datum which will be equal to 1. Thus, we would have arrived at some of the most meaningful data division process. Velleman and Hoaglin (1984) suggest denoting the further letters, beyond eights, as $c04-math-0050$ , and so on. Yes, we clarify here what to do with these eights, sixteens, etc. Recall that in Section 2 and in Subsection 4.3.1 we suggested extending measures of central tendency, dispersion, and skewness based on hinges. Similarly, we can generalize measures based on each of the letter values generated. Thus, we have further concepts such as midhinges, mideights, etc. These measures are referred to as midsummaries. Similarly, we can define measures of dispersion based on the ranges between the measures which lead to H-spread, E-spread, D-spread, etc.

We close this subsection with the use of function lval from the LearnEDA package developed by Prof Jim Albert.

Example 4.4.2. Area of New Jersey Counties

Contd. The lval function from Prof Jim Albert's R package LearnEDA will be used to obtain the letter values.

> library(LearnEDA)
> lval(areanj)
  depth  lo  hi  mids spreads
1  11.0 329 329 329.0       0
2   6.0 228 476 352.0     248
3   3.5 161 548 354.5     387
4   2.0 103 642 372.5     539
5   1.0  47 819 433.0     772

The letter values in this example end at thirty-two(th?), or equivalently at letter $c04-math-0051$ . Note that there only 21 observations in the study. Thus, this should sound as a warning for the user that the letter values need not stop at the maximum number of observations, and that the process is terminated when the depth of one is reached.□

Now, we are prepared for exploratory regression models!

4.5 Exploratory Regression Models

The scatter plot helps to identify the relationship between two variables. If the scatter plot indicates a linear relationship between two variables, we would like to quantify the relationship between them. A rich class of the related confirmatory models will be taken in Part IV. In this section we will develop the exploratory approach for quantifying the relationship and hence we call these models Exploratory Regression Models. For the case of single input variables, also known as covariates, the output can be modeled through a resistant line and this development will be carried out in the next subsection 4.5.1, and the extension for two variables will be taken in subsection 4.5.2.

4.5.1 Resistant Line

We have so far seen EDA techniques handle reliable summaries in the form of median, mid-summaries, etc., and very powerful graphical displays such as histogram, Pareto chart, etc. We also saw how the x-y plots help in understanding the relationship between two variables. The reader would appreciate some EDA technique which models the relationship between two variables. Particularly, regression models of the form

4.11

are of great interest. Equation 4.11 is our first regression model. Theanswer to problems of this kind are provided by the resistant line. Here, the term $c04-math-0053$ is referred to as the intercept term, $c04-math-0054$ as the slope term, while $c04-math-0055$ is the error or noise.

The motivation and development of the resistant line is very intuitive and extends in a natural way to employ the use of median, quartiles, hinges, etc. From a mathematical point of view, the slope term $c04-math-0056$ measures the changes in the output $c04-math-0057$ for unit change in the input $c04-math-0058$ , whereas $c04-math-0059$ is the intercept term. This rate of change is obtained by dividing the data into three regions.

We describe the resistant line mechanism by the following steps. The reader should follow the steps in Figure 4.13 too. A useful figure explaining the steps of resistant line modeling may be found in Figure 6 titled “Understanding the resistant line” of Tattar (2013). The initial estimate of the parameters are obtained through the following steps.

The x-y plot is divided into three regions, containing an equal number of data points, according to the $c04-math-0060$ -values only.
In the right-hand region find the median of $c04-math-0061$ values and also that of $c04-math-0062$ , denote them as $c04-math-0063$ and $c04-math-0064$ , and obtain the pair $c04-math-0065$ .
Repeat the exercises for the middle and left regions to obtain the points $c04-math-0066$ and $c04-math-0067$ .

We note from the construction in Figure 4.13 that $c04-math-0068$ , $c04-math-0069$ and $c04-math-0070$ need not correspond to any of the paired data $c04-math-0071$ . Refer to Chapter 5 of Velleman and Hoaglin for more details.

Image described by caption and surrounding text. — **Figure 4.13** Understanding The Construction of Resistant Line

The purpose of obtaining the triplets of the ordered pair $c04-math-0072$ is to put ourselves in the position where we can estimate the slope and intercept of the model given by Equation 4.11. We first estimate the slope, denoted by $c04-math-0073$ , using the pair of points $c04-math-0074$ . Define

4.12

We then use the estimated value of $c04-math-0076$ in the model and average over the three possible vital data points to obtain an estimate of the $c04-math-0077$ value, denoted by $c04-math-0078$ . Thus,

4.13

The initial estimate of $c04-math-0080$ and $c04-math-0081$ need improvization. Using the initial estimates, the residuals are obtained for the fitted model:

The slope and intercept terms are now obtained for the paired data $c04-math-0083$ and denoted by $c04-math-0084$ . The residuals for the $c04-math-0085$ iteration of the slope and intercept will be denoted by $c04-math-0086$ , and for $c04-math-0087$ , the slope and intercept terms are updated with $c04-math-0088$ and $c04-math-0089$ .

Let us put the theory of the resistant line behind us and see it in action during the following examples.

Example 4.5.1. AD4. The Galton Dataset

This dataset was used earlier as an example of an x-y plot. The scatter plot of the parents and the child reflected weak correlation in Figure 4.11, and now we examine and estimate the effect by using the resistant line model. The function resistant_line from the companion package will be used to build the model 4.11.

> library(UsingR)
> data(galton)
> rgalton <- resistant_line(galton$parent,galton$child,iterations=5)
> plot(galton$parent,galton$child,xlab="Parent's Height",
+ ylab="Child's Height")
> curve(rgalton$coeffs[1]+rgalton$coeffs[2]*(x-rgalton$xCenter), add=TRUE)
> rgalton$coeffs
[1] 68.5  1.0

The fitted resistant line, see Figure 4.14, tells us that if the parents were taller by an inch, the child's height would be more than an inch taller too.□

Figure 4.14 Fitting of Resistant Line for the Galton Dataset

For two factors, or covariates, an extension of the resistant line model 4.11, will be next considered.

4.5.2 Median Polish

For the AD5 dataset, we had height of the parent as an input variable and the height of the child as the output. Under the hypothetical case that we have groups for the height of father and mother as two different treatment variables, the resistant line model 4.11, in a very different technical sense, needs to be extended as follows:

4.14

where $c04-math-0091$ and $c04-math-0092$ represent the two groups of height for the father and mother. The groups here may be something along the lines of Low, Medium, and High. In the study of Experimental Designs, Chapter 13, this model is occasionally known as the two-way model. In EDA, the solution for obtaining the parameters $c04-math-0093$ , $c04-math-0094$ , and $c04-math-0095$ are given by the Median Polish algorithm. A slight technical difference needs to be pointed out for the use of median polish and resistant line models. Here, the input variables are categorical in nature and not continuous. This means that if we still need to understand the height of the child as a variable dependent on the height of the mother and of the father, the latter two variables need to be categorized into bins, say short (less than 5 feet), average (5–6 feet), and tall (greater than 6 feet).

To explain the median polish algorithm, we need to examine the dataset first.

Example 4.5.2. Strength Dataset of a Girder Experiment

In this experiment, the shear strength of steel plate girders needs to be modeled as a function of the four methods and nine girders, refer to Table 3.4 of Wu and Hamada (2000–9). The data will be displayed first and the analysis will be followed after the description of the median polish algorithm.

> data(girder)
> girder
     Aarau Karisruhe Lehigh Cardiff
S1.1 0.772     1.186  1.061   1.025
S2.1 0.744     1.151  0.992   0.905
S3.1 0.767     1.322  1.063   0.930
S4.1 0.745     1.339  1.062   0.899
S5.1 0.725     1.200  1.065   0.871
S1.2 0.844     1.402  1.178   1.004
S2.2 0.831     1.365  1.037   0.853
S3.2 0.867     1.537  1.086   0.858
S4.2 0.859     1.559  1.052   0.805

The array elements starting with 0.772 and finishing at 0.805 represent the shear strength of the steel plate girders. The row names, varying from S1.1 to S4.2, represent the nine types of girders, and the column names, Aarau, Karisruhe, Lehigh, and Cardiff are the four methods of preparation of the steel plates. The goal is to understand the impact of the methods of preparation as well as girder on the shear strength of the steel plates.□

Now that we know the data structure, called the two-way table, the median polish algorithm is given next, which will help in estimating the row and column effects.

1. Compute the row medians of a two-way table and augment it to the right-hand side of the table. Subtract the row median in the respective rows of the table.
2. Take the median of the row medians as the initial total effect value of the row effect. Similar to the original elements of the table, subtract the initial total effect value from the row medians.
3. Compute the column medians of the original columns for the matrix in the previous step and append it to the bottom. Subtract from the data matrix the corresponding column medians.
4. Similar to Step 2, obtain the median of the column medians and add to the initial total effect value. Remove the current total effect median from each element of the column medians.
5. Repeat the four steps above until convergence of either the row or the column medians.

The medpolish function from the MASS package can be used to fit the median polish model. This will be illustrated in a continuation of the girder experiment.

Example 4.5.3. Strength Dataset of a Girder Experiment

Contd. The medpolish function can be readily applied over the girder data frame.

> medpolish(girder)
1: 1.96
2: 1.701
Final: 1.686
Median Polish Results (Dataset: "girder2")
Overall: 0.989
Row Effects:
     S1.1      S2.1          S3.2      S4.2
-0.005375 -0.053875      0.054625  0.033625
Column Effects:
    Aarau Karisruhe    Lehigh   Cardiff
  -0.2116    0.3330    0.0740   -0.0695
Residuals:
       Aarau Karisruhe  Lehigh Cardiff
S1.1  0.0000   -0.1306  0.0034  0.1109
S2.1  0.0205   -0.1171 -0.0171  0.0394
S3.1 -0.0104    0.0000  0.0000  0.0105
S4.1 -0.0239    0.0255  0.0075 -0.0120
S5.1 -0.0025   -0.0721  0.0519  0.0014
S1.2 -0.0179   -0.0045  0.0305  0.0000
S2.2  0.0451    0.0345 -0.0345 -0.0750
S3.2  0.0350    0.1604 -0.0316 -0.1161
S4.2  0.0480    0.2034 -0.0446 -0.1481

The median polish solution shows that the Karisruhe method of preparation helps in creating higher shear strength of the steel plates.□

4.6 Further Reading

We had mentioned Tukey's (1962) article “The Future of Data Analysis” as one of the starting points which may have led to the beginning of EDA. Tukey had the strong belief that data analysis must not be overwhelmed by model assumptions and they should have an effect on how you describe them. This belief and further work at the Bell Telephone Laboratories culminated in Tukey (1977). There is a lot of simplicity in Tukeys work, such that for small datasets we do not even need a calculator. A paper and pencil will help us to a great enough extent and depth. An advanced course to Tukey (1977) is available in Mosteller and Tukey (1977). In this book, EDA techniques for regression problems are discussed.

In the year 1991, Hoaglin, et al. produced a volume with EDA methods for Analysis of Variance (ANOVA). Hoaglin, et al. (1985) is another edited volume which is useful for exploring tables, shapes, and trends. In fact, many such ideas are described in Rousseeuw and Leroy (1987) for robust regression. We also make good use of Velleman and Hoaglin (1984), which has many Fortran programs for EDA techniques. The main reason for restating this is the fact that a user can import Fortran programs in R and use them easily again.

EDA is about any method which is exploratory in nature. Thus, many of the multivariate statistical analysis techniques are considered as EDA techniques. As an example, many experts consider Principal Component Analysis, Factor Analysis, etc. as EDA techniques. Martinez and Martinez (2005) and Myatt (2007) are two recent books which accept this point of view. We will see the multivariate techniques in Chapters 14 and 15.

We need to mention that Frieden and Gatenby (2007) have developed EDA methods using Fisher information. This is an important facet, as the Fisher information is very important and we introduce this concept in Chapter 7.

4.7 Complements, Problems, and Programs

Problem 4.1 Let x be a numeric vector. Create a new function, say depth, which will have a serial number as an argument, between 1:length(x), and its output should return the depth of the datum.
Problem 4.2 Obtain the EDA summaries as in fivenum, IQR, fnid, and mad for the datasets considered in Section 4.3. Note your observations based on the summaries and then investigate whether or not these notes are visible in the corresponding figures.
Problem 4.3 The part B of Figure 4.4, see Example 4.5, clearly shows the presence of outliers for the number of dead insects for insecticides C and D. Identify the outlying data points. Remove the outlying points, and then check if any more potential outliers are present.
Problem 4.4 Provide summary and descriptive statistics for the cooked dataset in Example 4.6, and interpret the results as provided by the histograms.
Problem 4.5 The number of intervals for the five histograms in Figure 4.5 can be seen as 11, 6, 11, 6, and 10. How do you obtain these numbers through R?
Problem 4.6 Create a function which generates a histogram with the intervals according to the percentiles of the data vector.
Problem 4.7 The histograms seen in Section 4.3 give a horizontal display. At times, a vertical display is preferable. Using the tips from the web http://stackoverflow.com/questions/11022675/rotate-histogram-in-r-or-overlay-a-density-in-a-barplot, obtain the vertical display of a histogram.
Problem 4.8 Imposing histograms on each other helps in comparison of similar datasets, as seen in Example 4.7. Repeating the technique for the Youden-Beale experiment, what will be the conclusion for the two virus extracts?
Problem 4.9 Explore the different choices of breaks given in Formulas 4.5 – 4.7 for the different histogram examples.
Problem 4.10 Using the R function pareto.chart from the qcc package, obtain the Pareto chart for the causes and frequencies, as in Example 4.10, and compare the results with Figure 4.9.
Problem 4.11 Using the stem.leaf.backback function, compare the averages of the two virus extracts for the Youden-Beale experiment, as discussed in Example 4.2. Similarly, compare the stem-and-leaf displays for the recall of pleasant and unpleasant memories of Example 4.4.
Problem 4.12 Create an R function, say trimean, for computing trimean, as given in Equation 4.8. Apply the new function for datasets of your choices considered in the chapter.
Problem 4.13 Obtain the letter values for three datasets in Example 4.13 using lval and check if the median comparisons can be extended through them.
Problem 4.14 Fit resistant line models for the six pairs of data discussed in Example 4.17. Validate the correlations as implied by the scatter plots in Figure 4.12.
Problem 4.15 For the datasets available in the files rocket_propellant.csv and toluca_company.dat, build the resistant line models. In the former file, the input variable is Age_of_Propellant, while in the latter file it is Lot_Size. The output variables in these respective files are Shear_Strength and Labour_Hours.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.