Diagrams prove nothing, but bring outstanding features readily to the eye.
R. A. Fisher ([Fisher, 1925])
Chapter 8 discusses getting an initial overall view of a dataset.
When you first start work on a dataset it is important to learn what variables it includes and what the data are like. There will usually be some initial analysis goals, but it is still necessary to look over the dataset to ensure that you know what you are dealing with. There could be issues of data quality, perhaps missing values and outliers (discussed in the next chapter), and there could just be some surprising basic statistics.
There are several different functions in R for showing what is in a dataset. You can show the whole dataset (inadvisable for even moderately sized datasets), display only the first few lines (using the function head), or just list the variables, each with their type and a few values (using the function str). You can summarise the dataset using the function summary (base R) or the function describe (Hmisc) or the function whatis (YaleToolkit). There are doubtless other statistical alternatives. Plotting the dataset is the alternative, complementary approach for getting an overview and the one we will be concentrating on in this chapter.
As an example, consider the dataset HI from the package Ecdat on the effect of health insurance on women’s working hours. The data were analysed in [Olsen, 1998]. First information can be gained with str (the results are shown on the next page) and some simple graphics (Figures 8.1 and 8.2).
# 'data.frame': 22272 obs. of 13 variables:
# $ whrswk : int 0 50 40 40 0 ...
# $ hhi : Factor w/ 2 levels "no","yes": 1 1 2 1 2 ...
# $ whi : Factor w/ 2 levels "no","yes": 1 2 1 2 1 ...
# $ hhi2 : Factor w/ 2 levels "no","yes": 1 1 2 2 2 ...
# $ education : Ord.factor w/ 6 levels
# "<9years"<"9-11years"<..: 4 4 3 4 2 ...
# $ race : Factor w/ 3 levels "white","black",..: 1 1 1 1 1
# ...
# $ hispanic : Factor w/ 2 levels "no","yes": 1 1 1 1 1 ...
# $ experience: num 13 24 43 17 44.5 ...
# $ kidslt6 : int 2 0 0 0 0 ...
# $ kids618 : int 1 1 0 1 0 ...
# $ husby : num 12 1.2 ...
# $ region : Factor w/ 4 levels "other","northcentral",..: 2
# 2 2 2 2 ...
# $ wght : int 214986 210119 219955 210317 219955 ...
There are seven factors (one of which, education, is ordered), two numeric variables and four integer ones. The two kids variables must be discrete and limited, as drawing up tables or plotting them would confirm.
Figures 8.1 and 8.2 show histograms of the four continuous variables and barcharts of the remaining variables respectively. The histograms’ display is drawn by forming a new dataset, stacking the continuous variables, so that allthe plots can be drawn together with one line of code. (This could be done for the barcharts too, but then any categoryordering information for nominal variables would be lost.)
The hours worked by the women have two distinct modes, which on closer inspection of the data turn out to be 0 hours (i.e., not working) and 40 hours. The odd shape of the histogram of experience is due to the default binwidth, but the overall shape is clear. The extended axis below zero is surprising and growing the window vertically would show that there are a very few cases with a value of —1. (These arise because the variable is defined as years of potential work experience = age — years of education — 5.) Both the last two variables, husband’s income and sampling weight, are skewed to the right. Sampling weight distributions are often skew with a few cases having exceptionally large weights. The mode in the salary histogram at 100 turns out to be real, with 478 husbands’ incomes reported as $99,999 a year! This can be checked with something like
with(HI[HI$husby > 98 & HI$husby < 102 ,], table(husby))
library(reshape2)
HIvs <- c("whrswk", "experience", "husby", "wght")
HIs <- melt(HI[, HIvs], value.name = "HIx",
variable.name = "HIvars")
ggplot (HIs, aes(HIx)) + geom_histogram() +
facet_wrap(~ HIvars, scales = "free") +
xlab("") + ylab("")
The barcharts in Figure 8.2 show the insurance variables, information on education and race, the distributions of young and older children, and the distribution by region. The variables are selected by finding out which ones have a limited number of unique values. The weighting variable has been converted to a percentage to make the scales readable.
uniqv <- function(x) length(unique(x)) < 20
vcs <- names(HI)[sapply(HI, uniqv)]
par(mfrow = n2mfrow(length(vcs)))
relativeWeight <- with(HI, wght/sum(as.numeric(wght))*100)
for(v in vcs)
barplot(tapply(relativeWeight, HI[[v]], sum), main = v)
Sometimes it is obvious which variables should be treated as continuous and which as categorical or discrete, sometimes it is ambiguous. If an age variable takes values from 0 to 100, then it makes sense to treat it as continuous, while if there are only integer values between 20 and 30 it might be better to treat it as discrete. If you guess wrongly, then draw another plot. The whole point of initial overviews is to get a feel for the data, not to draw perfect pictures. They can come later when you have learned what information in the data is worth presenting.
Instead of jumping in and producing lists of summaries (as with summary) or a large matrix of primarily scatterplots (as with plot), another approach is to begin with the basics and work up step by step. Knowing what variables of what kinds there are and how many cases is a pretty good start and can be achieved using str, as was done in §8.1. The GDA approach suggested here is to split the variables into two groups, plotting categorical and discrete variables as barcharts, while plotting the other variables, where possible, as histograms. Any special variable types left over (e.g., dates) should be dealt with separately.
Plots of individual variables give a quick view of the variable distributions and of any features that stand out. Plotting variables in groups, all histograms together and all barcharts together, is quicker than plotting them one by one and organises them neatly. The Boston housing dataset was already examined in Figure 3.9, treating all variables as continuous. In fact the dataset’s 14 variables include one binary, one discrete, and twelve numeric. In Figure 8.3 the binary and discrete variables have been plotted and in Figure 8.4 all the continuous variables.
data(Boston, package="MASS")
par(mfrow=c(1,2))
for (i in c("chas", "rad")) {
barplot(table(Boston[, i]),
main=(paste("Barchart of", i)))
}
Although neither of these is ideal for the individual variables (for instance the histogram of medv misses the collection of areas with medv= 50, which was identified in §3.3), they do offer some direct insights (variable definitions can be found on the dataset’s R help page):
vs1 <- !(names(Boston) %in% c("chas","rad"))
grs <- n2mfrow(sum(as.numeric(vs1)))
par(mfrow=grs)
for (i in names(Boston)[vs1]) {
hist(Boston[,i], col="grey70", xlab="", ylab="",
main=(paste115("Histogram of", i)))
}
Scatterplot matrices have already been mentioned as a way of studying relationships between variables and they can be very effective. There are other possibilities as well. Apart from parallel coordinate plots and trellis plots there are tablelens plots, heatmaps, and glyphs. The following subsections discuss examples of several of these alternatives for the Boston dataset (variable definitions can be found on the dataset’s R help page).
The default scatterplot matrix (well, default except for using points rather than open circles) shown in Figure 8.5 is surprisingly informative, even with 14 variables. A number of features stand out:
As a next step you could consider adding univariate displays of the variables down the diagonal or colouring cases by membership of some subgroup. Figure 8.5 is just intended to provide a first quick look to help you to decide how to proceed. It may be a complex graphic, but it has a straightforward structure and is easy to draw. We should take advantage of the power that software can offer us nowadays.
Figure 8.6 is the default parallel coordinates plot using the function parcoord from the package MASS. Some of the features can be seen that were identified in the scatterplot matrix display, at least for those variables with adjacent axes. There is quite a lot of information on the distributions of the individual variables, such as the skewness of crim and the gaps in rad, tax, and possibly black.
Parallel coordinates are most effective used interactively, when groups of points can be selected across all axes and be compared with the rest. This can be done in the package iplots and its possible successor, the package Acinonyx, which is in development. Figure 6.19 shows the same plot, giving some of the flavour of interaction, with the points where rad= 24 highlighted in blue and with the variable axes ordered by differences between those cases and the rest.
data(Boston, package="MASS")
par(mfrow=c(1,1), mar=c(2.1, 1.1, 1.1, 1.1))
MASS::parcoord(Boston)
With a heatmap each case is represented by a row and each variable by a column. The individual cells are coloured according to the case value relative to the other values in the column. For this purpose the variables are standardised individually, either with a normal transformation to z scores or adjusted to a scale from minimum to maximum. (It is possible to colour according to all values in the dataset, although that is unwise, as it emphasises differences between the levels of different variables rather than differences between individual cases.)
As the orderings of cases and variables may be freely chosen, it is helpful to try to order them in an informative manner. Clustering or seriation methods can work well and you just have to bear in mind that each method will give different results. Figure 8.7 shows a heatmap of the Boston data using the package gplots.
library(gplots)
heatmap.2(as.matrix(Boston), scale="column", trace="none")
It is difficult to see much, although certain patterns are apparent, such as the group of relatively high values for the variable black because of the shape of the distribution, and the blocks of equally shaded values on some variables in the lower section of the plot. It is difficult to see much more. The colour legend top left with the superimposed histogram of values is useful, because we can see that although there are many different possible shades, most of the data values lie in the centre of the scale and only a few shades have been used. Using a different colour palette and possibly a nonlinear scale could make the display more enlightening. Experimenting with various colour schemes and clustering methods might reveal additional information.
All this makes heatmaps a fairly subjective tool and it is one of those graphic displays which can be effective for particular structures in some datasets, but which cannot be relied upon to produce good results in general.
With glyphs each case is represented by a multivariate symbol reflecting the case’s variable values. As for heatmaps, each variable must be standardised in some way first and this can influence the way the display looks a lot. The type of symbol used is also relevant and makes a big difference. It could be the oft discussed Chernoff faces, profile charts, star shapes, or some other form. Whatever is used must have at least the same number of dimensions as the number of variables in the dataset and each variable is allocated to one (or more) dimensions.
In Figure 8.8 only glyphs for the first four cases have been drawn to show some of the details of the plot. The stars function has been used and the segments have been coloured for better effect using a rainbow palette. You really need a big screen or zooming capability to appreciate the display of the full dataset.
par(mar=c(1.1, 1.1, 1.1, 1.1))
palette(rainbow(14, s = 0.6, v = 0.75))
stars(Boston[1:4,], labels=NULL, draw.segments = TRUE)
Figure 8.9 shows the result for the whole Boston dataset. It looks like that there are several distinct groups in the dataset, as we can see groups of different shapes. Surprisingly the data show some evidence of grouping already. As with heatmaps, the allocation of the variables to the dimensions (in Figure 8.9 this is the ordering of the variables round the star), the scales used, and the ordering of the cases can strongly influence what information can be detected.
stars(Boston, labels=NULL, draw.segments = TRUE)
All the displays in the previous section are primarily for continuous variables, although they can be useful for categorical variables sometimes too. If you want to look at a small group of categorical variables together, then some kind of mosaic-plot is best. This can be useful in checking experiments to see whether a study is unbalanced, and, if so, how.
The famous barley dataset, which Cleveland reanalysed in his book [Cleveland, 1993], has three categorical variables and one yield measurement for each combination. You can immediately see that the experiment is balanced by drawing a mosaicplot of the categorical variables and observing the resulting regular pattern.
The dataset foster in the package HSAUR2 has two categorical variables, the mother’s genotype and a litter genotype. Figure 8.10 shows that the structure is unbalanced, as the rectangles representing the variable combinations have different sizes.
data(foster, package="HSAUR2")
mosaic(~litgen+motgen, data=foster)
It is more informative to use a multiple barcharts version of a mosaicplot, which can be drawn using ggplot2’s functionality, as in Figure 8.11. Then it is easier to see just which groups are smaller or bigger than average.
As was mentioned in Chapter 7, there are distinct limits to the numbers of categorical variable combinations that can reasonably be displayed and understood. This is fine for monitoring experimental designs, as experiments usually only have a restricted number of combinations. It can be an issue in large surveys where there may be many classifying variables to be taken into account. The dataset HI in Ecdat includes information for 22,272 married women on region (4 categories), race (3), education (6), as well as three variables on insurance status amounting to 6 different categorisations. This gives 432 combinations in total that might be of interest. There is also a weighting for each case.
ggplot(data=foster, aes(motgen)) + geom_bar() +
facet_grid(litgen~ .) + xlab("") + ylab("") +
scale_y_continuous(breaks=seq(0,6,3)) +
labs(title="litter genotype by mother’s genotype")
Sometimes there are known groups in a dataset and it is important to get an overview of variable values split by group. There are two typical situations that arise: you can have several grouping or conditioning variables and just a couple of variables to display or you can have a single grouping variable and many variables to display.
Trellis graphics [Becker et al., 1996] are ideal for the first case. They were introduced some twenty years ago to effectively display data for large numbers of subsets. Each component plot or panel shows the same basic display for a different subset of the data, but each has the same scaling to allow comparison. Subsets can be defined by categorical variables, by discretisations of continuous variables and by combinations of variables. Each of the individual panels is a conditional view of the data.
In R you can use the packages lattice or ggplot2 for this. There was an excellent comparison of the two approaches on the blog Learning R [rlearnr, 2009] in 2009. Trellis graphics can be very effective and some people use them a lot. There is extensive information about the lattice package in [Sarkar, 2008] and on the accompanying webpage. Information on ggplot2 is available in [Wickham, 2009] and on the ggplot2 webpages.
Figure 8.12 shows a lattice display for the barley dataset. Apart from using filled circles for the points rather than open circles, this plot just uses the graphics defaults. It shows that the increasing yields across the sites from Grand Rapids to Waseca hold for all varieties except for Peatland. You can also verify Cleveland’s observation, that the 1932 yields are almost always less than the 1931 yields, except for Morris where it is the other way round. This is easier to see with Cleveland’s plot, where there are six panel plots, one for each site, rather than this plot where there are ten panel plots, one for each variety. On the other hand, the different pattern for the Peatland variety is easier to see with this plot. As always it is best to look at a selection of graphics. Cleveland concluded that the Morris data for the two years must have been switched. Recently Wright has re-examined the dataset using additional sources of supplementary data and suggests that in the light of the variability in the data he has found, the Morris data are quite plausible as they are [Wright, 2013].
Trellis graphics may be drawn in many different ways, depending on the choice of panel variables and panel plot, depending on the conditioning variables and what order they are in, depending on the order of categories within a conditioning variable (lattice plots the sites and varieties for barley in increasing order of median yield, as suggested by Cleveland, because that is how the factors are ordered in the R dataset), and depending on how the individual plots are arranged on the page. How well the plots look and how informative they are also depends very much on the size and aspect ratio of the overall display. Draw Figure 8.12 yourself and experiment with growing and shrinking the window in both directions.
library(lattice)
data(barley, package="lattice")
dotplot(site ~ yield |variety , data = barley,
groups = year, columns=2, pch=16, col=c("red", "blue"),
key = list(text=list(levels(barley$year)),
points = list(pch=16, col=c("red", "blue"))),
xlab = "Barley Yield (bushels/acre) ", ylab=NULL,
main="Barley Yields by Site for ten Varieties")
Like mosaicplots, trellis graphics can in theory include unlimited numbers of combinations, in practice the individual plots become too small if you try to get everything on one page. When trellis graphics were first introduced, applications were described using hundreds of plots printed on many, many pages. This is a sensible approach if you are looking for individual plots which stand out, although it is difficult to get an overview and an idea of overall structure. For designed experiments and other structured datasets it will often be possible to organise all plots on one page, as the number of combinations usually remains limited.
When there is only one grouping variable, it is interesting to look at how all other variables vary in parallel and for this a group plot can be drawn. Either there is a column for each variable to be displayed and a row for each group or the other way round. Mostly columns are better for comparisons, unless boxplots are used. Continuous variables may be plotted as histograms (density estimates or other displays could be used), whereas categorical and ordinal variables may be plotted as bar-charts. In group plots all plots for the same variables are drawn to the same scale to aid comparison.
Figure 8.13 shows an example for the uniranks dataset for UK universities from the GDAdata package, which was discussed in §6.6. Histograms have been drawn for all the nine variables for each of the six groups of universities. Rather than loop through the variables, the code constructs a long version of the dataset and then uses facetting to arrange the plots. The scales = “free_x” option allows each column to have its own x-axis scale.
data(uniranks, package="GDAdata")
names(uniranks)[c(5, 6, 8, 9, 10, 11, 13)] <– c("AvTeach", "NSSTeach", "SpendperSt", "StudentStaffR",
"Careers", "VAddScore", "NSSFeedb")
ur2 <– melt(uniranks[, c(3, 5:13)], id.vars="UniGroup", variable.name="uniV", value.name="uniX")
ggplot(ur2, aes(uniX)) + geom_histogram() + xlab("") +ylab("") +
facet_grid(UniGroup~uniV, scales = "free_x")
Several interesting features can be seen: the top performance of the Russell group, the good performance of the 1994 group, the range of performances for the universities which do not belong to any group, and the roughly equally poor showing of the other three groups. Inspecting individual columns more closely, you can see the sharp divisions on EntryTariff between the groups and to a lesser extent a similar effect for AvTeachScore.
Both trellis graphics and group plots are made up of lots of individual plots. With trellis displays every individual plot is of the same type, shows the same variables, and has the same scale, they just each show a different subset of the data. With group plots each column is like a trellis plot, but different columns show different variables and have different scales while each row shows the same subset. For all continuous variables or all categorical variables each row is like a set of histograms or barcharts respectively, treating that subset as a dataset in its own right, as described in Section 8.2. The only differences are that the plots are drawn in one row and the scaling of the individual plots depends on the scaling of the other plots in the same column for the other subsets.
Group plots are not seen very often, which is perhaps surprising. They are simple to understand and they offer effective multivariate comparisons. Since their main advantage lies in comparisons down columns there is no drawback in drawing several of them if there are many variables involved. Figure 8.13 is just about large enough to include all nine variables. Had there been more variables, then they could have been drawn in several displays over more than one page.
If a first look shows that distributions of variables are highly skewed, it may be constructive to transform the variables. The Box-Cox family of transformations is a good choice. If nominal variables have many categories with small frequencies, it may be helpful to combine or delete some. Otherwise, tests mentioned in Chapters 3 and 4 may be appropriate.
When variables show associations in a scatterplot matrix, then linear or non-linear models, depending on the form of association, should be considered.
If one or more panel plots stand out as being different in a trellis display, then a linear model based on the conditioning factors might be used to confirm this.
Discriminant analysis could be used in conjunction with group plots, as could other supervised learning methods, including Support Vector Machines.
What kinds of variables are there in this dataset? What plots would you recommend to help people get to know the dataset?
Longley’s dataset is well known as an example for highly collinear regression. Can you see this from the scatterplot matrix? Are there other features worth noting?
What kinds of variable are there? Is there anything interesting or unusual in the univariate distributions? Compare what information you might get from each of the multivariate displays discussed in this chapter: a scatterplot matrix, a parallel coordinates plot, a heatmap, and a collection of glyphs.
[Venables and Ripley, 2002] uses this dataset in discussing classification and discrimination. The authors initially transform to a log scale and then write that “The data are very highly correlated and scatterplot matrices and brush plots [i.e. interactive graphics] are none too revealing.” Using graphical overviews, comment on whether the log transform was a good idea and whether you agree with their statement on the correlations.
Data about diabetes amongst adult females of the Pima Indians is available in R, in the packages MASS and MMST. Both use the 532 cases with complete records. A larger version of the dataset with 768 cases is available from the UCI Machine Learning Repository [Bache and Lichman, 2013]. Download the larger dataset and give an overview of the differences between the cases available in R and the rest using two groups plots, one for the continuous variables and one for the variables npreg and type.
The dataset Exam in mlmRev is used to illustrate multilevel modelling. Prepare an initial graphical summary of the data and summarise your results in three main conclusions.
The dataset CollegeDistance in AER is from a survey of high school students. If you had to prepare a one-page summary of the main information you can find by exploring the data, what graphics would you use?
Numbers of various kinds of traffic fatalities for each of six years are given for each state in the contiguous United States in the dataset Fatalities from the package AER. There are 32 variable values for each state and year. Carry out an initial exploratory analysis to decide what information you would present to give people a first impression of the data.
The painting Britain at Play by L.S. Lowry hangs in the Usher Gallery in Lincoln. Does the title match the picture well? Can you see how the British ‘played’ in the early 1940s?
3.141.42.116