Chapter 8

Getting an Overview

Diagrams prove nothing, but bring outstanding features readily to the eye.

R. A. Fisher ([Fisher, 1925])

Summary

Chapter 8 discusses getting an initial overall view of a dataset.

8.1 Introduction

When you first start work on a dataset it is important to learn what variables it includes and what the data are like. There will usually be some initial analysis goals, but it is still necessary to look over the dataset to ensure that you know what you are dealing with. There could be issues of data quality, perhaps missing values and outliers (discussed in the next chapter), and there could just be some surprising basic statistics.

There are several different functions in R for showing what is in a dataset. You can show the whole dataset (inadvisable for even moderately sized datasets), display only the first few lines (using the function head), or just list the variables, each with their type and a few values (using the function str). You can summarise the dataset using the function summary (base R) or the function describe (Hmisc) or the function whatis (YaleToolkit). There are doubtless other statistical alternatives. Plotting the dataset is the alternative, complementary approach for getting an overview and the one we will be concentrating on in this chapter.

As an example, consider the dataset HI from the package Ecdat on the effect of health insurance on women’s working hours. The data were analysed in [Olsen, 1998]. First information can be gained with str (the results are shown on the next page) and some simple graphics (Figures 8.1 and 8.2).

data(HI, package="Ecdat")
str(HI)
#  'data.frame': 22272 obs. of 13 variables:
#  $ whrswk : int 0 50 40 40 0 ...
#  $ hhi : Factor w/ 2 levels "no","yes": 1 1 2 1 2 ...
#  $ whi : Factor w/ 2 levels "no","yes": 1 2 1 2 1 ...
#  $ hhi2 : Factor w/ 2 levels "no","yes": 1 1 2 2 2 ...
#  $ education : Ord.factor w/ 6 levels
#  "<9years"<"9-11years"<..: 4 4 3 4 2 ...
#  $ race : Factor w/ 3 levels "white","black",..: 1 1 1 1 1
#  ...
#  $ hispanic : Factor w/ 2 levels "no","yes": 1 1 1 1 1 ...
#  $ experience: num 13 24 43 17 44.5 ...
#  $ kidslt6 : int 2 0 0 0 0 ...
#  $ kids618 : int 1 1 0 1 0 ...
#  $ husby : num 12 1.2 ...
#  $ region : Factor w/ 4 levels "other","northcentral",..: 2
#  2 2 2 2 ...
#  $ wght : int 214986 210119 219955 210317 219955 ...

Figure 8.1

Figure showing histograms of the four continuous variables in the HI dataset from Ecdat. Women mostly work 40 hours per week or do not work at all. Potential years of experience range from 0 to around 50. A large group of husbands have little or no reported income. The distribution of case weights is skewed to the right.

Histograms of the four continuous variables in the HI dataset from Ecdat. Women mostly work 40 hours per week or do not work at all. Potential years of experience range from 0 to around 50. A large group of husbands have little or no reported income. The distribution of case weights is skewed to the right.

There are seven factors (one of which, education, is ordered), two numeric variables and four integer ones. The two kids variables must be discrete and limited, as drawing up tables or plotting them would confirm.

Figures 8.1 and 8.2 show histograms of the four continuous variables and barcharts of the remaining variables respectively. The histograms’ display is drawn by forming a new dataset, stacking the continuous variables, so that allthe plots can be drawn together with one line of code. (This could be done for the barcharts too, but then any categoryordering information for nominal variables would be lost.)

The hours worked by the women have two distinct modes, which on closer inspection of the data turn out to be 0 hours (i.e., not working) and 40 hours. The odd shape of the histogram of experience is due to the default binwidth, but the overall shape is clear. The extended axis below zero is surprising and growing the window vertically would show that there are a very few cases with a value of —1. (These arise because the variable is defined as years of potential work experience = age — years of education — 5.) Both the last two variables, husband’s income and sampling weight, are skewed to the right. Sampling weight distributions are often skew with a few cases having exceptionally large weights. The mode in the salary histogram at 100 turns out to be real, with 478 husbands’ incomes reported as $99,999 a year! This can be checked with something like

with(HI[HI$husby > 98 & HI$husby < 102 ,], table(husby))
library(reshape2)
HIvs <- c("whrswk", "experience", "husby", "wght")
HIs <- melt(HI[, HIvs], value.name = "HIx",
   variable.name = "HIvars")
ggplot (HIs, aes(HIx)) + geom_histogram() +
   facet_wrap(~ HIvars, scales = "free") +
   xlab("") + ylab("")

The barcharts in Figure 8.2 show the insurance variables, information on education and race, the distributions of young and older children, and the distribution by region. The variables are selected by finding out which ones have a limited number of unique values. The weighting variable has been converted to a percentage to make the scales readable.

uniqv <- function(x) length(unique(x)) < 20
vcs <- names(HI)[sapply(HI, uniqv)]
par(mfrow = n2mfrow(length(vcs)))
relativeWeight <- with(HI, wght/sum(as.numeric(wght))*100)
for(v in vcs)
 barplot(tapply(relativeWeight, HI[[v]], sum), main = v)

Figure 8.2

Figure showing barcharts of the nine other variables in the HI dataset in Ecdat. The top row shows that about half the women are covered by their husband’s insurance and a minority by their own, and that an equivalent minority of husbands have no job health insurance. The middle row shows that most women have twelve or more years of education and that there are few blacks and Hispanics in the study. The final plot at the bottom right shows that more participants were from the South.

Barcharts of the nine other variables in the HI dataset in Ecdat. The top row shows that about half the women are covered by their husband’s insurance and a minority by their own, and that an equivalent minority of husbands have no job health insurance. The middle row shows that most women have twelve or more years of education and that there are few blacks and Hispanics in the study. The final plot at the bottom right shows that more participants were from the South.

Sometimes it is obvious which variables should be treated as continuous and which as categorical or discrete, sometimes it is ambiguous. If an age variable takes values from 0 to 100, then it makes sense to treat it as continuous, while if there are only integer values between 20 and 30 it might be better to treat it as discrete. If you guess wrongly, then draw another plot. The whole point of initial overviews is to get a feel for the data, not to draw perfect pictures. They can come later when you have learned what information in the data is worth presenting.

8.2 Many individual displays

Instead of jumping in and producing lists of summaries (as with summary) or a large matrix of primarily scatterplots (as with plot), another approach is to begin with the basics and work up step by step. Knowing what variables of what kinds there are and how many cases is a pretty good start and can be achieved using str, as was done in §8.1. The GDA approach suggested here is to split the variables into two groups, plotting categorical and discrete variables as barcharts, while plotting the other variables, where possible, as histograms. Any special variable types left over (e.g., dates) should be dealt with separately.

Plots of individual variables give a quick view of the variable distributions and of any features that stand out. Plotting variables in groups, all histograms together and all barcharts together, is quicker than plotting them one by one and organises them neatly. The Boston housing dataset was already examined in Figure 3.9, treating all variables as continuous. In fact the dataset’s 14 variables include one binary, one discrete, and twelve numeric. In Figure 8.3 the binary and discrete variables have been plotted and in Figure 8.4 all the continuous variables.

data(Boston, package="MASS")
par(mfrow=c(1,2))
for (i in c("chas", "rad")) {
 barplot(table(Boston[, i]),
 main=(paste("Barchart of", i)))
}

Figure 8.3

Figure showing barcharts of the variables chas (whether the tract bounds the Charles River or not) and rad (index of accessibility to radial highways) from the Boston dataset.

Barcharts of the variables chas (whether the tract bounds the Charles River or not) and rad (index of accessibility to radial highways) from the Boston dataset.

Figure 8.4

Figure showing histograms of the twelve continuous variables from the Boston dataset.

Histograms of the twelve continuous variables from the Boston dataset.

Although neither of these is ideal for the individual variables (for instance the histogram of medv misses the collection of areas with medv= 50, which was identified in §3.3), they do offer some direct insights (variable definitions can be found on the dataset’s R help page):

  1. There are few areas with chas = 1.
  2. Three values, 4,5, and 24, dominate the variable rad. The value 24 is quite separate from the others in value (the plot fails to show that clearly because barcharts treat a discrete variable’s numbers as category names).
  3. The variables tax and indus have gaps that could be investigated.
  4. Three variables are highly skewed (crim, zn, and black).
  5. And there are suggestive features in other variables to be studied more closely.
vs1 <- !(names(Boston) %in% c("chas","rad"))
grs <- n2mfrow(sum(as.numeric(vs1)))
par(mfrow=grs)
for (i in names(Boston)[vs1]) {
  hist(Boston[,i], col="grey70", xlab="", ylab="",
  main=(paste115("Histogram of", i)))
 }

8.3 Multivariate overviews

Scatterplot matrices have already been mentioned as a way of studying relationships between variables and they can be very effective. There are other possibilities as well. Apart from parallel coordinate plots and trellis plots there are tablelens plots, heatmaps, and glyphs. The following subsections discuss examples of several of these alternatives for the Boston dataset (variable definitions can be found on the dataset’s R help page).

Scatterplot matrices

The default scatterplot matrix (well, default except for using points rather than open circles) shown in Figure 8.5 is surprisingly informative, even with 14 variables. A number of features stand out:

  1. The curious either/or relationship between crim and zn.
  2. The dependencies of medv on lstat and rm.
  3. The fact that tax, ptratio, zn, and indus have only single values for the extreme value 24 of rad.
  4. The associations between age and dis and nox.
  5. And some potential bivariate outliers such as the point in lstat and age or the set of high values of nox, which have the same value for rad, tax, and ptratio.

Figure 8.5

Figure showing a scatterplot matrix for all fourteen variables in the Boston dataset.

A scatterplot matrix for all fourteen variables in the Boston dataset.

As a next step you could consider adding univariate displays of the variables down the diagonal or colouring cases by membership of some subgroup. Figure 8.5 is just intended to provide a first quick look to help you to decide how to proceed. It may be a complex graphic, but it has a straightforward structure and is easy to draw. We should take advantage of the power that software can offer us nowadays.

plot (Boston, Pch-16)

Parallel coordinate plots

Figure 8.6 is the default parallel coordinates plot using the function parcoord from the package MASS. Some of the features can be seen that were identified in the scatterplot matrix display, at least for those variables with adjacent axes. There is quite a lot of information on the distributions of the individual variables, such as the skewness of crim and the gaps in rad, tax, and possibly black.

Figure 8.6

Figure showing a parallel coordinates plot for all fourteen variables in the Boston dataset. Details are discussed in the text.

A parallel coordinates plot for all fourteen variables in the Boston dataset. Details are discussed in the text.

Parallel coordinates are most effective used interactively, when groups of points can be selected across all axes and be compared with the rest. This can be done in the package iplots and its possible successor, the package Acinonyx, which is in development. Figure 6.19 shows the same plot, giving some of the flavour of interaction, with the points where rad= 24 highlighted in blue and with the variable axes ordered by differences between those cases and the rest.

data(Boston, package="MASS")
par(mfrow=c(1,1), mar=c(2.1, 1.1, 1.1, 1.1))
MASS::parcoord(Boston)

Heatmaps

With a heatmap each case is represented by a row and each variable by a column. The individual cells are coloured according to the case value relative to the other values in the column. For this purpose the variables are standardised individually, either with a normal transformation to z scores or adjusted to a scale from minimum to maximum. (It is possible to colour according to all values in the dataset, although that is unwise, as it emphasises differences between the levels of different variables rather than differences between individual cases.)

As the orderings of cases and variables may be freely chosen, it is helpful to try to order them in an informative manner. Clustering or seriation methods can work well and you just have to bear in mind that each method will give different results. Figure 8.7 shows a heatmap of the Boston data using the package gplots.

library(gplots)
heatmap.2(as.matrix(Boston), scale="column", trace="none")

Figure 8.7

Figure showing a heatmap for the Boston dataset. Some features can be discerned with a little effort (see text).

A heatmap for the Boston dataset. Some features can be discerned with a little effort (see text).

It is difficult to see much, although certain patterns are apparent, such as the group of relatively high values for the variable black because of the shape of the distribution, and the blocks of equally shaded values on some variables in the lower section of the plot. It is difficult to see much more. The colour legend top left with the superimposed histogram of values is useful, because we can see that although there are many different possible shades, most of the data values lie in the centre of the scale and only a few shades have been used. Using a different colour palette and possibly a nonlinear scale could make the display more enlightening. Experimenting with various colour schemes and clustering methods might reveal additional information.

All this makes heatmaps a fairly subjective tool and it is one of those graphic displays which can be effective for particular structures in some datasets, but which cannot be relied upon to produce good results in general.

Glyphs

With glyphs each case is represented by a multivariate symbol reflecting the case’s variable values. As for heatmaps, each variable must be standardised in some way first and this can influence the way the display looks a lot. The type of symbol used is also relevant and makes a big difference. It could be the oft discussed Chernoff faces, profile charts, star shapes, or some other form. Whatever is used must have at least the same number of dimensions as the number of variables in the dataset and each variable is allocated to one (or more) dimensions.

In Figure 8.8 only glyphs for the first four cases have been drawn to show some of the details of the plot. The stars function has been used and the segments have been coloured for better effect using a rainbow palette. You really need a big screen or zooming capability to appreciate the display of the full dataset.

par(mar=c(1.1, 1.1, 1.1, 1.1))
palette(rainbow(14, s = 0.6, v = 0.75))
stars(Boston[1:4,], labels=NULL, draw.segments = TRUE)

Figure 8.8

Figure showing glyphs (stars) for the first four cases of the Boston dataset.

Glyphs (stars) for the first four cases of the Boston dataset.

Figure 8.9 shows the result for the whole Boston dataset. It looks like that there are several distinct groups in the dataset, as we can see groups of different shapes. Surprisingly the data show some evidence of grouping already. As with heatmaps, the allocation of the variables to the dimensions (in Figure 8.9 this is the ordering of the variables round the star), the scales used, and the ordering of the cases can strongly influence what information can be detected.

stars(Boston, labels=NULL, draw.segments = TRUE)

Figure 8.9

Figure showing glyphs (stars) for the Boston dataset. The bigger glyphs towards the foot of the plot are the cases where the variable rad takes the value 24.

Glyphs (stars) for the Boston dataset. The bigger glyphs towards the foot of the plot are the cases where the variable rad takes the value 24.

8.4 Multivariate overviews for categorical variables

All the displays in the previous section are primarily for continuous variables, although they can be useful for categorical variables sometimes too. If you want to look at a small group of categorical variables together, then some kind of mosaic-plot is best. This can be useful in checking experiments to see whether a study is unbalanced, and, if so, how.

The famous barley dataset, which Cleveland reanalysed in his book [Cleveland, 1993], has three categorical variables and one yield measurement for each combination. You can immediately see that the experiment is balanced by drawing a mosaicplot of the categorical variables and observing the resulting regular pattern.

The dataset foster in the package HSAUR2 has two categorical variables, the mother’s genotype and a litter genotype. Figure 8.10 shows that the structure is unbalanced, as the rectangles representing the variable combinations have different sizes.

data(foster, package="HSAUR2")
mosaic(~litgen+motgen, data=foster)

Figure 8.10

Figure showing genotype groups in the foster dataset using a mosaicplot. There are not equal numbers in the different combinations.

Genotype groups in the foster dataset using a mosaicplot. There are not equal numbers in the different combinations.

It is more informative to use a multiple barcharts version of a mosaicplot, which can be drawn using ggplot2’s functionality, as in Figure 8.11. Then it is easier to see just which groups are smaller or bigger than average.

Figure 8.11

Figure showing genotype groups in the foster dataset using a multiple barchart. Numbers of rats vary across the different genotype combinations.

Genotype groups in the foster dataset using a multiple barchart. Numbers of rats vary across the different genotype combinations.

As was mentioned in Chapter 7, there are distinct limits to the numbers of categorical variable combinations that can reasonably be displayed and understood. This is fine for monitoring experimental designs, as experiments usually only have a restricted number of combinations. It can be an issue in large surveys where there may be many classifying variables to be taken into account. The dataset HI in Ecdat includes information for 22,272 married women on region (4 categories), race (3), education (6), as well as three variables on insurance status amounting to 6 different categorisations. This gives 432 combinations in total that might be of interest. There is also a weighting for each case.

ggplot(data=foster, aes(motgen)) + geom_bar() + 
  facet_grid(litgen~ .) + xlab("") + ylab("") + 
  scale_y_continuous(breaks=seq(0,6,3)) + 
  labs(title="litter genotype by mother’s genotype")

8.5 Graphics by group

Sometimes there are known groups in a dataset and it is important to get an overview of variable values split by group. There are two typical situations that arise: you can have several grouping or conditioning variables and just a couple of variables to display or you can have a single grouping variable and many variables to display.

Trellis graphics

Trellis graphics [Becker et al., 1996] are ideal for the first case. They were introduced some twenty years ago to effectively display data for large numbers of subsets. Each component plot or panel shows the same basic display for a different subset of the data, but each has the same scaling to allow comparison. Subsets can be defined by categorical variables, by discretisations of continuous variables and by combinations of variables. Each of the individual panels is a conditional view of the data.

In R you can use the packages lattice or ggplot2 for this. There was an excellent comparison of the two approaches on the blog Learning R [rlearnr, 2009] in 2009. Trellis graphics can be very effective and some people use them a lot. There is extensive information about the lattice package in [Sarkar, 2008] and on the accompanying webpage. Information on ggplot2 is available in [Wickham, 2009] and on the ggplot2 webpages.

Figure 8.12 shows a lattice display for the barley dataset. Apart from using filled circles for the points rather than open circles, this plot just uses the graphics defaults. It shows that the increasing yields across the sites from Grand Rapids to Waseca hold for all varieties except for Peatland. You can also verify Cleveland’s observation, that the 1932 yields are almost always less than the 1931 yields, except for Morris where it is the other way round. This is easier to see with Cleveland’s plot, where there are six panel plots, one for each site, rather than this plot where there are ten panel plots, one for each variety. On the other hand, the different pattern for the Peatland variety is easier to see with this plot. As always it is best to look at a selection of graphics. Cleveland concluded that the Morris data for the two years must have been switched. Recently Wright has re-examined the dataset using additional sources of supplementary data and suggests that in the light of the variability in the data he has found, the Morris data are quite plausible as they are [Wright, 2013].

Figure 8.12

Figure showing a lattice plot of the barley data. Yields at six different sites for two years are shown in ten plots, one for each variety. Yields in 1932 were generally lower than in 1931 and there is a common pattern across sites for most varieties.

A lattice plot of the barley data. Yields at six different sites for two years are shown in ten plots, one for each variety. Yields in 1932 were generally lower than in 1931 and there is a common pattern across sites for most varieties.

Trellis graphics may be drawn in many different ways, depending on the choice of panel variables and panel plot, depending on the conditioning variables and what order they are in, depending on the order of categories within a conditioning variable (lattice plots the sites and varieties for barley in increasing order of median yield, as suggested by Cleveland, because that is how the factors are ordered in the R dataset), and depending on how the individual plots are arranged on the page. How well the plots look and how informative they are also depends very much on the size and aspect ratio of the overall display. Draw Figure 8.12 yourself and experiment with growing and shrinking the window in both directions.

library(lattice)
data(barley, package="lattice")
dotplot(site ~ yield |variety , data = barley,
   groups = year, columns=2, pch=16, col=c("red", "blue"),
   key = list(text=list(levels(barley$year)),
   points = list(pch=16, col=c("red", "blue"))),
   xlab = "Barley Yield (bushels/acre) ", ylab=NULL,
   main="Barley Yields by Site for ten Varieties")

Like mosaicplots, trellis graphics can in theory include unlimited numbers of combinations, in practice the individual plots become too small if you try to get everything on one page. When trellis graphics were first introduced, applications were described using hundreds of plots printed on many, many pages. This is a sensible approach if you are looking for individual plots which stand out, although it is difficult to get an overview and an idea of overall structure. For designed experiments and other structured datasets it will often be possible to organise all plots on one page, as the number of combinations usually remains limited.

Group plots

When there is only one grouping variable, it is interesting to look at how all other variables vary in parallel and for this a group plot can be drawn. Either there is a column for each variable to be displayed and a row for each group or the other way round. Mostly columns are better for comparisons, unless boxplots are used. Continuous variables may be plotted as histograms (density estimates or other displays could be used), whereas categorical and ordinal variables may be plotted as bar-charts. In group plots all plots for the same variables are drawn to the same scale to aid comparison.

Figure 8.13 shows an example for the uniranks dataset for UK universities from the GDAdata package, which was discussed in §6.6. Histograms have been drawn for all the nine variables for each of the six groups of universities. Rather than loop through the variables, the code constructs a long version of the dataset and then uses facetting to arrange the plots. The scales = “free_x” option allows each column to have its own x-axis scale.

data(uniranks, package="GDAdata")
names(uniranks)[c(5, 6, 8, 9, 10, 11, 13)] <– c("AvTeach", "NSSTeach", "SpendperSt", "StudentStaffR",
           "Careers", "VAddScore", "NSSFeedb")
ur2 <– melt(uniranks[, c(3, 5:13)], id.vars="UniGroup", variable.name="uniV", value.name="uniX")
ggplot(ur2, aes(uniX)) + geom_histogram() + xlab("") +ylab("") +
  facet_grid(UniGroup~uniV, scales = "free_x")

Figure 8.13

Figure showing a group plot of the uniranks dataset. The Russell universities tend to get top scores on all criteria (it is better to have low values for StudentStaffRatio) except for NSSFeedback. The top row representing universities who are not members of any group have a spread of values on all criteria. The 1994 universities look to be second best to the Russell universities

A group plot of the uniranks dataset. The Russell universities tend to get top scores on all criteria (it is better to have low values for StudentStaffRatio) except for NSSFeedback. The top row representing universities who are not members of any group have a spread of values on all criteria. The 1994 universities look to be second best to the Russell universities

Several interesting features can be seen: the top performance of the Russell group, the good performance of the 1994 group, the range of performances for the universities which do not belong to any group, and the roughly equally poor showing of the other three groups. Inspecting individual columns more closely, you can see the sharp divisions on EntryTariff between the groups and to a lesser extent a similar effect for AvTeachScore.

Both trellis graphics and group plots are made up of lots of individual plots. With trellis displays every individual plot is of the same type, shows the same variables, and has the same scale, they just each show a different subset of the data. With group plots each column is like a trellis plot, but different columns show different variables and have different scales while each row shows the same subset. For all continuous variables or all categorical variables each row is like a set of histograms or barcharts respectively, treating that subset as a dataset in its own right, as described in Section 8.2. The only differences are that the plots are drawn in one row and the scaling of the individual plots depends on the scaling of the other plots in the same column for the other subsets.

Group plots are not seen very often, which is perhaps surprising. They are simple to understand and they offer effective multivariate comparisons. Since their main advantage lies in comparisons down columns there is no drawback in drawing several of them if there are many variables involved. Figure 8.13 is just about large enough to include all nine variables. Had there been more variables, then they could have been drawn in several displays over more than one page.

8.6 Modelling and testing for overviews

  1. Transformations

    If a first look shows that distributions of variables are highly skewed, it may be constructive to transform the variables. The Box-Cox family of transformations is a good choice. If nominal variables have many categories with small frequencies, it may be helpful to combine or delete some. Otherwise, tests mentioned in Chapters 3 and 4 may be appropriate.

  2. Associations (and causality)

    When variables show associations in a scatterplot matrix, then linear or non-linear models, depending on the form of association, should be considered.

  3. Linear models

    If one or more panel plots stand out as being different in a trellis display, then a linear model based on the conditioning factors might be used to confirm this.

  4. Discrimination between groups

    Discriminant analysis could be used in conjunction with group plots, as could other supervised learning methods, including Support Vector Machines.

Main points

  1. Sets of univariate graphics are good for giving a quick overview of variable values and distribution patterns (e.g., Figures 8.1 and 8.2).
  2. Scatterplot matrices are valuable for identifying bivariate patterns even with quite a few variables (cf. Figure 8.5).
  3. Parallel coordinate plots are useful for studying groups of cases and are most effective when they are interactive (§8.3).
  4. Trellis displays are excellent for comparing data on one or two variables by subsets, while group plots provide a useful overview across many variables for a small set of subgroups (cf. Figures 8.12 and 8.13).
  5. Other multivariate displays look interesting, but scaling, colours and orderings all have to be suitably chosen and the resulting graphics can still be difficult to decode (§8.3).

Exercises

  1. Cloud seeding (clouds from the package HSAUR2)

    What kinds of variables are there in this dataset? What plots would you recommend to help people get to know the dataset?

  2. Longley’s Data (longley from the package datasets)

    Longley’s dataset is well known as an example for highly collinear regression. Can you see this from the scatterplot matrix? Are there other features worth noting?

  3. US States (state.x77 from the package datasets)

    What kinds of variable are there? Is there anything interesting or unusual in the univariate distributions? Compare what information you might get from each of the multivariate displays discussed in this chapter: a scatterplot matrix, a parallel coordinates plot, a heatmap, and a collection of glyphs.

  4. Crabs (crabs from the package MASS)

    [Venables and Ripley, 2002] uses this dataset in discussing classification and discrimination. The authors initially transform to a log scale and then write that “The data are very highly correlated and scatterplot matrices and brush plots [i.e. interactive graphics] are none too revealing.” Using graphical overviews, comment on whether the log transform was a good idea and whether you agree with their statement on the correlations.

  5. Pima Indians

    Data about diabetes amongst adult females of the Pima Indians is available in R, in the packages MASS and MMST. Both use the 532 cases with complete records. A larger version of the dataset with 768 cases is available from the UCI Machine Learning Repository [Bache and Lichman, 2013]. Download the larger dataset and give an overview of the differences between the cases available in R and the rest using two groups plots, one for the continuous variables and one for the variables npreg and type.

  6. Exam results in London

    The dataset Exam in mlmRev is used to illustrate multilevel modelling. Prepare an initial graphical summary of the data and summarise your results in three main conclusions.

  7. Distance to college

    The dataset CollegeDistance in AER is from a survey of high school students. If you had to prepare a one-page summary of the main information you can find by exploring the data, what graphics would you use?

  8. US traffic fatalities

    Numbers of various kinds of traffic fatalities for each of six years are given for each state in the contiguous United States in the dataset Fatalities from the package AER. There are 32 variable values for each state and year. Carry out an initial exploratory analysis to decide what information you would present to give people a first impression of the data.

  9. Intermission

    The painting Britain at Play by L.S. Lowry hangs in the Usher Gallery in Lincoln. Does the title match the picture well? Can you see how the British ‘played’ in the early 1940s?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.42.116