Chapter 4

Displaying Categorical Data

It’s not what you look at that matters, it’s what you see.

Thoreau

Summary

Chapter 4 looks at visualising single categorical variables, nominal, ordinal, or discrete.

4.1 Introduction

Compared to the range of plots proposed for visualising single continuous variable data, there have been few suggestions for visualising single categorical variable data. Barcharts (horizontally or vertically) and piecharts are the commonest options. You could argue that both are area plots where cases are represented by an area proportional to group size rather than being plotted individually. However, since the bars in a barchart always have the same width, you compare lengths, not areas, which is much easier.

Occasionally there are suggestions for using individual points for cases and jittering to keep them apart. This does not work well for high-frequency groups, as it is hard to assess their densities, and the displays for low-frequency groups may exhibit non-existent patterns due to the random jittering. Nevertheless, as always with exploratory graphics, if a graphic helps to uncover information, it is worth using.

Whether the graphic that is used to find information is the best way of presenting it to others is another matter. Single categorical variables usually contain little information, but it is still worth looking at some examples before moving on to consider multivariate categorical data and categorical variables together with continuous variables. There is often information in the simplest plots that is obscured in more complicated displays.

There are sixteen states (Bundesländer) in the German Federal Republic. Figure 4.1 shows three displays of the states using the number of eligible voters in the 2009 election as a proxy for size. The first display uses alphabetical order. Some states are much bigger than others, with Nordrhein-Westfalen being the biggest and Bremen (HB is Hansestadt Bremen) the smallest. The second display orders by size, and it is easier to judge the full ordering and see that Schleswig-Holstein (SH) is bigger than Brandenburg (BB). The final display orders by size within East/West groupings, where Berlin (BE) has been classed as ‘East’, as it is geographically. In fact, part of it, West Berlin, used to be part of West Germany. It is now clearer that the states in the East are all roughly the same size, with Sachsen being a bit bigger. Both the biggest and smallest states are in the West.

Figure 4.1

Figure showing three alternative barcharts of the numbers of eligible voters in the German Bundesländer. The top one is the default alphabetical order, the middle one is ordered by decreasing number of voters, and the bottom one is ordered by voter numbers within the former East/West Germanies. There is great variation in size, and the biggest Bundesländer are in the area that was formerly West Germany.

Three alternative barcharts of the numbers of eligible voters in the German Bundesländer. The top one is the default alphabetical order, the middle one is ordered by decreasing number of voters, and the bottom one is ordered by voter numbers within the former East/West Germanies. There is great variation in size, and the biggest Bundesländer are in the area that was formerly West Germany.

The data need some preparation: a new variable is created with abbreviated names for the Bundesländer; numbers of eligible voters are aggregated by Bundesland (the data are provided for the 299 constituencies); and a new variable is created according to whether the Bundesländer were part of the old West or East Germany.

data(btw2009, package = "flexclust")
btw2009 <– within (btw2009, stateA <– state)
btw2009 <– within (btw2009,
    levels(stateA) <– c("BW", "BY", "BE",
    "BB", "HB", "HH", "HE", "MV", "NI", "NW",
    "RP", "SL", "SN", "ST", "SH", "TH"))
Voters <– with(btw2009, size <– tapply(eligible, stateA, sum))
Bundesland <– rownames(Voters)
btw9s <– data.frame(Bundesland, Voters)
btw9s$EW <– c("West")
btw9s[c("BB", "BE", "MV", "SN", "ST", "TH"), "EW"] <– "East"
ls <– with(btw9s, Bundesland[order(EW, -Voters)])
btw9s <– within(btw9s, State1 <– factor(Bundesland, levels=ls))

The first version is best if you are used to that default order and know that, for instance, Berlin is always in second position. Except that it is not. With the full names of the states, rather than the standard abbreviations, you get a different ordering and Berlin is in third position. Consistent orderings are helpful for looking at displays of several different variables, you just have to agree on the ordering. The other versions are better for specific information relating to a single variable. Whichever one you use, they all give a quick overview of the relative sizes of the sixteen states. Barcharts are good for that. For those who would like an older example, take a look at the top plot on page 11 of [Gannett, 1898]. It shows a horizontal barchart of the “population of each state and territory” of the United States in 1890.

The German election data are an example of categorical data, counts by category. The categories may be nominal with no standard order (like a group of instant coffee brands) or ordinal (for instance when age is recorded as ‘young’, ‘middle-aged’, or ‘old’) or discrete (such as the number of children in a family). When the data are ordinal or discrete, the order must be preserved. When the data are nominal, reordering the categories becomes an important tool for gaining insights from the data.

b1 <– ggplot(btw9s, aes(Bundesland, Voters/1000000)) + 
    geom_bar(stat="identity") +
    ylab("Voters (millions)")
b2 <– ggplot(btw9s, aes(reorder(Bundesland, -Voters),
    Voters/1000000)) + geom_bar(stat="identity") +
    xlab("Bundesland") + ylab("Vo"(millions)")
b3 <– ggplot(btw9s, aes(State1, Voters/1000000)) +
    geom_bar(stat="identity") "xlab("Bundesland") + 
    ylab("Vo"(millions)")
grid.arrange(b1, b2, b3)

4.2 What features might categorical variables have?

There might be

Unexpected patterns of results There may be many more of some categories than others. Some categories may be missing completely.

Uneven distributions Observational studies may exhibit some form of bias, perhaps too many males. In medical meta analyses many trials are analysed together, although it can turn out that most of the trials were small and that the results are dominated by the one or two major trials.

Extra categories Gender may be recorded as ‘M’ and ‘F’, but also as ‘m’ and ‘f’, ‘male’ and ‘female’. In a study of patients with two medical conditions, there may be some patients diagnosed with a third condition.

Unbalanced experiments Although experiments are usually carefully designed and carried out, there is always the chance that some data are missing or unusable. It is important to know if this occurs and leads to unequal group sizes.

Large numbers of categories In studies including open-ended questions (e.g., “Who is your favourite politician?”) there may be far more names mentioned than you expected.

Don’t knows, refusals, errors, missings, ... Data may not be available for a wide variety of reasons, and plots summarising how many cases of each type have arisen can be helpful both in deciding how to handle the data and in properly qualifying the results from the data that are available. Opinion polls are an obvious example.

4.3 Nominal data—no fixed category order

Meta analyses—how big was each study?

The package meta includes three meta analysis datasets. The dataset Fleiss93 has details of seven studies on the use of aspirin after myocardial infarction. Figure 4.2 plots a barchart of the study sizes, with the studies ordered by the total numbers of patients (control group and experimental group together). The differences in size are striking, with one study being much bigger than all the others.

data(Fleiss93, package="meta")
Fleiss93 <– within(Fleiss93, {
    total <– n.e + n.c
    st <– reorder(study, -(total))})
ggplot(Fleiss93, aes(st, total)) + geom_bar(stat="identity") +
  xlab("") + ylab("") + ylim(0,20000)

Figure 4.2

Figure showing a barchart of the sizes of the seven studies in the Fleiss93 meta analysis dataset. The ISIS-2 study had more patients than all the others put together.

A barchart of the sizes of the seven studies in the Fleiss93 meta analysis dataset. The ISIS-2 study had more patients than all the others put together.

If there were many more than seven studies, and that is the usual situation with meta analyses, it could make sense to plot individual bars for the biggest studies and a combined bar for the total of the smaller studies. The following code would draw the corresponding plot for Fleiss93, combining all studies with fewer than 2000 cases:

Fleiss93 <– within(Fleiss93, st1 <– as.character(study))
Fleiss93$st1[Fleiss93$total " 2" <– "REST"
ggplot(Fleiss93, aes(st1, total)) + geom_bar(stat="identity") +
  xlab("") + ylab("") + ylim(0,20000)

Anorexia

In the anorexia dataset in the MASS package, two treatment groups are compared with a control group. There are 72 cases and you might assume that they were split equally into three groups of 24 each. Figure 4.3 shows the curiously uneven distribution (29 and 17 for the two treatment groups and 26 for the control group). Needless to say, there were probably very good reasons for the split, but it could be useful to know. What if the groups were initially of the same size and there were dropouts?

data(anorexia, package="MASS")
ggplot(anorexia, aes(Treat)) + geom_bar() + xlab("Treatment")

Figure 4.3

Figure showing a barchart of the group sizes in the anorexia dataset drawn with barplot(table(anorexia$Treat)). The groups are not of equal size.

A barchart of the group sizes in the anorexia dataset drawn with barplot(table(anorexia$Treat)). The groups are not of equal size.

A table would show this just as well in principle

with(anorexia, table(Treat))

but a graphic is more convincing.

Who sailed on the Titanic?

The tragic sinking of the Titanic has been discussed endlessly and there are many films, books, and webpages about the disaster. Despite all this attention the information on those who sailed on the ship is incomplete. A lot is known about some passengers and members of the crew, especially about the first-class passengers and the officers; less is known about some of the others on the ship.

Although there is not full agreement on all details, the dataset Titanic is generally thought to be an accurate summary. It contains four pieces of information on 2201 people who were on board: Class (1st, 2nd, 3rd, or crew), Sex, Age (a binary variable, Child or Adult), and Survived (whether they survived or died). Interest centres on issues such as survival by Class and Sex (which will be discussed in §7.2), but it is sensible to have a look at the simple univariate barcharts first, as in Figure 4.4.

Titanic1 <– data.frame(Titanic)
p <– ggplot(Titanic1, aes(weight=Freq)) +
   ylab("") + ylim(0,2250)
cs <– p + aes(Class) + geom_bar(fill="blue")
sx <– p + aes(Sex) + geom_bar(fill="green")
ag <– p + aes(Age) + geom_bar(fill="tan2")
su <– p + aes(Survived) + geom_bar(fill="red")
grid.arrange(cs, sx, ag, su, nrow=1, widths=c(3, 2, 2, 2))

Figure 4.4

Figure showing barcharts of the four variables in the Titanic dataset. That the majority of passengers and crew died and that there were far more men than women on board is well known, the numbers in the different classes less so.

Barcharts of the four variables in the Titanic dataset. That the majority of passengers and crew died and that there were far more men than women on board is well known, the numbers in the different classes less so.

You should always think about what you expect graphics to show, before you draw them. That way you can be surprised by what you see and value more the information presented. Once you have seen Figure 4.4, it seems obvious that there were more people in the crew than in any of the other classes and that there were fewest in the second class. That there were more males than females will have surprised no one, but that there were over three times as many does seem high, probably because the crew were almost all men. The small number of children may raise questions about the quality of the data for the Age variable. The fact that overall twice as many died as survived is well known.

There are technical points worth mentioning. The dataset is supplied as four conditional contingency tables, so it has been converted to a dataframe. A common vertical scale has been used for all four plots to make comparisons across plots possible. This is fine for the three binary variables, but not so good for the variable Class, a typical issue when categorical variables have different numbers of categories. (The upper limit was chosen after drawing default plots. More formally, the maximum should be determined by calculation.)

The Class barchart has been drawn wider than the others, as it has more categories. This makes the code more complicated, but it is good to have the flexibility to do it. For presentation purposes it would have been more elegant to make all bars the same width. For exploratory purposes this would be gilding the lily.

Opinion polls

Nowadays we are bombarded with results of opinion polls on all kinds of subjects, and political opinion polls get special emphasis for obvious reasons. Although there are always people who fail to understand the question, who do not want to choose any of the options offered, or who just refuse to answer pollsters’ questions, the numbers of ‘Don’t knows’, or ‘Undecideds’ or whatever you choose to call them, are seldom reported. For political polls this group can be important, as they could sway the election if they did indeed decide to vote.

The referendum on Scottish independence held in 2014 looked like it would be very close just before voting took place. In the end the margin was fairly substantial (55.3% to 44.7%). Afterwards some commentators suggested that this should have been predictable, as the ‘Don’t knows’ in opinion polls in referenda have tended to vote for the status quo [FiveThirtyEight, 2014].

A poll taken in Ireland in August 2013 shows how much difference neglecting the ‘Don’t knows’ can make. Unusually, perhaps because of the relatively high proportion of ‘Don’t knows’, the Sunday Independent newspaper presented the results both with and without the ‘Don’t knows’ [Sunday Independent, 2013]. They used piecharts, and Figure 4.5 shows something similar drawn with R using the reported percentages. Altogether, 985 people were questioned, 4 of whom said they would vote for the Green party. Rounding the percentages, as was done by the newspaper, meant that the Green party was omitted completely from the published piechart with the ‘Don’t knows’, but was shown in the second piechart.

Figure 4.5

Figure showing piecharts of an Irish political opinion poll from August 2013, one including the ‘Don’t knows’ and one not. The top plot emphasizes the large number of voters who have not decided how they might vote, while the second plot suggests that a Fianna Fail and Sinn Fein coalition, were such an agreement possible, could almost have a majority.

Piecharts of an Irish political opinion poll from August 2013, one including the ‘Don’t knows’ and one not. The top plot emphasizes the large number of voters who have not decided how they might vote, while the second plot suggests that a Fianna Fail and Sinn Fein coalition, were such an agreement possible, could almost have a majority.

The top plot, including the ‘Don’t knows’, emphasizes how open the election was likely to be, whereas the lower plot suggests that Fianna Fail and Sinn Fein might have been able to form a government together (should such a coalition have been in the realms of possibility). Piecharts are good for emphasizing when data are shares of some whole, and can be helpful, as here, for helping to estimate possible coalition shares. You have to know what the shares are shares of. In the upper plot it is shares of all respondents, in the lower one it is shares of respondents who said whom they would support. Piecharts are poor for comparing values. Can you tell whether Fine Gael or Fianna Fail had more support in the survey? As the two major parties historically, it is very important to them which one is ahead. In this case, Fine Gael edged it by around 1%.

The piecharts could probably be improved with a better choice of colour shades and something close to the standard party colours should be chosen. It could be helpful to also report the values, either on the plots or in accompanying tables. Piecharts are not generally recommended (see, for example, the R help page for piecharts, where Bill Cleveland is quoted on the comparatively poor accuracy of judgements of angles) and should mostly be avoided. Piecharts in 3D should most definitely always be avoided.

Party <– c("Fine Gael", "La", "Fianna Fail",
   "Sinn Fein", "In", "Green", "Don’t know") 
nos <– c(181, 51, 171, 119, 91, 4, 368)
IrOP <– data.frame(Party, nos)
IrOP <– within(IrOP, {
   percwith <– nos/sum(nos)
   percnot <– nos/sum(nos[-7])}) 
par(mfrow=c(2,1), mar = c(2.1, 2.1, 2.1, 2.1))
with(IrOP, pie(percwith, labels=Party, clockwise=TRUE,
   col=c("blue", "red", "darkgreen", "black",
   "grey", "lightgreen", "white"), radius=1))
with(IrOP, pie(percnot[-7], labels=Party, clockwise=TRUE,
   col=c("blue", "red", "darkgreen", "black",
   "grey", "lightgreen"), radius=1))

4.4 Ordinal data—fixed category order

Surveys

In many surveys there is a string of questions to which respondents give answers on integer scales from, say, 1 to 5. Sometimes descriptive terms are attached to each number on the scale, although not always.

The BEPS dataset in the effects package includes seven such questions put to 1525 voters in the British Election Panel Study for 1997-2001. One question was on a scale from 1 to 11, one from 0 to 3, and the rest were from 1 to 5. Figure 4.6 shows the barcharts for the voters’ assessment of the leaders of the three main parties at the time, where a higher value means a better assessment. Few voters used the extremes of 1 or 5. The leaders of the two main parties, Tony Blair and William Hague, were either liked (a value of 4 or 5) or disliked (a value of 1 or 2). The middle value of 3 was hardly used for them, although it was used for Charles Kennedy, the leader of the Liberals. Once you have seen the patterns, there is no difficulty in describing them and suggesting possible explanations, but would you have expected them?

data("BEPS", package="effects")
a1 <– ggplot(BEPS, aes(factor(Hague))) + 
    geom_bar(fill="blue") + ylab("") + 
    xlab("Hague (Conservative)") + ylim(0, 900)
a2 <– ggplot(BEPS, aes(factor(Blair))) + 
    geom_bar(fill="red") + ylab("") +
    xlab("Blair (Labour)") + ylim(0, 900)
a3 <– ggplot(BEPS, aes(factor(Kennedy))) +
    geom_bar(fill="yellow") + ylab("") +
    xlab("Kennedy (Liberal)") + ylim(0, 900) 
grid.arrange(a1, a2, a3, nrow=1)

Figure 4.6

Figure showing assessment of UK party leaders by voters in 2001. Higher values mean a higher rating. The bars have been filled with the party colours. Opinion seems to be divided on the leaders of the two main parties, Hague and Blair.

Assessment of UK party leaders by voters in 2001. Higher values mean a higher rating. The bars have been filled with the party colours. Opinion seems to be divided on the leaders of the two main parties, Hague and Blair.

Two of the variables concerned Europe, one the respondents’ knowledge of the parties’ policies on European integration and one the respondents’ attitudes towards European integration. What patterns of results might be expected here? Figure 4.7 displays the two barcharts.

Figure 4.7

Figure showing UK voters’ attitudes to European integration are shown in the plot on the right. There is a strongly negative group (those with the value ‘11’), but otherwise the respondents are roughly equally spread across the scale. The barchart on the left referring to respondents’ knowledge has been drawn horizontally to differentiate it from the barchart on the right referring to respondents’ opinions.

UK voters’ attitudes to European integration are shown in the plot on the right. There is a strongly negative group (those with the value ‘11’), but otherwise the respondents are roughly equally spread across the scale. The barchart on the left referring to respondents’ knowledge has been drawn horizontally to differentiate it from the barchart on the right referring to respondents’ opinions.

Well over half the respondents thought they had good or very good knowledge of party policies (options 2 and 3), while a sizeable minority thought they had low knowledge (option 0). Very few chose the option 1. The second barchart plots a scale based on combining answers to several questions, so it is reasonable to find that there is a spread of scale values, although that they are represented fairly equally seems a little surprising. The biggest group was the over 20% who were strongly against European integration. Given the tenor of the discussions about membership of the European Union going on in the UK in 2014, the proportion strongly against European integration now is probably considerably higher than it was at the turn of the century.

b1 <– ggplot(BEPS, aes(factor(political.knowledge))) + 
    geom_bar(fill="tan2") + coord_flip() + ylab("") + 
    xlab("Knowledge of policies on Europe")
b2 <– ggplot(BEPS, aes(factor(Europe))) +
    geom_bar(fill="lightgreen") + ylab("") + 
    xlab("Attitudes to European integration")
grid.arrange(b1, b2, nrow=1, widths=c(4, 8))

And more surveys

The survey dataset from the MASS package gives the results of a survey of 237 Australian statistics students. (It is also used in Exercise 3 of Chapter 3.) Figure 4.8 displays the barcharts for the seven categorical variables and gives an idea of the range of barcharts that you can find.

Figure 4.8

Figure showing barcharts of the categorical variables in the survey dataset of 237 statistics students. The individual plots are discussed in the chapter. Barcharts may be of many different shapes. Each plot has its own vertical axis scale.

Barcharts of the categorical variables in the survey dataset of 237 statistics students. The individual plots are discussed in the chapter. Barcharts may be of many different shapes. Each plot has its own vertical axis scale.

For most of the variables a simple line of code suffices, adding informative labels instead of the variable names. The variables for exercising and smoking have had to be reordered for the plots and this is something that always has to be watched out for. Note that the vertical scales vary from about 120 to just over 200.

Interestingly the sexes look exactly equal (a table shows they were), and one person did not give that information. Most students are right-handed (again one person did not answer the question, not the same one as before, as it turned out). More students fold their arms right over left, and a non-negligible minority say that neither right nor left is on top. With clapping, the majority for the right side is higher and the ‘neither’ group is bigger. As you might expect, most students do at least some exercising and few smoke. Many people in Australia still use Imperial units (feet and inches) to measure height, or rather they did at the time the survey was carried out in the 1970s.

data(survey, package="MASS")<"
s1 <– ggplot(survey, aes(Sex)) + geom_bar() + ylab("")
s2 <– ggplot(survey, aes(W.Hnd)) + geom_bar() +
   xlab("Writing hand") + ylab("")
s3 <– ggplot(survey, aes(Fold)) + geom_bar() + 
   xlab("Folding arms: arm on top") + ylab("")
s4 <– ggplot(survey, aes(Clap)) + geom_bar() +
   xlab("Clapping: hand on top") + ylab("")
survey <– within(survey,
    ExerN <– factor(Exer,
    levels=c("None", "Some", "Freq")))
s5 <– ggplot(survey, aes(ExerN)) + geom_bar() + 
   xlab("Exercise") + ylab("")
s6 <– ggplot(survey, aes(M.I)) + geom_bar() +
   xlab("Height units") + ylab("")
survey <– within(survey,
    SmokeN <– factor(Smoke, 
    levels=c("Never", "Occas", "Regul", "Heavy")))
s7 <– ggplot(survey, aes(SmokeN)) + geom_bar() +
   xlab("Smoking") + ylab("")
grid.arrange(s1, s2, s3, s4, s5, s6, s7, ncol=3)

4.5 Discrete data—counts and integers Deaths by horsekicks

Von Bortkiewicz’s dataset about deaths by horsekick in the Prussian army over twenty years is included in vcd under the name vonBort. Figure 4.9 shows the distribution of the numbers of deaths by corps and year. There is no information provided on the sizes of the corps over the years, though it is believed that most had four regiments, while three had six and one had eight. There may have been around 750 men (and horses) in each cavalry regiment.

data(VonBort, package="vcd")
ggplot(VonBort, aes(x=factor(deaths))) + geom_bar() +
  xlab("Deaths by horse kick")

Figure 4.9

Figure showing number of soldiers killed in one of 14 Prussian army corps by year for twenty years from 1875 to 1894. The distribution is close to a Poisson distribution with parameter 0.7, suggesting that these might be considered unfortunate random accidents. The dataset first appeared in [von Bortkiewicz, 1898].

Number of soldiers killed in one of 14 Prussian army corps by year for twenty years from 1875 to 1894. The distribution is close to a Poisson distribution with parameter 0.7, suggesting that these might be considered unfortunate random accidents. The dataset first appeared in [von Bortkiewicz, 1898].

A test of whether a Poisson distribution would be a good fit is not significant.

gf <– goodfit(table(VonBort$deaths)) 
summary(gf)
#
# Goodness-of-fit test for poisson distribution
#
#    X^2 df P(> X^2)
# Likelihood Ratio 2.442786 3 0.4857196

Looking at the distributions for individual years illustrates a problem that can arise with data like these. Both plots in Figure 4.10 show the distribution of deaths by corps for the year 1891. No corps suffered either 2 or 4 deaths and so the plot on the left ignores those categories. This can be misleading for categories outside the range of the subset (the value 4 in this case) and is more serious for a missing category inside the range (the value 2 here). The plot on the right solves the problem by specifying the levels of the factor deaths.

data(VonBort, package="vcd")
h1 <– ggplot(VonBort[VonBort$year=="1891",], 
    aes(x=factor(deaths))) + geom_bar()
h2 <– ggplot(VonBort[VonBort$year=="1891",],
    aes(x=factor(deaths, levels=seq(0, 4)))) +
    geom_bar() + scale_x_discrete(drop=FALSE)
grid.arrange(h1, h2, nrow=1)

Figure 4.10

Figure showing numbers of soldiers killed in one of 14 Prussian army corps in 1891. The plot on the left leaves out values with zero counts, the plot on the right is a more complete display. With only 14 pieces of data, it is not surprising that there can be two unusual values (unusual for the year 1891).

Numbers of soldiers killed in one of 14 Prussian army corps in 1891. The plot on the left leaves out values with zero counts, the plot on the right is a more complete display. With only 14 pieces of data, it is not surprising that there can be two unusual values (unusual for the year 1891).

Goals in soccer

Scores in many sports lead to discrete data, whether goals in soccer, runs in cricket, points in rugby, or shots in golf. The dataset UKSoccer in vcd summarizes the results of season 1995-96 of the English Premier League in a table giving the frequencies with which specific results occurred. The second line of code converts the table into a data frame and Figure 4.11 shows the distributions of goals scored by the home and away teams. If a team scored more than 4 goals, 4 was recorded in the dataset, which is why the labels have been changed in the code. The vertical scales have been made the same for comparative purposes.

data(UKSoccer, package="vcd")
PL <– data.frame(UKSoccer)
lx <– c("0", "1" "2", "3" "4 or more")
b1 <– ggplot(PL, aes(x=factor(Home), weight=Freq)) +
    geom_bar(fill="firebrick1") +
    ylab("") + xlab("Home Goals") +
    scale_x_discrete(labels=lx) + ylim(0,150)
b2 <– ggplot(PL, aes(x=factor(Away), weight=Freq)) +
    geom_bar(fill="cyan1") +
    ylab("") + xlab("Away Goals") + 
    scale_x_discrete(labels=lx) + ylim(0,150) 
grid.arrange(b1, b2, nrow=1)

Figure 4.11

Figure showing distributions of the numbers of goals scored by the home and away teams in the 1995-96 English Premier League season. As you would expect, the home teams scored more goals. Perhaps it is unexpected that there are relatively few matches in which either team scores more than two goals.

Distributions of the numbers of goals scored by the home and away teams in the 1995-96 English Premier League season. As you would expect, the home teams scored more goals. Perhaps it is unexpected that there are relatively few matches in which either team scores more than two goals.

Benford’s Law

The observation that the first or leading digits of many sets of numbers do not follow a uniform distribution was known a long time ago. As Simon Newcomb wrote in his 1881 article [Newcomb, 1881]: “That the ten digits do not occur with equal frequency must be evident to any one making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones.” However, the result is known as Benford’s Law, because of his article [Benford, 1938], which described the checking of many real datasets he had collected.

There have been several interesting applications in recent years, including studies of financial fraud and voting irregularities, and references can be found in the relevant Wikipedia article. The package benford.analysis provides six datasets, including census.2009, which gives the populations of 19,509 U.S. towns and cities.

If we only consider the first digit, then there are only nine possible values and the probability distribution is shown in Figure 4.12:

xx <– 1:9
Ben <– data.frame(xx, pdf=log10(1+1/xx))
ggplot(Ben, aes(factor(xx), weight=pdf)) + geom_bar() + 
  xlab("") + ylab("") + ylim(0,0.35)

Figure 4.12

Figure showing the distribution of first digits according to Benford’s law.

The distribution of first digits according to Benford’s law.

4.6 Formats, factors, estimates, and barcharts

Attentive readers will have noticed that drawing graphics of categorical data often involves preparatory reformatting of the data. Working with categorical data is not always straightforward and you have to watch both for how the data are stored and how the categories are coded.

Shape of the dataset

The data may be provided as a collection of individual cases, as a list of the possible combinations of the categorical variables with their associated frequencies, or as a set of conditional tables. Depending on which form the data have, some restructuring may be necessary before drawing particular plots.

Coding of variables

Categorical variable data may be recorded in different ways. Numeric codings were common in the distant past when computer storage was expensive and limited. Numeric codings have their uses, but there is always the danger of inappropriate calculations being carried out (e.g., what is the average marital status when ‘1’ = single, ‘2’ = married, ‘3’ = separated, ... ?), and the implied ordering may need to be changed too. You can convert the variable to a factor to change the order, but if you are going to do that, it is probably best to define a new factor variable and replace the codes with descriptive text. This has the advantage of keeping the original variable: you never know when you might need it.

If the data are stored as character strings, R assumes the variable is a factor and the default ordering is alphabetical. Non-alphabetical orderings can be specified using the levels function. Sometimes this has been done already for datasets available in R, and you may have to change the ordering, if the default is not what you need. Reordering should always be carried out with care, since there are too many ways for things to go wrong.

Estimates shown as bars

In some fields of application estimates are displayed as bars. When confidence intervals are shown as well, the lower part can be hidden by the colouring of the bar and the upper part looks like the plunger on a dynamite box, which gives rise to the name ‘dynamite plot’. Most statisticians disapprove and recommend using dotplots with confidence intervals instead [Cleveland, 1993]. In this book there are no dynamite plots, only barcharts where the bars represent counts or probabilities.

4.7 Modelling and testing for categorical variables

  1. Testing by simulation

    χ2 tests in one form or another are standard. An alternative to relying on the asymptotic distribution of the test statistic is to simulate many datasets and compare the actual test statistic value with the distribution of simulated values. This option is available in the R function chisq.test.

  2. Evenness of distribution

    If numbers are supposed to have been drawn at random, for instance in a lottery, or if a random number generator is to be checked directly, then the null hypothesis of equally likely probabilities can be carried out with a χ2 test.

  3. Fitting discrete distributions

    As always with categorical data, χ2 tests are important, but it is useful to inspect any lack of fit visually. Tukey’s hanging rootograms are one approach; plotting components of χ2-statistics is another.

Main points

  1. Barcharts are a simple form of display, yet they can provide much information and often surprise. Errors may become apparent, there may be more categories than anticipated, category counts may be unexpected (e.g., Figures 4.2 and 4.6).
  2. Barcharts can be used for nominal, ordinal, or discrete variables.
  3. The order of categories affects the way a barchart of a nominal variable looks, and different orderings can emphasise different features, as was shown in Figure 4.1.
  4. People have strong opinions about piecharts. Don’t let that stop you using them if you like them, just make sure they show clearly what you want them to. They are useful for displaying shares, as in Figure 4.5.

Exercises

  1. Gastrointestinal damage

    The dataset is called Lanza from the package HSAUR2.

    1. (a) Data on four studies are reported. Draw a plot to show whether all four studies are equally large.
    2. (b) The outcome is measured by the variable classification with scores of 1 (best) to 5 (worst). How would you describe the distribution?
  2. Alzheimer’s

    Three groups of patients (one with Alzheimer’s, one with other forms of dementia, and a control group with other diagnoses) were studied. Counts are given in the dataset alzheimer in the package coin. Prepare a graphical summary of plots of each of the three variables, smoking, disease, and gender, in a single window.

    1. (a) Are the disease groups very different in size?
    2. (b) Are there more men or women in the study?
    3. (c) How would you describe the distribution of the smoking variable? Do you think the smoking data are likely to be reliable?
  3. Slot machines

    According to the R help entry, this dataset (vlt in package DAAG) was collected from the three windows of a video lottery terminal while playing the game ‘Double Diamond’. There are seven possible symbols that may appear in each window. Draw equally scaled barcharts to see if the distributions of frequencies are the same for each window. Describe any important features.

  4. Multiple sclerosis

    The dataset MSPatients in vcd provides information on the diagnoses of two neurologists from two different cities on two groups of patients, one from each city. How do the distributions of the ratings of the neurologists compare? How would you describe their rating patterns? Draw two barcharts with common scaling. Before drawing the barcharts, reorder the categories into a sensible ordinal scale instead of the default alphabetical order. (Hint: initially using data.frame(as.table(MSPatients)) will put the dataset into a form that is easier to work with.)

  5. Occupational mobility

    According to the R help page, the Yamaguchi87 dataset in vcdExtra has become a classic for models comparing two-way mobility tables.

    1. (a) How do the distributions of occupations of the sons in the three countries compare?
    2. (b) How do the distributions of the sons’ and fathers’ occupations in the UK compare?
    3. (c) Are you surprised by the results or are they what you would have expected?
  6. Whisky

    The package bayesm includes the dataset Scotch, which reports which brands of whisky 2218 respondents consumed in the previous year.

    1. (a) Draw a barchart of the number of respondents per brand. What ordering of the brands do you think is best?
    2. (b) There are 20 named brands and a further category Other.brands. That entails drawing a lot of bars. If you decided to plot only the biggest brands individually and group the rest all together in the ‘Other’ group, what cutoff would you use for defining a big brand?
    3. (c) Another version of the dataset called whiskey is given in the package flexmix. It is made up of two data frames, whiskey with the basic data, and whiskey_brands with information on whether the whiskeys are blends or single malts. How would you incorporate this information in your graphics, by using colour, by using a different ordering, or by drawing two graphics rather than one?
    4. (d) Which of the spellings, ‘whisky’ or ‘whiskey’, is more appropriate for this dataset?
  7. Choice of school

    The dataset GSOEP9402 in the package AER provides data on 675 14-year-old children in Germany. The data come from the German Socio-Economic Panel for the years 1994 to 2002.

    1. (a) Which variables are nominal, ordinal, or discrete?
    2. (b) Draw barcharts for the variables. Are any similar in form, and what explanations would you suggest for these similarities?
    3. (c) The variable meducation refers to the mother’s educational level in years. Would you describe it as ordinal or discrete, and how should it be displayed?
    4. (d) Summarise briefly the main information shown by your graphics.
  8. Election results

    The Bavarian election of Autumn 2013 was a triumph for the CSU party, as they obtained an absolute majority for the first time for several years. This was partly because some parties failed to get 5% of the votes or more and could therefore not win any seats in parliament. A few observers commented that the real winners were the group who did not vote at all, as they made up a higher proportion of the electorate than the CSU supporters. The percentages reported were as follows (where the party percentages refer to their share of the actual vote): ‘didn’t vote’ 36.7%, CSU 47.7%, SPD 20.6%, FW 9.0%, Grüne 8.6%, FDP 3.3%, BP 2.1%, Linke 2.1%, ÖDP 2.0%, Pirates 2.0%, Rest 2.5%.

    1. (a) Draw graphics to present these results both with and without the group of non-voters. What headlines would you give your graphics if you were to publish them in a newspaper?
    2. (b) It is, of course, not necessary to list all the parties, especially when several of them were excluded by the 5% rule. Which ones would you leave out and why?
    3. (c) Seats are won by the parties with over 5% of the vote, based on their share of the total votes of qualifying parties. The rules are actually more complicated than that, but these complications can be neglected here. There are 180 seats in all. Draw a graphic showing the seat distribution, using the available figures. What headline would you suggest for this graphic?
    4. (d) You have probably noticed that the percentages do not add up to 100%. Is this a problem and what might you do about it?
  9. Horsekicks

    In his discussion of the deaths due to horsekicks in the Prussian army, von Bortkiewicz pointed out that four of the corps (G, I, VI, and XI) were bigger than the others.

    1. (a) Draw two plots, one for the numbers of deaths each year for the four bigger corps and one for the other corps. What vertical scale do you think is appropriate for comparing the two plots?
    2. (b) Estimate the parameter of a Poisson distribution for the other corps. Carry out a test to see if a Poisson distribution with that X would be an acceptable fit for the four bigger corps.
  10. Intermission

    Edward Hopper’s Nighthawks hangs in the Art Insitute of Chicago. What are the main features of this painting and is it typical of Hopper’s work?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.224.226