Tabular Data
This chapter explains how to summarize and calculate statistics for categorical variables. Categorical data is normally summarized in a frequency table (for one variable) or contingency table (for two or more variables), which is a table showing the number of values in each category or in each combination of categories. Data summarized in this form is known as tabular data.
As well as summarizing your categorical data in a table, you may need to compare it with a hypothesized distribution using the chi-square goodness-of-fit test. You may also want to determine whether there is any association between two or more categorical variables. The chi-square test of association and the Fisher’s exact test are two methods that can help you with this.
You will learn how to:
This chapter uses the warpbreaks and esoph datasets and the Titanic table object, which are included with R. You can view more information about them by entering help(datasetname). It also uses the people2 dataset (which is a modified version of the people dataset introduced in Chapter 3), and the apartments dataset. These are both available with the downloads for this book and described in Appendix C.
Frequency Tables
Frequency tables summarize a categorical variable by displaying the number of observations belonging to each category. Frequency tables for two or more categorical variables (known as contingency tables or cross tabs) summarize the relationship between two or more categorical variables by displaying the number of observations that fall into each combination of categories.
In R, a table is also a type of object that holds tabulated data. There are some example table objects included with R, such as HairEyeColor and Titanic.
To create a one-dimensional frequency table showing the number of observations for each level of a factor variable, use the table function:
> table(people2$Eye.Color)
Blue Brown Green
7 6 3
Whereas R allows you to create tables from any type of variable, they are only really meaningful for factor variables with a relatively small number of values. If you want to include a continuous variable, first divide it into categories with the cut function, as explained in Chapter 3 under “Dividing a Continuous Variable into Categories.” To add an additional column to the table showing the numbers of missing values (if there are any), set the useNA argument:
> table(dataset$factor1, useNA="ifany")
To create a two-dimensional contingency table, give two variables as input:
> table(people2$Eye.Color, people2$Sex)
Male Female
Blue 4 3
Brown 5 1
Green 1 2
Similarly, you can create contingency tables for three or more variables:
> table(people2$Eye.Color, people2$Height.Cat, people2$Sex)
, , = Male
Medium Short Tall
Blue 2 0 2
Brown 4 0 1
Green 0 0 1
, , = Female
Medium Short Tall
Blue 3 0 0
Brown 0 0 0
Green 0 2 0
To save a table as a table object, assign the output of the table function to a new object name:
> sexeyetable<-table(people2$Eye.Color, people2$Sex)
Once you have created a table object, you can apply further functions to the object to create output that is relevant to tabular data. For example, you could use the pie function to create a pie chart showing the proportion of people with each eye color, or use the summary function to test for association between eye color and sex. You will learn about some of these functions later in this chapter. Save the sexeyetable object, as it will be useful later on.
Displaying Tables
There are a few functions that allow you to present table objects in different ways. These include ftable, prop.table, and addmargins.
The ftable function displays your table in a more compact way, which is useful for tables with three or more dimensions:
> ftable(Titanic)
Survived No Yes
Class Sex Age
1st Male Child 0 5
Adult 118 57
Female Child 0 1
Adult 4 140
2nd Male Child 0 11
Adult 154 14
Female Child 0 13
Adult 13 80
3rd Male Child 35 13
Adult 387 75
Female Child 17 14
Adult 89 76
Crew Male Child 0 0
Adult 670 192
Female Child 0 0
Adult 3 20
The prop.table function displays the table with each cell count expressed as a proportion of the total count:
> prop.table(sexeyetable)
Male Female
Blue 0.2500 0.1875
Brown 0.3125 0.0625
Green 0.0625 0.1250
To display the cell counts expressed as a proportion of the row or column totals instead of the grand total, set the margin argument to 1 for rows, 2 for columns, and 3+ for higher dimensions:
> prop.table(sexeyetable, margin=2)
Male Female
Blue 0.4000000 0.5000000
Brown 0.5000000 0.1666667
Green 0.1000000 0.3333333
To display percentages, multiply the whole table by 100. You can also use the round function to round all of the numbers in the table:
> round(prop.table(sexeyetable)*100)
Male Female
Blue 25 19
Brown 31 6
Green 6 12
The addmargins function displays your table with row and column totals:
> addmargins(sexeyetable)
Male Female Sum
Blue 4 3 7
Brown 5 1 6
Green 1 2 3
Sum 10 6 16
To add margins to just one dimension of the table, set the margin argument to 1 for rows, 2 for columns, and 3+ for higher dimensions:
> addmargins(sexeyetable, margin=1)
Male Female
Blue 4 3
Brown 5 1
Green 1 2
Sum 10 6
Creating Tables from Count Data
So far, you have created tables by counting each row in the dataset as one observation. However, occasionally you may have a dataset in which count data has already been aggregated, such as the warpbreaks dataset (included with R).
To create a table from a data frame containing count data, use the xtabs function:
> xtabs(breaks~wool+tension, warpbreaks)
tension
wool L M H
A 401 216 221
B 254 259 169
If the data frame has more than one column of count data, put them inside the list function:
> xtabs(list(counts1,counts2)~factor1,dataset)
You can create a table object by assigning the output of the xtabs function to a new object name:
> warpbreakstable<-xtabs(breaks~wool+tension, warpbreaks)
To create a data frame of count data from a table object, use the as.data.frame function:
> as.data.frame(Titanic)
Class Sex Age Survived Freq
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
7 3rd Female Child No 17
8 Crew Female Child No 0
9 1st Male Adult No 118
10 2nd Male Adult No 154
11 3rd Male Adult No 387
12 Crew Male Adult No 670
13 1st Female Adult No 4
14 2nd Female Adult No 13
15 3rd Female Adult No 89
16 Crew Female Adult No 3
17 1st Male Child Yes 5
18 2nd Male Child Yes 11
19 3rd Male Child Yes 13
20 Crew Male Child Yes 0
21 1st Female Child Yes 1
22 2nd Female Child Yes 13
23 3rd Female Child Yes 14
24 Crew Female Child Yes 0
25 1st Male Adult Yes 57
26 2nd Male Adult Yes 14
27 3rd Male Adult Yes 75
28 Crew Male Adult Yes 192
29 1st Female Adult Yes 140
30 2nd Female Adult Yes 80
31 3rd Female Adult Yes 76
32 Crew Female Adult Yes 20
Creating a Table Directly
Sometimes you may not have a dataset at all, only a table of counts. In order to be able to perform analysis such as the chi-square test (covered later in this chapter), you will need to enter your data into R as a table object.
You can create a one-dimensional table object with the as.table function. Enter the counts for each of the categories as shown here:
> table1D<-as.table(c(5, 21, 17, 3, 1))
When you view the table, you can see that R has given the categories default names of A, B, and so on.
> table1D
A B C D E
5 21 17 3 1
To overwrite the default names with the correct category names, use the row.names function:
> row.names(table1D)<-c("Category 1", "Category 2", "Category 3", "Category 4", "Category 5")
You can also enter a two-dimensional table into R in a similar way. Suppose that you want to enter the data shown in Table 6-1.
Table 6-1. Number of subjects with and without a disease after treatment with either an active treatment or a control
To create a two-dimensional table, you will first have to create a matrix object and then convert it to a table object. A matrix is a type of object that holds a rectangular grid of data, where all the data is of the same type (either character strings or numbers). It is a bit like a vector, except with two dimensions. To create a matrix, use the matrix function:
> matrix1<-matrix(c(15, 10, 11, 15), nrow=2)
This creates the matrix shown here:
> matrix1
[,1] [,2]
[1,] 15 11
[2,] 10 15
As you can see, R has arranged the list of values into a two-by-two matrix. R begins by filling the first column of the matrix from top to bottom, before moving onto the second column and so on. The nrow argument tells R how many rows the matrix should have, and the number of columns required is calculated automatically. Alternatively, you can use the ncol argument to specify the number of columns.
To give proper labels to the table dimensions and levels, use the dimnames argument:
> matrix1<- matrix(c(15, 9, 11, 17), nrow=2, dimnames=list(Group=c("Active", "Control"),
Disease=c("Resolved", "Unresolved")))
Once you have created the matrix, use the as.table function to convert it to a table object:
> treattable<-as.table(matrix1)
Alternatively, you can create the table in one step by nesting the matrix function inside the as.table function:
> treattable<-as.table(matrix(c(15, 9, 11, 16), nrow=2, dimnames=list(Group=c("Active", "Control"),
Disease=c("Resolved", "Unresolved"))))
Chi-Square Goodness-of-Fit Test
The chi-square goodness-of-fit test (also known as the Pearson's chi-squared test or χ2 test) allows you to compare categorical data with a theoretical distribution. It has the null hypothesis that the data follows the specified distribution, and the alternative hypothesis that it does not. The test is only suitable if sufficient data is available, which is commonly defined as each category having an expected frequency (under the null hypothesis) of at least five.
The test should not be confused with the chi-square test of association (see the next section in this chapter), which helps to determine whether two or more categorical variables are associated. The chi-square goodness-of-fit test and the chi-square test of association are both forms of the Pearson chi-square test, but they answer different questions about the data. To illustrate the difference, if you had recorded the results of a series of six-sided die rolls and wanted to use this data to determine whether your die was fair, you would use the chi-square goodness-of-fit test. The null hypothesis for the test would be that each of the six sides is equally likely to be rolled, with probability 1/6. However, if you had recorded the name of the person rolling the die as well as the result of the die roll and you wanted to determine whether there was any relationship between the result of the roll and the person that rolled the die, you would use the chi-square test of association.
Note For more details about hypothesis testing, see Chapter 10.
You can perform a chi-square goodness-of-fit test with the chisq.test function. If you have a one-dimensional table object, you can test it against the uniform distribution (i.e., against the null hypothesis that all categories are equally likely to occur), as shown here:
> chisq.test(tableobject)
If you are using raw data, nest the table function inside the chisq.test function:
> chisq.test(table(dataset$factor1))
To test the data against a different theoretical distribution, use the p argument to give a list of expected relative frequencies under the null hypothesis. You must give the same number of relative frequencies as there are levels in your table, and they must sum to one.
For example, to test the hypothesis that 10 percent of the population belong to the first category, 40 percent to the second category, 40 percent to the third category, and 10 percent to the fourth category, use the command:
> chisq.test(tableobject, p=c(0.1, 0.4, 0.4, 0.1))
EXAMPLE 6-1.
CHI-SQUARE GOODNESS-OF-FIT TEST USING THE APARTMENTS DATA
Consider the apartments dataset (available with the downloads for this book and described in Appendix C), which give details of thirty-nine one-bedroom apartments advertised for rent in a particular area of the United Kingdom in October 2012. The Price.Cat variable gives the rental price category for the apartment.
To create a table showing the number of apartments in each price category, use the command:
> table(apartments$Price.Cat)
£500-550 £551-600 £601-650 £651+
5 13 8 6
It is believed that 20 percent of the one-bedroom apartments in this area have a rental price less than £550, 30 percent have a price between £551 and £600, 30 percent have a price between £601 and £650, and 20 percent have a rental price greater than £650.
Suppose that you want to use a chi-square goodness-of-fit test to determine whether the data is consistent with the hypothesized price distribution. The test has the null hypothesis that the described price distribution is correct, and the alternative hypothesis that it is not. A significance level of 0.05 is used.
To perform the test, use the command:
> chisq.test(table(apartments$Price.Cat), p=c(0.2, 0.3, 0.3, 0.2))
This gives the output:
Chi-squared test for given probabilities
data: table(apartments$Price.Cat)
X-squared = 1.8021, df = 3, p-value = 0.6145
The p-value of 0.6145 is not less than the significance level of 0.05, so we cannot reject the null hypothesis. This means that the data is consistent with the hypothesised price distribution.
Tests of Association Between Categorical Variables
In the same way that we can look at the association between continuous variables using statistics such as covariance and correlation, there are methods available to help you to determine whether there is an association between categorical variables. Two of these are the chi-square test of association and Fisher’s exact test. These tests answer the same question but are suitable in different situations.
The chi-square test of association is used to test for association between two or more variables, and can be used regardless of how many levels there are in each variable. However, the test is only suitable when there is plenty of data available.
By contrast, Fisher’s exact test can only be used to look for association between two categorical variables which each have two levels. However, it is suitable even when very little data is available.
The following sections explain how to perform these tests in R.
Chi-Square Test of Association
The chi-square test of association (sometimes called the chi-square test of independence) helps to determine whether two or more categorical variables are associated. The test has the null hypothesis that the variables are independent and the alternative hypothesis that they are not independent (i.e., at least two of the variables are associated) The test is only suitable if there is sufficient data, which is commonly defined as all table cells having expected counts (under the null hypothesis) of at least five.
The summary function performs a chi-square test of association when given a table object as input. If you have already created a table object, use the summary function directly:
> summary(tableobject)
If you have raw data, nest the table function inside the summary function:
> summary(table(dataset$var1,dataset$var2,dataset$var3))
EXAMPLE 6-2.
CHI-SQUARE TEST OF ASSOCIATION USING PEOPLE2 DATA
Suppose that you want to use a chi-square test of association to determine whether sex and eye colour are associated, using the people2 dataset.
If you still have the sexeyetable object created earlier in this chapter in the “Frequency Tables” section, use the command:
> summary(sexeyetable)
Alternatively, you can perform the test using the raw data:
> summary(table(people2$Sex, people2$Eye.Colour))
Number of cases in table: 16
Number of factors: 2
Test for independence of all factors:
Chisq = 2.2857, df = 2, p-value = 0.3189
Chi-squared approximation may be incorrect
As the p-value of 0.3189 is not less than the significance level of 0.05, we cannot reject the null hypothesis that sex and eye color are independent. This means that there is no evidence of an association between the two.
The warning Chi-squared approximation may be incorrect tells us that as some cells have expected counts less than five. This means that the results may be unreliable and should be interpreted with caution. A discussion of chi-square tests with small expected counts can be found on page 39 of The Analysis of Contingency Tables, Second Edition, by B. S. Everitt (Chapman & Hall/CRC, 1992).
EXAMPLE 6-3.
CHI-SQUARE TEST OF ASSOCIATION USING THE ESOPH DATA
Consider the esoph dataset, which is included with R. The dataset gives the results of a case-control study of oesophageal cancer. You can view more details by entering help(esoph). The agegp, alcgp and tobgp variables list categories for the subjects’ age, alcohol consumption, and smoking habits. The variables ncases and ncontrols give the number of subjects with and without oesophageal cancer that fall into each of these categories.
Suppose that you want to use a chi-square test of association to determine where there is any association between smoking habits and oesophageal cancer. The test has the null hypothesis that smoking habits and oesophageal cancer are independent, and the alternative hypothesis that they are associated. A significance level of 0.05 will be used.
As the dataset contains data that is already expressed as counts, you can use the xtabs function to create a table giving the number of subjects with and without oesophageal cancer that fall into each category of smoking habits:
> tobacco<-xtabs(cbind(ncases, ncontrols)~tobgp, esoph)
> tobacco
tobgp ncases ncontrols
0-9g/day 78 525
10-19 58 236
20-29 33 132
30+ 31 82
To perform the test, use the command:
> summary(tobacco)
Call: xtabs(formula = cbind(ncases, ncontrols) ~ tobgp, data = esoph)
Number of cases in table: 1175
Number of factors: 2
Test for independence of all factors:
Chisq = 18.363, df = 3, p-value = 0.0003702
As the p-value of 0.0003702 is less than the significance level of 0.05, we can reject the null hypothesis that smoking category and oesophageal cancer are independent, in favour of the alternative hypothesis that the two are associated. This means that there is evidence of a relationship between smoking habits and oesophageal cancer.
The Fisher’s exact test is used to test for association between two categorical variables that each have two levels. Unlike the chi-square test of association, it can be used even when very little data is available. The test has the null hypothesis that the two variables are independent and the alternative hypothesis that they are not independent.
You can perform a Fisher's exact test with the fisher.test function. You can use the function with a two-by-two table object:
> fisher.test(tableobject)
You can also use raw data:
> fisher.test(dataset$var1,dataset$var2)
The test results are accompanied by a 95 percent confidence interval for the odds ratio. You can change the size of the interval with the conf.level argument:
> fisher.test(dataset$var1,dataset$var2, conf.level=0.99)
EXAMPLE 6-4.
FISHER'S EXACT TEST USING PEOPLE2 DATA
Using the table function, we can see there are more left-handed women than left-handed men in the people2 dataset:
> table(people2$Sex, people2$Handedness)
Left Right
Male 0 9
Female 2 2
Suppose you want to perform a Fisher's exact test to determine whether there is any statistically significant relationship between sex and handedness. The test has the null hypothesis that sex and handedness are independent, and the alternative hypothesis that they are associated. A significance level of 0.05 is used.
To perform the test, use the command:
> fisher.test(people2$Sex, people2$Handedness)
This gives the output:
Fisher's Exact Test for Count Data
data: people2$Sex and people2$Handedness
p-value = 0.07692
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.000000 2.098087
sample estimates:
odds ratio
0
Because the p-value of 0.07692 is not less than the significance level of 0.05, we cannot reject the null hypothesis that sex and handedness are independent. This means that there is no evidence of a relationship between sex and handedness.
A test of proportions allows you to compare the proportion of observations with a given outcome or attribute (referred to as a success) across two or more groups of observations to determine whether they are significantly different. The null hypothesis for the test is that the probability of a success is the same for all of the groups, and the alternative hypothesis is that the probability of success is not the same for all of the groups.
The test can be applied to an n-by-two contingency table, where the two columns give the number of successes and failures, and the n rows give the number that fall into each group.
You can perform a test of proportions with the prop.test function. If your data is already saved in a table object, you can use the command:
> prop.test(tableobject)
For the command to work, your table must have exactly two columns, and the rows of the table should show the different groups that you want to compare. If your table is the wrong way around, so that the columns give groups and the rows give successes and failures, you can use the t function to transpose the table:
> prop.test(t(tableobject))
If you don’t have a table object, you can perform the test using raw data by nesting the table function inside the prop.test function:
>prop.test(table(dataset$groups,dataset$outcome))
Alternatively, you can perform the test using raw count data:
> prop.test(c(success1,success2), c(n1,n2))
where success1 and success2 are the number of successes in Group 1 and Group 2, respectively, and n1 and n2 are the total number of observations in Group 1 and Group 2, respectively.
EXAMPLE 6-5.
PROPORTIONS TEST
Consider the data shown in Table 6-1 (under “Creating Tables Directly”), which gives the number of patients with resolved and unresolved disease after treatment with either an active treatment or a control. If you still have the treattable table object, you can view the contents:
> treattable
Disease
Group Resolved Unresolved
Active 15 11
Control 8 17
To see the proportion with resolved and unresolved disease for each treatment group, use the command:
> prop.table(treattable, margin=1)
Disease
Group Resolved Unresolved
Active 0.5769231 0.4230769
Control 0.3200000 0.6800000
From the output, you can see that the proportion of subjects with resolved disease is 58 percent in the treatment group and 32 percent in the control group. Suppose that you want to use a test of proportions to help determine whether this difference is significant or whether it is just the result of random variation. A significance level of 0.05 will be used.
To perform the test, enter the command:
> prop.test(treattable)
Alternatively, you can perform the test using the raw data:
> prop.test(c(15,8), c(26,25))
This produces the following output:
2-sample test for equality of proportions with continuity correction
data: treattable
X-squared = 1.6153, df = 1, p-value = 0.2038
alternative hypothesis: two.sided
95 percent confidence interval:
-0.08963846 0.52348461
sample estimates:
prop 1 prop 2
0.5769231 0.3600000
From the output, you can see that the p-value of 0.2038 is not less than the significance level of 0.05, so we cannot reject the null hypothesis that the proportion of patients with resolved disease is the same in both treatment groups. This means that there is no evidence of a difference in the probability of disease resolution for the active treatment and control.
Summary
You should be able to create frequency and contingency tables to summarize categorical data and be able to present them with marginal sums or as percentages if required. You should be able to compare your categorical data to a hypothesised distribution using the chi-square goodness-of-fit test. You should also be able to use the chi-square test of association or Fisher’s exact test to look for association between two categorical variables. Finally, you should be able to use a test to compare two or more proportions.
This table summarizes the most important commands covered.
Task |
Command |
---|---|
Contingency table |
table(dataset$factor1, dataset$factor2) |
Compact table |
ftable(tableobject) |
Proportions table |
prop.table(tableobject) |
Table with margins |
addmargins(tableobject) |
Chi-square goodness-of-fit test |
chisq.test(tableobject, p=c(p1, p2, pN))chisq.test(table(dataset$factor1), p=c(p1, p2, pN)) |
Chi-square test of association |
summary(tableobject) summary(table(dataset$factor1, dataset$factor2)) |
Fisher’s exact test |
fisher.test(tableobject)fisher.test(dataset$factor1, dataset$factor2) |
Proportions test |
prop.test(tableobject)prop.test(table(dataset$groups,dataset$outcome)) |
Now that you have learned how to summarize continuous and categorical variables, we can move on to the next chapter, in which you will learn about probability distributions.
3.135.249.220