Contingency tables are frequency tables represented by two or more categorical variables along with the proportion of each class represented as a group. Frequency table is used to represent one categorical variable; however, contingency table is used to represent two categorical variables.
Let's see an example to understand contingency tables, bivariate statistics, and data normality using the Cars93
dataset:
> table(Cars93$Type) Compact Large Midsize Small Sporty Van 16 11 22 21 14 9 > table(Cars93$AirBags) Driver & Passenger Driver only None 16 43 34
The individual frequency table for two categorical variables AirBags
and Type
of the car is represented previously:
> contTable<-table(Cars93$Type,Cars93$AirBags) > contTable Driver & Passenger Driver only None Compact 2 9 5 Large 4 7 0 Midsize 7 11 4 Small 0 5 16 Sporty 3 8 3 Van 0 3 6
The conTable
object holds the cross tabulation of two variables. The proportion of each cell in percentage is reflected in the following table. If we need to compute the row percentages or column percentages, then it is required to specify the values in the argument:
> prop.table(contTable) Driver & Passenger Driver only None Compact 0.022 0.097 0.054 Large 0.043 0.075 0.000 Midsize 0.075 0.118 0.043 Small 0.000 0.054 0.172 Sporty 0.032 0.086 0.032 Van 0.000 0.032 0.065
For row percentages, the value needs to be 1, and for column percentages, the value needs to be entered as 2 in the preceding command:
> prop.table(contTable,1) Driver & Passenger Driver only None Compact 0.12 0.56 0.31 Large 0.36 0.64 0.00 Midsize 0.32 0.50 0.18 Small 0.00 0.24 0.76 Sporty 0.21 0.57 0.21 Van 0.00 0.33 0.67 > prop.table(contTable,2) Driver & Passenger Driver only None Compact 0.125 0.209 0.147 Large 0.250 0.163 0.000 Midsize 0.438 0.256 0.118 Small 0.000 0.116 0.471 Sporty 0.188 0.186 0.088 Van 0.000 0.070 0.176
The summary of the contingency table performs a chi-square test of independence between the two categorical variables:
> summary(contTable) Number of cases in table: 93 Number of factors: 2 Test for independence of all factors: Chisq = 33, df = 10, p-value = 3e-04 Chi-squared approximation may be incorrect
The chi-square test of independence for all factors is represented previously. The message that the chi-squared approximation may be incorrect is due to the presence of null or less than 5 values in the cells of the contingency table. As in the preceding case, two random variables, car type and airbags, can be independent if the probability distribution of one variable does not impact the probability distribution of the other variable. The null hypothesis for the chi-square test of independence is that two variables are independent of each other. Since the p-value from the test is less than 0.05, at 5% level of significance we can reject the null hypothesis that the two variables are independent. Hence, the conclusion is that car type and airbags are not independent of each other; they are quite related or dependent.
Instead of two variables, what if we add one more dimension to the contingency table? Let's take Origin
, and then the table would look as follows:
> contTable<-table(Cars93$Type,Cars93$AirBags,Cars93$Origin) > contTable , , = USA Driver & Passenger Driver only None Compact 1 2 4 Large 4 7 0 Midsize 2 5 3 Small 0 2 5 Sporty 2 5 1 Van 0 2 3 , , = non-USA Driver & Passenger Driver only None Compact 1 7 1 Large 0 0 0 Midsize 5 6 1 Small 0 3 11 Sporty 1 3 2 Van 0 1 3
The summary
command for the test of independence of all factors can be used to test out the null hypothesis:
> summary(contTable) Number of cases in table: 93 Number of factors: 3 Test for independence of all factors: Chisq = 65, df = 27, p-value = 5e-05 Chi-squared approximation may be incorrect
Apart from the graphical methods discussed previously, there are some numerical statistical tests that can be used to know whether a variable is normally distributed or not. There is a library called norm.test
for performing data normality tests, a list of functions that help in assessing the data normality from this library are listed as follows:
|
Adjusted Jarque-Bera test for normality |
|
Frosini test for normality |
|
Geary test for normality |
|
Hegazy-Green test for normality |
|
Hegazy-Green test for normality |
|
Jarque-Bera test for normality |
|
Kurtosis test for normality |
|
Skewness test for normality |
|
Spiegelhalter test for normality |
|
Weisberg-Bingham test for normality |
|
Anderson-Darling test for normality |
|
Cramér-von Mises test for normality |
|
Lilliefors (Kolmogorov-Smirnov) test for normality |
|
Pearson chi-square test for normality |
|
Shapiro-Francia test for normality |
Let's apply the normality test on the Price
variable from the Cars93
dataset:
> library(nortest) > ad.test(Cars93$Price) # Anderson-Darling test Anderson-Darling normality test data: Cars93$Price A = 3, p-value = 9e-07 > cvm.test(Cars93$Price) # Cramer-von Mises test Cramer-von Mises normality test data: Cars93$Price W = 0.5, p-value = 6e-06 > lillie.test(Cars93$Price) # Lilliefors (KS) test Lilliefors (Kolmogorov-Smirnov) normality test data: Cars93$Price D = 0.2, p-value = 1e-05 > pearson.test(Cars93$Price) # Pearson chi-square Pearson chi-square normality test data: Cars93$Price P = 30, p-value = 3e-04 > sf.test(Cars93$Price) # Shapiro-Francia test Shapiro-Francia normality test data: Cars93$Price
From the previously mentioned tests, it is evident that the Price
variable is not normally distributed as the p-values from all the statistical tests are less than 0.05
. If we add more dimensions to the bi-variate relationship, it becomes multivariate analysis. Let's try to understand the relationship between horsepower and length of a car from the Cars93
dataset:
> library(corrplot) > o<-cor(Cars93[,c("Horsepower","Length")]) > corrplot(o,method = "circle",main="Correlation Plot")
When we include more variables, it becomes a multivariate relationship. Let's try to plot a multivariate relationship between various variables from the Cars93
dataset:
> library(corrplot) > t<-cor(Cars93[,c("Price","MPG.city","RPM","Rev.per.mile","Width","Weight","Horsepower","Length")]) > corrplot(t,method = "ellipse")
There are various methods that can be passed as an argument to the correlation plot. They are "circle"
, "square"
, "ellipse"
, "number"
, "shade"
, "color"
, and "pie"
.
3.147.84.169