Contingency tables, bivariate statistics, and checking for data normality

Contingency tables are frequency tables represented by two or more categorical variables along with the proportion of each class represented as a group. Frequency table is used to represent one categorical variable; however, contingency table is used to represent two categorical variables.

Let's see an example to understand contingency tables, bivariate statistics, and data normality using the Cars93 dataset:

> table(Cars93$Type)

Compact Large Midsize Small Sporty Van

16 11 22 21 14 9

> table(Cars93$AirBags)

Driver & Passenger Driver only None

16 43 34

The individual frequency table for two categorical variables AirBags and Type of the car is represented previously:

> contTable<-table(Cars93$Type,Cars93$AirBags)

> contTable

Driver & Passenger Driver only None

Compact 2 9 5

Large 4 7 0

Midsize 7 11 4

Small 0 5 16

Sporty 3 8 3

Van 0 3 6

The conTable object holds the cross tabulation of two variables. The proportion of each cell in percentage is reflected in the following table. If we need to compute the row percentages or column percentages, then it is required to specify the values in the argument:

> prop.table(contTable)

Driver & Passenger Driver only None

Compact 0.022 0.097 0.054

Large 0.043 0.075 0.000

Midsize 0.075 0.118 0.043

Small 0.000 0.054 0.172

Sporty 0.032 0.086 0.032

Van 0.000 0.032 0.065

For row percentages, the value needs to be 1, and for column percentages, the value needs to be entered as 2 in the preceding command:

> prop.table(contTable,1)

Driver & Passenger Driver only None

Compact 0.12 0.56 0.31

Large 0.36 0.64 0.00

Midsize 0.32 0.50 0.18

Small 0.00 0.24 0.76

Sporty 0.21 0.57 0.21

Van 0.00 0.33 0.67

> prop.table(contTable,2)

Driver & Passenger Driver only None

Compact 0.125 0.209 0.147

Large 0.250 0.163 0.000

Midsize 0.438 0.256 0.118

Small 0.000 0.116 0.471

Sporty 0.188 0.186 0.088

Van 0.000 0.070 0.176

The summary of the contingency table performs a chi-square test of independence between the two categorical variables:

> summary(contTable)

Number of cases in table: 93

Number of factors: 2

Test for independence of all factors:

Chisq = 33, df = 10, p-value = 3e-04

Chi-squared approximation may be incorrect

The chi-square test of independence for all factors is represented previously. The message that the chi-squared approximation may be incorrect is due to the presence of null or less than 5 values in the cells of the contingency table. As in the preceding case, two random variables, car type and airbags, can be independent if the probability distribution of one variable does not impact the probability distribution of the other variable. The null hypothesis for the chi-square test of independence is that two variables are independent of each other. Since the p-value from the test is less than 0.05, at 5% level of significance we can reject the null hypothesis that the two variables are independent. Hence, the conclusion is that car type and airbags are not independent of each other; they are quite related or dependent.

Instead of two variables, what if we add one more dimension to the contingency table? Let's take Origin, and then the table would look as follows:

> contTable<-table(Cars93$Type,Cars93$AirBags,Cars93$Origin)

> contTable

, , = USA

Driver & Passenger Driver only None

Compact 1 2 4

Large 4 7 0

Midsize 2 5 3

Small 0 2 5

Sporty 2 5 1

Van 0 2 3

, , = non-USA

Driver & Passenger Driver only None

Compact 1 7 1

Large 0 0 0

Midsize 5 6 1

Small 0 3 11

Sporty 1 3 2

Van 0 1 3

The summary command for the test of independence of all factors can be used to test out the null hypothesis:

> summary(contTable)

Number of cases in table: 93

Number of factors: 3

Test for independence of all factors:

Chisq = 65, df = 27, p-value = 5e-05

Chi-squared approximation may be incorrect

Apart from the graphical methods discussed previously, there are some numerical statistical tests that can be used to know whether a variable is normally distributed or not. There is a library called norm.test for performing data normality tests, a list of functions that help in assessing the data normality from this library are listed as follows:

ajb.norm.test

Adjusted Jarque-Bera test for normality

frosini.norm.test

Frosini test for normality

geary.norm.test

Geary test for normality

hegazy1.norm.test

Hegazy-Green test for normality

hegazy2.norm.test

Hegazy-Green test for normality

jb.norm.test

Jarque-Bera test for normality

kurtosis.norm.test

Kurtosis test for normality

skewness.norm.test

Skewness test for normality

spiegelhalter.norm.test

Spiegelhalter test for normality

wb.norm.test

Weisberg-Bingham test for normality

ad.test                     

Anderson-Darling test for normality

cvm.test

Cramér-von Mises test for normality

lillie.test

Lilliefors (Kolmogorov-Smirnov) test for normality

pearson.test

Pearson chi-square test for normality

sf.test

Shapiro-Francia test for normality

Let's apply the normality test on the Price variable from the Cars93 dataset:

> library(nortest)

> ad.test(Cars93$Price) # Anderson-Darling test

Anderson-Darling normality test

data: Cars93$Price

A = 3, p-value = 9e-07

> cvm.test(Cars93$Price) # Cramer-von Mises test

Cramer-von Mises normality test

data: Cars93$Price

W = 0.5, p-value = 6e-06

> lillie.test(Cars93$Price) # Lilliefors (KS) test

Lilliefors (Kolmogorov-Smirnov) normality test

data: Cars93$Price

D = 0.2, p-value = 1e-05

> pearson.test(Cars93$Price) # Pearson chi-square

Pearson chi-square normality test

data: Cars93$Price

P = 30, p-value = 3e-04

> sf.test(Cars93$Price) # Shapiro-Francia test

Shapiro-Francia normality test

data: Cars93$Price

From the previously mentioned tests, it is evident that the Price variable is not normally distributed as the p-values from all the statistical tests are less than 0.05. If we add more dimensions to the bi-variate relationship, it becomes multivariate analysis. Let's try to understand the relationship between horsepower and length of a car from the Cars93 dataset:

> library(corrplot)

> o<-cor(Cars93[,c("Horsepower","Length")])

> corrplot(o,method = "circle",main="Correlation Plot")

Contingency tables, bivariate statistics, and checking for data normality

When we include more variables, it becomes a multivariate relationship. Let's try to plot a multivariate relationship between various variables from the Cars93 dataset:

> library(corrplot)

> t<-cor(Cars93[,c("Price","MPG.city","RPM","Rev.per.mile","Width","Weight","Horsepower","Length")])

> corrplot(t,method = "ellipse")

Contingency tables, bivariate statistics, and checking for data normality

There are various methods that can be passed as an argument to the correlation plot. They are "circle", "square", "ellipse", "number", "shade", "color", and "pie".

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.84.169