Continuous and discrete variables

Finally, it is time to check for linear dependencies between a continuous and a discrete variable. You can do this by measuring the variance between means of the continuous variable in different groups of discrete variable, for example measuring the difference in salary (continuous variable) between different genders (discrete variable). The null hypothesis here is that all variance between means is a result of the variance within each group, for example there is one woman with a very large salary, which raises the mean salary in the female group and makes a difference to the male group. If you reject it, this means that there is some significant variance in the means between groups. This is also known as the residual, or unexplained, variance. You are analyzing the variance of the means, so this analysis is called the analysis of variance, or ANOVA. A simpler test is the Student's T-test, which you can use to test for the differences between means in two groups only.

For a simple one-way ANOVA, testing means (averages) of a continuous variable for one independent discrete variable, you calculate the variance between groups, that is, MSA as the sum of squares of deviations of the group mean from the total mean multiplied by the number of cases in each group, with the degrees of freedom equal to the number of groups minus one. The formula is:

                  

The discrete variable has discrete states, µ is the overall mean of the continuous variable, and µi is the mean in the continuous variable in the ith group of the discrete variable.

You calculate the variance within groups, that is, MSE as the sum over groups of the sum of squares of deviations of individual values from the group mean, with the degrees of freedom equal to the sum of the number of rows in each group minus 1:

                  

The individual value of the continuous variable is denoted as vij, µi is the mean in the continuous variable in the ith group of the discrete variable, and ni is the number of cases in the ith group of the discrete variable.

Once you have both variances, you calculate the so-called F ratio as the ratio between the variance between groups and the variance within groups:

A large F value means you can reject the null hypothesis. Tables for the cumulative distribution under the tails of F distributions for different degrees of freedom are already calculated. For a specific F value with degrees of freedom between groups and degrees of freedom within groups, you can get critical points where there is, for example, less than a 5% of distribution under the F distribution curve up to the F point. This means that there is less than a 5% probability that the null hypothesis is correct (that is, there is an association between the means and the groups).

The following code checks for the differences in mean between two groups: it checks the YearlyIncome mean in the groups of Gender and HouseOwnerFlag variables. Note that the last line, after the comment, produces an error, because you can't use t.test for more than two groups:

t.test(YearlyIncome ~ Gender); 
t.test(YearlyIncome ~ HouseOwnerFlag); 
# Error - t-test supports only two groups 
t.test(YearlyIncome ~ Education); 

Instead of using the t.test() function, you can use the aov() function to check for the variance of the YearlyIncome means in the five groups of Education, as the following code shows. Note that the code first correctly orders the Education variable:

Education = factor(Education, order=TRUE,  
levels=c("Partial High School",  
"High School","Partial College", 
"Bachelors", "Graduate Degree")); 
AssocTest <- aov(YearlyIncome ~ Education); 
summary(AssocTest); 

If you execute the preceding code and check the results, you can conclude that yearly income is associated with the level of the education. You can see that the F value (324.7) is quite high and the probability for such a high F value being accidental is very low (<2e-16). You can also visualize the differences in the distribution of a continuous variable in groups of a discrete variable with the boxplot() function:

boxplot(YearlyIncome ~ Education, 
        main = "Yearly Income in Groups", 
        notch = TRUE, 
        varwidth = TRUE, 
        col = "orange", 
        ylab = "Yearly Income", 
        xlab = "Education"); 

The results of the box plot are shown in the following screenshot:

Variability of means in groups
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.160.43