Describing the relationships between two categorical variables is done somewhat less often than the other two broad types of bivariate analyses, but it is just as fun (and useful)!
To explore this technique, we will be using the dataset UCBAdmissions,
which contains the data on graduate school applicants to the University of California Berkeley in 1973.
Before we get started, we have to wrap the dataset in a call to data.frame
for coercing it into a data frame type variable—I'll explain why, soon.
ucba <- data.frame(UCBAdmissions) > head(ucba) Admit Gender Dept Freq 1 Admitted Male A 512 2 Rejected Male A 313 3 Admitted Female A 89 4 Rejected Female A 19 5 Admitted Male B 353 6 Rejected Male B 207
Now, what we want is a count of the frequencies of number of students in each of the following four categories:
Do you remember the frequency tabulation at the beginning of the last chapter? This is similar—except that now we are dividing the set by one more variable. This is known as cross-tabulation or cross tab. It is also sometimes referred to as a contingency table. The reason we had to coerce UCBAdmissions
into a data frame is because it was already in the form of a cross tabulation (except that it further broke the data down into the different departments of the grad school). Check it out by typing UCBAdmissions
at the prompt.
We can use the xtabs
function in R to make our own cross-tabulations:
# the first argument to xtabs (the formula) should # be read as: frequency *by* Gender and Admission > cross <- xtabs(Freq ~ Gender+Admit, data=ucba) > cross Admit Gender Admitted Rejected Male 1198 1493 Female 557 1278
Here, at a glance, we can see that there were 1198
males that were admitted, 557
females that were admitted, and so on.
Is there a gender bias in UCB's graduate admissions process? Perhaps, but it's hard to tell from just looking at the 2x2 contingency table. Sure, there are fewer females accepted than males, but there are also, unfortunately, far fewer females that applied to UCB in the first place.
To aid us in either implicating UCB of a sexist admissions machine or exonerating them, it would help to look at a proportions table. Using a proportions table, we can easily compare the proportion of the total number of males who were accepted versus the proportion of the total number of females who were accepted. If the proportions are more or less equal, we can conclude that gender does not constitute a factor in UCB's admissions process. If this is the case, gender and admission status is said to be conditionally independent.
> prop.table(cross, 1) Admit Gender Admitted Rejected Male 0.4451877 0.5548123 Female 0.3035422 0.6964578
Here, we can see that while 45 percent of the males who applied were accepted, only 30 percent of the females who applied were accepted. This is evidence that the admissions department is sexist, right? Not so fast, my friend!
This is precisely what a lawsuit lodged against UCB purported. When the issue was looked into further, it was discovered that, at the department level, women and men actually had similar admissions rates. In fact, some of the departments appeared to have a small but significant bias in favor of women. Check out department A's proportion table, for example:
> cross2 <- xtabs(Freq ~ Gender + Admit, data=ucba[ucba$Dept=="A",]) > prop.table(cross2, 1) Admit Gender Admitted Rejected Male 0.6206061 0.3793939 Female 0.8240741 0.1759259
If there were any bias in admissions, these data didn't prove it. This phenomenon, where a trend that appears in combined groups of data disappears or reverses when broken down into groups is known as Simpson's Paradox. In this case, it was caused by the fact that women tended to apply to departments that were far more selective.
This is probably the most famous case of Simpson's Paradox, and it is also why this dataset is built into R. The lesson here is to be careful when using pooled data, and look out for hidden variables.
3.141.42.5