Relationships between two categorical variables

Describing the relationships between two categorical variables is done somewhat less often than the other two broad types of bivariate analyses, but it is just as fun (and useful)!

To explore this technique, we will be using the dataset UCBAdmissions, which contains the data on graduate school applicants to the University of California Berkeley in 1973.

Before we get started, we have to wrap the dataset in a call to data.frame for coercing it into a data frame type variable—I'll explain why, soon.

  ucba <- data.frame(UCBAdmissions)
  > head(ucba)
       Admit Gender Dept Freq
  1 Admitted   Male    A  512
  2 Rejected   Male    A  313
  3 Admitted Female    A   89
  4 Rejected Female    A   19
  5 Admitted   Male    B  353
  6 Rejected   Male    B  207

Now, what we want is a count of the frequencies of number of students in each of the following four categories:

  • Accepted female
  • Rejected female
  • Accepted male
  • Rejected male

Do you remember the frequency tabulation at the beginning of the last chapter? This is similar—except that now we are dividing the set by one more variable. This is known as cross-tabulation or cross tab. It is also sometimes referred to as a contingency table. The reason we had to coerce UCBAdmissions into a data frame is because it was already in the form of a cross tabulation (except that it further broke the data down into the different departments of the grad school). Check it out by typing UCBAdmissions at the prompt.

We can use the xtabs function in R to make our own cross-tabulations:

  # the first argument to xtabs (the formula) should
  # be read as: frequency *by* Gender and Admission
  > cross <- xtabs(Freq ~ Gender+Admit, data=ucba)
  > cross
          Admit
  Gender   Admitted Rejected
    Male       1198     1493
    Female      557     1278

Here, at a glance, we can see that there were 1198 males that were admitted, 557 females that were admitted, and so on.

Is there a gender bias in UCB's graduate admissions process? Perhaps, but it's hard to tell from just looking at the 2x2 contingency table. Sure, there are fewer females accepted than males, but there are also, unfortunately, far fewer females that applied to UCB in the first place.

To aid us in either implicating UCB of a sexist admissions machine or exonerating them, it would help to look at a proportions table. Using a proportions table, we can easily compare the proportion of the total number of males who were accepted versus the proportion of the total number of females who were accepted. If the proportions are more or less equal, we can conclude that gender does not constitute a factor in UCB's admissions process. If this is the case, gender and admission status is said to be conditionally independent.

  > prop.table(cross, 1)
          Admit
  Gender    Admitted  Rejected
    Male   0.4451877 0.5548123
    Female 0.3035422 0.6964578

Note

Why did we supply 1 as an argument to prop.table? Look up the documentation at the R prompt. When would we want to use prop.table(cross, 2)?

Here, we can see that while 45 percent of the males who applied were accepted, only 30 percent of the females who applied were accepted. This is evidence that the admissions department is sexist, right? Not so fast, my friend!

This is precisely what a lawsuit lodged against UCB purported. When the issue was looked into further, it was discovered that, at the department level, women and men actually had similar admissions rates. In fact, some of the departments appeared to have a small but significant bias in favor of women. Check out department A's proportion table, for example:

  > cross2 <- xtabs(Freq ~ Gender + Admit, data=ucba[ucba$Dept=="A",])
  > prop.table(cross2, 1)
          Admit
  Gender    Admitted  Rejected
    Male   0.6206061 0.3793939
    Female 0.8240741 0.1759259

If there were any bias in admissions, these data didn't prove it. This phenomenon, where a trend that appears in combined groups of data disappears or reverses when broken down into groups is known as Simpson's Paradox. In this case, it was caused by the fact that women tended to apply to departments that were far more selective.

This is probably the most famous case of Simpson's Paradox, and it is also why this dataset is built into R. The lesson here is to be careful when using pooled data, and look out for hidden variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.42.5