3.12. Sampling on the Dependent Variable

The logit model has a unique sampling property that is extremely useful in a variety of situations. In the analysis of linear models, sampling on the dependent variable is widely known to be potentially dangerous. In fact, much of the literature on selection bias in linear models has to do with fixing the problems that arise from such sampling (Heckman 1979). That’s not true for the logit model, however. You can do disproportionate stratified random sampling on the dependent variable without biasing the coefficient estimates.

Here’s a simple example. Table 3.3 shows a hypothetical table for employment status by high school graduation. The odds ratio for this table is 570×52/(360×22)=3.74. If we estimated a logit model predicting employment status from high school graduation, the coefficient would be log(3.74)=1.32. (If we reversed the independent and dependent variables, the logit coefficient would still be 1.32.) Now suppose we take a 10% random sample from the employed column and delete the other 90% of employed persons from the table. Ignoring sampling variability, the numbers in the employed column would change to 57 and 36. The new odds ratio is 57×52/(36×22)=3.74 and, of course, the logit coefficient is still 1.32. We see, then, that sampling on the dependent variable doesn’t change odds ratios.

Table 3.3. Employment Status by High School Graduation
 Employment Status
High School GraduateEmployedUnemployed
Yes57022
No36052

This property of contingency tables has been known for decades. More recently, it’s been extended to logit models with continuous independent variables (Prentice and Pyke 1979); the slope coefficients are not biased by disproportionate random sampling on the dependent variable. The intercept does change under such sampling schemes, but ordinarily we don’t care about the intercept anyway.

This property has a couple of important practical implications. Suppose you have a census tape with a million observations. You want to analyze the determinants of employment status, but you don’t want to stare at your computer screen while LOGISTIC chugs through a million cases. So you take a 1% simple random sample, giving you 10,000 cases. If the unemployment rate is 5%, you would expect to get about 9500 employed cases and 500 unemployed cases. Not bad, but you can do better. For a given sample size, the standard errors of the coefficients depend heavily on the split on the dependent variable. As a general rule, you’re much better off with a 50-50 split than with a 95-5 split. The solution is to take a 10% sample of the unemployed cases and a 0.5% sample of the employed cases. That way you end up with about 5,000 cases of each, which will give you much smaller standard errors for the coefficients.

After you’ve estimated the logit model from the disproportionately stratified sample, you can easily correct the intercept to reflect the true population proportions (Hosmer and Lemeshow 1989, p. 180). For example, suppose you estimated a model predicting the probability of unemployment and you get an intercept of bo, Suppose, further, that the sampling fraction for unemployed persons was pu, and the sampling fraction for employed persons was pe. Then, the corrected intercept is bo–log(pu/pe).

The other application of sampling on the dependent variable is the case-control study, widely used in the biomedical sciences (Breslow and Day 1980) but occasionally found in the social sciences as well. Here’s how it works. You want to study the determinants of some rare disease. By surveying doctors and hospitals, you manage to identify all 75 persons diagnosed with that disease (the cases) in your metropolitan area. To get a comparison group, you recruit a random sample of healthy persons in the general population of the metropolitan area (the controls). You pool the cases and the controls into a single sample and do a logit regression predicting case vs. control, based on various background characteristics. Often the controls are matched to the cases on one or more criteria, but that’s not an essential feature of the design. In Chapter 8, we’ll see how to analyze matched case-control data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.222.47