2.3. Problems with Ordinary Linear Regression

Not so long ago, it was common to see published research that used ordinary least squares (OLS) linear regression to analyze dichotomous dependent variables. Some people didn’t know any better. Others knew better but didn’t have access to good software for alternative methods. Now, virtually every major statistical package includes procedures for logit or probit analysis, so there’s no excuse for applying inferior methods. No reputable social science journal would publish an article that used OLS regression with a dichotomous dependent variable.

Should all the earlier literature that violated this prohibition be dismissed? Actually, most applications of OLS regression to dichotomous variables give results that are qualitatively quite similar to results obtained using logit regression. There are exceptions, of course, so I certainly wouldn’t claim that there’s no need for logit analysis. But as an approximate method, OLS linear regression does a surprisingly good job with dichotomous variables, despite clear-cut violations of assumptions.

What are the assumptions that underlie OLS regression? While there’s no single set of assumptions that justifies linear regression, the list in the box is fairly standard. To keep things simple, I’ve included only a single independent variable x, and I’ve presumed that x is “fixed” across repeated samples (which means that every sample has the same set of x values). The i subscript distinguishes different members of the sample.

Assumptions of the Linear Regression Model
  1. yi = α + βxi + εi

  2. E(εi) = 0

  3. var(εi) = σ2

  4. cov(εi, εj) = 0

  5. εi ~ Normal


Assumption 1 says that y is a linear function of x plus a random disturbance term ε, for all members of the sample. The remaining assumptions all say something about the distribution of ε. What’s important about assumption 2 is that E(ε) (the expected value of ε) does not vary with x, implying that x and ε are uncorrelated. Assumption 3, often called the homoscedasticity assumption, says that the variance of ε is the same for all observations. Assumption 4 says that the random disturbance for one observation is uncorrelated with the random disturbance for any other observation. Finally, assumption 5 says that the random disturbance is normally distributed. If all five assumptions are satisfied, ordinary least squares estimates of α and β are unbiased and have minimum sampling variance (minimum variability across repeated samples).

Now suppose that y is a dichotomy with possible values of 1 or 0. It’s still reasonable to claim that assumptions 1, 2, and 4 are true. But if 1 and 2 are true for a dichotomy, then 3 and 5 are necessarily false. First, let’s consider assumption 5. Suppose that yi=1. Then assumption 1 implies that εi=1–αβxi. On the other hand, if yi=0, we have εi=–αβxi. Because εi can only take on two values, it’s impossible for it to have a normal distribution (which has a continuum of values and no upper or lower bound). So assumption 5 must be rejected.

To evaluate assumption 3, it’s helpful to do a little preliminary algebra. The expected value of yi is, by definition,

E(yi) = 1 × Pr(yi = 1) + 0 × Pr(yi = 0).

If we define pi = Pr(yi=1), this reduces to

E(yi) = pi

In general, for any dummy variable, its expected value is just the probability that it is equal to 1. But assumptions 1 and 2 also imply another expression for this expectation. Taking the expected values of both sides of the equation in assumption 1, we get

Putting these two results together, we get

Equation 2.1


which is sometimes called the linear probability model. As the name suggests, this model says that the probability that y=1 is a linear function of x. Regression coefficients have a straightforward interpretation under this model. A 1-unit change in x produces a change of β in the probability that y=1. In Output 2.1, the coefficient for SERIOUS was .038. So we can say that each 1-point increase in the SERIOUS scale (which ranges from 1 to 15) is associated with an increase of .038 in the probability of a death sentence, controlling for the other variables in the model. The BLACKD coefficient of .12 tells us that the estimated probability of a death sentence for black defendants is .12 higher than for non-black defendants, controlling for other variables.

Now let’s consider the variance of εi. Because x is treated as fixed, the variance of εi is the same as the variance of yi. In general, the variance of a dummy variable is pi(1–pi).

Therefore, we have

var(εi) = pi(1 – pi) = (α + βxi)(1 – αβxi).

We see, then, that the variance of εi must be different for different observations and, in particular, it varies as a function of x. The disturbance variance is at a maximum when pi=.5 and gets small when pi is near 1 or 0.

We’ve just shown that a dichotomous dependent variable in a linear regression model necessarily violates assumptions of homoscedasticity (assumption 3) and normality (assumption 5) of the error term. What are the consequences? Not as serious as you might think. First of all, we don’t need these assumptions to get unbiased estimates. If just assumptions 1 and 2 hold, ordinary least squares will produce unbiased estimates of α and β. Second, the normality assumption is not needed if the sample is reasonably large. The central limit theorem assures us that coefficient estimates will have a distribution that is approximately normal even when ε is not normally distributed. That means that we can still use a normal table to calculate p-values and confidence intervals. If the sample is small, however, these approximations could be poor.

Violation of the homoscedasticity assumption has two undesirable consequences. First, the coefficient estimates are no longer efficient. In statistical terminology, this means that there are alternative methods of estimation with smaller standard errors. Second, and more serious, the standard error estimates are no longer consistent estimates of the true standard errors. That means that the estimated standard errors could be biased (either upward or downward) to unknown degrees. And because the standard errors are used in calculating test statistics, the test statistics could also be biased.

In addition to these technical difficulties, there is a more fundamental problem with the assumptions of the linear model. We’ve seen that when the dependent variable is a dichotomy, assumptions 1 and 2 imply the linear probability model

pi = α + βxi,

While there’s nothing intrinsically wrong with this model, it’s a bit implausible, especially if x is measured on a continuum. If x has no upper or lower bound, then for any value of β there are values of x for which pi is either greater than 1 or less than 0. In fact, when estimating a linear probability model by OLS, it’s quite common for predicted values generated by the model to be outside the (0, 1) interval. (That wasn’t a problem with the regression in Output 2.1, which implied predicted probabilities ranging from .03 to .65). Of course, it’s impossible for the true values (which are probabilities) to be greater than 1 or less than 0. So the only way the model could be true is if a ceiling and floor are somehow imposed on pi, leading to considerable awkwardness both theoretically and computationally.

These problems with the linear model led statisticians to develop alternative approaches that make more sense conceptually and also have better statistical properties. The most popular of these approaches is the logit model, which is estimated by maximum likelihood. Before considering the full model, let’s examine one of its components—the odds of an event.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.131.178