2.6. Estimation of the Logit Model: General Principles

Now that we have a model for dichotomous dependent variables, the next step is to use sample data to estimate the coefficients. How that’s done depends on the type of data you’re working with. If you have grouped data, there are three readily available methods: ordinary least squares, weighted least squares, and maximum likelihood.

Grouped data occurs when the explanatory variables are all categorical and the data is arrayed in the form of a contingency table. We’ll see several examples of grouped data in Chapter 4. Grouped data can also occur when data is collected from naturally occurring groups. For example, suppose that the units of analysis are business firms and the dependent variable is the probability that an employee is a full-time worker. Let Pi be the observed proportion of employees who work full-time in firm i. To estimate a logit model by OLS, we would simply take the logit transformation of P, which is log[P/(1-P)], and regress that transformation on characteristics of the firm and on the average characteristics of the employees. A weighted least squares (WLS) analysis would be similar, except that the data would be weighted to adjust for heteroscedasticity. The SAS procedure CATMOD does WLS estimation for grouped data (in addition to maximum likelihood).

Maximum likelihood (ML) is the third method for estimating the logit model for grouped data and the only method in general use for individual-level data. This is data for which we simply observe a dichotomous dependent variable for each individual along with measured characteristics of the individual. OLS and WLS can’t be used with this kind of data unless the data can be grouped in some way. If yi can only have values of 1 and 0, it’s impossible to apply the logit transformation—you get either minus infinity or plus infinity. To put it another way, any transformation of a dichotomy is still a dichotomy.

Maximum likelihood is a very general approach to estimation that is widely used for all sorts of statistical models. You may have encountered it before with loglinear models, latent variable models, or event history models. There are two reasons for this popularity. First, ML estimators are known to have good properties in large samples. Under fairly general conditions, ML estimators are consistent, asymptotically efficient, and asymptotically normal. Consistency means that, as the sample size gets larger, the probability that the estimate is within some small distance of the true value also gets larger. No matter how small the distance or how high the specified probability, there is always a sample size that yields an even higher probability that the estimator is within that distance of the true value. One implication of consistency is that the ML estimator is approximately unbiased in large samples. Asymptotic efficiency means that, in large samples, the estimates will have standard errors that are, approximately, at least as small as those for any other estimation method. And, finally, the sampling distribution of the estimates will be approximately normal in large samples, which means that you can use the normal and chi-square distributions to compute confidence intervals and p-values.

All these approximations get better as the sample size gets larger. The fact that these desirable properties have only been proven for large samples does not mean that ML has bad properties for small samples. It simply means that we usually don’t know what the small-sample properties are. And in the absence of attractive alternatives, researchers routinely use ML estimation for both large and small samples. Although I won’t argue against that practice, I do urge caution in interpreting p-values and confidence intervals when samples are small. Despite the temptation to accept larger p-values as evidence against the null hypothesis in small samples, it is actually more reasonable to demand smaller values to compensate for the fact that the approximation to the normal or chi-square distributions may be poor.

The other reason for ML’s popularity is that it is often straightforward to derive ML estimators when there are no other obvious possibilities. One case that ML handles very nicely is data with categorical dependent variables.

The basic principle of ML is to choose as estimates those parameter values which, if true, would maximize the probability of observing what we have, in fact, observed. There are two steps to this: (1) write down an expression for the probability of the data as a function of the unknown parameters, and (2) find the values of the unknown parameters that make the value of this expression as large as possible.

The first step is known as constructing the likelihood function. To accomplish this, you must specify a model, which amounts to choosing a probability distribution for the dependent variable and choosing a functional form that relates the parameters of this distribution to the values of the explanatory variables. In the case of the logit model, the dichotomous dependent variable is presumed to have a binomial distribution with a single “trial” and parameter pi. Then pi is assumed to depend on the explanatory variables according to equation (2.3), the logit model. Finally, we assume that the observations are independent across individuals.

The second step—maximization—typically requires an iterative numerical method, which means that it involves successive approximations. Such methods are often computationally demanding, which explains why ML estimation has become popular only in the last two decades. For those who are interested, I work through the basic mathematics of constructing and maximizing the likelihood function in Chapter 3. Here I focus on the practical aspects of ML estimation with SAS.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.145.114