In the postdoctoral example, the data was clustered into naturally occurring groups. Matching is another form of clustering in which individuals are grouped together by design. Matching was once commonly used in the social sciences to control for potentially confounding variables. Now, most researchers use some kind of regression procedure, primarily because of the difficulty of matching on several variables. However, with the recent development of the propensity score method, that objection to matching is largely obsolete (Rosenbaum and Rubin 1983, Smith 1997).
Here’s a typical application of the propensity score method. Imagine that your goal is to compare academic achievement of students in public and private schools, controlling for several measures of family background. You have measures on all the relevant variables for a large sample of eighth grade students, 10% of whom are in private schools. The first step in a propensity score analysis is to do a logit regression in which the dependent variable is the type of school and the independent variables are the family background characteristics. Based on that regression, the propensity score is the predicted probability of being in a private school. Each private school student is then matched to one or more public school students according to their closeness on the propensity score. In most cases, this method produces two groups that have nearly equal means on all the variables in the propensity score regression. One can then do a simple bivariate analysis of achievement versus school type. Alternatively, other variables not in the propensity score regression could be included in some kind of regression analysis. In many situations, this method could have important advantages over conventional regression analysis in reducing both bias and sampling variability (Smith 1997).
The propensity score method is an example of treatment-control matching. In this kind of matching, the individuals within each match group necessarily differ on the explanatory variable of central interest. Another sort of matching is case-control matching in which individuals within each match group necessarily differ on the dependent variable. Case-control studies have long been popular in the biomedical sciences for reasons I explained in Section 3.12. In a case-control study, the aim is to model the determinants of some dichotomous outcome, for instance, a disease condition. People who have the condition are called cases; people who do not are called controls. In Section 3.12, we saw that it’s legitimate to take all the available cases and a random subsample of the controls, pool the two groups into a single sample, and do a conventional logit analysis for the dichotomous outcome. Although not an essential feature of the method, each case is often matched to one or more controls on variables—such as age—that are known to affect the outcome but are not of direct interest.
Although it’s usually desirable to adjust for matching in the analysis, the appropriate adjustment methods are quite different for treatment-control and case-control matching. In brief, there are several ways to do it for treatment-control matching but only one way for case-control matching. Let’s begin with a treatment-control example. Metraux and Culhane (1997) constructed a data set of 8,402 women who stayed in family shelters in New York City for at least one 7-day period during 1992. The data contained information on several pre-stay characteristics, events that occurred during the stay, and housing type subsequent to the stay. As our dependent variable, we’ll focus on whether or not the woman exited to public housing (PUBHOUSE), which was the destination for 48% of the sample. Our principal independent variable is STAYBABY, equal to 1 if a woman gave birth during her stay at the shelter and 0 otherwise. Nine percent of the sample had a birth during the stay. Other independent variables are:
BLACK | 1=black race, 0=nonblack |
KIDS | Number of children in the household |
DOUBLEUP | 1=living with another family prior to shelter stay, 0 otherwise |
AGE | Age of woman at beginning of shelter stay |
DAYS | Number of days in shelter stay |
The conventional approach to analysis would be to estimate a logit regression model for the entire sample of 8,402 women. Results for doing that with PROC GENMOD are shown in Output 8.10. We see that although the odds of exiting to public housing increase with the number of children in the household, a birth during the stay reduces the odds by about 100(exp(–.40)–1)=–33%. Both of these effects are overshadowed by the enormous impact of the length of stay. Each additional day increases the odds of exiting to public housing by about 1%. (Not surprisingly, DAYS also has a correlation of .25 with STAYBABY—the longer a woman’s stay, the more likely it is that she gave birth during the stay.)
Analysis Of Parameter Estimates Parameter DF Estimate Std Err ChiSquare Pr>Chi INTERCEPT 1 -1.7441 0.1543 127.7919 0.0001 STAYBABY 1 -0.4030 0.0990 16.5794 0.0001 BLACK 1 -0.1606 0.0586 7.5141 0.0061 KIDS 1 0.1835 0.0219 70.0112 0.0001 DOUBLEUP 1 -0.1904 0.0601 10.0156 0.0016 AGE 1 -0.0227 0.0052 18.8247 0.0001 DAYS 1 0.0114 0.0003 1973.0827 0.0001 |
Now let’s estimate the effect of STAYBABY in a matched sample. I compared all 791 women who had a baby during the stay with an equal number of women who did not. To control for the effect of DAYS, I matched each woman who had a baby with a random draw from among those women whose length of stay was identical (or as close as possible). In this subsample, then, the correlation between STAYBABY and DAYS is necessarily .00. Rosenbaum and Rubin (1983) argue that adjustment by matching is “usually more robust to departures from the assumed form of the underlying model than model-based adjustment on random samples ... primarily because of reduced reliance on the model’s extrapolations.” Despite the fact that 6,820 cases are discarded in the matched analysis, we’ll see that very little precision is lost in the estimation of the STAYBABY coefficient. I could have used the propensity score method to control for all the variables in Output 8.10, but I wanted to keep things simple, and the adjustment for DAYS would have dominated anyway.
Next, I estimated a logit model for this matched-pair subsample without adjusting for the matching (Output 8.11). Although the level of significance declines greatly for most of the variables, the loss of precision is quite small for the STAYBABY coefficient; its standard error only increases by about 18% relative to that in Output 8.10. The coefficient declines somewhat in magnitude, but the matched subsample may be less prone to bias in this estimate than that in the full sample. Note that if we deleted DAYS from this model, the results for STAYBABY would hardly change at all because the two variables are uncorrelated, by design.
Analysis Of Parameter Estimates Parameter DF Estimate Std Err ChiSquare Pr>Chi INTERCEPT 1 -0.0317 0.3357 0.0089 0.9248 STAYBABY 1 -0.3107 0.1167 7.0914 0.0077 BLACK 1 -0.1322 0.1253 1.1139 0.2912 KIDS 1 0.1789 0.0465 14.8089 0.0001 DOUBLEUP 1 -0.1046 0.1216 0.7397 0.3898 AGE 1 -0.0188 0.0110 2.9045 0.0883 DAYS 1 0.0043 0.0004 95.4934 0.0001 |
The problem with the analysis in Output 8.11 is that the matched pairs are not independent, which could lead to bias in the standard error estimates. One way to adjust for the matching is to use the GEE method discussed in Section 8.5. Here’s how:
PROC GENMOD DATA=my.casecont; CLASS casenum; MODEL pubhouse=staybaby black kids doubleup age days / D=B; REPEATED SUBJECT=casenum / TYPE=UN CORRW; RUN;
CASENUM is the unique identifier for each matched pair. Although I specified an unstructured correlation matrix (TYPE=UN), all types of correlation structure are equivalent when there are only two observations per cluster.
GEE Model Information Description Value Correlation Structure Unstructured Subject Effect CASENUM (791 levels) Number of Clusters 791 Correlation Matrix Dimension 2 Maximum Cluster Size 2 Minimum Cluster Size 2 Working Correlation Matrix COL1 COL2 ROW1 1.0000 0.0705 ROW2 0.0705 1.0000 Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Empirical 95% Confidence Limits Parameter Estimate Std Err Lower Upper Z Pr>|Z| INTERCEPT -0.0125 0.3487 -0.6959 0.6709 -.0359 0.9714 STAYBABY -0.3100 0.1074 -0.5206 -0.0995 -2.886 0.0039 BLACK -0.1462 0.1197 -0.3808 0.0884 -1.221 0.2220 KIDS 0.1790 0.0469 0.0871 0.2709 3.8186 0.0001 DOUBLEUP -0.1138 0.1168 -0.3427 0.1151 -.9742 0.3300 AGE -0.0188 0.0108 -0.0400 0.0024 -1.740 0.0819 DAYS 0.0043 0.0007 0.0030 0.0056 6.5214 0.0000 Scale 1.2250 . . . . . |
Results in Output 8.12 differ only slightly from those in Output 8.11, which were not adjusted for matching. In particular, the coefficient for STAYBABY is about the same, while its estimated standard error is slightly reduced, from .1167 to .1074. Consistent with the small change is the estimated “residual” correlation between within-pair observations of only .07. One reason why the results don’t differ more is that the inclusion of DAYS in Output 8.11 is itself a partial adjustment for matching. When DAYS is omitted from the model, the residual correlation is .20.
In my opinion, GEE is usually the best method for adjusting for treatment-control matching. Although the example consisted of matched pairs, the method is identical for one-to-many matching or many-to-many matching. One alternative to GEE is the mixed model discussed in Section 8.7, but this is likely to give very similar results in most applications. Another widely recommended alternative is the conditional logit (fixed effects) model of Section 8.4. Unfortunately, this method often involves a substantial loss of data with concomitant increases in standard errors. The shelter stay data provides a good example of this loss. The PHREG program for the fixed-effects model with matched-pair data is:
DATA b; SET my.casecont; pubhouse=1-pubhouse; RUN; PROC PHREG DATA=b NOSUMMARY; MODEL pubhouse= staybaby black kids doubleup age / TIES=DISCRETE; STRATA casenum; RUN;
Because PHREG predicts the probability of the smaller value, the DATA step is necessary to make the coefficients have the correct sign. There is no need to include DAYS in the model because it is the same (or almost the same) for both members of every matched pair.
Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Risk Variable DF Estimate Error Chi-Square Chi-Square Ratio STAYBABY 1 -0.384991 0.13665 7.93701 0.0048 0.680 BLACK 1 -0.327532 0.19385 2.85470 0.0911 0.721 KIDS 1 0.201547 0.07141 7.96579 0.0048 1.223 DOUBLEUP 1 -0.328253 0.20425 2.58269 0.1080 0.720 AGE 1 -0.028323 0.01806 2.45837 0.1169 0.972 |
The results in Output 8.13 are quite similar to those in Output 8.12, but the standard error of STAYBABY is about 27% higher than it was before. The increase is attributable to the fact that the conditional logit method essentially discards all matched pairs in which both members have the same value of the dependent variable PUBHOUSE. In Section 8.4, I argued that this loss of information must be balanced by the potential decrease in bias that comes from controlling all stable characteristics of the cluster that might be correlated with the treatment variable, in this case, STAYBABY. Because the matching is balanced in this application (with one treatment and one control per cluster), it’s impossible for STAYBABY to be correlated with cluster characteristics. So, there’s no potential benefit from the conditional logit method.
The situation is quite different for case-control designs. In that setting, every matched pair has both of the two values of the dependent variable. If you try to apply GEE to such data, the estimated working residual correlation is –1 and the method breaks down. On the other hand, the conditional logit method suffers no loss of data because there are no clusters in which both members have the same value on the dependent variable. So conditional logit is the only way to go for case-control matching.
As an example of the case-control design, we again use the data on shelter stays, but with STAYBABY as the dependent variable rather than an independent variable. Output 8.14 shows the results from estimating a logit model with LOGISTIC for the full data set of 8,402 women. We see evidence that the probability of having a baby during the shelter stay is higher for blacks, women with more children in the household, younger women, and those with longer shelter stays.
Response Profile Ordered Value STAYBABY Count 1 1 791 2 0 7611 Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 5245.229 4728.880 . SC 5252.265 4771.097 . -2 LOG L 5243.229 4716.880 526.349 with 5 DF (p=0.0001) Score . . 592.384 with 5 DF (p=0.0001) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT 1 -2.6802 0.2204 147.8532 0.0001 . . BLACK 1 0.2857 0.0860 11.0466 0.0009 0.074728 1.331 KIDS 1 0.1558 0.0287 29.5257 0.0001 0.115842 1.169 DOUBLEUP 1 0.1497 0.0832 3.2351 0.0721 0.040090 1.161 AGE 1 -0.0467 0.00772 36.6776 0.0001 -0.146754 0.954 DAYS 1 0.00447 0.000224 399.7093 0.0001 0.404779 1.004 |
Now let’s re-estimate the model for the 791 pairs of women who are matched by number of days of shelter stay. The LOGISTIC output without adjustment for matching is shown in Output 8.15.
Response Profile Ordered Value STAYBABY Count 1 1 791 2 0 791 Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 2195.118 2154.347 . SC 2200.484 2186.546 . -2 LOG L 2193.118 2142.347 50.771 with 5 DF (p=0.0001) Score . . 49.882 with 5 DF (p=0.0001) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT 1 0.6315 0.2870 4.8418 0.0278 . . BLACK 1 0.3436 0.1101 9.7329 0.0018 -0.088556 1.410 KIDS 1 0.1717 0.0386 19.8173 0.0001 -0.135385 1.188 DOUBLEUP 1 0.1638 0.1094 2.2430 0.1342 -0.043971 1.178 AGE 1 -0.0464 0.00977 22.5180 0.0001 0.143084 0.955 DAYS 1 -0.000044 0.0003 0.0212 0.8842 0.004313 1.000 |
I included DAYS in the model just to demonstrate that matching on this variable eliminates its effect on the dependent variable. The other variables have coefficient estimates that are quite similar to those for the full sample. The standard errors are about 30% larger than in the full sample, but that’s not bad considering that we have discarded 81% of the cases.
We still haven’t adjusted for matching, however. To do that, we use PROC PHREG to estimate a conditional logit model:
DATA c; SET my.casecont; staybaby=1-staybaby; RUN; PROC PHREG DATA=c NOSUMMARY; MODEL staybaby=black kids doubleup age; STRATA casenum; RUN;
The DATA step reverses the coding of the dependent variable so that the signs of the coefficients are correct. Notice that the MODEL statement does not contain the TIES=DISCRETE option that was used in earlier examples of conditional logit analysis. That option is unnecessary when the data consists of matched pairs with each pair containing a 1 and a 0 on the dependent variable. Under any other matching design (for example, one-to-many or many-to-many matching), the DISCRETE option is essential. The results in Output 8.16 are very close to those in Output 8.15, which did not adjust for matching.
Testing Global Null Hypothesis: BETA=0 Without With Criterion Covariates Covariates Model Chi-Square -2 LOG L 1096.559 1043.201 53.358 with 4 DF (p=0.0001) Score . . 51.408 with 4 DF (p=0.0001) Wald . . 47.919 with 4 DF (p=0.0001) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Risk Variable DF Estimate Error Chi-Square Chi-Square Ratio BLACK 1 0.353492 0.10921 10.47629 0.0012 1.424 KIDS 1 0.186585 0.03994 21.82912 0.0001 1.205 DOUBLEUP 1 0.186489 0.11141 2.80200 0.0941 1.205 AGE 1 -0.049648 0.01019 23.72808 0.0001 0.952 |
3.137.188.11