8.6. Matching

In the postdoctoral example, the data was clustered into naturally occurring groups. Matching is another form of clustering in which individuals are grouped together by design. Matching was once commonly used in the social sciences to control for potentially confounding variables. Now, most researchers use some kind of regression procedure, primarily because of the difficulty of matching on several variables. However, with the recent development of the propensity score method, that objection to matching is largely obsolete (Rosenbaum and Rubin 1983, Smith 1997).

Here’s a typical application of the propensity score method. Imagine that your goal is to compare academic achievement of students in public and private schools, controlling for several measures of family background. You have measures on all the relevant variables for a large sample of eighth grade students, 10% of whom are in private schools. The first step in a propensity score analysis is to do a logit regression in which the dependent variable is the type of school and the independent variables are the family background characteristics. Based on that regression, the propensity score is the predicted probability of being in a private school. Each private school student is then matched to one or more public school students according to their closeness on the propensity score. In most cases, this method produces two groups that have nearly equal means on all the variables in the propensity score regression. One can then do a simple bivariate analysis of achievement versus school type. Alternatively, other variables not in the propensity score regression could be included in some kind of regression analysis. In many situations, this method could have important advantages over conventional regression analysis in reducing both bias and sampling variability (Smith 1997).

The propensity score method is an example of treatment-control matching. In this kind of matching, the individuals within each match group necessarily differ on the explanatory variable of central interest. Another sort of matching is case-control matching in which individuals within each match group necessarily differ on the dependent variable. Case-control studies have long been popular in the biomedical sciences for reasons I explained in Section 3.12. In a case-control study, the aim is to model the determinants of some dichotomous outcome, for instance, a disease condition. People who have the condition are called cases; people who do not are called controls. In Section 3.12, we saw that it’s legitimate to take all the available cases and a random subsample of the controls, pool the two groups into a single sample, and do a conventional logit analysis for the dichotomous outcome. Although not an essential feature of the method, each case is often matched to one or more controls on variables—such as age—that are known to affect the outcome but are not of direct interest.

Although it’s usually desirable to adjust for matching in the analysis, the appropriate adjustment methods are quite different for treatment-control and case-control matching. In brief, there are several ways to do it for treatment-control matching but only one way for case-control matching. Let’s begin with a treatment-control example. Metraux and Culhane (1997) constructed a data set of 8,402 women who stayed in family shelters in New York City for at least one 7-day period during 1992. The data contained information on several pre-stay characteristics, events that occurred during the stay, and housing type subsequent to the stay. As our dependent variable, we’ll focus on whether or not the woman exited to public housing (PUBHOUSE), which was the destination for 48% of the sample. Our principal independent variable is STAYBABY, equal to 1 if a woman gave birth during her stay at the shelter and 0 otherwise. Nine percent of the sample had a birth during the stay. Other independent variables are:

BLACK1=black race, 0=nonblack
KIDSNumber of children in the household
DOUBLEUP1=living with another family prior to shelter stay, 0 otherwise
AGEAge of woman at beginning of shelter stay
DAYSNumber of days in shelter stay

The conventional approach to analysis would be to estimate a logit regression model for the entire sample of 8,402 women. Results for doing that with PROC GENMOD are shown in Output 8.10. We see that although the odds of exiting to public housing increase with the number of children in the household, a birth during the stay reduces the odds by about 100(exp(–.40)–1)=–33%. Both of these effects are overshadowed by the enormous impact of the length of stay. Each additional day increases the odds of exiting to public housing by about 1%. (Not surprisingly, DAYS also has a correlation of .25 with STAYBABY—the longer a woman’s stay, the more likely it is that she gave birth during the stay.)

Output 8.10. Logistic Regression of Exit to Public Housing with GENMOD
              Analysis Of Parameter Estimates

Parameter    DF    Estimate     Std Err   ChiSquare  Pr>Chi

INTERCEPT     1     -1.7441      0.1543    127.7919  0.0001
STAYBABY      1     -0.4030      0.0990     16.5794  0.0001
BLACK         1     -0.1606      0.0586      7.5141  0.0061
KIDS          1      0.1835      0.0219     70.0112  0.0001
DOUBLEUP      1     -0.1904      0.0601     10.0156  0.0016
AGE           1     -0.0227      0.0052     18.8247  0.0001
DAYS          1      0.0114      0.0003   1973.0827  0.0001

Now let’s estimate the effect of STAYBABY in a matched sample. I compared all 791 women who had a baby during the stay with an equal number of women who did not. To control for the effect of DAYS, I matched each woman who had a baby with a random draw from among those women whose length of stay was identical (or as close as possible). In this subsample, then, the correlation between STAYBABY and DAYS is necessarily .00. Rosenbaum and Rubin (1983) argue that adjustment by matching is “usually more robust to departures from the assumed form of the underlying model than model-based adjustment on random samples ... primarily because of reduced reliance on the model’s extrapolations.” Despite the fact that 6,820 cases are discarded in the matched analysis, we’ll see that very little precision is lost in the estimation of the STAYBABY coefficient. I could have used the propensity score method to control for all the variables in Output 8.10, but I wanted to keep things simple, and the adjustment for DAYS would have dominated anyway.

Next, I estimated a logit model for this matched-pair subsample without adjusting for the matching (Output 8.11). Although the level of significance declines greatly for most of the variables, the loss of precision is quite small for the STAYBABY coefficient; its standard error only increases by about 18% relative to that in Output 8.10. The coefficient declines somewhat in magnitude, but the matched subsample may be less prone to bias in this estimate than that in the full sample. Note that if we deleted DAYS from this model, the results for STAYBABY would hardly change at all because the two variables are uncorrelated, by design.

Output 8.11. Regression of PUBHOUSE with Matched-Pair Data without Adjustment for Matching
             Analysis Of Parameter Estimates

Parameter    DF    Estimate     Std Err   ChiSquare  Pr>Chi

INTERCEPT     1     -0.0317      0.3357      0.0089  0.9248
STAYBABY      1     -0.3107      0.1167      7.0914  0.0077
BLACK         1     -0.1322      0.1253      1.1139  0.2912
KIDS          1      0.1789      0.0465     14.8089  0.0001
DOUBLEUP      1     -0.1046      0.1216      0.7397  0.3898
AGE           1     -0.0188      0.0110      2.9045  0.0883
DAYS          1      0.0043      0.0004     95.4934  0.0001

The problem with the analysis in Output 8.11 is that the matched pairs are not independent, which could lead to bias in the standard error estimates. One way to adjust for the matching is to use the GEE method discussed in Section 8.5. Here’s how:

PROC GENMOD DATA=my.casecont;
  CLASS casenum;
  MODEL pubhouse=staybaby black kids doubleup age days / D=B;
  REPEATED SUBJECT=casenum / TYPE=UN CORRW;
RUN;

CASENUM is the unique identifier for each matched pair. Although I specified an unstructured correlation matrix (TYPE=UN), all types of correlation structure are equivalent when there are only two observations per cluster.

Output 8.12. Regression of PUBHOUSE with Matched-Pair Data with GEE Adjustment for Matching
                         GEE Model Information

           Description                   Value

           Correlation Structure         Unstructured
           Subject Effect                CASENUM (791 levels)
           Number of Clusters            791
           Correlation Matrix Dimension  2
           Maximum Cluster Size          2
           Minimum Cluster Size          2


                       Working Correlation Matrix

                                      COL1     COL2

                     ROW1           1.0000   0.0705
                     ROW2           0.0705   1.0000


                   Analysis Of GEE Parameter Estimates
                   Empirical Standard Error Estimates

                        Empirical  95% Confidence Limits
Parameter    Estimate     Std Err       Lower       Upper       Z  Pr>|Z|

INTERCEPT     -0.0125      0.3487     -0.6959      0.6709  -.0359  0.9714
STAYBABY      -0.3100      0.1074     -0.5206     -0.0995  -2.886  0.0039
BLACK         -0.1462      0.1197     -0.3808      0.0884  -1.221  0.2220
KIDS           0.1790      0.0469      0.0871      0.2709  3.8186  0.0001
DOUBLEUP      -0.1138      0.1168     -0.3427      0.1151  -.9742  0.3300
AGE           -0.0188      0.0108     -0.0400      0.0024  -1.740  0.0819
DAYS           0.0043      0.0007      0.0030      0.0056  6.5214  0.0000
Scale          1.2250           .           .           .       .      .

Results in Output 8.12 differ only slightly from those in Output 8.11, which were not adjusted for matching. In particular, the coefficient for STAYBABY is about the same, while its estimated standard error is slightly reduced, from .1167 to .1074. Consistent with the small change is the estimated “residual” correlation between within-pair observations of only .07. One reason why the results don’t differ more is that the inclusion of DAYS in Output 8.11 is itself a partial adjustment for matching. When DAYS is omitted from the model, the residual correlation is .20.

In my opinion, GEE is usually the best method for adjusting for treatment-control matching. Although the example consisted of matched pairs, the method is identical for one-to-many matching or many-to-many matching. One alternative to GEE is the mixed model discussed in Section 8.7, but this is likely to give very similar results in most applications. Another widely recommended alternative is the conditional logit (fixed effects) model of Section 8.4. Unfortunately, this method often involves a substantial loss of data with concomitant increases in standard errors. The shelter stay data provides a good example of this loss. The PHREG program for the fixed-effects model with matched-pair data is:

DATA b;
  SET my.casecont;
  pubhouse=1-pubhouse;
RUN;
PROC PHREG DATA=b NOSUMMARY;
 MODEL pubhouse= staybaby black kids doubleup age /
         TIES=DISCRETE;
 STRATA casenum;
RUN;

Because PHREG predicts the probability of the smaller value, the DATA step is necessary to make the coefficients have the correct sign. There is no need to include DAYS in the model because it is the same (or almost the same) for both members of every matched pair.

Output 8.13. PHREG Output for Treatment-Control Matched Pairs
                Analysis of Maximum Likelihood Estimates

                 Parameter    Standard     Wald        Pr >         Risk
Variable  DF      Estimate      Error   Chi-Square  Chi-Square     Ratio

STAYBABY   1     -0.384991     0.13665     7.93701      0.0048     0.680
BLACK      1     -0.327532     0.19385     2.85470      0.0911     0.721
KIDS       1      0.201547     0.07141     7.96579      0.0048     1.223
DOUBLEUP   1     -0.328253     0.20425     2.58269      0.1080     0.720
AGE        1     -0.028323     0.01806     2.45837      0.1169     0.972

The results in Output 8.13 are quite similar to those in Output 8.12, but the standard error of STAYBABY is about 27% higher than it was before. The increase is attributable to the fact that the conditional logit method essentially discards all matched pairs in which both members have the same value of the dependent variable PUBHOUSE. In Section 8.4, I argued that this loss of information must be balanced by the potential decrease in bias that comes from controlling all stable characteristics of the cluster that might be correlated with the treatment variable, in this case, STAYBABY. Because the matching is balanced in this application (with one treatment and one control per cluster), it’s impossible for STAYBABY to be correlated with cluster characteristics. So, there’s no potential benefit from the conditional logit method.

The situation is quite different for case-control designs. In that setting, every matched pair has both of the two values of the dependent variable. If you try to apply GEE to such data, the estimated working residual correlation is –1 and the method breaks down. On the other hand, the conditional logit method suffers no loss of data because there are no clusters in which both members have the same value on the dependent variable. So conditional logit is the only way to go for case-control matching.

As an example of the case-control design, we again use the data on shelter stays, but with STAYBABY as the dependent variable rather than an independent variable. Output 8.14 shows the results from estimating a logit model with LOGISTIC for the full data set of 8,402 women. We see evidence that the probability of having a baby during the shelter stay is higher for blacks, women with more children in the household, younger women, and those with longer shelter stays.

Output 8.14. Logit Regression of STAYBABY for Full Sample
                             Response Profile

                        Ordered
                          Value  STAYBABY     Count

                              1         1       791
                              2         0      7611

    Model Fitting Information and Testing Global Null Hypothesis BETA=0

                             Intercept
               Intercept        and
 Criterion       Only       Covariates    Chi-Square for Covariates

 AIC            5245.229      4728.880         .
 SC             5252.265      4771.097         .
 -2 LOG L       5243.229      4716.880      526.349 with 5 DF (p=0.0001)
 Score              .             .         592.384 with 5 DF (p=0.0001)

                 Analysis of Maximum Likelihood Estimates

            Parameter Standard    Wald       Pr >    Standardized     Odds
Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

INTERCPT 1    -2.6802   0.2204   147.8532     0.0001            .     .
BLACK    1     0.2857   0.0860    11.0466     0.0009     0.074728    1.331
KIDS     1     0.1558   0.0287    29.5257     0.0001     0.115842    1.169
DOUBLEUP 1     0.1497   0.0832     3.2351     0.0721     0.040090    1.161
AGE      1    -0.0467  0.00772    36.6776     0.0001    -0.146754    0.954
DAYS     1    0.00447 0.000224   399.7093     0.0001     0.404779    1.004

Now let’s re-estimate the model for the 791 pairs of women who are matched by number of days of shelter stay. The LOGISTIC output without adjustment for matching is shown in Output 8.15.

Output 8.15. Logit Regression of STAYBABY for Matched Pairs, No Adjustment
                              Response Profile

                        Ordered
                          Value  STAYBABY     Count

                              1         1       791
                              2         0       791


    Model Fitting Information and Testing Global Null Hypothesis BETA=0

                             Intercept
               Intercept        and
 Criterion       Only       Covariates    Chi-Square for Covariates

 AIC            2195.118      2154.347         .
 SC             2200.484      2186.546         .
 -2 LOG L       2193.118      2142.347       50.771 with 5 DF (p=0.0001)
 Score              .             .          49.882 with 5 DF (p=0.0001)

                 Analysis of Maximum Likelihood Estimates

            Parameter Standard    Wald       Pr >    Standardized     Odds
Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

INTERCPT 1     0.6315   0.2870     4.8418     0.0278            .     .
BLACK    1     0.3436   0.1101     9.7329     0.0018    -0.088556    1.410
KIDS     1     0.1717   0.0386    19.8173     0.0001    -0.135385    1.188
DOUBLEUP 1     0.1638   0.1094     2.2430     0.1342    -0.043971    1.178
AGE      1    -0.0464  0.00977    22.5180     0.0001     0.143084    0.955
DAYS     1  -0.000044   0.0003     0.0212     0.8842     0.004313    1.000

I included DAYS in the model just to demonstrate that matching on this variable eliminates its effect on the dependent variable. The other variables have coefficient estimates that are quite similar to those for the full sample. The standard errors are about 30% larger than in the full sample, but that’s not bad considering that we have discarded 81% of the cases.

We still haven’t adjusted for matching, however. To do that, we use PROC PHREG to estimate a conditional logit model:

DATA c;
  SET my.casecont;
  staybaby=1-staybaby;
RUN;
PROC PHREG DATA=c NOSUMMARY;
  MODEL staybaby=black kids doubleup age;
  STRATA casenum;
RUN;

The DATA step reverses the coding of the dependent variable so that the signs of the coefficients are correct. Notice that the MODEL statement does not contain the TIES=DISCRETE option that was used in earlier examples of conditional logit analysis. That option is unnecessary when the data consists of matched pairs with each pair containing a 1 and a 0 on the dependent variable. Under any other matching design (for example, one-to-many or many-to-many matching), the DISCRETE option is essential. The results in Output 8.16 are very close to those in Output 8.15, which did not adjust for matching.

Output 8.16. Logit Regression of STAYBABY for Matched Pairs, Adjusted for Matching
                 Testing Global Null Hypothesis: BETA=0

               Without        With
Criterion    Covariates    Covariates    Model Chi-Square

-2 LOG L       1096.559      1043.201      53.358 with 4 DF (p=0.0001)
Score              .             .         51.408 with 4 DF (p=0.0001)
Wald               .             .         47.919 with 4 DF (p=0.0001)


                Analysis of Maximum Likelihood Estimates

                 Parameter    Standard     Wald        Pr >         Risk
Variable   DF     Estimate      Error   Chi-Square  Chi-Square     Ratio

BLACK       1     0.353492     0.10921    10.47629      0.0012     1.424
KIDS        1     0.186585     0.03994    21.82912      0.0001     1.205
DOUBLEUP    1     0.186489     0.11141     2.80200      0.0941     1.205
AGE         1    -0.049648     0.01019    23.72808      0.0001     0.952

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.188.11