Estimating the Full Multiple Regression Equation with PROC REG

There are several SAS procedures that allow you to perform multiple regression analyses. The basic multiple regression procedure, however, is PROC REG. This procedure estimates multiple regression coefficients for the various predictors, calculates R2 and tests it for significance, and prints additional information relevant to the analysis.

Writing the Program

This is the general form for using PROC REG to request a basic multiple regression analysis with standardized multiple regression coefficients:

PROC REG   DATA=dataset-name     options;
   MODEL  criterion  =  predictor-variables  /  STB   options;
RUN;

Predictor variables in the preceding MODEL statement should be separated by at least one space. The name of the last predictor variable should be followed by a slash and a list of options, if any are desired. You should always specify “STB” in the options field of the MODEL statement since this requests that the standardized multiple regression coefficients be printed.

Here are some options for the REG statement that can be particularly useful in social science research; additional options can be found in “The REG Procedure” in the SAS/STAT User’s Guide:

CORR

requests that the correlation matrix for all variables in the MODEL statement be printed.

SIMPLE

requests simple statistics for all variables in the analysis (mean, variance, standard deviation, and uncorrected sum of squares).

These are some options for the MODEL statement that might be particularly useful; more are listed in “The REG Procedure” in the SAS/STAT User’s Guide. The notes appearing here that refer the reader to the user’s guide are similarly referred to in “The REG Procedure.”

COLLIN

prints diagnostics regarding collinearity among the predictor variables. See the section on collinearity diagnostics in the PROC REG chapter of the SAS/STAT User’s Guide for details.

INFLUENCE

prints diagnostics regarding the influence of each observation on the parameter estimates and on the predicted Y values. See the section on influence diagnostics in the PROC REG chapter of the SAS/STAT User’s Guide for details.

P

prints actual Y scores, predicted Y scores, and residual scores (errors of prediction) for each observation. See the section on predicted and residual values in the PROC REG chapter of the SAS/STAT User’s Guide for details.

R

requests a detailed analysis of the residuals, including Cook’s D statistic, which assesses the influence of each observation on parameter estimates. See the section on predicted and residual values in the PROC REG chapter of the SAS/STAT User’s Guide for details.

SELECTION=model-selection-method

requests a specific model-selection method. Model-selection methods are used to select an “optimal” group of predictor variables from a larger set. Keywords for available methods are FORWARD, BACKWARD, STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, CP, and NONE. This chapter shows how to use the SELECTION=RSQUARE option to obtain information needed to compute uniqueness indices.

STB

requests the printing of the standardized multiple regression coefficient (beta weight) for each predictor variable.

In the present analysis, it is necessary to estimate a multiple regression equation in which the criterion variable was commitment, and the predictor variables were rewards, costs, investment size, and alternative value. Here are the statements that requested this model. The STB option in the MODEL statement requests the standardized regression coefficients:

PROC REG   DATA=D1;
   MODEL COMMIT = REWARD COST INVESTMENT ALTERNATIVES / STB;
RUN;

The SAS output created by these statements is reproduced as Output 14.2:

Output 14.2. Results of the REG Procedure with the STB Option
                                      The SAS System

                                    The REG Procedure
                                     Model: MODEL1
                            Dependent Variable: COMMITMENT

                                  Analysis of Variance

                                         Sum of           Mean
     Source                   DF        Squares         Square    F Value    Pr > F

     Model                     4     3154.83928      788.70982      19.59    <.0001
     Error                    43     1731.07738       40.25761
     Corrected Total          47     4885.91667


                  Root MSE              6.34489    R-Square      0.6457
                  Dependent Mean       27.70833    Adj R-Sq      0.6127
                  Coeff Var            22.89885


                                  Parameter Estimates

                        Parameter       Standard                           Standardized
Variable        DF       Estimate          Error    t Value    Pr > |t|        Estimate

Intercept        1       20.03813        8.73671       2.29      0.0268               0
REWARD           1        0.27937        0.27331       1.02      0.3124         0.13839
COST             1       -0.10575        0.20800      -0.51      0.6138        -0.05700
INVESTMENT       1        0.52347        0.21037       2.49      0.0168         0.31226
ALTERNATIVES     1       -0.67943        0.14670      -4.63      <.0001        -0.50014

Interpreting the Results of PROC REG

1. Make Sure That Everything Looks Right

At the top of the page, verify that the name of the criterion variable is listed to the right of “Dependent Variable.” In this case, the criterion is COMMITMENT. In the “Analysis of Variance” section, one of the headings is “DF” for degrees of freedom. Where this DF column intersects with the row headed “Corrected Total,” you find the corrected total degrees of freedom. Verify that this number is equal to N – 1, where N = the total number of participants who provided usable data for the analysis. In the present case, the total size of the sample was 50, but two of these participants did not provide complete data (as discussed in the section on PROC CORR), leaving usable data from 48 participants. Output 14.2 shows that corrected total degrees of freedom are 47, which is equal to N – 1 (where N = 48). These degrees of freedom therefore appear to be correct.

2. Review the Obtained Value of R2

Toward the center of the page, you find the heading “R-square.” The value is the observed R2 for this multiple regression equation. Earlier, it was noted that this R2 value indicates the percent of variance in the criterion variable that is accounted for by the linear combination of predictor variables. In the present case, R2 = 0.65. This indicates that the linear combination of REWARD, COST, INVESTMENT, and ALTERNATIVES accounts for about 65% of observed variance in COMMITMENT.

There is a significance test associated with this R2 which tests the null hypothesis that R2 = 0. To test this null hypothesis, look in the “Analysis of Variance” section, under “F Value.” In this case, you see an F value of 19.59. Under the heading “Pr > F” is the p value associated with this F. Remember that the p value gives us the probability that you would obtain an F value this large or larger if the null hypothesis were true. In this case, the p value is very small (< .01), so you reject the null hypothesis and conclude that the obtained value of R2 is statistically significant.

The preceding analysis determines whether the linear combination of predictor variables accounts for a significant amount of variance in the criterion; this test should be reviewed each time that you conduct a multiple regression. In addition to determining whether the predictors account for a significant amount of variance, you should also determine whether your predictors account for a meaningful amount of variance (i.e., a relatively large amount of variance). How large must an R2 value be to be considered meaningful? That depends, in part, on what has been found in prior research concerning the criterion variable being investigated. If, for example, predictor variables in earlier investigations have routinely accounted for 50% of variance in the criterion but the predictors in your study account for only 10%, your findings might not be viewed as being very important. On the other hand, if the predictors of earlier studies have routinely accounted for only 5% of variance but the variables of your study have accounted for 10%, this might be considered a meaningful amount of variance.

This issue of “statistical significance” versus “percentage of variance accounted for” is important because it is possible to obtain an R2 value that is very small (say, .03), but is still statistically significant. This often occurs when analyzing data from very large samples. Therefore, always review both the statistical significance of the equation, as well as the total amount of variance accounted for, in assessing the substantive importance of your findings.


3. Review the Adjusted Value of R2

Below the heading “R-square” is the heading “Adj R-Sq”; it stands for “adjusted R2.” To the right of “Adj R-Sq” is a version of R2 that is adjusted for degrees of freedom. In other words, this statistic adjusts for complexity of the regression model (i.e., favors more parsimonious solutions). This is provided because the actual value of R2 obtained with a given sample often overestimates the population value of R2. The adjusted R2, however, is adjusted to more closely approximate the population value. For this reason, the “Adj R-Sq” value is normally smaller than the “R-square” value.

4. Review the Intercept and Nonstandardized Regression Coefficients

The bottom half of the output page provides information about the parameter estimates. These parameter estimates are the terms that constitute the multiple regression equation (i.e., the intercept and the nonstandardized multiple regression coefficients for the predictor variables).

To begin, notice that the first column of information is headed “Variable.” Below this heading are the terms in the regression equation: the intercept and the names of the four predictor variables (REWARD, COST, INVESTMENT, and ALTERNATIVES). The third column from the left is headed “Parameter Estimate.” This provides the intercept estimate along with the nonstandardized multiple regression coefficients for each predictor. In this case, the intercept is approximately 20.04, the nonstandardized regression coefficient for REWARD is .28, the nonstandardized coefficient for COST is –.11, and so forth. Based on these estimates, you can write the multiple regression equation in this way:

Y′ =0.28(REWARD) - 0.11(COST) + 0.52(INVESTMENT)
 – 0.68(ALTERNATIVES) + 20.04

Remember that the multiple regression coefficient for a given predictor indicates the amount of change in Y that is associated with a one-unit change in that predictor while holding the remaining predictors constant. Nonstandardized coefficients represent the change that would be observed when the variables are in nonstandardized, “raw score” form (i.e., the different variables have different means and standard deviations). The nonstandardized regression equation would be used to predict participants’ scores on COMMITMENT so that the resulting scores would be on the same scale of magnitude as observed with the raw data. However, the coefficients in this equation cannot be used to assess the relative importance of predictor variables.

5. Review the Significance of the Regression Coefficients

Researchers usually want to determine whether regression coefficients for the various predictor variables are significantly different from zero. When given coefficients are statistically significant, this suggests that the corresponding predictor variable is a relatively important predictor of the criterion.

For each predictor variable, the output of PROC REG provides a t test that tests the null hypothesis that the regression coefficient is equal to zero. The obtained t value can be found in the column headed “t Value.” The p value corresponding to this value of t is in the next column, headed “Pr > |t|.”

For example, in the present case, the nonstandardized regression coefficient for REWARD is approximately 0.28. When testing the significance of this coefficient, the obtained value of t is 1.02; it has a corresponding p value of .31. Because this p value is greater than .05, you cannot reject the null hypothesis and must conclude that the regression coefficient for REWARDS is not significantly different from zero. A different finding is obtained for the predictor INVESTMENT, however, which had a nonstandardized regression coefficient of .52. The t value for this coefficient is 2.49 with a corresponding p value of .02. Because this value is less than .05, you reject the null hypothesis and tentatively conclude that the coefficient for INVESTMENT is significantly different from zero.

The first paragraph of this subsection indicates that the statistical significance of a regression coefficient suggests that variable’s importance as a predictor. This statement includes a qualification to emphasize that caution must be used in interpreting statistical significance as evidence of predictor’s importance. There are at least two reasons for this.

First, an earlier section in the chapter stated that multiple regression coefficients are often unreliable, especially under the conditions that are often encountered in social science research. Second, a multiple regression coefficient can prove to be statistically significant even when the standardized coefficient is relatively small in absolute magnitude and hence is of little predictive value. This is likely to be the case especially when sample sizes are very large. For these reasons, the statistical significance of regression coefficients should be viewed as only one indicator of a variable’s importance and should always be combined with additional information such as the size of the standardized regression coefficients and uniqueness indices.

6. Review the Standardized Regression Coefficients (Beta Weights)

A previous section in this chapter indicated that nonstandardized regression coefficients generally indicate little about the relative importance of predictor variables. This is because the different predictors normally have different standard deviations, and these differences affect the size of the nonstandardized coefficients. To avoid this difficulty, it is necessary to review the standardized multiple regression coefficients or beta weights. Beta weights are the regression coefficients that would be obtained if all the variables were standardized so that they had the same standard deviations. It is therefore more appropriate to review the beta weights when you want to compare the relative importance of predictor variables. (In many textbooks, beta weights are represented by the Greek letter β while nonstandardized regression coefficients are represented by the letter B.)

In the preceding program, you requested standardized regression coefficients (beta weights) by specifying STB in the options section of the MODEL statement. These beta weights appear toward the bottom of Output 14.2, below the “Standardized Estimate” heading. Note that the intercept for this equation is zero as is always the case with a standardized regression equation.

The results of Output 14.2 show that the beta weight for REWARD is approximately .14, the beta for COST is –.06, the beta for INVESTMENT is .31, and the beta for ALTERNATIVES is –.50. Based on these findings, you could rank the predictors from most important to least important as follows: ALTERNATIVES; INVESTMENT; REWARD; and COST.

This interpretation should be made with caution, however, because multiple regression coefficients, whether standardized or nonstandardized, tend to be somewhat unreliable under the conditions normally encountered in social science research. A more cautious approach to understanding the relative importance of predictor variables would involve combining information from a variety of sources, including bivariate correlations, standardized regression coefficients, and uniqueness indices.

You can see that Output 14.2 does not report significance tests for the standardized regression coefficients. This is because the significance of the nonstandardized coefficients has already been reported, and a separate test for the standardized coefficients is not necessary. More precisely, if the t test for the nonstandardized coefficient is significant, then the corresponding standardized coefficient for that variable is also significantly different from zero.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.204.23