3.5. Statistical Procedures for Model Building

In this section we review a number of procedures that can be used to build statistical models. The potential model builder will need to obtain a substantial toolbox since no one statistical tool will work on every set of data. As a corollary to the previous statement, the nature of data will help determine the appropriate statistical model-building tool. See Two Crows Corporation (1999) for a review of model building procedures used to extract predictive patterns in data.

3.5.1. Multiple Linear Regression

Multiple linear regression has certainly been the workhorse for building predictive models for many years. It is appropriate for the situation where there are many more observations than descriptors and where the dependent variable is continuous. Models take the form of:

y = β0 + β1x1 + β2x2 + ... + βpxp + e,

where y is the dependent variable, x1,...,xp are the independent variables and e is the random error. If a linear combination of the descriptors can approximate the response, then this formulation works well and is simple to perform. Terms such as xin or xixj are also allowed. PROC REG is appropriate and allows for variations of this simple procedure, such as when terms are added or deleted in a stepwise fashion. Program 3.3 illustrated the use of PROC REG with forward selection.

PROC REG can also be used interactively. After you specify a model in PROC REG with a RUN command, but not a QUIT command, you can use a variety of other commands interactively such as, ADD, DELETE, PRINT. See the PROC REG documentation for more information. To demonstrate the utility of this, Program 3.5 provides the commands for running PROC REG interactively on the solubility training set. In Program 3.5, the initial model fit contains the five most important variables, x1, ..., x5 (they are included in the MODEL statement) and the VAR statement specifies the other variables that might be added to the model. The following commands add variables to the initial model with the ADD statement. The QUIT command ends PROC REG. The results for the final model are included in Output 3.5. The output contains, for the intercept and each of the 11 variables added to the model interactively, the parameter estimate, the standard error of the estimate, and the one-degree of freedom test.

Example 3-5. Using PROC REG interactively on the solubility data
proc reg data=soltruse;
    model y = x1 x2 x3 x4 x5;
    var x6 x7 x8 x9 x10 x11 x12 x13 x14;
    run;
    add x6 x7;
    print;
    run;
    add x8 x9;
    print;
    run;
    add x10 x11;
    print;
    run;
    quit;

Example. Output from Program 3.5
Parameter Estimates

                                Parameter       Standard
           Variable     DF       Estimate          Error    t Value    Pr > |t|

           Intercept     1       −3.40278        3.57476      −0.95      0.3433
           x1            1       −0.00771        0.00198      −3.89      0.0002
           x2            1       −0.00851        0.00162      −5.25      <.0001
           x3            1        0.77612        0.22350       3.47      0.0007
           x4            1        0.08929        0.04636       1.93      0.0567
           x5            1        0.23927        0.11684       2.05      0.0430
           x6            1        0.07884        0.03048       2.59      0.0110
           x7            1        7.36932        3.37147       2.19      0.0310
           x8            1       −0.00510        0.00126      −4.06      <.0001
           x9            1       −0.40608        0.12531      −3.24      0.0016
           x10           1        0.63875        0.29842       2.14      0.0345
           x11           1        0.88649        0.20612       4.30      <.0001

3.5.2. Logistic Regression

Logistic regression is a generalization of linear regression in that it is formulated to use a binary (0 or 1) dependent variable. It uses the log odds or logit transformation:


where Pi is the probability of the event occurring. The model above assumes the log odds ratio is a linear function of the predictors. It is easy to see that similar pros and cons exist between logistic regression and multiple linear regression discussed earlier.

There are several SAS procedures that you can use to perform logistic regression such as LOGISTIC, CATMOD and GENMOD procedures.

3.5.3. Discriminant Analysis

Discriminant analysis is an extremely old statistical technique developed by R. A. Fisher in the 1930's and was used to classify the famous Iris data set. This procedure determines the hyper-planes that separate or discriminate between the various classes in the data. It is a very simple technique to use and interpret since an observation is on one side of the plane or the other. However, there are assumptions such as normality of the descriptors, which can be problematic, and the fact that the boundaries separating the different classes are linear may reduce its discrimination ability.

There are several SAS procedures that you can use to perform discriminant analysis such as the DISCRIM, CANDISC, and STEPDISC procedures.

3.5.4. Generalized Additive Regression

The standard linear regression model is used quite often for building models. It is extremely easy to use but suffers from the fact that, in real life, many effects are not linear. The generalized additive regression model offers a flexible statistical technique that can accommodate factors that are not linear. Additionally, generalized additive regression is an interesting tool because it allows both continuous and discrete responses.

Recall that the standard linear regression model assumes the expected value of y has the linear form:

E(y|X) = f(x1, ..., xp) = β0 + β1x1 + ... + βpxp.

The additive model generalizes the linear model by modeling the expected value of y as

E(y|X) = s0 + s1(x1) + s2(x2) + ... + sp(xp),

where si, i = 1,...,p are smooth functions. These function are estimated in a nonparametric fashion. Generalized additive models are able to fit both discrete and continuous responses by allowing for a link between f(x1,...,xp) and the expected value of y. Additional details may be found in the GAM procedure documentation.

This procedure has a lot of appeal and can be very useful. Since it is new to SAS, we will illustrate its use on the solubility training data set. Program 3.6 contains the SAS code to read the solubility training set and fit the model using five variables. The GAM procedure determines the values of the smoothing parameters for each of the variables. For the solubility example, the SPLINE smoothing effect was chosen for each variable. The other options available are LOESS, SPLINE2, or PARAM. The LOESS option fits a local regression with the variable, the SPLINE2 option fits a bivariate thin-plate spline to two variables and PARAM specifies a parametric variable, a smoothing function is not applied to the variable.

Example 3-6. Analysis of the solubility data using PROC GAM
proc gam data=soltruse;
    model y=spline(x1) spline(x2) spline(x3) spline(x4) spline(x5)/dist=normal;
    run;
    quit;

Example. Output from Program 3.6
Regression Model Analysis
                                   Parameter Estimates

                             Parameter       Standard
             Parameter        Estimate          Error    t Value    Pr > |t|

             Intercept        −0.73649        0.36152      −2.04      0.0442
             Linear(x1)       −0.00741        0.00118      −6.31      <.0001
             Linear(x2)       −0.00677        0.00158      −4.30      <.0001
             Linear(x3)        0.79513        0.19670       4.04      0.0001
             Linear(x4)        0.01799        0.03777       0.48      0.6349
             Linear(x5)        0.17723        0.10212       1.74      0.0857

                                Smoothing Model Analysis
                                  Analysis of Deviance

                                             Sum of
          Source                 DF         Squares    Chi-Square    Pr > ChiSq

          Spline(x1)        3.00000       14.958834       23.9431        <.0001
          Spline(x2)        3.00000        5.345832        8.5565        0.0358
          Spline(x3)        1.00000        0.038719        0.0620        0.8034
          Spline(x4)        3.00000        1.252530        2.0048        0.5714
          Spline(x5)        3.00000        7.541784       12.0714        0.0071

Output 3.6 provides the output from Program 3.6. From the output we can see that x1, x2 and x3 have significant linear trends, p-values for the corresponding linear tests are significant at 5%. From the analysis of deviance table, the chi-square tests are significant for x1, x2 and x5, indicating that these variables need to be smoothed.

A plot of the smoothing components with 95% confidence bands for each of the five descriptors can be created using the ODS GRAPHICS statement as shown below:

ods html;
ods graphics on;
proc gam data=soltruse plots(clm commonaxes);
    model y=spline(x1) spline(x2) spline(x3) spline(x4) spline(x5)/dist=normal;
    run;
    quit;
ods graphics off;
ods html close;

The plot generated using ODS GRAPHICS is not provided because it is not publication-quality and, due to the experimental nature of PROC GAM, the option to save the smoothed components to a SAS data set is not yet available.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.70.21