Chapter 8: Modeling Continuous Variables

Linear Regressions

Extensions of Ordinary Linear Regression

Generalized Linear Models

Regression Trees

Conclusion

In this chapter, we explore several commonly used predictive models for a continuous dependent variable, including linear regressions, generalized linear models, and regression trees. The data set that is used in this chapter is the Cars data set available in the sas-viya-programming repository of the sassoftware account on GitHub.  Upload the data set directly from GitHub using the upload method on the CAS connection, as follows:

In [1]: cars = conn.upload('https://raw.githubusercontent.com/sassoftware/'

                            'sas-viya-programming/master/data/cars.csv').casTable

 

In [1]: cars.tableinfo()

Out[1]:

[TableInfo]

 

    Name  Rows  Columns Encoding CreateTimeFormatted  

 0  CARS   428       15    utf-8  09Nov2016:10:32:02   

 

      ModTimeFormatted JavaCharSet    CreateTime       ModTime  

 0  09Nov2016:10:32:02        UTF8  1.794307e+09  1.794307e+09   

 

    Global  Repeated  View SourceName SourceCaslib  Compressed  

 0       1         0     0                                   0   

 

   Creator Modifier  

 0  username           

In [2]: cars.columninfo()

Out[2]:

[ColumnInfo]

 

          Column            Label  ID    Type  RawLength  

 0          Make                    1    char         13   

 1         Model                    2    char         40   

 2          Type                    3    char          8   

 3        Origin                    4    char          6   

 4    DriveTrain                    5    char          5   

 5          MSRP                    6  double          8   

 6       Invoice                    7  double          8   

 7    EngineSize  Engine Size (L)   8  double          8   

 8     Cylinders                    9  double          8   

 9    Horsepower                   10  double          8   

 10     MPG_City       MPG (City)  11  double          8   

 11  MPG_Highway    MPG (Highway)  12  double          8   

 12       Weight     Weight (LBS)  13  double          8   

 13    Wheelbase   Wheelbase (IN)  14  double          8   

 14       Length      Length (IN)  15  double          8   

 

     FormattedLength  Format  NFL  NFD  

 0                13            0    0  

 1                40            0    0  

 2                 8            0    0  

 3                 6            0    0  

 4                 5            0    0  

 5                 8  DOLLAR    8    0  

 6                 8  DOLLAR    8    0  

 7                12            0    0  

 8                12            0    0  

 9                12            0    0  

 10               12            0    0  

 11               12            0    0  

 12               12            0    0  

 13               12            0    0  

 14               12            0    0  

Linear Regressions

Linear regression is one of the most widely used statistical models for predictive modeling. The basic idea of a predictive model is to establish a function y=f(x1,x2,,xK) to predict the value of the dependent variable y that is based on the values of the predictors X1, X2, …, XK. Linear regression assumes that the function f is a linear combination of the predictors and an error term ε.

y=a+b1x1+b2x2++bKxK+ε

Usually, we assume that ε follows a normal distribution with mean zero and variance σ2. The parameters to be estimated in a linear model include the interceptα, the slopes b1, b2, …, bK, and the variance of the error termσ2. These parameters are estimated using the least squares method.

The regression action set provides a glm action, which fits linear regression models using the least squares method. You must load the action set before you use the glm action.

In [3]: conn.loadactionset('regression')

   ...: conn.help(actionset='regression')

NOTE: Added action set 'regression'.

Out[3]:

Regression

                              regression

  Name                            Description

  0 glm      Fits linear regression models using the method of least squares

  1 genmod   Fits generalized linear regression models

  2 logistic Fits logistic regression models

Let’s build a simple regression model using the Cars data to predict MSRP using the city miles per gallon (MPG) of the cars:

In [4]: cars.glm(

   ...:     target = 'MSRP',

   ...:     inputs = ['MPG_City']

   ...: )

Out[4]:

[ModelInfo]

 

 Model Information

 

          RowId        Description Value

 0         DATA        Data Source  CARS

 1  RESPONSEVAR  Response Variable  MSRP

 

[NObs]

 

 Number of Observations

 

    RowId                  Description  Value

 0  NREAD  Number of Observations Read  428.0

 1  NUSED  Number of Observations Used  428.0

 

[Dimensions]

 

 Dimensions

 

       RowId           Description  Value

 0  NEFFECTS     Number of Effects      2

 1    NPARMS  Number of Parameters      2

 

[ANOVA]

 

 Analysis of Variance

 

    RowId           Source     DF            SS            MS  

 0  MODEL            Model    1.0  3.638090e+10  3.638090e+10   

 1  ERROR            Error  426.0  1.248507e+11  2.930768e+08   

 2  TOTAL  Corrected Total  427.0  1.612316e+11           NaN   

 

       FValue         ProbF  

 0  124.13436  1.783404e-25  

 1        NaN           NaN  

 2        NaN           NaN  

 

[FitStatistics]

 

 Fit Statistics

 

        RowId Description         Value

 0       RMSE    Root MSE  1.711949e+04

 1    RSQUARE    R-Square  2.256437e-01

 2     ADJRSQ    Adj R-Sq  2.238260e-01

 3        AIC         AIC  8.776260e+03

 4       AICC        AICC  8.776316e+03

 5        SBC         SBC  8.354378e+03

 6  TRAIN_ASE         ASE  2.917073e+08

 

[ParameterEstimates]

 

 Parameter Estimates

 

       Effect  Parameter  DF      Estimate       StdErr     tValue  

 0  Intercept  Intercept   1  68124.606698  3278.919093  20.776544   

 1   MPG_City   MPG_City   1  -1762.135298   158.158758 -11.141560   

 

           Probt  

 0  1.006169e-66  

 1  1.783404e-25  

 

[Timing]

 

 Task Timing

 

             RowId                  Task      Time   RelTime

 0           SETUP     Setup and Parsing  0.027544  0.626975

 1    LEVELIZATION          Levelization  0.007532  0.171444

 2  INITIALIZATION  Model Initialization  0.000371  0.008444

 3            SSCP      SSCP Computation  0.003291  0.074909

 4         FITTING         Model Fitting  0.000367  0.008352

 5         CLEANUP               Cleanup  0.002385  0.054286

 6           TOTAL                 Total  0.043932  1.000000

The ParameterEstimates table contains the estimation of parameters for the linear regression model. In the preceding example, the model returned by the glm action is shown as follows:

   MSRP = 68124.606698 + −1762.135298 × MPG _CITY

The result tables also contain useful information about the model definition and model fitting. The following table summarizes the result tables.

Table Name Description
NObs The number of observations that are read and used. Missing values are excluded, by default.
Dimension The dimension of the model, including the number of effects and the number of parameters.
ANOVA The Analysis of Variance table that measures the overall model fitting.
FitStatistics The fit statistics of the model such as R-Square and root mean square error.
ParameterEstimates The estimation of the regression parameters.
Timing A timing of the subtasks of the glm action call.

Compared to the data exploration actions introduced in Chapter 7, the glm action requires a more complex and deeper parameter structure. In this case, it might be more convenient to define a new model first and then specify the model parameters, step-by-step. In other words, the first linear regression shown in the preceding example can be rewritten as follows:

linear1 = cars.Glm()

linear1.target = 'MSRP'

linear1.inputs = ['MPG_City']

linear1()

This approach enables you to reuse the code when you need to change only a few options of the glm action. For example, to display only the parameter estimation table, you specify the names of the output table in the display option and rerun the linear1 model:

In [4]: linear1.display.names = ['ParameterEstimates']

   ...: linear1()

Out[4]:

ParameterEstimates

                                Parameter Estimates

    Effect   Parameter   DF   Estimate     StdErr     tValue   Probt

0 Intercept Intercept 1 68124.606698 3278.9190925 20.776543969 1.006169E-66

1 MPG_City MPG_City 1 -1762.135298 158.15875785 -11.14156005 1.783404E-25

So far, we have only used the glm action example to estimate the parameters of the linear regression model. We haven’t used the model to predict MSRP values of the cars. For prediction, you must specify an output table using the output option. You can also delete the display.names option in order to request all result tables:

In [5]: del linear1.display.names

   ...: result1 = conn.CASTable('cas.MSRPPrediction')

   ...: result1.replace = True

   ...: linear1.output.casout = result1

   ...: linear1.output.copyvars = 'all';

   ...: linear1()

 

Out[5]:

 

…output clipped…

 

[OutputCasTables]

 

                 casLib            Name Label  Rows  Columns  

 0  CASUSERHDFS(username)  MSRPPrediction         428       16   

 

                                   casTable  

 0  CASTable('MSRPPrediction', caslib='C...  

In the preceding example, a new CAS table MSRPPredicton is defined and then used in the output option. When you submit the code, the glm action first fits a linear regression model, and then it uses the fitted model to score the input data. Also, it creates the new CAS table MSRPPrediction that contains the predicted MSRP values. The copyvars='all' option requests that the glm action copy all columns from the CARS table to the MSRPPrediction table.

In the preceding example, the output column name for predicted MSRP is not specified. The glm action automatically chooses Pred as the name.

You can summarize the predicted values using the summary action from the simple action set.

In [6]: result1[['pred']].summary()

Out[6]:

[Summary]

 

 Descriptive Statistics for MSRPPREDICTION

 

   Column           Min          Max      N  NMiss         Mean  

 0   Pred -37603.511169  50503.25372  428.0    0.0  32774.85514   

 

           Sum          Std      StdErr           Var           USS  

 0  14027638.0  9230.448198  446.170554  8.520117e+07  4.961347e+11   

 

             CSS         CV     TValue          ProbT  

 0  3.638090e+10  28.163201  73.458131  2.182203e-244  

The glm action can generate additional columns besides the predicted values. The following table summarizes the statistical outputs that the glm action can generate.

Option Description
pred The predicted value. If you do not specify any output statistics, the predicted value is named Pred, by default.
resid The residual, which is calculated as ACTUAL minus PREDICTED.
cooksd The Cook's D influence statistic.
covratio The standard influence of the observation on covariance of betas. The COVRATIO statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the current observation.
dffits The scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space.
h The leverage of the observation.
lcl The lower bound of a confidence interval for an individual prediction.
ucl The upper bound of a confidence interval for an individual prediction.
lclm The lower bound of a confidence interval for the expected value of the dependent variable.
uclm The upper bound of a confidence interval for the expected value of the dependent variable.
likedist The likelihood displacement.
press The ith residual divided by 1 - h, where h is the leverage, and where the model has been refit without the ith observation
rstudent The studentized residual with the current observation deleted.
stdi The standard error of the individual predicted value.
stdp The standard error of the mean predicted value.
stdr The standard error of the residual.
student The studentized residuals, which are the residuals divided by their standard errors.

The following example adds the residual values (observed MSRP values minus predicted MSRP values) and the confidence intervals of the prediction to the output table:

In [7]: result2 = conn.CASTable('cas.MSRPPrediction2', replace=True)

   ...: linear1.output.casout = result2

   ...: linear1.output.pred  = 'Predicted_MSRP'

   ...: linear1.output.resid = 'Residual_MSRP'

   ...: linear1.output.lcl = 'LCL_MSRP'

   ...: linear1.output.ucl = 'UCL_MSRP'

   ...: linear1()

The output table MSRPPredcition2 is a CAS table that is saved on the CAS server. You have several ways to fetch or download a table from the CAS server. Since the CARS data is relatively small, you can pull all observations from MSRPPredcition2 directly to the Python client using the to_frame method. Then you can use some visualization package in Python such as Bokeh to observe and understand the model outcome.

In [8]: from bokeh.charts import Scatter, output_file, show

   ...: out1 = result2.to_frame()

   ...: p = Scatter(out1, x='Residual_MSRP', y='Predicted_MSRP',  

                    color='Origin', marker='Origin')

   ...: output_file('scatter.html')

   ...: show(p)

The following figure shows a scatter plot of the predicted MSRP values and residuals.

image

You can see that the model predicted large negative MSRP values for three observations. Let’s print out these observations to find out what happens:

In [9]: result2[['Predicted_MSRP', 'MSRP', 'MPG_City’, 'Make',

                  'Model']].query('Predicted_MSRP < 0').to_frame()

Out[9]:

Selected Rows from Table MSRPPREDICTION3

 

   Predicted_MSRP     MSRP  MPG_City    Make  

0   -12933.617000  20140.0      46.0   Honda   

1   -35841.375871  20510.0      59.0  Toyota   

2   -37603.511169  19110.0      60.0   Honda   

 

                                     Model  

0   Civic Hybrid 4dr manual (gas/electric)  

1                 Prius 4dr (gas/electric)  

2               Insight 2dr (gas/electric)  

All of these cars are fuel efficient with relatively high MPG_City. If you generate a scatter plot of the dependent variable MSRP and the predictor variable MPG_City, you can see that the data has some extreme outliers with high MSRP values or high city MPG that might not fit into the linear relationship assumption between these two variables.

In [10]: p = Scatter(out1, x='MPG_City', y='MSRP',

                     color='Origin', marker='Origin')

    ...: output_file('scatter.html')

    ...: show(p)

The following figure shows a scatter plot of MPG_City and MSRP.

image

Outliers are observed when MSRP values are higher than $100,000 or MPG City is higher than 40.

Linear regression models are sensitive to outliers. There are several treatments for outliers to improve the prediction accuracy of a linear regression model. The simplest approach is to remove the outliers. In the next example, let’s add a filter to remove cars with MPG_City that is greater  than 40 or MSRP that is greater than 100,000. The R-Square of the model is actually improved from 0.22 to 0.34.

In [11]: cars.where = 'MSRP < 100000 and MPG_City < 40'

    ...:

    ...: result2 = conn.CASTable('cas.MSRPPrediction2')

    ...: result2.replace = True

    ...:

    ...: linear2 = cars.Glm()

    ...: linear2.target = 'MSRP'

    ...: linear2.inputs = ['MPG_City']

    ...: linear2.output.casout = result2

    ...: linear2.output.copyVars = 'ALL';

    ...: linear2.output.pred = 'Predicted_MSRP'

    ...: linear2.output.resid = 'Residual_MSRP'

    ...: linear2.output.lcl = 'LCL_MSRP'

    ...: linear2.output.ucl = 'UCL_MSRP'

    ...: linear2()

You can also use the DataFrame API of the CASTable cars to apply a filter. The preceding model can also be defined using the query method of the CASTable:

linear2 = cars.query('MSRP < 100000 and MPG_City < 40').Glm()

You can see that we have a better residual plot after the outliers are removed from the model.

In [12]: out2 = result2.to_frame()

    ...: p = Scatter(out2, x='Predicted_MSRP', y='Residual_MSRP',

                     color='Origin', marker='Origin')

    ...: output_file('scatter.html')

    ...: show(p)

The following figure shows a scatter plot of predicted MSRP values and residuals, after excluding outliers.

image

Let’s continue to improve the linear regression model for prediction MSRP by adding more predictors to the model. In this example, we add three categorical predictors Origin, Type, and DriveTrain, and two more continuous predictors Weight and Length. The categorical predictors must be specified in both the inputs and the nominals parameters. The R-Square statistic is improved again (0.63).

In [13]: nomList = ['Origin','Type','DriveTrain']

    ...: contList = ['MPG_City','Weight','Length']

    ...:

    ...: linear3 = CASTable('cars').Glm()

    ...: linear3.target = 'MSRP'

    ...: linear3.inputs = nomList + contList

    ...: linear3.nominals = nomList

    ...: linear3.display.names = ['FitStatistics','ParameterEstimates']

    ...: linear3()

 

Out [13]:

[FitStatistics]

 

 Fit Statistics

 

        RowId Description         Value

 0       RMSE    Root MSE  1.201514e+04

 1    RSQUARE    R-Square  6.284172e-01

 2     ADJRSQ    Adj R-Sq  6.176727e-01

 3        AIC         AIC  8.483996e+03

 4       AICC        AICC  8.485013e+03

 5        SBC         SBC  8.106765e+03

 6  TRAIN_ASE         ASE  1.399787e+08

 

[ParameterEstimates]

 

 Parameter Estimates

 

         Effect  Origin    Type DriveTrain         Parameter  DF  

 0    Intercept                                    Intercept   1   

 1       Origin    Asia                          Origin Asia   1   

 2       Origin  Europe                        Origin Europe   1   

 3       Origin     USA                           Origin USA   0   

 4         Type          Hybrid                  Type Hybrid   1   

 5         Type             SUV                     Type SUV   1   

 6         Type           Sedan                   Type Sedan   1   

 7         Type          Sports                  Type Sports   1   

 8         Type           Truck                   Type Truck   1   

 9         Type           Wagon                   Type Wagon   0   

 10  DriveTrain                        All    DriveTrain All   1   

 11  DriveTrain                      Front  DriveTrain Front   1   

 12  DriveTrain                       Rear   DriveTrain Rear   0   

 13    MPG_City                                     MPG_City   1   

 14      Weight                                       Weight   1   

 15      Length                                       Length   1   

 

         Estimate        StdErr    tValue         Probt  

 0  -23692.980669  16261.000069 -1.457043  1.458607e-01  

 1    2191.206289   1479.756700  1.480788  1.394218e-01  

 2   17100.937866   1779.533025  9.609790  7.112423e-20  

 3       0.000000           NaN       NaN           NaN  

 4   26154.719438  10602.173003  2.466921  1.403098e-02  

 5   -1016.065543   3083.503255 -0.329517  7.419315e-01  

 6    2481.367175   2359.814614  1.051509  2.936366e-01  

 7   21015.571095   3065.180416  6.856226  2.572647e-11  

 8  -12891.562541   3592.436933 -3.588529  3.722560e-04  

 9       0.000000           NaN       NaN           NaN  

 10  -7669.535987   1987.120548 -3.859623  1.316520e-04  

 11  -7699.083608   1722.883863 -4.468719  1.016332e-05  

 12      0.000000           NaN       NaN           NaN  

 13   -496.308946    266.606023 -1.861582  6.336879e-02  

 14      9.171343      1.914672  4.790032  2.325015e-06  

 15    162.893307     82.088679  1.984358  4.787420e-02  

A linear regression model with categorical effects still assumes homogeneity of variance (that is, the random errors for all observations have the same variance). In the preceding example, this means that the variation of MSRP values is approximately the same across different origins, types, or drive trains. Sometimes data can be heterogeneous. For example, cars from different origins might have different MSRP values as well as different variation in MSRP values.

In [14]: cars = conn.CASTable('cars')

    ...: out = cars.groupby('Origin')[['MSRP']].summary()

    ...: out.concat_bygroups()['Summary'][['Column','Mean','Var','Std']]

    ...:

Out[14]:

Descriptive Statistics for CARS

 

       Column          Mean           Var           Std

Origin                                                 

Asia     MSRP  24741.322785  1.281666e+08  11321.069675

Europe   MSRP  48349.796748  6.410315e+08  25318.600464

USA      MSRP  28377.442177  1.371705e+08  11711.982506

The output from the summary action shows that cars from Europe not only have higher MSRP values but also a greater variance. The sample standard deviation of MSRP values for the European cars doubles compared to the sample standard deviation of the MSRP values for the cars from Asia and USA.  One easy remedy for variance heterogeneity is to fit multiple models, one for each segment of the data. For the glm action, you can fit multiple models with the groupby option:

In [15]: cars = conn.CASTable('cars')

    ...: cars.groupby = ['Origin']

    ...: cars.where = 'MSRP < 100000 and MPG_City < 40'

    ...: nomList = ['Type','DriveTrain']

    ...: contList = ['MPG_City','Weight','Length']

    ...: groupBYResult = conn.CASTable('MSRPPredictionGroupBy')

    ...:

    ...: linear4 = cars.Glm()

    ...: linear4.target = 'MSRP'

    ...: linear4.inputs = nomList + contList

    ...: linear4.nominals = nomList

    ...: linear4.display.names = ['FitStatistics','ParameterEstimates']

    ...: linear4.output.casout = groupBYResult

    ...: linear4.output.copyVars = 'ALL';

    ...: linear4.output.pred = 'Predicted_MSRP'

    ...: linear4.output.resid = 'Residual_MSRP'

    ...: linear4.output.lcl = 'LCL_MSRP'

    ...: linear4.output.ucl = 'UCL_MSRP'

    ...: linear4()

    ...:

    ...: out = groupBYResult.to_frame()

    ...: p = Scatter(out, x='Predicted_MSRP', y='Residual_MSRP',

                     color='Origin', marker='Origin')

    ...: output_file('scatter.html')

    ...: show(p)

The following figure shows a scatter plot of predicted MSRP values and residuals with three linear regression models fit for each origin (Asia, Europe, and USA):

image

Extensions of Ordinary Linear Regression

The key assumptions of an ordinary linear regression are 1) the expected value of the dependent variable can be predicted using a linear combination of a set of predictors, and 2) the error distribution function of ε follows a normal distribution. The second assumptions might not be valid for some applications such as estimating the number of calls received within a time interval in a call center. Generalized linear models (GLM) and regression trees are popular types of generalization of ordinary linear regression to fit data that does not follow a normal distribution.

Generalized Linear Models

In generalized linear models, the dependent variable Y is assumed to follow a particular probability distribution, and the expected value of Y is predicted by a function of the linear combination of the predictors,

E(y)=f(x1,x2,,xK)=f(a+b1x1+b2x2++bKxK)

where E(y) is the expected values of Y, x1, …, x2, …, xK are the observed values of the predictors X1, and X2, …, XK and b1, b2, ... , bK are the unknown parameters. It is more common to express a GLM model by linking the expected values of Y o the linear combination of predictors,

g[E(y)]=a+b1x1+b2x2++bKxK

The link function g() is the inverse function of f(). The probability distribution of Y is usually from the exponential family of distribution, such as normal, binomial, exponential, gamma, Poisson, and zero-inflated distributions. The choice of the link function usually depends on the assumption of the probability distribution. For example, for call center data, it is common to assume that the number of calls within a time interval follows a Poisson distribution

P(kcallsininterval)=λkeλk!,k=0,1,2

and a log link function between the expected number of calls λ to the linear combination of the predictors.

log(λ)=a+b1x1+b2x2++bKxK

It is also worthwhile to mention that ordinary linear regression is a special type of GLM, where the target variable follows a normal distribution, and the link function is an identity function E(y)=a+b1x1+b2x2++bKxK. For more details of generalized linear models, refer to [1] and [2].

Generalized linear models are available in the regression CAS action set. Let’s continue to use the Cars data example and build a simple generalized linear model to predict MSRP values of cars using MPG_City.

In [1]: cars = conn.CASTable('cars')

   ...: genmodModel1 = cars.Genmod()

   ...: genmodModel1.model.depvars = 'MSRP'

   ...: genmodModel1.model.effects = ['MPG_City']

   ...: genmodModel1.model.dist = 'gamma'

   ...: genmodModel1.model.link = 'log'

   ...: genmodModel1()

 

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

Out[1]:

[ModelInfo]

 

 Model Information

 

          RowId             Description                        Value

 0         DATA             Data Source                         CARS

 1  RESPONSEVAR       Response Variable                         MSRP

 2         DIST            Distribution                        Gamma

 3         LINK           Link Function                          Log

 4         TECH  Optimization Technique  Newton-Raphson with Ridging

 

[NObs]

 

 Number of Observations

 

    RowId                  Description  Value

 0  NREAD  Number of Observations Read  428.0

 1  NUSED  Number of Observations Used  428.0

 

[ConvergenceStatus]

 

 Convergence Status

 

                                     Reason  Status   MaxGradient

 0  Convergence criterion (GCONV=1E-8) s...       0  1.068483e-09

 

[Dimensions]

 

 Dimensions

 

          RowId                 Description  Value

 0  NDESIGNCOLS           Columns in Design      2

 1     NEFFECTS           Number of Effects      2

 2    MAXEFCOLS          Max Effect Columns      1

 3   DESIGNRANK              Rank of Design      2

 4      OPTPARM  Parameters in Optimization      3

 

[FitStatistics]

 

 Fit Statistics

 

   RowId               Description        Value

 0  M2LL         -2 Log Likelihood  9270.853164

 1   AIC   AIC (smaller is better)  9276.853164

 2  AICC  AICC (smaller is better)  9276.909768

 3   SBC   SBC (smaller is better)  9289.030533

 

[ParameterEstimates]

 

 Parameter Estimates

 

        Effect   Parameter    ParmName  DF   Estimate    StdErr  

 0   Intercept   Intercept   Intercept   1  11.307790  0.059611   

 1    MPG_City    MPG_City    MPG_City   1  -0.047400  0.002801   

 2  Dispersion  Dispersion  Dispersion   1   5.886574  0.391526   

 

           ChiSq     ProbChiSq  

 0  35983.929066  0.000000e+00  

 1    286.445370  2.958655e-64  

 2           NaN           NaN  

 

[Timing]

 

 Task Timing

 

             RowId                  Task      Time   RelTime

 0           SETUP     Setup and Parsing  0.008891  0.257967

 1    LEVELIZATION          Levelization  0.005668  0.164454

 2  INITIALIZATION  Model Initialization  0.000360  0.010446

 3            SSCP      SSCP Computation  0.001076  0.031220

 4         FITTING         Model Fitting  0.014661  0.425389

 5         CLEANUP               Cleanup  0.002235  0.064853

 6           TOTAL                 Total  0.034465  1.000000

In the preceding example, we fit a generalized linear model using gamma distribution and the log link function because MSRP values of cars are nonnegative continuous values. The type of distribution is usually determined by the range and the shape of the data. For example, exponential, gamma, and inverse Gaussian distributions are popular choices for fitting nonnegative continuous values such as MSRP values of cars, sales revenue, insurance claim amounts, and so on. Binomial and multinomial distributions are a valid distribution assumption to fit an integer response. In the next example, we fix a multinomial regression model to predict the number of cylinders of the vehicles that are based on MPG_City.

In [2]: genmodModel1.model.depvars = 'Cylinders'

   ...: genmodModel1.model.dist = 'multinomial'

   ...: genmodModel1.model.link = 'logit'

   ...: genmodModel1.model.effects = ['MPG_City']

   ...: genmodModel1.display.names = ['ModelInfo', 'ParameterEstimates']

   ...: genmodModel1()

   ...:

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

Out[2]:

 

[ModelInfo]

 

 Model Information

 

          RowId                Description  

 0         DATA                Data Source   

 1  RESPONSEVAR          Response Variable   

 2      NLEVELS  Number of Response Levels   

 3         DIST               Distribution   

 4     LINKTYPE                  Link Type   

 5         LINK              Link Function   

 6         TECH     Optimization Technique   

 

                          Value  

 0                         CARS  

 1                    Cylinders  

 2                            7  

 3                  Multinomial  

 4                   Cumulative  

 5                        Logit  

 6  Newton-Raphson with Ridging  

 

[ParameterEstimates]

 

 Parameter Estimates

 

       Effect  Parameter      ParmName Outcome  Cylinders  DF  

 0  Intercept  Intercept   Intercept_3       3        3.0   1   

 1  Intercept  Intercept   Intercept_4       4        4.0   1   

 2  Intercept  Intercept   Intercept_5       5        5.0   1   

 3  Intercept  Intercept   Intercept_6       6        6.0   1   

 4  Intercept  Intercept   Intercept_8       8        8.0   1   

 5  Intercept  Intercept  Intercept_10      10       10.0   1   

 6   MPG_City   MPG_City      MPG_City                NaN   1   

 

     Estimate    StdErr       ChiSq     ProbChiSq  

 0 -60.329075  4.829533  156.042532  8.286542e-36  

 1 -21.461149  1.584887  183.361936  8.941488e-42  

 2 -21.233691  1.575766  181.579751  2.190306e-41  

 3 -16.632445  1.337275  154.693103  1.634032e-35  

 4 -10.988487  1.139470   92.997190  5.236863e-22  

 5 -10.314220  1.186541   75.562638  3.539969e-18  

 6   1.013934  0.077371  171.734698  3.092446e-39  

In the preceding example, the cumulative logit link function is used for the multinomial model. For a generalized linear model using multinomial distribution, we estimate where an observation (a car) can fall into one of the seven possible number of cylinders (Cylinder=3, 4, 5, 6, 8, 10, 12). In this case, the cumulative log link assume a logit link function between the cumulative probabilities and the linear combination of the predictors:

Pr(Cylinders=3)=f(60.329075+1.013934×MPG_City)Pr(Cylinders=3)=Pr(Cylinders=4)=f(21.461149+1.013934×MPG_City)Pr(Cylinders=3)++Pr(Cylinders=10)=f(10.314220+1.013934×MPG_City)

where f(u)=exp(u)1+exp(u) is the standard inverse logit link function. You can use the parameter estimates to score new observation directly. For example, when a car has MPG_CITY = 20, the chance that this car is 4-cylinder is about 23.5%:

Pr(Cylinders=4|MPG _ CITY=20)=f(21.461149+1.013934×20)f(60.329075+1.013934×20)=0.235

Similar to the glm action in the same regression action set, the genmod action generates the predictions from the generalized linear model. Instead of using the parameter estimates from the preceding formulas to manually compute the predictions, you can use genmod to score a data set using the output options.

In [3]: result = conn.CASTable('CylinderPredicted', replace=True)

    ...: genmodModel1.output.casout = genmodResult

    ...: genmodModel1.output.copyVars = 'ALL';

    ...: genmodModel1.output.pred = 'Prob_Cylinders'

    ...: genmodModel1()

    ...: result[['Prob_Cylinders','_level_','Cylinders','MPG_City']].head(24)

    ...:

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

Out[3]:

 

 Selected Rows from Table CYLINDERPREDICTED

 

     Prob_Cylinders  _LEVEL_  Cylinders  MPG_City

 0     1.928842e-19      3.0        6.0      17.0

 1     1.442488e-02      4.0        6.0      17.0

 2     1.804258e-02      5.0        6.0      17.0

 3     6.466697e-01      6.0        6.0      17.0

 4     9.980702e-01      8.0        6.0      17.0

 5     9.990158e-01     10.0        6.0      17.0

 6     5.316706e-19      3.0        6.0      18.0

 7     3.877857e-02      4.0        6.0      18.0

 8     4.820533e-02      5.0        6.0      18.0

 9     8.345697e-01      6.0        6.0      18.0

 10    9.992990e-01      8.0        6.0      18.0

 11    9.996427e-01     10.0        6.0      18.0

 12    8.460038e-17      3.0        4.0      23.0

 13    8.652192e-01      4.0        4.0      23.0

 14    8.896126e-01      5.0        4.0      23.0

 15    9.987558e-01      6.0        4.0      23.0

 16    9.999956e-01      8.0        4.0      23.0

 17    9.999978e-01     10.0        4.0      23.0

 18    4.039564e-18      3.0        6.0      20.0

 19    2.346085e-01      4.0        6.0      20.0

 20    2.778780e-01      5.0        6.0      20.0

 21    9.745742e-01      6.0        6.0      20.0

 22    9.999077e-01      8.0        6.0      20.0

 23    9.999530e-01     10.0        6.0      20.0

You can generate the predictions for a car with MPG_CITY = 20 from the last six rows of the preceding output table:

Pr(Cylinders=3|MPG_CITY=20)=4.039564e180Pr(Cylinders=4|MPG_CITY=20)=0.23460850=0.2346085Pr(Cylinders=5|MPG_CITY=20)=0.27787800.2346085=0.04327Pr(Cylinders=6|MPG_CITY=20)=0.97457420.2778780=0.696696Pr(Cylinders>6|MPG_CITY=20)=10.9745742<0.03

Using the multinomial model, a car with MPG_CITY = 20 is most likely a 6-cylinder (69.7%) or 4-cylinder car (23.5%), and the chance that it has more than 6 cylinders is less than 3%.

The following table lists the output result tables from the genmod action. You can use the display option to include or exclude a result table when fitting a generalized linear model using the genmod action.

Table Name Description
ModelInfo Basic information about the model such as the data source, the response variable, the distribution and link functions, and the optimization technique.
NObs The number of observations that are read and used. Missing values are excluded, by default.
ConvergenceStatus The convergence status of the parameter estimation.
Dimension The dimension of the model, including the number of effects and the number of parameters.
FitStatistics The fit statistics of the model such as log likelihood (multiplied by -2), AIC, AICC, and SBC.
ParameterEstimates The estimation of the model parameters.
Timing A timing of the subtasks of the GLM action call.

Regression Trees

Similar to decision tree models, regression trees are machine learning algorithms that use recursive partitioning to segment the input data set and to make predictions within each segment of the data. In this

section, we briefly introduce how to use the tree models that are available in the decisiontree action set to build predictive models for continuous response. For information about decision tree and tree family models, see Chapter 9.

A regression tree usually follows these steps:

1.   Grows a tree as deep as possible based on a splitting criterion and the training data.

2.   Prunes back some nodes on the tree based on an error function on the validation data.

3.   For each leaf (terminal node),  build a simple model to predict the continuous dependent variable. The most common approach is to use the local sample average of the dependent variable.

Let’s first load the decisiontree action set:

In [4]: conn.loadactionset('decisiontree')

   ...: conn.help(actionset='decisiontree')

NOTE: Added action set 'decisiontree'.

Out[4]:

decisionTree

                          decisionTree

        Name                       Description

 0  dtreeTrain   Train Decision Tree

 1  dtreeScore   Score A Table Using Decision Tree

 2  dtreeSplit   Split Tree Nodes

 3  dtreePrune   Prune Decision Tree

 4  dtreeMerge   Merge Tree Nodes

 5  dtreeCode    Generate score code for Decision Tree

 6  forestTrain  Train Forest

 7  forestScore  Score A Table Using Forest

 8  forestCode   Generate score code for Forest

 9  gbtreeTrain  Train Gradient Boosting Tree

10  gbtreeScore  Score A Table Using Gradient Boosting Tree

11  gbtreecode   Generate score code for Gradient Boosting Trees

The dtreetrain action fits decision tree model for a categorical dependent variable or regression tree model for a continuous dependent variable. Let’s build a regression tree model using MPG_City to predict MSRP values of cars. In this example, we create the simplest regression tree that splits the root node into only two leaves (splitting the entire data into two partitions).

In [5]: cars = conn.CASTable('cars')

   ...:

   ...: output1 = conn.CASTable('treeModel1')

   ...: output1.replace = True;

   ...:

   ...: tree1 = cars.Dtreetrain()

   ...: tree1.target = 'MSRP'

   ...: tree1.inputs = ['MPG_City']

   ...: tree1.casout = output1

   ...: tree1.maxlevel = 2

   ...: tree1()

   ...:

   ...: output1[['_NodeID_', '_Parent_','_Mean_','_NodeName_','_PBLower0_',

                 '_PBUpper0_']].head()

   ...:

Out[5]:

[Fetch]

 Selected Rows from Table TREEMODEL1

 

    _NodeID_  _Parent_        _Mean_ _NodeName_  _PBLower0_  

 0       0.0      -1.0  32774.855140   MPG_City         NaN   

 1       1.0       0.0  22875.341584       MSRP        20.0   

 2       2.0       0.0  41623.092920       MSRP        10.0   

 

    _PBUpper0_  

 0         NaN  

 1        60.0  

 2        20.0  

In the preceding example, the decision tree model is saved to the CAS table treeModel1. This table is stored on the CAS server  You must use the fetch or head method to download the table to the Python client. The tree model table contains three observations, one for each node in the tree, including the root node. We fetch information  only from the tree model table such as the unique IDs of the nodes and parents (_NodeID_ and _Parent_), the local sample means of the dependent variable (_Mean_), the splitting variable (_NodeName_), and the splitting points (_PBLower0_ and _PBUpper0_). The tree model table treeModel1 also contains other useful information such as the size of the node, the splitting criterion, and so on.

We can read about how the root node is split from the preceding table. The value of the_NodeName_ column in the first row shows that the root node is split by MPG_City, and the values of the _PBLower0_  and _PBUpper0_ columns show us that Node 1 contains the observations with MPG_CITY in (20,60], and Node 2 contains the observations with MPG_CITY in [10,20]. Note that in the CAS tree models, the splitting points such as MPG_CITY = 20 in this case are assigned to the child node with smaller values of MPG_City.

Based on this information, a tree structure follows for the first decision tree example:

image

The splitting criteria of the CAS decision tree model are listed in the following table:

Parameter Description
crit = ‘ftest’ Uses the p-values of F test to select the best split.
crit = ‘variance’ Uses the best split that produce the largest reduction of sum of squared errors. For regression trees, the sum of squared errors is equivalent to the variable of the data.
crit = ‘chaid’ Uses adjusted significance testing (Bonferroni testing) to select the best split.

Pruning a tree model is a necessary step to avoid overfitting when you use the regression tree model to score new data (validation data). In the preceding example, we grow only the simplest tree structure, so there is no need to prune it back. If you have created a deeper tree and you have validation data, you can use the dtreeprune action to prune a given tree model.

data2 = conn.CASTable('your_vadliatoin_data')

output2 = conn.CASTable('pruned_tree')

data2.dtreeprune(modelTable=output1, casout=output2)

The last step for completing a regression tree model is to use the pruned tree model to score your data. This is done by the dtreescore action.

data2.dtreescore(modelTable=output2)

Conclusion

In this chapter, we first introduced the linear regression model and discussed some best practices to improve the modeling fitting of a linear regression. Then we introduced the generalized linear model that is available in the regression action set and the actions in the decisiontree action set to build a regression tree model. For more information on these see the following references.

[1] Neter, John, et al. 1996. Applied Linear Statistical Models. Vol. 4. Chicago: Irwin.

[2] McCullagh, Peter, and John A. Nelder. 1989. Generalized Linear Models. Vol. 37. CRC Press.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.52.203