Extensions of Ordinary Linear Regression
In this chapter, we explore several commonly used predictive models for a continuous dependent variable, including linear regressions, generalized linear models, and regression trees. The data set that is used in this chapter is the Cars data set available in the sas-viya-programming repository of the sassoftware account on GitHub. Upload the data set directly from GitHub using the upload method on the CAS connection, as follows:
In [1]: cars = conn.upload('https://raw.githubusercontent.com/sassoftware/'
'sas-viya-programming/master/data/cars.csv').casTable
In [1]: cars.tableinfo()
Out[1]:
[TableInfo]
Name Rows Columns Encoding CreateTimeFormatted
0 CARS 428 15 utf-8 09Nov2016:10:32:02
ModTimeFormatted JavaCharSet CreateTime ModTime
0 09Nov2016:10:32:02 UTF8 1.794307e+09 1.794307e+09
Global Repeated View SourceName SourceCaslib Compressed
0 1 0 0 0
Creator Modifier
0 username
In [2]: cars.columninfo()
Out[2]:
[ColumnInfo]
Column Label ID Type RawLength
0 Make 1 char 13
1 Model 2 char 40
2 Type 3 char 8
3 Origin 4 char 6
4 DriveTrain 5 char 5
5 MSRP 6 double 8
6 Invoice 7 double 8
7 EngineSize Engine Size (L) 8 double 8
8 Cylinders 9 double 8
9 Horsepower 10 double 8
10 MPG_City MPG (City) 11 double 8
11 MPG_Highway MPG (Highway) 12 double 8
12 Weight Weight (LBS) 13 double 8
13 Wheelbase Wheelbase (IN) 14 double 8
14 Length Length (IN) 15 double 8
FormattedLength Format NFL NFD
0 13 0 0
1 40 0 0
2 8 0 0
3 6 0 0
4 5 0 0
5 8 DOLLAR 8 0
6 8 DOLLAR 8 0
7 12 0 0
8 12 0 0
9 12 0 0
10 12 0 0
11 12 0 0
12 12 0 0
13 12 0 0
14 12 0 0
Linear regression is one of the most widely used statistical models for predictive modeling. The basic idea of a predictive model is to establish a function y=f(x1,x2,…,xK) to predict the value of the dependent variable y that is based on the values of the predictors X1, X2, …, XK. Linear regression assumes that the function f is a linear combination of the predictors and an error term ε.
y=a+b1x1+b2x2+…+bKxK+ε
Usually, we assume that ε follows a normal distribution with mean zero and variance σ2. The parameters to be estimated in a linear model include the interceptα, the slopes b1, b2, …, bK, and the variance of the error termσ2. These parameters are estimated using the least squares method.
The regression action set provides a glm action, which fits linear regression models using the least squares method. You must load the action set before you use the glm action.
In [3]: conn.loadactionset('regression')
...: conn.help(actionset='regression')
NOTE: Added action set 'regression'.
Out[3]:
Regression
regression
Name Description
0 glm Fits linear regression models using the method of least squares
1 genmod Fits generalized linear regression models
2 logistic Fits logistic regression models
Let’s build a simple regression model using the Cars data to predict MSRP using the city miles per gallon (MPG) of the cars:
In [4]: cars.glm(
...: target = 'MSRP',
...: inputs = ['MPG_City']
...: )
Out[4]:
[ModelInfo]
Model Information
RowId Description Value
0 DATA Data Source CARS
1 RESPONSEVAR Response Variable MSRP
[NObs]
Number of Observations
RowId Description Value
0 NREAD Number of Observations Read 428.0
1 NUSED Number of Observations Used 428.0
[Dimensions]
Dimensions
RowId Description Value
0 NEFFECTS Number of Effects 2
1 NPARMS Number of Parameters 2
[ANOVA]
Analysis of Variance
RowId Source DF SS MS
0 MODEL Model 1.0 3.638090e+10 3.638090e+10
1 ERROR Error 426.0 1.248507e+11 2.930768e+08
2 TOTAL Corrected Total 427.0 1.612316e+11 NaN
FValue ProbF
0 124.13436 1.783404e-25
1 NaN NaN
2 NaN NaN
[FitStatistics]
Fit Statistics
RowId Description Value
0 RMSE Root MSE 1.711949e+04
1 RSQUARE R-Square 2.256437e-01
2 ADJRSQ Adj R-Sq 2.238260e-01
3 AIC AIC 8.776260e+03
4 AICC AICC 8.776316e+03
5 SBC SBC 8.354378e+03
6 TRAIN_ASE ASE 2.917073e+08
[ParameterEstimates]
Parameter Estimates
Effect Parameter DF Estimate StdErr tValue
0 Intercept Intercept 1 68124.606698 3278.919093 20.776544
1 MPG_City MPG_City 1 -1762.135298 158.158758 -11.141560
Probt
0 1.006169e-66
1 1.783404e-25
[Timing]
Task Timing
RowId Task Time RelTime
0 SETUP Setup and Parsing 0.027544 0.626975
1 LEVELIZATION Levelization 0.007532 0.171444
2 INITIALIZATION Model Initialization 0.000371 0.008444
3 SSCP SSCP Computation 0.003291 0.074909
4 FITTING Model Fitting 0.000367 0.008352
5 CLEANUP Cleanup 0.002385 0.054286
6 TOTAL Total 0.043932 1.000000
The ParameterEstimates table contains the estimation of parameters for the linear regression model. In the preceding example, the model returned by the glm action is shown as follows:
MSRP = 68124.606698 + −1762.135298 × MPG _CITY
The result tables also contain useful information about the model definition and model fitting. The following table summarizes the result tables.
Table Name | Description |
NObs | The number of observations that are read and used. Missing values are excluded, by default. |
Dimension | The dimension of the model, including the number of effects and the number of parameters. |
ANOVA | The Analysis of Variance table that measures the overall model fitting. |
FitStatistics | The fit statistics of the model such as R-Square and root mean square error. |
ParameterEstimates | The estimation of the regression parameters. |
Timing | A timing of the subtasks of the glm action call. |
Compared to the data exploration actions introduced in Chapter 7, the glm action requires a more complex and deeper parameter structure. In this case, it might be more convenient to define a new model first and then specify the model parameters, step-by-step. In other words, the first linear regression shown in the preceding example can be rewritten as follows:
linear1 = cars.Glm()
linear1.target = 'MSRP'
linear1.inputs = ['MPG_City']
linear1()
This approach enables you to reuse the code when you need to change only a few options of the glm action. For example, to display only the parameter estimation table, you specify the names of the output table in the display option and rerun the linear1 model:
In [4]: linear1.display.names = ['ParameterEstimates']
...: linear1()
Out[4]:
ParameterEstimates
Parameter Estimates
Effect Parameter DF Estimate StdErr tValue Probt
0 Intercept Intercept 1 68124.606698 3278.9190925 20.776543969 1.006169E-66
1 MPG_City MPG_City 1 -1762.135298 158.15875785 -11.14156005 1.783404E-25
So far, we have only used the glm action example to estimate the parameters of the linear regression model. We haven’t used the model to predict MSRP values of the cars. For prediction, you must specify an output table using the output option. You can also delete the display.names option in order to request all result tables:
In [5]: del linear1.display.names
...: result1 = conn.CASTable('cas.MSRPPrediction')
...: result1.replace = True
...: linear1.output.casout = result1
...: linear1.output.copyvars = 'all';
...: linear1()
Out[5]:
…output clipped…
[OutputCasTables]
casLib Name Label Rows Columns
0 CASUSERHDFS(username) MSRPPrediction 428 16
casTable
0 CASTable('MSRPPrediction', caslib='C...
In the preceding example, a new CAS table MSRPPredicton is defined and then used in the output option. When you submit the code, the glm action first fits a linear regression model, and then it uses the fitted model to score the input data. Also, it creates the new CAS table MSRPPrediction that contains the predicted MSRP values. The copyvars='all' option requests that the glm action copy all columns from the CARS table to the MSRPPrediction table.
In the preceding example, the output column name for predicted MSRP is not specified. The glm action automatically chooses Pred as the name.
You can summarize the predicted values using the summary action from the simple action set.
In [6]: result1[['pred']].summary()
Out[6]:
[Summary]
Descriptive Statistics for MSRPPREDICTION
Column Min Max N NMiss Mean
0 Pred -37603.511169 50503.25372 428.0 0.0 32774.85514
Sum Std StdErr Var USS
0 14027638.0 9230.448198 446.170554 8.520117e+07 4.961347e+11
CSS CV TValue ProbT
0 3.638090e+10 28.163201 73.458131 2.182203e-244
The glm action can generate additional columns besides the predicted values. The following table summarizes the statistical outputs that the glm action can generate.
Option | Description |
pred | The predicted value. If you do not specify any output statistics, the predicted value is named Pred, by default. |
resid | The residual, which is calculated as ACTUAL minus PREDICTED. |
cooksd | The Cook's D influence statistic. |
covratio | The standard influence of the observation on covariance of betas. The COVRATIO statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the current observation. |
dffits | The scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space. |
h | The leverage of the observation. |
lcl | The lower bound of a confidence interval for an individual prediction. |
ucl | The upper bound of a confidence interval for an individual prediction. |
lclm | The lower bound of a confidence interval for the expected value of the dependent variable. |
uclm | The upper bound of a confidence interval for the expected value of the dependent variable. |
likedist | The likelihood displacement. |
press | The ith residual divided by 1 - h, where h is the leverage, and where the model has been refit without the ith observation |
rstudent | The studentized residual with the current observation deleted. |
stdi | The standard error of the individual predicted value. |
stdp | The standard error of the mean predicted value. |
stdr | The standard error of the residual. |
student | The studentized residuals, which are the residuals divided by their standard errors. |
The following example adds the residual values (observed MSRP values minus predicted MSRP values) and the confidence intervals of the prediction to the output table:
In [7]: result2 = conn.CASTable('cas.MSRPPrediction2', replace=True)
...: linear1.output.casout = result2
...: linear1.output.pred = 'Predicted_MSRP'
...: linear1.output.resid = 'Residual_MSRP'
...: linear1.output.lcl = 'LCL_MSRP'
...: linear1.output.ucl = 'UCL_MSRP'
...: linear1()
The output table MSRPPredcition2 is a CAS table that is saved on the CAS server. You have several ways to fetch or download a table from the CAS server. Since the CARS data is relatively small, you can pull all observations from MSRPPredcition2 directly to the Python client using the to_frame method. Then you can use some visualization package in Python such as Bokeh to observe and understand the model outcome.
In [8]: from bokeh.charts import Scatter, output_file, show
...: out1 = result2.to_frame()
...: p = Scatter(out1, x='Residual_MSRP', y='Predicted_MSRP',
color='Origin', marker='Origin')
...: output_file('scatter.html')
...: show(p)
The following figure shows a scatter plot of the predicted MSRP values and residuals.
You can see that the model predicted large negative MSRP values for three observations. Let’s print out these observations to find out what happens:
In [9]: result2[['Predicted_MSRP', 'MSRP', 'MPG_City’, 'Make',
'Model']].query('Predicted_MSRP < 0').to_frame()
Out[9]:
Selected Rows from Table MSRPPREDICTION3
Predicted_MSRP MSRP MPG_City Make
0 -12933.617000 20140.0 46.0 Honda
1 -35841.375871 20510.0 59.0 Toyota
2 -37603.511169 19110.0 60.0 Honda
Model
0 Civic Hybrid 4dr manual (gas/electric)
1 Prius 4dr (gas/electric)
2 Insight 2dr (gas/electric)
All of these cars are fuel efficient with relatively high MPG_City. If you generate a scatter plot of the dependent variable MSRP and the predictor variable MPG_City, you can see that the data has some extreme outliers with high MSRP values or high city MPG that might not fit into the linear relationship assumption between these two variables.
In [10]: p = Scatter(out1, x='MPG_City', y='MSRP',
color='Origin', marker='Origin')
...: output_file('scatter.html')
...: show(p)
The following figure shows a scatter plot of MPG_City and MSRP.
Outliers are observed when MSRP values are higher than $100,000 or MPG City is higher than 40.
Linear regression models are sensitive to outliers. There are several treatments for outliers to improve the prediction accuracy of a linear regression model. The simplest approach is to remove the outliers. In the next example, let’s add a filter to remove cars with MPG_City that is greater than 40 or MSRP that is greater than 100,000. The R-Square of the model is actually improved from 0.22 to 0.34.
In [11]: cars.where = 'MSRP < 100000 and MPG_City < 40'
...:
...: result2 = conn.CASTable('cas.MSRPPrediction2')
...: result2.replace = True
...:
...: linear2 = cars.Glm()
...: linear2.target = 'MSRP'
...: linear2.inputs = ['MPG_City']
...: linear2.output.casout = result2
...: linear2.output.copyVars = 'ALL';
...: linear2.output.pred = 'Predicted_MSRP'
...: linear2.output.resid = 'Residual_MSRP'
...: linear2.output.lcl = 'LCL_MSRP'
...: linear2.output.ucl = 'UCL_MSRP'
...: linear2()
You can also use the DataFrame API of the CASTable cars to apply a filter. The preceding model can also be defined using the query method of the CASTable:
linear2 = cars.query('MSRP < 100000 and MPG_City < 40').Glm()
You can see that we have a better residual plot after the outliers are removed from the model.
In [12]: out2 = result2.to_frame()
...: p = Scatter(out2, x='Predicted_MSRP', y='Residual_MSRP',
color='Origin', marker='Origin')
...: output_file('scatter.html')
...: show(p)
The following figure shows a scatter plot of predicted MSRP values and residuals, after excluding outliers.
Let’s continue to improve the linear regression model for prediction MSRP by adding more predictors to the model. In this example, we add three categorical predictors Origin, Type, and DriveTrain, and two more continuous predictors Weight and Length. The categorical predictors must be specified in both the inputs and the nominals parameters. The R-Square statistic is improved again (0.63).
In [13]: nomList = ['Origin','Type','DriveTrain']
...: contList = ['MPG_City','Weight','Length']
...:
...: linear3 = CASTable('cars').Glm()
...: linear3.target = 'MSRP'
...: linear3.inputs = nomList + contList
...: linear3.nominals = nomList
...: linear3.display.names = ['FitStatistics','ParameterEstimates']
...: linear3()
Out [13]:
[FitStatistics]
Fit Statistics
RowId Description Value
0 RMSE Root MSE 1.201514e+04
1 RSQUARE R-Square 6.284172e-01
2 ADJRSQ Adj R-Sq 6.176727e-01
3 AIC AIC 8.483996e+03
4 AICC AICC 8.485013e+03
5 SBC SBC 8.106765e+03
6 TRAIN_ASE ASE 1.399787e+08
[ParameterEstimates]
Parameter Estimates
Effect Origin Type DriveTrain Parameter DF
0 Intercept Intercept 1
1 Origin Asia Origin Asia 1
2 Origin Europe Origin Europe 1
3 Origin USA Origin USA 0
4 Type Hybrid Type Hybrid 1
5 Type SUV Type SUV 1
6 Type Sedan Type Sedan 1
7 Type Sports Type Sports 1
8 Type Truck Type Truck 1
9 Type Wagon Type Wagon 0
10 DriveTrain All DriveTrain All 1
11 DriveTrain Front DriveTrain Front 1
12 DriveTrain Rear DriveTrain Rear 0
13 MPG_City MPG_City 1
14 Weight Weight 1
15 Length Length 1
Estimate StdErr tValue Probt
0 -23692.980669 16261.000069 -1.457043 1.458607e-01
1 2191.206289 1479.756700 1.480788 1.394218e-01
2 17100.937866 1779.533025 9.609790 7.112423e-20
3 0.000000 NaN NaN NaN
4 26154.719438 10602.173003 2.466921 1.403098e-02
5 -1016.065543 3083.503255 -0.329517 7.419315e-01
6 2481.367175 2359.814614 1.051509 2.936366e-01
7 21015.571095 3065.180416 6.856226 2.572647e-11
8 -12891.562541 3592.436933 -3.588529 3.722560e-04
9 0.000000 NaN NaN NaN
10 -7669.535987 1987.120548 -3.859623 1.316520e-04
11 -7699.083608 1722.883863 -4.468719 1.016332e-05
12 0.000000 NaN NaN NaN
13 -496.308946 266.606023 -1.861582 6.336879e-02
14 9.171343 1.914672 4.790032 2.325015e-06
15 162.893307 82.088679 1.984358 4.787420e-02
A linear regression model with categorical effects still assumes homogeneity of variance (that is, the random errors for all observations have the same variance). In the preceding example, this means that the variation of MSRP values is approximately the same across different origins, types, or drive trains. Sometimes data can be heterogeneous. For example, cars from different origins might have different MSRP values as well as different variation in MSRP values.
In [14]: cars = conn.CASTable('cars')
...: out = cars.groupby('Origin')[['MSRP']].summary()
...: out.concat_bygroups()['Summary'][['Column','Mean','Var','Std']]
...:
Out[14]:
Descriptive Statistics for CARS
Column Mean Var Std
Origin
Asia MSRP 24741.322785 1.281666e+08 11321.069675
Europe MSRP 48349.796748 6.410315e+08 25318.600464
USA MSRP 28377.442177 1.371705e+08 11711.982506
The output from the summary action shows that cars from Europe not only have higher MSRP values but also a greater variance. The sample standard deviation of MSRP values for the European cars doubles compared to the sample standard deviation of the MSRP values for the cars from Asia and USA. One easy remedy for variance heterogeneity is to fit multiple models, one for each segment of the data. For the glm action, you can fit multiple models with the groupby option:
In [15]: cars = conn.CASTable('cars')
...: cars.groupby = ['Origin']
...: cars.where = 'MSRP < 100000 and MPG_City < 40'
...: nomList = ['Type','DriveTrain']
...: contList = ['MPG_City','Weight','Length']
...: groupBYResult = conn.CASTable('MSRPPredictionGroupBy')
...:
...: linear4 = cars.Glm()
...: linear4.target = 'MSRP'
...: linear4.inputs = nomList + contList
...: linear4.nominals = nomList
...: linear4.display.names = ['FitStatistics','ParameterEstimates']
...: linear4.output.casout = groupBYResult
...: linear4.output.copyVars = 'ALL';
...: linear4.output.pred = 'Predicted_MSRP'
...: linear4.output.resid = 'Residual_MSRP'
...: linear4.output.lcl = 'LCL_MSRP'
...: linear4.output.ucl = 'UCL_MSRP'
...: linear4()
...:
...: out = groupBYResult.to_frame()
...: p = Scatter(out, x='Predicted_MSRP', y='Residual_MSRP',
color='Origin', marker='Origin')
...: output_file('scatter.html')
...: show(p)
The following figure shows a scatter plot of predicted MSRP values and residuals with three linear regression models fit for each origin (Asia, Europe, and USA):
The key assumptions of an ordinary linear regression are 1) the expected value of the dependent variable can be predicted using a linear combination of a set of predictors, and 2) the error distribution function of ε follows a normal distribution. The second assumptions might not be valid for some applications such as estimating the number of calls received within a time interval in a call center. Generalized linear models (GLM) and regression trees are popular types of generalization of ordinary linear regression to fit data that does not follow a normal distribution.
In generalized linear models, the dependent variable Y is assumed to follow a particular probability distribution, and the expected value of Y is predicted by a function of the linear combination of the predictors,
E(y)=f(x1,x2,…,xK)=f(a+b1x1+b2x2+…+bKxK)
where E(y) is the expected values of Y, x1, …, x2, …, xK are the observed values of the predictors X1, and X2, …, XK and b1, b2, ... , bK are the unknown parameters. It is more common to express a GLM model by linking the expected values of Y o the linear combination of predictors,
g[E(y)]=a+b1x1+b2x2+…+bKxK
The link function g() is the inverse function of f(). The probability distribution of Y is usually from the exponential family of distribution, such as normal, binomial, exponential, gamma, Poisson, and zero-inflated distributions. The choice of the link function usually depends on the assumption of the probability distribution. For example, for call center data, it is common to assume that the number of calls within a time interval follows a Poisson distribution
P(k calls in interval)=λke−λk!,k=0,1,2…
and a log link function between the expected number of calls λ to the linear combination of the predictors.
log(λ)=a+b1x1+b2x2+…+bKxK
It is also worthwhile to mention that ordinary linear regression is a special type of GLM, where the target variable follows a normal distribution, and the link function is an identity function E(y)=a+b1x1+b2x2+…+bKxK. For more details of generalized linear models, refer to [1] and [2].
Generalized linear models are available in the regression CAS action set. Let’s continue to use the Cars data example and build a simple generalized linear model to predict MSRP values of cars using MPG_City.
In [1]: cars = conn.CASTable('cars')
...: genmodModel1 = cars.Genmod()
...: genmodModel1.model.depvars = 'MSRP'
...: genmodModel1.model.effects = ['MPG_City']
...: genmodModel1.model.dist = 'gamma'
...: genmodModel1.model.link = 'log'
...: genmodModel1()
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
Out[1]:
[ModelInfo]
Model Information
RowId Description Value
0 DATA Data Source CARS
1 RESPONSEVAR Response Variable MSRP
2 DIST Distribution Gamma
3 LINK Link Function Log
4 TECH Optimization Technique Newton-Raphson with Ridging
[NObs]
Number of Observations
RowId Description Value
0 NREAD Number of Observations Read 428.0
1 NUSED Number of Observations Used 428.0
[ConvergenceStatus]
Convergence Status
Reason Status MaxGradient
0 Convergence criterion (GCONV=1E-8) s... 0 1.068483e-09
[Dimensions]
Dimensions
RowId Description Value
0 NDESIGNCOLS Columns in Design 2
1 NEFFECTS Number of Effects 2
2 MAXEFCOLS Max Effect Columns 1
3 DESIGNRANK Rank of Design 2
4 OPTPARM Parameters in Optimization 3
[FitStatistics]
Fit Statistics
RowId Description Value
0 M2LL -2 Log Likelihood 9270.853164
1 AIC AIC (smaller is better) 9276.853164
2 AICC AICC (smaller is better) 9276.909768
3 SBC SBC (smaller is better) 9289.030533
[ParameterEstimates]
Parameter Estimates
Effect Parameter ParmName DF Estimate StdErr
0 Intercept Intercept Intercept 1 11.307790 0.059611
1 MPG_City MPG_City MPG_City 1 -0.047400 0.002801
2 Dispersion Dispersion Dispersion 1 5.886574 0.391526
ChiSq ProbChiSq
0 35983.929066 0.000000e+00
1 286.445370 2.958655e-64
2 NaN NaN
[Timing]
Task Timing
RowId Task Time RelTime
0 SETUP Setup and Parsing 0.008891 0.257967
1 LEVELIZATION Levelization 0.005668 0.164454
2 INITIALIZATION Model Initialization 0.000360 0.010446
3 SSCP SSCP Computation 0.001076 0.031220
4 FITTING Model Fitting 0.014661 0.425389
5 CLEANUP Cleanup 0.002235 0.064853
6 TOTAL Total 0.034465 1.000000
In the preceding example, we fit a generalized linear model using gamma distribution and the log link function because MSRP values of cars are nonnegative continuous values. The type of distribution is usually determined by the range and the shape of the data. For example, exponential, gamma, and inverse Gaussian distributions are popular choices for fitting nonnegative continuous values such as MSRP values of cars, sales revenue, insurance claim amounts, and so on. Binomial and multinomial distributions are a valid distribution assumption to fit an integer response. In the next example, we fix a multinomial regression model to predict the number of cylinders of the vehicles that are based on MPG_City.
In [2]: genmodModel1.model.depvars = 'Cylinders'
...: genmodModel1.model.dist = 'multinomial'
...: genmodModel1.model.link = 'logit'
...: genmodModel1.model.effects = ['MPG_City']
...: genmodModel1.display.names = ['ModelInfo', 'ParameterEstimates']
...: genmodModel1()
...:
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
Out[2]:
[ModelInfo]
Model Information
RowId Description
0 DATA Data Source
1 RESPONSEVAR Response Variable
2 NLEVELS Number of Response Levels
3 DIST Distribution
4 LINKTYPE Link Type
5 LINK Link Function
6 TECH Optimization Technique
Value
0 CARS
1 Cylinders
2 7
3 Multinomial
4 Cumulative
5 Logit
6 Newton-Raphson with Ridging
[ParameterEstimates]
Parameter Estimates
Effect Parameter ParmName Outcome Cylinders DF
0 Intercept Intercept Intercept_3 3 3.0 1
1 Intercept Intercept Intercept_4 4 4.0 1
2 Intercept Intercept Intercept_5 5 5.0 1
3 Intercept Intercept Intercept_6 6 6.0 1
4 Intercept Intercept Intercept_8 8 8.0 1
5 Intercept Intercept Intercept_10 10 10.0 1
6 MPG_City MPG_City MPG_City NaN 1
Estimate StdErr ChiSq ProbChiSq
0 -60.329075 4.829533 156.042532 8.286542e-36
1 -21.461149 1.584887 183.361936 8.941488e-42
2 -21.233691 1.575766 181.579751 2.190306e-41
3 -16.632445 1.337275 154.693103 1.634032e-35
4 -10.988487 1.139470 92.997190 5.236863e-22
5 -10.314220 1.186541 75.562638 3.539969e-18
6 1.013934 0.077371 171.734698 3.092446e-39
In the preceding example, the cumulative logit link function is used for the multinomial model. For a generalized linear model using multinomial distribution, we estimate where an observation (a car) can fall into one of the seven possible number of cylinders (Cylinder=3, 4, 5, 6, 8, 10, 12). In this case, the cumulative log link assume a logit link function between the cumulative probabilities and the linear combination of the predictors:
Pr(Cylinders=3)=f(−60.329075+1.013934×MPG_City)Pr(Cylinders=3)=Pr(Cylinders=4)=f(−21.461149+1.013934×MPG_City) ……Pr(Cylinders=3)+…+Pr(Cylinders=10)=f(−10.314220+1.013934×MPG_City)
where f(u)=exp(u)1+exp(u) is the standard inverse logit link function. You can use the parameter estimates to score new observation directly. For example, when a car has MPG_CITY = 20, the chance that this car is 4-cylinder is about 23.5%:
Pr(Cylinders=4|MPG _ CITY=20)=f(−21.461149+1.013934×20)−f(−60.329075+1.013934×20)=0.235
Similar to the glm action in the same regression action set, the genmod action generates the predictions from the generalized linear model. Instead of using the parameter estimates from the preceding formulas to manually compute the predictions, you can use genmod to score a data set using the output options.
In [3]: result = conn.CASTable('CylinderPredicted', replace=True)
...: genmodModel1.output.casout = genmodResult
...: genmodModel1.output.copyVars = 'ALL';
...: genmodModel1.output.pred = 'Prob_Cylinders'
...: genmodModel1()
...: result[['Prob_Cylinders','_level_','Cylinders','MPG_City']].head(24)
...:
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
Out[3]:
Selected Rows from Table CYLINDERPREDICTED
Prob_Cylinders _LEVEL_ Cylinders MPG_City
0 1.928842e-19 3.0 6.0 17.0
1 1.442488e-02 4.0 6.0 17.0
2 1.804258e-02 5.0 6.0 17.0
3 6.466697e-01 6.0 6.0 17.0
4 9.980702e-01 8.0 6.0 17.0
5 9.990158e-01 10.0 6.0 17.0
6 5.316706e-19 3.0 6.0 18.0
7 3.877857e-02 4.0 6.0 18.0
8 4.820533e-02 5.0 6.0 18.0
9 8.345697e-01 6.0 6.0 18.0
10 9.992990e-01 8.0 6.0 18.0
11 9.996427e-01 10.0 6.0 18.0
12 8.460038e-17 3.0 4.0 23.0
13 8.652192e-01 4.0 4.0 23.0
14 8.896126e-01 5.0 4.0 23.0
15 9.987558e-01 6.0 4.0 23.0
16 9.999956e-01 8.0 4.0 23.0
17 9.999978e-01 10.0 4.0 23.0
18 4.039564e-18 3.0 6.0 20.0
19 2.346085e-01 4.0 6.0 20.0
20 2.778780e-01 5.0 6.0 20.0
21 9.745742e-01 6.0 6.0 20.0
22 9.999077e-01 8.0 6.0 20.0
23 9.999530e-01 10.0 6.0 20.0
You can generate the predictions for a car with MPG_CITY = 20 from the last six rows of the preceding output table:
Pr(Cylinders=3|MPG_CITY=20)=4.039564e−18≈0Pr(Cylinders=4|MPG_CITY=20)=0.2346085−0=0.2346085Pr(Cylinders=5|MPG_CITY=20)=0.2778780−0.2346085=0.04327Pr(Cylinders=6|MPG_CITY=20)=0.9745742−0.2778780=0.696696Pr(Cylinders>6|MPG_CITY=20)=1−0.9745742<0.03
Using the multinomial model, a car with MPG_CITY = 20 is most likely a 6-cylinder (69.7%) or 4-cylinder car (23.5%), and the chance that it has more than 6 cylinders is less than 3%.
The following table lists the output result tables from the genmod action. You can use the display option to include or exclude a result table when fitting a generalized linear model using the genmod action.
Table Name | Description |
ModelInfo | Basic information about the model such as the data source, the response variable, the distribution and link functions, and the optimization technique. |
NObs | The number of observations that are read and used. Missing values are excluded, by default. |
ConvergenceStatus | The convergence status of the parameter estimation. |
Dimension | The dimension of the model, including the number of effects and the number of parameters. |
FitStatistics | The fit statistics of the model such as log likelihood (multiplied by -2), AIC, AICC, and SBC. |
ParameterEstimates | The estimation of the model parameters. |
Timing | A timing of the subtasks of the GLM action call. |
Similar to decision tree models, regression trees are machine learning algorithms that use recursive partitioning to segment the input data set and to make predictions within each segment of the data. In this
section, we briefly introduce how to use the tree models that are available in the decisiontree action set to build predictive models for continuous response. For information about decision tree and tree family models, see Chapter 9.
A regression tree usually follows these steps:
1. Grows a tree as deep as possible based on a splitting criterion and the training data.
2. Prunes back some nodes on the tree based on an error function on the validation data.
3. For each leaf (terminal node), build a simple model to predict the continuous dependent variable. The most common approach is to use the local sample average of the dependent variable.
Let’s first load the decisiontree action set:
In [4]: conn.loadactionset('decisiontree')
...: conn.help(actionset='decisiontree')
NOTE: Added action set 'decisiontree'.
Out[4]:
decisionTree
decisionTree
Name Description
0 dtreeTrain Train Decision Tree
1 dtreeScore Score A Table Using Decision Tree
2 dtreeSplit Split Tree Nodes
3 dtreePrune Prune Decision Tree
4 dtreeMerge Merge Tree Nodes
5 dtreeCode Generate score code for Decision Tree
6 forestTrain Train Forest
7 forestScore Score A Table Using Forest
8 forestCode Generate score code for Forest
9 gbtreeTrain Train Gradient Boosting Tree
10 gbtreeScore Score A Table Using Gradient Boosting Tree
11 gbtreecode Generate score code for Gradient Boosting Trees
The dtreetrain action fits decision tree model for a categorical dependent variable or regression tree model for a continuous dependent variable. Let’s build a regression tree model using MPG_City to predict MSRP values of cars. In this example, we create the simplest regression tree that splits the root node into only two leaves (splitting the entire data into two partitions).
In [5]: cars = conn.CASTable('cars')
...:
...: output1 = conn.CASTable('treeModel1')
...: output1.replace = True;
...:
...: tree1 = cars.Dtreetrain()
...: tree1.target = 'MSRP'
...: tree1.inputs = ['MPG_City']
...: tree1.casout = output1
...: tree1.maxlevel = 2
...: tree1()
...:
...: output1[['_NodeID_', '_Parent_','_Mean_','_NodeName_','_PBLower0_',
'_PBUpper0_']].head()
...:
Out[5]:
[Fetch]
Selected Rows from Table TREEMODEL1
_NodeID_ _Parent_ _Mean_ _NodeName_ _PBLower0_
0 0.0 -1.0 32774.855140 MPG_City NaN
1 1.0 0.0 22875.341584 MSRP 20.0
2 2.0 0.0 41623.092920 MSRP 10.0
_PBUpper0_
0 NaN
1 60.0
2 20.0
In the preceding example, the decision tree model is saved to the CAS table treeModel1. This table is stored on the CAS server You must use the fetch or head method to download the table to the Python client. The tree model table contains three observations, one for each node in the tree, including the root node. We fetch information only from the tree model table such as the unique IDs of the nodes and parents (_NodeID_ and _Parent_), the local sample means of the dependent variable (_Mean_), the splitting variable (_NodeName_), and the splitting points (_PBLower0_ and _PBUpper0_). The tree model table treeModel1 also contains other useful information such as the size of the node, the splitting criterion, and so on.
We can read about how the root node is split from the preceding table. The value of the_NodeName_ column in the first row shows that the root node is split by MPG_City, and the values of the _PBLower0_ and _PBUpper0_ columns show us that Node 1 contains the observations with MPG_CITY in (20,60], and Node 2 contains the observations with MPG_CITY in [10,20]. Note that in the CAS tree models, the splitting points such as MPG_CITY = 20 in this case are assigned to the child node with smaller values of MPG_City.
Based on this information, a tree structure follows for the first decision tree example:
The splitting criteria of the CAS decision tree model are listed in the following table:
Parameter | Description |
crit = ‘ftest’ | Uses the p-values of F test to select the best split. |
crit = ‘variance’ | Uses the best split that produce the largest reduction of sum of squared errors. For regression trees, the sum of squared errors is equivalent to the variable of the data. |
crit = ‘chaid’ | Uses adjusted significance testing (Bonferroni testing) to select the best split. |
Pruning a tree model is a necessary step to avoid overfitting when you use the regression tree model to score new data (validation data). In the preceding example, we grow only the simplest tree structure, so there is no need to prune it back. If you have created a deeper tree and you have validation data, you can use the dtreeprune action to prune a given tree model.
data2 = conn.CASTable('your_vadliatoin_data')
output2 = conn.CASTable('pruned_tree')
data2.dtreeprune(modelTable=output1, casout=output2)
The last step for completing a regression tree model is to use the pruned tree model to score your data. This is done by the dtreescore action.
data2.dtreescore(modelTable=output2)
In this chapter, we first introduced the linear regression model and discussed some best practices to improve the modeling fitting of a linear regression. Then we introduced the generalized linear model that is available in the regression action set and the actions in the decisiontree action set to build a regression tree model. For more information on these see the following references.
[1] Neter, John, et al. 1996. Applied Linear Statistical Models. Vol. 4. Chicago: Irwin.
[2] McCullagh, Peter, and John A. Nelder. 1989. Generalized Linear Models. Vol. 37. CRC Press.
3.145.17.18