Chapter 8: Modeling Continuous Variables

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8: Modeling Continuous Variables

Extensions of Ordinary Linear Regression

In this chapter, we explore several commonly used predictive models for a continuous dependent variable, including linear regressions, generalized linear models, and regression trees. The data set that is used in this chapter is the Cars data set available in the sas-viya-programming repository of the sassoftware account on GitHub. Upload the data set directly from GitHub using the upload method on the CAS connection, as follows:

In [1]: cars = conn.upload('https://raw.githubusercontent.com/sassoftware/'

'sas-viya-programming/master/data/cars.csv').casTable

In [1]: cars.tableinfo()

Out[1]:

[TableInfo]

Name Rows Columns Encoding CreateTimeFormatted

0 CARS 428 15 utf-8 09Nov2016:10:32:02

ModTimeFormatted JavaCharSet CreateTime ModTime

0 09Nov2016:10:32:02 UTF8 1.794307e+09 1.794307e+09

Global Repeated View SourceName SourceCaslib Compressed

0 1 0 0 0

Creator Modifier

0 username

In [2]: cars.columninfo()

Out[2]:

[ColumnInfo]

Column Label ID Type RawLength

0 Make 1 char 13

1 Model 2 char 40

2 Type 3 char 8

3 Origin 4 char 6

4 DriveTrain 5 char 5

5 MSRP 6 double 8

6 Invoice 7 double 8

7 EngineSize Engine Size (L) 8 double 8

8 Cylinders 9 double 8

9 Horsepower 10 double 8

10 MPG_City MPG (City) 11 double 8

11 MPG_Highway MPG (Highway) 12 double 8

12 Weight Weight (LBS) 13 double 8

13 Wheelbase Wheelbase (IN) 14 double 8

14 Length Length (IN) 15 double 8

FormattedLength Format NFL NFD

0 13 0 0

1 40 0 0

2 8 0 0

3 6 0 0

4 5 0 0

5 8 DOLLAR 8 0

6 8 DOLLAR 8 0

7 12 0 0

8 12 0 0

9 12 0 0

10 12 0 0

11 12 0 0

12 12 0 0

13 12 0 0

14 12 0 0

Linear Regressions

Linear regression is one of the most widely used statistical models for predictive modeling. The basic idea of a predictive model is to establish a function $y = f (x_{1}, x_{2}, \dots, x_{K})$ to predict the value of the dependent variable y that is based on the values of the predictors X₁, X₂, …, X_K. Linear regression assumes that the function f is a linear combination of the predictors and an error term ε.

$y = a + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{K} x_{K} + ε$

Usually, we assume that ε follows a normal distribution with mean zero and variance σ². The parameters to be estimated in a linear model include the interceptα, the slopes b₁, b₂, …, b_K, and the variance of the error termσ². These parameters are estimated using the least squares method.

The regression action set provides a glm action, which fits linear regression models using the least squares method. You must load the action set before you use the glm action.

In [3]: conn.loadactionset('regression')

...: conn.help(actionset='regression')

NOTE: Added action set 'regression'.

Out[3]:

Regression

regression

Name Description

0 glm Fits linear regression models using the method of least squares

1 genmod Fits generalized linear regression models

2 logistic Fits logistic regression models

Let’s build a simple regression model using the Cars data to predict MSRP using the city miles per gallon (MPG) of the cars:

In [4]: cars.glm(

...: target = 'MSRP',

...: inputs = ['MPG_City']

...: )

Out[4]:

[ModelInfo]

Model Information

RowId Description Value

0 DATA Data Source CARS

1 RESPONSEVAR Response Variable MSRP

[NObs]

Number of Observations

RowId Description Value

0 NREAD Number of Observations Read 428.0

1 NUSED Number of Observations Used 428.0

[Dimensions]

Dimensions

RowId Description Value

0 NEFFECTS Number of Effects 2

1 NPARMS Number of Parameters 2

[ANOVA]

Analysis of Variance

RowId Source DF SS MS

0 MODEL Model 1.0 3.638090e+10 3.638090e+10

1 ERROR Error 426.0 1.248507e+11 2.930768e+08

2 TOTAL Corrected Total 427.0 1.612316e+11 NaN

FValue ProbF

0 124.13436 1.783404e-25

1 NaN NaN

2 NaN NaN

[FitStatistics]

Fit Statistics

RowId Description Value

0 RMSE Root MSE 1.711949e+04

1 RSQUARE R-Square 2.256437e-01

2 ADJRSQ Adj R-Sq 2.238260e-01

3 AIC AIC 8.776260e+03

4 AICC AICC 8.776316e+03

5 SBC SBC 8.354378e+03

6 TRAIN_ASE ASE 2.917073e+08

[ParameterEstimates]

Parameter Estimates

Effect Parameter DF Estimate StdErr tValue

0 Intercept Intercept 1 68124.606698 3278.919093 20.776544

1 MPG_City MPG_City 1 -1762.135298 158.158758 -11.141560

Probt

0 1.006169e-66

1 1.783404e-25

[Timing]

Task Timing

RowId Task Time RelTime

0 SETUP Setup and Parsing 0.027544 0.626975

1 LEVELIZATION Levelization 0.007532 0.171444

2 INITIALIZATION Model Initialization 0.000371 0.008444

3 SSCP SSCP Computation 0.003291 0.074909

4 FITTING Model Fitting 0.000367 0.008352

5 CLEANUP Cleanup 0.002385 0.054286

6 TOTAL Total 0.043932 1.000000

The ParameterEstimates table contains the estimation of parameters for the linear regression model. In the preceding example, the model returned by the glm action is shown as follows:

MSRP = 68124.606698 + −1762.135298 × MPG _CITY

The result tables also contain useful information about the model definition and model fitting. The following table summarizes the result tables.

Table Name	Description
NObs	The number of observations that are read and used. Missing values are excluded, by default.
Dimension	The dimension of the model, including the number of effects and the number of parameters.
ANOVA	The Analysis of Variance table that measures the overall model fitting.
FitStatistics	The fit statistics of the model such as R-Square and root mean square error.
ParameterEstimates	The estimation of the regression parameters.
Timing	A timing of the subtasks of the glm action call.

Compared to the data exploration actions introduced in Chapter 7, the glm action requires a more complex and deeper parameter structure. In this case, it might be more convenient to define a new model first and then specify the model parameters, step-by-step. In other words, the first linear regression shown in the preceding example can be rewritten as follows:

linear1 = cars.Glm()

linear1.target = 'MSRP'

linear1.inputs = ['MPG_City']

linear1()

This approach enables you to reuse the code when you need to change only a few options of the glm action. For example, to display only the parameter estimation table, you specify the names of the output table in the display option and rerun the linear1 model:

In [4]: linear1.display.names = ['ParameterEstimates']

...: linear1()

Out[4]:

ParameterEstimates

Parameter Estimates

Effect Parameter DF Estimate StdErr tValue Probt

0 Intercept Intercept 1 68124.606698 3278.9190925 20.776543969 1.006169E-66

1 MPG_City MPG_City 1 -1762.135298 158.15875785 -11.14156005 1.783404E-25

So far, we have only used the glm action example to estimate the parameters of the linear regression model. We haven’t used the model to predict MSRP values of the cars. For prediction, you must specify an output table using the output option. You can also delete the display.names option in order to request all result tables:

In [5]: del linear1.display.names

...: result1 = conn.CASTable('cas.MSRPPrediction')

...: result1.replace = True

...: linear1.output.casout = result1

...: linear1.output.copyvars = 'all';

...: linear1()

Out[5]:

…output clipped…

[OutputCasTables]

casLib Name Label Rows Columns

0 CASUSERHDFS(username) MSRPPrediction 428 16

casTable

0 CASTable('MSRPPrediction', caslib='C...

In the preceding example, a new CAS table MSRPPredicton is defined and then used in the output option. When you submit the code, the glm action first fits a linear regression model, and then it uses the fitted model to score the input data. Also, it creates the new CAS table MSRPPrediction that contains the predicted MSRP values. The copyvars='all' option requests that the glm action copy all columns from the CARS table to the MSRPPrediction table.

In the preceding example, the output column name for predicted MSRP is not specified. The glm action automatically chooses Pred as the name.

You can summarize the predicted values using the summary action from the simple action set.

In [6]: result1[['pred']].summary()

Out[6]:

[Summary]

Descriptive Statistics for MSRPPREDICTION

Column Min Max N NMiss Mean

0 Pred -37603.511169 50503.25372 428.0 0.0 32774.85514

Sum Std StdErr Var USS

0 14027638.0 9230.448198 446.170554 8.520117e+07 4.961347e+11

CSS CV TValue ProbT

0 3.638090e+10 28.163201 73.458131 2.182203e-244

The glm action can generate additional columns besides the predicted values. The following table summarizes the statistical outputs that the glm action can generate.

Option	Description
pred	The predicted value. If you do not specify any output statistics, the predicted value is named Pred, by default.
resid	The residual, which is calculated as ACTUAL minus PREDICTED.
cooksd	The Cook's D influence statistic.
covratio	The standard influence of the observation on covariance of betas. The COVRATIO statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the current observation.
dffits	The scaled measure of the change in the predicted value for the i^th observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space.
h	The leverage of the observation.
lcl	The lower bound of a confidence interval for an individual prediction.
ucl	The upper bound of a confidence interval for an individual prediction.
lclm	The lower bound of a confidence interval for the expected value of the dependent variable.
uclm	The upper bound of a confidence interval for the expected value of the dependent variable.
likedist	The likelihood displacement.
press	The i^th residual divided by 1 - h, where h is the leverage, and where the model has been refit without the i^th observation
rstudent	The studentized residual with the current observation deleted.
stdi	The standard error of the individual predicted value.
stdp	The standard error of the mean predicted value.
stdr	The standard error of the residual.
student	The studentized residuals, which are the residuals divided by their standard errors.

The following example adds the residual values (observed MSRP values minus predicted MSRP values) and the confidence intervals of the prediction to the output table:

In [7]: result2 = conn.CASTable('cas.MSRPPrediction2', replace=True)

...: linear1.output.casout = result2

...: linear1.output.pred = 'Predicted_MSRP'

...: linear1.output.resid = 'Residual_MSRP'

...: linear1.output.lcl = 'LCL_MSRP'

...: linear1.output.ucl = 'UCL_MSRP'

...: linear1()

The output table MSRPPredcition2 is a CAS table that is saved on the CAS server. You have several ways to fetch or download a table from the CAS server. Since the CARS data is relatively small, you can pull all observations from MSRPPredcition2 directly to the Python client using the to_frame method. Then you can use some visualization package in Python such as Bokeh to observe and understand the model outcome.

In [8]: from bokeh.charts import Scatter, output_file, show

...: out1 = result2.to_frame()

...: p = Scatter(out1, x='Residual_MSRP', y='Predicted_MSRP',

color='Origin', marker='Origin')

...: output_file('scatter.html')

...: show(p)

The following figure shows a scatter plot of the predicted MSRP values and residuals.

You can see that the model predicted large negative MSRP values for three observations. Let’s print out these observations to find out what happens:

In [9]: result2[['Predicted_MSRP', 'MSRP', 'MPG_City’, 'Make',

'Model']].query('Predicted_MSRP < 0').to_frame()

Out[9]:

Selected Rows from Table MSRPPREDICTION3

Predicted_MSRP MSRP MPG_City Make

0 -12933.617000 20140.0 46.0 Honda

1 -35841.375871 20510.0 59.0 Toyota

2 -37603.511169 19110.0 60.0 Honda

Model

0 Civic Hybrid 4dr manual (gas/electric)

1 Prius 4dr (gas/electric)

2 Insight 2dr (gas/electric)

All of these cars are fuel efficient with relatively high MPG_City. If you generate a scatter plot of the dependent variable MSRP and the predictor variable MPG_City, you can see that the data has some extreme outliers with high MSRP values or high city MPG that might not fit into the linear relationship assumption between these two variables.

In [10]: p = Scatter(out1, x='MPG_City', y='MSRP',

color='Origin', marker='Origin')

...: output_file('scatter.html')

...: show(p)

The following figure shows a scatter plot of MPG_City and MSRP.

Outliers are observed when MSRP values are higher than $100,000 or MPG City is higher than 40.

Linear regression models are sensitive to outliers. There are several treatments for outliers to improve the prediction accuracy of a linear regression model. The simplest approach is to remove the outliers. In the next example, let’s add a filter to remove cars with MPG_City that is greater than 40 or MSRP that is greater than 100,000. The R-Square of the model is actually improved from 0.22 to 0.34.

In [11]: cars.where = 'MSRP < 100000 and MPG_City < 40'

...:

...: result2 = conn.CASTable('cas.MSRPPrediction2')

...: result2.replace = True

...:

...: linear2 = cars.Glm()

...: linear2.target = 'MSRP'

...: linear2.inputs = ['MPG_City']

...: linear2.output.casout = result2

...: linear2.output.copyVars = 'ALL';

...: linear2.output.pred = 'Predicted_MSRP'

...: linear2.output.resid = 'Residual_MSRP'

...: linear2.output.lcl = 'LCL_MSRP'

...: linear2.output.ucl = 'UCL_MSRP'

...: linear2()

You can also use the DataFrame API of the CASTable cars to apply a filter. The preceding model can also be defined using the query method of the CASTable:

linear2 = cars.query('MSRP < 100000 and MPG_City < 40').Glm()

You can see that we have a better residual plot after the outliers are removed from the model.

In [12]: out2 = result2.to_frame()

...: p = Scatter(out2, x='Predicted_MSRP', y='Residual_MSRP',

color='Origin', marker='Origin')

...: output_file('scatter.html')

...: show(p)

The following figure shows a scatter plot of predicted MSRP values and residuals, after excluding outliers.

Let’s continue to improve the linear regression model for prediction MSRP by adding more predictors to the model. In this example, we add three categorical predictors Origin, Type, and DriveTrain, and two more continuous predictors Weight and Length. The categorical predictors must be specified in both the inputs and the nominals parameters. The R-Square statistic is improved again (0.63).

In [13]: nomList = ['Origin','Type','DriveTrain']

...: contList = ['MPG_City','Weight','Length']

...:

...: linear3 = CASTable('cars').Glm()

...: linear3.target = 'MSRP'

...: linear3.inputs = nomList + contList

...: linear3.nominals = nomList

...: linear3.display.names = ['FitStatistics','ParameterEstimates']

...: linear3()

Out [13]:

[FitStatistics]

Fit Statistics

RowId Description Value

0 RMSE Root MSE 1.201514e+04

1 RSQUARE R-Square 6.284172e-01

2 ADJRSQ Adj R-Sq 6.176727e-01

3 AIC AIC 8.483996e+03

4 AICC AICC 8.485013e+03

5 SBC SBC 8.106765e+03

6 TRAIN_ASE ASE 1.399787e+08

[ParameterEstimates]

Parameter Estimates

Effect Origin Type DriveTrain Parameter DF

0 Intercept Intercept 1

1 Origin Asia Origin Asia 1

2 Origin Europe Origin Europe 1

3 Origin USA Origin USA 0

4 Type Hybrid Type Hybrid 1

5 Type SUV Type SUV 1

6 Type Sedan Type Sedan 1

7 Type Sports Type Sports 1

8 Type Truck Type Truck 1

9 Type Wagon Type Wagon 0

10 DriveTrain All DriveTrain All 1

11 DriveTrain Front DriveTrain Front 1

12 DriveTrain Rear DriveTrain Rear 0

13 MPG_City MPG_City 1

14 Weight Weight 1

15 Length Length 1

Estimate StdErr tValue Probt

0 -23692.980669 16261.000069 -1.457043 1.458607e-01

1 2191.206289 1479.756700 1.480788 1.394218e-01

2 17100.937866 1779.533025 9.609790 7.112423e-20

3 0.000000 NaN NaN NaN

4 26154.719438 10602.173003 2.466921 1.403098e-02

5 -1016.065543 3083.503255 -0.329517 7.419315e-01

6 2481.367175 2359.814614 1.051509 2.936366e-01

7 21015.571095 3065.180416 6.856226 2.572647e-11

8 -12891.562541 3592.436933 -3.588529 3.722560e-04

9 0.000000 NaN NaN NaN

10 -7669.535987 1987.120548 -3.859623 1.316520e-04

11 -7699.083608 1722.883863 -4.468719 1.016332e-05

12 0.000000 NaN NaN NaN

13 -496.308946 266.606023 -1.861582 6.336879e-02

14 9.171343 1.914672 4.790032 2.325015e-06

15 162.893307 82.088679 1.984358 4.787420e-02

A linear regression model with categorical effects still assumes homogeneity of variance (that is, the random errors for all observations have the same variance). In the preceding example, this means that the variation of MSRP values is approximately the same across different origins, types, or drive trains. Sometimes data can be heterogeneous. For example, cars from different origins might have different MSRP values as well as different variation in MSRP values.

In [14]: cars = conn.CASTable('cars')

...: out = cars.groupby('Origin')[['MSRP']].summary()

...: out.concat_bygroups()['Summary'][['Column','Mean','Var','Std']]

...:

Out[14]:

Descriptive Statistics for CARS

Column Mean Var Std

Origin

Asia MSRP 24741.322785 1.281666e+08 11321.069675

Europe MSRP 48349.796748 6.410315e+08 25318.600464

USA MSRP 28377.442177 1.371705e+08 11711.982506

The output from the summary action shows that cars from Europe not only have higher MSRP values but also a greater variance. The sample standard deviation of MSRP values for the European cars doubles compared to the sample standard deviation of the MSRP values for the cars from Asia and USA. One easy remedy for variance heterogeneity is to fit multiple models, one for each segment of the data. For the glm action, you can fit multiple models with the groupby option:

In [15]: cars = conn.CASTable('cars')

...: cars.groupby = ['Origin']

...: cars.where = 'MSRP < 100000 and MPG_City < 40'

...: nomList = ['Type','DriveTrain']

...: contList = ['MPG_City','Weight','Length']

...: groupBYResult = conn.CASTable('MSRPPredictionGroupBy')

...:

...: linear4 = cars.Glm()

...: linear4.target = 'MSRP'

...: linear4.inputs = nomList + contList

...: linear4.nominals = nomList

...: linear4.display.names = ['FitStatistics','ParameterEstimates']

...: linear4.output.casout = groupBYResult

...: linear4.output.copyVars = 'ALL';

...: linear4.output.pred = 'Predicted_MSRP'

...: linear4.output.resid = 'Residual_MSRP'

...: linear4.output.lcl = 'LCL_MSRP'

...: linear4.output.ucl = 'UCL_MSRP'

...: linear4()

...:

...: out = groupBYResult.to_frame()

...: p = Scatter(out, x='Predicted_MSRP', y='Residual_MSRP',

color='Origin', marker='Origin')

...: output_file('scatter.html')

...: show(p)

The following figure shows a scatter plot of predicted MSRP values and residuals with three linear regression models fit for each origin (Asia, Europe, and USA):

Extensions of Ordinary Linear Regression

The key assumptions of an ordinary linear regression are 1) the expected value of the dependent variable can be predicted using a linear combination of a set of predictors, and 2) the error distribution function of ε follows a normal distribution. The second assumptions might not be valid for some applications such as estimating the number of calls received within a time interval in a call center. Generalized linear models (GLM) and regression trees are popular types of generalization of ordinary linear regression to fit data that does not follow a normal distribution.

Generalized Linear Models

In generalized linear models, the dependent variable Y is assumed to follow a particular probability distribution, and the expected value of Y is predicted by a function of the linear combination of the predictors,

$E (y) = f (x_{1}, x_{2,} \dots, x_{K}) = f (a + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{K} x_{K})$

where E(y) is the expected values of Y, x₁, …, x₂, …, x_K are the observed values of the predictors X₁, and X₂, …, X_K and b₁, b₂, ... , b_K are the unknown parameters. It is more common to express a GLM model by linking the expected values of Y o the linear combination of predictors,

$g [E (y)] = a + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{K} x_{K}$

The link function g() is the inverse function of f(). The probability distribution of Y is usually from the exponential family of distribution, such as normal, binomial, exponential, gamma, Poisson, and zero-inflated distributions. The choice of the link function usually depends on the assumption of the probability distribution. For example, for call center data, it is common to assume that the number of calls within a time interval follows a Poisson distribution

$P (k c a l l s i n i n t e r v a l) = \frac{λ^{k} e^{- λ}}{k!}, k = 0, 1, 2 \dots$

and a log link function between the expected number of calls λ to the linear combination of the predictors.

$\log (λ) = a + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{K} x_{K}$

It is also worthwhile to mention that ordinary linear regression is a special type of GLM, where the target variable follows a normal distribution, and the link function is an identity function $E (y) = a + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{K} x_{K}$ . For more details of generalized linear models, refer to [1] and [2].

Generalized linear models are available in the regression CAS action set. Let’s continue to use the Cars data example and build a simple generalized linear model to predict MSRP values of cars using MPG_City.

In [1]: cars = conn.CASTable('cars')

...: genmodModel1 = cars.Genmod()

...: genmodModel1.model.depvars = 'MSRP'

...: genmodModel1.model.effects = ['MPG_City']

...: genmodModel1.model.dist = 'gamma'

...: genmodModel1.model.link = 'log'

...: genmodModel1()

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

Out[1]:

[ModelInfo]

Model Information

RowId Description Value

0 DATA Data Source CARS

1 RESPONSEVAR Response Variable MSRP

2 DIST Distribution Gamma

3 LINK Link Function Log

4 TECH Optimization Technique Newton-Raphson with Ridging

[NObs]

Number of Observations

RowId Description Value

0 NREAD Number of Observations Read 428.0

1 NUSED Number of Observations Used 428.0

[ConvergenceStatus]

Convergence Status

Reason Status MaxGradient

0 Convergence criterion (GCONV=1E-8) s... 0 1.068483e-09

[Dimensions]

Dimensions

RowId Description Value

0 NDESIGNCOLS Columns in Design 2

1 NEFFECTS Number of Effects 2

2 MAXEFCOLS Max Effect Columns 1

3 DESIGNRANK Rank of Design 2

4 OPTPARM Parameters in Optimization 3

[FitStatistics]

Fit Statistics

RowId Description Value

0 M2LL -2 Log Likelihood 9270.853164

1 AIC AIC (smaller is better) 9276.853164

2 AICC AICC (smaller is better) 9276.909768

3 SBC SBC (smaller is better) 9289.030533

[ParameterEstimates]

Parameter Estimates

Effect Parameter ParmName DF Estimate StdErr

0 Intercept Intercept Intercept 1 11.307790 0.059611

1 MPG_City MPG_City MPG_City 1 -0.047400 0.002801

2 Dispersion Dispersion Dispersion 1 5.886574 0.391526

ChiSq ProbChiSq

0 35983.929066 0.000000e+00

1 286.445370 2.958655e-64

2 NaN NaN

[Timing]

Task Timing

RowId Task Time RelTime

0 SETUP Setup and Parsing 0.008891 0.257967

1 LEVELIZATION Levelization 0.005668 0.164454

2 INITIALIZATION Model Initialization 0.000360 0.010446

3 SSCP SSCP Computation 0.001076 0.031220

4 FITTING Model Fitting 0.014661 0.425389

5 CLEANUP Cleanup 0.002235 0.064853

6 TOTAL Total 0.034465 1.000000

In the preceding example, we fit a generalized linear model using gamma distribution and the log link function because MSRP values of cars are nonnegative continuous values. The type of distribution is usually determined by the range and the shape of the data. For example, exponential, gamma, and inverse Gaussian distributions are popular choices for fitting nonnegative continuous values such as MSRP values of cars, sales revenue, insurance claim amounts, and so on. Binomial and multinomial distributions are a valid distribution assumption to fit an integer response. In the next example, we fix a multinomial regression model to predict the number of cylinders of the vehicles that are based on MPG_City.

In [2]: genmodModel1.model.depvars = 'Cylinders'

...: genmodModel1.model.dist = 'multinomial'

...: genmodModel1.model.link = 'logit'

...: genmodModel1.model.effects = ['MPG_City']

...: genmodModel1.display.names = ['ModelInfo', 'ParameterEstimates']

...: genmodModel1()

...:

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

Out[2]:

[ModelInfo]

Model Information

RowId Description

0 DATA Data Source

1 RESPONSEVAR Response Variable

2 NLEVELS Number of Response Levels

3 DIST Distribution

4 LINKTYPE Link Type

5 LINK Link Function

6 TECH Optimization Technique

Value

0 CARS

1 Cylinders

2 7

3 Multinomial

4 Cumulative

5 Logit

6 Newton-Raphson with Ridging

[ParameterEstimates]

Parameter Estimates

Effect Parameter ParmName Outcome Cylinders DF

0 Intercept Intercept Intercept_3 3 3.0 1

1 Intercept Intercept Intercept_4 4 4.0 1

2 Intercept Intercept Intercept_5 5 5.0 1

3 Intercept Intercept Intercept_6 6 6.0 1

4 Intercept Intercept Intercept_8 8 8.0 1

5 Intercept Intercept Intercept_10 10 10.0 1

6 MPG_City MPG_City MPG_City NaN 1

Estimate StdErr ChiSq ProbChiSq

0 -60.329075 4.829533 156.042532 8.286542e-36

1 -21.461149 1.584887 183.361936 8.941488e-42

2 -21.233691 1.575766 181.579751 2.190306e-41

3 -16.632445 1.337275 154.693103 1.634032e-35

4 -10.988487 1.139470 92.997190 5.236863e-22

5 -10.314220 1.186541 75.562638 3.539969e-18

6 1.013934 0.077371 171.734698 3.092446e-39

In the preceding example, the cumulative logit link function is used for the multinomial model. For a generalized linear model using multinomial distribution, we estimate where an observation (a car) can fall into one of the seven possible number of cylinders (Cylinder=3, 4, 5, 6, 8, 10, 12). In this case, the cumulative log link assume a logit link function between the cumulative probabilities and the linear combination of the predictors:

$\begin{array}{l} Pr(Cylinders = 3) = f (- 60 .329075 + 1 .013934 \times MPG_City) \\ Pr (Cylinders = 3) = Pr (Cylinders = 4) = f (- 21 .461149+1 .013934 \times MPG_City) \\ \dots \dots \\ Pr (Cylinders = 3) + \dots + \Pr (Cylinders = 10) = f (- 10 .314220 + 1 .013934 \times MPG_City) \end{array}$

where $f (u) = \frac{\exp (u)}{1 + \exp (u)}$ is the standard inverse logit link function. You can use the parameter estimates to score new observation directly. For example, when a car has MPG_CITY = 20, the chance that this car is 4-cylinder is about 23.5%:

$\Pr (Cylinders = 4 | MPG _ CITY = 20) = f (- 21 .461149 + 1 .013934 \times 20) - f (- 60 .329075 + 1 .013934 \times 20) = 0 .235$

Similar to the glm action in the same regression action set, the genmod action generates the predictions from the generalized linear model. Instead of using the parameter estimates from the preceding formulas to manually compute the predictions, you can use genmod to score a data set using the output options.

In [3]: result = conn.CASTable('CylinderPredicted', replace=True)

...: genmodModel1.output.casout = genmodResult

...: genmodModel1.output.copyVars = 'ALL';

...: genmodModel1.output.pred = 'Prob_Cylinders'

...: genmodModel1()

...: result[['Prob_Cylinders','_level_','Cylinders','MPG_City']].head(24)

...:

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

Out[3]:

Selected Rows from Table CYLINDERPREDICTED

Prob_Cylinders _LEVEL_ Cylinders MPG_City

0 1.928842e-19 3.0 6.0 17.0

1 1.442488e-02 4.0 6.0 17.0

2 1.804258e-02 5.0 6.0 17.0

3 6.466697e-01 6.0 6.0 17.0

4 9.980702e-01 8.0 6.0 17.0

5 9.990158e-01 10.0 6.0 17.0

6 5.316706e-19 3.0 6.0 18.0

7 3.877857e-02 4.0 6.0 18.0

8 4.820533e-02 5.0 6.0 18.0

9 8.345697e-01 6.0 6.0 18.0

10 9.992990e-01 8.0 6.0 18.0

11 9.996427e-01 10.0 6.0 18.0

12 8.460038e-17 3.0 4.0 23.0

13 8.652192e-01 4.0 4.0 23.0

14 8.896126e-01 5.0 4.0 23.0

15 9.987558e-01 6.0 4.0 23.0

16 9.999956e-01 8.0 4.0 23.0

17 9.999978e-01 10.0 4.0 23.0

18 4.039564e-18 3.0 6.0 20.0

19 2.346085e-01 4.0 6.0 20.0

20 2.778780e-01 5.0 6.0 20.0

21 9.745742e-01 6.0 6.0 20.0

22 9.999077e-01 8.0 6.0 20.0

23 9.999530e-01 10.0 6.0 20.0

You can generate the predictions for a car with MPG_CITY = 20 from the last six rows of the preceding output table:

$\begin{array}{l} \Pr (Cylinders = 3 | MPG_CITY = 20) = 4 .039564 e - 18 \approx 0 \\ \Pr (Cylinders = 4 | MPG_CITY = 20) = 0 .2346085 - 0 = 0 .2346085 \\ \Pr (Cylinders = 5 | MPG_CITY = 20) = 0 .2778780 - 0 .2346085 = 0 .04327 \\ \Pr (Cylinders = 6 | MPG_CITY = 20) = 0 .9745742 - 0 .2778780 = 0 .696696 \\ \Pr (Cylinders > 6 | MPG_CITY = 20) = 1 - 0 .9745742 < 0 .03 \end{array}$

Using the multinomial model, a car with MPG_CITY = 20 is most likely a 6-cylinder (69.7%) or 4-cylinder car (23.5%), and the chance that it has more than 6 cylinders is less than 3%.

The following table lists the output result tables from the genmod action. You can use the display option to include or exclude a result table when fitting a generalized linear model using the genmod action.

Table Name	Description
ModelInfo	Basic information about the model such as the data source, the response variable, the distribution and link functions, and the optimization technique.
NObs	The number of observations that are read and used. Missing values are excluded, by default.
ConvergenceStatus	The convergence status of the parameter estimation.
Dimension	The dimension of the model, including the number of effects and the number of parameters.
FitStatistics	The fit statistics of the model such as log likelihood (multiplied by -2), AIC, AICC, and SBC.
ParameterEstimates	The estimation of the model parameters.
Timing	A timing of the subtasks of the GLM action call.

Regression Trees

Similar to decision tree models, regression trees are machine learning algorithms that use recursive partitioning to segment the input data set and to make predictions within each segment of the data. In this

section, we briefly introduce how to use the tree models that are available in the decisiontree action set to build predictive models for continuous response. For information about decision tree and tree family models, see Chapter 9.

A regression tree usually follows these steps:

1. Grows a tree as deep as possible based on a splitting criterion and the training data.

2. Prunes back some nodes on the tree based on an error function on the validation data.

3. For each leaf (terminal node), build a simple model to predict the continuous dependent variable. The most common approach is to use the local sample average of the dependent variable.

Let’s first load the decisiontree action set:

In [4]: conn.loadactionset('decisiontree')

...: conn.help(actionset='decisiontree')

NOTE: Added action set 'decisiontree'.

Out[4]:

decisionTree

Name Description

0 dtreeTrain Train Decision Tree

1 dtreeScore Score A Table Using Decision Tree

2 dtreeSplit Split Tree Nodes

3 dtreePrune Prune Decision Tree

4 dtreeMerge Merge Tree Nodes

5 dtreeCode Generate score code for Decision Tree

6 forestTrain Train Forest

7 forestScore Score A Table Using Forest

8 forestCode Generate score code for Forest

9 gbtreeTrain Train Gradient Boosting Tree

10 gbtreeScore Score A Table Using Gradient Boosting Tree

11 gbtreecode Generate score code for Gradient Boosting Trees

The dtreetrain action fits decision tree model for a categorical dependent variable or regression tree model for a continuous dependent variable. Let’s build a regression tree model using MPG_City to predict MSRP values of cars. In this example, we create the simplest regression tree that splits the root node into only two leaves (splitting the entire data into two partitions).

In [5]: cars = conn.CASTable('cars')

...:

...: output1 = conn.CASTable('treeModel1')

...: output1.replace = True;

...:

...: tree1 = cars.Dtreetrain()

...: tree1.target = 'MSRP'

...: tree1.inputs = ['MPG_City']

...: tree1.casout = output1

...: tree1.maxlevel = 2

...: tree1()

...:

...: output1[['_NodeID_', '_Parent_','_Mean_','_NodeName_','_PBLower0_',

'_PBUpper0_']].head()

...:

Out[5]:

[Fetch]

Selected Rows from Table TREEMODEL1

_NodeID_ _Parent_ _Mean_ _NodeName_ _PBLower0_

0 0.0 -1.0 32774.855140 MPG_City NaN

1 1.0 0.0 22875.341584 MSRP 20.0

2 2.0 0.0 41623.092920 MSRP 10.0

_PBUpper0_

0 NaN

1 60.0

2 20.0

In the preceding example, the decision tree model is saved to the CAS table treeModel1. This table is stored on the CAS server You must use the fetch or head method to download the table to the Python client. The tree model table contains three observations, one for each node in the tree, including the root node. We fetch information only from the tree model table such as the unique IDs of the nodes and parents (_NodeID_ and _Parent_), the local sample means of the dependent variable (_Mean_), the splitting variable (_NodeName_), and the splitting points (_PBLower0_ and _PBUpper0_). The tree model table treeModel1 also contains other useful information such as the size of the node, the splitting criterion, and so on.

We can read about how the root node is split from the preceding table. The value of the_NodeName_ column in the first row shows that the root node is split by MPG_City, and the values of the _PBLower0_ and _PBUpper0_ columns show us that Node 1 contains the observations with MPG_CITY in (20,60], and Node 2 contains the observations with MPG_CITY in [10,20]. Note that in the CAS tree models, the splitting points such as MPG_CITY = 20 in this case are assigned to the child node with smaller values of MPG_City.

Based on this information, a tree structure follows for the first decision tree example:

The splitting criteria of the CAS decision tree model are listed in the following table:

Parameter	Description
crit = ‘ftest’	Uses the p-values of F test to select the best split.
crit = ‘variance’	Uses the best split that produce the largest reduction of sum of squared errors. For regression trees, the sum of squared errors is equivalent to the variable of the data.
crit = ‘chaid’	Uses adjusted significance testing (Bonferroni testing) to select the best split.

Pruning a tree model is a necessary step to avoid overfitting when you use the regression tree model to score new data (validation data). In the preceding example, we grow only the simplest tree structure, so there is no need to prune it back. If you have created a deeper tree and you have validation data, you can use the dtreeprune action to prune a given tree model.

data2 = conn.CASTable('your_vadliatoin_data')

output2 = conn.CASTable('pruned_tree')

data2.dtreeprune(modelTable=output1, casout=output2)

The last step for completing a regression tree model is to use the pruned tree model to score your data. This is done by the dtreescore action.

data2.dtreescore(modelTable=output2)

Conclusion

In this chapter, we first introduced the linear regression model and discussed some best practices to improve the modeling fitting of a linear regression. Then we introduced the generalized linear model that is available in the regression action set and the actions in the decisiontree action set to build a regression tree model. For more information on these see the following references.

[1] Neter, John, et al. 1996. Applied Linear Statistical Models. Vol. 4. Chicago: Irwin.

[2] McCullagh, Peter, and John A. Nelder. 1989. Generalized Linear Models. Vol. 37. CRC Press.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Modeling Continuous Variables

Create new playlist

Sign In

Sign Up

Chapter 8: Modeling Continuous Variables

Linear Regressions

Extensions of Ordinary Linear Regression

Generalized Linear Models

Regression Trees

Conclusion

Table of Contents for
Chapter 8: Modeling Continuous Variables