Chapter 9: Modeling Categorical Variables

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9: Modeling Categorical Variables

Gradient Boosting, Forests, and Neural Networks

In Chapter 8, we introduced linear regressions, generalized linear models and regression trees for modeling continuous response variables. In this chapter, we focus on applications in which the response variable is categorical, such as organic food that is purchased in a supermarket (Bought, Not), blood press status (High, Normal, Low), and credit card application status (Accepted, Rejected). Logistic regressions and decision trees are introduced in the first two sections for a binary response variable, which is a response variable with only two qualitative outcomes. In the last section, we introduce random forests, gradient boosting trees, and neural networks, which can fit a categorical response variable with more than two distinct outcomes.

We continue to use the Organics data set for this chapter. Again, we assume that a CAS server is already set up and the data sets have been loaded to the CAS server

In [1]: organics.tableinfo()

Out[1]:

[TableInfo]

Name Rows Columns Encoding CreateTimeFormatted

0 ORGANICS 1688948 36 utf-8 09Nov2016:10:32:06

ModTimeFormatted JavaCharSet CreateTime ModTime

0 09Nov2016:10:32:06 UTF8 1.794307e+09 1.794307e+09

Global Repeated View SourceName SourceCaslib Compressed

0 1 0 0 0

Creator Modifier

0 username

Logistic Regression

Similar to the linear regression models introduced in Chapter 8, logistic regression uses the linear combination of one or more predictors to build a predictive model for the response outcomes. Unlike linear regressions, the prediction of interest in a logistic regression is not a continuous outcome with a range from negative to positive infinity. Instead, we are interested in predicting the probabilities of the outcome levels of the response variable. Examples are the probability of buying or not buying organic food from a supermarket or the probability of a credit card application that is accepted or rejected. For a binary response variable, the two distinct outcome levels are usually called the event level and the non-event level.

$P_{e v e n t} + P_{n o n - e v e n t} = 1$

We simply need to predict P_event, the event probability for the response variable, which always takes values in [0, 1]. Logistic regression originates from the logistic transformation, which maps a continuous input from (−∞, ∞) to an output between 0 and 1.

$f (x) = \frac{1}{1 + e^{- (x - x_{0})}}$

Similarly, a logistic regression model uses the logistic transformation to link the linear combination of one or more predictors to the probability of event level as follows:

$P_{e v e n t} = \frac{1}{1 + e^{- t}}$

where t is a linear combination of the predictors:

$t = a + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{K} x_{K}$

The inverse form of the transformation also indicates that a logistic regression is a non-linear model:

$G (P_{e v e n t}) = l o g (\frac{P_{e v e n t}}{1 - P_{e v e n t}}) = a + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{K} x_{K}$

where the function G(P_event) is often called the logit link function. The ratio P_event/1 − P_event) is also known as the odds ratio. When a logit link function is used, we assume that the predictors have a linear relationship with the log of the odds ratio. Logistic regression can be extended by using other link functions as well, such as probit, cloglog, and negative cloglog.

Logistic regression models are available in the regression action set. First, you must load the action set.

In [2]: conn.loadactionset('regression')

Now let’s build a simple logistic regression for the Organics data to predict whether consumers would like to buy organic food. The response variable in this data is called TargetBuy.

In [2]: organics.logistic(

...: target = 'TargetBuy',

...: inputs = ['DemAge', 'Purchase_3mon', 'Purchase_6mon']

...: )

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

Out[2]:

[ModelInfo]

Model Information

RowId Description Value

0 DATA Data Source ORGANICS

1 RESPONSEVAR Response Variable TargetBuy

2 DIST Distribution Binary

3 LINK Link Function Logit

4 TECH Optimization Technique Newton-Raphson with Ridging

[NObs]

Number of Observations

RowId Description Value

0 NREAD Number of Observations Read 1688948.0

1 NUSED Number of Observations Used 1574340.0

[ResponseProfile]

Response Profile

OrderedValue Outcome TargetBuy Freq Modeled

0 1 Bought Bought 387600.0 *

1 2 No No 1186740.0

[ConvergenceStatus]

Convergence Status

Reason Status MaxGradient

0 Convergence criterion (GCONV=1E-8) s... 0 4.419888e-08

[Dimensions]

Dimensions

RowId Description Value

0 NDESIGNCOLS Columns in Design 4

1 NEFFECTS Number of Effects 4

2 MAXEFCOLS Max Effect Columns 1

3 DESIGNRANK Rank of Design 4

4 OPTPARM Parameters in Optimization 4

[GlobalTest]

Testing Global Null Hypothesis: BETA=0

Test DF ChiSq ProbChiSq

0 Likelihood Ratio 3 149251.090674 0.0

[FitStatistics]

Fit Statistics

RowId Description Value

0 M2LL -2 Log Likelihood 1.608090e+06

1 AIC AIC (smaller is better) 1.608098e+06

2 AICC AICC (smaller is better) 1.608098e+06

3 SBC SBC (smaller is better) 1.608147e+06

[ParameterEstimates]

Parameter Estimates

Effect Parameter ParmName DF Estimate

0 Intercept Intercept Intercept 1 1.755274

1 DemAge DemAge DemAge 1 -0.057438

2 purchase_3mon purchase_3mon purchase_3mon 1 -0.000002

3 purchase_6mon purchase_6mon purchase_6mon 1 0.000039

StdErr ChiSq ProbChiSq

0 0.057092 945.232627 1.442299e-207

1 0.000158 131358.158671 0.000000e+00

2 0.000055 0.000866 9.765250e-01

3 0.000039 0.990883 3.195267e-01

[Timing]

Task Timing

RowId Task Time RelTime

0 SETUP Setup and Parsing 0.017694 0.058605

1 LEVELIZATION Levelization 0.027966 0.092627

2 INITIALIZATION Model Initialization 0.000611 0.002024

3 SSCP SSCP Computation 0.017766 0.058843

4 FITTING Model Fitting 0.233353 0.772892

5 CLEANUP Cleanup 0.002567 0.008502

6 TOTAL Total 0.301922 1.000000

In the preceding example, we build a logistic regression to predict the probability of buying organic food (Pr(TargetBuy = Bought)) using three continuous predictors Age, Recent 3 Month Purchase Amount, and Recent 6 month Purchase Amount. You can also add categorical predictors to the logistic regression. They must be specified in both the inputs argument and the nominals argument.

In [3]: organics.logistic(

...: target = 'TargetBuy',

...: inputs = ['DemAge', 'Purchase_3mon', 'Purchase_6mon', 'DemGender',

'DemHomeowner'],

...: nominals = ['DemGender', 'DemHomeowner'],

...: display = {'names': ['ParameterEstimates']}

...: )

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

Out[3]:

[ParameterEstimates]

Parameter Estimates

Effect DemGender DemHomeowner Parameter

0 Intercept Intercept

1 DemAge DemAge

2 purchase_3mon purchase_3mon

3 purchase_6mon purchase_6mon

4 DemGender F DemGender F

5 DemGender M DemGender M

6 DemGender U DemGender U

7 DemHomeowner No DemHomeowner No

8 DemHomeowner Yes DemHomeowner Yes

ParmName DF Estimate StdErr ChiSq

0 Intercept 1 0.353349 0.059586 35.165819

1 DemAge 1 -0.056478 0.000162 120824.157416

2 purchase_3mon 1 0.000009 0.000057 0.024229

3 purchase_6mon 1 0.000037 0.000040 0.819283

4 DemGender_F 1 1.817158 0.007381 60608.910532

5 DemGender_M 1 0.857905 0.008216 10904.528993

6 DemGender_U 0 0.000000 NaN NaN

7 DemHomeowner_No 1 0.000320 0.004226 0.005725

8 DemHomeowner_Yes 0 0.000000 NaN NaN

ProbChiSq

0 3.027918e-09

1 0.000000e+00

2 8.763032e-01

3 3.653900e-01

4 0.000000e+00

5 0.000000e+00

6 NaN

7 9.396871e-01

8 NaN

In the preceding example, we also use the display option to select only the parameter estimate output table. For each categorical predictor, the logistic action constructs dummy indicators for the distinct levels of the predictor, and uses them as linear terms. For example, if a consumer is 30 years old, has purchased the amount $1,200 and $500 for the last six months and three months, respectively, is male, and currently does not own a home, the predicted probability of buying organic food is 39.65%.

t=0.3533−0.056×30+0.000008864×500+0.00003651×1200+0.8579+0.0003197=−0.42024

$P_{e v e n t} = \frac{1}{1 + e^{- t}} = 39.65 %$

The following table summarizes the output tables that are returned by the logistic action.

Table Name	Description
NObs	The number of observations that are read and used. Missing values are excluded, by default.
ResponseProfile	The frequency distribution of the response variable. The event level is marked with an asterisk (*).
ConvergenceStatus	The convergence status of the parameter estimation.
Dimension	The dimension of the model, including the number of effects and the number of parameters.
GlobalTest	A likelihood ratio test that measures the overall model fitting.
FitStatistics	The fit statistics of the model such as log likelihood (multiplied by -2), AIC, AICC and BIC.
ParameterEstimates	The estimation of the logistic regression parameters.
Timing	A timing of the subtasks of the GLM action call.

In Python, you can define the logistic model first and add options, step-by-step, which enables you to reuse a lot of code when you need to change only a few options in a logistic model. The preceding logistic action code can be rewritten as follows:

In [4]: all_preds = ['DemAge', 'Purchase_3mon', 'Purchase_9mon', 'DemGender',

'DemHomeowner']

...: all_cats = ['DemGender', 'DemHomeowner']

...:

...: model1 = organics.Logistic()

...: model1.nominals = all_cats

...: model1.target = 'TargetBuy'

...: model1.inputs = all_preds

...: model1.display.names = ['ParameterEstimates']

...:

...: model1()

In the preceding example, first, we define a new logistic model model1, and then add the class effect list (model1.nominals), the response variables (model1.target), and the effect list (model1.inputs), step-by-step. The logistic action, by default, uses the logit link function but you can always override it with your own event selection. In the following example, we reuse the model1 definition from the previous example and change the link function to PROBIT:

In [5]: model1.link = 'PROBIT'

...: model1.display.names = ['ResponseProfile', 'ParameterEstimates']

...: model1()

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

Out[5]:

[ResponseProfile]

Response Profile

OrderedValue Outcome TargetBuy Freq Modeled

0 1 Bought Bought 387600.0 *

1 2 No No 1186740.0

[ParameterEstimates]

Parameter Estimates

Effect DemGender DemHomeowner Parameter

0 Intercept Intercept

1 DemAge DemAge

2 purchase_3mon purchase_3mon

3 purchase_9mon purchase_9mon

4 DemGender F DemGender F

5 DemGender M DemGender M

6 DemGender U DemGender U

7 DemHomeowner No DemHomeowner No

8 DemHomeowner Yes DemHomeowner Yes

ParmName DF Estimate StdErr ChiSq

0 Intercept 1 0.447991 0.072359 38.330834

1 DemAge 1 -0.056478 0.000162 120823.996353

2 purchase_3mon 1 0.000071 0.000049 2.074100

3 purchase_9mon 1 -0.000026 0.000029 0.811013

4 DemGender_F 1 1.817162 0.007381 60609.158117

5 DemGender_M 1 0.857905 0.008216 10904.528278

6 DemGender_U 0 0.000000 NaN NaN

7 DemHomeowner_No 1 0.000318 0.004226 0.005669

8 DemHomeowner_Yes 0 0.000000 NaN NaN

ProbChiSq

0 5.971168e-10

1 0.000000e+00

2 1.498183e-01

3 3.678210e-01

4 0.000000e+00

5 0.000000e+00

6 NaN

7 9.399814e-01

8 NaN

The logistic regression models that we have built so far train the model without actually using the model to make a prediction (that is, estimating P(event)). You must specify an output table for the logistic action in order to score the observations in the input table. By default, the casout table contains only the predictions, but you can use the copyvars option to copy some columns from the input table to the casout table. You can use copyvars=’all’ to copy all columns to the casout table.

In [6]: result1 = conn.CASTable('predicted', replace=True)

...: model1.output.casout = result1

...: model1.output.copyvars = 'all'

...: del model1.display

...: model1()

The logistic action generates one more result tables when a casout table is created. This table contains basic information about the casout table such as the name, the library, and the dimensions of the table.

[OutputCasTables]

casLib Name Label Rows Columns

0 CASUSERHDFS(username) predicted 1688948 37

casTable

0 CASTable('predicted', caslib='CASUSE...

You can print out the column names of the output table to see what new columns have been added. By default, the logistic action creates a new column _PRED_ for the estimated event probability.

In [7]: result1.columns

Out[7]:

Index(['_PRED_', 'ID', 'DemAffl', 'DemAge', 'DemGender',

'DemHomeowner', 'DemAgeGroup', 'DemCluster', 'DemReg',

'DemTVReg', 'DemFlag1', 'DemFlag2', 'DemFlag3', 'DemFlag4',

'DemFlag5', 'DemFlag6', 'DemFlag7', 'DemFlag8', 'PromClass',

'PromTime', 'TargetBuy', 'Bought_Beverages', 'Bought_Bakery',

'Bought_Canned', 'Bought_Dairy', 'Bought_Baking',

'Bought_Frozen', 'Bought_Meat', 'Bought_Fruits',

'Bought_Vegetables', 'Bought_Cleaners', 'Bought_PaperGoods',

'Bought_Others', 'purchase_3mon', 'purchase_6mon',

'purchase_9mon', 'purchase_12mon'],

dtype='object')

You can use the data summary actions in the simple action set to look at the model output. For example, let’s compare the average predicted probability of buying organic foods across different levels of DemGender:

In [8]: result1.crosstab(row='DemGender', weight='_PRED_', aggregators='mean')

Out[8]:

[Crosstab]

DemGender Col1

0 F 0.343891

1 M 0.165559

2 U 0.076981

From the preceding output, it looks like female customers are more likely to purchase organic food, and the customers who didn’t provide gender information are not interested in buying organic food. The ratio (0.343891/0.076981) between these two groups is approximately 5, which means female customers are five times more likely to purchase organic foods than the customers who didn’t provide gender information.

The logistic action generates additional model diagnosis outputs. The following table summarizes the columns that a logistic action can generate as output.

Option	Description
pred	The predicted value. If you do not specify any output statistics, the predicted value is named _PRED_, by default.
resraw	The raw residual.
xbeta	The linear predictor.
stdxbeta	The standard error of the linear predictor.
lcl	The lower bound of a confidence interval for the linear predictor.
ucl	The upper bound of a confidence interval for the linear predictor.
lclm	The lower bound of a confidence interval for the mean.
uclm	The upper bound of a confidence interval for the mean.
h	The leverage of the observation.
reschi	The Pearson chi-square residual.
stdreschi	The standardized Pearson chi-square residual.
resdev	The deviance residual.
reslik	The likelihood residual (likelihood displacement).
reswork	The working residual.
difdev	The change in the deviance that is attributable to deleting the individual observation.
difchisq	The change in the Pearson chi-square statistic that is attributable to deleting the individual observation.
cbar	The confidence interval displacement, which measures the overall change in the global regression estimates due to deleting the individual observation.
alpha	The significance level used for the construction of confidence intervals.

Similar to the code argument in the glm action for linear regression, SAS DATA step code can also be used to save a logistic model. You can score new data sets using the DATA step code in Python (using the runcode action from the datastep action set) or in a SAS language environment such as Base SAS or SAS Studio.

In [9]: # example 4 score code

...: result = organics.logistic(

...: target = 'TargetBuy',

...: inputs = ['DemAge', 'Purchase_3mon', 'Purchase_6mon'],

...: code = {}

...: )

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

In [10]: result['_code_']

Out[10]:

Score Code

SASCode

0 /*-------------------------------...

1 Generated SAS Scoring Code

2 Date: 09Nov2016:13:33:15

3 -------------------------------...

5 drop _badval_ _linp_ _temp_ _i_ _j_;

6 _badval_ = 0;

7 _linp_ = 0;

8 _temp_ = 0;

9 _i_ = 0;

10 _j_ = 0;

12 array _xrow_0_0_{4} _temporary_;

13 array _beta_0_0_{4} _temporary_ (...

14 -0.05743797902413

15 -1.6172314911858E-6

16 0.00003872414134);

18 if missing(purchase_3mon)

19 or missing(DemAge)

20 or missing(purchase_6mon)

21 then do;

22 _badval_ = 1;

23 goto skip_0_0;

24 end;

26 do _i_=1 to 4; _xrow_0_0_{_i_} = ...

28 _xrow_0_0_[1] = 1;

30 _xrow_0_0_[2] = DemAge;

32 _xrow_0_0_[3] = purchase_3mon;

34 _xrow_0_0_[4] = purchase_6mon;

36 do _i_=1 to 4;

37 _linp_ + _xrow_0_0_{_i_} * _be...

38 end;

40 skip_0_0:

41 label P_TargetBuy = 'Predicted: T...

42 if (_badval_ eq 0) and not missin...

43 if (_linp_ > 0) then do;

44 P_TargetBuy = 1 / (1+exp(-_...

45 end; else do;

46 P_TargetBuy = exp(_linp_) /...

47 end;

48 end; else do;

49 _linp_ = .;

50 P_TargetBuy = .;

51 end;

CAS also provides several useful data set options that can interact with an analytical action. For example, to build multiple logistic regressions (one for each level of a categorical variable), you can simply use the groupby option for the Organics data set.

In [11]: organics.groupby = ['DemGender']

...: result = organics.logistic(

...: target = 'TargetBuy',

...: inputs = ['DemAge', 'Purchase_3mon', 'Purchase_6mon'],

...: )

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

There are three convergence messages for the three logistic models that are trained for the female customers, male customers, and customers who didn’t provide gender information. You can loop through the results and generate the parameter estimations:

In [12]: for df in result:

...: if 'ParameterEstimates' in df:

...: print(result[df][['Effect','Parameter','Estimate']])

...: print('')

...:

Parameter Estimates

Effect Parameter Estimate

DemGender

F Intercept Intercept 2.181055

F DemAge DemAge -0.057016

F purchase_3mon purchase_3mon 0.000023

F purchase_6mon purchase_6mon 0.000038

Parameter Estimates

Effect Parameter Estimate

DemGender

M Intercept Intercept 1.230455

M DemAge DemAge -0.056114

M purchase_3mon purchase_3mon -0.000091

M purchase_6mon purchase_6mon 0.000065

Parameter Estimates

Effect Parameter Estimate

DemGender

U Intercept Intercept 0.211336

U DemAge DemAge -0.052621

U purchase_3mon purchase_3mon 0.000139

U purchase_6mon purchase_6mon -0.000046

Note that the groupby option in Python is very similar to the BY statement in the SAS language for by-group processing of repeated analytical contents. However, the BY statement requires the data set to be sorted by the variables that are specified in the BY statement. The groupby option does not require pre-sorted data. When the data set is distributed, the groupby option does not introduce additional data shuffling either.

Decision Trees

Decision trees is a type of machine learning algorithm that uses recursive partitioning to segment the input data set and to make predictions within each segment of the data. Decision tree models have been widely used for predictive modeling, data stratification, missing value imputation, outlier detection and description, variable selection, and other areas of machine learning and statistical modeling. In this section, we focus on using decision trees to build predictive models.

The following figure illustrates a simple decision tree model on data that is collected in a city park regarding whether people purchase ice cream. The response variable is binary: 1 for the people who bought ice cream and 0 for the people who didn’t buy ice cream. There are three predictors in this example: Sunny and Hot?, Have Extra Money?, and Crave Ice Cream?. In this example, all of these predictors are binary, with values YES and NO. In the example, the chance is 80% that people who have extra money on a sunny and hot day will buy ice cream.

A typical decision tree model for a categorical variable contains three steps:

1. Grow a decision tree as deep as possible based on a splitting criterion and the training data.

2. Prune back some nodes on the decision tree based on an error function on the validation data.

3. For each leaf, the probability of a response outcome is estimated using the sample frequency table.

The CAS decision tree supports a variety of decision tree splitting and pruning criteria. The decision tree action set contains machine learning models that are in the tree family: decision trees, random forests, and gradient boosting. Let’s first load the decisiontree action set.

In [1]: conn.loadactionset('decisiontree')

...: conn.help(actionset='decisiontree')

NOTE: Added action set 'decisiontree'.

Out[1]:

decisionTree

Name Description

0 dtreeTrain Train Decision Tree

1 dtreeScore Score A Table Using Decision Tree

2 dtreeSplit Split Tree Nodes

3 dtreePrune Prune Decision Tree

4 dtreeMerge Merge Tree Nodes

5 dtreeCode Generate score code for Decision Tree

6 forestTrain Train Forest

7 forestScore Score A Table Using Forest

8 forestCode Generate score code for Forest

9 gbtreeTrain Train Gradient Boosting Tree

10 gbtreeScore Score A Table Using Gradient Boosting Tree

The action sets with prefix dtree are used for building a decision tree model. The dtreetrain action is used for training a decision tree. The dtreescore and dtreecode actions are designed for scoring using a decision tree and for generating as output a decision tree model, as SAS DATA step code. The other actions with prefix dtree are used in interactive decision tree modifications, where a user can manually split, prune, or merge tree nodes.

Let’s get started with a simple decision tree model that contains only one response variable and one predictor.

In [2]: output1 = conn.CASTable('treeModel1', replace=True)

...: tree1 = organics.Dtreetrain()

...: tree1.target = 'TargetBuy'

...: tree1.inputs = ['DemGender']

...: tree1.casout = output1

...: tree1()

...:

Out[2]:

[ModelInfo]

Decision Tree for ORGANICS

Descr Value

0 Number of Tree Nodes 5.000000e+00

1 Max Number of Branches 2.000000e+00

2 Number of Levels 3.000000e+00

3 Number of Leaves 3.000000e+00

4 Number of Bins 2.000000e+01

5 Minimum Size of Leaves 3.236840e+05

6 Maximum Size of Leaves 9.233240e+05

7 Number of Variables 1.000000e+00

8 Confidence Level for Pruning 2.500000e-01

9 Number of Observations Used 1.688948e+06

10 Misclassification Error (%) 2.477163e+01

[OutputCasTables]

casLib Name Rows Columns

0 CASUSERHDFS(username) treeModel1 5 24

casTable

0 CASTable('treeModel1', caslib='CASUS...

The dtreetrain action trains a decision tree model and saves it to the casout table treemodel1. The action also generates a model information table (ModelInfo) that contains basic descriptions of the trained decision tree model. Note that ModelInfo is stored at the local Python client and contains only basic information about the tree model, whereas the treemodel1 table is a CAS table that is stored on the CAS server. Let’s look at what is included in the tree model output:

In [3]: output1.columns

Out[3]:

Index(['_Target_', '_NumTargetLevel_', '_TargetValL_',

'_TargetVal0_', '_TargetVal1_', '_CI0_', '_CI1_', '_NodeID_',

'_TreeLevel_', '_NodeName_', '_Parent_', '_ParentName_',

'_NodeType_', '_Gain_', '_NumObs_', '_TargetValue_',

'_NumChild_', '_ChildID0_', '_ChildID1_', '_PBranches_',

'_PBNameL0_', '_PBNameL1_', '_PBName0_', '_PBName1_'],

dtype='object')

The decision tree output contains one row per node to summarize the tree structure and the model fit. Let’s first fetch some columns that are related to the tree structure and the splitting values.

In [4]: output1[['_TreeLevel_', '_NodeID_', '_Parent_', '_ParentName_',

'_NodeType_', '_PBName0_',

'_PBName1_']].sort_values('_NodeID_').head(20)

Out[4]:

Selected Rows from Table TREEMODEL1

_TreeLevel_ _NodeID_ _Parent_ _ParentName_ _NodeType_

0 0.0 0.0 -1.0 1.0

1 1.0 1.0 0.0 DemGender 1.0

2 1.0 2.0 0.0 DemGender 3.0

3 2.0 3.0 1.0 DemGender 3.0

4 2.0 4.0 1.0 DemGender 3.0

_PBName0_ _PBName1_

1 M U

2 F

3 U

4 M

The first column _TreeLevel_ identifies the depth of the tree nodes, where depth 0 indicates the root node. Each node has a unique node ID (_NodeID_) that is based on the order that the nodes are inserted in the decision tree. By default, the CAS decision tree model splits a node into two branches, which explains why the two nodes share the same parent (_Parent_).

The _ParentName_ column indicates which predictor is chosen to split its parent node, and the columns with prefix _PBName specifies the splitting rule, which explains how to split the node from its parent node. For example, the second row and the third row tell the first split of the decision tree to split observations with DemGender = M/U and observations DemGender = F into two nodes with _NodeID_ 1 and 2. The _NodeType_ column in this table indicates whether a node is an internal node or a terminal node (leaf).

The following figure reconstructs the decision tree structure from the preceding fetched output:

Next let’s look at the distribution of the target variables and the measurement of the splits.

In [5]: output1[['_TreeLevel_', '_NodeID_', '_Parent_',

'_TargetVal0_', '_TargetVal1_', '_CI0_', '_CI1_',

'_Gain_', '_NumObs_']].sort_values('_NodeID_').head(20)

Out[5]:

Selected Rows from Table TREEMODEL1

_TreeLevel_ _NodeID_ _Parent_ _TargetVal0_ _TargetVal1_

0 0.0 0.0 -1.0 No Bought

1 1.0 1.0 0.0 No Bought

2 1.0 2.0 0.0 No Bought

3 2.0 3.0 1.0 No Bought

4 2.0 4.0 1.0 No Bought

_CI0_ _CI1_ _Gain_ _NumObs_

0 0.752284 0.247716 0.047713 1688948.0

1 0.870955 0.129045 0.012886 765624.0

2 0.653881 0.346119 0.000000 923324.0

3 0.921578 0.078422 0.000000 323684.0

4 0.833878 0.166122 0.000000 441940.0

The first row returns the overall frequency distribution of the response variable. In this case, the overall sample percentage of those who buy organic food is 24.77%. The decision tree algorithm then evaluates all possible ways to cut the entire data into two segments. Since we have only one predictor with three levels, there are only three ways to split the data into two segments:

● Gender = F versus Gender = M, U

● Gender = M versus Gender = F, U

● Gender = U versus Gender = F, M

The dtreetrain action evaluates all these three possible splits and selects the best one based on a certain criterion. The default criterion that is used by dtreetrain is information gain, which is the difference between of the entropy of the parent node and the weighted average of entropies of the child nodes. In other words, information gain tries to find a split that maximizes the difference of the target distribution between

the parent node and the child nodes. The best split chosen by dtreetrain is Gender = F versus Gender = M, U with information gain 0.0477. The sample percentage of those who buy organic food and who are the two children of node 0 is 12.90% and 34.61%, respectively.

The tree model continues to evaluate the splits on the two segments and to determine whether they can be further split. The continued evaluation for more splits is the reason why a decision tree is also known as a recursive partition. The tree structure is grown using recursive partitioning, until one of the following criteria is met:

● The current partition cannot be split anymore.

● The minimum leaf size has been reached.

● The maximum depth of the tree has been reached.

The following figure shows the final decision tree model that we have built:

The following table lists some important parameters for training a decision tree model:

Parameter	Description	Default
maxbranch	Specifies the maximum number of children (branches) that are allowed for each level of the tree.	2
maxlevel	Specifies the maximum number of tree levels.	6
leafsize	Specifies the minimum number of observations on each node.	1
prune=True \| False	Specifies whether to use a C4.5 pruning method for classification trees or minimal cost-complexity pruning for regression trees.	False
varimp=True \| False	Specifies whether the variable importance information is generated. The importance value is determined by the total Gini reduction.	False
cflev	Specifies the aggressiveness of tree pruning according to the C4.5 method.

The dtreetrain action supports the following criteria for evaluating the splits in a decision tree growing process:

Parameter	Description
crit=chaid	CHAID (Chi-square Automatic Interaction Detector) technique
crit=chisquare	Chi-square test
crit=gain	Information gain
crit=gainratio	Information gain ratio
crit=gini	Per-leaf Gini statistic

Building a decision tree model usually involves a pruning step after the tree is grown. You must enable the prune option to request dtreetrain to include an additional pruning step after the tree is grown.

In [6]: tree1.prune = True

...: tree1()

Out[6]:

[ModelInfo]

Decision Tree for ORGANICS

Descr Value

0 Number of Tree Nodes 3.000000e+00

1 Max Number of Branches 2.000000e+00

2 Number of Levels 2.000000e+00

3 Number of Leaves 2.000000e+00

4 Number of Bins 2.000000e+01

5 Minimum Size of Leaves 7.656240e+05

6 Maximum Size of Leaves 9.233240e+05

7 Number of Variables 1.000000e+00

8 Confidence Level for Pruning 2.500000e-01

9 Number of Observations Used 1.688948e+06

10 Misclassification Error (%) 2.477163e+01

[OutputCasTables]

casLib Name Rows Columns

0 CASUSERHDFS(username) treeModel1 3 24

casTable

0 CASTable('treeModel1', caslib='CASUS...

In [7]: output1[['_TreeLevel_', '_NodeID_', '_Parent_', '_ParentName_',

'_NodeType_', '_PBName0_',

'_PBName1_']].sort_values('_NodeID_').head(20)

Out[7]:

Selected Rows from Table TREEMODEL1

_TreeLevel_ _NodeID_ _Parent_ _ParentName_ _NodeType_

0 0.0 0.0 -1.0 1.0

1 1.0 1.0 0.0 DemGender 3.0

2 1.0 2.0 0.0 DemGender 3.0

_PBName0_ _PBName1_

1 M U

2 F

Compared to the code in In[2], Node 3 and Node 4 are pruned back according to the pruning criterion. The default pruning criterion is a modified C4.5 method, which estimates the error rate for validation data that is based on the classification error rate of the training data. The pruning criterion is not generated by the dtreetrain action. However, you can use the information gain output that is generated by the tree model without pruning in order to understand why this split is not significant (the information gain is 0.0129) compared to the first split (the information gain is 0.0477).

Pruning a decision tree is used to avoid model overfitting when you use the decision tree to score new data (validation data). The C4.5 method does not require validation data in order to prune a tree. If you do have holdout data, you can use the dtreeprune action to prune a given tree model.

data2 = conn.CASTable('your_validation_data')

output2 = conn.CASTable('pruned_tree')

data2.dtreeprune(model=output1, casout=output2)

Next let’s add more variables to the decision tree models.

In [8]: varlist = ['DemGender', 'DemHomeowner', 'DemAgeGroup', 'DemCluster', 'DemReg',

'DemTVReg', 'DemFlag1', 'DemFlag2', 'DemFlag3', 'DemFlag4', 'DemFlag5',

'DemFlag6', 'DemFlag7', 'DemFlag8', 'PromClass']

...:

...: output2 = conn.CASTable('treeModel2', replace=True)

...:

...: tree2 = organics.dtreetrain

...: tree2.target = 'TargetBuy'

...: tree2.inputs = varlist

...: tree2.casout = output2

...: tree2()

Out[8]:

[ModelInfo]

Decision Tree for ORGANICS

Descr Value

0 Number of Tree Nodes 4.300000e+01

1 Max Number of Branches 2.000000e+00

2 Number of Levels 6.000000e+00

3 Number of Leaves 2.200000e+01

4 Number of Bins 2.000000e+01

5 Minimum Size of Leaves 7.600000e+01

6 Maximum Size of Leaves 5.972840e+05

7 Number of Variables 1.500000e+01

8 Confidence Level for Pruning 2.500000e-01

9 Number of Observations Used 1.688948e+06

10 Misclassification Error (%) 2.389866e+01

[OutputCasTables]

casLib Name Rows Columns

0 CASUSERHDFS(username) treeModel2 43 130

casTable

0 CASTable('treeModel2', caslib='CASUS...

The ModelInfo output in the preceding example indicates that a larger decision tree has been grown. The new tree has 43 nodes and 22 of them are terminal nodes (leaves). The tree has 6 levels and the smallest leaves have 76 observations. For data with 1.7 million observations, sometimes, we might not be interested in subsets of data with only 76 observations. You can actually control subset size by the leafsize parameter. You can set leafsize to 1000 in order to grow a smaller tree, as follows:

In [9]: organics.dtreetrain(

...: target = 'TargetBuy',

...: inputs = varlist,

...: casout = output2,

...: leafSize = 1000,

...: maxLevel = 4,

...: )

Out[9]:

[ModelInfo]

Decision Tree for ORGANICS

Descr Value

0 Number of Tree Nodes 1.500000e+01

1 Max Number of Branches 2.000000e+00

2 Number of Levels 4.000000e+00

3 Number of Leaves 8.000000e+00

4 Number of Bins 2.000000e+01

5 Minimum Size of Leaves 1.216000e+03

6 Maximum Size of Leaves 8.891240e+05

7 Number of Variables 1.500000e+01

8 Confidence Level for Pruning 2.500000e-01

9 Number of Observations Used 1.688948e+06

10 Misclassification Error (%) 2.393466e+01

[OutputCasTables]

casLib Name Rows Columns

0 CASUSERHDFS(username) treeModel2 15 130

casTable

0 CASTable('treeModel2', caslib='CASUS...

In the previous example, the new decision tree produced only 15 nodes and 8 of them are leaves. Using this tree model to score a new observation is straightforward. You first identify the leaf node that contains this observation, and use the sample event probability of the node as the estimated event probability for that observation. The CAS decisiontree action set provides a dtreescore action for you to score a data set as well.

In [10]: organics.dtreescore(modelTable=conn.CASTable('treeModel2'))

Out[10]:

[ScoreInfo]

Descr Value

0 Number of Observations Read 1688948

1 Number of Observations Used 1688948

2 Misclassification Error (%) 23.934662287

The model option in dtreescore points to the tree model that we trained and stored on the CAS server, and it uses the model to score the ORGANICS CAS table. The dtreescore action in this example does not actually save the predictions to a new CAS table. You must use a casout option to save the predictions.

In [11]: output3 = conn.CASTable('predicted', replace=True)

...: organics.dtreescore(modelTable=output2, casout=output3)

...: output3.columns

Out[11]: Index(['_DT_PredName_', '_DT_PredP_', '_DT_PredLevel_', '_LeafID_', '_MissIt_', '_NumNodes_', '_NodeList0_', '_NodeList1_', '_NodeList2_', '_NodeList3_'], dtype='object')

In [12]: output3.head(10)

Out[12]:

Selected Rows from Table PREDICTED

_DT_PredName_ _DT_PredP_ _DT_PredLevel_ _LeafID_ _MissIt_

0 No 0.840345 0.0 8.0 1.0

1 No 0.667749 0.0 9.0 0.0

2 No 0.840345 0.0 8.0 0.0

3 No 0.667749 0.0 9.0 1.0

4 No 0.667749 0.0 9.0 1.0

5 No 0.559701 0.0 11.0 1.0

6 No 0.667749 0.0 9.0 1.0

7 No 0.840345 0.0 8.0 1.0

8 No 0.667749 0.0 9.0 0.0

9 No 0.923686 0.0 7.0 0.0

_NumNodes_ _NodeList0_ _NodeList1_ _NodeList2_ _NodeList3_

0 4.0 0.0 1.0 3.0 8.0

1 4.0 0.0 1.0 4.0 9.0

2 4.0 0.0 1.0 3.0 8.0

3 4.0 0.0 1.0 4.0 9.0

4 4.0 0.0 1.0 4.0 9.0

5 4.0 0.0 2.0 5.0 11.0

6 4.0 0.0 1.0 4.0 9.0

7 4.0 0.0 1.0 3.0 8.0

8 4.0 0.0 1.0 4.0 9.0

9 4.0 0.0 1.0 3.0 7.0

The following table summarizes the columns that are generated as output by a dtreescore action.

Output Column	Description
_DT_PredName_	The predicted value, which is the most frequent level of the leaf that the observation is assigned to.
_DT_PredP_	The predicted probability, which is equal to the sample frequency of the predicted value in the leaf that the observation is assigned to.
_DT_PredLeveL_	The index of the predicted value. If the response variable has k levels, this column takes values from 0, 1, ..,k – 1
_LeafID_	The node ID of the leaf that the observation is assigned to.
_MissIt_	The indicator for misclassification.
_NumNodes__NodeListK_	_NumNodes_ indicates the depth of the leaf. The list of _NodeListK_ variables stores the path from the root node to the current leaf.

Gradient Boosting, Forests, and Neural Networks

The decisiontree action set also contains the actions for building gradient boosting and random forest models. Unlike decision trees, gradient boosting and random forests are machine-learning techniques that produce predictions that are based on an ensemble of trees. Gradient boosting models are usually based on a set of weak prediction models (small decision trees or even tree stumps). By contrast, random forests are usually based on a set of full grown trees (deep trees) on subsample data. Another major difference is that gradient boosting grows decision trees sequentially and a random forest grows decision trees in parallel. For more details about tree models, refer to [1] and [2] at the end of the chapter.

Both gradient boosting and random forest models are available in the decisiontree action set through three distinct actions that cover basic steps of a machine learning pipeline: model training, scoring, and delivery (score code generation).

In [1]: conn.help(actionset='decisiontree')

NOTE: Added action set 'decisiontree'.

Out[1]:

decisionTree

Name Description

0 dtreeTrain Train Decision Tree

1 dtreeScore Score A Table Using Decision Tree

2 dtreeSplit Split Tree Nodes

3 dtreePrune Prune Decision Tree

4 dtreeMerge Merge Tree Nodes

5 dtreeCode Generate score code for Decision Tree

6 forestTrain Train Forest

7 forestScore Score A Table Using Forest

8 forestCode Generate score code for Forest

9 gbtreeTrain Train Gradient Boosting Tree

10 gbtreeScore Score A Table Using Gradient Boosting Tree

11 gbtreecode Generate score code for Gradient Boosting Trees

Let’s first train a simple random forest model using the Organics data set to predict the probability of buying organic food.

In[2]: varlist = ['DemGender', 'DemHomeowner', 'DemAgeGroup',

'DemCluster', 'DemReg', 'DemTVReg', 'DemFlag1',

'DemFlag2', 'DemFlag3', 'DemFlag4', 'DemFlag5',

'DemFlag6', 'DemFlag7', 'DemFlag8', 'PromClass']

...:

...: output = conn.CASTable('forest1', replace=True)

...:

...: forest1 = organics.Foresttrain()

...: forest1.target = 'TargetBuy'

...: forest1.inputs = varlist

...: forest1.casout = output

...: forest1()

Out[2]:

[ModelInfo]

Forest for ORGANICS

Descr Value

0 Number of Trees 50.000000

1 Number of Selected Variables (M) 4.000000

2 Random Number Seed 0.000000

3 Bootstrap Percentage (%) 63.212056

4 Number of Bins 20.000000

5 Number of Variables 15.000000

6 Confidence Level for Pruning 0.250000

7 Max Number of Tree Nodes 57.000000

8 Min Number of Tree Nodes 19.000000

9 Max Number of Branches 2.000000

10 Min Number of Branches 2.000000

11 Max Number of Levels 6.000000

12 Min Number of Levels 6.000000

13 Max Number of Leaves 29.000000

14 Min Number of Leaves 10.000000

15 Maximum Size of Leaves 927422.000000

16 Minimum Size of Leaves 10.000000

17 Out-of-Bag MCR (%) NaN

[OutputCasTables]

casLib Name Rows Columns

0 CASUSERHDFS(username) forest1 2280 132

casTable

0 CASTable('forest1', caslib='CASUSERH...

The foresttrain action returns two result tables to the client: ModelInfo and OutputCasTables. The first table contains parameters that define the forest, parameters that define each individual tree, and tree statistics such as the minimum and maximum number of branches and levels. A forest model is an ensemble of homogenous trees, with each tree growing on a different subset of the data (usually from bootstrap sampling). Therefore, the size and depth of the trees might be different from each other even though the tree parameters that you define are the same. The idea of a forest is to grow a deep tree on each subsample in order to produce “perfect” predictions for the local data (low bias but high variance) and then use the ensemble technique to reduce the overall variance.

Some key parameters that define a random forest model are listed in the following table. The foresttrain action also enables you to configure the individual trees, and the parameters are identical to those in the dtreetrain action. In general, you must grow deep trees for each bootstrap sample, and pruning is often unnecessary.

Parameter	Description
ntree	The number of trees in the forest ensemble, which is 50, by default.
m	The number of input variables to consider for splitting on a node. The variables are selected at random from the input variables. By default, forest uses the square root of the number of input variables that are used, rounded up to the nearest integer.
vote=’majority’	Uses majority voting to collect the individual trees into an ensemble. This is the default ensemble for classification models.
vote=’prob’	Uses the average of predicted probabilities or values to collect the individual trees into an ensemble.
seed	The seed that is used for the random number generator in bootstrapping.
bootstrap	Specifies the fraction of the data for the bootstrap sample. The default value is 0.63212055882.
oob	The Boolean value to control whether the out-of-bag error is computed when building a forest.

Random forest models are also commonly used in variable selection, which is usually determined by the variable importance of the predictors in training the forest model. The importance of a predictor to the target variable is a measure of its overall contribution to all the individual trees. In the foresttrain action, this contribution is defined as the total Gini reduction from all of the splits that use this predictor.

In [3]: forest1.varimp = True

...: result = forest1()

...: result['DTreeVarImpInfo']

...:

Out[3]:

Forest for ORGANICS

Variable Importance Std

0 DemGender 16191.631365 7820.237986

1 DemAgeGroup 7006.819480 2827.738946

2 PromClass 2235.407366 1199.868546

3 DemTVReg 288.048039 68.934845

4 DemFlag2 249.873589 253.060732

5 DemFlag6 226.131347 309.079859

6 DemCluster 222.589229 70.478049

7 DemFlag1 162.256557 177.508653

8 DemReg 114.090856 50.533349

9 DemFlag7 40.931939 35.704581

10 DemFlag5 8.986376 23.494624

11 DemFlag4 8.583654 17.210787

12 DemFlag8 6.291225 12.721568

13 DemFlag3 5.888663 23.692049

14 DemHomeowner 0.446454 0.812748

The foresttrain action also produces an OutputCasTables result table, which contains the name of the CAS table that stores the actual forest model. The CAS table, which is stored on the CAS server, describes all the individual trees. Each row of this CAS table contains the information about a single node in an individual tree. When you have a large of number of trees, this table can be large, and therefore, is stored on the CAS server instead. In the preceding example, the forest model contains 50 trees and 2278 nodes, in total being returned to the client.

In [4]: result['OutputCasTables']

Out[4]:

casLib Name Rows Columns

0 CASUSERHDFS(usrname) forest1 2278 132

casTable

0 CASTable('forest1', caslib='CASUSERH...

casLib Name Rows Columns

0 CASUSERHDFS(username) forest1 2278 132

To score the training data or the holdout data using the forest model, you can use the forestscore action.

scored_data = conn.CASTable('scored_output', replace=True)

organics.forestscore(modelTable=conn.CASTable('forest1'), casout=scored_data)

Unlike the random forest model, gradient boosting grows trees sequentially, whereby each tree is grown based on the residuals from the previous tree. The following example shows how to build a gradient boosting model using the same target and predictors as in the forest model.

In [5]: varlist = ['DemGender', 'DemHomeowner', 'DemAgeGroup', 'DemCluster',

'DemReg', 'DemTVReg', 'DemFlag1', 'DemFlag2', 'DemFlag3',

'DemFlag4', 'DemFlag5', 'DemFlag6', 'DemFlag7', 'DemFlag8',

'PromClass']

...:

...: output = conn.CASTable('gbtree1', replace=True)

...:

...: gbtree1 = organics.Gbtreetrain()

...: gbtree1.target = 'TargetBuy'

...: gbtree1.inputs = varlist

...: gbtree1.casout = output

...: gbtree1()

...:

Out[5]:

[ModelInfo]

Gradient Boosting Tree for ORGANICS

Descr Value

0 Number of Trees 50.0

1 Distribution 2.0

2 Learning Rate 0.1

3 Subsampling Rate 0.5

4 Number of Selected Variables (M) 15.0

5 Number of Bins 20.0

6 Number of Variables 15.0

7 Max Number of Tree Nodes 63.0

8 Min Number of Tree Nodes 61.0

9 Max Number of Branches 2.0

10 Min Number of Branches 2.0

11 Max Number of Levels 6.0

12 Min Number of Levels 6.0

13 Max Number of Leaves 32.0

14 Min Number of Leaves 31.0

15 Maximum Size of Leaves 512468.0

16 Minimum Size of Leaves 33.0

17 Random Number Seed 0.0

[OutputCasTables]

casLib Name Rows Columns

0 CASUSERHDFS(username) gbtree1 3146 119

casTable

0 CASTable('gbtree1', caslib='CASUSERH...

The following table lists the key parameters to grow a sequence of trees in a gradient boosting model. The gbtreetrain action also enables you configure the individual trees. These parameters are identical to those in the dtreetrain action.

Parameter	Description
ntree	The number of trees in the gradient boosting model. The default is 50.
m	The number of input variables to consider for splitting on a node. The variables are selected at random from the input variables. By default, all input variables are used.
seed	The seed for the random number generator for sampling.
subsamplerate	The fraction of the subsample data to build each tree. The default is 0.5.
distribution	The type of gradient boosting tree to build. The value 0 is used for a regression tree, and 1 is used for a binary classification tree.
lasso	The L1 norm regularization on prediction. The default is 0.
ridge	The L2 norm regularization on prediction. The default is 0.

To score the training data or the holdout data using the gradient boosting model, you can use the gbtreescore action.

scored_data = conn.CASTable('scored_output', replace=True)

organics.gbtreescore(modelTable=conn.CASTable('gbtree1'), casout=scored_data)

Neural networks are machine learning models that derive hidden features (neuron) as nonlinear functions of linear combination of the predictors, and then model the target variable as a nonlinear function of the hidden features. Such nonlinear transformations are often called activation functions in neural networks. The following figure illustrates a simple neural network between three predictors and a binary target that takes levels 0 and 1. In this case, there are four hidden features. Each is a nonlinear function of the linear combination of the three predictors. The probabilities of target = 1 and target = 0 are modeled as a nonlinear function of the hidden features. In neutral networks, such hidden features are called neurons, and it is common to have more than one layer of hidden neurons.

The neural network of related actions is available in the neuralnet action set. Let’s load the action set and continue to use the Organics data set to build a simple neural network.

In [6]: conn.loadactionset('neuralnet')

...: conn.help(actionset='neuralnet')

...:

NOTE: Added action set 'neuralNet'.

NOTE: Information for action set 'neuralNet':

NOTE: neuralNet

NOTE: annTrain - Train an artificial neural network

NOTE: annScore - Score a table using an artificial neural network model

NOTE: annCode - Generate DATA step scoring code from an artificial neural network model

In [7]: neural1 = organics.Anntrain()

...: neural1.target = 'TargetBuy'

...: neural1.inputs = ['DemAge','DemAffl','DemGender']

...: neural1.casout = output

...: neural1.hiddens = [4,2]

...: neural1.maxIter = 500

...: result = neural1()

...: list(result.keys())

Out[7]: ['OptIterHistory', 'ConvergenceStatus', 'ModelInfo', 'OutputCasTables']

In this case, we built a neural network with two layers. The first layer has four hidden neurons and the second layer has two hidden neurons. The maximum number of iterations for training the neural network is set to 500. The iteration history and convergence status of the model are reported in OptIterHistory and ConvergenceStatus result tables, respectively. The ModelInfo result table contains basic information about the neural network model:

In [8]: result['ModelInfo']

Out[8]:

Neural Net Model Info for ORGANICS

Descr Value

0 Model Neural Net

1 Number of Observations Used 1498264

2 Number of Observations Read 1688948

3 Target/Response Variable TargetBuy

4 Number of Nodes 13

5 Number of Input Nodes 5

6 Number of Output Nodes 2

7 Number of Hidden Nodes 6

8 Number of Hidden Layers 2

9 Number of Weight Parameters 30

10 Number of Bias Parameters 8

11 Architecture MLP

12 Number of Neural Nets 1

13 Objective Value 1.7011968247

The default error function of the anntrain action is NORMAL for continuous target or ENTROPY for categorical target. The anntrain action also provides parameters for customizing the neural networks such as the error function, the activation function, and the target activation function. Some key parameters of the anntrain action are listed in the following table:

Parameter	Description
arch	Specifies the architecture of the network. arch = ‘MLP’: The standard multilayer perceptron network. arch =’GLIM’: The neural network with no hidden layer. arch = ‘DIRECT’: The MLP network with additional direct links from input nodes to the target nodes.
errorfunc	Specifies the error function for training the neural network. ENTROPY is available for categorical target. GAMMA, NORMAL, and POISSON are available for continuous targets.
targetact	Specifies the target activation function that links the hidden neurons at the last layer to the target nodes. LOGISTIC and SOFMAX (default) are available for categorical targets. EXP, IDENTITY, SIN, and TANH (default) are available for continuous targets.
act	Specifies the activation function that links the input nodes to the hidden neurons at the first layer, or neurons from one layer to the next layer. Available activation functions include EXP, IDENTITY, LOGISTIC, RECTIFIER, SIN, SOFTPLUS, and TANH (default).
targetcomb	Specifies the way to combine neurons in the target activation function. Linear combination (LINEAR) is the default. Other combinations are additive (ADD) and radial (RADIAL).
comb	Specifies the way to combine neurons or read as input in the activation functions. Linear combination (LINEAR) is the default. Other combinations are Additive (ADD) and radial (RADIAL).
ntries	The number of tries for random initial values of the weight parameters.
includebias	Indicates whether to include the intercept term (usually called bias) in the combination function. This parameter is ignored if an additive combination is used. Additive combinations are Combination (comb) and Target Combination (targetComb).

Similar to the tree models in CAS, a CAS table in the server can be used for storing the neural network model. This is convenient when you build a large neural network and it avoids the I/O traffic between the Python client and the CAS server. To score a data set that uses a CAS table and that contains the neural network model, you can use the annscore action.

In [9]: organics.annscore(modelTable=output)

...:

Out[9]:

[ScoreInfo]

Descr Value

0 Number of Observations Read 1688948

1 Number of Observations Used 1498264

2 Misclassification Error (%) 18.448818099

Conclusion

In this chapter, we introduced several analytic models that are available on the CAS server for modeling categorical variables. This includes the logistic regression in the regression action set, the tree family (decision tree, random forest, gradient boosting) in the decisiontree action set, and simple neural networks in the neuralnet action set. For more information about the models, see the following references.

[1] Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer, Berlin: Springer Series in Statistics.

[2] Breiman, Leo. 2001. "Random Forests." In Machine Learning 45.1:5-32.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Modeling Categorical Variables

Create new playlist

Sign In

Sign Up

Chapter 9: Modeling Categorical Variables

Logistic Regression

Decision Trees

Gradient Boosting, Forests, and Neural Networks

Conclusion

Table of Contents for
Chapter 9: Modeling Categorical Variables