Regularized methods

There are three common approaches to using regularized methods:

Lasso
Ridge
Elastic net

In this section, we will see how these methods can be implemented in R. For these models, we will use the h2o package. This provides a predictive analysis platform to be used in machine learning that is open source, based on in-memory parameters, and distributed, fast, and scalable. It helps in creating models that are built on big data and is most suitable for enterprise applications as it enhances production quality.

For more information on the h2o package, please visit its documentation at https://cran.r-project.org/web/packages/h2o/index.html.

This package is very useful because it summarizes several common machine learning algorithms in one package. Moreover, these algorithms can be executed in parallel on our own computer, as it is very fast. The package includes generalized linear naïve Bayes, distributed random forest, gradient boosting, and deep learning, among others.

It is not necessary to have a high level of programming knowledge, because the package comes with a user interface.

Let's see how the package works. First, the package should be loaded:

library(h2o)

Use the h2o.init method to initialize H2O. This method accepts other options that can be found in the package documentation:

h2o.init()

The first step toward building our model involves placing our data in the H2O cluster/Java process. Before this step, we will ensure that our target is considered as a factor variable:

train$Default<-as.factor(train$Default)
 
test$Default<-as.factor(test$Default)

Now, let's upload our data to the h2o cluster:

as.h2o(train[,2:ncol(train)],destination_frame="train")

as.h2o(test[,2:ncol(test)],destination_frame="test")

If you close R and restart it later, you will need to upload the datasets again, as in the preceding code.

We can check that the data has been uploaded correctly with the following command:

h2o.ls()

 ##     key
 ## 1  test
 ## 2 train

The package contains an easy interface that allows us to create different models when we run it in our browser. In general, the interface can be launched by writing the following address in our web browser, http://localhost:54321/flow/index.html. You will be faced with a page like the one that's shown in the following screenshot. In the Model tab, we can see a list with all of the available models that are implemented in this package:

First, we are going to develop regularization models. For that, Generalized Linear Modelling… must be selected. This module includes the following:

Gaussian regression
Poisson regression
Binomial regression (classification)
Multinomial classification
Gamma regression
Ordinal regression

As shown in the following screenshot, we should fill in the necessary parameters to train our model:

We will fill in the following fields:

model_id: Here, we can specify the name that can be used as a reference by the model.
training_frame: The dataset that we wish to use to build and train the model can be mentioned here, as this will be our training dataset.
validation_frame: Here, the dataset that will be used to check the accuracy of the model is mentioned.
nfolds: For validation, we require a certain number of folds to be mentioned here. In our case, the nfolds value is 5.
seed: This specifies the seed that will be used by the algorithm. We will use a Random Number Generator (RNG) for the components in the algorithm that require random numbers.
response_column: This is the column to use as the dependent variable. In our case, the column is named Default.
ignored_columns: In this section, it is possible to ignore variables in the training process. In our case, all of the variables are considered relevant.
ignore_const_cols: This is a flag that indicates that the package should avoid constant variables.
family: This specifies the model type. In our case, we want to train a regression model, so the family should be fixed as binomial, because our target variable has two possible values.
solver: This specifies the solver to use. We don't change this value because no significant differences have been observed regardless of whether one solver or another is chosen. Hence, we will keep it as the default value.
alpha: Here, you have to choose values for the regularization distribution from L1 to L2. If you select 1, it will be a Lasso regression. If you select 0, it will be a Ridge regression. If any value in between 0 and 1 is selected, you will have a mixture of both Lasso and Ridge. In our case, we will select 1. One of the main advantages of the Lasso model is in the reduction of the number of variables because the trained models makes the coefficient of non-relevant variables zero, resulting in models that are simple, but accurate at the same time.
lambda_search: This parameter starts a search of the regularization strength.
standardize: If this flag is marked, it means that numeric columns will be transformed to have a zero mean and zero unit variance.

Finally, the Build model button trains the model. Although other options can be selected, the preceding specifications are sufficient:

We can see that the model has been trained quickly. The View button provides us with some interesting details about the model:

Model parameters
Scoring history
The Receiver Operating Characteristic (ROC) curve for training and validation samples
Standardized coefficient magnitudes
Gains/Lift table for cross-validation, training, and validation samples
Cross-validation models
Metrics
Coefficients

Let's see some of the main results:

As we can see, our Lasso model is trained with 108 different variables, but only 56 result in a model that has a coefficient greater than zero.

The model provides an almost perfect classification. In the training sample, the Area under the curve (AUC) reaches 99.51%. This value is slightly lower in the validation sample, with a value of 98.65%. The standardized variables are also relevant:

If a variable is shown in blue, this indicates that the coefficient is positive. If it is negative, the color is orange.

As we can see, UBPRE626 looks like an important variable. It shows us how many times the total loans and lease-financing receivables surpassed the actual total of the equity capital. A positive sign here means a higher ratio, which also implies a higher probability that a bank will fail in its operations.

The top five relevant variables according to this figure are as follows:

UBPRE626: The number of times net loans and lease-financing receivables exceed the total equity capital
UBPRE545: The total of the due and non-accrual loans and leases, divided by the allowance for the loan and lease losses
UBPRE170: The total equity capital
UBPRE394: Other construction and land development loans, divided by the average gross loans and leases
UBPRE672: One quarter of the annualized realized gains (or losses) of the securities, divided by average assets

When looking at credit risks, it is important to understand which variables are most significant and the economic relevance of these variables. For example, it would not make sense if the higher the non-performing loans or loans with problems, the higher the solvency of a bank. We aren't concerned about the economic sense of the variables in our model, but this is a key issue for some models that are developed in financial institutions. If the variables don't have the expected sign, they have to be removed from the model.

In some cases, is necessary to test different combinations of parameters until you obtain the best model. For example, in the recently trained regularized model, we could have tried different values of the alpha parameter. To test different parameters at the same time, you need to execute the algorithms using code. Let's have a look at how to do this. We will train the regularized models again, but using some code this time. First, we remove all the objects, including the recently created model, from the h2o system:

h2o.removeAll()
 ## [1] 0

Then, we upload our training and validation samples again:

as.h2o(train[,2:ncol(train)],destination_frame="train")
as.h2o(test[,2:ncol(test)],destination_frame="test")

Let's code our model. A grid of empty parameters is created as follows:

grid_id <- 'glm_grid'

Then, we assign different parameters to be tested in this grid:

hyper_parameters <- list( alpha = c(0, .5, 1) )
stopping_metric <- 'auc'
glm_grid <- h2o.grid(
     algorithm = "glm",
     grid_id = grid_id,
     hyper_params = hyper_parameters,
     training_frame = training,
     nfolds=5,
     x=2:110,
     y=1,
     lambda_search = TRUE,
     family = "binomial", seed=1234)

As we can see, the parameters are exactly the same as those we used to train the previous model. The only difference is that we now use different alpha values at the same time, which corresponds to a Ridge regression, with a Lasso and an elastic net. The model is trained using the following code:

results_glm <- h2o.getGrid(
     grid_id = grid_id,
     sort_by = stopping_metric,
     decreasing = TRUE)

According to the previous code, the different models in the grid should be ordered by the AUC metric. Thus, we are interested in the first model:

best_GLM <- h2o.getModel(results_glm@model_ids[[1]])

Let's take a look at some details about this model:

best_GLM@model$model_summary$regularization
 ## [1] "Ridge ( lambda = 0.006918 )"

The model with the best performance is a Ridge model. The performance of the model can be obtained as follows:

perf_train<-h2o.performance(model = best_GLM,newdata = training)
perf_train
 ## H2OBinomialMetrics: glm
 ##
 ## MSE:  0.006359316
 ## RMSE:  0.07974532
 ## LogLoss:  0.02561085
 ## Mean Per-Class Error:  0.06116986
 ## AUC:  0.9953735
 ## Gini:  0.990747
 ## R^2:  0.8579102
 ## Residual Deviance:  363.213
 ## AIC:  581.213
 ##
 ## Confusion Matrix (vertical: actual; across: predicted) for F1-              optimal threshold:
 ##           0   1    Error      Rate
 ## 0      6743  15 0.002220  =15/6758
 ## 1        40 293 0.120120   =40/333
 ## Totals 6783 308 0.007756  =55/7091
 ##
 ## Maximum Metrics: Maximum metrics at their respective thresholds
 ##                         metric threshold    value idx
 ## 1                       max f1  0.540987 0.914197 144
 ## 2                       max f2  0.157131 0.931659 206
 ## 3                 max f0point5  0.617239 0.941021 132
 ## 4                 max accuracy  0.547359 0.992244 143
 ## 5                max precision  0.999897 1.000000   0
 ## 6                   max recall  0.001351 1.000000 383
 ## 7              max specificity  0.999897 1.000000   0
 ## 8             max absolute_mcc  0.540987 0.910901 144
 ## 9   max min_per_class_accuracy  0.056411 0.972973 265
 ## 10 max mean_per_class_accuracy  0.087402 0.977216 239
 ##

The AUC and the Gini index, which are the main metrics of performance, are only slightly higher than in the Lasso that we trained initially—at least in the training sample.

The performance of the model in the test sample is also high:

perf_test<-h2o.performance(model = best_GLM,newdata = as.h2o(test))
perf_test
 ## H2OBinomialMetrics: glm
 ##
 ## MSE:  0.01070733
 ## RMSE:  0.1034762
 ## LogLoss:  0.04052454
 ## Mean Per-Class Error:  0.0467923
 ## AUC:  0.9875425
 ## Gini:  0.975085
 ## R^2:  0.7612146
 ## Residual Deviance:  246.3081
 ## AIC:  464.3081
 ##
 ## Confusion Matrix (vertical: actual; across: predicted) for F1-            optimal threshold:
 ##           0   1    Error      Rate
 ## 0      2868  28 0.009669  =28/2896
 ## 1        12 131 0.083916   =12/143
 ## Totals 2880 159 0.013162  =40/3039
 ##
 ## Maximum Metrics: Maximum metrics at their respective thresholds
 ##                         metric threshold    value idx
 ## 1                       max f1  0.174545 0.867550 125
 ## 2                       max f2  0.102341 0.904826 138
 ## 3                 max f0point5  0.586261 0.885167  89
 ## 4                 max accuracy  0.309187 0.987167 107
 ## 5                max precision  0.999961 1.000000   0
 ## 6                   max recall  0.000386 1.000000 388
 ## 7              max specificity  0.999961 1.000000   0
 ## 8             max absolute_mcc  0.174545 0.861985 125
 ## 9   max min_per_class_accuracy  0.027830 0.955456 210
 ## 10 max mean_per_class_accuracy  0.102341 0.965295 138

Again, the results do not differ significantly in comparison with the Lasso model. Nevertheless, the number of coefficients in the Lasso model is lower, which makes it easier to interpret and more parsimonious.

The total number of coefficients in the Ridge regression is equal to the number of variables in the dataset and the intercept of the model:

head(best_GLM@model$coefficients)
 ##    Intercept     UBPRE395     UBPRE543     UBPRE586     UBPRFB60
 ## -8.448270911 -0.004167366 -0.003376142 -0.001531582  0.027969152
 ##     UBPRE389
 ## -0.004031844

Now, we will store the predictions of each model in a new data frame. We can combine the results of the different models to obtain an additional model. Initially, our data frame will contain only the ID of each bank and the target variable:

summary_models_train<-train[,c("ID_RSSD","Default")]
summary_models_test<-test[,c("ID_RSSD","Default")]

Let's calculate the model predictions and store them in the summary data frame:

summary_models_train$GLM<-as.vector(h2o.predict(best_GLM,training)[3])
summary_models_test$GLM<-as.vector(h2o.predict(best_GLM,validation)[3])

When we run the previous code to calculate the performance of the model, we also obtain a confusion matrix. For example, in the test sample, we obtain the following:

perf_test@metrics$cm$table
 ## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
 ##           0   1  Error         Rate
 ## 0      2868  28 0.0097 = 28 / 2,896
 ## 1        12 131 0.0839 =   12 / 143
 ## Totals 2880 159 0.0132 = 40 / 3,039

This package classifies a bank as a failed bank if its probability of defaulting is higher than 0.5, and is a successful bank otherwise.

According to this assumption, 40 banks are misclassified (28+12). Nevertheless, the cutoff of 0.5 is not actually correct, because the proportion of failed versus non-failed banks in the sample is different.

The proportion of failed banks is actually only 4.696%, as shown in the following code :

mean(as.numeric(as.character(train$Default)))
 ## [1] 0.04696094

Hence, it is more appropriate to consider a bank as failed if the probability of a bank defaulting is higher than this proportion:

aux<-summary_models_test
aux$pred<-ifelse(summary_models_test$GLM>0.04696094,1,0)

Thus, the new confusion table for the test sample is as follows:

table(aux$Default,aux$pred)
 ##   
 ##        0    1
 ##   0 2818   78
 ##   1    8  135

According to this table, the model misclassifies 86 banks (78+8). Almost all of the failed banks have been correctly classified. It will be difficult to obtain a better algorithm than this.

The model can be saved locally using h2o.saveModel:

h2o.saveModel(object= best_GLM, path=getwd(), force=TRUE)

We remove the irrelevant objects in the workspace and save it as follows:

rm(list=setdiff(ls(), c("Model_database","train","test","summary_models_train","summary_models_test","training","validation")))
 
save.image("Data13.RData")

Remember that if you close R and load this workspace again, you should convert your train and test samples into h2o format again:

training<-as.h2o(train[,2:ncol(train)],destination_frame=“train”)
validation<-as.h2o(test[,2:ncol(test)],destination_frame=“test”)

Table of Contents for Regularized methods

Create new playlist

Sign In

Sign Up

Table of Contents for
Regularized methods