Ensembles

At this point, we have trained five different models. The predictions are stored in two data frames, one for training and the other for the validation samples:

head(summary_models_train)
## ID_RSSD Default GLM RF GBM deep
## 4 37 0 0.0013554364 0 0.000005755001 0.000000018217172
## 21 242 0 0.0006967876 0 0.000005755001 0.000000002088871
## 38 279 0 0.0028306028 0 0.000005240935 0.000003555978680
## 52 354 0 0.0013898732 0 0.000005707480 0.000000782777042
## 78 457 0 0.0021731695 0 0.000005755001 0.000000012535539
## 81 505 0 0.0011344433 0 0.000005461855 0.000000012267744
## SVM
## 4 0.0006227083
## 21 0.0002813123
## 38 0.0010763298
## 52 0.0009740568
## 78 0.0021555739
## 81 0.0005557417

Let's summarize the accuracy of the previously trained models. First, the predictive power of each classifier will be calculated using the Gini index. With the following code, the Gini index for the training and validation samples is calculated:

gini_models<-as.data.frame(names(summary_models_train[,3:ncol(summary_models_train)]))
colnames(gini_models)<-"Char"

for (i in 3:ncol(summary_models_train))
{

gini_models$Gini_train[i-2]<-(abs(as.numeric(2*rcorr.cens(summary_models_train[,i],summary_models_train$Default)[1]-1)))

gini_models$Gini_test[i-2]<-(abs(as.numeric(2*rcorr.cens(summary_models_test[,i],summary_models_test$Default)[1]-1)))

}

The results are stored in a data frame called gini_models. The variation in the predictive power between the train and test samples is also calculated:

gini_models$var_train_test<-(gini_models$Gini_train-gini_models$Gini_test)/gini_models$Gini_train
print(gini_models)

## Char Gini_train Gini_test var_train_test
## 1 GLM 0.9906977 0.9748967 0.01594943
## 2 RF 1.0000000 0.9764276 0.02357242
## 3 GBM 1.0000000 0.9754665 0.02453348
## 4 deep 0.9855324 0.9589837 0.02693848
## 5 SVM 0.9920815 0.9766884 0.01551595

There are not really many significant differences between the models. The SVM is the model with the highest predictive power in the test sample. On the other hand, the deep learning model obtains the worst results.

These results indicate that it is not very difficult to find banks that will fail in less than a year from the current financial statement, which is how we defined our target variable.

We can also see the predictive power of each model, depending on the number of banks that are correctly classified:

decisions_train <- summary_models_train

decisions_test <- summary_models_test

Now, let's create new data frames where the banks are classified as solvent or non-solvent banks, depending on the predicted probabilities, as we have done for each model:

for (m in 3:ncol(decisions_train))
{

decisions_train[,m]<-ifelse(decisions_train[,m]>0.04696094,1,0)

decisions_test[,m]<-ifelse(decisions_test[,m]>0.04696094,1,0)

}

Now, a function that counts the number of banks as correctly and non-correctly classified is created:

accuracy_function <- function(dataframe, observed, predicted)
{
bads<-sum(as.numeric(as.character(dataframe[,observed])))
goods<-nrow(dataframe)-bads
y <- as.vector(table(dataframe[,predicted], dataframe[,observed]))
names(y) <- c("TN", "FP", "FN", "TP")
return(y)
}

By running the preceding function, we will see a summary of the performance of each model. First, the function is applied on the training sample:

print("Accuracy GLM model:")
## [1] "Accuracy GLM model:"
accuracy_function(decisions_train,"Default","GLM")
## TN FP FN TP
## 6584 174 9 324

print("Accuracy RF model:")
## [1] "Accuracy RF model:"
accuracy_function(decisions_train,"Default","RF")
## TN FP FN TP
## 6608 150 0 333

print("Accuracy GBM model:")
## [1] "Accuracy GBM model:"
accuracy_function(decisions_train,"Default","GBM")
## TN FP FN TP
## 6758 0 0 333

print("Accuracy deep model:")
## [1] "Accuracy deep model:"
accuracy_function(decisions_train,"Default","deep")
## TN FP FN TP
## 6747 11 104 229

print("Accuracy SVM model:")
## [1] "Accuracy SVM model:"
accuracy_function(decisions_train,"Default","SVM")
## TN FP FN TP
## 6614 144 7 326

Then, we can see the results of the different models in our test sample:

print("Accuracy GLM model:")
## [1] "Accuracy GLM model:"
accuracy_function(decisions_test,"Default","GLM")
## TN FP FN TP
## 2818 78 8 135

print("Accuracy RF model:")
## [1] "Accuracy RF model:"
accuracy_function(decisions_test,"Default","RF")
## TN FP FN TP
## 2753 143 5 138

print("Accuracy GBM model:")
## [1] "Accuracy GBM model:"
accuracy_function(decisions_test,"Default","GBM")
## TN FP FN TP
## 2876 20 15 128

print("Accuracy deep model:")
## [1] "Accuracy deep model:"
accuracy_function(decisions_test,"Default","deep")
## TN FP FN TP
## 2886 10 61 82

print("Accuracy SVM model:")
## [1] "Accuracy SVM model:"
accuracy_function(decisions_test,"Default","SVM")
## TN FP FN TP
## 2828 68 8 135

According to the table that was measured on the test sample, RF is the most accurate classifier of the failed banks, but this also misclassifies 138 solvent banks as failed, providing false alerts.

The results of the different models are correlated:

correlations<-cor(summary_models_train[,3:ncol(summary_models_train)], use="pairwise", method="pearson")

print(correlations)
## GLM RF GBM deep SVM
## GLM 1.0000000 0.9616688 0.9270350 0.8010252 0.9910695
## RF 0.9616688 1.0000000 0.9876728 0.7603979 0.9719735
## GBM 0.9270350 0.9876728 1.0000000 0.7283464 0.9457436
## deep 0.8010252 0.7603979 0.7283464 1.0000000 0.7879191
## SVM 0.9910695 0.9719735 0.9457436 0.7879191 1.0000000

It might be interesting to combine the results of the different models to obtain a better model. Here, the concept of ensembles comes in handy. Ensemble is a technique that's used to combine different algorithms to make a more robust model. This combined model incorporates the predictions from all the base learners. The resulting model will have a higher level of accuracy than the accuracy that would be attained if the models were run separately. In fact, some of the previous models that we've developed are ensemble models, for example; the random forest or Gradient Boosting Machine (GBM). There are many options when creating an ensemble. In this section, we will look at different alternatives, from the simplest to those that are more complex.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.108.196