Having described our business use case, and prepared our Apache Spark computing platform, in this section, we need to select our analytical methods or predictive models (equations) for this machine learning project for risk scoring, which is to complete a task of mapping our risk modelling case to machine learning methods.
To model and predict loan defaults, logistic regression and decision tree are among the most utilized methods. For our exercise, we will use both. But we will focus on logistic regression, because logistic regression, if well developed in combination with decision trees, can outperform most of the other methods.
As always, once we finalize our decision for analytical methods or models, we will need to prepare our coding, which will be in R for this chapter.
Logistic regression measures the relationship between one categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Logistic regression can be seen as a special case of Generalized Linear Model (GLM), and thus, it is analogous to linear regression.
We have chosen to focus on logistic regression for this real life use case mainly for two reasons besides the performance as mentioned previously:
Random forest is an ensemble learning method for classification and regression that builds hundreds or more decision trees at the training stage, and then combines their output for the final prediction.
Random forest is quite a popular machine learning method, because its interpretation is very intuitive and it usually leads to good results with less effort than that needed for logistic regression. There are many algorithms developed in R, Java, and others for implementing Random forest, so preparation is relatively easy.
As our focus for this project is on logistic regression, Random forest comes in to assist our logistic regression for feature selection and for calculating the importance of features.
As mentioned before, decision trees, in combination with logistic regression, often provide good results. So we bring in decision tree modeling here, and also use the decision tree model for our client to test rule-based solutions, and compare them to our score-based solutions.
In R, we need to use the R package randomForest
, as originally developed by Leo Breiman and Adele Cutler.
To get a random forest model estimated, we can use the following R code, where we use the training data and 2000
trees.
library(randomForest) Model2 <- randomForest(default ~ ., data=train, importance=TRUE, ntree=2000)
Once the models get estimated, we can use functions getTree
and importance
to obtain the results.
For decision trees, there are a few ways of coding in R:
Model3 <- rpart(default ~ ., data=train)
For a good example of running Random forest on Spark, please go to https://spark-summit.org/2014/wp-content/uploads/2014/07/Sequoia-Forest-Random-Forest-of-Humongous-Trees-Sung-Chung.pdf.
3.15.144.56