Methods of risk scoring

Having described our business use case, and prepared our Apache Spark computing platform, in this section, we need to select our analytical methods or predictive models (equations) for this machine learning project for risk scoring, which is to complete a task of mapping our risk modelling case to machine learning methods.

To model and predict loan defaults, logistic regression and decision tree are among the most utilized methods. For our exercise, we will use both. But we will focus on logistic regression, because logistic regression, if well developed in combination with decision trees, can outperform most of the other methods.

As always, once we finalize our decision for analytical methods or models, we will need to prepare our coding, which will be in R for this chapter.

Logistic regression

Logistic regression measures the relationship between one categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Logistic regression can be seen as a special case of Generalized Linear Model (GLM), and thus, it is analogous to linear regression.

Logistic regression

We have chosen to focus on logistic regression for this real life use case mainly for two reasons besides the performance as mentioned previously:

  • Logistic regression can be interpreted easily with some simple calculations
  • Most financial corporations have implemented logistic regression in the past; so it becomes easy for our clients to compare our results against what they have received in the past

Preparing coding in R

With R, there are many ways to code for logistic regression.

In the previous chapters, we used the R function glm with the following code:

Model1 <-glm(good_bad ~.,data=train,family=binomial())

For consistency, we will continue to use the glm function here.

Random forest and decision trees

Random forest is an ensemble learning method for classification and regression that builds hundreds or more decision trees at the training stage, and then combines their output for the final prediction.

Random forest is quite a popular machine learning method, because its interpretation is very intuitive and it usually leads to good results with less effort than that needed for logistic regression. There are many algorithms developed in R, Java, and others for implementing Random forest, so preparation is relatively easy.

As our focus for this project is on logistic regression, Random forest comes in to assist our logistic regression for feature selection and for calculating the importance of features.

As mentioned before, decision trees, in combination with logistic regression, often provide good results. So we bring in decision tree modeling here, and also use the decision tree model for our client to test rule-based solutions, and compare them to our score-based solutions.

Preparing coding

In R, we need to use the R package randomForest, as originally developed by Leo Breiman and Adele Cutler.

To get a random forest model estimated, we can use the following R code, where we use the training data and 2000 trees.

library(randomForest)
Model2 <- randomForest(default ~ ., data=train, importance=TRUE, ntree=2000)

Once the models get estimated, we can use functions getTree and importance to obtain the results.

For decision trees, there are a few ways of coding in R:

Model3 <- rpart(default ~ ., data=train)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.177.39