Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Predicting credit scores

In this section, we will explore another data set; this time, in the field of banking and finance. The particular data set in question is known as the German Credit Dataset and is also hosted by the UCI Machine Learning Repository. The link to the data is https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29.

The observations in the data set are loan applications made by individuals at a bank. The goal of the data is to determine whether an application constitutes a high credit risk.

Column name	Type	Definition
`checking`	Categorical	The status of the existing checking account
`duration`	Numerical	The duration in months
`creditHistory`	Categorical	The applicant's credit history
`purpose`	Categorical	The purpose of the loan
`credit`	Numerical	The credit amount
`savings`	Categorical	Savings account/bonds
`employment`	Categorical	Present employment since
`installmentRate`	Numerical	The installment rate (as a percentage of disposable income)
`personal`	Categorical	Personal status and gender
`debtors`	Categorical	Other debtors/guarantors
`presentResidence`	Numerical	Present residence since
`property`	Categorical	The type of property
`Age`	Numerical	The applicant's age in years
`otherPlans`	Categorical	Other installment plans
`housing`	Categorical	The applicant's housing situation
`existingBankCredits`	Numerical	The number of existing credits at this bank
`Job`	Categorical	The applicant's job situation
`dependents`	Numerical	The number of dependents
`telephone`	Categorical	The status of the applicant's telephone
`foreign`	Categorical	Foreign worker
`risk`	Binary	Credit risk (1 = good, 2 = bad)

First, we will load the data into a data frame called german_raw and provide it with column names that match the previous table:

> german_raw <- read.table("german.data", quote = """)
> names(german_raw) <- c("checking", "duration", "creditHistory", "purpose", "credit", "savings", "employment", "installmentRate", "personal", "debtors", "presentResidence", "property", "age", "otherPlans", "housing", "existingBankCredits", "job", "dependents", "telephone", "foreign", "risk")

Note from the table describing the features that we have a lot of categorical features to deal with. For this reason, we will employ dummyVars() once again to create dummy binary variables for these. In addition, we will record the risk variable, our output, as a factor with level 0 for good credit and level 1 for bad credit:

> library(caret)
> dummies <- dummyVars(risk ~ ., data = german_raw)
> german <- data.frame(predict(dummies, newdata = german_raw), 
                       risk = factor((german_raw$risk - 1)))
> dim(german)
[1] 1000   62

As a result of this processing, we now have a data frame with 61 features because several of the categorical input features had many levels. Next, we will partition our data into training and test sets:

> set.seed(977)
> german_sampling_vector <- createDataPartition(german$risk, 
                                      p = 0.80, list = FALSE)
> german_train <- german[german_sampling_vector,]
> german_test <- german[-german_sampling_vector,]

One particularity of this data set that is mentioned on the website is that these data comes from a scenario where the two different types of errors have different costs. Specifically, the cost of misclassifying a high-risk customer as a low-risk customer is five times more expensive for the bank than misclassifying a low-risk customer as a high-risk customer. This is understandable, as in the first case, the bank stands to lose a lot of money from a loan it gives out that cannot be repaid, whereas in the second case, the bank misses out on an opportunity to give out a loan that will yield interest for the bank.

The svm() function has a class.weights parameter, which we use to specify the cost of misclassifying an observation to each class. This is how we will incorporate our asymmetric error cost information into our model. First, we'll create a vector of class weights, noting that we need to specify names that correspond to the output factor levels. Then, we will use the tune() function to train various SVM models with a radial kernel:

> class_weights <- c(1, 5)
> names(class_weights) <- c("0", "1")
> class_weights
0 1
1 5
> set.seed(2423)
> german_radial_tune <- tune(svm,risk ~ ., data = german_train, 
  kernel = "radial", ranges = list(cost = c(0.01, 0.1, 1, 10, 100), 
  gamma = c(0.01, 0.05, 0.1, 0.5, 1)), class.weights = class_weights)
> german_radial_tune$best.parameters
   cost gamma
9  10  0.05
> german_radial_tune$best.performance
[1] 0.26

The suggested best model has cost = 10 and gamma = 0.05 and achieves 74 percent training accuracy. Let's see how this model fares on our test data set:

> german_model <- german_radial_tune$best.model
> test_predictions <- predict(german_model, german_test[,1:61])
> mean(test_predictions == german_test[,62])
[1] 0.735
> table(predicted = test_predictions, actual = german_test[,62])
         actual
predicted   0   1
        0 134  47
        1   6  13

The performance on our test set is 73.5 percent and very close to what we saw in training. As expected, our model tends to make many more errors that misclassify a low risk customer as a high risk customer. Predictably, this takes a toll on the overall classification accuracy, which just computes the ratio of correctly classified observations to the overall number of observations. In fact, were we to remove this cost imbalance, we would actually select a different set of parameters for our model and our performance, from the perspective of the unbiased classification accuracy, would be better:

> set.seed(2423)
> german_radial_tune_unbiased <- tune(svm,risk ~ ., 
  data = german_train, kernel = "radial", ranges = list( 
  cost = c(0.01, 0.1, 1, 10, 100), gamma = c(0.01, 0.05, 0.1, 0.5, 1)))
> german_radial_tune_unbiased$best.parameters
  cost gamma
3    1  0.01
> german_radial_tune_unbiased$best.performance
[1] 0.23875

Of course, this last model will tend to make a greater number of costly misclassifications of high-risk customers as low-risk customers, which we know is very undesirable. We'll conclude this section with two final thoughts. Firstly, we have used relatively small ranges for the gamma and cost parameters. It is left as an exercise for the reader to rerun our analysis with a greater spread of values for these two in order to see whether we can get even better performance. This will, however, necessarily result in longer training times. Secondly, this particular data set is quite challenging in that its baseline accuracy is actually 70 percent. This is because 70 percent of the customers in the data are low-risk customers (the two output classes are not balanced). For this reason, computing the Kappa statistic, which we saw in Chapter 1, Gearing Up for Predictive Modeling, might be a better metric to use instead of classification accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Predicting credit scores

Create new playlist

Sign In

Sign Up

Predicting credit scores

Table of Contents for
Predicting credit scores