Predicting credit scores

In this section, we will explore another data set; this time, in the field of banking and finance. The particular data set in question is known as the German Credit Dataset and is also hosted by the UCI Machine Learning Repository. The link to the data is https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29.

The observations in the data set are loan applications made by individuals at a bank. The goal of the data is to determine whether an application constitutes a high credit risk.

Column name

Type

Definition

checking

Categorical

The status of the existing checking account

duration

Numerical

The duration in months

creditHistory

Categorical

The applicant's credit history

purpose

Categorical

The purpose of the loan

credit

Numerical

The credit amount

savings

Categorical

Savings account/bonds

employment

Categorical

Present employment since

installmentRate

Numerical

The installment rate (as a percentage of disposable income)

personal

Categorical

Personal status and gender

debtors

Categorical

Other debtors/guarantors

presentResidence

Numerical

Present residence since

property

Categorical

The type of property

Age

Numerical

The applicant's age in years

otherPlans

Categorical

Other installment plans

housing

Categorical

The applicant's housing situation

existingBankCredits

Numerical

The number of existing credits at this bank

Job

Categorical

The applicant's job situation

dependents

Numerical

The number of dependents

telephone

Categorical

The status of the applicant's telephone

foreign

Categorical

Foreign worker

risk

Binary

Credit risk (1 = good, 2 = bad)

First, we will load the data into a data frame called german_raw and provide it with column names that match the previous table:

> german_raw <- read.table("german.data", quote = """)
> names(german_raw) <- c("checking", "duration", "creditHistory", "purpose", "credit", "savings", "employment", "installmentRate", "personal", "debtors", "presentResidence", "property", "age", "otherPlans", "housing", "existingBankCredits", "job", "dependents", "telephone", "foreign", "risk")

Note from the table describing the features that we have a lot of categorical features to deal with. For this reason, we will employ dummyVars() once again to create dummy binary variables for these. In addition, we will record the risk variable, our output, as a factor with level 0 for good credit and level 1 for bad credit:

> library(caret)
> dummies <- dummyVars(risk ~ ., data = german_raw)
> german <- data.frame(predict(dummies, newdata = german_raw), 
                       risk = factor((german_raw$risk - 1)))
> dim(german)
[1] 1000   62

As a result of this processing, we now have a data frame with 61 features because several of the categorical input features had many levels. Next, we will partition our data into training and test sets:

> set.seed(977)
> german_sampling_vector <- createDataPartition(german$risk, 
                                      p = 0.80, list = FALSE)
> german_train <- german[german_sampling_vector,]
> german_test <- german[-german_sampling_vector,]

One particularity of this data set that is mentioned on the website is that these data comes from a scenario where the two different types of errors have different costs. Specifically, the cost of misclassifying a high-risk customer as a low-risk customer is five times more expensive for the bank than misclassifying a low-risk customer as a high-risk customer. This is understandable, as in the first case, the bank stands to lose a lot of money from a loan it gives out that cannot be repaid, whereas in the second case, the bank misses out on an opportunity to give out a loan that will yield interest for the bank.

The svm() function has a class.weights parameter, which we use to specify the cost of misclassifying an observation to each class. This is how we will incorporate our asymmetric error cost information into our model. First, we'll create a vector of class weights, noting that we need to specify names that correspond to the output factor levels. Then, we will use the tune() function to train various SVM models with a radial kernel:

> class_weights <- c(1, 5)
> names(class_weights) <- c("0", "1")
> class_weights
0 1
1 5
> set.seed(2423)
> german_radial_tune <- tune(svm,risk ~ ., data = german_train, 
  kernel = "radial", ranges = list(cost = c(0.01, 0.1, 1, 10, 100), 
  gamma = c(0.01, 0.05, 0.1, 0.5, 1)), class.weights = class_weights)
> german_radial_tune$best.parameters
   cost gamma
9  10  0.05
> german_radial_tune$best.performance
[1] 0.26

The suggested best model has cost = 10 and gamma = 0.05 and achieves 74 percent training accuracy. Let's see how this model fares on our test data set:

> german_model <- german_radial_tune$best.model
> test_predictions <- predict(german_model, german_test[,1:61])
> mean(test_predictions == german_test[,62])
[1] 0.735
> table(predicted = test_predictions, actual = german_test[,62])
         actual
predicted   0   1
        0 134  47
        1   6  13

The performance on our test set is 73.5 percent and very close to what we saw in training. As expected, our model tends to make many more errors that misclassify a low risk customer as a high risk customer. Predictably, this takes a toll on the overall classification accuracy, which just computes the ratio of correctly classified observations to the overall number of observations. In fact, were we to remove this cost imbalance, we would actually select a different set of parameters for our model and our performance, from the perspective of the unbiased classification accuracy, would be better:

> set.seed(2423)
> german_radial_tune_unbiased <- tune(svm,risk ~ ., 
  data = german_train, kernel = "radial", ranges = list( 
  cost = c(0.01, 0.1, 1, 10, 100), gamma = c(0.01, 0.05, 0.1, 0.5, 1)))
> german_radial_tune_unbiased$best.parameters
  cost gamma
3    1  0.01
> german_radial_tune_unbiased$best.performance
[1] 0.23875

Of course, this last model will tend to make a greater number of costly misclassifications of high-risk customers as low-risk customers, which we know is very undesirable. We'll conclude this section with two final thoughts. Firstly, we have used relatively small ranges for the gamma and cost parameters. It is left as an exercise for the reader to rerun our analysis with a greater spread of values for these two in order to see whether we can get even better performance. This will, however, necessarily result in longer training times. Secondly, this particular data set is quite challenging in that its baseline accuracy is actually 70 percent. This is because 70 percent of the customers in the data are low-risk customers (the two output classes are not balanced). For this reason, computing the Kappa statistic, which we saw in Chapter 1, Gearing Up for Predictive Modeling, might be a better metric to use instead of classification accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.79.84