Identifying Credit Default with Machine Learning

In recent years, we have witnessed machine learning gaining more and more popularity in solving traditional business problems. Every so often, a new algorithm is published, beating the current state of the art. It is only natural for businesses (in all industries) to try to leverage the incredible powers of machine learning in their core functionalities.

Before specifying a problem, we provide a brief introduction to the field of machine learning. Machine learning can be broken down into two main areas: supervised learning and unsupervised learning. In the former, we have a target variable (label), which we try to predict as accurately as possible. In the latter, there is no target, and we try to use different techniques to draw some insights from the data. An example of unsupervised learning might be clustering, which is often used for customer segmentation.

We can further break down supervised problems into regression problems (where a target variable is a continuous number, such as income or the price of a house) and classification problems (where the target is a class, either binary or multi-class).

In this chapter, we tackle a binary classification problem set in the financial industry. We work with a dataset contributed to the UCI Machine Learning Repository (a very popular data repository).

The dataset used in this chapter was collected in a Taiwanese bank in October 2005. The study was motivated by the fact that—at that time—more and more banks were giving cash (and credit card) credit to willing customers. On top of that, more and more people, regardless of their repayment capabilities, accumulated significant amounts of debt, which in turn resulted in defaults.

The goal of the study was to use some basic information about customers (such as gender, age, and education level), together with their past repayment history, to predict who was likely to default. The setting can be described as follows—using the previous 6 months of repayment history (April-September 2005), we try to predict whether the customer will default in October 2005.

By the end of this chapter, you will be familiar with a real-life approach to a machine learning task, from gathering and cleaning data to building and tuning a classifier. Another takeaway is understanding the general approach to machine learning projects, which can then be applied to many different tasks, be it churn prediction or estimating the price of new real estate in a neighborhood.

In this chapter, we focus on the following recipes:

  • Loading data and managing data types
  • Exploratory data analysis
  • Splitting data into training and test sets
  • Dealing with missing values
  • Encoding categorical variables
  • Fitting a decision tree classifier
  • Implementing scikit-learn's pipelines
  • Tuning hyperparameters using grid search and cross-validation 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.166.127