How it works...

In Step 2, we used the typical scikit-learn approach to train a machine learning model. First, we created the object of the DecisionTreeClassifier class (using all the default settings). Then, we fitted the model to the training data (we needed to pass both the features and the target), using the fit method. Lastly, we obtained the predictions by using the predict method.

Using the predict method results in an array of predicted classes (in this case, it is either a 0 or a 1). However, there are cases when we are interested in the assigned probabilities. To obtain them, we can use the predict_proba method and obtain an n_test_observations x n_classes array, where each row contains all the possible class probabilities (they sum up to 1). In the case of binary classification, the predict method automatically assigns a positive class when the corresponding probability is > 50%.

In Step 3, we evaluated the performance of the model. We used a custom function to display all the results. We will not go deeper into its specifics, as it is quite standard and is built using functions from the metrics module of scikit-learn. For a detailed description of the function, please refer to the accompanying GitHub repository.

The confusion matrix summarizes all possible combinations of the predicted values as opposed to the actual target. It has a structure that looks like the following:

TN | FP
-------
FN | TP

The values are as follows:

True positive (TP): The model predicts a default, and the client defaulted.
False positive (FP): The model predicts a default, but the client did not default.
True negative (TN): The model predicts a good customer, and the client did not default.
False negative (FN): The model predicts a good customer, but the client defaulted.

Using these values, we can further build multiple evaluation criteria:

Accuracy ((TP + TN) / (TP + FP + TN + FN))—Measures the model's overall ability to correctly predict the class of the observation.
Precision (TP / (TP + FP))—Out of all predictions of the positive class (in our case, the default), how many observations indeed defaulted.
Recall (TP /(TP + FN))—Out of all positive cases, how many were predicted correctly. Also called sensitivity or the true positive rate.
F-1 Score—A harmonic average of precision and recall. The reason for a harmonic mean instead of a standard mean is that it punishes extreme outcomes, such as precision = 1 and recall = 0, or vice versa.
Specificity (TN / (TN + FP))—Measures what fraction of negative cases (clients without a default) actually did not default.

Understanding the subtleties behind these metrics is very important for the correct evaluation of the model's performance. Accuracy can be highly misleading in the case of class imbalance. Imagine a case when 99% of data is not fraudulent and only 1% is fraudulent. Then, a naïve model classifying each observation as non-fraudulent achieves 99% accuracy, while it is actually worthless. That is why, in such cases, we should refer to precision or recall. When we try to achieve as high precision as possible, we will get fewer false positives, at the cost of more false negatives. When optimizing for recall, we will achieve fewer false negatives, at the cost of more false positives. The metric on which we try to optimize should be selected based on the use case.

The second plot contains the Receiver Operating Characteristic (ROC) curve. The ROC curve presents a trade-off between the true-positive rate and the false-positive rate for different probability thresholds. A probability threshold determines the predicted probability above which we decide that the observation belongs to the positive class (by default, it is 50%). The ideal point is (0, 1), and a skillful model's curve would be as close to it as possible. On the other hand, a model with no skill will have a line close to the diagonal (45°) line.

A model with a curve below the diagonal line is actually possible and better than the "no-skill" one, as its predictions can be simply inverted to obtain better performance.

To summarize the performance of a model with one number, we can look at the area under the ROC curve (AUC). It is a metric with values between 0.5 (no skill) and 1 (ideal model). We can interpret the AUC in probabilistic terms. An AUC of 75% means that, if we randomly take two observations from the predictions, with a 75% probability they will be ordered in the correct way. This also explains the minimal AUC of 0.5, as a random model will result in a 50% probability of being correctly ordered.

Lastly, we may use the ROC curve to select a threshold that results in an appropriate balance between false positives and false negatives.

In Step 4, we visualized the decision tree. We started by retraining a tree, this time with a maximum depth of 3. This way, we ensured that the image was readable. We used the export_graphviz function to create a dot file containing the visualized tree and captured it using StringIO. This way, we avoided storing a copy of the file on the HDD. Then, we recovered the file, using the getvalue method, and created the graph inside Jupyter Notebook.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...