How it works...

In Step 2, we loaded the dataset into Python using the read_csv function. We only indicated which column contains the index and what symbol represents the missing values.

In Step 3, we identified the dependent variable (the target), as well as both numerical and categorical features. To do so, we used the select_dtypes methods and indicated what data type we wanted to extract. We stored the features in lists. We also had to remove the dependent variable from the list containing the numerical features. Lastly, we created a list containing all the transformations we wanted to apply to the data. We selected the following:

FillMissing: Missing values will be filled using the median of the feature's values. In the case of categorical variables, missing values become a separate category.
Categorify: Converts categorical features into the categories.
Normalise: Features' values are transformed such that they have zero mean and unit variance. This makes training neural networks easier.

It is important to note that the same transformations will be applied to both the training and validation set. Also, to prevent data leakage, the transformations are based solely on the training set.

In Step 4, we created the TabularDataBunch, which handles all the preprocessing, splitting, and batching of the input data. To do so, we chained multiple methods of TabularList. First, we loaded the data from a pandas DataFrame using the from_df method. While doing so, we passed the source DataFrame, numerical/categorical features, and the preprocessing steps we wanted to carry out. Second, we split the data into training and validation sets. For this case, we used the split_by_rand_pct method, indicated the percentage of all observations intended for the validation set, and set the seed (for reproducibility). For cases with severe class imbalance, please refer to tips in the There's more section. Third, we indicated the label using the label_from_df method. Lastly, we created the DataBunch using the databunch method. By default, it will create batches containing 64 observations.

In Step 5, we defined the learner using tabular_learner. This is the place where we defined the network's architecture. We decided to use a network with two hidden layers, with 1,000 and 500 neurons, respectively. Choosing the network's architecture can often be considered an art and may require a significant amount of trial and error. Another popular approach is to use the architecture that worked before for someone else.

As in the case of machine learning, it is crucial to prevent overfitting with neural networks. We want the networks to be able to generalize to new data. Some of the popular techniques used for fighting overfitting include the following:

Weight decay: Each time the weights are updated, they are multiplied by a factor smaller than 1 (a rule of thumb is to use values between 0.01 and 0.1).
Dropout: While training the NN, some activations (a fraction indicated by the ps hyperparameter) are randomly dropped for each mini-batch. Dropout can also be used for the concatenated vector of embeddings of categorical features (controlled via the emb_drop hyperparameter).
Batch normalization: This technique reduces overfitting by making sure that a small number of outlying inputs cannot have too much impact on the trained network.

While defining the learner, we specified the percentage of neurons to be dropped out. Additionally, we provided a selection of metrics we wanted to inspect. We chose recall, F1-score (FBeta(beta=1)), and the F-beta score (FBeta(beta=5)).

While the F1-Score is the harmonic average of precision and recall, the F-beta score is a weighted harmonic average of the same metrics. The beta parameter determines the weight of recall in the total score. Values of beta lower than 1 place more importance on precision, while beta > 1 gives more weight to recall. The best value of the F-beta score is 1, and the worst is 0.

In Step 6, we inspected the model's architecture. In the output, we first saw the categorical embeddings and the corresponding dropout. Then, in the (layers) section, we saw the input layer (63 input, 1,000 output features), followed by the ReLU (Rectified Linear Unit) activation function, batch normalization, and dropout. The same steps were repeated for the second hidden layer and then the last linear layer produced the class probabilities.

In Step 7, we tried to determine the "good" learning rate. fastai provides a convenience method, lr_find, which facilitates the process. It begins to train the network while increasing the learning rate – it starts from a very low one and increases to a very large one. Then, we ran learn.recorder.plot(suggestion=True) to plot the losses against the learning rates, together with the suggested value. We should aim for a value that is before the minimum, but where the loss still improves (decreases).

In Step 8, we trained the neural network using the fit method of the learner. We briefly describe the training algorithm. The entire training set is divided into batches (default size of 64). For each batch, the network is used to make predictions, which are compared to the target values and used to calculate the error. Then, the error is used to update the weights in the network. An epoch is a complete run through all the batches, in other words, using the entire dataset for training.

Without going into too much detail, by default, fastai uses the cross-entropy loss function (for classification tasks) and Adam (Adaptive Moment Estimation) as the optimizer. The reported training and validation losses come from the loss function and the evaluation metrics (such as recall) are not used in the training procedure.

In our case, we trained the network for 25 epochs. We additionally specified the learning rate and weight decay. Then, we plotted the training and validation loss over batches.

In Step 10, we used the get_preds method to obtain the validation set predictions (preds_valid). To obtain the predictions from preds_valid, we had to use the argmax method.

To extract the predictions for the validation set, we passed DatasetType.Valid as the ds_type argument in the get_preds method. Using this approach, we can also extract the predictions for the training set (DatasetType.Train) and test set (by passing DatasetType.Test if it was specified previously).

In Step 11, we used the ClassificationInterpretation class to extract performance evaluation metrics from the trained network. We plotted the confusion matrix using the plot_confusion_matrix method. With the created object, we can also look at the observations with the highest loss – the plot_tab_top_losses method.

In the last step, we used the slightly modified performance_evaluation_report function (the convenience function defined in Chapter 8, Identifying Credit Default with Machine Learning) to recover evaluation metrics such as precision and recall.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...