Deep learning for tabular data

Deep learning is not often associated with tabular data, as this kind of data comes with some possible issues:

  • How to represent features in a way that can be understood by the neural networks? In tabular data, we often deal with numerical and categorical features, so we need to correctly represent both types of inputs.
  • How to use feature interactions – both between the features themselves and the target?
  • How to effectively sample the data? Tabular datasets tend to be smaller than typical datasets used for solving computer vision or NLP problems. There is no easy way to apply augmentation, such as random cropping or rotation in the case of images. Also, there is no general large dataset with some universal properties, based on which we could apply transfer learning.
  • How to interpret the neural network's decisions?

That is why practitioners tend to use traditional machine learning approaches (often based on some kind of gradient boosted trees) to approach such tasks. In this recipe, we present how to successfully use deep learning for tabular data. To do so, we use the popular fastai library, which is built on top of PyTorch.

Some of the benefits of working with the fastai library:

  • It provides a selection of APIs that greatly simplify working with Artificial Neural Networks (ANNs)—from loading and batching the data to training the model.
  • It incorporates some best approaches to using deep learning for various tasks such as image classification, NLP, and tabular data (both classification and regression problems).
  • It handles the data preprocessing automatically – we just need to define which operations we want to apply.

Another advantage of using a deep learning approach is that it requires much less feature engineering and domain knowledge.

What makes fastai stand out is the use of Entity Embedding (embedding layers) for categorical data. By using it, the model can learn some potentially meaningful relationships between the observations of categorical features. You can think about the embeddings as latent features. For each categorical column, there is a trainable embedding matrix and each unique value has a designated vector mapped to it. Thankfully, fastai does everything for us.

In this recipe, we apply deep learning to a classification problem based on the credit card default dataset. We have already used this dataset in Chapter 8, Identifying Credit Card Default with Machine Learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.135.67