Andre Ye and Zian Wang

Modern Deep Learning for Tabular Data

Novel Approaches to Common Modeling Problems

Andre Ye
Seattle, WA, USA
Zian Wang
Redmond, WA, USA
ISBN 978-1-4842-8691-3e-ISBN 978-1-4842-8692-0
© Andre Ye and Zian Wang 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress Media, LLC, part of Springer Nature.

The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

To the skeptics and the adventurers alike

Foreword 1

Tabular data is the most common data type in real-world AI, but until recently, tabular data problems were almost solely tackled with tree-based models. This is because tree-based models are representationally efficient for tabular data, inherently interpretable, and fast to train. In addition, traditional deep neural network (DNN) architectures are not suitable for tabular data – they are vastly overparameterized, causing them to rarely find optimal solutions for tabular data manifolds.

What, then, does deep learning have to offer tabular data learning? The main attraction is significant expected performance improvements, particularly for large datasets – as has been demonstrated in the last decade in images, text, and speech. In addition, deep learning offers many additional benefits, such as the ability to do multimodal learning, removing the need for feature engineering, transfer learning across different domains, semi-supervised learning, data generation, and multitask learning – all of which have led to significant AI advances in other data types.

This led us to develop TabNet, one of the first DNN architectures designed specifically for tabular data. Similar to the canonical architectures in other data types, TabNet is trained end to end without feature engineering. It uses sequential attention to choose which features to reason from at each decision step, enabling both interpretability and better learning as the learning capacity is used for the most salient features. Importantly, we were able to show that TabNet outperforms or is on par with tree-based models and is able to achieve significant performance improvements with unsupervised pretraining. This has led to an increased amount of work on deep learning for tabular data.

Importantly, the improvement from deep learning on tabular data extends from academic datasets to large-scale problems in the real world. As part of my work at Google Cloud, I have observed TabNet achieving significant performance gains on billion-scale real-world datasets for many organizations.

When Andre reached out to share this book, I was delighted to write a foreword. It is the first book I have come across that is dedicated toward the application of deep learning to tabular data. The book has approachability to anyone who knows only a bit of Python, with helpful extensive code samples and notebooks. It strikes a good balance between covering the basics and discussing more cutting-edge research, including interpretability, data generation, robustness, meta-optimization, and ensembling. Overall, a great book for data scientist–type personas to get started on deep learning for tabular data. I’m hoping it will encourage and enable even more successful real-world applications on deep learning to tabular data.

Tomas Pfister

Head of AI Research, Google Cloud

Author of the TabNet paper (covered in Chapter 6)

Foreword 2

Almost three decades ago, the inception of support vector machines (SVMs) brought a storm of papers to the machine learning paradigm. The impact of SVMs was far-reaching and touched many fields of research. Since then, both the complexity of data and hardware technologies have expanded multifold, demanding the need for more sophisticated algorithms. This need took us to the development of state-of-the-art deep learning networks such as convolutional neural networks (CNNs).

Artificial intelligence (AI) tools like deep learning gained momentum and stretched into industries, such as driverless cars. High-performance graphics processing units (GPUs) have accelerated the computation of deep learning models. Consequently, deeper nets have evolved, pushing both industrial innovation and academic research.

Generally, visual data occupies the space of deep learning technologies, and tabular data (nonvisual) is largely ignored in this domain. Many areas of research, such as biological sciences, material science, medicine, energy, and agriculture, generate large amounts of tabular data.

This book attempts to explore the less examined domain of tabular data analysis using deep learning technologies. Very recently, some exciting research has been done that could drive us into an era of extensively applying deep learning technologies to tabular data.

Andre and Andy have undoubtedly touched this dimension and comprehensively put forward essential and relevant works for both novice and expert readers to apprehend. They described the core models to bridge tabular data and deep learning technologies. Moreover, they have built up detailed and easy-to-follow programming language codes. Links to Python scripts and necessary data are provided for anyone to reimplement the models.

Although the most modern deep learning networks perform very promisingly on many applications, they have their limitations, particularly when handling tabular data. For instance, in biological sciences, tabular data such as multi-omics data have huge dimensions with very few samples. Deep learning models have transfer learning capability, which could be used on a large scale on tabular data to tap hidden information.

Andre and Andy's work is commendable in bringing this crucial information together in a single work. This is a must-read for AI adventurers!

Alok Sharma

RIKEN Center for Integrative Medical Sciences, Japan

Author of the DeepInsight paper (covered in Chapter 4)

Introduction

Deep learning has become the public and private face of artificial intelligence. When one talks casually about artificial intelligence with friends at a party, strangers on the street, and colleagues at work, it is almost always on the exciting models that generate language, create art, synthesize music, and so on. Massive and intricately designed deep learning models power most of these exciting machine capabilities. Many practitioners, however, are rightfully pushing back against the technological sensationalism of deep learning. While deep learning is “what’s cool,” it certainly is not the be-all and end-all of modeling.

While deep learning has undoubtedly dominated specialized, high-dimensional data forms such as images, text, and audio, the general consensus is that it performs comparatively worse in tabular data. It is therefore tabular data where those with some distaste, or even resentment, toward deep learning stake out their argument. It was and still is fashionable to publish accepted deep learning papers that make seemingly trivial or even scientifically dubious modifications – this being one of the gripes against deep learning research culture – but now it is also fashionable within this minority to bash the “fresh-minted new-generation data scientists” for being too enamored with deep learning and instead to tout the comparatively more classic tree-based methods as instead the “best” model for tabular data. You will find this perspective everywhere – in bold research papers, AI-oriented social media, research forums, and blog posts. Indeed, the counter-culture is often as fashionable as the mainstream culture, whether it is with hippies or deep learning.

This is not to say that there is no good research that points in favor of tree-based methods over deep learning – there certainly is.1 But too often this nuanced research is mistaken and taken for a general rule, and those with a distaste for deep learning often commit to the same problematic doctrine as many advancing the state of deep learning: taking results obtained within a generally well-defined set of limitations and willfully extrapolating them in ways irrespective of said limitations.

The most obvious shortsightedness of those who advocate for tree-based models over deep learning models is in the problem domain space. A common criticism of tabular deep learning approaches is that they seem like “tricks,” one-off methods that work sporadically, as opposed to reliably high-performance tree methods. The intellectual question Wolpert and Macready’s classic No Free Lunch Theorem makes us think about whenever we encounter claims of universal superiority, whether this is superiority in performance, consistency, or another metric, is: Universal across what subset of the problem space?

The datasets used by the well-publicized research surveys and more informal investigations showing the success of deep learning over tabular data are common benchmark datasets – the Forest Cover dataset, the Higgs Boson dataset, the California Housing dataset, the Wine Quality dataset, and so on. These datasets, even when evaluated in the dozens, are undoubtedly limited. It would not be unreasonable to suggest that out of all data forms, tabular data is the most varied. Of course, we must acknowledge that it is much more difficult to perform an evaluative survey study with poorly behaved diverse datasets than more homogenous benchmark datasets. Yet those who tout the findings of such studies as bearing a broad verdict on the capabilities of neural networks on tabular data overlook the sheer breadth of tabular data domains in which machine learning models are applied.

With the increase of data signals acquirable from biological systems, biological datasets have increased significantly in feature richness from their state just one or two decades ago. The richness of these tabular datasets exposes the immense complexity of biological phenomena – intricate patterns across a multitude of scales, ranging from the local to the global, interacting with each other in innumerable ways. Deep neural networks are almost always used to model modern tabular datasets representing complex biological phenomena. Alternatively, content recommendation, an intricate domain requiring careful and high-capacity modeling power, more or less universally employs deep learning solutions. Netflix, for instance, reported “large improvements to our recommendations as measured by both offline and online metrics” when implementing deep learning.2 Similarly, a Google paper demonstrating the restructuring of deep learning as the paradigm powering YouTube recommendations writes that “In conjugation with other product areas across Google, YouTube has undergone a fundamental paradigm shift towards using deep learning as a general-purpose solution for nearly all learning problems.”3

We can find many more examples, if only we look. Many tabular datasets contain text attributes, such as an online product reviews dataset that contains a textual review as well as user and product information represented in tabular fashion. Recent house listing datasets contain images associated with standard tabular information such as the square footage, number of bathrooms, and so on. Alternatively, consider stock price data that captures time-series data in addition to company information in tabular form. What if we want to also add the top ten financial headlines in addition to this tabular data and the time-series data to forecast stock prices? Tree-based models, to our knowledge, cannot effectively address any of these multimodal problems. Deep learning, on the other hand, can be used to solve all of them. (All three of these problems, and more, will be explored in the book.)

The fact is that data has changed since the 2000s and early 2010s, which is when many of the benchmark datasets were used in studies that investigated performance discrepancies between deep learning and tree-based models. Tabular data is more fine-grained and complex than ever, capturing a wide range of incredibly complex phenomena. It is decidedly not true that deep learning functions as an unstructured, sparsely and randomly successful method in the context of tabular data.

However, raw supervised learning is not just a singular problem in modeling tabular data. Tabular data is often noisy, and we need methods to denoise noise or to otherwise develop ways to be robust to noise. Tabular data is often also always changing, so we need models that can structurally adapt to new data easily. We often also encounter many different datasets that share a fundamentally similar structure, so we would like to be able to transfer knowledge from one model to another. Sometimes tabular data is lacking, and we need to generate realistic new data. Alternatively, we would also like to be able to develop very robust and well-generalized models with a very limited dataset. Again, as far as we are aware, tree-based models either cannot do these tasks or have difficulty doing them. Neural networks, on the other hand, can do all the following successfully, following adaptations to tabular data from the computer vision and natural language processing domains.

Of course, there are important legitimate general objections to neural networks.

One such objection is interpretability – the contention that deep neural networks are less interpretable than tree models. Interpretability is a particularly interesting idea because it is more an attribute of the human observer than the model itself. Is a Gradient Boosting model operating on hundreds of features really more intrinsically interpretable than a multilayer perceptron (MLP) trained on the same dataset? Tree models do indeed build easily understandable single-feature split conditions, but this in and of itself is not really quite valuable. Moreover, many or most models in popular tree ensembles like Gradient Boosting systems do not directly model the target, but rather the residual, which makes direct interpretability more difficult. What we care about more is the interaction between features. To effectively grasp at this, tree models and neural networks alike need external interpretability methods to collapse the complexity of decision making into key directions and forces. Thus, it is not clear that tree models are more inherently interpretable than neural network models4 on complex datasets.5

The second primary objection is the laboriousness of tuning the meta-parameters of neural networks. This is an irreconcilable attribute of neural networks. Neural networks are fundamentally closer to ideas than real concrete algorithms, given the sheer diversity of possible configurations and approaches to the architecture, training curriculum, optimization processes, and so on. It should be noted that this is an even more pronounced problem for computer vision and natural language processing domains than tabular data and that approaches have been proposed to mitigate this effect. Moreover, it should be noted that tree-based models have a large number of meta-parameters too, which often require systematic meta-optimization.

A third objection is the inability of neural networks to effectively preprocess data in a way that reflects the effective meaning of the features. Popular tree-based models are able to interpret features in a way that is argued to be more effective for heterogeneous data. However, this does not bar the conjoined application of deep learning with preprocessing schemes, which can put neural networks on equal footing against tree-based models with respect to access to expressive features. We spend a significant amount of space in the book covering different preprocessing schemes.

All of this is to challenge the notion that tree-based models are superior to deep learning models or even that tree-based models are consistently or generally superior to deep learning models. It is not to suggest that deep learning is generally superior to tree-based models, either. To be clear, the following claims were made:
  • Deep learning works successfully on a certain well-defined domain of problems, just as tree-based methods work successfully on another well-defined problem domain.

  • Deep learning can solve many problems beyond raw tabular supervised learning that tree-based methods cannot, such as modeling multimodal data (image, text, audio, and other data in addition to tabular data), denoising noisy data, transferring knowledge across datasets, successfully training on limited datasets, and generating data.

  • Deep learning is indeed prone to difficulties, such as interpretability and meta-optimization. In both cases, however, tree-based models suffer from the same problems to a somewhat similar degree (on the sufficiently complex cases where deep learning would be at least somewhat successful). Moreover, we can attempt to reconcile weaknesses of neural networks with measures such as employing preprocessing pipelines in conjunction with deep models.

The objective of this book is to substantiate these claims by providing the theory and tools to apply deep learning to tabular data problems. The approach is tentative and explorative, especially given the novel nature of much of this work. You are not expected to accept the validity or success of everything or even most things presented. This book is rather a treatment of a wide range of ideas, with the primary pedagogical goal of exposure.

This book is written toward two audiences (although you are more than certainly welcome if you feel you do not belong to either): experienced tabular data skeptics and domain specialists seeking to potentially apply deep learning to their field. To the former, we hope that you will find the methods and discussion presented in this book, both original and synthesized from research, at least interesting and worthy of thought, if not convincing. To the latter, we have structured the book in a way that provides sufficient introduction to necessary tools and concepts, and we hope our discussion will assist modeling your domain of specialty.

- - -

Organization of the Book

Across 12 chapters, the book is organized into three parts.

Part 1, “Machine Learning and Tabular Data,” contains Chapter 1, “Classical Machine Learning Principles and Methods,” and Chapter 2, “Data Preparation and Engineering.” This part introduces machine learning and data concepts relevant for success in the remainder of the book.
  • Chapter 1, “Classical Machine Learning Principles and Methods,” covers important machine learning concepts and algorithms. The chapter demonstrates the theory and implementation of several foundational machine learning models and tabular deep learning competitors, including Gradient Boosting models. It also discusses the bridge between classical and deep learning.

  • Chapter 2, “Data Preparation and Engineering,” is a wide exposition of methods to manipulate, manage, transform, and store tabular data (as well as other forms of data you may need for multimodal learning). The chapter discusses NumPy, Pandas, and TensorFlow Datasets (native and custom); encoding methods for categorical, text, time, and geographical data; normalization and standardization (and variants); feature transformations, including through dimensionality reduction; and feature selection.

Part 2, “Applied Deep Learning Architectures,” contains Chapter 3, “Neural Networks and Tabular Data”; Chapter 4, “Applying Convolutional Structures to Tabular Data”; Chapter 5, “Applying Recurrent Structures to Tabular Data”; Chapter 6, “Applying Attention to Tabular Data”; and Chapter 7, “Tree-Based Deep Learning Approaches.” This part constitutes the majority of the book and demonstrates both how various neural network architectures function in their “native application” and how they can be appropriated for tabular data in both intuitive and unintuitive ways. Chapters 3, 4, and 5 each center one of the three well-established (even “traditional”) areas of deep learning – artificial neural networks (ANNs), convolutional neural networks, and recurrent neural networks – and their relevance to tabular data. Chapters 6 and 7 collectively cover two of the most prominent modern research directions in applying deep learning to tabular data – attention/transformer methods, which take inspiration from the similarity of modeling cross-word/token relationships and modeling cross-feature relationships, and tree-based neural network methods, which attempt to emulate, in some way or another, the structure or capabilities of tree-based models in neural network form.
  • Chapter 3, “Neural Networks and Tabular Data,” covers the fundamentals of neural network theory – the multilayer perceptron, the backpropagation derivation, activation functions, loss functions, and optimizers – and the TensorFlow/Keras API for implementing neural networks. Comparatively advanced neural network methods such as callbacks, batch normalization, dropout, nonlinear architectures, and multi-input/multi-output models are also discussed. The objective of the chapter is to provide both an important theoretical foundation to understand neural networks and the tools to begin implementing functional neural networks to model tabular data.

  • Chapter 4, “Applying Convolutional Structures to Tabular Data,” begins by demonstrating the low-level mechanics of convolution and pooling operations, followed by the construction and application of standard convolutional neural networks for image data. The application of convolutional structures to tabular data is explored in three ways: multimodal image-and-tabular datasets, one-dimensional convolutions, and two-dimensional convolutions. This chapter is especially relevant to biological applications, which often employ methods in this chapter.

  • Chapter 5, “Applying Recurrent Structures to Tabular Data,” like Chapter 4, begins by demonstrating how three variants of recurrent models – the “vanilla” recurrent layer, the Long Short-Term Memory (LSTM) layer, and the Gated Recurrent Unit layer – capture sequential properties in the input. Recurrent models are applied to text, time-series, and multimodal data. Finally, speculative methods for directly applying recurrent layers to tabular data are proposed.

  • Chapter 6, “Applying Attention to Tabular Data,” introduces the attention mechanism and the transformer family of models. The attention mechanism is applied to text, multimodal text-and-tabular data, and tabular contexts. Four research papers – TabTransformer, TabNet, SAINT (Self-Attention and Intersample Attention Transformer), and ARM-Net – are discussed in detail and implemented.

  • Chapter 7, “Tree-Based Deep Learning Approaches,” is dominantly research-focused and focuses on three primary classes of tree-based neural networks: tree-inspired/emulative neural networks, which attempt to replicate the character of tree models in the architecture or structure of a neural network; stacking and boosting neural networks; and distillation, which transfers tree-structured knowledge into a neural network.

Part 3, “Deep Learning Design and Tools,” contains Chapter 8, “Autoencoders”; Chapter 9, “Data Generation”; Chapter 10, “Meta-optimization”; Chapter 11, “Multi-model Arrangement”; and Chapter 12, “Neural Network Interpretability.” This part demonstrates how neural networks can be used and understood beyond the raw task of supervised modeling in somewhat shorter chapters.
  • Chapter 8, “Autoencoders,” introduces the properties of autoencoder architectures and demonstrates how they can be used for pretraining, multitask learning, sparse/robust learning, and denoising.

  • Chapter 9, “Data Generation,” shows how Variational Autoencoders and Generative Adversarial Networks can be applied to the generation of tabular data in limited-data contexts.

  • Chapter 10, “Meta-optimization,” demonstrates how Bayesian optimization with Hyperopt can be employed to automate the optimization of meta-parameters including the data encoding pipeline and the model architecture, as well as the basics of Neural Architecture Search.

  • Chapter 11, “Multi-model Arrangement,” shows how neural network models can be dynamically ensembled and stacked together to boost performance or to evaluate live/“real-time” model prediction quality.

  • Chapter 12, “Neural Network Interpretability,” presents three methods, both model-agnostic and model-specific, for interpreting the predictions of neural networks.

All the code for the book is available in a repository on Apress’s GitHub here: https://github.com/apress/modern-deep-learning-tabular-data.

We are more than happy to discuss the book and other topics with you. You can reach Andre at [email protected] and Andy at [email protected].

We hope that this book is thought-provoking and interesting and – most importantly – inspires you to think critically about the relationship between deep learning and tabular data. Happy reading, and thanks for joining us on this adventure!

Best,

Andre and Andy

Acknowledgments

This book would not have been possible without the professional help and support of so many. We want to express our greatest gratitude to Mark Powers, the awesome coordinating editor powering the logistics of the book’s development, and all the other amazing staff at Apress whose tireless work allowed us to focus on writing the best book we could. We also would like to thank Bharath Kumar Bolla, our technical reviewer, as well as Kalamendu Das, Andrew Steckley, Santi Adavani, and Aditya Battacharya for serving as our third through seventh pair of eyes. Last but not least, we are honored to have had Tomas Pfister and Alok Sharma each contribute a foreword.

Many have also supported us in our personal lives – we greatly appreciate our friends and family for their unwavering support and motivation. Although these incredible individuals might not be as involved in the technical aspect as those mentioned previously (we have fielded the question “So exactly what is your book about?” at the dinner table many a time), we would not have been able to accomplish what we set out to do – in this book and in life – without their presence.

Table of Contents
Part II: Applied Deep Learning Architectures181
About the Authors
Andre Ye

A photograph of Andre Ye.

is a deep learning (DL) researcher with a focus on building and training robust medical deep computer vision systems for uncertain, ambiguous, and unusual contexts. He has published another book with Apress, Modern Deep Learning Design and Application Development, and writes short-form data science articles on his blog. In his spare time, Andre enjoys keeping up with current deep learning research and jamming to hard metal.
 
Zian “Andy” Wang

A photograph of Andy Wang.

is a researcher and technical writer passionate about data science and machine learning (ML). With extensive experiences in modern artificial intelligence (AI) tools and applications, he has competed in various professional data science competitions while gaining hundreds and thousands of views across his published articles. His main focus lies in building versatile model pipelines for different problem settings including tabular and computer vision–related tasks. At other times while Andy is not writing or programming, he has a passion for piano and swimming.
 
About the Technical Reviewer
Bharath Kumar Bolla

A photograph of Bharath Kumar Bolla.

has over 10 years of experience and is currently working as a senior data science engineer consultant at Verizon, Bengaluru. He has a PG diploma in data science from Praxis Business School and an MS in life sciences from Mississippi State University, USA. He previously worked as a data scientist at the University of Georgia, Emory University, Eurofins LLC, and Happiest Minds. At Happiest Minds, he worked on AI-based digital marketing products and NLP-based solutions in the education domain. Along with his day-to-day responsibilities, Bharath is a mentor and an active researcher. To date, he has published around ten articles in journals and peer-reviewed conferences. He is particularly interested in unsupervised and semi-supervised learning and efficient deep learning architectures in NLP and computer vision.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.142.246