Machine learning fits mathematical models to data in order to derive insights or make predictions. These models take features as input. A feature is a numeric representation of an aspect of raw data. Features sit between data and models in the machine learning pipeline. Feature engineering is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model. It is a crucial step in the machine learning pipeline, because the right features can ease the difficulty of modeling, and therefore enable the pipeline to output results of higher quality. Practitioners agree that the vast majority of time in building a machine learning pipeline is spent on feature engineering and data cleaning. Yet, despite its importance, the topic is rarely discussed on its own. Perhaps this is because the right features can only be defined in the context of both the model and the data; since data and models are so diverse, it’s difficult to generalize the practice of feature engineering across projects.
Nevertheless, feature engineering is not just an ad hoc practice. There are deeper principles at work, and they are best illustrated in situ. Each chapter of this book addresses one data problem: how to represent text data or image data, how to reduce the dimensionality of autogenerated features, when and how to normalize, etc. Think of this as a collection of interconnected short stories, as opposed to a single long novel. Each chapter provides a vignette into the vast array of existing feature engineering techniques. Together, they illustrate the overarching principles.
Mastering a subject is not just about knowing the definitions and being able to derive the formulas. It is not enough to know how the mechanism works and what it can do—one must also understand why it is designed that way, how it relates to other techniques, and what the pros and cons of each approach are. Mastery is about knowing precisely how something is done, having an intuition for the underlying principles, and integrating it into one’s existing web of knowledge. One does not become a master of something by simply reading a book, though a good book can open new doors. It has to involve practice—putting the ideas to use, which is an iterative process. With every iteration, we know the ideas better and become increasingly more adept and creative at applying them. The goal of this book is to facilitate the application of its ideas.
This book tries to teach the reason first, and the mathematics second. Instead of only discussing how something is done, we try to teach why. Our goal is to provide the intuition behind the ideas, so that the reader may understand how and when to apply them. There are tons of descriptions and pictures for folks who learn in different ways. Mathematical formulas are presented in order to make the intuition precise, and also to bridge this book with other existing offerings.
Code examples in this book are given in Python, using a variety of free and open source packages. The NumPy library provides numeric vector and matrix operations. Pandas provides the DataFrame that is the building block of data science in Python. Scikit-learn is a general-purpose machine learning package with extensive coverage of models and feature transformers. Matplotlib and the styling library Seaborn provide plotting and visualization support. You can find these examples as Jupyter notebooks in our GitHub repo.
The first few chapters start out slow in order to provide a bridge for folks who are just getting started with data science and machine learning. Chapter 1 introduces the fundamental concepts in the machine learning pipeline (data, models, features, etc.). In Chapter 2, we explore basic feature engineering for numeric data: filtering, binning, scaling, log transforms and power transforms, and interaction features. Chapter 3 dives into feature engineering for natural text, exploring techniques like bag-of-words, n-grams, and phrase detection. Chapter 4 examines tf-idf (term frequency–inverse document frequency) as an example of feature scaling and discusses why it works. The pace starts to pick up around Chapter 5, where we talk about efficient encoding techniques for categorical variables, including feature hashing and bin counting. By the time we get to principal component analysis (PCA) in Chapter 6, we are deep in the land of machine learning. Chapter 7 looks at k-means as a featurization technique, which illustrates the useful concept of model stacking. Chapter 8 is all about images, which are much more challenging in terms of feature extraction than text data. We look at two manual feature extraction techniques, SIFT and HOG, before concluding with an explanation of deep learning as the latest feature extraction technique for images. We finish up in Chapter 9 by showing a few different techniques in an end-to-end example, creating a recommender for a dataset of academic papers.
The illustrations in this book are best viewed in color. Really, you should print out the color versions of the Swiss roll in Chapter 7 and paste them into your book. Your aesthetic sense will thank us.
Feature engineering is a vast topic, and more methods are being invented every day, particularly in the area of automatic feature learning. In order to limit the book to a manageable size, we’ve had to make some cuts. This book does not discuss Fourier analysis for audio data, though it is a beautiful subject that is closely related to eigen analysis in linear algebra (which we touch upon in Chapters 4 and 6). We also skip a discussion of random features, which are intimately related to Fourier analysis. We provide an introduction to feature learning via deep learning for image data, but do not go into depth on the numerous deep learning models under active development. Also out of scope are advanced research ideas like random projections, complex text featurization models such as word2vec and Brown clustering, and latent space models like Latent Dirichlet allocation and matrix factorization. If those words mean nothing to you, then you are in luck. If the frontiers of feature learning are where your interest lies, then this is probably not the book for you.
The book assumes knowledge of basic machine learning concepts, such as what a model is and what a vector is, though a refresher is provided so we’re all on the same page. Experience with linear algebra, probability distributions, and optimization are helpful, but not necessary.
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
The book also contains numerous linear algebra equations. We use the following conventions with regard to notation: scalars are shown in lowercase italic (e.g., a), vectors in lowercase bold (e.g., v), and matrices in uppercase bold and italic (e.g., U).
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/alicezheng/feature-engineering-book.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Feature Engineering for Machine Learning by Alice Zheng and Amanda Casari (O’Reilly). Copyright 2018 Alice Zheng and Amanda Casari, 978-1-491-95324-2.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected].
Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/featureEngineering_for_ML.
To comment or ask technical questions about this book, send email to [email protected].
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
First and foremost, we want to thank our editors, Shannon Cutt and Jeff Bleiel, for shepherding two first-time authors through the (unknown to us) long marathon of book publishing. Without your many check-ins, this book would not have seen the light of day. Thank you also to Ben Lorica, O’Reilly Mastermind, whose encouragement and affirmation turned this from a crazy idea into an actual product. Thank you to Kristen Brown and the O’Reilly production team for their superb attention to detail and extreme patience in waiting for our responses.
If it takes a village to raise a child, it takes a parliament of data scientists to publish a book. We greatly appreciate every hashtag suggestion, notes on room for improvement and calls for clarification. Andreas Müller, Sethu Raman, and Antoine Atallah took precious time out of their busy days to provide technical reviews. Antoine not only did so at lightning speed, but also made available his beefy machines for use on experiments. Ted Dunning’s statistical fluency and mastery of applied machine learning are legendary. He is also incredibly generous with his time and his ideas, and he literally gave us the method and the example described in the k-means chapter. Owen Zhang revealed his cache of Kaggle nuggets on using response rate features, which were added to machine learning folklore on bin-counting collected by Misha Bilenko. Thank you also to Alex Ott, Francisco Martin, and David Garrison for additional feedback.
I would like to thank the GraphLab/Dato/Turi family for their generous support in the first phase of this project. The idea germinated from interactions with our users. In the process of building a brand new machine learning platform for data scientists, we discovered that the world needs a more systematic understanding of feature engineering. Thank you to Carlos Guestrin for granting me leave from busy startup life to focus on writing.
Thank you to Amanda, who started out as technical reviewer and later pitched in to help bring this book to life. You are the best finisher! Now that this book is done, we’ll have to find another project, if only to keep doing our editing sessions over tea and coffee and sandwiches and takeout food.
Special thanks to my friend and healer, Daisy Thompson, for her unwavering support throughout all phases of this project. Without your help, I would have taken much longer to take the plunge, and would have resented the marathon. You brought light and relief to this project, as you do with all your work.
As this is a book and not a lifetime achievement award, I will attempt to scope my thanks to the project at hand.
Many thanks to Alice for bringing me in as a technical editor and then coauthor. I continue to learn so much from you, including how to write better math jokes and explain complex concepts clearly.
Last in order only, special thanks to my husband, Matthew, for mastering the nearly impossible role of grounding me, encouraging me towards my next goal, and never allowing a concept to be hand-waved away. You are the best partner and my favorite partner in crime. To the biggest and littlest sunshines, you inspire me to make you proud.
3.17.78.47