front matter

preface

Like many data scientists and machine learning engineers out there, most of my professional training and education came from real-world experiences, rather than classical education. I got all my degrees from Johns Hopkins in theoretical mathematics and never once learned about regressions and classification models. Once I received my master’s degree, I decided to make the switch from pursuing my PhD to going into startups in Silicon Valley and teaching myself the basics of ML and AI.

I used free online resources and read reference books to begin my data science education and started a company focusing on creating enterprise AIs for large corporations. Nearly all of the material I picked up focused on the types of models and algorithms used to model data and make predictions. I used books to learn the theory and read online posts on sites like Medium to see how people would apply that theory to real-life applications.

It wasn’t until a few years later that I started to realize that I could only go so far learning about topics like models, training, and parameter tuning. I was working with raw text data at the time, building enterprise-grade chatbots, and I noticed a big difference in the tone of the books and articles about natural language processing (NLP). They focused a lot on the classification and regression models I could use, but they focused equally, if not even more, on how to process the raw text for the models to use. They talked about tuning parameters for the data more than tuning parameters for the models themselves.

I wondered why this wasn’t the case for other branches of ML and AI. Why weren’t people transforming tabular data with the same rigor as text data? It couldn’t be that it wasn’t necessary or helpful because pretty much every survey asking about time spent in the data science process revealed that people spent a majority of time getting and cleaning data. I decided to take this gap and turn it into a book.

Funny enough, that wasn’t this book. I wrote another book on feature engineering a few years prior to this one. My first book on feature engineering focused on the basics of feature engineering with an emphasis on explaining the tools and algorithms over showcasing how to use them day to day. This book takes a more practical approach. Every chapter in this book is dedicated to a use case in a particular field with a dataset that invites different feature engineering techniques to be used.

I tried to outline my own thinking process when it came to feature engineering in an easy-to-follow and concise format. I’ve made a career out of data science and machine learning, and feature engineering has been a huge part of that. I hope that this book will open your eyes and your conversations with colleagues about working with data and give you the tools and tricks to know which feature engineering techniques to apply and when.

acknowledgments

This book required a lot of work, but I believe that all the time and effort resulted in a great book. I sure hope that you think so as well! There are many people I’d like to thank for encouraging me and helping me along the way.

First and foremost, I want to thank my partner, Elizabeth. You’ve supported me, listened to me as I paced around our kitchen trying to figure out the best analogy for a complex topic, and walked the dog when it was my turn, but I was so engrossed in my writing that it totally slipped my mind. I love you more than anything.

Next, I’d like to acknowledge everyone at Manning who made this text possible. I know it took a while, but your constant support and belief in the topic kept me going when things were rough. Your commitment to the quality of this book has made it better for everyone who will read it.

I’d also like to thank all the reviewers, who took the time to read my manuscript at various stages during its development. To Aleksei Agarkov, Alexander Klyanchin, Amaresh Rajasekharan, Bhagvan Kommadi, Bob Quintus, Harveen Singh, Igor Dudchenko, Jim Amrhein, Jiri Pik, John Williams, Joshua A. McAdams, Krzysztof Jędrzejewski, Krzysztof Kamyczek, Lavanya Mysuru Krishnamurthy, Lokesh Kumar, Maria Ana, Maxim Volgin, Mikael Dautrey, Oliver Korten, Prashant Nair, Richard Vaughan, Sadhana Ganapathiraju, Satej Kumar Sahu, Seongjin Kim, Sergio Govoni, Shaksham Kapoor, Shweta Mohan Joshi, Subhash Talluri, Swapna Yeleswarapu, and Vishwesh Ravi Shrimaland: your suggestions helped make this a better book.

Finally, a special thank you goes to my technical proofreaders, who made sure that I crossed my t’s, dotted my i’s, and commented on my code!

All in all, many people made this book possible. Thank you all so much!

about this book

Feature Engineering Bookcamp was written both to give the reader an overview of popular feature engineering techniques and to provide a framework for thinking about when and how to use certain techniques. I have found that books that focus on one or the other can sometimes fall a bit flat. The book that focuses only on overviews tends to ignore the practical application side of things, whereas the book that focuses on the frameworks can leave readers asking themselves, “Sure, but why does it work?” I want readers to walk away confident in both understanding and applying these techniques.

Who should read this book?

Feature Engineering Bookcamp is for machine learning engineers and data scientists who have already entered the space and are looking for a boost in their abilities and skill sets. I assume that the reader already has functional knowledge of machine learning, cross-validation, parameter tuning, and model training using Python and scikit-learn. This book builds on that knowledge by incorporating feature engineering pipelines directly into existing machine learning frameworks.

How this book is organized: A roadmap

This book has two introductory chapters that cover the basics of feature engineering, including how to recognize different types of data and the different categories of feature engineering. Each of chapters 3 through 8 focuses on a specific case study with a different dataset and a different goal. Each chapter gives the reader a new perspective, a new dataset, and new feature engineering techniques that are specific to the type of data we are working with. The goal is to provide a broad and comprehensive view of the types of feature engineering techniques, while showcasing a variety of datasets and data types.

About the code

This book contains many examples of source code both in numbered listings and in line with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text. Sometimes code is also in bold to highlight code that has changed from previous steps in the chapter, such as when a new feature adds to an existing line of code.

In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page space in the book. In some cases, even this was not enough, and listings include line-continuation markers (). Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts.

You can get executable snippets of code from the liveBook (online) version of this book at https://livebook.manning.com/book/feature-engineering-bookcamp. The complete code for the examples in the book is available for download from my personal GitHub at https://github.com/sinanuozdemir/feature_engineering_bookcamp.

liveBook discussion forum

Purchase of Feature Engineering Bookcamp includes free access to liveBook, Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach comments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users. To access the forum, go to https://livebook.manning.com/book/feature-engineering-bookcamp/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/discussion.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website for as long as the book is in print.

about the author

Ozdemir_author_photo

Sinan Ozdemir is the founder and CTO of Shiba and is currently managing the Web3 components and machine learning models that power the company’s social commerce platform. Sinan is a former lecturer of data science at Johns Hopkins University and the author of multiple textbooks on data science and machine learning. Additionally, he is the founder of the acquired Kylie.ai, an enterprise-grade conversational AI platform with robotic process automation (RPA) capabilities. He holds a master’s degree in pure mathematics from Johns Hopkins University and is based in San Francisco, CA.

about the cover illustration

The figure on the cover of Feature Engineering Bookcamp is captioned “Homme du Thibet,” or “Man from Tibet,” taken from a collection by Jacques Grasset de Saint-Sauveur, published in 1797. Each illustration is finely drawn and colored by hand.

In those days, it was easy to identify where people lived and what their trade or station in life was just by their dress. Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional culture centuries ago, brought back to life by pictures from collections such as this one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.52.96