Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Pramod Singh

Machine Learning with PySpark

With Natural Language Processing and Recommender Systems

2nd ed.

Logo of the publisher

Pramod Singh

Bangalore, Karnataka, India

ISBN 978-1-4842-7776-8e-ISBN 978-1-4842-7777-5

https://doi.org/10.1007/978-1-4842-7777-5

Apress Standard

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress Media, LLC part of Springer Nature.

The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

I dedicate this book to my wife Neha, my son Ziaan, and my parents. Without you guys, this book wouldn’t have been possible. You complete my world and are the source of my strength.

Foreword

Businesses are swimming in data, yet many organizations are struggling to remain afloat in a rapidly expanding sea of data. Ever-increasing connectivity of data-transmitting devices is driving growth in how much and how fast data is being generated. This explosion in digitalization has been accompanied by a proliferation of applications and vendors. Companies often use multiple vendors for the same use cases and store data across multiple systems. Digital data lives in more and more formats and across fragmented data architecture layers.

In a world where data lives everywhere, many organizations are on a race to be more data-driven in their decision-making to get and stay ahead of the competition. The winners in this race proactively manage their data needs and leverage their data and analytics capabilities to drive tangible business outcomes.

The starting point for using data and analytics as a strategic tool is having good data. Take a B2B company that was looking to improve how it forecasted next month’s sales, for example. Each month this company would aggregate individual sales reps’ “hot leads” into a sales forecast that might be wide of the mark. Improving the underlying data quality required four changes to how the company captured customer insights from the front line. First, the company clearly defined at which stage in the sales funnel a lead should be tagged to the sales forecast. Second, it reduced the granularity of information that the sales force had to enter along with each sales lead. Third, it gave more time to the sales reps to complete the data entry. Fourth, it secured commitment from the sales force to improve the data quality. Along with ensuring a common understanding of the changes, the company was able to improve how it translated frontline customer insights into better data.

With better data in their arsenals, businesses need to orchestrate the data in a complex web of systems, processes, and tools. Despite having a state-of-the art CRM (customer relationship management) system, numerous spreadsheets were metronomically circulated across the organization to syndicate the company’s sales forecast. The company improved how it orchestrated the sales forecast data in four ways. First, it consolidated disparate planning and tracking tools into a consistent view in the CRM system. Second, it secured commitment from the organization to use the CRM forecast as one source of truth. Third, it configured the access for all relevant stakeholders to view the sales forecast. Fourth, it compressed the organizational layers that could make adjustments – as fewer layers brought fewer agendas in managing the sales forecast.

The effective use of data and analytics can help businesses make better decisions to drive successful outcomes. For instance, the company mentioned previously developed a system of sanity checks to assure the sales forecast. It used machine learning to classify individual sales reps into groups of optimistic and conservative forecasters. Machine learning helped predict expected conversion rates at an individual level and at an aggregate basis that the company could use to sense-check deviations from the forecast envelope, either to confirm adjustments to the sales forecast or trigger a review of specific forecast data. Better visibility on the sales forecast enabled the supply chain function to proactively move the company’s products to the branches, where sales was forecasted to increase, and place replenishment orders with its suppliers. As a result, the company incurred fewer lost sales episodes due to stock-outs and achieved a more optimal balance between product availability and working capital requirements.

Building data and analytics capabilities is a journey that comes in different waves. Going back to the example, the company first concentrated on a few use cases – such as improved sales forecasting – that could result in better visibility, more valuable insights, and automation of work. It defined and aligned on changes to its operating model and data governance. It set up a center of excellence to help accelerate use cases and capability building. It built a road map for further use case deployment and capability development – upskilling current employees, acquiring new talent, and exploring data and analytics partnerships – to get the full value of its proprietary data assets.

Great data and analytics capabilities can eventually lead to data monetization. Companies should start with use cases that address a raw customer need. The most successful use cases typically center on improving customer experience (as well as supplier and frontline worker experience). Making the plunge to build data and analytics capabilities is hard; yet successful companies are able to muster the courage to ride the waves. They start with small wins and then scale up and amplify use cases to capture value and drive tangible business outcomes through data and analytics.

Foreword by Dominik Utama

Introduction

I am going to be very honest with you. When I signed the contract to write this second edition, I thought it would be a bit easier to write, but I couldn't have been more wrong about this assumption. It has taken me quite a significant amount of time to complete the chapters. What I have come to realize is that it's never easy to break down a thought process and put it on paper in the most convincing manner. There are so many retrials in that process, but what helped was the foundation block or the blueprint that was already established in the first edition of this book. The main challenge was to figure out how I could make this book more relevant and useful for the readers. I mean there are literally thousands of books on this subject already that this might just end up as another book on the shelf.

To find the answer, I spent a lot of time thinking and going through the messages that I received from so many people who read the first edition of the book. After a while a few patterns started to emerge. The first realization was that data continues to get generated at a much faster pace. The basic premise of the first edition was that a data scientist needs to get familiar with at least one big data framework in order to handle the scalable ML engagement. It would require them to gradually move away from libraries like sklearn that have certain limitations in terms of handling large datasets. That is still highly relevant today as businesses want to leverage as much data as possible to build powerful and significant insights. Hence, people would be excited to learn new things about the Spark framework.

Most of the books that have been published on this subject were either too detailed or lacked a high-level overview. Readers would start really easy but, after a couple of chapters, would start to feel overwhelmed as the content became too technical. As a result, readers would give up without getting enough out of the book. That’s why I wanted to write this book that demonstrates the different ways of using Machine Learning without getting too deep, yet capturing the complete methodology to build a ML model from scratch.

Another issue that I wanted to address in this edition is the development environment. It was evident many people struggled with setting up the right environment in their local machines to install Spark properly and could see a lot of issues. Hence, I wrote this edition using Databricks as the core development platform, which is easy to access, and one doesn't have to worry about setting up anything on the local system. The best thing about using Databricks is that it provides a platform to code in multiple languages such as Python, R, and Scala. The other extension to this edition is that the codebase demonstrates end-to-end development of ML models including automating the intermediate steps using Spark pipelines. The libraries that have been used are from the latest Spark version.

This book is divided into three different sections. The first section covers the process to access Databricks and alternate ways to use Spark. It goes into architecture details of the Spark framework, along with an introduction to Machine Learning. The second section focuses on different machine learning algorithm details and executing end-to-end pipelines for different use cases in PySpark. The algorithms are explained in simple terms for anyone to read and understand the details. The datasets that are being used in the book are relatively smaller on scale, but the overall process and steps remain the same on big data as well. The third and final section showcases how to build a distributed recommender system and Natural Language Processing in PySpark. The bonus part covers creating and visualizing sequence embeddings in PySpark. This book might also be relevant for data analysts and data engineers as it covers steps of big data processing using PySpark. The readers who want to make a transition to the data science and Machine Learning field would find this book easier to start with and can gradually take up more complicated stuff later. The case studies and examples given in the book make it really easy to follow along and understand the fundamental concepts. Moreover, there are limited books available on PySpark, and this book would certainly add value towards upskilling of the readers. The strength of this book lies in explaining the Machine Learning algorithms in most simplistic ways and taking a practical approach toward building and training them using PySpark.

I have put my entire experience and learnings into this book and feel it is precisely relevant to what readers are seeking either to upskill or to solve ML problems. I hope you have some useful takeaways from this book.

Acknowledgments

I am really grateful to Apress for providing me with the opportunity to author the second edition of this book. It's a testament that readers found the first edition really useful. This book wouldn’t have seen the light of day if a few people were not there with me during this journey. First and foremost, I want to thank the most important person in my life, my beloved wife, Neha, who selflessly supported me throughout this time and sacrificed so much just to ensure that I complete this book.

I want to thank Dominik Utama (partner at Bain) for taking out time from his busy schedule to write the foreword for the readers of this book. Thanks, Dom. I really appreciate it. I would also like to thank Suresh John Celestin who believed in me and offered me to write the second edition. Aditee Mirashi is one of the best editors to work with – this is my fourth book with her. She is extremely supportive and responsive. I would like to especially thank Mark Powers, who dedicated his time for reading every single chapter and giving so many useful suggestions. Another person I want to thank is Joos Korstanje who had the patience to review every single line of code and check the appropriateness of each example. Thank you, Joos, for your feedback and your encouragement. It really made a difference to me and the book as well.

I also want to thank my mentors who have constantly guided me in the right direction. Thank you, Dominik Utama, Janani Sriram, Barron Berenjan, Ilker Carikcioglu, Dr. Vijay Agneeswaran, Sreenivas Venkatraman, Shoaib Ahmed, and Abhishek Kumar, for your time.

Finally, I am grateful to my son, Ziaan, and my parents for the endless love and support they give me irrespective of circumstances. You guys remind me that life is beautiful.

Table of Contents

Chapter 1: Introduction to Spark1

Data Generation2

Before the 1990s2

The Internet and Social Media Era2

The Machine Data Era3

Spark4

Setting Up the Environment8

Chapter 2: Manage Data with PySpark15

Load and Read Data15

Data Filtering Using filter21

Data Filtering Using where24

Pandas UDF30

Drop Duplicate Values32

Writing Data34

CSV34

Parquet34

Data Handling Using Koalas34

Conclusion37

Chapter 3: Introduction to Machine Learning39

Rise in Data40

Increased Computational Efficiency40

Improved ML Algorithms41

Availability of Data Scientists41

Supervised Machine Learning43

Unsupervised Machine Learning45

Semi-supervised Learning46

Reinforcement Learning47

Industrial Application and Challenges48

Retail48

Healthcare49

Finance49

Travel and Hospitality50

Media and Marketing50

Manufacturing and Automobile51

Social Media51

Others51

Conclusion52

Chapter 4:Linear Regression53

Theory55

Code65

Chapter 6: Random Forests Using PySpark105

Decision Tree105

Entropy107

Information Gain109

Random Forests111

Code114

Conclusion126

Chapter 7: Clustering in PySpark127

Applications129

K-Means130

Hierarchical Clustering142

Code147

Data Info147

Conclusion155

Chapter 8: Recommender Systems157

Recommendations158

Popularity-Based RS159

Content-Based RS160

Collaborative Filtering–Based RS162

Hybrid Recommender Systems174

Code175

Data Info175

Conclusion187

Chapter 9: Natural Language Processing189

Steps Involved in NLP189

Corpus190

Tokenize190

Stopword Removal191

Bag of Words192

CountVectorizer193

TF-IDF195

Text Classification Using Machine Learning196

Sequence Embeddings204

Conclusion215

Index217

About the Author

Pramod Singh

works at Bain & Company as a Senior Manager, Data Science, in their Advanced Analytics group. He has over 13 years of industry experience in Machine Learning (ML) and AI at scale, data analytics, and application development. He has authored four other books including a book on Machine Learning operations. He is also a regular speaker at major international conferences such as Databricks AI, O’Reilly’s Strata, and other similar conferences. He is an MBA holder from Symbiosis International University and data analytics certified professional from IIM Calcutta. He lives in Gurgaon with his wife and five-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

About the Technical Reviewer

Joos Korstanje

is a data scientist, with over five years of industry experience in developing machine learning tools, of which a large part is forecasting models. He currently works at Disneyland Paris where he develops machine learning for a variety of tools.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Front Matter

Create new playlist

Sign In

Sign Up

Machine Learning with PySpark

With Natural Language Processing and Recommender Systems

Table of Contents for
Front Matter