Pramod Singh
Machine Learning with PySparkWith Natural Language Processing and Recommender Systems
Pramod Singh
Bangalore, Karnataka, India
ISBN 978-1-4842-4130-1e-ISBN 978-1-4842-4131-8
Library of Congress Control Number: 2018966519
© Pramod Singh 2019
Apress Standard
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

I dedicate this book to my wife, Neha; my son, Ziaan; and my parents. Without you guys, this book wouldn’t have been possible. You complete my world and are the source of my strength.

Introduction

Before even starting to write this book, I asked myself a question: Is there a need for another book on Machine Learning? I mean that there are so many books written on this subject already that this might end up as just another book on the shelf. To find the answer, I spent a lot of time thinking and after a while, a few patterns started to emerge. The books that have been written on Machine Learning were too detailed and lacked a high-level overview. Most of these would start really easy but after a couple of chapters, it felt overwhelming to continue as the content became too deep. As a result, readers would give up without getting enough out of the book. That’s why I wanted to write this book, which demonstrates the different ways of using Machine Learning without getting too deep, yet capturing the complete methodology to build an ML model from scratch. The next obvious question was this: Why Machine Learning using PySpark? The answer to this question did not take too long since I am a practicing Data Scientist and well aware of the challenges faced by people dealing with data. Most of the packages or modules are often limited as they process data on a single machine. Moving from a development to production environment becomes a nightmare if ML models are not meant to handle Big Data, and finally the processing of data itself needs to be fast and scalable. For all these reasons, it made complete sense to write this book on Machine Learning using PySpark to understand the process of using Machine Learning from a Big Data standpoint.

Now we come to the core of the book Machine Learning with PySpark . This book is divided into three different sections. The first section gives the introduction to Machine Learning and Spark, the second section talks about Machine Learning in detail using Big Data, and finally the third part showcases Recommender Systems and NLP using PySpark. This book might also be relevant for Data Analysts and Data Engineers as it covers steps of Big Data processing using PySpark as well. The readers who want to make a transition to Data Science and the Machine Learning field would also find this book easier to start with and can gradually take up more complicated stuff later. The case studies and examples given in the book make it really easy to follow along and understand the fundamental concepts. Moreover, there are very few books available on PySpark out there, and this book would certainly add some value to the knowledge of the readers. The strength of this book lies in explaining the Machine Learning algorithms in the most simplistic ways and uses a practical approach toward building them using PySpark.

I have put in my entire experience and learning into this book and feel it is precisely relevant to what businesses are seeking out there to solve real challenges. I hope you have some useful takeaways from this book.

Acknowledgments

This book wouldn’t have seen the light of the day if a few people were not there with me during this journey. I had heard the quote “Easier said than done” so many times in my life, but I had the privilege to experience it truly while writing this book. To be honest, I was extremely confident of writing this book initially, but as I progressed into writing it, things started becoming difficult. It’s quite ironic because when you think about the content, you are crystal clear in your mind, but when you go on to write it on a piece of paper, it suddenly starts becoming confusing. I struggled a lot, yet this period has been revolutionary for me personally. First, I must thank the most important person in my life, my beloved wife, Neha, who selflessly supported me throughout this time and sacrificed so much just to ensure that I completed this book.

I need to thank Suresh John Celestin who believed in me and offered me this break to write this book. Aditee Mirashi is one of the best editors to start your work with. She was extremely supportive and always there to respond to all my queries. You can imagine that for a person writing his first book, the number of questions that I must have had. I would like to especially thank Matthew Moodie, who dedicated his time for reading every single chapter and giving so many useful suggestions. Thanks, Matthew; I really appreciate it. Another person that I want to thank is Leonardo De Marchi who had the patience of reviewing every single line of code and check the appropriateness of each example. Thank you, Leo, for your feedback and your encouragement. It really made a difference to me and the book as well. I also want to thank my mentors who have constantly forced me to chase my dreams. Thank you, Alan Wexler, Dr. Vijay Agneeswaran, Sreenivas Venkatraman, Shoaib Ahmed, and Abhishek Kumar for your time.

Finally, I am infinitely grateful to my son, Ziaan, and my parents for the endless love and support irrespective of circumstances. You guys remind me that life is beautiful.

Table of Contents

Index 219

About the Author and About the Technical Reviewer

About the Author

Pramod Singh
../images/469852_1_En_BookFrontmatter_Figb_HTML.jpg

is a Manager, Data Science at Publicis.Sapient and works as a Data Science track lead for a project with Mercedes Benz. He has extensive hands-on experience in Machine Learning, Data Engineering, programming, and designing algorithms for various business requirements in domains such as retail, telecom, automobile, and consumer goods. He drives lot of strategic initiatives that deal with Machine Learning and AI at Publicis.Sapient. He received his Bachelor’s degree in Electrical and Electronics Engineering from Mumbai University, an MBA (Operations & Finance) from Symbiosis International University along with Data Analytics Certification from IIM – Calcutta. He has spent the last eight plus years working on multiple Data projects. He has used Machine Learning and Deep Learning techniques in numerous client projects using R, Python, Spark, and TensorFlow. He has also been a regular speaker at major conferences and universities. He conducts Data Science meetups at Publicis.Sapient and regularly presents webinars on ML and AI. He lives in Bangalore with his wife and two-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

 

About the Technical Reviewer

Leonardo De Marchi
../images/469852_1_En_BookFrontmatter_Figc_HTML.jpg

holds a Master’s in Artificial intelligence and has worked as a Data Scientist in the sports world, with clients such as the New York Knicks and Manchester United, and with large social networks such as Justgiving.

He now works as Lead Data Scientist in Badoo, the largest dating site with over 360 million users, he is also the lead instructor at ideai.​io , a company specializing in Deep Learning and Machine Learning training and is a contractor for the European Commission.

 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.216.249