Raju Kumar Mishra
PySpark RecipesA Problem-Solution Approach with PySpark2
Raju Kumar Mishra
Bangalore, Karnataka, India
Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book’s product page, located at www.apress.com . For more detailed information, please visit www.apress.com/source-code .
ISBN 978-1-4842-3140-1e-ISBN 978-1-4842-3141-8
Library of Congress Control Number: 2017962438
© Raju Kumar Mishra 2018
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image, we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Printed on acid-free paper
Distributed to the book trade worldwide by Springer Science + Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC, and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
To the Almighty, who guides me in every aspect of my life. And to my mother, Smt. Savitri Mishra, and my lovely wife, Smt. Smita Rani Pathak.
Introduction
This book will take you on an interesting journey to learn about PySpark and big data through a problem-solution approach. Every problem is followed by a detailed, step-by-step answer, which will improve your thought process for solving big data problems with PySpark. This book is divided into nine chapters. Here’s a brief description of each chapter:
Chapter 1 , “The Era of Big Data, Hadoop, and Other Big Data Processing Frameworks,” covers many big data processing tools such as Apache Hadoop, Apache Pig, Apache Hive, and Apache Spark. The shortcomings of Hadoop and the evolution of Spark are discussed. Apache Kafka is explained as a publish-subscribe system. This chapter also sheds light on HBase, a NoSQL database.
Chapter 2 , “Installation,” will take you to the real battleground. You’ll learn how to install many big data processing tools such as Hadoop, Hive, Spark, Apache Mesos, and Apache HBase.
Chapter 3 , “Introduction to Python and NumPy,” is for newcomers to Python. You will learn about the basics of Python and NumPy by following a problem-solution approach. Problems in this chapter are data-science oriented.
Chapter 4 , “Spark Architecture and the Resilient Distributed Dataset,” explains the architecture of Spark and introduces resilient distributed datasets. You’ll learn about creating RDDs and using data-analysis algorithms for data aggregation, data filtering, and set operations on RDDs.
Chapter 5 , “The Power of Pairs: Paired RDD,” shows how to create paired RDDs and how to perform data aggregation, data joining, and other algorithms on these paired RDDs.
Chapter 6 , “I/O in PySpark,” will teach you how to read data from various types of files and save the result as an RDD.
Chapter 7 , “Optimizing PySpark and PySpark Streaming,” is one of the most important chapters. You will start by optimizing a page-rank algorithm. Then you’ll implement a k -nearest neighbors algorithm and optimize it by using broadcast variables provided by the PySpark framework. Learning PySpark Streaming will finally lead us into integrating Apache Kafka with the PySpark Streaming framework.
Chapter 8 , “PySparkSQL,” is paradise for readers who use SQL. But newcomers will also learn PySparkSQL in order to write SQL-like queries on DataFrames by using a problem-solution approach. Apart from DataFrames, we will also implement the graph algorithms breadth-first search and page rank by using the GraphFrames library.
Chapter 9 , “PySpark MLlib and Linear Regression,” describes PySpark’s machine-learning library, MLlib. You will see many recipes on various data structures provided by PySpark MLlib. You’ll also implement linear regression. Recipes on lasso and ridge regression are included in the chapter.
Acknowledgments
My heartiest thanks to the Almighty. I also would like to thank my mother, Smt. Savitri Mishra; my sisters, Mitan and Priya; my cousins, Suchitra and Chandni; and my maternal uncle, Shyam Bihari Pandey; for their support and encouragement. I am very grateful to my sweet and beautiful wife, Smt. Smita Rani Pathak, for her continuous encouragement and love while I was writing this book. I thank my brother-in-law, Mr. Prafull Chandra Pandey, for his encouragement to write this book. I am very thankful to my sisters-in-law, Rinky, Reena, Kshama, Charu, Dhriti, Kriti, and Jyoti for their encouragement as well. I am grateful to Anurag Pal Sehgal, Saurabh Gupta, Devendra Mani Tripathi, and all my friends. Last but not least, thanks to Coordinating Editor Sanchita Mandal, Acquisitions Editor Celestin Suresh John, and Development Editor Laura Berendson at Apress; without them, this book would not have been possible.
Contents
Index261
About the Author and About the Technical Reviewer
About the Author
Raju Kumar Mishra
A430628_1_En_BookFrontmatter_Figb_HTML.jpg
has a strong interest in data science and systems that have the capability of handling large amounts of data and operating complex mathematical models through computational programming. He was inspired to pursue a Master of Technology degree in computational sciences from the Indian Institute of Science in Bangalore, India. Raju primarily works in the areas of data science and its various applications. Working as a corporate trainer, he has developed unique insights that help him in teaching and explaining complex ideas with ease. Raju is also a data science consultant who solves complex industrial problems. He works on programming tools such as R, Python, scikit-learn, Statsmodels, Hadoop, Hive, Pig, Spark, and many others.
 
About the Technical Reviewer
Sundar Rajan Raman
A430628_1_En_BookFrontmatter_Figc_HTML.jpg
is an artificial intelligence practitioner currently working for Bank of America. He holds a Bachelor of Technology degree from the National Institute of Technology in India. Being a seasoned Java and J2EE programmer, he has worked at companies such as AT&T, Singtel, and Deutsche Bank. He is a messaging platform specialist with vast experience on SonicMQ, WebSphere MQ, and TIBCO software, with respective certifications. His current focus is on artificial intelligence, including machine learning and neural networks. More information is available at https://in.linkedin.com/pub/sundar-rajan-raman/7/905/488 .
I would like to thank my wife, Hema, and my daughter, Shriya, for their patience during the review process.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.171.107