0%

Book Description

Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs

Key Features

  • Work with large amounts of agile data using distributed datasets and in-memory caching
  • Source data from all popular data hosting platforms, such as HDFS, Hive, JSON, and S3
  • Employ the easy-to-use PySpark API to deploy big data Analytics for production

Book Description

Apache Spark is an open source parallel-processing framework that has been around for quite some time now. One of the many uses of Apache Spark is for data analytics applications across clustered computers. In this book, you will not only learn how to use Spark and the Python API to create high-performance analytics with big data, but also discover techniques for testing, immunizing, and parallelizing Spark jobs.

You will learn how to source data from all popular data hosting platforms, including HDFS, Hive, JSON, and S3, and deal with large datasets with PySpark to gain practical big data experience. This book will help you work on prototypes on local machines and subsequently go on to handle messy data in production and at scale. This book covers installing and setting up PySpark, RDD operations, big data cleaning and wrangling, and aggregating and summarizing data into useful reports. You will also learn how to implement some practical and proven techniques to improve certain aspects of programming and administration in Apache Spark.

By the end of the book, you will be able to build big data analytical solutions using the various PySpark offerings and also optimize them effectively.

What you will learn

  • Get practical big data experience while working on messy datasets
  • Analyze patterns with Spark SQL to improve your business intelligence
  • Use PySpark's interactive shell to speed up development time
  • Create highly concurrent Spark programs by leveraging immutability
  • Discover ways to avoid the most expensive operation in the Spark API: the shuffle operation
  • Re-design your jobs to use reduceByKey instead of groupBy
  • Create robust processing pipelines by testing Apache Spark jobs

Who this book is for

This book is for developers, data scientists, business analysts, or anyone who needs to reliably analyze large amounts of large-scale, real-world data. Whether you're tasked with creating your company's business intelligence function or creating great data platforms for your machine learning models, or are looking to use code to magnify the impact of your business, this book is for you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Hands-On Big Data Analytics with PySpark
  3. About Packt
    1. Why subscribe?
    2. Packt.com
  4. Contributors
    1. About the authors
    2. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  6. Installing Pyspark and Setting up Your Development Environment
    1. An overview of PySpark
      1. Spark SQL
    2. Setting up Spark on Windows and PySpark
    3. Core concepts in Spark and PySpark
      1. SparkContext
      2. Spark shell
      3. SparkConf
    4. Summary
  7. Getting Your Big Data into the Spark Environment Using RDDs
    1. Loading data on to Spark RDDs
      1. The UCI machine learning repository
      2. Getting the data from the repository to Spark
      3. Getting data into Spark
    2. Parallelization with Spark RDDs
      1. What is parallelization?
    3. Basics of RDD operation
    4. Summary
  8. Big Data Cleaning and Wrangling with Spark Notebooks
    1. Using Spark Notebooks for quick iteration of ideas
    2. Sampling/filtering RDDs to pick out relevant data points
    3. Splitting datasets and creating some new combinations
    4. Summary
  9. Aggregating and Summarizing Data into Useful Reports
    1. Calculating averages with map and reduce
    2. Faster average computations with aggregate
    3. Pivot tabling with key-value paired data points
    4. Summary
  10. Powerful Exploratory Data Analysis with MLlib
    1. Computing summary statistics with MLlib
    2. Using Pearson and Spearman correlations to discover correlations
      1. The Pearson correlation
      2. The Spearman correlation
      3. Computing Pearson and Spearman correlations
    3. Testing our hypotheses on large datasets
    4. Summary
  11. Putting Structure on Your Big Data with SparkSQL
    1. Manipulating DataFrames with Spark SQL schemas
    2. Using Spark DSL to build queries
    3. Summary
  12. Transformations and Actions
    1. Using Spark transformations to defer computations to a later time
    2. Avoiding transformations
    3. Using the reduce and reduceByKey methods to calculate the results
    4. Performing actions that trigger computations
    5. Reusing the same rdd for different actions
    6. Summary
  13. Immutable Design
    1. Delving into the Spark RDD's parent/child chain
      1. Extending an RDD
      2. Chaining a new RDD with the parent
      3. Testing our custom RDD
    2. Using RDD in an immutable way
    3. Using DataFrame operations to transform
    4. Immutability in the highly concurrent environment
    5. Using the Dataset API in an immutable way
    6. Summary
  14. Avoiding Shuffle and Reducing Operational Expenses
    1. Detecting a shuffle in a process
    2. Testing operations that cause a shuffle in Apache Spark
    3. Changing the design of jobs with wide dependencies
    4. Using keyBy() operations to reduce shuffle
    5. Using a custom partitioner to reduce shuffle
    6. Summary
  15. Saving Data in the Correct Format
    1. Saving data in plain text format
    2. Leveraging JSON as a data format
    3. Tabular formats – CSV
    4. Using Avro with Spark
    5. Columnar formats – Parquet
    6. Summary
  16. Working with the Spark Key/Value API
    1. Available actions on key/value pairs
    2. Using aggregateByKey instead of groupBy()
    3. Actions on key/value pairs
    4. Available partitioners on key/value data
    5. Implementing a custom partitioner
    6. Summary
  17. Testing Apache Spark Jobs
    1. Separating logic from Spark engine-unit testing
    2. Integration testing using SparkSession
    3. Mocking data sources using partial functions
    4. Using ScalaCheck for property-based testing
    5. Testing in different versions of Spark
    6. Summary
  18. Leveraging the Spark GraphX API
    1. Creating a graph from a data source
      1. Creating the loader component
      2. Revisiting the graph format
      3. Loading Spark from file
    2. Using the Vertex API
      1. Constructing a graph using the vertex
      2. Creating couple relationships
    3. Using the Edge API
      1. Constructing the graph using edge
    4. Calculating the degree of the vertex
      1. The in-degree
      2. The out-degree
    5. Calculating PageRank
      1. Loading and reloading data about users and followers
    6. Summary
  19. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think
3.137.164.24