Book Description

Apache Spark’s speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples for this framework using PySpark.

In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You’ll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.

With this book, you will:

  • Learn how to select Spark transformations for optimized solutions
  • Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
  • Understand data partitioning for optimized queries
  • Design machine learning algorithms including Naive Bayes, linear regression, and logistic regression
  • Build and apply a model using PySpark design patterns
  • Apply motif finding algorithms to graph data
  • Analyze graph data by using the GraphFrames API
  • Apply PySpark algorithms to clinical and genomics data (such as DNA-Seq)

Table of Contents

  1. 1. Reductions in Spark
    1. Creating Pair RDDs
      1. Example: Using Collections
      2. Example: Using map() Transformation
    2. Reducer transformations
    3. Spark’s Reductions
    4. What is a Reduction?
    5. Spark’s Reduction Transformations
    6. Simple Warmup Example
    7. Solution by reduceByKey()
      1. Using Lambda Expressions
      2. Using Functions
    8. Solution by groupByKey()
    9. Solution by aggregateByKey()
    10. Solution by combineByKey()
    11. What is a Monoid?
    12. Monoid Examples
    13. Non-Monoid Examples
    14. Movie Problem
    15. Input Data Set to Analyze
    16. Ratings Data File Structure (ratings.csv)
    17. Solution Using aggregateByKey()
    18. Results
    19. How does aggregateByKey() work?
    20. PySpark Solution using aggregateByKey()
      1. Step 1: Read Data and Create Pairs
      2. Step 2: Use aggregateByKey() to Sum Up Ratings
      3. Step 3: Find Average Rating
    21. Complete PySpark Solution by groupByKey()
    22. PySpark Solution using groupByKey()
      1. Step 1: Read Data and Create Pairs
      2. Step 2: Use groupByKey() to Group Ratings
      3. Step 3: Find Average Rating
    23. Shuffle Step in Reductions
    24. Shuffle Step for groupByKey()
    25. Shuffle Step for reduceByKey()
    26. Complete PySpark Solution using reduceByKey()
      1. Step 1: Read Data and Create Pairs
      2. Step 2: Use reduceByKey() to Sum up Ratings
      3. Step 3: Find Average Rating
    27. Complete PySpark Solution using combineByKey()
    28. PySpark Solution using combineByKey()
      1. Step 1: Read Data and Create Pairs
      2. Step 2: Use combineByKey() to Sum up Ratings
      3. Step 3: Find Average Rating
    29. Comparison of Reductions
    30. Summary
  2. 2. Data Design Patterns
    1. InMapper Combining
    2. Basic MapReduce Design Pattern
    3. InMapper Combining Per Record
    4. InMapper Combiner Per Partition
    5. Top-10
    6. Top-N Formalized
    7. PySpark Solution
    8. Implementation in PySpark
    9. How to Find Bottom-10
    10. MinMax
      1. Solution-1: Classic MapReduce
      2. Solution-2: Sorting
      3. Solution-3: Spark’s mapPartitions()
      4. Solution-3 Input
      5. PySpark Solution
    11. The Composite Pattern and Monoids
      1. Composite Pattern
    12. Monoids
      1. Definition of Monoid
      2. How to form a Monoid?
      3. Monoidic and Non-Monoidic Examples
      4. Non-Commutative Example
      5. Median over Set of Integers
      6. Concatenation over Lists
      7. Union and Intersection over Integers
      8. Matrix Example
    13. Not a Monoid Example
    14. Monoid Example
    15. PySpark Implementation of Monodized Mean
      1. Input
      2. PySpark Solution
    16. Conclusion on Using Monoids
      1. Functors and Monoids
    17. Map-Side Join
    18. Efficient Joins using Bloom filters
    19. Bloom filter
      1. A Simple Bloom Filter Example
      2. Bloom Filter in Python
      3. Using Bloom Filter in PySpark
    20. Summary