The amount of data being generated today is staggering--and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark's Python API, and other best practices in Spark programming.

Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques--including classification, clustering, collaborative filtering, and anomaly detection--to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing.

If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.

  • Familiarize yourself with Spark's programming model and ecosystem
  • Learn general approaches in data science
  • Examine complete implementations that analyze large public datasets
  • Discover which machine learning tools make sense for particular problems
  • Explore code that can be adapted to many uses

Table of Contents

  1. 1. Analyzing Big Data
    1. Working with Big Data
    2. Introducing Apache Spark and PySpark
    3. Components
    4. PySpark
    5. Ecosystem
    6. Spark 3.0
    7. PySpark Addresses Challenges of Data Science
    8. Where to go from here
  2. 2. Introduction to Data Analysis with PySpark
    1. Spark Architecture
    2. Installing PySpark
    3. Setting up our data
    4. Analyzing Data with the DataFrame API
    5. Fast Summary Statistics for DataFrames
    6. Pivoting and Reshaping DataFrames
    7. Joining DataFrames and Selecting Features
    8. Scoring And Model Evaluation
    9. Where to Go from Here
  3. 3. Recommending Music and the Audioscrobbler Data Set
    1. Setting up the Data
    2. Our requirements from a recommender system
    3. Alternating Least Squares algorithm
    4. Preparing the Data
    5. Building a First Model
    6. Spot Checking Recommendations
    7. Evaluating Recommendation Quality
    8. Computing AUC
    9. Hyperparameter Selection
    10. Making Recommendations
    11. Where to Go from Here
  4. 4. Predicting Forest Cover with Decision Trees
    1. Decision Trees and Forests
    2. Preparing the Data
    3. Our First Decision Tree
    4. Decision Tree Hyperparameters
    5. Tuning Decision Trees
    6. Categorical Features Revisited
    7. Random Forests
    8. Making Predictions
    9. Where to Go from Here
  5. 5. Anomaly Detection in Network Traffic with K-means Clustering
    1. Anomaly Detection
    2. K-means Clustering
    3. Network Intrusion
    4. KDD Cup 1999 Data Set
    5. A First Take on Clustering
    6. Choosing k
    7. Visualization with SparkR
    8. Feature Normalization
    9. Categorical Variables
    10. Using Labels with Entropy
    11. Clustering in Action
    12. Where to Go from Here
  6. 6. Estimating Financial Risk
    1. Terminology
    2. Methods for Calculating VaR
    3. Variance-Covariance
    4. Historical Simulation
    5. Monte Carlo Simulation
    6. Our Model
    7. Getting the Data
    8. Preprocessing
    9. Determining the Factor Weights
    10. Sampling
    11. The Multivariate Normal Distribution
    12. Running the Trials
    13. Visualizing the Distribution of Returns
    14. Evaluating Our Results
    15. Where to Go from Here