Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Downloading the color images of this book Errata Piracy Questions A First Taste and What’s New in Apache Spark V2 Spark machine learning Spark Streaming Spark SQL Spark graph processing Extended ecosystem What's new in Apache Spark V2? Cluster design Cluster management Local Standalone Apache YARN Apache Mesos Cloud-based deployments Performance The cluster structure Hadoop Distributed File System Data locality Memory Coding Cloud Summary Apache Spark SQL The SparkSession--your gateway to structured data processing Importing and saving data Processing the text files Processing JSON files Processing the Parquet files Understanding the DataSource API Implicit schema discovery Predicate push-down on smart data sources DataFrames Using SQL Defining schemas manually Using SQL subqueries Applying SQL table joins Using Datasets The Dataset API in action User-defined functions RDDs versus DataFrames versus Datasets Summary The Catalyst Optimizer Understanding the workings of the Catalyst Optimizer Managing temporary views with the catalog API The SQL abstract syntax tree How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan Internal class and object representations of LEPs How to optimize the Resolved Logical Execution Plan Physical Execution Plan generation and selection Code generation Practical examples Using the explain method to obtain the PEP How smart data sources work internally Summary Project Tungsten Memory management beyond the Java Virtual Machine Garbage Collector Understanding the UnsafeRow object The null bit set region The fixed length values region The variable length values region Understanding the BytesToBytesMap A practical example on memory usage and performance Cache-friendly layout of data in memory Cache eviction strategies and pre-fetching Code generation Understanding columnar storage Understanding whole stage code generation A practical example on whole stage code generation performance Operator fusing versus the volcano iterator model Summary Apache Spark Streaming Overview Errors and recovery Checkpointing Streaming sources TCP stream File streams Flume Kafka Summary Structured Streaming The concept of continuous applications True unification - same code, same engine Windowing How streaming engines use windowing How Apache Spark improves windowing Increased performance with good old friends How transparent fault tolerance and exactly-once delivery guarantee is achieved Replayable sources can replay streams from a given offset Idempotent sinks prevent data duplication State versioning guarantees consistent results after reruns Example - connection to a MQTT message broker Controlling continuous applications More on stream life cycle management Summary Apache Spark MLlib Architecture The development environment Classification with Naive Bayes Theory on Classification Naive Bayes in practice Clustering with K-Means Theory on Clustering K-Means in practice Artificial neural networks ANN in practice Summary Apache SparkML What does the new API look like? The concept of pipelines Transformers String indexer OneHotEncoder VectorAssembler Pipelines Estimators RandomForestClassifier Model evaluation CrossValidation and hyperparameter tuning CrossValidation Hyperparameter tuning Winning a Kaggle competition with Apache SparkML Data preparation Feature engineering Testing the feature engineering pipeline Training the machine learning model Model evaluation CrossValidation and hyperparameter tuning Using the evaluator to assess the quality of the cross-validated and tuned model Summary Apache SystemML Why do we need just another library? Why on Apache Spark? The history of Apache SystemML A cost-based optimizer for machine learning algorithms An example - alternating least squares ApacheSystemML architecture Language parsing High-level operators are generated How low-level operators are optimized on Performance measurements Apache SystemML in action Summary Deep Learning on Apache Spark with DeepLearning4j and H2O H2O Overview The build environment Architecture Sourcing the data Data quality Performance tuning Deep Learning Example code – income The example code – MNIST H2O Flow Deeplearning4j ND4J - high performance linear algebra for the JVM Deeplearning4j Example: an IoT real-time anomaly detector Mastering chaos: the Lorenz attractor model Deploying the test data generator Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud Deploying the test data generator flow Testing the test data generator Install the Deeplearning4j example within Eclipse Running the examples in Eclipse Run the examples in Apache Spark Summary Apache Spark GraphX Overview Graph analytics/processing with GraphX The raw data Creating a graph Example 1 – counting Example 2 – filtering Example 3 – PageRank Example 4 – triangle counting Example 5 – connected components Summary Apache Spark GraphFrames Architecture Graph-relational translation Materialized views Join elimination Join reordering Examples Example 1 – counting Example 2 – filtering Example 3 – page rank Example 4 – triangle counting Example 5 – connected components Summary Apache Spark with Jupyter Notebooks on IBM DataScience Experience Why notebooks are the new standard Learning by example The IEEE PHM 2012 data challenge bearing dataset ETL with Scala Interactive, exploratory analysis using Python and Pixiedust Real data science work with SparkR Summary Apache Spark on Kubernetes Bare metal, virtual machines, and containers Containerization Namespaces Control groups Linux containers Understanding the core concepts of Docker Understanding Kubernetes Using Kubernetes for provisioning containerized Spark applications Example--Apache Spark on Kubernetes Prerequisites Deploying the Apache Spark master Deploying the Apache Spark workers Deploying the Zeppelin notebooks Summary