Part 1. The theory crippled by awesome examples
1.1 The big picture: What Spark is and what it does
Spark in a data processing/engineering scenario
Spark in a data science scenario
1.3 What can you do with Spark?
Spark predicts restaurant quality at NC eateries
Spark allows fast data transfer for Lumeris
Spark analyzes equipment logs for CERN
1.4 Why you will love the dataframe
The dataframe from a Java perspective
The dataframe from an RDBMS perspective
A graphical representation of the dataframe
Downloading the code Running your first application
2.1 Building your mental model
2.2 Using Java code to build your mental model
2.3 Walking through your application
Loading, or ingesting, the CSV file
Saving the work done in your dataframe to a database
3. The majestic role of the dataframe
3.1 The essential role of the dataframe in Spark
Immutability is not a swear word
3.2 Using dataframes through examples
A dataframe after a simple CSV ingestion
A dataframe after a JSON ingestion
3.3 The dataframe is a Dataset<Row>
Creating a dataset of strings Converting back and forth
3.4 Dataframe’s ancestor: the RDD
4.1 A real-life example of efficient laziness
4.2 A Spark example of efficient laziness
Looking at the results of transformations and actions
The transformation process, step by step
The code behind the transformation/action process
The mystery behind the creation of 7 million datapoints in 182 ms
The mystery behind the timing of actions
4.3 Comparing to RDBMS and traditional applications
Working with the teen birth rates dataset
Analyzing differences between a traditional app and a Spark app
4.4 Spark is amazing for data-focused applications
4.5 Catalyst is your app catalyzer
5. Building a simple app for deployment
What are lambda functions in Java?
Approximating p by using lambda functions
Interactive mode in Scala and Python
6.1 Beyond the example: The role of the components
Quick overview of the components and their interactions Troubleshooting tips for the Spark architecture
Building a cluster that works for you
6.3 Building your application to run on the cluster
Building your application’s uber JAR
Building your application by using Git and Maven
6.4 Running your application on the cluster
Running the application Analyzing the Spark user interface
7.1 Common behaviors of parsers
7.2 Complex ingestion from CSV
7.3 Ingesting a CSV with a known schema
7.5 Ingesting a multiline JSON file
The problem with traditional file formats
Avro is a schema-based serialization format
ORC is a columnar storage format
Parquet is also a columnar storage format Comparing Avro, ORC, and Parquet
7.9 Ingesting Avro, ORC, and Parquet files
Reference table for ingesting Avro, ORC, or Parquet
8.1 Ingestion from relational databases
Understanding the data used in the examples
JDBC dialects provided with Spark
8.3 Advanced queries and ingestion
Filtering by using a WHERE clause
Performing Ingestion and partitioning Summary of advanced features
8.4 Ingestion from Elasticsearch
The New York restaurants dataset digested by Spark
Code to ingest the restaurant dataset from Elasticsearch
9. Advanced ingestion: finding data sources and building your own
9.2 Benefits of a direct connection to a data source
9.3 Finding data sources at Spark Packages
9.4 Building your own data source
Your data source API and options
9.5 Behind the scenes: Building the data source itself
9.6 Using the register file and the advertiser class
9.7 Understanding the relationship between the data and schema
The data source builds the relation
9.8 Building the schema from a JavaBean
9.9 Building the dataframe is magic with the utilities
10. Ingestion through structured streaming
10.2 Creating your first stream
Consuming the records 229 Getting records, not lines
10.3 Ingesting data from network streams
10.4 Dealing with multiple streams
10.5 Differentiating discretized and structured streaming
Part 3. Transforming your data
11.2 The difference between local and global views
11.3 Mixing the dataframe API and Spark SQL
12.1 What is data transformation?
12.2 Process and example of record-level transformation
Data discovery to understand the complexity
Data mapping to draw the process
Wrapping up your first Spark transformation
A closer look at the datasets to join
Building the list of higher education institutions per county
12.4 Performing more transformations
13.1 Transforming entire documents
13.1 Transforming entire documents and their structure
Building nested documents for transfer and storage
13.2 The magic behind static functions
13.3 Performing more transformations
14. Extending transformations with user-defined functions
14.2 Registering and calling a UDF
Registering the UDF with Spark
Using the UDF with the dataframe API
Manipulating UDFs with SQL Implementing the UDF
14.3 Using UDFs to ensure a high level of data quality
14.4 Considering UDFs’ constraints
15.1 Aggregating data with Spark
A quick reminder on aggregations
Performing basic aggregations with Spark
15.2 Performing aggregations with live data
Aggregating data to better understand the schools
15.3 Building custom aggregations with UDAFs
16. Cache and checkpoint: Enhancing Spark’s performances
16.1 Caching and checkpointing can increase performance
The usefulness of Spark caching
The subtle effectiveness of Spark checkpointing
Using caching and checkpointing
16.3 Going further in performance optimization
17. Exporting data and building full data pipelines
Building a pipeline with NASA datasets
Transforming columns to datetime
Transforming the confidence percentage to confidence level
Exporting the data Exporting the data: What really happened?
17.2 Delta Lake: Enjoying a database close to your system
Understanding why a database is needed
Using Delta Lake in your data pipeline
Consuming data from Delta Lake
17.3 Accessing cloud storage services from Spark
18. Exploring deployment constraints: Understanding the ecosystem
18.1 Managing resources with YARN, Mesos, and Kubernetes
The built-in standalone mode manages resources
YARN manages resources in a Hadoop environment
Mesos is a standalone resource manager
Kubernetes orchestrates containers
Choosing the right resource manager
Accessing the data contained in files
Sharing files through distributed filesystems
Accessing files on shared drives or file server
Using file-sharing services to distribute files Other options for accessing files in Spark
Hybrid solution for sharing files with Spark
18.3 Making sure your Spark application is secure
Securing the network components of your infrastructure 408 Securing Spark’s disk usage
appendix D Downloading the code and getting started with Eclipse
appendix E A history of enterprise data
appendix F Getting help with relational databases
appendix G Static functions ease your transformations
appendix H Maven quick cheat sheet
appendix I Reference for transformations and actions
appendix K Installing Spark in production and a few tips
appendix L Reference for ingestion
appendix M Reference for joins
appendix N Installing Elasticsearch and sample data
appendix O Generating streaming data
appendix P Reference for streaming
appendix Q Reference for exporting data
appendix R Finding help when you’re stuck
3.138.157.105