contents

foreword

preface

acknowledgments

about this book 

about the author 

about the cover illustration

Part 1. The theory crippled by awesome examples

1. So, what is Spark, anyway?

1.1 The big picture: What Spark is and what it does

What is Spark?

The four pillars of mana

1.2 How can you use Spark?

Spark in a data processing/engineering scenario

Spark in a data science scenario

1.3 What can you do with Spark?

Spark predicts restaurant quality at NC eateries

Spark allows fast data transfer for Lumeris

Spark analyzes equipment logs for CERN

Other use cases

1.4 Why you will love the dataframe

The dataframe from a Java perspective

The dataframe from an RDBMS perspective

A graphical representation of the dataframe

1.5 Your first example

Recommended software

Downloading the code Running your first application

Your first code

2. Architecture and flow

2.1 Building your mental model

2.2 Using Java code to build your mental model

2.3 Walking through your application

Connecting to a master

Loading, or ingesting, the CSV file

Transforming your data

Saving the work done in your dataframe to a database

3. The majestic role of the dataframe

3.1 The essential role of the dataframe in Spark

Organization of a dataframe

Immutability is not a swear word

3.2 Using dataframes through examples

A dataframe after a simple CSV ingestion

Data is stored in partitions

Digging in the schema

A dataframe after a JSON ingestion

Combining two dataframes

3.3 The dataframe is a Dataset<Row>

Reusing your POJOs

Creating a dataset of strings Converting back and forth

3.4 Dataframe’s ancestor: the RDD

4. Fundamentally lazy

4.1 A real-life example of efficient laziness

4.2 A Spark example of efficient laziness

Looking at the results of transformations and actions

The transformation process, step by step

The code behind the transformation/action process

The mystery behind the creation of 7 million datapoints in 182 ms

The mystery behind the timing of actions

4.3 Comparing to RDBMS and traditional applications

Working with the teen birth rates dataset

Analyzing differences between a traditional app and a Spark app

4.4 Spark is amazing for data-focused applications

4.5 Catalyst is your app catalyzer

5. Building a simple app for deployment

5.1 An ingestionless example

Calculating p

The code to approximate p

What are lambda functions in Java?

Approximating p by using lambda functions

5.2 Interacting with Spark

Local mode

Cluster mode

Interactive mode in Scala and Python

6. Deploying your simple app

6.1 Beyond the example: The role of the components

Quick overview of the components and their interactions Troubleshooting tips for the Spark architecture

Going further

6.2 Building a cluster

Building a cluster that works for you

Setting up the environment

6.3 Building your application to run on the cluster

Building your application’s uber JAR

Building your application by using Git and Maven

6.4 Running your application on the cluster

Submitting the uber JAR

Running the application Analyzing the Spark user interface

Part 2. Ingestion

7. Ingestion from files

7.1 Common behaviors of parsers

7.2 Complex ingestion from CSV

Desired output

Code

7.3 Ingesting a CSV with a known schema

Desired output

Code

7.4 Ingesting a JSON file

Desired output

Code

7.5 Ingesting a multiline JSON file

Desired output

Code

7.6 Ingesting an XML file

Desired output

Code

7.7 Ingesting a text file

Desired output

Code

7.8 File formats for big data

The problem with traditional file formats

Avro is a schema-based serialization format

ORC is a columnar storage format

Parquet is also a columnar storage format Comparing Avro, ORC, and Parquet

7.9 Ingesting Avro, ORC, and Parquet files

Ingesting Avro

Ingesting ORC

Ingesting Parquet

Reference table for ingesting Avro, ORC, or Parquet

8. Ingestion from databases

8.1 Ingestion from relational databases

Database connection checklist

Understanding the data used in the examples

Desired output

Code Alternative code

8.2 The role of the dialect

What is a dialect, anyway?

JDBC dialects provided with Spark

Building your own dialect

8.3 Advanced queries and ingestion

Filtering by using a WHERE clause

Joining data in the database

Performing Ingestion and partitioning Summary of advanced features

8.4 Ingestion from Elasticsearch

Data flow

The New York restaurants dataset digested by Spark

Code to ingest the restaurant dataset from Elasticsearch

9. Advanced ingestion: finding data sources and building your own

9.1 What is a data source?

9.2 Benefits of a direct connection to a data source

Temporary files

Data quality scripts

Data on demand

9.3 Finding data sources at Spark Packages

9.4 Building your own data source

Scope of the example project

Your data source API and options

9.5 Behind the scenes: Building the data source itself

9.6 Using the register file and the advertiser class

9.7 Understanding the relationship between the data and schema

The data source builds the relation

Inside the relation

9.8 Building the schema from a JavaBean

9.9 Building the dataframe is magic with the utilities

9.10 The other classes

10. Ingestion through structured streaming

10.1 What’s streaming?

10.2 Creating your first stream

Generating a file stream

Consuming the records 229 Getting records, not lines

10.3 Ingesting data from network streams

10.4 Dealing with multiple streams

10.5 Differentiating discretized and structured streaming

Part 3. Transforming your data

11. Working with SQL

11.1 Working with Spark SQL

11.2 The difference between local and global views

11.3 Mixing the dataframe API and Spark SQL

11.4 Don’t DELETE it!

11.5 Going further with SQL

12. Transforming your data

12.1 What is data transformation?

12.2 Process and example of record-level transformation

Data discovery to understand the complexity

Data mapping to draw the process

Writing the transformation code Reviewing your data transformation to ensure a quality process What about sorting?

Wrapping up your first Spark transformation

12.3 Joining datasets

A closer look at the datasets to join

Building the list of higher education institutions per county

Performing the joins

12.4 Performing more transformations

13.1 Transforming entire documents

13.1 Transforming entire documents and their structure

Flattening your JSON document

Building nested documents for transfer and storage

13.2 The magic behind static functions

13.3 Performing more transformations

13.4 Summary

14. Extending transformations with user-defined functions

14.1 Extending Apache Spark

14.2 Registering and calling a UDF

Registering the UDF with Spark

Using the UDF with the dataframe API

Manipulating UDFs with SQL Implementing the UDF

Writing the service itself

14.3 Using UDFs to ensure a high level of data quality

14.4 Considering UDFs’ constraints

15. Aggregating your data

15.1 Aggregating data with Spark

A quick reminder on aggregations

Performing basic aggregations with Spark

15.2 Performing aggregations with live data

Preparing your dataset

Aggregating data to better understand the schools

15.3 Building custom aggregations with UDAFs

Part 4. Going further

16. Cache and checkpoint: Enhancing Spark’s performances

16.1 Caching and checkpointing can increase performance

The usefulness of Spark caching

The subtle effectiveness of Spark checkpointing

Using caching and checkpointing

16.2 Caching in action

16.3 Going further in performance optimization

17. Exporting data and building full data pipelines

17.1 Exporting data

Building a pipeline with NASA datasets

Transforming columns to datetime

Transforming the confidence percentage to confidence level

Exporting the data Exporting the data: What really happened?

17.2 Delta Lake: Enjoying a database close to your system

Understanding why a database is needed

Using Delta Lake in your data pipeline

Consuming data from Delta Lake

17.3 Accessing cloud storage services from Spark

18. Exploring deployment constraints: Understanding the ecosystem

18.1 Managing resources with YARN, Mesos, and Kubernetes

The built-in standalone mode manages resources

YARN manages resources in a Hadoop environment

Mesos is a standalone resource manager

Kubernetes orchestrates containers

Choosing the right resource manager

18.2 Sharing files with Spark

Accessing the data contained in files

Sharing files through distributed filesystems

Accessing files on shared drives or file server

Using file-sharing services to distribute files Other options for accessing files in Spark

Hybrid solution for sharing files with Spark

18.3 Making sure your Spark application is secure

Securing the network components of your infrastructure 408 Securing Spark’s disk usage

Appendixes

appendix A Installing Eclipse

appendix B Installing Maven

appendix C Installing Git

appendix D Downloading the code and getting started with Eclipse

appendix E A history of enterprise data

appendix F Getting help with relational databases

appendix G Static functions ease your transformations

appendix H Maven quick cheat sheet

appendix I Reference for transformations and actions

appendix J Enough Scala

appendix K Installing Spark in production and a few tips

appendix L Reference for ingestion

appendix M Reference for joins

appendix N Installing Elasticsearch and sample data

appendix O Generating streaming data

appendix P Reference for streaming

appendix Q Reference for exporting data

appendix R Finding help when you’re stuck

index

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.157.105