Spark in Action, Second Edition

Author Jean-Georges Perrin

Release Date: 2020/06/01

ISBN: 9781617295522

Topic:

49
Chapters

0-1
Hours read

0k
Total Words

Start Reading Now
Add to Wishlist
View table of contents

Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms.

Copyright
brief contents
contents
front matter
1. foreword
2. The analytics operating system
3. preface
4. acknowledgments
5. about this book
6. Who should read this book
7. What will you learn in this book?
8. How this book is organized
9. About the code
10. liveBook discussion forum
11. about the author
12. about the cover illustration
Part 1. The theory crippled by awesome examples
1. So, what is Spark, anyway?
1. 1.1 The big picture: What Spark is and what it does
2. 1.1.1 What is Spark?
3. 1.1.2 The four pillars of mana
4. 1.2 How can you use Spark?
5. 1.2.1 Spark in a data processing/engineering scenario
6. 1.2.2 Spark in a data science scenario
7. 1.3 What can you do with Spark?
8. 1.3.1 Spark predicts restaurant quality at NC eateries
9. 1.3.2 Spark allows fast data transfer for Lumeris
10. 1.3.3 Spark analyzes equipment logs for CERN
11. 1.3.4 Other use cases
12. 1.4 Why you will love the dataframe
13. 1.4.1 The dataframe from a Java perspective
14. 1.4.2 The dataframe from an RDBMS perspective
15. 1.4.3 A graphical representation of the dataframe
16. 1.5 Your first example
17. 1.5.1 Recommended software
18. 1.5.2 Downloading the code
19. 1.5.3 Running your first application
20. Command line
21. Eclipse
22. 1.5.4 Your first code
23. Summary
2. Architecture and flow
1. 2.1 Building your mental model
2. 2.2 Using Java code to build your mental model
3. 2.3 Walking through your application
4. 2.3.1 Connecting to a master
5. 2.3.2 Loading, or ingesting, the CSV file
6. 2.3.3 Transforming your data
7. 2.3.4 Saving the work done in your dataframe to a database
8. Summary
3. The majestic role of the dataframe
1. 3.1 The essential role of the dataframe in Spark
2. 3.1.1 Organization of a dataframe
3. 3.1.2 Immutability is not a swear word
4. 3.2 Using dataframes through examples
5. 3.2.1 A dataframe after a simple CSV ingestion
6. 3.2.2 Data is stored in partitions
7. 3.2.3 Digging in the schema
8. 3.2.4 A dataframe after a JSON ingestion
9. 3.2.5 Combining two dataframes
10. 3.3 The dataframe is a Dataset<Row>
11. 3.3.1 Reusing your POJOs
12. 3.3.2 Creating a dataset of strings
13. 3.3.3 Converting back and forth
14. Create the dataset
15. Create the dataframe
16. 3.4 Dataframe’s ancestor: the RDD
17. Summary
4. Fundamentally lazy
1. 4.1 A real-life example of efficient laziness
2. 4.2 A Spark example of efficient laziness
3. 4.2.1 Looking at the results of transformations and actions
4. 4.2.2 The transformation process, step by step
5. 4.2.3 The code behind the transformation/action process
6. 4.2.4 The mystery behind the creation of 7 million datapoints in 182 ms
7. The mystery behind the timing of actions
8. 4.3 Comparing to RDBMS and traditional applications
9. 4.3.1 Working with the teen birth rates dataset
10. 4.3.2 Analyzing differences between a traditional app and a Spark app
11. 4.4 Spark is amazing for data-focused applications
12. 4.5 Catalyst is your app catalyzer
13. Summary
5. Building a simple app for deployment
1. 5.1 An ingestionless example
2. 5.1.1 Calculating π
3. 5.1.2 The code to approximate π
4. 5.1.3 What are lambda functions in Java?
5. 5.1.4 Approximating π by using lambda functions
6. 5.2 Interacting with Spark
7. 5.2.1 Local mode
8. 5.2.2 Cluster mode
9. Submitting a job to Spark
10. Setting the cluster’s master in your application
11. 5.2.3 Interactive mode in Scala and Python
12. Scala shell
13. Python shell
14. Summary
6. Deploying your simple app
1. 6.1 Beyond the example: The role of the components
2. 6.1.1 Quick overview of the components and their interactions
3. 6.1.2 Troubleshooting tips for the Spark architecture
4. 6.1.3 Going further
5. 6.2 Building a cluster
6. 6.2.1 Building a cluster that works for you
7. 6.2.2 Setting up the environment
8. 6.3 Building your application to run on the cluster
9. 6.3.1 Building your application’s uber JAR
10. 6.3.2 Building your application by using Git and Maven
11. 6.4 Running your application on the cluster
12. 6.4.1 Submitting the uber JAR
13. 6.4.2 Running the application
14. 6.4.3 the Spark user interface
15. Summary
Part 2. Ingestion
7. Ingestion from files
1. 7.1 Common behaviors of parsers
2. 7.2 Complex ingestion from CSV
3. 7.2.1 Desired output
4. 7.2.2 Code
5. 7.3 Ingesting a CSV with a known schema
6. 7.3.1 Desired output
7. 7.3.2 Code
8. 7.4 Ingesting a JSON file
9. 7.4.1 Desired output
10. 7.4.2 Code
11. 7.5 Ingesting a multiline JSON file
12. 7.5.1 Desired output
13. 7.5.2 Code
14. 7.6 Ingesting an XML file
15. 7.6.1 Desired output
16. 7.6.2 Code
17. 7.7 Ingesting a text file
18. 7.7.1 Desired output
19. 7.7.2 Code
20. 7.8 File formats for big data
21. 7.8.1 The problem with traditional file formats
22. 7.8.2 Avro is a schema-based serialization format
23. 7.8.3 ORC is a columnar storage format
24. 7.8.4 Parquet is also a columnar storage format
25. 7.8.5 Comparing Avro, ORC, and Parquet
26. 7.9 Ingesting Avro, ORC, and Parquet files
27. 7.9.1 Ingesting Avro
28. 7.9.2 Ingesting ORC
29. 7.9.3 Ingesting Parquet
30. 7.9.4 Reference table for ingesting Avro, ORC, or Parquet
31. Summary
8. Ingestion from databases
1. 8.1 Ingestion from relational databases
2. 8.1.1 Database connection checklist
3. 8.1.2 Understanding the data used in the examples
4. 8.1.3 Desired output
5. 8.1.4 Code
6. 8.1.5 Alternative code
7. 8.2 The role of the dialect
8. 8.2.1 What is a dialect, anyway?
9. 8.2.2 JDBC dialects provided with Spark
10. 8.2.3 Building your own dialect
11. 8.3 Advanced queries and ingestion
12. 8.3.1 Filtering by using a WHERE clause
13. 8.3.2 Joining data in the database
14. 8.3.3 Performing Ingestion and partitioning
15. 8.3.4 Summary of advanced features
16. 8.4 Ingestion from Elasticsearch
17. 8.4.1 Data flow
18. 8.4.2 The New York restaurants dataset digested by Spark
19. 8.4.3 Code to ingest the restaurant dataset from Elasticsearch
20. Summary
9 Advanced ingestion: finding data sources and building your own
1. 9.1 What is a data source?
2. 9.2 Benefits of a direct connection to a data source
3. 9.2.1 Temporary files
4. 9.2.2 Data quality scripts
5. 9.2.3 Data on demand
6. 9.3 Finding data sources at Spark Packages
7. 9.4 Building your own data source
8. 9.4.1 Scope of the example project
9. 9.4.2 Your data source API and options
10. 9.5 Behind the scenes: Building the data source itself
11. 9.6 Using the register file and the advertiser class
12. 9.7 Understanding the relationship between the data and schema
13. 9.7.1 The data source builds the relation
14. 9.7.2 Inside the relation
15. 9.8 Building the schema from a JavaBean
16. 9.9 Building the dataframe is magic with the utilities
17. 9.10 The other classes
18. Summary
10. Ingestion through structured streaming
1. 10.1 What’s streaming?
2. 10.2 Creating your first stream
3. 10.2.1 Generating a file stream
4. 10.2.2 Consuming the records
5. 10.2.3 Getting records, not lines
6. 10.3 Ingesting data from network streams
7. 10.4 Dealing with multiple streams
8. 10.5 Differentiating discretized and structured streaming
9. Summary
Part 3. Transforming your data
11. Working with SQL
1. 11.1 Working with Spark SQL
2. 11.2 The difference between local and global views
3. 11.3 Mixing the dataframe API and Spark SQL
4. 11.4 Don’t DELETE it!
5. 11.5 Going further with SQL
6. Summary
12 Transforming your data
1. 12.1 What is data transformation?
2. 12.2 Process and example of record-level transformation
3. 12.2.1 Data discovery to understand the complexity
4. 12.2.2 Data mapping to draw the process
5. 12.2.3 Writing the transformation code
6. 12.2.4 Reviewing your data transformation to ensure a quality process
7. What about sorting?
8. Wrapping up your first Spark transformation
9. 12.3 Joining datasets
10. 12.3.1 A closer look at the datasets to join
11. 12.3.2 Building the list of higher education institutions per county
12. Initialization of Spark
13. Loading and preparing the data
14. 12.3.3 Performing the joins
15. Joining the FIPS county identifier with the higher ed dataset using a join
16. Joining the census data to get the county name
17. 12.4 Performing more transformations
18. Summary
13. Transforming entire documents
1. 13.1 Transforming entire documents and their structure
2. 13.1.1 Flattening your JSON document
3. 13.1.2 Building nested documents for transfer and storage
4. 13.2 The magic behind static functions
5. 13.3 Performing more transformations
6. Summary
14. Extending transformations with user-defined functions
1. 14.1 Extending Apache Spark
2. 14.2 Registering and calling a UDF
3. 14.2.1 Registering the UDF with Spark
4. 14.2.2 Using the UDF with the dataframe API
5. 14.2.3 Manipulating UDFs with SQL
6. 14.2.4 Implementing the UDF
7. 14.2.5 Writing the service itself
8. 14.3 Using UDFs to ensure a high level of data quality
9. 14.4 Considering UDFs’ constraints
10. Summary
15. Aggregating your data
1. 15.1 Aggregating data with Spark
2. 15.1.1 A quick reminder on aggregations
3. 15.1.2 Performing basic aggregations with Spark
4. Performing an aggregation using the dataframe API
5. Performing an aggregation using Spark SQL
6. 15.2 Performing aggregations with live data
7. 15.2.1 Preparing your dataset
8. 15.2.2 Aggregating data to better understand the schools
9. What is the average enrollment for each school?
10. What is the evolution of the number of students?
11. What is the higher enrollment per school and year?
12. What is the minimal absenteeism per school?
13. Which are the five schools with the least and most absenteeism?
14. 15.3 Building custom aggregations with UDAFs
15. Summary
Part 4. Going further
16. Cache and checkpoint: Enhancing Spark’s performances
1. 16.1 Caching and checkpointing can increase performance
2. 16.1.1 The usefulness of Spark caching
3. 16.1.2 The subtle effectiveness of Spark checkpointing
4. 16.1.3 Using caching and checkpointing
5. 16.2 Caching in action
6. 16.3 Going further in performance optimization
7. Summary
17. Exporting data and building full data pipelines
1. 17.1 Exporting data
2. 17.1.1 Building a pipeline with NASA datasets
3. 17.1.2 Transforming columns to datetime
4. 17.1.3 Transforming the confidence percentage to confidence level
5. 17.1.4 Exporting the data
6. 17.1.5 Exporting the data: What really happened?
7. 17.2 Delta Lake: Enjoying a database close to your system
8. 17.2.1 Understanding why a database is needed
9. 17.2.2 Using Delta Lake in your data pipeline
10. 17.2.3 Consuming data from Delta Lake
11. Number of meetings per department
12. Number of meetings per type of organizer
13. 17.3 Accessing cloud storage services from Spark
14. Amazon S3
15. Google Cloud Storage
16. IBM COS
17. Microsoft Azure Blob Storage
18. OVH Object Storage
19. Summary
18. Exploring deployment constraints: Understanding the ecosystem
1. 18.1 Managing resources with YARN, Mesos, and Kubernetes
2. 18.1.1 The built-in standalone mode manages resources
3. 18.1.2 YARN manages resources in a Hadoop environment
4. 18.1.3 Mesos is a standalone resource manager
5. 18.1.4 Kubernetes orchestrates containers
6. 18.1.5 Choosing the right resource manager
7. 18.2 Sharing files with Spark
8. 18.2.1 Accessing the data contained in files
9. 18.2.2 Sharing files through distributed filesystems
10. 18.2.3 Accessing files on shared drives or file server
11. 18.2.4 Using file-sharing services to distribute files
12. 18.2.5 Other options for accessing files in Spark
13. 18.2.6 Hybrid solution for sharing files with Spark
14. 18.3 Making sure your Spark application is secure
15. 18.3.1 Securing the network components of your infrastructure
16. 18.3.2 Securing Spark’s disk usage
17. Summary
Appendixes.
Appendix A. Installing Eclipse
1. A.1 Eclipse
2. A.2 Running Eclipse for the first time
Appendix B. Installing Maven
1. B.1 Installation on Windows
2. B.2 Installation on macOS
Appendix C. Installing Git
1. C.1 Installing Git on Windows
2. C.2 Installing Git on macOS
3. C.3 Installing Git on Ubuntu
4. $ sudo apt install git
5. C.4 Installing Git on RHEL / Amazon EMR
6. $ sudo yum install -y git
7. C.5 Other tools to consider
Appendix D. Downloading the code and getting started with Eclipse
1. D.1 Downloading the source code from the command line
2. D.2 Getting started in Eclipse
Appendix E. A history of enterprise data
1. E.1 The enterprise problem
2. E.2 The solution is--hmmm, was--the data warehouse
3. E.3 The ephemeral data lake
4. E.4 Lightning-fast cluster computing
5. E.5 Java rules, but we’re okay with Python
Appendix F. Getting help with relational databases
1. F.1 IBM Informix
2. F.1.1 Installing Informix on macOS
3. F.1.2 Installing Informix on Windows
4. F.2 MariaDB
5. F.2.1 Installing MariaDB on macOS
6. F.2.2 Installing MariaDB on Windows
7. F.3 MySQL (Oracle)
8. F.3.1 Installing MySQL on macOS
9. F.3.2 Installing MySQL on Windows
10. F.3.3 Loading the Sakila database
11. F.4 PostgreSQL
12. F.4.1 Installing PostgreSQL on macOS and Windows
13. F.4.2 Installing PostgreSQL on Linux
14. F.4.3 GUI clients for PostgreSQL
Appendix G. Static functions ease your transformations
1. G.1.1 Functions per category
2. G.1.1 Popular functions
3. G.1.2 Aggregate functions
4. G.1.3 Arithmetical functions
5. G.1.4 Array manipulation functions
6. G.1.5 Binary operations
7. G.1.6 Byte functions
8. G.1.7 Comparison functions
9. G.1.8 Compute function
10. G.1.9 Conditional operations
11. G.1.10 Conversion functions
12. G.1.11 Data shape functions
13. G.1.12 Date and time functions
14. G.1.13 Digest functions
15. G.1.14 Encoding functions
16. G.1.15 Formatting functions
17. G.1.16 JSON functions
18. G.1.17 List functions
19. G.1.18 Map functions
20. G.1.19 Mathematical functions
21. G.1.20 Navigation functions
22. G.1.21 Parsing functions
23. G.1.22 Partition functions
24. G.1.23 Rounding functions
25. G.1.24 Sorting functions
26. G.1.25 Statistical functions
27. G.1.26 Streaming functions
28. G.1.27 String functions
29. G.1.28 Technical functions
30. G.1.29 Trigonometry functions
31. G.1.30 UDF helpers
32. G.1.31 Validation functions
33. G.1.32 Deprecated functions
34. G.2 Function appearance per version of Spark
35. G.2.1 Functions in Spark v3.0.0
36. G.2.2 Functions in Spark v2.4.0
37. G.2.3 Functions in Spark v2.3.0
38. G.2.4 Functions in Spark v2.2.0
39. G.2.5 Functions in Spark v2.1.0
40. G.2.6 Functions in Spark v2.0.0
41. G.2.7 Functions in Spark v1.6.0
42. G.2.8 Functions in Spark v1.5.0
43. G.2.9 Functions in Spark v1.4.0
44. G.2.10 Functions in Spark v1.3.0
Appendix H. Maven quick cheat sheet
1. H.1 Source of packages
2. H.2 Useful commands
3. H.3 Typical Maven life cycle
4. H.4 Useful configuration
5. H.4.1 Built-in properties
6. H.4.2 Building an uber JAR
7. H.4.3 Including the source code
8. H.4.4 Executing from Maven
Appendix I. Reference for transformations and actions
1. I.1 Transformations
2. I.2 Actions
Appendix J. Enough Scala
1. J.1 What is Scala
2. J.2 Scala to Java conversion
3. J.2.1 General conversions
4. J.2.2 Maps: Conversion from Scala to Java
Appendix K. Installing Spark in production and a few tips
1. K.1 Installation
2. K.1.1 Installing Spark on Windows
3. K.1.2 Installing Spark on macOS
4. K.1.3 Installing Spark on Ubuntu
5. Figure K.1 Getting the real download URL for Apache Spark so you can copy it to your command line
6. K.1.4 Installing Spark on AWS EMR
7. K.2 Understanding the installation
8. K.3 Configuration
9. K.3.1 Properties syntax
10. K.3.2 Application configuration
11. K.3.3 Runtime configuration
12. K.3.4 Other configuration points
Appendix L. Reference for ingestion
1. L.1 Spark datatypes
2. L.2 Options for CSV ingestion
3. L.3 Options for JSON ingestion
4. L.4 Options for XML ingestion
5. L.5 Methods for building a full dialect
6. L.6 Options for ingesting and writing data from/to a database
7. L.7 Options for ingesting and writing data from/to Elasticsearch
Appendix M. Reference for joins
1. M.1 Setting up the decorum
2. M.2 Performing an inner join
3. M.3 Performing an outer join
4. M.4 Performing a left, or left-outer, join
5. M.5 Performing a right, or right-outer, join
6. M.6 Performing a left-semi join
7. M.7 Performing a left-anti join
8. M.9 Performing a cross-join
Appendix N. Installing Elasticsearch and sample data
1. N.1 Installing the software
2. N.1.1 All platforms
3. N.1.2 macOS with Homebrew
4. N.2 Installing the NYC restaurant dataset
5. N.3 Understanding Elasticsearch terminology
6. N.4 Working with useful commands
7. N.4.1 Get the server status
8. N.4.2 Display the structure
9. N.4.3 Count documents
Appendix O. Generating streaming data
1. O.1 Need for generating streaming data
2. O.2 A simple stream
3. O.3 Joined data
4. O.4 Types of fields
Appendix P. Reference for streaming
1. P.1 Output mode
2. P.2 Sinks
3. P.3 Sinks, output modes, and options
4. P.4 Examples of using the various sinks
5. P.4.1 Output in a file
6. P.4.2 Output to a Kafka topic
7. P.4.3 Processing streamed records through foreach
8. P.4.4 Output in memory and processing from memory
Appendix Q. Reference for exporting data
1. Q.1 Specifying the way to save data
2. Q.2 Spark export formats
3. Q.3 Options for the main formats
4. Q.3.1 Exporting as CSV
5. Q.3.2 Exporting as JSON
6. Q.3.3 Exporting as Parquet
7. Q.3.4 Exporting as ORC
8. Q.3.5 Exporting as XML
9. Q.3.6 Exporting as text
10. Q.4 Exporting data to datastores
11. Q.4.1 Exporting data to a database via JDBC
12. Q.4.2 Exporting data to Elasticsearch
13. Q.4.3 Exporting data to Delta Lake
Appendix R. Finding help when you’re stuck
1. R.1 Small annoyances here and there
2. R.1.1 Service sparkDriver failed after 16 retries . . .
3. R.1.2 Requirement failed
4. R.1.3 Class cast exception
5. R.1.4 Corrupt record in ingestion
6. R.1.5 Cannot find winutils.exe
7. R.2 Help in the outside world
8. R.2.1 User mailing list
9. R.2.2 Stack Overflow
index
1. Numerics
2. A
3. B
4. C
5. D
6. E
7. F
8. G
9. H
10. I
11. J
12. K
13. L
14. M
15. N
16. O
17. P
18. Q
19. R
20. S
21. T
22. U
23. V
24. W
25. X
26. Y
27. Z

Spark in Action, Second Edition

Table of Contents