Home Page Icon
Home Page
Table of Contents for
Table of Contents
Close
Table of Contents
by Mike Frampton
Mastering Apache Spark
Mastering Apache Spark
Table of Contents
Mastering Apache Spark
Credits
Foreword
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Apache Spark
Overview
Spark Machine Learning
Spark Streaming
Spark SQL
Spark graph processing
Extended ecosystem
The future of Spark
Cluster design
Cluster management
Local
Standalone
Apache YARN
Apache Mesos
Amazon EC2
Performance
The cluster structure
The Hadoop file system
Data locality
Memory
Coding
Cloud
Summary
2. Apache Spark MLlib
The environment configuration
Architecture
The development environment
Installing Spark
Classification with Naïve Bayes
Theory
Naïve Bayes in practice
Clustering with K-Means
Theory
K-Means in practice
ANN – Artificial Neural Networks
Theory
Building the Spark server
ANN in practice
Summary
3. Apache Spark Streaming
Overview
Errors and recovery
Checkpointing
Streaming sources
TCP stream
File streams
Flume
Kafka
Summary
4. Apache Spark SQL
The SQL context
Importing and saving data
Processing the Text files
Processing the JSON files
Processing the Parquet files
DataFrames
Using SQL
User-defined functions
Using Hive
Local Hive Metastore server
A Hive-based Metastore server
Summary
5. Apache Spark GraphX
Overview
GraphX coding
Environment
Creating a graph
Example 1 – counting
Example 2 – filtering
Example 3 – PageRank
Example 4 – triangle counting
Example 5 – connected components
Mazerunner for Neo4j
Installing Docker
The Neo4j browser
The Mazerunner algorithms
The PageRank algorithm
The closeness centrality algorithm
The triangle count algorithm
The connected components algorithm
The strongly connected components algorithm
Summary
6. Graph-based Storage
Titan
TinkerPop
Installing Titan
Titan with HBase
The HBase cluster
The Gremlin HBase script
Spark on HBase
Accessing HBase with Spark
Titan with Cassandra
Installing Cassandra
The Gremlin Cassandra script
The Spark Cassandra connector
Accessing Cassandra with Spark
Accessing Titan with Spark
Gremlin and Groovy
TinkerPop's Hadoop Gremlin
Alternative Groovy configuration
Using Cassandra
Using HBase
Using the filesystem
Summary
7. Extending Spark with H2O
Overview
The processing environment
Installing H2O
The build environment
Architecture
Sourcing the data
Data Quality
Performance tuning
Deep learning
Example code – income
The example code – MNIST
H2O Flow
Summary
8. Spark Databricks
Overview
Installing Databricks
AWS billing
Databricks menus
Account management
Cluster management
Notebooks and folders
Jobs and libraries
Development environments
Databricks tables
Data import
External tables
The DbUtils package
Databricks file system
Dbutils fsutils
The DbUtils cache
The DbUtils mount
Summary
9. Databricks Visualization
Data visualization
Dashboards
An RDD-based report
A stream-based report
REST interface
Configuration
Cluster management
The execution context
Command execution
Libraries
Moving data
The table data
Folder import
Library import
Further reading
Summary
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Cover
Next
Next Chapter
Mastering Apache Spark
Table of Contents
Mastering Apache Spark
Credits
Foreword
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Apache Spark
Overview
Spark Machine Learning
Spark Streaming
Spark SQL
Spark graph processing
Extended ecosystem
The future of Spark
Cluster design
Cluster management
Local
Standalone
Apache YARN
Apache Mesos
Amazon EC2
Performance
The cluster structure
The Hadoop file system
Data locality
Memory
Coding
Cloud
Summary
2. Apache Spark MLlib
The environment configuration
Architecture
The development environment
Installing Spark
Classification with Naïve Bayes
Theory
Naïve Bayes in practice
Clustering with K-Means
Theory
K-Means in practice
ANN – Artificial Neural Networks
Theory
Building the Spark server
ANN in practice
Summary
3. Apache Spark Streaming
Overview
Errors and recovery
Checkpointing
Streaming sources
TCP stream
File streams
Flume
Kafka
Summary
4. Apache Spark SQL
The SQL context
Importing and saving data
Processing the Text files
Processing the JSON files
Processing the Parquet files
DataFrames
Using SQL
User-defined functions
Using Hive
Local Hive Metastore server
A Hive-based Metastore server
Summary
5. Apache Spark GraphX
Overview
GraphX coding
Environment
Creating a graph
Example 1 – counting
Example 2 – filtering
Example 3 – PageRank
Example 4 – triangle counting
Example 5 – connected components
Mazerunner for Neo4j
Installing Docker
The Neo4j browser
The Mazerunner algorithms
The PageRank algorithm
The closeness centrality algorithm
The triangle count algorithm
The connected components algorithm
The strongly connected components algorithm
Summary
6. Graph-based Storage
Titan
TinkerPop
Installing Titan
Titan with HBase
The HBase cluster
The Gremlin HBase script
Spark on HBase
Accessing HBase with Spark
Titan with Cassandra
Installing Cassandra
The Gremlin Cassandra script
The Spark Cassandra connector
Accessing Cassandra with Spark
Accessing Titan with Spark
Gremlin and Groovy
TinkerPop's Hadoop Gremlin
Alternative Groovy configuration
Using Cassandra
Using HBase
Using the filesystem
Summary
7. Extending Spark with H2O
Overview
The processing environment
Installing H2O
The build environment
Architecture
Sourcing the data
Data Quality
Performance tuning
Deep learning
Example code – income
The example code – MNIST
H2O Flow
Summary
8. Spark Databricks
Overview
Installing Databricks
AWS billing
Databricks menus
Account management
Cluster management
Notebooks and folders
Jobs and libraries
Development environments
Databricks tables
Data import
External tables
The DbUtils package
Databricks file system
Dbutils fsutils
The DbUtils cache
The DbUtils mount
Summary
9. Databricks Visualization
Data visualization
Dashboards
An RDD-based report
A stream-based report
REST interface
Configuration
Cluster management
The execution context
Command execution
Libraries
Moving data
The table data
Folder import
Library import
Further reading
Summary
Index
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset