Home Page Icon
Home Page
Table of Contents for
Mastering Spark for Data Science
Close
Mastering Spark for Data Science
by Matthew Hallett, David George, Antoine Amend, Andrew Morgan
Mastering Spark for Data Science
Mastering Spark for Data Science
Mastering Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. The Big Data Science Ecosystem
Introducing the Big Data ecosystem
Data management
Data management responsibilities
The right tool for the job
Overall architecture
Data Ingestion
Data Lake
Reliable storage
Scalable data processing capability
Data science platform
Data Access
Data technologies
The role of Apache Spark
Companion tools
Apache HDFS
Advantages
Disadvantages
Installation
Amazon S3
Advantages
Disadvantages
Installation
Apache Kafka
Advantages
Disadvantages
Installation
Apache Parquet
Advantages
Disadvantages
Installation
Apache Avro
Advantages
Disadvantages
Installation
Apache NiFi
Advantages
Disadvantages
Installation
Apache YARN
Advantages
Disadvantages
Installation
Apache Lucene
Advantages
Disadvantages
Installation
Kibana
Advantages
Disadvantages
Installation
Elasticsearch
Advantages
Disadvantages
Installation
Accumulo
Advantages
Disadvantages
Installation
Summary
2. Data Acquisition
Data pipelines
Universal ingestion framework
Introducing the GDELT news stream
Discovering GDELT in real-time
Our first GDELT feed
Improving with publish and subscribe
Content registry
Choices and more choices
Going with the flow
Metadata model
Kibana dashboard
Quality assurance
Example 1 - Basic quality checking, no contending users
Example 2 - Advanced quality checking, no contending users
Example 3 - Basic quality checking, 50% utility due to contending users
Summary
3. Input Formats and Schema
A structured life is a good life
GDELT dimensional modeling
GDELT model
First look at the data
Core global knowledge graph model
Hidden complexity
Denormalized models
Challenges with flattened data
Issue 1 - Loss of contextual information
Issue 2: Re-establishing dimensions
Issue 3: Including reference data
Loading your data
Schema agility
Reality check
GKG ELT
Position matters
Avro
Spark-Avro method
Pedagogical method
When to perform Avro transformation
Parquet
Summary
4. Exploratory Data Analysis
The problem, principles and planning
Understanding the EDA problem
Design principles
General plan of exploration
Preparation
Introducing mask based data profiling
Introducing character class masks
Building a mask based profiler
Setting up Apache Zeppelin
Constructing a reusable notebook
Exploring GDELT
GDELT GKG datasets
The files
Special collections
Reference data
Exploring the GKG v2.1
The Translingual files
A configurable GCAM time series EDA
Plot.ly charting on Apache Zeppelin
Exploring translation sourced GCAM sentiment with plot.ly
Concluding remarks
A configurable GCAM Spatio-Temporal EDA
Introducing GeoGCAM
Does our spatial pivot work?
Summary
5. Spark for Geographic Analysis
GDELT and oil
GDELT events
GDELT GKG
Formulating a plan of action
GeoMesa
Installing
GDELT Ingest
GeoMesa Ingest
MapReduce to Spark
Geohash
GeoServer
Map layers
CQL
Gauging oil prices
Using the GeoMesa query API
Data preparation
Machine learning
Naive Bayes
Results
Analysis
Summary
6. Scraping Link-Based External Data
Building a web scale news scanner
Accessing the web content
The Goose library
Integration with Spark
Scala compatibility
Serialization issues
Creating a scalable, production-ready library
Build once, read many
Exception handling
Performance tuning
Named entity recognition
Scala libraries
NLP walkthrough
Extracting entities
Abstracting methods
Building a scalable code
Build once, read many
Scalability is also a state of mind
Performance tuning
GIS lookup
GeoNames dataset
Building an efficient join
Offline strategy - Bloom filtering
Online strategy - Hash partitioning
Content deduplication
Context learning
Location scoring
Names de-duplication
Functional programming with Scalaz
Our de-duplication strategy
Using the mappend operator
Simple clean
DoubleMetaphone
News index dashboard
Summary
7. Building Communities
Building a graph of persons
Contact chaining
Extracting data from Elasticsearch
Using the Accumulo database
Setup Accumulo
Cell security
Iterators
Elasticsearch to Accumulo
A graph data model in Accumulo
Hadoop input and output formats
Reading from Accumulo
AccumuloGraphxInputFormat and EdgeWritable
Building a graph
Community detection algorithm
Louvain algorithm
Weighted Community Clustering (WCC)
Description
Preprocessing stage
Initial communities
Message passing
Community back propagation
WCC iteration
Gathering community statistics
WCC Computation
WCC iteration
GDELT dataset
The Bowie effect
Smaller communities
Using Accumulo cell level security
Summary
8. Building a Recommendation System
Different approaches
Collaborative filtering
Content-based filtering
Custom approach
Uninformed data
Processing bytes
Creating a scalable code
From time to frequency domain
Fast Fourier transform
Sampling by time window
Extracting audio signatures
Building a song analyzer
Selling data science is all about selling cupcakes
Using Cassandra
Using the Play framework
Building a recommender
The PageRank algorithm
Building a Graph of Frequency Co-occurrence
Running PageRank
Building personalized playlists
Expanding our cupcake factory
Building a playlist service
Leveraging the Spark job server
User interface
Summary
9. News Dictionary and Real-Time Tagging System
The mechanical Turk
Human intelligence tasks
Bootstrapping a classification model
Learning from Stack Exchange
Building text features
Training a Naive Bayes model
Laziness, impatience, and hubris
Designing a Spark Streaming application
A tale of two architectures
The CAP theorem
The Greeks are here to help
Importance of the Lambda architecture
Importance of the Kappa architecture
Consuming data streams
Creating a GDELT data stream
Creating a Kafka topic
Publishing content to a Kafka topic
Consuming Kafka from Spark Streaming
Creating a Twitter data stream
Processing Twitter data
Extracting URLs and hashtags
Keeping popular hashtags
Expanding shortened URLs
Fetching HTML content
Using Elasticsearch as a caching layer
Classifying data
Training a Naive Bayes model
Thread safety
Predict the GDELT data
Our Twitter mechanical Turk
Summary
10. Story De-duplication and Mutation
Detecting near duplicates
First steps with hashing
Standing on the shoulders of the Internet giants
Simhashing
The hamming weight
Detecting near duplicates in GDELT
Indexing the GDELT database
Persisting our RDDs
Building a REST API
Area of improvement
Building stories
Building term frequency vectors
The curse of dimensionality, the data science plague
Optimizing KMeans
Story mutation
The Equilibrium state
Tracking stories over time
Building a streaming application
Streaming KMeans
Visualization
Building story connections
Summary
11. Anomaly Detection on Sentiment Analysis
Following the US elections on Twitter
Acquiring data in stream
Acquiring data in batch
The search API
Rate limit
Analysing sentiment
Massaging Twitter data
Using the Stanford NLP
Building the Pipeline
Using Timely as a time series database
Storing data
Using Grafana to visualize sentiment
Number of processed tweets
Give me my Twitter account back
Identifying the swing states
Twitter and the Godwin point
Learning context
Visualizing our model
Word2Graph and Godwin point
Building a Word2Graph
Random walks
A Small Step into sarcasm detection
Building features
#LoveTrumpsHates
Scoring Emojis
Training a KMeans model
Detecting anomalies
Summary
12. TrendCalculus
Studying trends
The TrendCalculus algorithm
Trend windows
Simple trend
User Defined Aggregate Functions
Simple trend calculation
Reversal rule
Introducing the FHLS bar structure
Visualize the data
FHLS with reversals
Edge cases
Zero values
Completing the gaps
Stackable processing
Practical applications
Algorithm characteristics
Advantages
Disadvantages
Possible use cases
Chart annotation
Co-trending
Data reduction
Indexing
Fractal dimension
Streaming proxy for piecewise linear regression
Summary
13. Secure Data
Data security
The problem
The basics
Authentication and authorization
Access control lists (ACL)
Role-based access control (RBAC)
Access
Encryption
Data at rest
Java KeyStore
S3 encryption
Data in transit
Obfuscation/Anonymizing
Masking
Tokenization
Using a Hybrid approach
Data disposal
Kerberos authentication
Use case 1: Apache Spark accessing data in secure HDFS
Use case 2: extending to automated authentication
Use case 3: connecting to secure databases from Spark
Security ecosystem
Apache sentry
RecordService
Apache ranger
Apache Knox
Your Secure Responsibility
Summary
14. Scalable Algorithms
General principles
Spark architecture
History of Spark
Moving parts
Driver
SparkSession
Resilient distributed datasets (RDDs)
Executor
Shuffle operation
Cluster Manager
Task
DAG
DAG scheduler
Transformations
Stages
Actions
Task scheduler
Challenges
Algorithmic complexity
Numerical anomalies
Shuffle
Data schemes
Plotting your course
Be iterative
Data preparation
Scale up slowly
Estimate performance
Step through carefully
Tune your analytic
Design patterns and techniques
Spark APIs
Problem
Solution
Example
Summary pattern
Problem
Solution
Example
Expand and Conquer Pattern
Problem
Solution
Lightweight Shuffle
Problem
Solution
Wide Table pattern
Problem
Solution
Example
Broadcast variables pattern
Problem
Solution
Creating a broadcast variable
Accessing a broadcast variable
Removing a broadcast variable
Example
Combiner pattern
Problem
Solution
Example
Optimized cluster
Problem
Solution
Redistribution pattern
Problem
Solution
Example
Salting key pattern
Problem
Solution
Secondary sort pattern
Problem
Solution
Example
Filter overkill pattern
Problem
Solution
Probabilistic algorithms
Problem
Solution
Example
Selective caching
Problem
Solution
Garbage collection
Problem
Solution
Graph traversal
Problem
Solution
Example
Summary
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Table of Contents
Next
Next Chapter
Mastering Spark for Data Science
Mastering Spark for Data Science
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset