Home Page Icon
Home Page
Table of Contents for
Mastering Machine Learning with Spark
Close
Mastering Machine Learning with Spark
by Michal Malohlava, Max Pumperla, Alex Tellez
Mastering Machine Learning with Spark 2.x
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Introduction to Large-Scale Machine Learning and Spark
Data science
The sexiest role of the 21st century – data scientist?
A day in the life of a data scientist
Working with big data
The machine learning algorithm using a distributed environment
Splitting of data into multiple machines
From Hadoop MapReduce to Spark
What is Databricks?
Inside the box
Introducing H2O.ai
Design of Sparkling Water
What's the difference between H2O and Spark's MLlib?
Data munging
Data science - an iterative process
Summary
Detecting Dark Matter - The Higgs-Boson Particle
Type I versus type II error
Finding the Higgs-Boson particle
The LHC and data creation
The theory behind the Higgs-Boson
Measuring for the Higgs-Boson
The dataset
Spark start and data load
Labeled point vector
Data caching
Creating a training and testing set
What about cross-validation?
Our first model – decision tree
Gini versus Entropy
Next model – tree ensembles
Random forest model
Grid search
Gradient boosting machine
Last model - H2O deep learning
Build a 3-layer DNN
Adding more layers
Building models and inspecting results
Summary
Ensemble Methods for Multi-Class Classification
Data
Modeling goal
Challenges
Machine learning workflow
Starting Spark shell
Exploring data
Missing data
Summary of missing value analysis
Data unification
Missing values
Categorical values
Final transformation
Modelling data with Random Forest
Building a classification model using Spark RandomForest
Classification model evaluation
Spark model metrics
Building a classification model using H2O RandomForest
Summary
Predicting Movie Reviews Using NLP and Spark Streaming
NLP - a brief primer
The dataset
Dataset preparation
Feature extraction
Feature extraction method– bag-of-words model
Text tokenization
Declaring our stopwords list
Stemming and lemmatization
Featurization - feature hashing
Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme
Let's do some (model) training!
Spark decision tree model
Spark Naive Bayes model
Spark random forest model
Spark GBM model
Super-learner model
Super learner
Composing all transformations together
Using the super-learner model
Summary
Word2vec for Prediction and Clustering
Motivation of word vectors
Word2vec explained
What is a word vector?
The CBOW model
The skip-gram model
Fun with word vectors
Cosine similarity
Doc2vec explained
The distributed-memory model
The distributed bag-of-words model
Applying word2vec and exploring our data with vectors
Creating document vectors
Supervised learning task
Summary
Extracting Patterns from Clickstream Data
Frequent pattern mining
Pattern mining terminology
Frequent pattern mining problem
The association rule mining problem
The sequential pattern mining problem
Pattern mining with Spark MLlib
Frequent pattern mining with FP-growth
Association rule mining
Sequential pattern mining with prefix span
Pattern mining on MSNBC clickstream data
Deploying a pattern mining application
The Spark Streaming module
Summary
Graph Analytics with GraphX
Basic graph theory
Graphs
Directed and undirected graphs
Order and degree
Directed acyclic graphs
Connected components
Trees
Multigraphs
Property graphs
GraphX distributed graph processing engine
Graph representation in GraphX
Graph properties and operations
Building and loading graphs
Visualizing graphs with Gephi
Gephi
Creating GEXF files from GraphX graphs
Advanced graph processing
Aggregating messages
Pregel
GraphFrames
Graph algorithms and applications
Clustering
Vertex importance
GraphX in context
Summary
Lending Club Loan Prediction
Motivation
Goal
Data
Data dictionary
Preparation of the environment
Data load
Exploration – data analysis
Basic clean up
Useless columns
String columns
Loan progress columns
Categorical columns
Text columns
Missing data
Prediction targets
Loan status model
Base model
The emp_title column transformation
The desc column transformation
Interest RateModel
Using models for scoring
Model deployment
Stream creation
Stream transformation
Stream output
Summary
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Title Page
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset