Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Downloading the color images of this book Errata Piracy Questions Introduction to Large-Scale Machine Learning and Spark Data science The sexiest role of the 21st century – data scientist? A day in the life of a data scientist Working with big data The machine learning algorithm using a distributed environment Splitting of data into multiple machines From Hadoop MapReduce to Spark What is Databricks? Inside the box Introducing H2O.ai Design of Sparkling Water What's the difference between H2O and Spark's MLlib? Data munging Data science - an iterative process Summary Detecting Dark Matter - The Higgs-Boson Particle Type I versus type II error Finding the Higgs-Boson particle The LHC and data creation The theory behind the Higgs-Boson Measuring for the Higgs-Boson The dataset Spark start and data load Labeled point vector Data caching Creating a training and testing set What about cross-validation? Our first model – decision tree Gini versus Entropy Next model – tree ensembles Random forest model Grid search Gradient boosting machine Last model - H2O deep learning Build a 3-layer DNN Adding more layers Building models and inspecting results Summary Ensemble Methods for Multi-Class Classification Data Modeling goal Challenges Machine learning workflow Starting Spark shell Exploring data Missing data Summary of missing value analysis Data unification Missing values Categorical values Final transformation Modelling data with Random Forest Building a classification model using Spark RandomForest Classification model evaluation Spark model metrics Building a classification model using H2O RandomForest Summary Predicting Movie Reviews Using NLP and Spark Streaming NLP - a brief primer The dataset Dataset preparation Feature extraction Feature extraction method– bag-of-words model Text tokenization Declaring our stopwords list Stemming and lemmatization Featurization - feature hashing Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme Let's do some (model) training! Spark decision tree model Spark Naive Bayes model Spark random forest model Spark GBM model Super-learner model Super learner Composing all transformations together Using the super-learner model Summary Word2vec for Prediction and Clustering Motivation of word vectors Word2vec explained What is a word vector? The CBOW model The skip-gram model Fun with word vectors Cosine similarity Doc2vec explained The distributed-memory model The distributed bag-of-words model Applying word2vec and exploring our data with vectors Creating document vectors Supervised learning task Summary Extracting Patterns from Clickstream Data Frequent pattern mining Pattern mining terminology Frequent pattern mining problem The association rule mining problem The sequential pattern mining problem Pattern mining with Spark MLlib Frequent pattern mining with FP-growth Association rule mining Sequential pattern mining with prefix span Pattern mining on MSNBC clickstream data Deploying a pattern mining application The Spark Streaming module Summary Graph Analytics with GraphX Basic graph theory Graphs Directed and undirected graphs Order and degree Directed acyclic graphs Connected components Trees Multigraphs Property graphs GraphX distributed graph processing engine Graph representation in GraphX Graph properties and operations Building and loading graphs Visualizing graphs with Gephi Gephi Creating GEXF files from GraphX graphs Advanced graph processing Aggregating messages Pregel GraphFrames Graph algorithms and applications Clustering Vertex importance GraphX in context Summary Lending Club Loan Prediction Motivation Goal Data Data dictionary Preparation of the environment Data load Exploration – data analysis Basic clean up Useless columns String columns Loan progress columns Categorical columns Text columns Missing data Prediction targets Loan status model Base model The emp_title column transformation The desc column transformation Interest RateModel Using models for scoring Model deployment Stream creation Stream transformation Stream output Summary