As this ebook edition doesn't have fixed pagination, the page numbers below are hyperlinked for reference only, based on the printed edition of this book.
Symbols
.ascharacter() 107
.asfactor() 107
.asnumeric() 107
.isfactor() 107
.isnumeric() 107
.isstring() 107
A
absolute Matthews correlation coefficient
accuracy 192
actions, RDD operations 309
activation function 183
Anaconda 266
Apache Software Foundation
reference link 316
Apache Spark
about 304
exploring 304
implementing, with H2O Sparkling Water 319
URL 310
Apache Spark architecture
about 306
Spark cluster manager 306
Spark Context 307
Spark driver 307
Spark executors 307
Apache Spark, components
GraphX 305
Spark core 305
Spark MLlib 305
Spark SQL 305
Spark streaming 305
Apache Storm
about 343
architecture components 349
architecture, designing 348
implementation, working with 349-358
installing 346
master node 344
problem statement 347
streams 345
topologies 345
tuples 345
URL 346
using 343
worker node 344
Apache Storm repo
reference link 346
Apache Thrift
about 345
reference link 345
area under the precision-recall curve (AUC-PR)
about 254
area under the ROC curve (AUC-ROC)
Artificial Neural Networks (ANNs)
about 179
association problem 147
association rule learning 148
Atomicity, Consistency, Isolation, and Durability (ACID) 126
Automated Machine Learning (AutoML)
triggering, on dataset 54
used, for model training 62
AutoML, model operations
Model Management options 56
Run AutoML 55
Run Specific Models 55
AutoML parameters
expert parameters 61
AutoML training
execution in Sparkling Water 320-327
B
bagging 172
Bayesian optimization 138
binary classification 147
binary wheels 268
bolt 345
boosting 179
C
Cartesian Hyperparameter Search 140
categorical data 111
cells 31
chunking 48
class balancing parameters
classification problems
binary classification 147
multiclass/polynomial classification 147
client layer
about 122
cluster 127
cluster computing 127
clustering
about 147
hard clustering 147
soft clustering 147
coefficient of determination 211
cold scoring 291
columns
combining, from two dataframes 81-83
column summary 49
column types
Comprehensive R Archive Network (CRAN)
about 11
URL 11
computational statistics 145
Concrete Compressive Strength dataset
reference link 319
conda environment 268
confusion matrix
Control Theory 147
Convolutional Neural Networks (CNNs) 181
cross-validation parameters
experimenting with 256
curl operation 131
curse of dimensionality 141
cut-off line 195
Cython 267
D
data
encoding, with target encoding 111-119
DataFrames
feature columns, manipulating 102
feature_test 269
feature_train 269
label_test 269
label_train 269
missing values, handling 89
reframing 80
data functions, H2O Flow
working with 37
data ingestion interaction
flow 128
data manipulation
reference link 97
data processing 79
data sharing
in external Sparkling Water backend 314
in internal Sparkling Water backend 313
datum 93
decision trees
components 160
deep learning (DL)
reference link 185
Deep Neural Network (DNN) 181
Deep Q Network 149
dependent variables 146
deviance, ranking metric 188
Directed Acyclic Graph (DAG) 345
Distributed Random Forest (DRF) algorithm
Extremely Randomized Trees (XRT) 172
E
early stopping parameters
ensemble ML 179
Enumerated Types (Enums) 44
epoch 252
error metrics 18
Estimator 265
event logging, H2O AutoML 275
event log levels
DEBUG 277
ERROR 276
FATAL 276
INFO 276
TRACE 277
WARN 276
event logs
about 275
level 276
message 277
name 277
stage 277
timestamp 276
value 277
viewing, in R 276
exhaustive search 140
explainability features
confusion matrix 221
exploring 221
feature importance heatmaps 229, 230
individual conditional expectation plots 222-241
leaderboard 221
model correlation heatmap 222-232
partial dependency plots 222-236
variable importance heatmap 221
explainability function, parameters
columns 217
exclude_explanations 217
include_explanations 217
newdata/frame 217
object 218
plot_overrides 218
top_n_features 217
external Sparkling Water backend
data sharing 314
Extract-Transform-Load (ETL) 344
Extremely Randomized Trees (XRT) 172
F
F1 score performance metric 206-208
feature columns of dataframe
manipulating 102
feature importance
feedforward neural network 180
Fisher’s Iris dataset 14
Flow notebook/Flow 28
Fluid Vector 126
G
Gaussian distribution 154
Gedeon method 227
Generalized Linear Model (GLM) algorithm
Generalized Linear Model (GLM) algorithm, components
link function 157
random component 157
systematic component 157
generative models 148
global explanation 216
gradient 151
gradient-based optimization 138
Gradient Boosting Machine (GBM)
Graphviz
about 295
installing, on Linux 295
installing, on Mac 295
installing, on Windows 295
grid search hyperparameter optimization 136-141
grid search optimization 138
H
H2O
implementing, with Python 8
implementing, with R 11
H2O-3 MOJO model, loading and usage
reference link 330
H2O AI high-level architecture
about 122
client layer, observing 123, 124
Java Virtual Machine (JVM) component layer, observing 124-126
H2O AutoML
about 6
implementing, with H2O Sparkling Water 319
integration, with scikit-learn 264
minimum standard requirements 7, 8
used, for training ML model 14
used, for training model in Python 16-21
used, for training model in R 21-25
using, in scikit-learn 270
H2OAutoMLClassifier function
about 265
H2O AutoML leaderboard performance metrics
AUC-PR metric, calculating 200-202
exploring 188
Mean Squared Error (MSE) 188, 189
ROC-AUC metric, calculating 193-200
Root Mean Squared Error (RMSE) 189
H2O AutoML model MOJOs
used, for making predictions 297-300
H2OAutoMLRegressor function
about 265
H2O client-server interactions
during, ingestion of data 127-129
H2ODataSpout file
h2o.explain() function 216
h2o.explain_row() function 216
H2O Flow
basics 28
data functions, working with 37
downloading 29
download link 29
H2O models, downloading as POJOs 283, 284
launching 29
model training functions, working with 54
prediction functions, working with 68
H2OFrame
converting, into RDD 313
H2O models
download, as POJOs in H2O Flow 283, 284
download, as POJOs in Python 281, 282
download, as POJOs in R 282, 283
extracting, as MOJOs 291
extracting, as MOJOs in H2O Flow 294
extracting, as MOJOs in Python 291-293
extracting, as MOJOs in R programming language 293
extracting, as POJOs 281
H2O server
sequence of interactions, during model training 129
h2o.sklearn module 265
H2O Sparkling Water
AutoML training, execution 320-327
benefits 310
download link 316
installing requirements 315
predictions, making with model MOJOs 327-330
used, for implementing H2O AutoML 319
used, for implementing Spark 319
H2O Sparkling Water, backends
external backend 312
internal backend 311
H2O Sparkling Water, supported platforms
reference link 314
H2OStormStarter file
Hadoop Distributed File System (HDFS) 128, 307
hard clustering 147
harmonic mean 207
Heart Failure Clinical dataset
features 347
reference link 347
hex file 42
hot scoring 291
hyperparameter
values 158
hyperparameter optimization/ hyperparameter tuning 5, 137, 138
I
imbalanced dataset
about 248
oversampling 249
undersampling 248
imputation function
about 98
parameters 98
using, to fill missing values 98-102
imputation strategy 98
individual conditional expectation plots 217-241
IntelliJ IDEA version 336
internal parameters 137
internal Sparkling Water backend
data sharing 313
Iris flower dataset
reference link 14
J
Java
installation link 8
installing 8
Java Archive (JAR) file 29
Java Database Connectivity (JDBC) 6
Java Runtime Environment (JRE) 8
JavaScript 123
Java Virtual Machine (JVM) 10, 289, 313
Java Virtual Machine (JVM) component layer
about 122
job manager 131
jsr166y 126
Jupyter Notebook 217
JVM nodes
about 124
algorithm layer 125
language layer 125
resource management layer 125
JVM processes
about 124
Distributed key-value store 126
Fluid Vector Frame 126
Fork/Join 126
job 126
MapReduce Task 126
NonBlockingHashMap 126
K
key-value pairs
data 268
target 269
target_names 269
key-value store 126
K-fold cross-validation 257
K-means clustering 148
Kubernetes 307
L
label encoding 48
Laplacian regularization 148
lazy evaluation 309
leaf nodes 160
learning rate 176
linear equation 151
linear regression
link function 157
local explanation 216
logistic regression algorithm 195, 269
log loss
about 202
log system 343
M
Machine Learning Control (MLC) 147
Machine Learning (ML) 3, 27, 79, 289
Machine Learning (ML), types
reinforcement learning 149
semi-supervised learning 148
supervised learning 148
unsupervised learning 148
majority class 248
Market Basket Analysis 147
Matplotlib
about 265
URL 265
Maven project 316
Maven repository
reference link 298
Maven repository for Spark
reference link 316
mean 137
Mean Absolute Error (MAE) 254
mean per-class error 188
Mean Squared Deviation (MSD) 188
Mean Squared Error (MSE) 19, 59, 188, 189, 254
merge key 86
Microsoft Excel 123
Miniconda 266
minority class 248
missing values
handling, in dataframes 89
ML model
training, with H2O AutoML 14
ML models, metadata
column_types 66
model parameters 65
output 66
output’s training metrics 67
output’s validation metrics 68
variable importances 65
ML pipeline, steps
data collection 5
data exploration 5
data preparation 5
data transformation 5
hyperparameter tuning 5
model selection 5
model training 5
prediction 5
model correlation 230
model correlation heatmaps 222, 230-232
model explainability interface
about 216
implementing, in Python 218, 219
model MOJOs, in H2O Sparkling Water
used, for making predictions 327-330
Model Object, Optimized (MOJOs)
benefits 290
H2O AutoML model, using to make predictions 297-300
H2O models, extracting 291
H2O models, extracting in H2O Flow 294
H2O models, extracting in Python 291-293
H2O models, extracting in R programming language 293
versus POJOs 290
model parameter 137
model performance metrics
absolute Matthews correlation coefficient, calculating 208-211
exploring 206
F1 score performance metric 206-208
R2 performance metric, measuring 211-214
model training functions, H2O Flow
AutoML model training 62
AutoML parameters 54
working with 54
model training, in Python
model training, in R
model training, sequences of interactions
client queries, for model information 134-136
client, sending request to H2O 130, 131
completion status, client polling 133, 134
Monte Carlo Methods 149
multiclass/polynomial classification 147
Multi-Layer Perceptron (MLP) 181
multiple or curvilinear regression 153
N
Natural Language Processing (NLP) 109
NA values
negative binomial regression 157
negative classification 190
Neural Networks (NNs)
workings 182
Nimbus 344
nodes 160
normal distribution 154
normality of errors 153
normally distributed 223
Not Available (NA)/nan 90
NumPy
URL 264
O
one-hot encoding 48
operating systems, with scikit-learn package
Arch Linux 266
Debian/Ubuntu 267
Fedora 267
NetBSD 267
Optimal Control Theory 147
Optimization Problem 147
oversample the minority class techniques
solving, imbalanced dataset 249
P
parallel computing 127
parameters, supporting imbalanced classes
experimenting with 248
parsing process 42
partial dependency graph 216
partial dependency plots (PDP) 217, 222, 232-236
PermGen 318
phi coefficient 210
pkgsrc-wip
download link 267
Plain Old Java Objects (POJOs)
H2O models, downloading in H2O Flow as 283, 284
H2O models, downloading in Python as 281, 282
H2O models, downloading in R as 282, 283
H2O models, extracting as 281
versus MOJOs 290
Poisson regression 157
polynomial 147
positive classification 190
POST request 131
precision 192
precision-recall curve (PR curve)
predicted values 223
prediction functions
working with 68
prediction problems
about 146
association 147
classification 146
clustering 147
Optimization/Control 147
regression analysis 146
predictions
PredictionService class
methods 340
PredictionService file
attributes 340
predictors 157
probability density function 153
Project Object Model (POM) 337
pseudo-residuals 174
PySparkling 315
Python
about 123
H2O models, downloading as POJOs 281, 282
H2O models, extracting as MOJOs 291-293
installing 9
model explainability interface, implementing 218, 219
model training, with H2O AutoML 16-21
URL 9
used, for implementing H2O 8
python3-sklearn-doc package 267
python3-sklearn-lib package 267
python3-sklearn package 267
Q
Q-Learning 149
R
R
about 123
H2O models, downloading as POJOs 282, 283
H2O models, extracting as MOJOs 293
model explainability interface, implementing 220, 221
model training, with H2O AutoML 21, 22
used, for implementing H2O 11
used, for installing H2O 12-14
R2 performance metric
random component 157
Random Forest/Random Decision Forest 159, 168-172
random grid search optimization 138, 142, 143
raw data 80
RDD operations
actions 309
transformations 309
recall 192
receiver operating characteristic (ROC) curve
Recurrent Neural Network (RNN) 180, 181
Red Wine Quality dataset
alcohol 225
chlorides 225
citric acid 225
density 225
fixed acidity 224
free sulfur dioxide 225
pH 225
quality 225
reference link 224
residual analysis graph plot 225, 226
residual sugar 225
sulphates 225
total sulfur dioxide 225
volatile acidity 225
regression analysis 146
regression models 222
regression problem 146, 324, 325
reinforcement learning 149
residual 152, 174, 189, 213, 222
residual analysis 217, 221-224
residual analysis plot 223
residual values 223
Resilient Distributed Data (RDD)
converting, into H2OFrame 313
response variables 146
Root Mean Squared Error (RMSE) 19, 189, 254
Root Mean Squared Logarithmic Error (RMSLE) 254
root node 160
rows
combining, from two dataframes 83-85
R studio
URL 12
S
scikit-learn
about 264
building 265
H2O AutoML integration 264
H2O AutoML, using 270
installing 265
installing, from source 267
installing, Python distribution used 266, 267
latest official release, installing 266
URL 264
scikit-learn library
about 268
Matplotlib 265
NumPy 264
SciPy
URL 265
semi-supervised learning 148
sensitivity 192
Shapley Additive Explanations (SHAP)
about 237
reference link 238
SHAP value 238
skill of model 206
soft clustering 147
sorting metric 63
Spark cluster manager
about 306
Hadoop YARN 307
Kubernetes 307
standalone 306
Spark Context 307
Spark driver 307
Spark executors 307
specificity 192
split_frame function
reference link 17
spout 345
Spring Boot
about 332
architecture components 334, 335
architecture, designing 334
implementation, working with 335-343
using 332
Spring Framework 332
Stacked Ensemble models
about 20
reference link 179
stacking 179
standard deviation 137
stochastic gradient descent 181
storm-starter directory
files 350
Structured Query Language (SQL) 39
supervised learning 148
Supervisor daemon 344
systematic component 157
T
Tableau 123
target class 248
target encoding
about 111
reference link 119
used, for encoding data 111-119
threshold 195
tokenization
about 109
tokenize() 109
topology 345
training logs
about 275
creation_epoch key 278
duration_secs key 278
start_epoch key 278
start_{model_name} key 278
stop_epoch key 278
transformations, RDD operations 309
U
undersampling the majority class techniques
solving, imbalanced dataset 248
Universally Unique Identifiers (UUIDs) 44
unsupervised learning 148
User Interfaces (UIs) 27
V
values
replacing, in dataframes 92-97
variable importance graph 216
variable importance heatmap 221
variance 270
W
Wine Quality dataset
reference link 333
Wine Quality dataset features
alcohol 334
chlorides 333
citric acidity 333
color 334
density 333
fixed acidity 333
free sulfur dioxide 333
pH 334
quality 334
residual sugar 333
sulphates 334
total sulfur dioxide 333
volatile acidity 333
X
XGBoost
reference link 179
Y
Yet Another Resource Negotiator (YARN) 307
Z
Zookeeper cluster
about 345
URL 345
18.223.133.169