Index

As this ebook edition doesn't have fixed pagination, the page numbers below are hyperlinked for reference only, based on the printed edition of this book.

Symbols

.ascharacter() 107

.asfactor() 107

.asnumeric() 107

.isfactor() 107

.isnumeric() 107

.isstring() 107

A

absolute Matthews correlation coefficient

calculating 208-211

accuracy 192

actions, RDD operations 309

activation function 183

Anaconda 266

Apache Software Foundation

reference link 316

Apache Spark

about 304

components 304, 305

exploring 304

implementing, with H2O Sparkling Water 319

URL 310

Apache Spark architecture

about 306

Spark cluster manager 306

Spark Context 307

Spark driver 307

Spark executors 307

Apache Spark, components

GraphX 305

Spark core 305

Spark MLlib 305

Spark SQL 305

Spark streaming 305

Apache Storm

about 343

architecture 344, 345

architecture components 349

architecture, designing 348

implementation, working with 349-358

installing 346

master node 344

problem statement 347

streams 345

topologies 345

tuples 345

URL 346

using 343

worker node 344

Apache Storm repo

reference link 346

Apache Thrift

about 345

reference link 345

area under the precision-recall curve (AUC-PR)

about 254

calculating 200-202

area under the ROC curve (AUC-ROC)

about 20, 188, 254

calculating 193-200

Artificial Neural Networks (ANNs)

about 179

types 180, 181

association problem 147

association rule learning 148

Atomicity, Consistency, Isolation, and Durability (ACID) 126

Automated Machine Learning (AutoML)

about 3-6, 54

triggering, on dataset 54

used, for model training 62

AutoML, model operations

Model Management options 56

Run AutoML 55

Run Specific Models 55

AutoML parameters

advanced parameters 58-60

basic parameters 56, 57

expert parameters 61

AutoML training

execution in Sparkling Water 320-327

B

backpropagation 181, 184

bagging 172

Bayesian optimization 138

bias 137, 183, 270

binary classification 147

binary wheels 268

bolt 345

boosting 179

C

Cartesian Hyperparameter Search 140

categorical data 111

cbind() 81, 83

cells 31

chunking 48

class balancing parameters

working with 250, 251

classification problems

about 146, 324

binary classification 147

multiclass/polynomial classification 147

client layer

about 122

observing 123, 124

cluster 127

cluster computing 127

clustering

about 147

hard clustering 147

soft clustering 147

coefficient of determination 211

cold scoring 291

columns

combining, from two dataframes 81-83

sorting 102-106

column summary 49

column types

modifying 106-109

Comprehensive R Archive Network (CRAN)

about 11

URL 11

computational statistics 145

Concrete Compressive Strength dataset

reference link 319

conda environment 268

confusion matrix

about 190, 221

working with 190-193

Control Theory 147

Convolutional Neural Networks (CNNs) 181

cross-validation 257, 258

cross-validation parameters

experimenting with 256

working with 259-261

curl operation 131

curse of dimensionality 141

cut-off line 195

Cython 267

D

data

encoding, with target encoding 111-119

DataFrames

about 80, 126

columns, combining from 81-83

feature columns, manipulating 102

feature_test 269

feature_train 269

label_test 269

label_train 269

merging 86-89

missing values, handling 89

NA values, filling 90-92

reframing 80

rows, combining from 83-85

values, replacing 92-97

data functions, H2O Flow

dataframe, observing 46-51

dataframe, splitting 51-53

dataset, importing 37-42

dataset, parsing 42-46

working with 37

data ingestion interaction

flow 128

data manipulation

reference link 97

data processing 79

data sharing

in external Sparkling Water backend 314

in internal Sparkling Water backend 313

datum 93

decision trees

about 148, 159-167

components 160

deep learning (DL)

about 179, 185, 227

reference link 185

Deep Neural Network (DNN) 181

Deep Q Network 149

dependent variables 146

deviance, ranking metric 188

Directed Acyclic Graph (DAG) 345

Distributed Random Forest (DRF) algorithm

about 159, 227, 296

decision trees 159-167

Extremely Randomized Trees (XRT) 172

Random Forest 168-172

E

early stopping 252, 253

early stopping parameters

working with 253-255

encoding 48, 111

ensemble ML 179

Enumerated Types (Enums) 44

epoch 252

error metrics 18

Estimator 265

event logging, H2O AutoML 275

event log levels

DEBUG 277

ERROR 276

FATAL 276

INFO 276

TRACE 277

WARN 276

event logs

about 275

level 276

message 277

name 277

retrieving 275, 276

stage 277

timestamp 276

value 277

viewing, in R 276

exhaustive search 140

explainability features

confusion matrix 221

exploring 221

feature importance heatmaps 229, 230

individual conditional expectation plots 222-241

leaderboard 221

learning curve plots 241-243

model correlation heatmap 222-232

partial dependency plots 222-236

residual analysis 221-224

SHAP summary plots 222-238

variable importance 221-228

variable importance heatmap 221

explainability function, parameters

columns 217

exclude_explanations 217

include_explanations 217

newdata/frame 217

object 218

plot_overrides 218

top_n_features 217

external Sparkling Water backend

data sharing 314

Extract-Transform-Load (ETL) 344

Extremely Randomized Trees (XRT) 172

F

F1 score performance metric 206-208

feature columns of dataframe

manipulating 102

feature importance

about 226-229

heatmaps 229, 230

features 17, 146

feedforward neural network 180

fillna() function 90-92, 98

Fisher’s Iris dataset 14

Flow notebook/Flow 28

Fluid Vector 126

forward propagation 183, 184

G

Gaussian distribution 154

Gedeon method 227

Generalized Linear Model (GLM) algorithm

about 150, 227

working with 157-159

Generalized Linear Model (GLM) algorithm, components

link function 157

random component 157

systematic component 157

generative models 148

Gini Impurity 163, 164

global explanation 216

gradient 151

gradient-based optimization 138

Gradient Boosting Machine (GBM)

about 70, 173, 227

building 173-179

Graphviz

about 295

installing, on Linux 295

installing, on Mac 295

installing, on Windows 295

grid search hyperparameter optimization 136-141

grid search optimization 138

H

H2O

implementing, with Python 8

implementing, with R 11

installing, with Python 9-11

installing, with R 12-14

H2O-3 MOJO model, loading and usage

reference link 330

H2O AI high-level architecture

about 122

client layer, observing 123, 124

Java Virtual Machine (JVM) component layer, observing 124-126

H2O AutoML

about 6

event logging 274-278

implementing, with H2O Sparkling Water 319

integration, with scikit-learn 264

minimum standard requirements 7, 8

used, for training ML model 14

used, for training model in Python 16-21

used, for training model in R 21-25

using 332, 343

using, in scikit-learn 270

H2OAutoMLClassifier function

about 265

experimenting with 271, 272

H2O AutoML leaderboard performance metrics

AUC-PR metric, calculating 200-202

confusion matrix 190-193

exploring 188

log loss 202-206

Mean Squared Error (MSE) 188, 189

ROC-AUC metric, calculating 193-200

Root Mean Squared Error (RMSE) 189

H2O AutoML model MOJOs

used, for making predictions 297-300

H2OAutoMLRegressor function

about 265

experimenting with 273, 274

H2O client-server interactions

during, ingestion of data 127-129

H2ODataSpout file

attributes 352, 353

h2o.explain() function 216

h2o.explain_row() function 216

H2O Flow

about 27, 124

basics 28

data functions, working with 37

downloading 29

download link 29

exploring 29-36

H2O models, downloading as POJOs 283, 284

launching 29

model training functions, working with 54

prediction functions, working with 68

H2OFrame

converting, into RDD 313

H2O leaderboard 272, 273

H2O models

download, as POJOs in H2O Flow 283, 284

download, as POJOs in Python 281, 282

download, as POJOs in R 282, 283

extracting, as MOJOs 291

extracting, as MOJOs in H2O Flow 294

extracting, as MOJOs in Python 291-293

extracting, as MOJOs in R programming language 293

extracting, as POJOs 281

using, as POJOs 284-288

H2O server

sequence of interactions, during model training 129

h2o.sklearn module 265

H2O Sparkling Water

about 310, 312

AutoML training, execution 320-327

benefits 310

downloading 315-318

download link 316

installing 315-318

installing requirements 315

predictions, making with model MOJOs 327-330

problem statement 319, 320

recommended tuning 318, 319

used, for implementing H2O AutoML 319

used, for implementing Spark 319

H2O Sparkling Water, backends

external backend 312

internal backend 311

H2O Sparkling Water, supported platforms

reference link 314

H2OStormStarter file

attributes 353-356

Hadoop Distributed File System (HDFS) 128, 307

hard clustering 147

harmonic mean 207

Heart Failure Clinical dataset

features 347

reference link 347

heteroscedasticity 223, 224

hex file 42

homoscedasticity 223, 224

hot scoring 291

hyperparameter

about 136, 137

values 158

hyperparameter optimization/ hyperparameter tuning 5, 137, 138

I

imbalanced dataset

about 248

oversampling 249

undersampling 248

imputation function

about 98

parameters 98

using, to fill missing values 98-102

imputation strategy 98

individual conditional expectation plots 217-241

IntelliJ IDEA version 336

internal parameters 137

internal Sparkling Water backend

data sharing 313

Iris flower dataset

about 14, 15, 291

reference link 14

J

Java

installation link 8

installing 8

Java Archive (JAR) file 29

Java Database Connectivity (JDBC) 6

Java Runtime Environment (JRE) 8

JavaScript 123

Java Virtual Machine (JVM) 10, 289, 313

Java Virtual Machine (JVM) component layer

about 122

observing 124-126

job manager 131

jsr166y 126

Jupyter Notebook 217

JVM nodes

about 124

algorithm layer 125

language layer 125

resource management layer 125

JVM processes

about 124

Distributed key-value store 126

Fluid Vector Frame 126

Fork/Join 126

job 126

MapReduce Task 126

NonBlockingHashMap 126

K

key-value pairs

data 268

target 269

target_names 269

key-value store 126

K-fold cross-validation 257

K-means clustering 148

Kubernetes 307

L

label 17, 146

label encoding 48

Laplacian regularization 148

lazy evaluation 309

leaderboard 62, 216, 221

leaf nodes 160

learning curve plots 241-243

learning rate 176

linear equation 151

linear regression

about 148-152, 223

assumptions 153-157

link function 157

local explanation 216

logistic regression algorithm 195, 269

log loss

about 202

working with 203-206

log system 343

loss function 16, 184

M

Machine Learning Control (MLC) 147

Machine Learning (ML) 3, 27, 79, 289

Machine Learning (ML), types

reinforcement learning 149

semi-supervised learning 148

supervised learning 148

unsupervised learning 148

majority class 248

Market Basket Analysis 147

Matplotlib

about 265

URL 265

Maven project 316

Maven repository

reference link 298

Maven repository for Spark

reference link 316

mean 137

Mean Absolute Error (MAE) 254

mean per-class error 188

Mean Squared Deviation (MSD) 188

Mean Squared Error (MSE) 19, 59, 188, 189, 254

merge() function 86, 89

merge key 86

Microsoft Excel 123

Miniconda 266

minority class 248

missing values

handling, in dataframes 89

ML algorithm 62, 150

ML model

about 63, 64

training, with H2O AutoML 14

ML models, metadata

column_types 66

model parameters 65

output 66

output’s training metrics 67

output’s validation metrics 68

variable importances 65

ML pipeline, steps

data collection 5

data exploration 5

data preparation 5

data transformation 5

hyperparameter tuning 5

model selection 5

model training 5

prediction 5

model correlation 230

model correlation heatmaps 222, 230-232

model explainability interface

about 216

implementing, in Python 218, 219

implementing, in R 220, 221

working with 216-218

model MOJOs, in H2O Sparkling Water

used, for making predictions 327-330

Model Object, Optimized (MOJOs)

about 56, 289, 290

benefits 290

H2O AutoML model, using to make predictions 297-300

H2O models, extracting 291

H2O models, extracting in H2O Flow 294

H2O models, extracting in Python 291-293

H2O models, extracting in R programming language 293

models, viewing 295-297

versus POJOs 290

model parameter 137

model performance metrics

absolute Matthews correlation coefficient, calculating 208-211

exploring 206

F1 score performance metric 206-208

R2 performance metric, measuring 211-214

model training 16, 62

model training functions, H2O Flow

AutoML model training 62

AutoML parameters 54

working with 54

model training, in Python

with H2O AutoML 16-21

model training, in R

with H2O AutoML 21-25

model training, sequences of interactions

client queries, for model information 134-136

client, sending request to H2O 130, 131

completion status, client polling 133, 134

H2O, running 131-133

Monte Carlo Methods 149

multiclass/polynomial classification 147

Multi-Layer Perceptron (MLP) 181

multiple or curvilinear regression 153

N

Natural Language Processing (NLP) 109

NA values

filling in dataframes 90-92

negative binomial regression 157

negative classification 190

Neural Networks (NNs)

about 148, 179

components 182-184

workings 182

Nimbus 344

nodes 160

normal distribution 154

normality of errors 153

normally distributed 223

Not Available (NA)/nan 90

NumPy

about 264, 267

URL 264

O

one-hot encoding 48

operating systems, with scikit-learn package

Arch Linux 266

Debian/Ubuntu 267

Fedora 267

NetBSD 267

Optimal Control Theory 147

Optimization Problem 147

overfitting models 176, 251

oversample the minority class techniques

solving, imbalanced dataset 249

P

parallel computing 127

parameters, supporting imbalanced classes

experimenting with 248

parsing process 42

partial dependency graph 216

partial dependency plots (PDP) 217, 222, 232-236

PermGen 318

phi coefficient 210

pkgsrc-wip

download link 267

Plain Old Java Objects (POJOs)

about 64, 280, 281, 290

H2O models, downloading in H2O Flow as 283, 284

H2O models, downloading in Python as 281, 282

H2O models, downloading in R as 282, 283

H2O models, extracting as 281

H2O model, using as 284-288

versus MOJOs 290

Poisson regression 157

polynomial 147

positive classification 190

POST request 131

precision 192

precision-recall curve (PR curve)

calculating 200-202

predicted values 223

prediction functions

working with 68

prediction problems

about 146

association 147

classification 146

clustering 147

Optimization/Control 147

regression analysis 146

predictions

making, with H2O Flow 69-71

results, exploring 71-76

PredictionService class

methods 340

PredictionService file

attributes 340

predictors 157

probability density function 153

Project Object Model (POM) 337

pseudo-residuals 174

PySparkling 315

Python

about 123

H2O models, downloading as POJOs 281, 282

H2O models, extracting as MOJOs 291-293

installing 9

model explainability interface, implementing 218, 219

model training, with H2O AutoML 16-21

URL 9

used, for implementing H2O 8

used, for installing H2O 9-11

python3-sklearn-doc package 267

python3-sklearn-lib package 267

python3-sklearn package 267

Q

Q-Learning 149

R

R

about 123

H2O models, downloading as POJOs 282, 283

H2O models, extracting as MOJOs 293

installing 11, 12

model explainability interface, implementing 220, 221

model training, with H2O AutoML 21, 22

used, for implementing H2O 11

used, for installing H2O 12-14

R2 performance metric

measuring 211-214

random component 157

Random Forest/Random Decision Forest 159, 168-172

random grid search optimization 138, 142, 143

raw data 80

rbind() 83, 85

RDD operations

actions 309

transformations 309

recall 192

receiver operating characteristic (ROC) curve

calculating 193-200

Recurrent Neural Network (RNN) 180, 181

Red Wine Quality dataset

alcohol 225

chlorides 225

citric acid 225

density 225

fixed acidity 224

free sulfur dioxide 225

pH 225

quality 225

reference link 224

residual analysis graph plot 225, 226

residual sugar 225

sulphates 225

total sulfur dioxide 225

volatile acidity 225

regression analysis 146

regression models 222

regression problem 146, 324, 325

reinforcement learning 149

residual 152, 174, 189, 213, 222

residual analysis 217, 221-224

residual analysis plot 223

residual values 223

Resilient Distributed Data (RDD)

about 307-310

converting, into H2OFrame 313

response variables 146

Root Mean Squared Error (RMSE) 19, 189, 254

Root Mean Squared Logarithmic Error (RMSLE) 254

root node 160

rows

combining, from two dataframes 83-85

R studio

URL 12

S

scikit-learn

about 264

building 265

experimenting with 268-270

H2O AutoML integration 264

H2O AutoML, using 270

installing 265

installing, from source 267

installing, Python distribution used 266, 267

latest official release, installing 266

URL 264

scikit-learn library

about 268

Matplotlib 265

NumPy 264

SciPy

about 265, 267

URL 265

semi-supervised learning 148

sensitivity 192

Shapley Additive Explanations (SHAP)

about 237

reference link 238

SHAP summary 217, 222

SHAP summary plots 236-238

SHAP value 238

skill of model 206

soft clustering 147

sort() 103, 104

sorting metric 63

Spark cluster manager

about 306

Hadoop YARN 307

Kubernetes 307

standalone 306

Spark Context 307

Spark driver 307

Spark executors 307

specificity 192

split_frame function

reference link 17

Split operation 51, 52

spout 345

Spring Boot

about 332

architecture components 334, 335

architecture, designing 334

implementation, working with 335-343

problem statement 332, 333

using 332

Spring Framework 332

Stacked Ensemble models

about 20

reference link 179

stacking 179

standard deviation 137

stochastic gradient descent 181

storm-starter directory

files 350

Structured Query Language (SQL) 39

supervised learning 148

Supervisor daemon 344

systematic component 157

T

Tableau 123

target class 248

target encoding

about 111

reference link 119

used, for encoding data 111-119

threshold 195

tokenization

about 109

of textual data 109-111

tokenize() 109

topology 345

training logs

about 275

creation_epoch key 278

duration_secs key 278

retrieving 277, 278

start_epoch key 278

start_{model_name} key 278

stop_epoch key 278

transformations, RDD operations 309

U

undersampling the majority class techniques

solving, imbalanced dataset 248

Universally Unique Identifiers (UUIDs) 44

unsupervised learning 148

User Interfaces (UIs) 27

V

values

replacing, in dataframes 92-97

variable importance 221-228

variable importance graph 216

variable importance heatmap 221

variance 270

W

weights 137, 183

Wine Quality dataset

reference link 333

Wine Quality dataset features

alcohol 334

chlorides 333

citric acidity 333

color 334

density 333

fixed acidity 333

free sulfur dioxide 333

pH 334

quality 334

residual sugar 333

sulphates 334

total sulfur dioxide 333

volatile acidity 333

X

XGBoost

about 179, 227

reference link 179

Y

Yet Another Resource Negotiator (YARN) 307

Z

Zookeeper cluster

about 345

URL 345

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.133.169