Table of Contents for index

batch_embed_text function 133

BERTs (bidirectional encoder representations from transformers)

pretrained features 133 – 135

transfer learning with 131 – 133

BinaryLabelDataset dataframe 97

binning 54 – 55

Box-Cox transforms 50 – 52

cached feature groups 208

case studies

COVID-19 diagnostics case study 71

data streaming case study 220

day trading case study 196

law school success prediction case study 102

object recognition case study 160

social media sentiment classification case study 137

binning 54 – 55

constructing dummy features from categorical data 227 – 228

domain-specific feature construction 58 – 59

one-hot encodings 56 – 58

when to dummify categorical variables vs. leaving as single column 232 – 233

CI/CD (continuous integration and development) 198

CIFAR-10 dataset 139 – 140

coef attribute 68

ColumnTransformer object 84

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset 73 – 75

computer vision case study 160

CIFAR-10 dataset 139 – 140

feature construction 140 – 142

feature extraction

histogram of oriented gradients 143 – 148

principal component analysis 147 – 148

feature learning 149 – 159

fine-tuning VGG-11 155 – 157

using fine-tuned VGG-11 features with logistic regression 158 – 159

using pretrained VGG-11 as feature extractor 151 – 154

image vectorization 159 – 160

problem statement and success definition 140

concept drift 6

ConvNets (Convolutional Networks) 149

COVID-19 diagnostics case study 71

CountVectorizer 110, 115, 239

COVID-flu diagnostic dataset 36 – 39

exploratory data analysis 39 – 41

feature construction 48 – 59

feature improvement 41 – 48

imputing missing qualitative data 46 – 48

imputing missing quantitative data 41 – 46

feature selection 66 – 69

machine learning 68 – 69

problem statement and success definition 36 – 39

pipeline

building 59 – 65

train/test splits 59 – 65

create_feature_group method 211

CSAT (customer satisfaction score) 21

daily price features 179 – 180

data drift 7

data imputation 227

data streaming case study

benefits of using feature stores 198 – 202

creating training data in Hopsworks 215 – 220

setting up feature stores with Hopsworks 204 – 215

MLOps and feature stores 198 – 203

connecting to Hopsworks with HSFS API 204 – 206

feature groups 207 – 213

using feature groups to select data 213 – 215

Wikipedia and MLOps and feature stores 202 – 203

data transforms 228

data-type-specific feature engineering techniques 226 – 232

structured data 227 – 228

constructing dummy features from categorical data 227 – 228

data imputation 227

data transforms 228

standardization and normalization 228

unstructured data 230 – 232

image data 230

text data 230

time-series data 230 – 232

DataLoaders 151

datetime.strptime feature 210

datetime index 163

datetime object 175

day trading case study 196

feature construction 166 – 185

domain-specific features 179 – 185

selecting features using machine learning 186 – 187

rolling/expanding window features 169 – 178

feature extraction 189 – 192

feature selection 185 – 188

recursive feature elimination 187 – 188

problem statement 164 – 166

TWLO dataset 162 – 166

dimension reduction, optimizing with principal component analysis 147 – 148

disparate impact

disparate treatment vs. 79

treating using Yeo-Johnson transformers 91 – 96

domain-specific feature construction 58 – 59

DummifyRiskFactor transformer 57

dummy features

when to dummify categorical variables vs. leaving as single column 232 – 233

constructing from categorical data 227 – 228

why not to dummify everything 232 – 233

EDA (exploratory data analysis) 5

EMA (exponential moving average) 180

end-of-tail imputation 43 – 46

equalized odds 81

evaluation metrics 30 – 32

fairness and bias 31

interpretability 31

machine learning complexity and speed 31 – 32

machine learning metrics 30

event_time feature 211

expanding window features 170 – 172

Explainer object 85

exploratory data analysis

COVID-19 diagnostics case study 39 – 41

law school success prediction case study 76 – 79

exploratory data analysis (EDA) 5

exponential moving average (EMA) 180

ExtraTreesClassifier model 37

fair representation implementation 96 – 100

fairness and bias 31, 79 – 81

bias-aware model 91 – 100

feature construction 91 – 96

feature extraction 96 – 100

definitions of fairness 79 – 81

equalized odds 81

statistical parity 80 – 81

unawareness 80

disparate treatment vs. disparate impact 79

how to know if bias in data needs to be dealt with 234

measuring bias in baseline model 85 – 90

mitigating bias 90 – 91

in-processing 90

post-processing 90 – 91

preprocessing 90

false negatives (FN) 86

false positives (FP) 86

feature construction

basics of 25 – 26, 224

COVID-19 diagnostics case study 48 – 59

day trading case study 166 – 185

domain-specific features 179 – 185

social media sentiment classification case study 109

rolling/expanding window features 169 – 178

law school success prediction case study

baseline model 82

bias-aware model 91 – 96

object recognition case study 140 – 142

feature engineering 12, 33, 241

approach to process 32

as crucial as machine learning model choice 222 – 223

case studies 11 – 12

combining learned features with conventional features 236 – 239

data-type-specific techniques 226 – 232

structured data 227 – 228

unstructured data 230 – 232

defined 2 – 3

evaluation metrics 30 – 32

fairness and bias 31

interpretability 31

machine learning complexity and speed 31 – 32

machine learning metrics 30

frequently asked questions 232 – 234

how to know if bias in data needs to be dealt with 234

when to dummify categorical variables vs. leaving as single column 232 – 233

why not to dummify everything 232 – 233

further reading material 240 – 241

great data and great models 4 – 5

levels of data 16 – 23

interval level 19 – 20

nominal level 16 – 18

ordinal level 18

qualitative data vs. quantitative data 16

ratio level 20 – 23

limits of 4

need for 3 – 4

not one-size-fits-all solution 223

pipeline 5 – 8, 221 – 222

raw-data vectorizers 239 – 240

types of 9 – 10, 24 – 29, 224 – 226

feature construction 25 – 26, 224

feature extraction 27 – 28, 225

feature improvement 24 – 25, 224

feature learning 28 – 29, 226

feature selection 26 – 27, 224 – 225

types of data 14 – 15

structured data 14

unstructured data 14 – 15

feature extraction

basics of 27 – 28, 225

day trading case study 189 – 192

law school success prediction case study 96 – 100

object recognition case study

histogram of oriented gradients 143 – 148

principal component analysis 147 – 148

social media sentiment classification case study 123 – 125

feature groups

basics of 207 – 213

using to select data 213 – 215

feature improvement

basics of 24 – 25, 224

COVID-19 diagnostics case study 41 – 48

imputing missing qualitative data 46 – 48

imputing missing quantitative data 41 – 46

social media sentiment classification case study 118 – 123

cleaning noise from text 118 – 120

standardizing tokens 120 – 123

feature learning

basics of 28 – 29, 226

object recognition case study 149 – 159

fine-tuning VGG-11 155 – 157

using fine-tuned VGG-11 features with logistic regression 158 – 159

using pretrained VGG-11 as feature extractor 151 – 154

social media sentiment classification case study 125 – 135

autoencoders 126 – 127

BERTs pretrained features 133 – 135

transfer learning 130

feature scaling 52 – 54

feature selection

basics of 26 – 27, 224 – 225

COVID-19 diagnostics case study 66 – 69

machine learning 68 – 69

selecting features using machine learning 186 – 187

day trading case study 185 – 188

recursive feature elimination 187 – 188

feature stores

benefits of using 198 – 202

compliance and governance 202

real-time feature serving 202

single source of features 200 – 202

creating training data in Hopsworks 215 – 220

feature-engine package 43

MLOps and 198 – 203

setting up with Hopsworks 204 – 215

connecting to Hopsworks with HSFS API 204 – 206

feature groups 207 – 213

using feature groups to select data 213 – 215

Wikipedia and 202 – 203

FeatureUnion class 60, 237

FeatureUnion object 98

float64 types 41

FluSymptoms feature 70

FN (false negatives) 86

four-fifths rule 80

FP (false positives) 86

GANs (generative adversarial networks) 10

GridSearchCV instance 37

Grigorev, Alexey 241

harmonic mean 21

healthcare case study 71

COVID-flu diagnostic dataset 36 – 39

exploratory data analysis 39 – 41

feature construction 48 – 59

feature improvement 41 – 48

imputing missing qualitative data 46 – 48

imputing missing quantitative data 41 – 46

feature selection 66 – 69

machine learning 68 – 69

problem statement and success definition 36 – 39

pipeline

building 59 – 65

train/test splits 59 – 65

HOGs (histogram of oriented gradients) 143 – 148

Hopsworks

creating training data in 215 – 220

setting up feature stores with 204 – 215

connecting to Hopsworks with HSFS API 204 – 206

feature groups 207 – 215

HSFS API 204 – 206

Huggingface 239

IDF (inverse document frequency) 115

image data 230

image vectorization 159 – 160

interpretability 31

interval level

dealing with data at 19 – 20

defined 19

k-NN ( k-nearest neighbors) 52

Kakade, Sunil 241

KBinsDiscretizer class 55, 234

law school success prediction case study 102

baseline model 82 – 90

feature construction 82

measuring bias in 85 – 90

pipeline 83 – 84

bias-aware model 91 – 100

feature construction 91 – 96

feature extraction 96 – 100

COMPAS dataset 73 – 75

exploratory data analysis 76 – 79

fairness and bias measurement 79 – 81

definitions of fairness 79 – 81

disparate treatment vs. disparate impact 79

mitigating bias 90 – 91

in-processing 90

post-processing 90 – 91

preprocessing 90

problem statement and success definition 75

log-transforms 48 – 49

logistic regression, using fine-tuned VGG-11 features with 158 – 159

lymphocytes feature 43

MACD (moving average convergence divergence) 180 – 181

Machine Learning Bookcamp (Grigorev) 241

max_feastures parameter 118

max_features parameter 110

mean pixel value (MPV) 140

mean/median imputation 42

min – max standardization scales 52

ML (machine learning)

complexity and speed 31 – 32

day trading case study 196

feature construction 166 – 185

feature extraction 189 – 192

feature selection 185 – 188

problem statement 164 – 166

TWLO dataset 162 – 166

feature engineering as crucial as ML model choice 222 – 223

feature selection with 68 – 69

metrics 30

pipeline 5 – 8

selecting features using 186 – 187

MLM (masked language model) 132

MLOps 198 – 203

MLOps Engineering at Scale (Osipov) 241

model_fairness object 86

most-frequent category imputation 47

moving average convergence divergence (MACD) 180 – 181

MPV (mean pixel value) 140

MultiLabelBinarizer class 57

multivariate time series 164

ngram_range parameter 111

NLP (Natural Language Processing) 4, 137

feature extraction 123 – 125

feature improvement 118 – 123

cleaning noise from text 118 – 120

standardizing tokens 120 – 123

feature learning 125 – 135

autoencoders 126 – 127

BERTs pretrained features 133 – 135

transfer learning 130

problem statement and success definition 108

text vectorization 108 – 117, 135

nominal binary feature 166

tweet sentiment dataset 105 – 108

NLTK (Natural Language Toolkit) 120 – 121

noise, cleaning from text 118 – 120

nominal level

dealing with data at 17 – 18

defined 16 – 17

normalization 228

NPS (net promoter score) 21

NSP (next sentence prediction) 132

object recognition case study 160

Box-Cox transforms 50 – 52

feature scaling 52 – 54

log-transforms 48 – 49

object pandas 41

CIFAR-10 dataset 139 – 140

feature construction 140 – 142

feature extraction

histogram of oriented gradients 143 – 148

principal component analysis 147 – 148

feature learning 149 – 159

fine-tuning VGG-11 155 – 157

using fine-tuned VGG-11 features with logistic regression 158 – 159

using pretrained VGG-11 as feature extractor 151 – 154

image vectorization 159 – 160

problem statement and success definition 140

on-demand feature groups 208

one-hot encodings 56 – 58

online feature serving 208

ordinal feature 166

ordinal level

dealing with data at 18

PCA (principal component analysis) 27, 123, 147 – 148

pip3 install hsfs library 205

Pipeline class 60

Pipeline object 37, 173

plot_gains function 178

PN (predicted negatives) 86

polynomial feature extraction 189 – 192

PowerTransformer class 50

PP (predicted positives) 86

precision, defined 38 – 39

Principles of Data Science, The, Second Edition (Ozdemir and Kakade) 241

priors_count value 93

most-frequent category imputation 47

qualitative data

imputing missing 46 – 48

arbitrary category imputation 47 – 48

quantitative data vs. 16

quantitative data

imputing missing 41 – 46

arbitrary value imputation 43

end-of-tail imputation 43 – 46

mean/median imputation 42

qualitative data vs. 16

ratio level

dealing with data at 20 – 23

defined 20

raw-data vectorizers 239 – 240

recall, defined 38 – 39

RFE (recursive feature elimination) 187 – 188

RiskFactor feature 56

rolling window features 169 – 170

sample bias 77

SelectFromModel module 185

SelectFromModel object 186, 223

SimpleImputer class 42

singular value decomposition (SVD) 27, 123 – 125

social media sentiment classification case study 137

feature extraction 123 – 125

feature improvement 118 – 123

cleaning noise from text 118 – 120

standardizing tokens 120 – 123

feature learning 125 – 135

autoencoders 126 – 127

BERTs pretrained features 133 – 135

transfer learning 130

problem statement and success definition 108

text vectorization 108 – 117, 135

StandardScaler module 224

tweet sentiment dataset 105 – 108

sparse matrix object 110

split_data function 175

standardization 228

StandardScalar class 53

statistical parity 80 – 81

stopwords 113

structured data 14, 227 – 228

constructing dummy features from categorical data 227 – 228

data imputation 227

data transforms 228

standardization and normalization 228

SVD (singular value decomposition) 27, 123 – 125

text data 230

text vectorization 108 – 117, 135

TF-IDF (term-frequency inverse document-frequency) vectorization 115 – 117

TfidfVectorizer 239

time series analysis case study 196

feature construction 166 – 185

domain-specific features 179 – 185

selecting features using machine learning 186 – 187

rolling/expanding window features 169 – 178

feature extraction 189 – 192

feature selection 185 – 188

recursive feature elimination 187 – 188

problem statement 164 – 166

TWLO dataset 162 – 166

time series cross-validation splitting 173 – 178

time-series data 230 – 232

TimeSeriesSplit object 173

tokenizer parameter 121

tokens, standardizing 120 – 123

TP (true positives) 86

train/test splits 59 – 65

social media sentiment classification case study 137

transfer learning

basics of 130

with BERTs 131 – 133

tweet sentiment dataset 105 – 108

Twitter

day trading case study 181 – 184

unawareness 80

univariate time series 164

unstructured data 14 – 15, 230 – 232

image data 230

text data 230

time-series data 230 – 232

vectorization

image 159 – 160

raw-data vectorizers 239 – 240

text 108 – 117, 135