index

A

advanced_grid_search method 114

AI (artificial intelligence) 1

AIF360 (AI Fairness 360) 96100

AN (actual negatives) 86

AP (actual positives) 86

arbitrary category imputation 4748

arbitrary value imputation 43

autoencoders

basics of 126127

training to learn features 127130

B

bag of words approach 109

batch_embed_text function 133

BERTs (bidirectional encoder representations from transformers)

pretrained features 133135

transfer learning with 131133

BinaryLabelDataset dataframe 97

binning 5455

Box-Cox transforms 5052

C

cached feature groups 208

case studies

COVID-19 diagnostics case study 71

data streaming case study 220

day trading case study 196

law school success prediction case study 102

object recognition case study 160

social media sentiment classification case study 137

categorical data construction 5459

binning 5455

categorical dummy bucketing 234236

constructing dummy features from categorical data 227228

domain-specific feature construction 5859

one-hot encodings 5658

when to dummify categorical variables vs. leaving as single column 232233

categorical dummy bucketing 234236

CI/CD (continuous integration and development) 198

CIFAR-10 dataset 139140

coef attribute 68

ColumnTransformer object 84

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset 7375

computer vision case study 160

CIFAR-10 dataset 139140

feature construction 140142

feature extraction

histogram of oriented gradients 143148

principal component analysis 147148

feature learning 149159

fine-tuning VGG-11 155157

using fine-tuned VGG-11 features with logistic regression 158159

using pretrained VGG-11 as feature extractor 151154

image vectorization 159160

problem statement and success definition 140

concept drift 6

ConvNets (Convolutional Networks) 149

count vectorization 110115

CountVectorizer 110, 115, 239

COVID-19 diagnostics case study 71

COVID-flu diagnostic dataset 3639

exploratory data analysis 3941

feature construction 4859

categorical data construction 5459

numerical feature transformations 4854

feature improvement 4148

imputing missing qualitative data 4648

imputing missing quantitative data 4146

feature selection 6669

hypothesis testing 67

machine learning 6869

mutual information 6667

pipeline

building 5965

train/test splits 5965

problem statement and success definition 3639

create_feature_group method 211

CSAT (customer satisfaction score) 21

D

daily price features 179180

data drift 7

data imputation 227

data streaming case study

benefits of using feature stores 198202

creating training data in Hopsworks 215220

provenance 219220

training datasets 215219

MLOps and feature stores 198203

setting up feature stores with Hopsworks 204215

connecting to Hopsworks with HSFS API 204206

feature groups 207213

using feature groups to select data 213215

Wikipedia and MLOps and feature stores 202203

data transforms 228

data-type-specific feature engineering techniques 226232

structured data 227228

constructing dummy features from categorical data 227228

data imputation 227

data transforms 228

standardization and normalization 228

unstructured data 230232

image data 230

text data 230

time-series data 230232

DataLoaders 151

date/time features 166167

datetime index 163

datetime object 175

datetime.strptime feature 210

day trading case study 196

feature construction 166185

date/time features 166167

domain-specific features 179185

lag features 168

rolling/expanding window features 169178

feature extraction 189192

feature selection 185188

recursive feature elimination 187188

selecting features using machine learning 186187

problem statement 164166

TWLO dataset 162166

dimension reduction, optimizing with principal component analysis 147148

disparate impact

disparate treatment vs. 79

treating using Yeo-Johnson transformers 9196

domain-specific feature construction 5859

DummifyRiskFactor transformer 57

dummy features

categorical dummy bucketing 234236

constructing from categorical data 227228

when to dummify categorical variables vs. leaving as single column 232233

why not to dummify everything 232233

E

EDA (exploratory data analysis) 5

EMA (exponential moving average) 180

end-of-tail imputation 4346

equalized odds 81

evaluation metrics 3032

fairness and bias 31

interpretability 31

machine learning complexity and speed 3132

machine learning metrics 30

event_time feature 211

expanding window features 170172

Explainer object 85

exploratory data analysis

COVID-19 diagnostics case study 3941

law school success prediction case study 7679

exploratory data analysis (EDA) 5

exponential moving average (EMA) 180

ExtraTreesClassifier model 37

F

fair representation implementation 96100

fairness and bias 31, 7981

bias-aware model 91100

feature construction 9196

feature extraction 96100

definitions of fairness 7981

equalized odds 81

statistical parity 8081

unawareness 80

disparate treatment vs. disparate impact 79

how to know if bias in data needs to be dealt with 234

measuring bias in baseline model 8590

mitigating bias 9091

in-processing 90

post-processing 9091

preprocessing 90

false negatives (FN) 86

false positives (FP) 86

feature construction

basics of 2526, 224

COVID-19 diagnostics case study 4859

categorical data construction 5459

numerical feature transformations 4854

day trading case study 166185

date/time features 166167

domain-specific features 179185

lag features 168

rolling/expanding window features 169178

law school success prediction case study

baseline model 82

bias-aware model 9196

object recognition case study 140142

social media sentiment classification case study 109

feature engineering 12, 33, 241

approach to process 32

as crucial as machine learning model choice 222223

case studies 1112

categorical dummy bucketing 234236

combining learned features with conventional features 236239

data-type-specific techniques 226232

structured data 227228

unstructured data 230232

defined 23

evaluation metrics 3032

fairness and bias 31

interpretability 31

machine learning complexity and speed 3132

machine learning metrics 30

frequently asked questions 232234

how to know if bias in data needs to be dealt with 234

when to dummify categorical variables vs. leaving as single column 232233

why not to dummify everything 232233

further reading material 240241

great data and great models 45

levels of data 1623

interval level 1920

nominal level 1618

ordinal level 18

qualitative data vs. quantitative data 16

ratio level 2023

limits of 4

need for 34

not one-size-fits-all solution 223

pipeline 58, 221222

raw-data vectorizers 239240

types of 910, 2429, 224226

feature construction 2526, 224

feature extraction 2728, 225

feature improvement 2425, 224

feature learning 2829, 226

feature selection 2627, 224225

types of data 1415

structured data 14

unstructured data 1415

feature extraction

basics of 2728, 225

day trading case study 189192

law school success prediction case study 96100

object recognition case study

histogram of oriented gradients 143148

principal component analysis 147148

social media sentiment classification case study 123125

feature groups

basics of 207213

using to select data 213215

feature improvement

basics of 2425, 224

COVID-19 diagnostics case study 4148

imputing missing qualitative data 4648

imputing missing quantitative data 4146

social media sentiment classification case study 118123

cleaning noise from text 118120

standardizing tokens 120123

feature learning

basics of 2829, 226

object recognition case study 149159

fine-tuning VGG-11 155157

using fine-tuned VGG-11 features with logistic regression 158159

using pretrained VGG-11 as feature extractor 151154

social media sentiment classification case study 125135

autoencoders 126127

BERTs pretrained features 133135

transfer learning 130

feature scaling 5254

feature selection

basics of 2627, 224225

COVID-19 diagnostics case study 6669

hypothesis testing 67

machine learning 6869

mutual information 6667

day trading case study 185188

recursive feature elimination 187188

selecting features using machine learning 186187

feature stores

benefits of using 198202

compliance and governance 202

real-time feature serving 202

single source of features 200202

creating training data in Hopsworks 215220

provenance 219220

training datasets 215219

MLOps and 198203

setting up with Hopsworks 204215

connecting to Hopsworks with HSFS API 204206

feature groups 207213

using feature groups to select data 213215

Wikipedia and 202203

feature-engine package 43

FeatureUnion class 60, 237

FeatureUnion object 98

float64 types 41

FluSymptoms feature 70

FN (false negatives) 86

four-fifths rule 80

FP (false positives) 86

G

GANs (generative adversarial networks) 10

GridSearchCV instance 37

Grigorev, Alexey 241

H

harmonic mean 21

healthcare case study 71

COVID-flu diagnostic dataset 3639

exploratory data analysis 3941

feature construction 4859

categorical data construction 5459

numerical feature transformations 4854

feature improvement 4148

imputing missing qualitative data 4648

imputing missing quantitative data 4146

feature selection 6669

hypothesis testing 67

machine learning 6869

mutual information 6667

pipeline

building 5965

train/test splits 5965

problem statement and success definition 3639

HOGs (histogram of oriented gradients) 143148

Hopsworks

creating training data in 215220

provenance 219220

training datasets 215219

setting up feature stores with 204215

connecting to Hopsworks with HSFS API 204206

feature groups 207215

HSFS API 204206

Huggingface 239

hypothesis testing 67

I

IDF (inverse document frequency) 115

image data 230

image vectorization 159160

interpretability 31

interval level

dealing with data at 1920

defined 19

K

k-NN ( k-nearest neighbors) 52

Kakade, Sunil 241

KBinsDiscretizer class 55, 234

L

lag features 168

law school success prediction case study 102

baseline model 8290

feature construction 82

measuring bias in 8590

pipeline 8384

bias-aware model 91100

feature construction 9196

feature extraction 96100

COMPAS dataset 7375

exploratory data analysis 7679

fairness and bias measurement 7981

definitions of fairness 7981

disparate treatment vs. disparate impact 79

mitigating bias 9091

in-processing 90

post-processing 9091

preprocessing 90

problem statement and success definition 75

log-transforms 4849

logistic regression, using fine-tuned VGG-11 features with 158159

lymphocytes feature 43

M

MACD (moving average convergence divergence) 180181

Machine Learning Bookcamp (Grigorev) 241

max_feastures parameter 118

max_features parameter 110

mean pixel value (MPV) 140

mean/median imputation 42

min – max standardization scales 52

ML (machine learning)

complexity and speed 3132

day trading case study 196

feature construction 166185

feature extraction 189192

feature selection 185188

problem statement 164166

TWLO dataset 162166

feature engineering as crucial as ML model choice 222223

feature selection with 6869

metrics 30

pipeline 58

selecting features using 186187

MLM (masked language model) 132

MLOps 198203

MLOps Engineering at Scale (Osipov) 241

model_fairness object 86

most-frequent category imputation 47

moving average convergence divergence (MACD) 180181

MPV (mean pixel value) 140

MultiLabelBinarizer class 57

multivariate time series 164

mutual information 6667

N

ngram_range parameter 111

NLP (Natural Language Processing) 4, 137

feature extraction 123125

feature improvement 118123

cleaning noise from text 118120

standardizing tokens 120123

feature learning 125135

autoencoders 126127

BERTs pretrained features 133135

transfer learning 130

problem statement and success definition 108

text vectorization 108117, 135

bag of words approach 109

count vectorization 110115

TF-IDF vectorization 115117

tweet sentiment dataset 105108

NLTK (Natural Language Toolkit) 120121

noise, cleaning from text 118120

nominal binary feature 166

nominal level

dealing with data at 1718

defined 1617

normalization 228

NPS (net promoter score) 21

NSP (next sentence prediction) 132

numerical feature transformations 4854

Box-Cox transforms 5052

feature scaling 5254

log-transforms 4849

O

object pandas 41

object recognition case study 160

CIFAR-10 dataset 139140

feature construction 140142

feature extraction

histogram of oriented gradients 143148

principal component analysis 147148

feature learning 149159

fine-tuning VGG-11 155157

using fine-tuned VGG-11 features with logistic regression 158159

using pretrained VGG-11 as feature extractor 151154

image vectorization 159160

problem statement and success definition 140

on-demand feature groups 208

one-hot encodings 5658

online feature serving 208

ordinal feature 166

ordinal level

dealing with data at 18

defined 18

Osipov, Carl 241

Ozdemir, Sinan 241

P

pandas profiling 106

PCA (principal component analysis) 27, 123, 147148

pip3 install hsfs library 205

Pipeline class 60

Pipeline object 37, 173

plot_gains function 178

PN (predicted negatives) 86

polynomial feature extraction 189192

PowerTransformer class 50

PP (predicted positives) 86

precision, defined 3839

Principles of Data Science, The, Second Edition (Ozdemir and Kakade) 241

priors_count value 93

provenance 219220

Q

qualitative data

imputing missing 4648

arbitrary category imputation 4748

most-frequent category imputation 47

quantitative data vs. 16

quantitative data

imputing missing 4146

arbitrary value imputation 43

end-of-tail imputation 4346

mean/median imputation 42

qualitative data vs. 16

R

ratio level

dealing with data at 2023

defined 20

raw-data vectorizers 239240

recall, defined 3839

RFE (recursive feature elimination) 187188

RiskFactor feature 56

rolling window features 169170

S

sample bias 77

SelectFromModel module 185

SelectFromModel object 186, 223

SimpleImputer class 42

singular value decomposition (SVD) 27, 123125

social media sentiment classification case study 137

feature extraction 123125

feature improvement 118123

cleaning noise from text 118120

standardizing tokens 120123

feature learning 125135

autoencoders 126127

BERTs pretrained features 133135

transfer learning 130

problem statement and success definition 108

text vectorization 108117, 135

bag of words approach 109

count vectorization 110115

TF-IDF vectorization 115117

tweet sentiment dataset 105108

sparse matrix object 110

split_data function 175

standardization 228

StandardScalar class 53

StandardScaler module 224

statistical parity 8081

stopwords 113

structured data 14, 227228

constructing dummy features from categorical data 227228

data imputation 227

data transforms 228

standardization and normalization 228

SVD (singular value decomposition) 27, 123125

T

text data 230

text vectorization 108117, 135

bag of words approach 109

count vectorization 110115

TF-IDF vectorization 115117

TF-IDF (term-frequency inverse document-frequency) vectorization 115117

TfidfVectorizer 239

time series analysis case study 196

feature construction 166185

date/time features 166167

domain-specific features 179185

lag features 168

rolling/expanding window features 169178

feature extraction 189192

feature selection 185188

recursive feature elimination 187188

selecting features using machine learning 186187

problem statement 164166

TWLO dataset 162166

time series cross-validation splitting 173178

time-series data 230232

TimeSeriesSplit object 173

tokenizer parameter 121

tokens, standardizing 120123

TP (true positives) 86

train/test splits 5965

training datasets 215219

transfer learning

basics of 130

with BERTs 131133

tweet sentiment dataset 105108

Twitter

day trading case study 181184

social media sentiment classification case study 137

U

unawareness 80

univariate time series 164

unstructured data 1415, 230232

image data 230

text data 230

time-series data 230232

V

vectorization

image 159160

raw-data vectorizers 239240

text 108117, 135

bag of words approach 109

count vectorization 110115

TF-IDF vectorization 115117

VGG (Visual Geometry Group) 149

VGG-11 149159

fine-tuning 155157

using fine-tuned VGG-11 features with logistic regression 158159

using pretrained VGG-11 as feature extractor 151154

W

Wikipedia 202203

Y

Yeo-Johnson transformers 9196

Z

z-score standardization scales 52

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.249.197