index

A

A/B testing 316333

data collection 317319

evaluating categorical metrics 329333

evaluating continuous metrics 319323

using alternative displays and tests 325329

what not to do 319

acceptance testing 438470

data consistency 439446

dangers of data silo 445446

feature stores 441442

process over technology 442445

training and inference skew 440441

end user vs. internal use testing 453460

biased testing 456457

dogfooding 457458

SME evaluation 459460

fallbacks and cold starts 447452

cold-start woes 450452

leaning heavily on prior art 448450

model interpretability 460469

Shapley additive explanations 461463

using shap 463466, 469

ACID-compliant storage layer 442

active retraining 352

Agile software engineering 3135

communication and cooperation 3335

embracing and expecting change 35

algorithmic complexity 536539

alignment 408410

ALS (alternating least squares) 44

Anaconda Navigator 542

analysis paralysis 53

API documentation 130135

approximate Shapley value estimation 461463

architecture, code 276278

ARIMA, rapid testing for 184186

artifact management 472481

interfacing with model registry 476481

MLflow model registry 474475

asynchronous concurrency 239241

attribution measurement 302316

clarifying correlation vs. causation 312316

leveraging A/B testing for calculations 316333

data collection 317319

evaluating categorical metrics 329333

evaluating continuous metrics 319323

using alternative displays and tests 325329

what not to do 319

prediction performance 302310

autocorrelation 154

autoML (automated-ML) 205

autoregressive parameters (p) variable 184

availability, data 403404

B

ball of mud 273

baseline comparison visualization 164167

Becoming Agile in an Imperfect World (Smith and Sidky) 113

BI (business intelligence) style queries 194

biased testing 456457

Big O 510539

algorithmic complexity for ML 536539

analyzing decision-tree complexity 531536

complexity 519529

O(1) 519521

O(n) 521523

O(n2) 524529

overview 515516

overview 510516

Bird, Steven 514

black swan events 346

blind catch 283284

branch strategies, logging 234236

bulk external delivery 500502

delivery consistency 500501

quality assurance 502

burst volume 504507

business knowledge 50

business rules chaos 116120

backup plan 119120

planning for 117119

C

cargo cult ML (machine learning) behavior 432437

categorical metrics 329333

causation 312316

CDD (chaos-driven development) 115

CI/CD (continuous integration/continuous deployment) system 194

citizen data scientist 206

classification problems 502

clean experimentation environment 540541

cleansing data 410412

CNNs (convolutional neural networks) 14

code and coding 269299

code architecture 276278

code smells 270272

designing modular ML 257264

efficient code 274275

exception eating 282288

exception handling 285286

handling errors right way 286288

try/catch block 283284

excessively nested logic 292297

global mutable objects 288291

encapsulation to prevent mutable side effects 290291

mutability 288290

naming conventions and structure 273274

production code 399437

avoiding cargo cult ML (machine learning) behavior 432437

guiding principles 401412

monitoring everything in model life cycle 417421

monitoring features 412416

simplicity 421426

wireframing ML projects 426431

setting guidelines in 163172

baseline comparison visualization 164167

standard metrics 167172

tuple unpacking 278282

alternative to 280282

example of 278280

code smells 270272

cold starts 447452

collaborative involvement 33

collections, polynomial relationship and 524529

communication 3335, 76123, 163

business rules chaos 116120

backup plan 119120

planning for 117119

defining problem 79100

ideal implementation 8688

project-based meetings 8993

setting critical discussion boundaries 94100

what will it to do 8186

working with SMEs (subject-matter experts) 89

explaining results 120122

meeting with cross-functional teams 101108

development progress reviews 105

experimental update meeting 102103

MVP review 106107

preproduction review 107108

SMEs (subject-matter experts) review/prototype review 103104

setting limits on experimentation 108116

CDD (chaos-driven development) 115

maintainability and extensibility 112113

PDD (prayer-driven development) 114115

RDD (resume-driven development) 115116

TDD (test-driven development) or FDD (feature-driven development) 113

time limit 109110

complexity 519529

assessing risk 66

elegant complexity 355364

lightweight scripted style 357361

overengineering vs. 361364

O(1) 519521

O(n) 521523

O(n2) 524529

overview 515516

concept drift 341343

concurrency

asynchronous concurrency 239241

scalability and 239

Conda environment manager 542

constructIndexers() method 363

containers

creating container-based pristine environment for experimentation 543544

for dependency hell 542

continuous integration/continuous deployment (CI/CD) system 194

continuous metrics 319323

control code 510

cooperation 3335

correlation 312316

cost, serving needs 494

cowboy development 115

cross-functional teams 101108

development progress reviews 105

experimental update meeting 102103

MVP review 106107

preproduction review 107108

SMEs review/prototype review 103104

D

d (differences) variable 184

data

analysis 139146

cleanliness 143146

collection for A/B testing 317319

consistency 439446

dangers of data silo 445446

feature stores 441442

process over technology 442445

training and inference skew 440441

guiding principles for production code 401412

alignment 408410

checking data provenance 404408

data availability 403404

embedding data cleansing 410412

quality 5052

database, serving from 498

DataFrame functions module 359

DataFrame object 372

data science 2637

co-opting principles of Agile software engineering 3135

communication and cooperation 3335

embracing and expecting change 35

foundation of ML (machine learning) engineering 3537

foundation of simplicity 29

increasing project success 2729

Data Science, Classification, and Related Methods (Hayashi) 27

data silo 445446

data warehouse, serving from 498

debugging walls of text 255257

decision-trees, complexity 531536

delivery consistency 500501

demos, planning for 5657

dependency hell 540, 542

deployment 1821

detecting drift 335347

concept drift 341343

feature drift 337339

feedback drift and law of diminishing returns 346347

label drift 339341

prediction drift 343345

reality drift 346

development 1518

progress reviews 105

setting up environment 540544

case for clean experimentation environment 540541

containers to deal with dependency hell 542

creating container-based pristine environment for experimentation 543544

sprint reviews 98

DevOps (development operations) 31

diminishing returns 346347

discussion boundaries 94100

development sprint reviews 98

MVP review 9899

post-experimentation phase 9698

post-research phase discussion 9596

preproduction review 100

displays 325329

Docker 542

docker pull continuumio/anaconda3 command 544

dogfooding 457458

drift 334352

detecting 335347

concept drift 341343

feature drift 337339

feedback drift and law of diminishing returns 346347

label drift 339341

prediction drift 343345

reality drift 346

responding to 347352

drivers, handling tuning with SparkTrials 218222

E

edge deployment 507508

efficient code 274275

elegant complexity 355364

lightweight scripted style (imperative) 357361

overengineering vs. 361364

elif statements 292, 294

else statements 292, 294

encapsulation 290291

end user testing 453460

biased testing 456457

dogfooding 457458

SME evaluation 459460

ER (entity-relationship) diagrams 409

errors 286288

estimating amount of work 7374

evaluation 2122

exception eating 282288

exception handling 285286

handling errors right way 286288

try/catch block 283284

experimental scoping 6074

experimentation 6474

assessing complexity risk 66

estimating amount of work 7374

scoping research phase, importance of 6873

tracking phases 6667

overview 6162

research 6264

experimental update meeting 102103

experimentation 1315, 6474, 124241

assessing complexity risk 66

choosing tech for platform and team 215227

handling tuning from driver with SparkTrials 218222

handling tuning from workers with pandas_udf 222226

Spark 216217

using new paradigms for teams 226227

estimating amount of work 7374

limitations on 108116

CDD (chaos-driven development) 115

maintainability and extensibility 112113

PDD (prayer-driven development) 114115

RDD (resume-driven development) 115116

TDD (test-driven development) or FDD (feature-driven development) 113

time limit 109110

logging 229236

MLflow tracking 230232

printing and 232234

version control, branch strategies, and working with others 234236

planning 126137

assigning testing 135136

collecting metrics 136137

reading API documentation 130135

researching 126130

possibilities, whittling down 190196

evaluating prototypes properly 191193

questions in planning session 193196

preparation 137156

moving from script to reusable code 146153

performing data analysis 139146

scalability 237241

asynchronous concurrency 239241

concurrency 239

scoping research phase 6873

testing ideas 162190

running quick forecasting tests 172190

setting guidelines in code 163172

tracking phases 6667

tuning 199214

Hyperopt primer 206208

options 201206

using Hyperopt to tune complex forecasting problem 208214

explainable artificial intelligence (XAI) 460

ExponentialSmoothing() class 211

extensibility 112113

F

fallbacks 447452

cold-start woes 450452

leaning heavily on prior art 448450

FDD (feature-driven development) 113

feature drift 337339

feature ignorance 338

features, monitoring 412416

feature stores 441442, 482489

evaluating 489

reasons for 483485

using 485489

feedback drift 346347

fit() method 178179, 211

foldLeft operation 378

forecasting tests 172190

creating validation dataset 173174

rapid testing for ARIMA 184186

rapid testing of Holt-Winters exponential smoothing algorithm 186190

rapid testing of VAR model approach 175182

FP-growth (frequent-pattern-growth) market-basket analysis algorithms 95

frameworks, generalization and 379381

functionality 52

functions, benefits of 153

G

GANs (generative adversarial networks) 15

GDPR (General Data Protection Regulation) 407

generalization 379381

_generate_boundaries() method 265

generate_hyperopt_report() function 213

generate_log_map_and_plot() function 279

global mutable objects 288291

encapsulation to prevent mutable side effects 290291

mutability 288290

grid search 202203

H

hacker mentality 368370

hacking (cowboy development) 115

Hayashi, C. 27

Heisenbugs 288

high volume 504507

HIPAA (Health Insurance Portability and Accountability Act) 407

Holt-Winters exponential smoothing algorithm 186190

HSD (honestly significant difference) tests 327

HWES (Holt-Winters Exponential Smoothing) model 208

Hyperopt

overview 206208

TPEs (tree-structured Parzen estimators) 204205

tuning complex forecasting problem 208214

Hyperopt Trials() object 217

hypothesis testing 308

I

IDE (integrated development environment) 17

if statements 292, 294

imperative scripted style 357361

implementation, simplicity in 424426

import statements 463

imposter syndrome 368

inference skew 440441

integrated models 507508

internal use

prediction serving architecture 497499

testing 453460

biased testing 456457

dogfooding 457458

SME evaluation 459460

interpretability 460469

Shapley additive explanations 461463

approximate Shapley value estimation 461463

foundation 461

how to use values from 463

using shap 463466, 469

shap summary plot 466467

waterfall plots 467469

J

JIT (just-in-time) 263

K

Klein, Ewan 514

knowledge, curse 5253

Koskela, Lasse 113

L

label drift 339341

lightweight scripted style 357361

linear relationship algorithm 521523

logging 229236

MLflow tracking 230232

printing and 232234

version control, branch strategies, and working with others 234236

log statements 233

Loper, Edward 514

low volume 504

LSTM (long short-term memory) 144

M

mad scientist developers 375377

maintainability 112113

Mandelbugs 288

manual tuning 201202

Map object 375

Map type 375

maxlags parameter 179

metrics

categorical metrics 329333

collecting 136137

continuous metrics 319323

scoring 308310

microbatch streaming 502503

microservice framework 498499

ML (machine learning)

algorithmic complexity for 536539

code smells 270272

development 353395

dangers of open source 390392

elegant complexity 355364

generalization and frameworks 379381

optimizing too early 382390

technology-driven development vs. solution-driven development 393395

unintentional obfuscation 364379

ML (machine learning) engineering 325

core tenets of 822

deployment 1821

development 1518

evaluation 2122

experimentation 1315

planning 810

scoping and research 1012

data science and foundation of 3537

goals of 24

reasons for 58

MLflow

model registry

artifact management 474475

interfacing with 476481

tracking 230232

model life cycle 417421

model measurement 300333

leveraging A/B testing for attribution calculations 316333

data collection 317319

evaluating categorical metrics 329333

evaluating continuous metrics 319323

using alternative displays and tests 325329

what not to do 319

measuring model attribution 302316

clarifying correlation vs. causation 312316

prediction performance 302310

modularity for ML 245268

debugging walls of text 255257

designing modular ML code 257264

monolithic scripts 248255

considerations for 252255

history of 249

walls of text 249252

using test-driven development for ML 264267

modulo function 521

monitoring

everything in model life cycle 417421

features 412416

monolithic scripts 248255

considerations for 252255

history of 249

walls of text 249252

moving average (q) variable 184

mutable objects, global 288291

encapsulation to prevent mutable side effects 290291

mutability 288290

MVP review 9899, 106107

mystic developers 370372

N

naming conventions 273274

Natural Language Processing with Python (Bird, Klein, and Loper) 514

NDCG (non-discounted cumulative gain) metrics 45

nested logic 292297

NLTK package 513

nonstationary time series 141

novel algorithm 115

O

O(1) complexity 519521

O(n) complexity 521523

O(n2) complexity 524529

obfuscation 364379

hacker mentality 368370

mad scientist developers 375377

mystic developers 370372

safer bet approach 377378

show-off type 373375

troublesome coding habits 378379

objective_function function 208

OLTP (online transaction processing) storage layer 441

open source 390392

optimizing 382390

overengineering 361364

P

p (autoregressive parameters) variable 184

pandas_udf 222226

paradigms 226227

parallelism 239

partial autocorrelation test 154

passive retraining 352

PDD (prayer-driven development) 114115

pdf (probability density function) 341, 345

personalization 47

phases, experimentation 6667

PII (personally identifiable information) 407

planning 810, 3860, 126137

assigning testing 135136

basic 4753

analysis paralysis 53

assumption of business knowledge 50

assumption of data quality 5052

assumption of functionality 52

knowledge, curse of 5253

collecting metrics 136137

experimentation by solution building 5860

first meeting 5356

for demos 5657

reading API documentation 130135

researching 126130

phase of 129130

quick visualization of dataset 127129

session questions 193196

data requirements 194

development cadence 195196

existing code used for project 195

getting predictions to end users 195

inference running location 195

running frequency 193

running location for training 194

setting up code base 194

storing forecasts 194

plot_predictions() function 213

pmf (probability mass function) 341, 345

PoC (proof of concept) 70

polynomial relationship 524529

post-experimentation phase 9698

post-research phase discussion 9596

prayer-driven development (PDD) 114115

prediction drift 343345

prediction performance 302, 308310

prediction serving architecture 490508

bulk external delivery 500502

delivery consistency 500501

quality assurance 502

determining serving needs 493497

cost 494

recency 494497

SLA 494

integrated models 507508

internal use cases 497499

serving from database or data warehouse 498

serving from microservice framework 498499

microbatch streaming 502503

real-time server-side 503507

burst volume and high volume 504507

low volume 504

preparation 137156

moving from script to reusable code 146153

functions, benefits of 153

importance of 154156

performing data analysis 139146

preproduction review 100, 107108

printing, logging and 232234

print statements 170, 232233, 256, 416

prior art 448450

probability density function (pdf) 341, 345

process over technology 442445

production

infrastructure 471509

artifact management 472481

feature stores 482489

prediction serving architecture 490508

writing code 399437

avoiding cargo cult ML (machine learning) behavior 432437

guiding principles 401412

monitoring everything in model life cycle 417421

monitoring features 412416

simplicity 421426

wireframing ML (machine learning) projects 426431

project-based meetings 8993

project success 2729

prototype culling 60

provenance of data 404408

Q

q (moving average) variable 184

quadratic() method 523

quality assurance 502

quality testing 438470

data consistency 439446

dangers of data silo 445446

feature stores 441442

process over technology 442445

training and inference skew 440441

end user vs. internal use testing 453460

biased testing 456457

dogfooding 457458

SME evaluation 459460

fallbacks and cold starts 447452

cold-start woes 450452

leaning heavily on prior art 448450

model interpretability 460469

Shapley additive explanations 461463

using shap 463466, 469

R

random search 203204

rapid testing

for ARIMA 184186

of Holt-Winters exponential smoothing algorithm 186190

of VAR model approach 175182

RDBMS (relational database management system) 137

RDD (resiliently distributed dataset) 223, 359

RDD (resume-driven development) 115116

reality drift 346

real-time server-side 503507

burst volume and high volume 504507

low volume 504

recency 494497

recurrent neural networks (RNNs) 144

regression problems 502

remove_bias value 210

REPL (read-eval-print loop) 128

researching 1012, 126130

experimental scoping 6264

phase of 129130

scoping phase, importance of 6873

visualization of dataset 127129

responding to drift 347352

results, explaining 120122

returns, diminishing 346347

reusable code 146153

functions, benefits of 153

importance of 154156

rm -rf command 386

RMSE (root mean squared error) 44

RNNs (recurrent neural networks) 144

ROI (return on investment) 473

runtime performance 510539

algorithmic complexity for ML (machine learning) 536539

analyzing decision-tree complexity 531536

Big O 510516

complexity 519529

O(1) 519521

O(n) 521523

O(n2) 524529

overview 515516

run_tuning() function 228

S

safer bet approach 377378

scalability 237241

asynchronous concurrency 239241

concurrency 239

scoping 1012

serving architecture, prediction 490508

bulk external delivery 500502

delivery consistency 500501

quality assurance 502

determining serving needs 493497

cost 494

recency 494497

SLA 494

integrated models (edge deployment) 507508

internal use cases 497499

serving from database or data warehouse 498

serving from microservice framework 498499

microbatch streaming 502503

real-time server-side 503507

burst volume and high volume 504507

low volume 504

SGD (stochastic gradient descent) 536

shap 463466, 469

shap summary plot 466467

waterfall plots 467469

Shapley additive explanations 461463

approximate Shapley value estimation 461463

foundation 461

how to use values from 463

shap package 461, 463, 466, 468469, 475, 480

show-off type 373 – 375

Sidky, Ahmed 113

simplicity 421426

in implementation 424426

in problem definitions 423424

simplicity, foundation of 29

singular value decomposition (SVD) model 58

SLA, determining serving needs 494

SMEs (subject-matter experts)

evaluation 459460

review 103104

working with 89

Smith,Greg 113

smoothing_level value 210

smoothing_seasonal value 210

solution building 5860

solution-driven development 393395

space complexity 515

spaghetti code 273

Spark 215227

handling tuning from driver with SparkTrials 218222

handling tuning from workers with pandas_udf 222226

reasons for 216217

using new paradigms for teams 226227

SPC (statistical process control) rules 345

standardization 163

standard metrics 167172

statement 386

String values 370

structure, coding 273274

summary plots, shap 466467

SVD (singular value decomposition) model 58

T

TDD (test-driven development) 113, 264267

technology, process over 442445

technology-driven development 393395

Test Driven (Koskela) 113

testing ideas 162190

assigning 135136

running quick forecasting tests 172190

creating validation dataset 173174

rapid testing for ARIMA 184186

rapid testing of Holt-Winters exponential smoothing algorithm 186190

rapid testing of VAR model approach 175182

setting guidelines in code 163172

baseline comparison visualization 164167

standard metrics 167172

test statistic 141

time limit 109110

TPEs (tree-structured Parzen estimators) 204205

training, inference skew and 440441

Trials() mode 225

Trials object 207, 220

trigger-once operation 494

try/catch block 283284

tuning 199214

handling from driver with SparkTrials 218222

handling from workers with pandas_udf 222226

Hyperopt primer 206208

options 201206

advanced techniques 205206

grid search 202203

manual tuning 201202

random search 203204

TPEs (tree-structured Parzen estimators) 204205

using Hyperopt to tune complex forecasting problem 208214

tuple unpacking 278282

alternative to 280282

example of 278280

type 386

U

unsupervised problems 502

use_basin_hopping value 210

use_boxcox value 210

use_brute value 210

V

validation dataset 173174

VAR model approach 175182

VectorAssembler constructor 360

version control 234236

visualization of dataset 127129

VM (virtual machine) container 215

W

walls of text

debugging 255257

monolithic scripts 249252

waterfall plots, shap 467469

wireframing ML projects 426431

workers, handling tuning with pandas_udf 222226

WoT (walls of text) 250

X

XAI (explainable artificial intelligence) 460

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.227.92