Index

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Index

Symbols

+1 trick, 38, 43–45, 336, 521

1-NN model, 145, 154–156

for the circle problem, 464

3-NN model, 66, 193

3D datasets, 460–461

80–20 rule, 83

Σ, in math, 30

A

accuracy, 15, 27, 163

calculating, 62

fundamental limits of, 163

accuracy_score, 62

AdaBoost, 400, 405

AdaBoostClassifier, 400, 403–406

additive model, 318

aggregation, 390

algorithms

analysis of, 72

genetic, 101

less important than data, 15

amoeba (StackExchange user), 465

analytic learning, 18

Anderson, Edgar, 56

ANOVAs test, 463

area under the curve (AUC), 177–178, 182–193, 202

arguments, of a function, 362

arithmetic mean, 170, see also average

array (NumPy), 276, 494

assessment, 113–115

assumptions, 55, 270, 282, 286–287, 439

attributes, 4–5

average

computing from confusion matrix, 170

simple, 30

weighted, 31–32, 34, 89

average centered dot product, see covariance

B

background knowledge, 322, 331, 439

bag of global visual words (BoGVW), 483, 488–490

bag of visual words (BoVW), 481–483

transformer for, 491–493

bag of words (BOW), 471–473

normalizing, 474–476

bagged classifiers

creating, 394

implementing, 407

bagging, 390, 394

basic algorithm for, 395

bias-variance in, 396

BaggingRegressor, 407

base models

overfitting, 396

well-calibrated, 407

BaseEstimator, 311

baseline methods, 159–161, 189, 191

baseline regressors, 205–207

baseline values, 356

basketball players, 397

Bayes optimal classifier, 464

betting odds, 259–262

bias, 110, 144–145, 292

addressing, 350–351

in combined models, 390

in SVCs, 256–259

number of, 148

reducing, 396, 400, 406

bias-variance tradeoffs, 145–149, 154, 396

in decision trees, 249

in performance estimating, 382

big data, 71

Big-O analysis, 82

bigrams, 471

binary classification, 55, 174, 267

confusion matrix for, 164

binomials, 524

bivariate correlation, 415

black holes, models of, 467

body mass index (BMI), 322, 410–411

boosting, 398–401, 406

bootstrap aggregation, see bagging

bootstrap mean, 391–393

bootstrapping, 157, 390–394

Box, George, 69

C

C4.5, C5.0, CART algorithms, 244

calculated shortcut strategy, 100–101, 104

Caltech101 dataset, 482–483

Calvinball game, 67

card games, 21

rigged, 68–69

case-based reasoning, 18

categorical coding, 332–341

categorical features, 5–7, 18, 346

numerical values for, 85–86

categories, 332

predicting, 9–10

Cauchy-Schwart inequality, 463

causality, 233

Celsius, converting to Fahrenheit, 325–326

center, see mean

classification, 7, 55–58

binary, 55, 164, 174, 267

nonlinear, 418–419

classification_report, 169–170

ClassifierMixin, 202

classifiers

baseline, 159–161, 189, 191

comparing, 287–290, 368

evaluating, 70–71, 159–203, 238–239

making decisions, 55–56

simple, 63–81

smart, 189

closures, 382, 394

clustering, 18, 479–481

on subsets of features, 494

coefficient of determination, 130

coin flipping, 21

and binomials, 524

increasing number of, 25–27

collections, 20

collinearity, 340, 356

combinations, 41

combinatorics, 423

complexity, 124–125

cost of increasing, 12

evaluating, 152–154, 363–365

manipulating, 119–123

penalizing, 300, 306, 502

trading off for errors, 125–126, 295–301

complexity analysis, 82

compound events, 22–23

compression, 13

computational learning theory, 15

computer graphics, 82

computer memory, see memory

computer science, 362

confounding factors, 233

confusion matrix, 164, 171–178

computing averages from, 168, 170

constant, 160–161

constant linear model, 146

constants, 35–38

contrast coding, 356

conveyor belt, 377

convolutional neural network, 516

corpus, 472

corrcoef (NumPy), 416

correctness, 11–12

correlation, 415–417, 423, 464

squared, 415–417

Cortez, Paulo, 195, 203

cosine similarity, 462–463

cost, 126–127

comparing, for different models, 127

lowering, 299, 497–500

of predictions, 56

CountVectorizer, 473

covariance, 270–292, 415–417

between all pairs of features, 278

exploring graphically, 292

length-normalized, 463

not affected by data shifting, 451

visualizing, 275–281

covariance matrix (CM), 279–283, 451, 456

computing, 455

diagonal, 281

eigendecomposition of, 452, 456, 459

for multiple classes, 281–282

CRISP-DM process, 18

cross-validation (CV), 128–131

2-fold, 132–133

3-fold, 128–130

5-fold, 129–130, 132

as a single learner, 230

comparing learners with, 154–155

extracting scores from, 192

feature engineering during, 323

flat, 370–371, 376

leave-one-out, 140–142

minimum number of examples for, 152

nested, 157, 370–377

on multiple metrics, 226–229

with boosting, 403

wrapping methods inside, 370–372

cross_val_predict, 192, 230

cross_val_score, 130, 132, 137, 196, 207, 379

Cumulative Response curve, 189

curves, 45–47

using kernels with, 461

cut, 328, 330–331

D

data

accuracy of, 15

big, 71

centering, 221, 322, 325, 445–447, 451, 457

cleaning, 323

collecting, 14

converting to tabular, 470

fuzzy towards the tails, 88

geometric view of, 410

incomplete, 16

making assumptions about, 270

modeling, 14

more important than algorithms, 15

multimodal, 327, 357

noisiness of, 15

nonlinear, 285

preparing, 14

preprocessing, 341

reducing, 250–252, 324–325, 461

redundant, 324, 340, 411

scaling, 85, 221, 445, 447

sparse, 333, 356, 471, 473

standardized, 105, 221–225, 231, 315–316, 447

synthetic, 117

total amount of variation in, 451

transforming, see feature engineering

variance of, 143, 145, 445

weighted, 399–400

DataFrame, 323, 363–364

datasets

3D, 460–461

applying learners to, 394

examples in, 5

features in, 5

finding relationships in, 445

missing values in, 322

multiple, 128, 156

poorly represented classes in, 133

reducing, 449

single, distribution from, 390

testing, see testing datasets

training, see training datasets

datasets.load_boston, 105, 234

datasets.load_breast_cancer, 84, 203

datasets.load_digits, 319

datasets.load_wine, 84, 203

decision stumps, 399, 401–403

decision trees (DT), 239–249, 290–291, 464

bagged, 395

bias-variance tradeoffs in, 249

building, 244, 291

depth of, 241, 249

flexibility of, 313

for nonlinear data, 285–286

performance of, 429–430

prone to overfitting, 241

selecting features in, 325, 412

unique identifiers in, 241, 322

viewed as ensembles, 405

vs. random forests, 396

DecisionTreeClassifier, 247

decomposition, 452, 455

deep neural networks, 481

democratic legislature, 388

dependent variables, see targets

deployment, 14

Descartes, René, 170

design matrix, 336, 347

diabetes dataset, 85, 105, 322, 416

diagonal covariance matrix, 281

Diagonal Linear Discriminant Analysis (DLDA), 282–285, 292

diagrams, drawing, 245

dice rolling, 21–24

expected value of, 31–32

rigged, 68–69

Dietterich, Tom, 375

digits dataset, 287–290, 401

Dijkstra, Edsger, 54

directions, 441, 445

finding the best, 449, 459

with PCA, 451

discontinuous target, 308

discretization, 329–332

discriminant analysis (DA), 269–287, 290–292

performing, 283–285

variations of, 270, 282–285

distances, 63–64

as weights, 90

sum product of, 275

total, 94

distractions, 109–110, 117

distributions, 25–27

binomial, 524

from a single dataset, 390

normal, 27, 520–524

of the mean, 390–391

random, 369

domain knowledge, see background knowledge

dot, 29–30, 38, 47–52, 245, 455

dot products, 29–30, 38, 47–52

advantages of, 43

and kernels, 438–441, 458–459, 461

average centered, see covariance

length-normalized, 462–463

double cross strategy, 375

dual problem, solving, 459

dummy coding, see one-hot coding

dummy methods, see baseline methods

E

edit distance, 439, 464

educated guesses, 71

eigendecomposition (EIGD), 452, 456, 458, 465–466

eigenvalues and eigenvectors, 456–457

Einstein, Albert, 124

ElasticNet, 318

empirical loss, 125

ensembles, 387–390

enterprises, competitive advantages of, 16

entropy, 464

enumerate, 494

enumerate_outer, 491–492, 494

error plots, 215–217

errors

between predictions and reality, 350

ignoring, 302–305

in data collection process, 322

in measurements, 15, 142–143, 241

margin, 254

measuring, 33

minimizing, 448–449, 451

negating, 207

positive, 33

sources of, 145

trading off for complexity, 125–126, 295–301

vs. residuals, 218

vs. score, 207

weighted, 399

estimated values, see predicted values

estimators, 66

Euclidean distance, 63, 367

Euclidean space, 466

evaluation, 14, 62, 109–157

deterministic, 142

events

compound vs. primitive, 22–23

probability distribution of, 25–27

random, 21–22

examples, 5

dependent vs. independent, 391

distance between, 63–64, 438–439

duplicating by weight, 399

focusing on hard, 252, 398

grouping together, 479

learning from, 4

quantity of, 15

relationships between, 434

supporting, 252

tricky vs. bad, 144

execution time, see time

expected value, 31–32

extract-transform-load (ETL), 323

extrapolation, 71

extreme gradient boosting, 406

extreme random forest, 397–398

F

F₁ calculation, 170

f_classif, 422

f_regression, 416–417

Facebook, 109, 388

factor analysis (FA), 466

factorization, 452, 455

factory machines, 7–9, 114

choosing knob values for, 115, 144, 156, 337

stringing together, 377

testing, 110–113

with a side tray, 65

Fahrenheit, converting to Celsius, 325–326

failures, in a legal system, 12

fair bet, 259

false negative rate (FNR), 164–166

false positive rate (FPR), 164–166, 173–181

Fawcett, Tom, 18

feature construction, 322, 341–350, 410–411

manual, 341–343

with kernels, 428–445

feature engineering, 321–356

how to perform, 324

limitations of, 377

when to perform, 323–324

feature extraction, 322, 470

feature selection, 322, 324–325, 410–428, 449

by importance, 425

formal statistics for, 463

greedy, 423–424

integrating with a pipeline, 426–428

model-based, 423–426

modelless, 464

random, 396–397, 423, 425

recursive, 425–426

feature-and-split

finding the best, 244, 397

random, 397–398

feature-pairwise Gram matrix, 464

feature_names, 413–414

features, 5

categorical, 7, 346

causing targets, 233

conditionally independent, 69

correlation between, 415–417

counterproductive, 322

covariant, 270

different, 63

evaluating, 462–463

interactions between, 343–348

irrelevant, 15, 241, 324, 411

number of, 146–148

numerical, 6–7, 18, 225, 343–344, 346

relationships between, 417

scaling, 322, 325–329

scoring, 412–415

sets of, 423

standardizing, 85

training vs. testing, 60–61

transforming, 348–353

useful, 15, 412

variance of, 412–415

Fenner, Ethan, 237–238

Fisher’s Iris Dataset, see iris dataset

Fisher, Sir Ronald, 56

fit, 224–225, 337, 363, 367–368, 371–372, 379, 381

fit-estimators, 66

fit_intercept, 340

fit_transform, 326, 413

flash cards, 398

flashlights, messaging with, 417–418

flat surface, see planes

flipping coins, 21

and binomials, 524

increasing number of, 25–27

float, 52–53

floating-point numbers, 52–53

fmin, 500

folds, 128

forward stepwise selection, 463

fromiter (NumPy), 494

full joint distribution, 148

functions

parameters of

vs. arguments, 362

vs. values, 360–361

wrapping, 361, 502

FunctionTransformer, 348–349

functools, 20

fundraising campaign, 189

future, predicting, 7

fuzzy specialist scenario, 405

G

gain curve, see Lift Versus Random curve

games

expected value of, 32

fair, 259

sets of rules for, 67

Gaussian Naive Bayes (GNB), 82, 282–287

generalization, 59, 126

genetic algorithms, 101

geometric mean, 170

get_support, 413

Ghostbusters, 218

Gini index, 202, 245, 464

Glaton regression, 7

global visual words, 483, 487–490

good old-fashioned (GOF) linear regression, 300–301, 519–521

and complex problems, 307

gradient descent (GD), 101, 292

GradientBoostingClassifier, 400, 403–406

Gram matrix, 464

graphics processing units (GPUs), 71, 82

greediness, for feature selection, 423–424

GridSearch, 363, 368, 377, 382, 405, 427–428

wrapped inside CV, 370–372

GridSearchCV, 368, 371–377

H

Hamming distance, 63

Hand and Till M method, 183–185, 197, 200, 202

handwritten digits, 287–290

harmonic mean, 170

Hettinger, Raymond, 54

hinge loss, 301–305, 465

hist, 22

histogram, 21

hold-out test set (HOT), 114–115

hyperparameters, 67, 115

adjusting, 116

choosing, 359

cross-validation for, 371–377, 380–382

evaluating, 363–368

for tradeoffs between complexity and errors, 126

overfitting, 370

random combinations of, 368–370

tuning, 362–369, 380–382

hyperplanes, 39

I

IBM, 3

ID3 algorithm, 244

identification variables, 241, 322, 324

identity matrix, 456, 465

illusory correlations, 233

images, 481–493

BoVW transformer for, 491–493

classification of, 9

describing, 488–490

predicting, 490–491

processing, 485–487

import, 19

in-sample evaluation, 60

independence, 23

independence assumptions, 148

independent component analysis (ICA), 466

independent variables, see features

indicator function, 243

inductive logic programming, 18

infinity-norm, 367

information gain, 325

information theory, 417

input features, 7

inputs, see features

intercept, 336–341

avoiding, 356

International Standard of scientific abbreviations (SI), 73

iris dataset, 56-58, 60-61, 82, 133, 166–168, 174, 190–195, 242, 245, 329–332, 336, 480, 495

IsoMap, 462

iteratively reweighted least squares (IRLS), 291

itertools, 20, 41

J

jackknife resampling, 157

jointplot, 524–525

Jupyter notebooks, 19

K

k-Cross-Validation (CV), 129–131

with repeated train-test splits, 137

k-Means Clustering (k-MC), 479–481

k-Nearest Neighbors (k-NN), 64–67

1-NN model, 145, 154–156, 464

3-NN model, 66, 193

algorithm of, 63

bias-variance for, 145

building models, 66–67, 91

combining values from, 64

evaluating, 70–71

metrics for, 162–163

for nonlinear data, 285

performance of, 74–76, 78–81, 429–430

picking the best k, 113, 116, 154, 363–365

k-Nearest Neighbors classification (k-NN-C), 64

k-Nearest Neighbors regression (k-NN-R), 87–91

comparing to linear regression, 102–104, 147–229

evaluating, 221

vs. piecewise constant regression, 310

Kaggle website, 406

Karate Kid, The, 182, 250

Keras, 82

kernel matrix, 438

kernel methods, 458

automated, 437–438

learners used with, 438

manual, 433–437

mock-up, 437

kernels, 438–445

and dot products, 438–441, 458–459, 461

approximate vs. full, 436

feature construction with, 428–445

linear, 253, 438

polynomial, 253, 437

KFold, 139–140, 368

KNeighborsClassifier, 66, 362–363

KNeighborsRegressor, 91

knn_statistic, 394–395

Knuth, Donald, 83

kurtosis, 466

L

L₁ regularization, see lasso regression

L₂ regularization, see ridge regression

label_binarize, 179, 183

Lasso, 300

lasso regression (L₁), 300, 307

blending with ridge regression, 318

selecting features in, 325, 411, 424

learning algorithms, 8

learning curves, 131, 150–152

in sklearn, 157

learning methods

incremental/decremental, 130

nonparametric, 65

parameters of, 115

requiring normalization, 221

learning models, see models

learning systems, 9–10

building, 13–15, 366

choosing, 81

combining multiple, see ensembles

evaluating, 11–13, 109–157

from examples, 4, 9–11

performance of, 102

overestimating, 109

tolerating mistakes in data, 16

used with kernel methods, 438

learning_curve, 150–152

least-squares fitting, 101

leave-one-out cross-validation (LOOCV), 140–142

length-normalized covariance, 463

length-normalized dot product, 462–463

Levenshtein distance, 464

liblinear, 291–292

libsvm, 291, 443, 465

Lift Versus Random curve, 189, 193

limited capacity, 109–110, 117

limited resources, 187

linalg.svd (NumPy), 455

line magic, 75

linear algebra, 452, 457, 465

linear combination, 28

Linear Discriminant Analysis (LDA), 282–285, 495

linear kernel, 253, 438

linear regression (LR), 91–97, 305

bias of, 350

bias-variance for, 146–147

calculating predicted values with, 97, 265

comparing to k-NN-R, 102–104, 229

complexity of, 119–123

default metric for, 209

example of, 118

for nonlinear data, 285

from raw materials, 500–504

good old-fashioned (GOF), 300–301, 307, 519–521

graphical presentation of, 504

performing, 97

piecewise, 309–313

regularized, 296–301

relating to k-NN, 147–148

selecting features in, 425

using standardized data for, 105

viewed as ensembles, 405

linear relationships, 415, 417

linearity, 285

LinearRegression, 371

LinearSVC, 253, 291, 465

lines, 34–39

between classes, 250

drawing through points, 92, 237–238

finding the best, 98–101, 253, 268–269, 350, 410, 448–449, 457, 465

piecewise, 313

sloped, 37, 94–97

straight, 91

limited capacity of, 122

local visual words, 483–488

extracting, 485–487

finding synonyms for, 487–488

log-odds, 259, 262–266

predicting, 505–508

logistic regression (LogReg), 259–269, 287, 290–292

and loss, 526

calculating predicted values with, 265

for nonlinear data, 285

from raw materials, 504–509

kernelized, 436

performance of, 429

PGM view of, 523–525

solving perfectly separable classification problems with, 268–269

LogisticRegression, 267, 292

logreg_loss_01, 507

lookup tables, 13

loss, 125–126, 295

defining, 501

hinge, 301–305, 465

minimizing, 526

vs. score, 127, 207

M

M method, 183–185, 197, 200, 202

machine learning and math, 19–20

definition of, 4

limits of, 15

running on GPUs, 82

macro, 168

macro precision, 168

magical_minimum_finder, 500–511

make_cost, 502–503

make_scorer, 185, 196, 208

Manhattan distance, 82, 367

manifolds, 459–462

differentiable, 466–467

Mann-Whitney U statistic, 202

margin errors, 254

mathematics

1-based indexing in, 54

Σ notation, 30

derivatives, 526

eigenvalues and eigenvectors, 456–457

linear algebra, 452, 457, 465

matrix algebra, 82, 465–466

optimization, 500

parameters, 318

matplotlib, 20, 22, 222–223

matrices, 456

breaking down, 457

decomposition (factorization), 452, 455

identity, 465

multiplication of, 82, 465

orthogonal, 465–466

squaring, 466

transposing, 465

Matrix, The, 67

matshow, 275–277

max_depth, 242

maximum margin separator, 252

mean, 54, 85, 271, 446

arithmetic, 170, see also average

bootstrap, 391–393

computing, 390–391, 395

definition of, 88

distribution of, 390–391

empirical, 457

for two variables, multiplying, 271

geometric, 170

harmonic, 170

multiple, for each train-test split, 231

predicting, 147, 205

weighted, 89–90

mean absolute error (MAE), 209

mean squared error (MSE), 91, 101, 130, 209

mean_squared_error, 91, 126

measurements

accuracy of, 27

critical, 16

errors in, 15, 142–143, 241

levels of, 18

overlapping, 410

rescaling, 328, 414

scales of, 412–414

median, 206, 446

computing on training data, 349

definition of, 88

predicting, 205

median absolute error, 209

medical diagnosis, 10

assessing correctness of, 11–12

confusion matrix for, 165–166

example of, 6–7

for rare diseases, 160, 163, 178

memory

constraints of, 325

cost of, 71

measuring, 12, 76

relating to input size, 72

shared between programs, 76–77

testing usage of, 77–81, 102–104

memory_profiler, 78

merge, 334

meta level, 4, 17

methods

baseline, 159–161

chaining, 166

metrics.accuracy_score, 62

metrics.mean_squared_error, 91

metrics.roc_curve, 174, 179

metrics.SCORERS.keys(), 161–162, 208

micro, 168

Minkowski distance, 63, 82, 367

MinMaxScaler, 327

mistakes, see errors

Mitchell, Tom, 18

Moby Dick, 13

mode value, 446

models, 8, 66

additive, 318

bias of, 144–145

building, 14

combining, 390–398

comparing, 14

concrete, 371

evaluating, 14, 110

features working well with, 423–426, 464

fitting, 359–361, 363, 367, 370

fully defined, 371

keeping simple, 126, 295

not modifying the internal state of, 8, 361

performance of, 423

selecting, 113–114, 361–362

variability of, 144–145

workflow template for, 67, 90

Monte Carlo, see randomness

Monte Carlo cross-validation, see repeated train-test splitting (RTTS)

Morse code, 417

most_frequent, 160–161

multiclass learners, 179–185, 195–201

averaging, 168–169

mutual information, 418–423, 464

minimizing, 466

mutual_info_classif, 419, 421–422

mutual_info_regression, 420–421

N

Naive Bayes (NB), 68–70, 292

bias-variance for, 148

evaluating, 70–71

in text classification, 69

performance of, 74–76, 78–81, 191

natural language processing (NLP), 9

nearest neighbors, see k-Nearest Neighbors

Nearest Shrunken Centroids (NSC), 292

NearestCentroids, 292

negative outcome, 163–164

nested cross-validation, 157, 370–377

Netflix, 117

neural networks, 512–516, 526

newsgroups, 476

Newton’s Method, 292

No Free Lunch Theorem, 290

noise, 15, 117

addressing, 350, 353–356

capturing, 122, 124, 126

distracting, 109–110, 296

eliminating, 144

manipulating, 117

non-normality, 350

nonic, 120

nonlinearity, 285

nonparametric learning methods, 65

nonprimitive events, see compound events

normal distribution, 27, 520–524

normal equations, 101

normalization, 221, 322, 356, 474–476

Normalizer, 475

np_array_fromiter, 491–492, 494–495

np_cartesian_product, 41

numbers

binary vs. decimal, 53

floating-point, 52–53

numerical features, 6–7, 18, 225, 343–344, 346

predicting, 10–11

NumPy, 20

np.corrcoef, 416

floating-point numbers in, 52–53

np.array, 276, 494

np.dot, 29–30, 38, 47–52

np.fromiter, 494

np.histogram, 21

np.linalg.svd, 455

np.polyfit, 119

np.random.randint, 21

np.searchsorted, 310

NuSVC, 253–257, 291

Nystroem kernel, 436

O

Occam’s razor, 124, 284

odds

betting, 259–262

probability of, 262–266

one-hot coding, 333–341, 347, 356, 526

one-versus-all (OvA), 169

one-versus-one (OvO), 181–182, 253

one-versus-rest (OvR), 168, 179–182, 253, 267

OneHotEncoder, 333

OpenCV library, 485

optimization, 156, 497–500, 526

premature, 83

ordinal regression, 18

outcome, outputs, see targets

overconfidence, 109–110

and resampling, 128

overfitting, 117, 122–126, 290, 296

of base models, 396

P

pairplot, 86

pandas, 20

pd.cut, 328, 330–331

DataFrame, 323

one-hot coding in, 333–334

vs. sklearn, 323, 332

parabolas, 45

finding the best fit, 119–123

piecewise, 313

parameters, 115

adjusting, 116

choosing, 359

in computer science vs. math, 318

shuffling, 368

tuning, 362

vs. arguments, 362

vs. explicit values, 360–361

Pareto principle, 83

partitions, 242

patsy, 334–340, 344–347

connecting sklearn and, 347–348

documentation for, 356

PayPal, 189

PCA, 449–452

peeking, 225

penalization, see complexity

penalties, 300, 306, 502

percentile, 206

performance, 102

estimating, 382

evaluating, 131, 150–152, 382

measuring, 74–76, 78–81, 173, 178

overestimating, 109

physical laws, 17

piecewise constant regression, 309–313, 318

implementing, 310

preprocessing inputs in, 341

vs. k-NN-R, 310

PiecewiseConstantRegression, 313

Pipeline, 378–379

pipelines, 224–225, 377–382

integrating feature selection with, 426–428

plain linear model, 146, 147

planes, 39–41

finding the best, 410, 457

playing cards, 21

plots, 40, 41

plus-one trick, 38, 43–45, 336, 521

points in space, 34–43, 82

polyfit, 119

polynomial kernel, 253

polynomials

degree of, 119, 124

quadratic, 45

positive outcome, 163–164

precision, 165

macro, 168

tradeoffs between recall and, 168, 170–173, 185–187, 202

precision-recall curve (PRC), 185–187, 202

predict, 224–225, 379, 490–491

predict_proba, 174–175

predicted values, 10–11, 33

calculating, 97, 265

prediction bar, 170–177, 186

predictions, 165

combining, 389, 395, 405

evaluating, 215–217

flipping, 202

probability of, 170

real-world cost of, 56

predictive features, 7

predictive residuals, 219

predictors, see features

premature optimization, 83

presumption of innocence, 12

prime factorization, 452

primitive events, 22–23

principal components analysis (PCA), 445–462, 465–466

feature engineering in, 324

using dot products, 458–459, 461

prior, 160–161

probabilistic graphical models (PGMs), 516–525

and linear regression, 519–523

and logistic regression, 523–525

probabilistic principal components analysis (PPCA), 466

probabilities, 21–27

conditional, 24, 25

distribution of, 25–27, 290

expected value of, 31–32

of independent events, 23, 69

of primitive events, 22

of winning, 259–266

processing time, see time

programs

bottlenecks in, 83

memory usage of, 76–77

Provost, Foster, 18

purchasing behavior, predicting, 11

pydotplus, 245

pymc3, 519–521

Pythagorean theorem, 63

Python

indexing semantics in, 21, 54

list comprehension in, 136

memory management in, 77

using modules in the book, 20

Q

Quadratic Discriminant Analysis (QDA), 282–285

quadratic polynomials, see parabolas

quantile, 206

Quinlan, Ross, 239, 244

R

R² metric, 209–214

for mean model, 229

limitations of, 214, 233–234

misusing, 130

randint, 369

random events, 21–22

random forests (RFs), 396–398

comparing, 403

extreme, 397–398

selecting features in, 425

random guess strategy, 98–99, 101

random sampling, 325

random step strategy, 99, 101

random.randint, 21

random_state, 139–140

RandomForestClassifier, 425

RandomizedSearchCV, 369

randomness, 16

affecting data, 143

for feature selection, 423

for hyperparameters, 368–370

inherent in decisions, 241

pseudo-random, 139

to generate train-test splits, 133, 138–139

rare diseases, 160, 163, 178

rbf, 467

reality, 165

comparing to predictions, 215–217

recall, 165

tradeoffs between precision and, 168, 170–173, 185–187, 202

Receiver Operating Characteristic (ROC) curves, 172–181, 192, 202

and multiclass problem, 179–181

area under, 177–178, 182–193, 202

binary, 174–177

patterns in, 173–174

recentering, see data, centering

rectangles

areas of, 275

drawing, 275–278

overlapping, 243

recursive feature elimination, 425–426

redundancy, 324, 340

regression, 7, 64, 85–105

comparing methods of, 306–307

definition of, 85

examples of, 10–11

metrics for, 208–214

ordinal, 18

regression trees, 313–314

RegressorMixin, 311

regressors

baseline, 205–207

comparing, 314–317

default metric for, 209

evaluating, 205–234

implementing, 311–313

performance of, 317

scoring for, 130

regularization, 296–301

performing, 300–301

regularized linear regression, 296–301, 305

reinforcement learning, 18

repeated train-test splitting (RTTS), 133–139, 156

resampling, 128, 156, 390

with replacement, 157, 391–392

without replacement, 391

rescaling, see scaling, standardizing

reshape, 333

residual plots, 217–221, 232

residuals, 218, 230–232, 350

predictive, 219

Studentized, 232

resources

consumption of, 12–13, 71

limited, 187

measuring, 71–77

needed by an algorithm, 72

utilization in regression, 102–104

RFE, 425

Ridge, 300

ridge regression (L₂), 300, 307

blending with lasso regression, 318

rolling dice, 21–24

expected value of, 31–32

rigged, 68–69

root mean squared error (RMSE), 101

calculating, 119

comparing regressors on, 315

high, 142

size of values in, 136

rvs, 369

S

sampling, see resampling

Samuel, Arthur, 3–4, 17

scaling, 322, 325–329

statistical, 326

scipy.stats, 369

scores, 127, 130

extracting from CV classifiers, 192

for each class, 181

vs. loss, 207

scoring function, 184

Seaborn, 20

pairplot, 86

tsplot, 151

searchsorted, 310

SelectFromModel, 424–425

selection, 113–114

SelectPercentile, 422

sensitivity, 173, 185

SGDClassifier, 267, 292

shrinkage, see complexity

shuffle, 368

ShuffleSplit, 137–139

shuffling, 137–140, 382

SIFT_create, 485

signed area, 275

Silva, Alice, 195, 203

similarity, 63–64

simple average, 30

simplicity, 124

singular value decomposition (SVD), 452, 465–466

sklearn, 19–20

3D datasets in, 460–461

baseline models in, 205

boosters in, 400

classification metrics in, 161–163, 208–209

classifiers in, 202

common interface of, 379

confusion matrix in, 173

connecting patsy and, 347–348

consistency of, 225

cross-validation in, 129–130, 132, 184

custom models in, 311

distance calculators in, 64

documentation of, 368

feature correlation in, 416–417

feature evaluation in, 463

feature selection in, 425

kernels in, 435–437, 481

learners in, 318

linear regression in, 300, 310

logistic regression in, 267

naming conventions in, 207, 362

normalization in, 356

PCA in, 449–452

pipelines in, 224–225

plotting learning curves in, 157

R² in, 210–214, 233–234

random forests in, 396, 407

sparse-aware methods in, 356

storing data in, 333

SVC in, 253

SVR in, 307

terminology of, 61, 66, 127, 160

text representation in, 471–479, 494

thresholds in, 176

using alternative systems instead, 119

using OvR, 253

vs. pandas, 323, 332

workflow in, 67, 90

skms.cross_validate, 226–227

skpre.Normalizer, 495

Skynet, 389

smart step strategy, 99–101, 267

smoothness, 308, 406, see also complexity, regularization

sns.pairplot, 58

softmax function, 526

sorted lists, 465

sparsity, 333, 356

specificity, 165, 173, 185

splines, 318

spread, see standard deviation

square root of the sum of squared errors, 93

squared error loss, 301

squared error points, 209

ss.geom, 369

ss.normal, 369

ss.uniform, 369

StackExchange, 465

stacking, 390

StackOverflow, 292

standard deviation, 54, 85, 221, 327

standardization, 85, 105, 221–225, 231, 327

StandardScaler, 223–225, 326–327

stationary learning tasks, 16

statistics, 87

coefficient of determination, 130, 209

distribution of the mean, 391

dummy coding, 334

for feature selection, 463

Studentized residuals, 232

variation in data, 451

statsmodels, 292, 338–341

documentation for, 356

Stochastic Gradient Descent (SGD), 267

stocks

choosing action for, 9

predicting pricing for, 11

stop words, 472–473, 494

storage space

cost of, 12–13, 71

measuring, 72

stratification, 132–133

stratified, 160–161

StratifiedKFold, 130, 403

strings, comparing, 438–439

stripplots, 135, 155

student performance, 195–201, 203, 225–226

comparing regressors on, 314–317

predicting, 10

Studentized residuals, 232

studying for a test, 109, 116–117

sum, weighted, 28, 31

sum of probabilities of events

all primitive, 22

independent, 23

sum of squared errors (SSE), 33–34, 93–94, 210–212, 271, 301

smallest, 100

sum of squares, 32–33

sum product, 30

summary statistic, 87

supervised learning from examples, 4, 9–11

Support Vector Classifiers (SVCs), 252–259, 290–291, 301, 442

bias-variance in, 256–259

boundary in, 252

computing, 291

for nonlinear data, 285–287

maximum margin separator in, 305

parameters for, 254–256

performance of, 429

Support Vector Machines (SVMs), 252, 291, 442, 465

feature engineering in, 324

from raw materials, 510–511

vs. the polynomial kernel, 437

Support Vector Regression (SVR), 301–307

main options for, 307

support vectors, 252, 254

supporting examples, 252

SVC, 253–259, 291, 438

synonyms, 482–483, 487–488

T

T-distributed Stochastic Neighbor Embedding (TSNE), 462

t-test, 463

tabular data, 470

targets, 6–7

cooperative values of, 296

discontinuous, 308

predicting, 397

training vs. testing, 60–61

transforming, 350, 353–356

task understanding, 14

tax brackets, 322, 331

teaching to the test, 59–60, 114

in picking a learner, 112–113

protecting against, 110–111, 372, 377

TensorFlow, 82

term frequency-inverse document frequency (TF-IDF), 475–477, 495

testing datasets, 60–61, 110, 114

predicting on, 66

resampling, 128

size of, 115, 130

testing phase, see assessment, selection

tests

positive vs. negative, 163–166

specificity of, 165

text, 470–479

classification of, 69

encoding, 471–476

representing as table rows, 470–471

TfidfVectorizer, 475, 478, 495

Theano, 82

time

constraints of, 325

cost of, 13, 71

measuring, 12, 72, 74–75

relating to input size, 72

time series, plotting, 151

timeit, 74–75, 83

todense, 333, 473

Tolkien, J. R. R., 290

total distance, 94

tradeoffs, 13

between bias and variance, see bias-variance tradeoffs

between complexity and errors, 126

between false positives and negatives, 172

between precision and recall, 168, 170–173

train-test splits, 60, 110, 115

evaluating, 70–71, 152

for cross-validation, 132

multiple, 128

randomly selected, 370

repeated, 133–139, 156

train_test_split, 60, 70–71, 79, 349

training datasets, 60–61, 110, 114

duplicating examples by weight in, 399

fitting estimators on, 66

randomly selected, 370

resampling, 128

size of, 115, 130–131, 150

unique identifiers in, 241, 322

training error, 60

training loss, 125–126, 296

training phase, 113

transform, 224–225

Transformer, 435–436

TransformerMixin, 348, 379

transformers, 348–350

for images, 491–493

treatment coding, see one-hot coding

tree-building algorithms, 244

trigrams, 471

true negative rate (TNR), 164–166

true positive rate (TPR), 164–166, 173–181

Trust Region Newton’s Method, 292

tsplot, 151

Twenty Newsgroups dataset, 476

two-humped camel, see data, multimodal

U

unaccounted-for differences, 350

underfitting, 117, 122–125, 296

uniform, 160–161

unigrams, 471

unique identifiers, 241, 322, 324

univariate feature selection, 415

unsupervised activities, 445

V

validation, 110, 156, see also cross-validation

validation sets (ValS), 114

randomly selected, 370

size of, 115

values

accuracy of, 15

actual, 33

baseline, 356

definition of, 5

discrete, 5–6

explicit, vs. function parameters, 360–361

finding the best, 98–101, 267

missing, 18, 322

numerical, 6–7, 18, 86, 225

predicting, 64, 85, 87, 91

predicted, 10–11, 33, 97, 265

target, 6–7

cooperative, 296

transforming, 350

under- vs. overestimating, 33

variance, 110, 271, 292

always positive, 272

in feature values, 412–415

in SVCs, 256–259

maximizing, 448–449, 451

not affected by data shifting, 451

of data, 143, 145, 445

of model, 144–145

reducing, 396, 400, 406

VarianceThreshold, 413

vectorizers, 495

verification, 156

vocabularies, 482

global, 487

votes, weighted, 390

VotingClassifier, 407

W

warp functions, 440

weighted

average, 31–32, 34, 89

data, 399–400

errors, 399

mean, 89–90

sum, 28, 31

votes, 390

weights

adjusting, 497–500

distributions of, 524

pairs of, 524

restricting, 105, 146

total size of, 297

whuber (StackOverflow user), 292

wine dataset, 412–414, 426–428, 449

winning, odds of, 259–262

Wittgenstein, Ludwig, 18

words

adjacent, 471

counts of, 471, 473

frequency of, 474–476

in a document, 471

stop, 472–473, 494

visual, 491

global, 483, 487–490

local, 483–488

World War II, 172

wrapping functions, 361, 502

X

xgboost, 406

xor function, 341–343

Y

YouTube, 54, 109

Z

z-scoring, see standardizing

zip, 30

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Index

Symbols

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

Table of Contents for
Index