for training instances,
137f
Kernel logistic regression,
261
computational expense,
259
computational simplicity,
259
Kernels, conditional probability models using,
402–403
K-nearest-neighbor method,
85
components configuration and connection,
566–567
evaluation components,
566
incremental learning,
567
Knowledge representation,
91
L
Labor negotiations example,
16–18
Language identification,
516
Large item sets, finding with association rules,
240–241
LatentSemanticAnalysis method,
376–377
Latent Dirichlet allocation (LDA),
379–381
Latent semantic analysis (LSA),
376–377
LaTeX typesetting system,
571
Lattice-structured models,
408
Law of diminishing returns,
507
Lazy classifiers, in Weka,
563
Learning
in performance situations,
21
testing,
Learning Bayesian networks,
344–347
Least-squares linear regression,
70,
129
Least Absolute Shrinkage and Selection Operator (LASSO),
394
Leave-one-out cross-validation,
169
Linear chain conditional random fields,
408–409
Linear classification
Linear discriminant analysis,
310
in binary classification problems,
69
local, numeric prediction with,
273–284
nonlinear class boundaries,
254–256
stochastic gradient descent,
270–272
support vector machine use,
252
matrix vector formulations,
394–395
Linear threshold unit,
159
LinearForwardSelection method,
334
LinearRegression algorithm,
160
Locally weighted linear regression,
281–283
distance-based weighting schemes,
282
in nonlinear function approximation,
282
LogitBoost algorithm,
492,
493
Long short-term memory (LSTM),
457,
458
Loss functions
M
applications,
in diagnosis applications,
25
Manufacturing process applications,
27–28
Marketing and sales,
26–27
market basket analysis,
26–27
Marginal log-likelihood for PPCA,
374
Marginal probabilities,
387
Markov chain Monte Carlo methods,
368–369
Massive Online Analysis (MOA),
512
Max-product algorithms,
391
Maximum likelihood estimation,
338–339
Maximum posteriori parameter estimation,
339
Mean-absolute errors,
195
relations among attributes,
513
Metalearning algorithms, in Weka,
563
Metropolis–Hastings algorithm,
368–369
Mini-batch-based stochastic gradient descent,
433–434
Minimum description length (MDL) principle,
179,
197–200
probability theory and,
199
MIOptimalBall algorithm,
478
machine learning schemes and,
63
Mixed-attribute problems,
11
Mixed National Institute of Standards and Technology (MNIST),
421–422,
421t
Mixture of Gaussians
expectation maximization algorithm,
353–356
of principal component analyzers,
360–361
with nominal attributes,
279f
smoothing calculation,
274
Multiclass prediction,
181
MultiClassClassifier algorithm,
334
Multiclass classification problem,
396
Multiclass logistic regression,
396–400
aggregating the input,
157
converting to single-instance learning,
472–474
nearest-neighbor learning adaptation to,
475
upgrading learning algorithms,
475
Multi-instance problems,
53
converting to single-instance problem,
157
Multilabeled instances,
45
datasets corresponding to,
262f
as feed-forward networks,
269
MultilayerPerceptron algorithm,
284
Multinomial logistic regression,
396
Multinominal Naïve Bayes,
103
error-correcting output codes,
324–326
pairwise classification,
323
Multiple linear regression,
491
Multiresponse linear regression,
129
Multistage decision property,
110
N
for document classification,
103–104
independent attributes assumption,
469
NaiveBayes algorithm,
160
NaiveBayesMultinomial algorithm,
160
NaiveBayesUpdateable algorithm,
566–567
Nearest-neighbor classification,
85
Nearest-neighbor learning,
475
Hausdorff distance variants and,
477
multi-instance data adaptation,
475
Neuron’s receptive field,
440
N-fold cross-validation,
169
Noise,
Nonlinear class boundaries,
254–256
Nonparametric density models for classification,
362–363
Normal distribution
classification rules,
224
converting discrete attributes to,
303
normal-distribution assumption for,
105
Numeric prediction,
16,
44
outcome as numeric value,
46
support vector machine algorithms for,
256
Numeric prediction (local linear models),
273–284
locally weighted linear regression,
281–283
rules from model trees,
281
Numeric-attribute problems,
11
O
One-tailed probability,
166
as alternating decision trees,
495,
495f
decision trees versus,
494
Order-independent rules,
119
Ordered classes, predictions for,
402
“Ordered logit” models,
402
Ordinal attributes,
55–56
Orthogonal coordinate systems,
305
Output
instance-based representation,
84–87
knowledge representation,
91
forward stagewise additive regression and,
491–492
Overfitting-avoidance bias,
35
P
Pair-adjacent violators (PAV) algorithm,
330
Pairwise classification,
323
Parametric density models for classification,
362–363
Partial decision trees
expansion algorithm,
229f
Partial least squares regression,
307–309
Partitioning
Parzen window density estimation,
361
Perceptron learning rule,
132
instance presentation to,
133
linear classification using,
131–133
Performance
classifier, predicting,
165
instance-based learning,
246
Personal information use,
37–38
PKIDiscretize filter,
334
matrix vector formulations,
394–395
Posterior distribution,
337
Posterior predictive distribution,
367–368
Prediction
Pretraining deep autoencoders with RBMs,
448
for dimensionality reduction,
377–378
principal components,
306
Principal components regression,
307
PrincipalComponents filter,
334
Principle of multiple explanations,
200
Probabilistic inference methods,
368–370
probability propagation,
368
sampling, simulated annealing, and iterated conditional modes,
368–370
variational inference,
370
Probabilistic methods,
336
Bayesian estimation and prediction,
367–370
clustering and probability density estimation,
352–363
conditional probability models,
392–403
maximum likelihood estimation,
338–339
maximum posteriori parameter estimation,
339
sequential and temporal models,
403–410
software packages and implementations,
414–415
expected gradient for,
375
expected log-likelihood for,
374
marginal log-likelihood for,
374
Probabilities
probability density function relationship,
177
Probability density estimation,
352–363
comparing parametric, semiparametric and nonparametric density models,
362–363
expectation maximization algorithm,
353–356
two-class mixture model,
354f
Probability density functions,
102
Probability estimates,
340
Probability propagation,
368
Programming by demonstration,
528
Projections
Fisher’s linear discriminant analysis,
311–312
independent component analysis,
309–310
linear discriminant analysis,
310
quadratic discriminant analysis,
310–311
“Proportional odds” models,
402
Proportional
k-interval discretization,
297–298
Pruning
example illustration,
216f
incremental reduced-error,
225,
226f
Q
Quadratic discriminant analysis,
310–311
R
Radial basis function (RBF),
270
RandomCommittee algorithm,
501
RandomForest algorithm,
501
RandomSubSpace algorithm,
501
RBFNetwork algorithm,
284
RBMs, pretraining deep autoencoders with,
448
Recall-precision curves,
190
area under the precision-recall curve,
192
Reconstructive learning,
449
Rectangular generalizations,
86–87
Rectified linear units (ReLUs),
424–425
deep encoder-decoder recurrent network,
460f
exploding and vanishing gradients,
457–459
recurrent network architectures,
459–460
Recursive feature elimination,
290–291
Reduced-error pruning,
225,
269
Reference distribution,
321
principal components,
307
Linear regression equation,
16
Relation-valued attributes,
59
Relative absolute errors,
196
RELIEF (Recursive Elimination of Features),
331
Replicated subtree problem,
76
decision tree illustration,
77f
Representation learning techniques,
418
Restricted Boltzmann machines (RBMs),
451–452
Resubstitution errors,
163
from different learning schemes,
189
generating with cross-validation,
189
for two learning schemes,
189f
RotationForest algorithm,
501
Rule sets
model trees for generating,
281
computer-generated,
19–21
decision lists versus,
119
PRISM method for constructing,
118–119
S
“Scaled” kernel function,
361
Scheme-independent attribute selection,
289–292
instance-based learning methods,
291
recursive feature elimination,
290–291
Scheme-specific attribute selection,
293–295
selective Naïve Bayes,
295
Scientific applications,
28
Search, generalization as,
31–35
Search engines, in web mining,
21–22
Search methods (Weka),
413,
564
Second-order analysis,
435
Selective Naïve Bayes,
295
Semantic relationship,
513
Semiparametric density models for classification,
362–363
clustering for classification,
468–470
Separate-and-conquer algorithms,
119,
289
Sequential and temporal models,
403–410
SimpleCart algorithm,
242
SimpleKMeans algorithm,
160
SimpleLinearRegression algorithm,
160
Simple probabilistic modeling,
96–105
Single-attribute evaluators,
564
Single-consequent rules,
126
Single-linkage clustering algorithm,
147,
149
Smoothing calculation,
274
Soybean classification example,
19–21
Standard deviation from the mean,
166
Standardizing statistical variables,
61
Statistical clustering,
296
Statistical modeling,
406
Statistics, machine learning and,
30–31
Stochastic backpropagation,
268–269
categorical and continuous variables,
452–453
contrastive divergence,
452
restricted Boltzmann machines,
451–452
Stochastic gradient descent,
270–272
Stratified threefold cross-validation,
168
StringToWordVector filter,
290,
563
Structural descriptions,
6–7
decision trees,
learning techniques,
by conditional independence tests,
349
“Structured prediction” techniques,
407–408
Student’s distribution with
k-1 degrees of freedom,
173–174
Success rate, error rate and,
215–216
marginal probabilities,
387
probable explanation example,
390
Super-parent one-dependence estimator,
348–349
Supervised discretization,
297,
332
Support, of association rules,
79,
120
Support vector machines (SVMs),
252,
403,
471
linear regression differences,
256–257
Synthetic transformations,
437
T
Tables
as knowledge representation,
68
Tabular input format,
127
Tenfold cross-validation,
169
conditional random fields for,
410
document classification,
516
Threefold cross-validation,
168
3-point average recall,
191
“Time-homogeneous” models,
405
Top-down induction, of decision trees,
221
support vector machines,
261
Tree-augmented Naïve Bayes (TAN),
348
Two-class mixture model,
354f
Typographic errors,
63–64
U
Ubiquitous computing,
527
Unsupervised discretization,
297–298
Unsupervised pretraining,
437
User Classifier (Weka),
72
V
Variables, standardizing,
61
Variational inference,
370
Variational parameters,
370
Venn diagrams, in cluster representation,
87–88
Visualization, in Weka,
562
W
Weather problem example,
10–12
alternating decision tree,
495f
attributes evaluation,
94t
counts and probabilities,
97t
data with numeric class,
47t
expanded tree stumps,
108f
identification codes,
111t
multi-instance ARFF file,
60f
numeric data with summary statistics,
101t
Weighting attributes
Weights
determination process,
16
components configuration and connection,
566–567
evaluation components,
566
incremental learning,
567
ISO-8601 date/time format,
59
metalearning algorithms,
563
User Classifier facility,
72
visualization components,
565,
567
linear classification with,
133–135
versions illustration,
134f
X
XML (eXtensible Markup Language),
57,
568
Z