List of Figures

Figure 1.1 Rules for the contact lens data. 13
Figure 1.2 Decision tree for the contact lens data. 14
Figure 1.3 Decision trees for the labor negotiations data. 18
Figure 1.4 Life cycle of a data mining project. 29
Figure 2.1 A family tree and two ways of expressing the sister-of relation. 48
Figure 2.2 ARFF file for the weather data. 58
Figure 2.3 Multi-instance ARFF file for the weather data. 60
Figure 3.1 A linear regression function for the CPU performance data. 69
Figure 3.2 A linear decision boundary separating Iris setosas from Iris versicolors. 70
Figure 3.3 Constructing a decision tree interactively: (A) creating a rectangular test involving petallength and petalwidth; (B) the resulting (unfinished) decision tree. 73
Figure 3.4 Models for the CPU performance data: (A) linear regression; (B) regression tree; (C) model tree. 74
Figure 3.5 Decision tree for a simple disjunction. 76
Figure 3.6 The exclusive-or problem. 77
Figure 3.7 Decision tree with a replicated subtree. 77
Figure 3.8 Rules for the iris data. 81
Figure 3.9 The shapes problem. 82
Figure 3.10 Different ways of partitioning the instance space. 86
Figure 3.11 Different ways of representing clusters. 88
Figure 4.1 Pseudocode for 1R. 93
Figure 4.2 Tree stumps for the weather data. 106
Figure 4.3 Expanded tree stumps for the weather data. 108
Figure 4.4 Decision tree for the weather data. 109
Figure 4.5 Tree stump for the ID code attribute. 111
Figure 4.6 Covering algorithm: (A) covering the instances; (B) decision tree for the same problem. 113
Figure 4.7 The instance space during operation of a covering algorithm. 115
Figure 4.8 Pseudocode for a basic rule learner. 118
Figure 4.9 (A) Finding all item sets with sufficient coverage; (B) finding all sufficiently accurate association rules for a k-item set. 127
Figure 4.10 Logistic regression: (A) the logit transform; (B) example logistic regression function. 130
Figure 4.11 The perceptron: (A) learning rule; (B) representation as a neural network. 132
Figure 4.12 The Winnow algorithm: (A) unbalanced version; (B) balanced version. 134
Figure 4.13 A kD-tree for four training instances: (A) the tree; (B) instances and splits. 137
Figure 4.14 Using a kD-tree to find the nearest neighbor of the star. 137
Figure 4.15 Ball tree for 16 training instances: (A) instances and balls; (B) the tree. 139
Figure 4.16 Ruling out an entire ball (gray) based on a target point (star) and its current nearest neighbor. 140
Figure 4.17 Iterative distance-based clustering. 143
Figure 4.18 A ball tree: (A) two cluster centers and their dividing line; (B) corresponding tree. 145
Figure 4.19 Hierarchical clustering displays. 149
Figure 4.20 Clustering the weather data. 151
Figure 4.21 Hierarchical clusterings of the iris data. 153
Figure 5.1 A hypothetical lift chart. 185
Figure 5.2 Analyzing the expected benefit of a mailing campaign when the cost of mailing is (A) $0.50 and (B) $0.80. 187
Figure 5.3 A sample ROC curve. 188
Figure 5.4 ROC curves for two learning schemes. 189
Figure 5.5 Effect of varying the probability threshold: (A) error curve; (B) cost curve. 193
Figure 6.1 Example of subtree raising, where node C is “raised” to subsume node B. 214
Figure 6.2 Pruning the labor negotiations decision tree. 216
Figure 6.3 Algorithm for forming rules by incremental reduced-error pruning. 226
Figure 6.4 RIPPER: (A) algorithm for rule learning; (B) meaning of symbols. 228
Figure 6.5 Algorithm for expanding examples into a partial tree. 229
Figure 6.6 Example of building a partial tree. 230
Figure 6.7 Rules with exceptions for the iris data. 232
Figure 6.8 Extended prefix trees for the weather data: (A) the full data; (B) the data conditional on temperature=mild; (C) the data conditional on humidity=normal. 238
Figure 7.1 A boundary between two rectangular classes. 249
Figure 7.2 A maximum margin hyperplane. 253
Figure 7.3 Support vector regression: (A) ε=1; (B) ε=2; (C) ε=0.5. 257
Figure 7.4 Example data sets and corresponding perceptrons. 262
Figure 7.5 Step vs sigmoid: (A) step function; (B) sigmoid function. 264
Figure 7.6 Gradient descent using the error function w2+1. 265
Figure 7.7 Multilayer perceptron with a hidden layer (omitting bias inputs). 267
Figure 7.8 Hinge, squared and 0–1 loss functions. 271
Figure 7.9 Pseudocode for model tree induction. 278
Figure 7.10 Model tree for a data set with nominal attributes. 279
Figure 8.1 Attribute space for the weather dataset. 292
Figure 8.2 Discretizing the temperature attribute using the entropy method. 299
Figure 8.3 The result of discretizing the temperature attribute. 299
Figure 8.4 Class distribution for a two-class, two-attribute problem. 302
Figure 8.5 Principal component transform of a dataset: (A) variance of each component; (B) variance plot. 306
Figure 8.6 Comparing principal component analysis and Fisher’s linear discriminant analysis. 312
Figure 8.7 Number of international phone calls from Belgium, 1950–1973. 318
Figure 8.8 Overoptimistic probability estimation for a two-class problem. 329
Figure 9.1 A simple Bayesian network for the weather data. 341
Figure 9.2 Another Bayesian network for the weather data. 342
Figure 9.3 The Markov blanket for variable x6 in a 10-variable Bayesian network. 348
Figure 9.4 The weather data: (A) reduced version; (B) corresponding AD tree. 350
Figure 9.5 A two-class mixture model. 354
Figure 9.6 DensiTree showing possible hierarchical clusterings of a given data set. 360
Figure 9.7 Probability contours for three types of model, all based on Gaussians. 362
Figure 9.8 (A) Bayesian network for a mixture model; (B) multiple copies of the Bayesian network, one for each observation; (C) plate notation version of (B). 371
Figure 9.9 (A) Bayesian network for probabilistic PCA; (B) equal-probability contour for a Gaussian distribution along with its covariance matrix’s principal eigenvector. 372
Figure 9.10 The singular value decomposition of a t by d matrix. 377
Figure 9.11 Graphical models for (A) pLSA, (B) LDAb, and (C) smoothed LDAb. 379
Figure 9.12 (A) Bayesian network and (B) corresponding factor graph. 382
Figure 9.13 The Markov blanket for variable x6 in a 10-variable factor graph. 383
Figure 9.14 (A) and (B) Bayesian network and corresponding factor graph; (C) and (D) Naïve Bayes model and corresponding factor graph. 384
Figure 9.15 (A) Bayesian network representing the joint distribution of y and its parents; (B) factor graph for a logistic regression for the conditional distribution of y given its parents. 384
Figure 9.16 (A) Undirected graph representing a Markov random field structure; (B) corresponding factor graph. 385
Figure 9.17 Message sequence in an example factor graph. 389
Figure 9.18 (A) and (B) First- and second-order Markov models for a sequence of variables; (C) Hidden Markov model; (D) Markov random field. 404
Figure 9.19 Mining emails for meeting details. 406
Figure 9.20 (A) Dynamic Bayesian network representation of a hidden Markov model; (B) similarly structured Markov random field; (C) factor graph for (A); and (D) factor graph for a linear chain conditional random field. 407
Figure 10.1 A feedforward neural network. 424
Figure 10.2 Computation graph showing forward propagation in a deep network. 426
Figure 10.3 Backpropagation in a deep network (the forward computation is shown with gray arrows). 429
Figure 10.4 Parameter updates that follow the forward and backward propagation steps (shown with gray arrows). 430
Figure 10.5 Typical learning curves for the training and validation sets. 431
Figure 10.6 Pseudocode for mini-batch based stochastic gradient descent. 435
Figure 10.7 Typical convolutional neural network architecture. 439
Figure 10.8 Original image; filtered with the two Sobel operators; magnitude of the result. 441
Figure 10.9 Examples of what random neurons detect in different layers of a convolutional neural network using the visualization approach of Zeiler and Fergus (2013). Underlying imagery kindly provided by Matthew Zeiler. 442
Figure 10.10 Example of the convolution, pooling, and decimation operations used in convolutional neural networks. 443
Figure 10.11 A simple autoencoder. 445
Figure 10.12 A deep autoencoder with multiple layers of transformation. 447
Figure 10.13 Low-dimensional principal component space (left) compared with one learned by a deep autoencoder (right). 447
Figure 10.14 Boltzmann machines: (A) fully connected; (B) restricted; (C) more general form of (B). 450
Figure 10.15 (A) Deep Boltzmann machine and (B) deep belief network. 453
Figure 10.16 (A) Feedforward network transformed into a recurrent network; (B) hidden Markov model; and (C) recurrent network obtained by unwrapping (A). 456
Figure 10.17 Structure of a “long short-term memory” unit. 459
Figure 10.18 Recurrent neural networks: (A) bidirectional, (B) encoder-decoder. 460
Figure 10.19 A deep encoder-decoder recurrent network. 460
Figure 12.1 Algorithm for bagging. 483
Figure 12.2 Algorithm for boosting. 488
Figure 12.3 Algorithm for additive logistic regression. 493
Figure 12.4 Simple option tree for the weather data. 494
Figure 12.5 Alternating decision tree for the weather data. 495
Figure 13.1 A tangled web. 521
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.200.46