Figure 1.1 | Rules for the contact lens data. | 13 |
Figure 1.2 | Decision tree for the contact lens data. | 14 |
Figure 1.3 | Decision trees for the labor negotiations data. | 18 |
Figure 1.4 | Life cycle of a data mining project. | 29 |
Figure 2.1 | A family tree and two ways of expressing the sister-of relation. | 48 |
Figure 2.2 | ARFF file for the weather data. | 58 |
Figure 2.3 | Multi-instance ARFF file for the weather data. | 60 |
Figure 3.1 | A linear regression function for the CPU performance data. | 69 |
Figure 3.2 | A linear decision boundary separating Iris setosas from Iris versicolors. | 70 |
Figure 3.3 | Constructing a decision tree interactively: (A) creating a rectangular test involving petallength and petalwidth; (B) the resulting (unfinished) decision tree. | 73 |
Figure 3.4 | Models for the CPU performance data: (A) linear regression; (B) regression tree; (C) model tree. | 74 |
Figure 3.5 | Decision tree for a simple disjunction. | 76 |
Figure 3.6 | The exclusive-or problem. | 77 |
Figure 3.7 | Decision tree with a replicated subtree. | 77 |
Figure 3.8 | Rules for the iris data. | 81 |
Figure 3.9 | The shapes problem. | 82 |
Figure 3.10 | Different ways of partitioning the instance space. | 86 |
Figure 3.11 | Different ways of representing clusters. | 88 |
Figure 4.1 | Pseudocode for 1R. | 93 |
Figure 4.2 | Tree stumps for the weather data. | 106 |
Figure 4.3 | Expanded tree stumps for the weather data. | 108 |
Figure 4.4 | Decision tree for the weather data. | 109 |
Figure 4.5 | Tree stump for the ID code attribute. | 111 |
Figure 4.6 | Covering algorithm: (A) covering the instances; (B) decision tree for the same problem. | 113 |
Figure 4.7 | The instance space during operation of a covering algorithm. | 115 |
Figure 4.8 | Pseudocode for a basic rule learner. | 118 |
Figure 4.9 | (A) Finding all item sets with sufficient coverage; (B) finding all sufficiently accurate association rules for a k-item set. | 127 |
Figure 4.10 | Logistic regression: (A) the logit transform; (B) example logistic regression function. | 130 |
Figure 4.11 | The perceptron: (A) learning rule; (B) representation as a neural network. | 132 |
Figure 4.12 | The Winnow algorithm: (A) unbalanced version; (B) balanced version. | 134 |
Figure 4.13 | A kD-tree for four training instances: (A) the tree; (B) instances and splits. | 137 |
Figure 4.14 | Using a kD-tree to find the nearest neighbor of the star. | 137 |
Figure 4.15 | Ball tree for 16 training instances: (A) instances and balls; (B) the tree. | 139 |
Figure 4.16 | Ruling out an entire ball (gray) based on a target point (star) and its current nearest neighbor. | 140 |
Figure 4.17 | Iterative distance-based clustering. | 143 |
Figure 4.18 | A ball tree: (A) two cluster centers and their dividing line; (B) corresponding tree. | 145 |
Figure 4.19 | Hierarchical clustering displays. | 149 |
Figure 4.20 | Clustering the weather data. | 151 |
Figure 4.21 | Hierarchical clusterings of the iris data. | 153 |
Figure 5.1 | A hypothetical lift chart. | 185 |
Figure 5.2 | Analyzing the expected benefit of a mailing campaign when the cost of mailing is (A) $0.50 and (B) $0.80. | 187 |
Figure 5.3 | A sample ROC curve. | 188 |
Figure 5.4 | ROC curves for two learning schemes. | 189 |
Figure 5.5 | Effect of varying the probability threshold: (A) error curve; (B) cost curve. | 193 |
Figure 6.1 | Example of subtree raising, where node C is “raised” to subsume node B. | 214 |
Figure 6.2 | Pruning the labor negotiations decision tree. | 216 |
Figure 6.3 | Algorithm for forming rules by incremental reduced-error pruning. | 226 |
Figure 6.4 | RIPPER: (A) algorithm for rule learning; (B) meaning of symbols. | 228 |
Figure 6.5 | Algorithm for expanding examples into a partial tree. | 229 |
Figure 6.6 | Example of building a partial tree. | 230 |
Figure 6.7 | Rules with exceptions for the iris data. | 232 |
Figure 6.8 | Extended prefix trees for the weather data: (A) the full data; (B) the data conditional on temperature=mild; (C) the data conditional on humidity=normal. | 238 |
Figure 7.1 | A boundary between two rectangular classes. | 249 |
Figure 7.2 | A maximum margin hyperplane. | 253 |
Figure 7.3 | Support vector regression: (A) ε=1; (B) ε=2; (C) ε=0.5. | 257 |
Figure 7.4 | Example data sets and corresponding perceptrons. | 262 |
Figure 7.5 | Step vs sigmoid: (A) step function; (B) sigmoid function. | 264 |
Figure 7.6 | Gradient descent using the error function w2+1. | 265 |
Figure 7.7 | Multilayer perceptron with a hidden layer (omitting bias inputs). | 267 |
Figure 7.8 | Hinge, squared and 0–1 loss functions. | 271 |
Figure 7.9 | Pseudocode for model tree induction. | 278 |
Figure 7.10 | Model tree for a data set with nominal attributes. | 279 |
Figure 8.1 | Attribute space for the weather dataset. | 292 |
Figure 8.2 | Discretizing the temperature attribute using the entropy method. | 299 |
Figure 8.3 | The result of discretizing the temperature attribute. | 299 |
Figure 8.4 | Class distribution for a two-class, two-attribute problem. | 302 |
Figure 8.5 | Principal component transform of a dataset: (A) variance of each component; (B) variance plot. | 306 |
Figure 8.6 | Comparing principal component analysis and Fisher’s linear discriminant analysis. | 312 |
Figure 8.7 | Number of international phone calls from Belgium, 1950–1973. | 318 |
Figure 8.8 | Overoptimistic probability estimation for a two-class problem. | 329 |
Figure 9.1 | A simple Bayesian network for the weather data. | 341 |
Figure 9.2 | Another Bayesian network for the weather data. | 342 |
Figure 9.3 | The Markov blanket for variable x6 in a 10-variable Bayesian network. | 348 |
Figure 9.4 | The weather data: (A) reduced version; (B) corresponding AD tree. | 350 |
Figure 9.5 | A two-class mixture model. | 354 |
Figure 9.6 | DensiTree showing possible hierarchical clusterings of a given data set. | 360 |
Figure 9.7 | Probability contours for three types of model, all based on Gaussians. | 362 |
Figure 9.8 | (A) Bayesian network for a mixture model; (B) multiple copies of the Bayesian network, one for each observation; (C) plate notation version of (B). | 371 |
Figure 9.9 | (A) Bayesian network for probabilistic PCA; (B) equal-probability contour for a Gaussian distribution along with its covariance matrix’s principal eigenvector. | 372 |
Figure 9.10 | The singular value decomposition of a t by d matrix. | 377 |
Figure 9.11 | Graphical models for (A) pLSA, (B) LDAb, and (C) smoothed LDAb. | 379 |
Figure 9.12 | (A) Bayesian network and (B) corresponding factor graph. | 382 |
Figure 9.13 | The Markov blanket for variable x6 in a 10-variable factor graph. | 383 |
Figure 9.14 | (A) and (B) Bayesian network and corresponding factor graph; (C) and (D) Naïve Bayes model and corresponding factor graph. | 384 |
Figure 9.15 | (A) Bayesian network representing the joint distribution of y and its parents; (B) factor graph for a logistic regression for the conditional distribution of y given its parents. | 384 |
Figure 9.16 | (A) Undirected graph representing a Markov random field structure; (B) corresponding factor graph. | 385 |
Figure 9.17 | Message sequence in an example factor graph. | 389 |
Figure 9.18 | (A) and (B) First- and second-order Markov models for a sequence of variables; (C) Hidden Markov model; (D) Markov random field. | 404 |
Figure 9.19 | Mining emails for meeting details. | 406 |
Figure 9.20 | (A) Dynamic Bayesian network representation of a hidden Markov model; (B) similarly structured Markov random field; (C) factor graph for (A); and (D) factor graph for a linear chain conditional random field. | 407 |
Figure 10.1 | A feedforward neural network. | 424 |
Figure 10.2 | Computation graph showing forward propagation in a deep network. | 426 |
Figure 10.3 | Backpropagation in a deep network (the forward computation is shown with gray arrows). | 429 |
Figure 10.4 | Parameter updates that follow the forward and backward propagation steps (shown with gray arrows). | 430 |
Figure 10.5 | Typical learning curves for the training and validation sets. | 431 |
Figure 10.6 | Pseudocode for mini-batch based stochastic gradient descent. | 435 |
Figure 10.7 | Typical convolutional neural network architecture. | 439 |
Figure 10.8 | Original image; filtered with the two Sobel operators; magnitude of the result. | 441 |
Figure 10.9 | Examples of what random neurons detect in different layers of a convolutional neural network using the visualization approach of Zeiler and Fergus (2013). Underlying imagery kindly provided by Matthew Zeiler. | 442 |
Figure 10.10 | Example of the convolution, pooling, and decimation operations used in convolutional neural networks. | 443 |
Figure 10.11 | A simple autoencoder. | 445 |
Figure 10.12 | A deep autoencoder with multiple layers of transformation. | 447 |
Figure 10.13 | Low-dimensional principal component space (left) compared with one learned by a deep autoencoder (right). | 447 |
Figure 10.14 | Boltzmann machines: (A) fully connected; (B) restricted; (C) more general form of (B). | 450 |
Figure 10.15 | (A) Deep Boltzmann machine and (B) deep belief network. | 453 |
Figure 10.16 | (A) Feedforward network transformed into a recurrent network; (B) hidden Markov model; and (C) recurrent network obtained by unwrapping (A). | 456 |
Figure 10.17 | Structure of a “long short-term memory” unit. | 459 |
Figure 10.18 | Recurrent neural networks: (A) bidirectional, (B) encoder-decoder. | 460 |
Figure 10.19 | A deep encoder-decoder recurrent network. | 460 |
Figure 12.1 | Algorithm for bagging. | 483 |
Figure 12.2 | Algorithm for boosting. | 488 |
Figure 12.3 | Algorithm for additive logistic regression. | 493 |
Figure 12.4 | Simple option tree for the weather data. | 494 |
Figure 12.5 | Alternating decision tree for the weather data. | 495 |
Figure 13.1 | A tangled web. | 521 |
18.188.200.46