Part II. More advanced machine learning schemes

We have seen the basic ideas of several machine learning methods and studied in detail how to assess their performance on practical data mining problems. Now we are well prepared to look at more powerful and advanced machine learning algorithms. Our aim is to explain these both at a conceptual level and with a fair amount of technical detail, so that you can understand them fully and appreciate the key implementation issues that arise.

There are substantial differences between the simple methods described in Chapter 4, Algorithms: the basic methods, and the more sophisticated algorithms required to achieve state-of-the-art performance on many real-world problems, but the principles are the same. So are the inputs and, in many cases, the outputs—the methods of knowledge representation. But the algorithms are more complex—e.g., they may have to be extended to deal with numeric attributes, missing values, and—most challenging of all—noisy data.

Chapter 4, Algorithms: the basic methods, opened with an explanation of how to infer rudimentary rules, and went on to examine probabilistic modeling and decision trees. Then we returned to rule induction, and continued with association rules, linear models, the nearest-neighbor method of instance-based learning, clustering, and multi-instance learning. In Part II, we develop all these topics further, and also encounter some new ones.

We begin, in Chapter 6, Trees and rules, with decision tree induction, working up to a full description of the C4.5 system, a landmark decision tree program that is one of the most widely used workhorses of machine learning. Next we describe decision rule induction. Despite the simplicity of the idea, inducing rules that perform comparably with state-of-the-art decision trees turns out to be quite difficult in practice. Most high-performance rule inducers find an initial rule set and then refine it using a rather complex optimization stage that discards or adjusts individual rules to make them work better together. We describe the ideas that underlie rule learning in the presence of noise, and then go on to cover a scheme that operates by forming partial decision trees, an approach that has been demonstrated to perform well, while avoiding complex heuristics. Following this, we take a brief look at how to generate rules with exceptions, which were described in Section 3.4, and examine fast data structures for learning association rules.

Chapter 7, Extending instance-based and linear models, extends instance-based learning methods and linear models. There has been a resurgence of interest in linear models with the introduction of support vector machines, a blend of linear modeling and instance-based learning. These select a small number of critical boundary instances called “support vectors” from each class and build a linear discriminant function that separates them as widely as possible. This instance-based approach transcends the limitations of linear boundaries by making it practical to include further terms in the function that allow quadratic, cubic, and higher order decision boundaries. The same techniques can be applied to the perceptron described in Section 4.6 to implement complex decision boundaries, and also to least squares regression. An older but also very powerful technique for extending the perceptron is to connect units together into multilayer “neural networks.” We cover all these ideas in Chapter 7, Extending instance-based and linear models.

Chapter 7, Extending instance-based and linear models, also describes classic instance-based learners, developing the simple nearest-neighbor method introduced in Section 4.7 and showing some more powerful alternatives that perform explicit generalization. The final part of the chapter extends linear regression for numeric prediction to a more sophisticated procedure that comes up with the tree representation introduced in Section 3.3, and goes on to describe locally weighted regression, an instance-based strategy for numeric prediction.

Chapter 8, Data transformations, discusses methods for transforming data to improve machine learning. We look primarily at techniques that process the input to machine learning to make it more amenable to learning. We cover selection of informative attributes, discretization of numeric attributes, data projections for dimensionality reduction and learning from text data, efficient sampling from large datasets, data cleansing using such techniques as anomaly detection; and also consider transforming multiclass classification problems to two-class problems, and calibrating class probability estimates to make them more accurate.

Chapter 9, Probabilistic methods, covers probabilistic modeling approaches that go far beyond the simple Naïve Bayes classifier introduced in Chapter 4, Algorithms: the basic methods. We begin with a review of some fundamental concepts, such as maximum likelihood estimation, that form the basis of probabilistic approaches. Then we examine Bayesian networks, a powerful way of extending the Naïve Bayes method to make it less “naïve” by accommodating datasets that have internal dependencies. Next we consider how clustering can be viewed from a probabilistic perspective as fitting a mixture of probability distributions to a dataset. This is essentially a form of density estimation, and we also discuss the alternative approach of kernel density estimation to model the distribution of a dataset.

Except for the brief review of foundations at the beginning, this initial part of Chapter 9, Probabilistic methods, is fairly light on mathematics. However, the remainder of the material in the chapter requires a more mathematical approach. We look at general approaches for fitting models with unknown variables—hidden attributes that are not explicitly included in a dataset—before considering truly Bayesian methods for estimation and prediction. The next big topic is how to represent probability distributions using graphical models such as factor graphs. We will encounter probabilistic principal component analysis and Markov random fields, as well as (probabilistic) latent semantic analysis and latent Dirichlet allocation, two well-known models for working with text data. We will also see how to efficiently compute probabilities for tree-structured graphical models using the sum-product and max-product algorithms.

Up to now, most of the techniques described in this book have twin goals: maximizing predictive accuracy and producing comprehensible models. Chapter 10, Data transformations, however, breaks away and focuses exclusively on maximizing modeling accuracy; it makes no attempt to provide insight by generating interpretable models. Here we enter the domain of “deep learning,” the idea of learning very complex artificial neural networks that implicitly extract increasingly more abstract representations of the underlying patterns in a dataset. We first cover the nuts and bolts of deep feedforward networks, including commonly used loss functions and activation functions, before delving into the details of training and evaluating deep networks, including hyperparameter tuning, data augmentation, and pretraining. Next, we introduce a particular type of feedforward network, called a convolutional neural network, that reduces the number of parameters to be learned by weight sharing. We also describe autoencoders and Boltzmann machines, which are network models for unsupervised learning; and recurrent networks, which are designed for sequential data. Chapter 10, Deep learning, closes with some pointers to software implementations of deep learning methods.

The main focus of this book is on supervised techniques for machine learning, although we also consider unsupervised learning in the form of clustering and association rule mining. However, there are other, less conventional learning settings. In fact, we have already encountered one in Chapter 4, Algorithms: the basic methods—multi-instance learning (although it can also be interpreted as a form of supervised learning with bag-based examples). Chapter 8, Data transformations, covers more advanced techniques for multi-instance learning than those in Chapter 4, Algorithms: the basic methods. We also look at semisupervised learning, which, by combining supervised and unsupervised learning, promises a tantalizing opportunity to exploit unlabeled data for learning more accurate classification and numeric prediction models.

To maximize accuracy in practical applications, advantage can often be gained by combining multiple models. A wealth of practical experience has shown that ensembles of models are often necessary to squeeze out the last drops of predictive accuracy. Chapter 12, Ensemble learning, introduces various popular methods of ensemble learning—bagging, boosting, and randomization, including the renowned random forest variant—before going on to show how boosting can be interpreted as a form of additive regression, a statistical model building approach. A major drawback of most ensemble techniques is lack of interpretability, but alternating decision trees and related tree-based approaches offer the prospect of high accuracy while also providing insight. Finally we examine stacking, an intuitive method for combining a diverse set of models to maximize accuracy.

Chapter 13, Moving on: Applications and beyond, provides an outlook and overview of applications. We discuss learning from massive datasets and data streams, consider the use of domain knowledge, and provide an overview of application areas such as text mining, web mining, computer vision, speech recognition, and natural language processing, before briefly discussing the scenario of adversarial learning—learning with a malevolent teacher. At the very end, we imagine a future in which machine learning pervades our everyday activities.

Because of the nature of the material, this second part of the book differs from Part I in how you can approach it. Chapters can be read independently. Each is selfcontained, including references to further reading and to the research literature.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.33.235