Chapter 4. Identifying Anomalous Data

In this chapter we will delve into deep neural networks and deep learning models. This chapter will focus on auto-encoder models, which can be used to learn the features of a dataset. The first part of the chapter introduces unsupervised learning where there is no specific outcome to be predicted. The next section provides a conceptual overview of auto-encoder models in a machine learning and deep neural network context in particular. The main core of the chapter will show how to build and apply an auto-encoder model to identify anomalous data. Such atypical data may simply be bad data or outliers, but these techniques are also used for fraud detection; for example, when an individual's credit card spending pattern differs from their usual behavior, it may be a red flag that something is wrong. Finally, the chapter closes with some exploration of how to fine-tune the models, including the use of different regularization strategies discussed in the previous chapter. In addition to being useful in its own right, this chapter will provide important building blocks for using and training deep learning models.

This chapter will cover the following topics:

  • What is unsupervised learning?
  • How do auto-encoders work?
  • Training an auto-encoder in R
  • Use case – building and applying an auto-encoder model
  • Fine-tuning auto-encoder models

Getting started with unsupervised learning

So far we have focused on models and techniques that broadly fall under the category of supervised learning. Supervised learning is supervised in the sense that the task is for the machine to learn the relationship between a set of variables or features and one or more outcomes. Often, there is only a single outcome. For example, a company may wish to predict whether someone is likely to become a customer, in which case the outcome of whether an individual becomes a customer coded as yes/no. In this chapter, we will delve into methods of unsupervised learning. In contrast with supervised learning, where there is an outcome variable(s) or labeled data is used, unsupervised learning does not require any outcomes or labeled data. Unsupervised learning uses only input features for learning. A common example of unsupervised learning is cluster analysis, such as K-means clustering, where the machine learns hidden or latent clusters in the data to minimize a criterion (for example, the smallest variance within a cluster).

Another way to think about unsupervised learning is that the goal is to predict the inputs. An example of this is shown in Figure 4.1. At first this is counter-intuitive as it may seem relatively unhelpful to learn a sophisticated model whose only purpose is to reproduce the inputs fed into it. However, there are a number of useful features. One common use of unsupervised learning is dimension reduction. The goal of dimension reduction is for a set of p variables to find a set of latent variables, k, so that k < p, but with the k latent variables the p raw variables can be reasonably reproduced. This is always a trade-off and balancing act, as typically the greater the dimension reduction, the greater the simplicity, but at the cost of accuracy:

Getting started with unsupervised learning

Figure 4.1

Perhaps the most common example of dimension reduction is principal component analysis. Principal component analysis uses an orthogonal transformation to go from the raw data to the principal components. In addition to being uncorrelated, the principal components are ordered from the component that explains the most variance to that which explains the least. Although all principal components can be used (in which case the dimensionality of the data is not reduced), only components that explain a sufficiently large amount of variance (for example, based on high eigenvalues) are included and components that account for relatively little variance are dropped as noise or unnecessary.

A variety of other methods for unsupervised learning are covered in Chapter 14 of Hastie, T., Tibshirani, R., and Friedman, J. (2009). The remainder of this chapter will focus on unsupervised methods for deep learning, specifically on auto-encoders.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.0.145