Chapter 3. Unsupervised Machine Learning Techniques

In the last chapter, we focused on supervised learning, that is, learning from a training dataset that was labeled. In the real world, obtaining data with labels is often difficult. In many domains, it is virtually impossible to label data either due to the cost of labeling or difficulty in labeling due to the sheer volume or velocity at which data is generated. In those situations, unsupervised learning, in its various forms, offers the right approaches to explore, visualize, and perform descriptive and predictive modeling. In many applications, unsupervised learning is often coupled with supervised learning as a first step to isolate interesting data elements for labeling.

In this chapter, we will focus on various methodologies, techniques, and algorithms that are practical and well-suited for unsupervised learning. We begin by noting the issues that are common between supervised and unsupervised learning when it comes to handling data and transformations. We will then briefly introduce the particular challenges faced in unsupervised learning owing to the lack of "ground truth" and the nature of learning under those conditions.

We will then discuss the techniques of feature analysis and dimensionality reduction applied to unlabeled datasets. This is followed by an introduction to the broad spectrum of clustering methods and discussions on the various algorithms in practical use, just as we did with supervised learning in Chapter 2, Practical Approach to Real-World Supervised Learning, showing how each algorithm works, when to use it, and its advantages and limitations. We will conclude the section on clustering by presenting the different cluster evaluation techniques.

Following the treatment of clustering, we will approach the subject of outlier detection. We will contrast various techniques and algorithms that illustrate what makes some objects outliers—also called anomalies—within a given dataset.

The chapter will conclude with clustering and outlier detection experiments, conducted with a real-world dataset and an analysis of the results obtained. In this case study, we will be using ELKI and SMILE Java libraries for the machine learning tasks and will present code and results from the experiments. We hope that this will provide the reader with a sense of the power and ease of use of these tools.

Issues in common with supervised learning

Many of the issues that we discussed related to supervised learning are also common with unsupervised learning. Some of them are listed here:

  • Types of features handled by the algorithm: Most clustering and outlier algorithms need numeric representation to work effectively. Transforming categorical or ordinal data has to be done carefully
  • Curse of dimensionality: Having a large number of features results in sparse spaces and affects the performance of clustering algorithms. Some option must be chosen to suitably reduce dimensionality—either feature selection where only a subset of the most relevant features are retained, or feature extraction, which transforms the feature space into a new set of principal variables of a lower dimensional space
  • Scalability in memory and training time: Many unsupervised learning algorithms cannot scale up to more than a few thousands of instances either due to memory or training time constraints
  • Outliers and noise in data: Many algorithms are affected by noise in the features, the presence of anomalous data, or missing values. They need to be transformed and handled appropriately
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.20.156