Unsupervised learning in the data-mining life cycle 

To understand the role of unsupervised learning, it is important to first look at the overall life cycle of the data-mining process. There are different methodologies that divide the life cycle of the data-mining process into different independent stages, called phases. Currently, there are two popular ways to represent the data-mining life cycle:

CRISP-DM (Cross-Industry Standard Process for Data Mining) life cycle
SEMMA (Sample, Explore, Modify, Model, Access) data-mining process

CRISP-DM was developed by a consortium of data miners who belonged to various companies, including Chrysler and SPSS (Statistical Package for Social Science). SEMMA was proposed by SAS (Statistical Analysis System). Let's look at one of these two representations of the data-mining life cycle, CRISP-DM, and try to understand the place of unsupervised learning in the data-mining life cycle. Note that SEMMA has somewhat similar phases within its life cycle.

If we look at the CRISP-DM life cycle, we can see that it consists of six distinct phases, which are shown in the following figure:

Let's understand each phase one by one:

Phase 1: Business Understanding: This is about gathering the requirements and involves trying to fully understand the problem in depth from a business point of view. Defining the scope of the problem and properly rephrasing it according to machine learning (ML) is an important part of this phase—for example, for a binary classification problem, sometimes it is helpful to phrase the requirements in terms of a hypothesis that can be proved or rejected. This phase is also about documenting the expectations for the machine learning model that will be trained downstream in Phase 4—for example, for a classification problem, we need to document the minimum acceptable accuracy of the model that can be deployed in production.

It is important to note that Phase 1 of the CRISP-DM life cycle is about business understanding. It focuses on what needs to be done, not on how it will be done.

Phase 2: Data Understanding: This is about understanding the data that is available for data mining. In this phase, we will find out whether the right datasets are available for the problem we are trying to solve. After identifying the datasets, we need to understand the quality of the data and its structure. We need to find out what patterns can be extracted out of the data that can potentially lead us toward important insights. We will also try to find the right feature that can be used as the label (or the target variable) according to the requirements gathered in Phase 1. Unsupervised learning algorithms can play a powerful role in achieving the objectives of Phase 2. Unsupervised algorithms can be used for the following purposes:
- To discover patterns in the dataset
- To understand the structure of the dataset by analyzing the discovered patterns
- To identify or derive the target variable
Phase 3: Data Preparation: This is about preparing the data for the ML model that we will train in Phase 4. The available labeled data is divided into two unequal parts. The larger portion is called the training data and is used for training the model downstream in Phase 4. The smaller portion is called the testing data and is used in Phase 5 for model evaluation. In this phase, the unsupervised machine learning algorithms can be used as a tool to prepare the data—for example, they can be used to convert unstructured data into structured data, providing additional dimensions that can be helpful in training the model.
Phase 4: Modeling: This is the phase where we use supervised learning to formulate the patterns that we have discovered. We are expected to successfully prepare the data according to the requirements of our chosen supervised learning algorithm. This is also the phase in which the particular feature that will be used as the label will be identified. In Phase 3, we divided the data into testing and training sets. In this phase, we form mathematical formulations to represent the relationships in our patterns of interest. This is done by training the model using the training data that was created in Phase 3. As mentioned before, the resulting mathematical formulation will depend on our choice of algorithm.
Phase 5: Evaluation: This phase is about testing the newly trained model using the test data from Phase 3. If the evaluation matches the expectations set in Phase 1, then we need iterate through all the preceding phases again, starting with Phase 1. This is illustrated in the preceding image.
Phase 6: Deployment: If the evaluation meets or exceeds the expectations described in Phase 5, then the trained model is deployed in production and starts generating a solution to the problem we defined in Phase 1.

Phase 2 (Data Understanding) and Phase 3 (Data Preparation) of the CRISP-DM life cycle are all about understanding the data and preparing it for training the model. These phases involve data processing. Some organizations employ specialists for this data engineering phase.

It is obvious that the process of suggesting a solution to a problem is fully data driven. A combination of supervised and unsupervised machine learning is used to formulate a workable solution. This chapter focuses on the unsupervised learning part of the solution.

Data engineering comprises Phase 2 and Phase 3, and is the most time-consuming part of machine learning. It can take as much as 70% of the time and resources of a typical ML project. The unsupervised learning algorithms can play an important role in data engineering.

The following sections provide more details regarding unsupervised algorithms.

Table of Contents for Unsupervised learning in the data-mining life cycle&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Unsupervised learning in the data-mining life cycle