In the previous chapters, the focus of the problem was on predicting a variable, which could have been a number, class, or category. In this chapter, we will change the approach and try to create new features and variables at scale, hopefully better for our prediction purposes than the ones already included in the observation matrix. We will first introduce the unsupervised methods and illustrate three of them, which are able to scale to big data:
Unsupervised learning is a branch of machine learning whose algorithms reveal inferences from data without an explicit label (unlabeled data). The goal of such techniques is to extract hidden patterns and group similar data.
In these algorithms, the unknown parameters of interests of each observation (the group membership and topic composition, for instance) are often modeled as latent variables (or a series of hidden variables), hidden in the system of observed variables that cannot be observed directly, but only deduced from the past and present outputs of the system. Typically, the output of the system contains noise, which makes this operation harder.
In common problems, unsupervised methods are used in two main situations:
First of all, before starting with our illustration, let's import the modules that will be necessary along the chapter in our notebook:
In : import matplotlib import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib import pylab %matplotlib inline import matplotlib.cm as cm import copy import tempfile import os
18.117.72.224