Supervised and unsupervised machine learning

Supervised machine learning constitutes the set of techniques that work towards building a model that approximate a function. The function takes a set of input variables, which are alternatively called independent variables, and tries to map the input variables to the output variable, alternatively called the dependent variable or the label.

Given that we know the label (or the value) we are trying to predict, for a set of input variables, the technique becomes a supervised learning problem.

In a similar manner, in an unsupervised learning problem, we do not have the output variable that we have to predict. However, in unsupervised learning, we try to group the data points so that they form logical groups.

A distinction between supervised and unsupervised learning at a high level can be obtained as shown in the following diagram:

In the preceding diagram, the supervised learning approach can distinguish between the two classes, as follows:

In supervised learning, there are two major objectives that can be achieved:

  • Predict the probability of an event happening—classification
  • Estimate the value of the continuous dependent variable—regression

The major methods that can help in classification are as follows:

  • Logistic regression
  • Decision tree
  • Random forest
  • Gradient boosting
  • Neural network

Along with these (except logistic regression), linear regression also helps in estimating a continuous variable (regression).

While these techniques help in estimating a continuous variable or in predicting the probability of an event happening (discrete variable prediction), unsupervised learning helps in grouping. Grouping can be either of rows (which is a typical clustering technique) or of columns (a dimensionality reduction technique). The major methods of row groupings are:

  • K-means clustering
  • Hierarchical clustering
  • Density-based clustering

The major methods of column groupings are:

  • Principal component analysis
  • t-Distributed Stochastic Neighbor Embedding (t-SNE)

Row groupings result in identifying the segments of customers (observations) that are there in our dataset.

Column groupings result in reducing the number of columns. This comes in handy when the number of independent variables is high. Typically when this is the case, there could be an issue in building the model, as the number of weights that need to be estimated could be high. Also, there could be an issue in interpreting the model, as some of the independent variables could be highly correlated with each other. Principal component analysis or t-SNE comes in handy in such a scenario, where we reduce the number of independent variables without losing too much of the information that is present in the dataset.

In the next section, we will go through an overview of all the major machine learning algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.81.232