Unsupervised learning

Let's talk in detail now about two different types of machine learning: supervised and unsupervised learning. Sometimes there can be kind of a blurry line between the two, but the basic definition of unsupervised learning is that you're not giving your model any answers to learn from. You're just presenting it with a group of data and your machine learning algorithm tries to make sense out of it given no additional information:

Let's say I give it a bunch of different objects, like these balls and cubes and sets of dice and what not. Let's then say have some algorithm that will cluster these objects into things that are similar to each other based on some similarity metric.

Now I haven't told the machine learning algorithm, ahead of time, what categories certain objects belong to. I don't have a cheat sheet that it can learn from where I have a set of existing objects and my correct categorization of it. The machine learning algorithm must infer those categories on its own. This is an example of unsupervised learning, where I don't have a set of answers that I'm getting it learn from. I'm just trying to let the algorithm gather its own answers based on the data presented to it alone.

The problem with this is that we don't necessarily know what the algorithm will come up with! If I gave it that bunch of objects shown in the preceding image, is it going to group things into things that are round, things that are large versus small, things that are red versus blue, I don't know. It's going to depend on the metric that I give it for similarity between items primarily. But sometimes you'll find clusters that are surprising, and emerged that you didn't expect to see.

So that's really the point of unsupervised learning: if you don't know what you're looking for, it can be a powerful tool for discovering classifications that you didn't even know were there. We call this a latent variable. Some property of your data that you didn't even know was there originally, can be teased out by unsupervised learning.

Let's take another example around unsupervised learning. Say I was clustering people instead of balls and dice. I'm writing a dating site and I want to see what sorts of people tend to cluster together. There are some attributes that people tend to cluster around, which decide whether they tend to like each other and date each other for example. Now you might find that the clusters that emerge don't conform to your predisposed stereotypes. Maybe it's not about college students versus middle-aged people, or people who are divorced and whatnot, or their religious beliefs. Maybe if you look at the clusters that actually emerged from that analysis, you'll learn something new about your users and actually figure out that there's something more important than any of those existing features of your people that really count toward, to decide whether they like each other. So that's an example of supervised learning providing useful results.

Another example could be clustering movies based on their properties. If you were to run clustering on a set of movies from like IMDb or something, maybe the results would surprise you. Perhaps it's not just about the genre of the movie. Maybe there are other properties, like the age of the movie or the running length or what country it was released in, that are more important. You just never know. Or we could analyze the text of product descriptions and try to find the terms that carry the most meaning for a certain category. Again, we might not necessarily know ahead of time what terms, or what words, are most indicative of a product being in a certain category; but through unsupervised learning, we can tease out that latent information.

Table of Contents for Unsupervised learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Unsupervised learning