Let's see just how easy it is to do k-means clustering using scikit-learn and Python.
The first thing we're going to do is create some random data that we want to try to cluster. Just to make it easier, we'll actually build some clusters into our fake test data. So let's pretend there's some real fundamental relationship between these data, and there are some real natural clusters that exist in it.
So to do that, we can work with this little createClusteredData() function in Python:
from numpy import random, array #Create fake income/age clusters for N people in k clusters def createClusteredData(N, k): random.seed(10) pointsPerCluster = float(N)/k X = [] for i in range (k): incomeCentroid = random.uniform(20000.0, 200000.0) ageCentroid = random.uniform(20.0, 70.0) for j in range(int(pointsPerCluster)): X.append([random.normal(incomeCentroid, 10000.0),
random.normal(ageCentroid, 2.0)]) X = array(X) return X
The function starts off with a consistent random seed so you'll get the same result every time. We want to create clusters of N people in k clusters. So we pass N and k to createClusteredData().
Our code figures out how many points per cluster that works out to first and stores it in pointsPerCluster. Then, it builds up list X that starts off empty. For each cluster, we're going to create some random centroid of income (incomeCentroid) between 20,000 and 200,000 dollars and some random centroid of age (ageCentroid) between the age of 20 and 70.
What we're doing here is creating a fake scatter plot that will show income versus age for N people and k clusters. So for each random centroid that we created, I'm then going to create a normally distributed set of random data with a standard deviation of 10,000 in income and a standard deviation of 2 in age. That will give us back a bunch of age income data that is clustered into some pre-existing clusters that we can chose at random. OK, let's go ahead and run that.
Now, to actually do k-means, you'll see how easy it is.
from sklearn.cluster import KMeans import matplotlib.pyplot as plt from sklearn.preprocessing import scale from numpy import random, float data = createClusteredData(100, 5) model = KMeans(n_clusters=5) # Note I'm scaling the data to normalize it! Important for good results. model = model.fit(scale(data)) # We can look at the clusters each data point was assigned to print model.labels_ # And we'll visualize it: plt.figure(figsize=(8, 6)) plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float)) plt.show()
All you need to do is import KMeans from scikit-learn's cluster package. We're also going to import matplotlib so we can visualize things, and also import scale so we can take a look at how that works.
So we use our createClusteredData() function to say 100 random people around 5 clusters. So there are 5 natural clusters for the data that I'm creating. We then create a model, a KMeans model with k of 5, so we're picking 5 clusters because we know that's the right answer. But again, in unsupervised learning you don't necessarily know what the real value of k is. You need to iterate and converge on it yourself. And then we just call model.fit using my KMeans model using the data that we had.
Now the scale I alluded to earlier, that's normalizing the data. One important thing with k-means is that it works best if your data is all normalized. That means everything is at the same scale. So a problem that I have here is that my ages range from 20 to 70, but my incomes range all the way up to 200,000. So these values are not really comparable. The incomes are much larger than the age values. Scale will take all that data and scale it together to a consistent scale so I can actually compare these things as apples to apples, and that will help a lot with your k-means results.
So, once we've actually called fit on our model, we can actually look at the resulting labels that we got. Then we can actually visualize it using a little bit of matplotlib magic. You can see in the code we have a little trick where we assigned the color to the labels that we ended up with converted to some floating point number. That's just a little trick you can use to assign arbitrary colors to a given value. So let's see what we end up with:
It didn't take that long. You see the results are basically what clusters I assigned everything into. We know that our fake data is already pre-clustered, so it seems that it identified the first and second clusters pretty easily. It got a little bit confused beyond that point, though, because our clusters in the middle are actually a little bit mushed together. They're not really that distinct, so that was a challenge for k-means. But regardless, it did come up with some reasonable guesses at the clusters. This is probably an example of where four clusters would more naturally fit the data.