Clustering with DBSCAN 

Clustering with DBSCAN using Marcin's package is equally simple. In fact, you would just need to change one single line of code from this:

c, err := clusters.KMeans(10000, 25, clusters.EuclideanDistance)

You would change it to this:

c, err := clusters.DBSCAN(eps, minPts, clusters.EuclideanDistance)

Now, of course, the question is what values should eps and minPts be?

eps represents the minimum distance required for two points to be considered a neighbor. minPts is the minimum number of points to form a dense cluster. Let's address eps first.

How do we know what the best distance is? A good way to figure this out is usually to visualize the data. In fact, this is what the original inventors of the DBSCAN algorithm suggests. But what exactly are we to visualize?

We want to visualize the distance between the tweets. Given a dataset, we can compute a distance matrix that looks something like this:

| | A | B | C | ... |
|--|--|--|--|--|--|
| A | | | | |
| B | | | | |
| C | | | | |
| ... | | | | |

To do so, we write the following function:

 func knn(a [][]float64, k int, distance func(a, b []float64) float64) ([][]float64, []float64) {
var distances [][]float64
for _, row := range a {
var dists []float64
for _, row2 := range a {
dist := distance(row, row2)
dists = append(dists, dist)
}
sort.Sort(sort.Float64Slice(dists))
topK := dists[:k]
distances = append(distances, topK)
}
var lastCol []float64
for _, d := range distances {
l := d[len(d)-1]
lastCol = append(lastCol, l)
}
sort.Sort(sort.Float64Slice(lastCol))
return distances, lastCol
}

This function takes a matrix of floats; each row represents a tweet, and finds the top k-nearest neighbors. Let's walk through the algorithm. As we walk though the algorithm, bear in mind that each row is a tweet; you can think of each row therefore as a very complicated coordinate.

The first thing we want to do is to find the distance between a tweet and another tweet, hence the following block:

 var distances [][]float64
for _, row := range a {
var dists []float64
for _, row2 := range a {
dist := distance(row, row2)
dists = append(dists, dist)
}

Of particular note are the two expressions for _, row := range a and for _, row2 := range a. In a normal KNN function, you'd have two matrices, a and b, and you'd find the distance between a tweet in a and a tweet in b. But for the purposes of drawing this chart, we are going to compare tweets within the same dataset.

Once we acquired all the distances, we want to find the closest neighbors, so we sort the list and then put them in the distance matrix:

 sort.Sort(sort.Float64Slice(dists))
topK := dists[:k]
distances = append(distances, topK)

This, in a very quick way, is how to do K-nearest neighbors. Of course, it's not the most efficient. The algorithm I've shown here is O(n^2). There are better ways of doing things, but for the purpose of this project, this suffices.

After that, we grab the last column of the matrix and sort the last column. This is what we wish to plot. The plotting code is not unlike that seen in previous chapters. I shall provide it here with no further elaboration on how to use it:

 func plotKNNDist(a []float64) plotter.XYs {
points := make(plotter.XYs, len(a))
for i, val := range a {
points[i].X = float64(i)
points[i].Y = val
}
return points
}

When I plot the real Twitter data to figure out the ideal eps, I get the following output:

What you want to find is an elbow or knee in the picture. Unfortunately, as you can tell, there are many of them. This is going to make clustering with the DBSCAN algorithm difficult. What this means is that the data is rather noisy.

One of the things that is of particular importance is the distance function used. I will go into this a little further in following sections on tweaking the program.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.226.240