Building stories

Simhash should be used to detect near-duplicate articles only. Extending our search to a 3-bit or 4-bit difference becomes terribly inefficient (3-bit difference requires 5,488 distinct queries to Cassandra while 41,448 queries will be needed to detect up to 4-bit differences) and seems to bring much more noise than related articles. Should the user want to build larger stories, a typical clustering technique must be applied then.

Building term frequency vectors

We will start grouping events into stories using a KMeans algorithm, taking the articles' word frequencies as input vectors. TF-IDF is simple, efficient, and a proven technique to build vectors out of text content. The basic idea is to compute a word frequency that we normalize using the inverse document frequency across the dataset, hence decreasing the weight on common words (such as stop words) while increasing the weight of words specific to the definition of a document. Its implementation is part of the basics of MapReduce processing, the Wordcount algorithm. We first compute our RDD of term frequency for each word in each document.

val tfRDD = documentRDD.flatMap { case (docId, body) =>
  body.split("\s").map { word =>
    ((docId, word), 1)
  }
}
.reduceByKey(_+_)
.map { case ((docId, word), tf) =>
  (docId, (word, tf))
}

The IDF is the logarithmic value of the total number of documents divided by the number of documents containing the letter w:

Building term frequency vectors

val n = sc.broadcast(documentRDD.count())
val dfMap = sc.broadcast(
  tfRDD.map { case (docId, (word, _)) =>
    (docId, word)
  }
  .distinct()
  .values
  .map { word =>
    (word, 1)
  }
  .reduceByKey(_+_)
  .collectAsMap()
)
 
val tfIdfRDD = tfRDD.mapValues { case (word, tf) =>
  val df = dfMap.value.get(word).get
  val idf = math.log((n.value + 1) / (df + 1))
  (word, tf * idf)
}

As our output vectors are made of words, we need to assign a sequence ID to each word in our corpus. We may have two solutions here. Either we build our dictionary and assign an ID for each word, or group different words in same buckets using a hash function. The former is ideal but results in vectors about a million features long (as many features as we do have unique words) while the latter is much smaller (as many features as the user specifies) but may lead to undesired effects due to hash collisions (the least features the more collisions).

val numFeatures = 256

val vectorRDD = tfIdfRDD.mapValues { case (word, tfIdf) =>
  val rawMod = word.hashCode % numFeatures
  rawMod + (if (rawMod < 0) numFeatures else 0)
  (word.hashCode / numFeatures, tfIdf)
}
.groupByKey()
.values
.map { it =>
  Vectors.sparse(numFeatures, it.toSeq)
}

Although we describe the TF-IDF technique in detail, this hashing TF can be done within only a couple of lines thanks to the MLlib utilities, which we'll see next. We built our RDD of 256 large vectors that can (technically) be fed in a KMeans clustering, but, due to the hashing properties we just explained earlier, we would be subject to dramatic hash collisions.

val tfModel = new HashingTF(1 << 20)
val tfRDD = documentRDD.values.map { body =>
  tfModel.transform(body.split("\s"))
}

val idfModel = new IDF().fit(tfRDD)
val tfIdfRDD = idfModel.transform(tfRDD)
val normalizer = new Normalizer()
val sparseVectorRDD = tfIdfRDD map normalizer.transform

The curse of dimensionality, the data science plague

Increasing our feature size from 256 to say 220 will strongly limit the number of collisions, but will come at a price, our data points are now embedded on a highly dimensional space.

Here we describe a clever approach to overcome the curse of dimensionality (http://www.stat.ucla.edu/~sabatti/statarray/textr/node5.html) without having to deep dive into fuzzy mathematical theories around matrix calculation (such as singular value decomposition) and without the need of compute-intensive operation. This approach is called Random Indexing and is similar to the Simhash concept described earlier.

Note

More information about Random Indexing can be found at http://eprints.sics.se/221/1/RI_intro.pdf.

The idea is to generate a sparse, randomly generated and unique representation of each distinct feature (a word here), composed of +1s, -1s, and mainly 0s. Then, each time we come across a word in a context (a document), we add this word's signature to a context vector. A document vector is then the sum of each of its words' vectors as per the following Figure 8 (or the sum of each of its TF-IDF vectors, in our case):

The curse of dimensionality, the data science plague
Figure 8: Building a Random Indexing vector

We invite our purist math geek readers to dive into the Johnson-Lindenstrauss (http://ttic.uchicago.edu/~gregory/courses/LargeScaleLearning/lectures/jl.pdf) lemma that basically states that "if we project points in a vector space into a randomly selected subspace of sufficiently high dimensionality, the distances between the points are approximately preserved". Although the Random Indexing technique itself can be implemented (with its fair amount of effort), the Johnson-Lindenstrauss lemma is quite useful but by far more difficult to grasp. Luckily, an implementation is part of the excellent spark-package generalized-kmeans-clustering (https://github.com/derrickburns/generalized-kmeans-clustering) from Derrick Burns.

val embedding = Embedding(Embedding.MEDIUM_DIMENSIONAL_RI)
val denseVectorRDD = sparseVectorRDD map embedding.embed
denseVectorRDD.cache()

We were finally able to project our 220 large vectors in only 256 dimensions. This technique offers huge benefits to say the least.

  • We have a fixed number of features. Our vectors will never grow in size should we encounter a new word in the future that was not part of our initial dictionary. This will be particularly useful in a streaming context.
  • Our input feature set is extremely large (220). Although the collisions will still occur, the risk is mitigated.
  • The distances are preserved thanks to the Johnson-Lindenstrauss lemma.
  • Our output vectors are relatively small (256). We overcame the curse of dimensionality.

As we cached our vector RDD into memory, we can now look at the KMeans clustering itself.

Optimizing KMeans

We assume our readers are familiar with KMeans clustering already as this algorithm is probably the most famous and widely used algorithm for unsupervised clustering. Any attempt at doing yet another explanation here will not be as good as the many resources you will be able to find out there after more than half a century of active research.

We previously created our vectors based on the articles' contents (TF-IDF). The next step is to start grouping articles into stories based on their similarity. In Spark implementation of KMeans, only the Euclidean distance measure is supported. One would argue the Cosine distance would be more suitable for text analysis, but we assume that the former is accurate enough as we do not want to repackage the MLlib distribution for that exercise.For more explanation regarding the use of cosine distance for text analysis, please refer to http://www.cse.msu.edu/~pramanik/research/papers/2003Papers/sac04.pdf. We report in the following code both the Euclidean and Cosine functions that can be applied on any array of double (the logical data structure behind dense vectors):

def euclidean(xs: Array[Double], ys: Array[Double]) = {
  require(xs.length == ys.length)
  math.sqrt((xs zip ys)
    .map { case (x, y) =>
      math.pow(y - x, 2)
    }
    .sum
  )
}

def cosine(xs: Array[Double], ys: Array[Double]) = {

  require(xs.length == ys.length)
  val magX = math.sqrt(xs.map(i => i * i).sum)
  val magY = math.sqrt(ys.map(i => i * i).sum)
  val dotP = (xs zip ys).map { case (x, y) =>
    x * y
  }.sum

  dotP / (magX * magY)
}

Training a new KMeans clustering is fairly simple using the MLlib package. We specify a threshold of 0.01 after which we consider our cluster centers to converge and set the maximum iterations to 1,000.

val model: KMeansModel = new KMeans()
  .setEpsilon(0.01)
  .setK(numberOfClusters)
  .setMaxIterations(1000)
  .run(denseVectorRDD)

But what is the right number of clusters in our particular use case? With between 500 to 1,000 different articles per 15mn batch, how many stories can we build? The right question is, How many true events do we think happened over a 15mn batch window? In fact, optimizing KMeans for news articles is not different from any other use case; this is done by optimizing its associated cost, cost being the sum of the squared distances (SSE) from the points to their respective centroids.

val wsse = model.computeCost(denseVectorRDD) 

With k equal to the number of articles, the associated cost will be 0 (each article is the center of its own cluster). Similarly, with k equal to 1, the cost will be maximum. The best value of k is therefore the minimum possible value after which adding a new cluster would not bring any gain in the associated cost, usually represented as an elbow in the SSE curve shown in the next figure.

Using all the 15,000 articles we collected so far, the optimal number of clusters is not obvious here but would probably be around 300.

Optimizing KMeans
Figure 9: Elbow method using the cost function

A rule of thumb is to use k as a function of n (number of articles). With over 15,000 articles, following this rule would return k Optimizing KMeans 100.

Optimizing KMeans

We use a value of 100 and start predicting our clusters for each of our data points.

val clusterTitleRDD = articleRDD
  .zip(denseVectorRDD)
  .map { case ((id, article), vector) =>
    (model.predict(vector), article.title)
  }

Although this could be greatly improved, we confirm many articles that look similar are grouped within same stories. We report some Samsung-related articles belonging to the same cluster here:

  • What Samsung can learn from Tylenol, Mattel, and JetBlue...
  • Huawei Mate 9 Appears To Be A Samsung Galaxy Note 7 Clone...
  • In light of the Note 7 debacle, Samsung may be poised to...
  • Samsung's spiralling stock draws investors betting...
  • Note 7 fiasco leaves Samsung's smartphone brand...
  • Samsung's smartphone brand takes beating from Note 7 fiasco...
  • Note 7 fiasco leaves Samsung's smartphone brand in question...
  • Note 7 fiasco leaves Samsung's smartphone brand in question...
  • Samsung's smartphone brand takes beating from Note 7 fiasco...
  • Fiasco leaves Samsung's smartphone brand in question...

Surely these similar articles were not eligible for a Simhash lookup as they differ by more than 1-bits or 2-bits. A clustering technique can be used to group similar (but not duplicate) articles into broader stories. It is worth mentioning that optimizing KMeans is a tedious task that requires many iterations and thorough analysis. This, however, is not part of the scope here as we will be focusing on much larger clusters and much smaller datasets in real time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.70.21