Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Twitter and the Godwin point

With our text content properly cleaned up, we can feed a Word2Vec algorithm and attempt to understand the words in their actual context.

Learning context

As it says on the tin, the Word2Vec algorithm transforms a word into a vector. The idea is that similar words will be embedded into similar vector spaces and, as such, will look close to one another contextually.

Note

More information about Word2Vec algorithm can be found at https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.

Well integrated into Spark, a Word2Vec model can be trained as follows:

import org.apache.spark.mllib.feature.Word2Vec

val corpusRDD = tweetRDD
   .map(_.body.split("\s").toSeq)
   .filter(_.distinct.length >= 4)

val model = new Word2Vec().fit(corpusRDD)

Here we extract each tweet as a sequence of words, only keeping records with at least 4 distinct words. Note that the list of all words needs to fit in memory as it is collected back to the driver as a map of word and vector (as an array of float). The vector size and learning rate can be tuned through the setVectorSize and setLearningRate methods respectively.

Next, we use a Zeppelin notebook to interact with our model, sending different words and asking the model to obtain the closest synonyms. The results are quite impressive:

model.findSynonyms("#lockherup", 10).foreach(println)

/*
(#hillaryforprison,2.3266071900089313)
(#neverhillary,2.2890002973310066)
(#draintheswamp,2.2440446323298175)
(#trumppencelandslide,2.2392471034643604)
(#womenfortrump,2.2331140131326874)
(#trumpwinsbecause,2.2182999853485454)
(#imwithhim,2.1950198833564563)
(#deplorable,2.1570936207197016)
(#trumpsarmy,2.155859656266577)
(#rednationrising,2.146132149205829)
*/

While hashtags generally pass through standard NLP unnoticed, they do have a major contribution to make to tone and emotion. A tweet marked as neutral can be, in fact, much worse than it sounds using hashtags like #HillaryForPrison or #LockHerUp . So, let's attempt to take this into account using an interesting feature called word-vector association. A common example of this association given by the original Word2Vec algorithm is shown here:

[KING] is at [MAN] what [QUEEN] is at [?????]

This can be translated as the following vector:

VKING - VQUEEN = VMAN - V???? 
V???? = VMAN - VKING + VQUEEN

The nearest point should therefore be [WOMEN]. Technically speaking, this can be translated as follows:

import org.apache.spark.mllib.linalg.Vectors

def association(word1: String, word2: String, word3: String) = {

  val isTo = model
    .getVectors
    .get(word2)
    .get
    .zip(model.getVectors.get(word1).get)
    .map(t => t._1 - t._2)

 val what = model
   .getVectors
   .get(word3)
   .get

 val vec = isTo
   .zip(what)
   .map(t => t._1 + t._2)
   .map(_.toDouble)

 Vectors.dense(vec)

}
 
val assoc = association("trump", "republican", "clinton")

model.findSynonyms(assoc, 1)
     .foreach(println)

// (democrat,1.6838367309269164)

Saving/retrieving this model can be done as follows:

model.save(sc, "/path/to/word2vec")

val retrieved = Word2VecModel.load(sc, "/path/to/word2vec")

Visualizing our model

As our vectors are 100 dimensions wide, they are difficult to represent in a graph using traditional methods. However, you may have come across the Tensor Flow project and its recently open sourced Embedding Projector (http://projector.tensorflow.org/). This project offers a nice way to visualize our models due to its ability to quickly render high-dimensional data. It's easy to use as well - we simply export our vectors as tab-separated data points, load them into a web browser, and voila!

Figure 7: Embedding project, neighbours of Computer

Embedding Projector projects high-dimensional vector onto 3D space, where each dimension represents one of the first three principal components (PCA). We can also build our own projection where we basically stretch our vectors toward four specific directions. In the following representation, we stretch our vectors left, right, up, and down to [Trump], [Clinton], [Love], and [Hate]:

Figure 8: Embedding project, custom projection

Now that we have a greatly simplified vector space, we can more easily understand each word and how it relates to its neighbors (democrat versus republican and love versus hate). For example, with the French election coming up next year, we see that France is closer to Trump than it is to Clinton. Could this be seen as an early indicator of the upcoming election?

Word2Graph and Godwin point

You don't have to play around with the Twitter Word2Vec model for very long before you come across sensitive terms and references to World War II. In fact, this is an occurrence that was originally asserted by Mike Godwin in 1990 as Godwin's Law (https://www.wired.com/1994/10/godwin-if-2/), which states as follows:

As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches 1

As of 2012, it is even part of the Oxford English Dictionary.

Figure 9: The Godwin law

Building a Word2Graph

Although more of a rhetorical device than an actual mathematical law, Godwin's Law remains a fascinating anomaly and seems to be relevant to the US election. Naturally, we will decide to explore the idea further using the graph theory. The first step is to broadcast our model back to the executors and parallelize our list of words. For each word, we output the top five synonyms and build an Edge object with word similarity as edge weight. Let's take a look:

val bModel = sc.broadcast(model)
val bDictionary = sc.broadcast(
  model.getVectors
    .keys
    .toList
    .zipWithIndex
    .map(l => (l._1, l._2.toLong + 1L))
    .toMap
)
 
import org.apache.spark.graphx._

val wordRDD = sc.parallelize(
  model.getVectors
    .keys
    .toSeq
    .filter(s => s.length > 3)
)

val word2EdgeRDD = wordRDD.mapPartitions { it =>
  val model = bModel.value
  val dictionary = bDictionary.value
  
  it.flatMap { from =>
    val synonyms = model.findSynonyms(from, 5)
    val tot = synonyms.map(_._2).sum
    synonyms.map { case (to, sim) =>
      val norm = sim / tot
      Edge(
           dictionary.get(from).get,
           dictionary.get(to).get,
           norm
         )
      }
   }
}

val word2Graph = Graph.fromEdges(word2EdgeRDD, 0L)

word2Graph.cache()
word2Graph.vertices.count()

To prove Godwin's law, we will have to prove that no matter the input node, we can always find a path from that node to the Godwin point. In mathematical terms, this assumes the graph to be ergodic. With more than one connected component, our graph cannot be ergodic as some nodes will never lead to the Godwin point. Therefore:

val cc = word2Graph
  .connectedComponents()
  .vertices
  .values
  .distinct
  .count

println(s"Do we still have faith in humanity? ${cc > 1L}")
// false

As we only have one connected component, the next step is to compute the shortest path for each node to that Godwin point:

import org.apache.spark.graphx.lib.ShortestPaths

val shortestPaths = ShortestPaths.run(graph, Seq(godwin))

The shortest path algorithm is quite simple and can be easily implemented with Pregel using the same techniques described in Chapter 7, Building Communities. The basic approach is to start Pregel on the target node (our Godwin point) and send a message back to its incoming edges, incrementing a counter at each hop. Each node will always keep the smallest possible counter and propagate this value downstream to its incoming edges. The algorithm stops when no further edge is found.

We normalize this distance using a Godwin depth of 16, calculated as the maximum of each of the shortest paths:

val depth = sc.broadcast(
  shortestPaths.vertices
    .values
    .filter(_.nonEmpty)
    .map(_.values.min)
    .max()
)

logInfo(s"Godwin depth is [${depth.value}]")
// 16
 
shortestPaths.vertices.map { case (vid, hops) =>
  if(hops.nonEmpty) {
    val godwin = Option(
      math.min(hops.values.min / depth.value.toDouble, 1.0)
    )
    (vid, godwin)
   } else {
     (vid, None: Option[Double])
   }
}
.filter(_._2.isDefined)
.map { case (vid, distance) =>
  (vid, distance.get)
}
.collectAsMap()

The following figure shows a depth of 4 - we normalize the scores of 0, 1, 2, 3, and 4 to 0.0, 0.25, 0.5, 0.75, and 1.0 respectively:

Figure 10: The normalized Godwin distance

Finally, we collect each vertex with its associated distance as a map. We can easily sort this collection from the most to the least-sensitive word, but we will not report our findings here (for obvious reasons!).

On November 7 and 8, 2016, this map contained all the words from our Twitter dictionary, implying a full ergodicity. According to Godwin's Law, any word, given enough time, can lead to the Godwin point. We will use this map later in the chapter when we build features from Twitter text content.

Random walks

One way to simulate random walks through the Word2Vec algorithm is to treat the graph as a series of Markov chains. Assuming N random walks and a transition matrix T, we compute the transition matrix T^N. Given a state, S₁ (meaning a word w₁), we extract the probability distribution to jump from S₁ to an S_N state in N given transitions. In practice, given a dictionary of ~100k words, a dense representation of such a transition matrix will require around 50 GB to fit in memory. We can easily build a sparse representation of T using the IndexedRowMatrix class from MLlib:

val size = sc.broadcast(
  word2Graph
    .vertices
    .count()
    .toInt
)
 
val indexedRowRDD = word2Graph.edges
  .map { case edge =>
    (edge.srcId,(edge.dstId.toInt, edge.attr))
  }
  .groupByKey()
  .map { case (id, it) =>
    new IndexedRow(id, Vectors.sparse(size.value, it.toSeq))
  }
 
val m1 = new IndexedRowMatrix(indexedRowRDD)
val m3 = m1.multiply(m2)

Unfortunately, there is no built-in method in Spark to perform matrix multiplication with sparse support. Therefore, the m2 matrix needs to be dense and must fit in memory. A solution can be to decompose this matrix (using SVD) and play with the symmetric property of the word2vec matrix (if word w₁ is a synonym to w₂, then w₂ is a synonym to w₁) in order to simplify this process. Using simple matrix algebra, one can prove that given a matrix M:

and M symmetric, then

for even and odd value of n respectively. In theory, we only need to compute the multiplication of S that is a diagonal matrix. In practice, this requires lot of effort and is computationally expensive for no real value (all we want is to generate random word association). Instead, we generate random walks using our Word2Vec graph, the Pregel API, and a Monte Carlo simulation. This will generate word associations starting from a seed love. The algorithm stops after 100 iterations or when a path reaches our Godwin point. The detail of this algorithm can be found in our code repository.

Godwin.randomWalks(graph, "love", 100)

Tip

It is also worth mentioning that a Matrix, M, is said to be ergodic (hence also proving the Godwin Law) if there exists an integer, n, such that Mⁿ> 0.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Twitter and the Godwin point

Create new playlist

Sign In

Sign Up