Word2vec has been developed by Tomas Mikolov at Google, around 2012. The original idea behind word2vec was to demonstrate that one might improve efficiency by trading the model's complexity for efficiency. Instead of representing a document as bags of words, word2vec takes each word context into account by trying to analyze n-grams or skip-grams (a set of surrounding tokens with potential the token in question skipped). The words and word contexts themselves are represented by an array of floats/doubles . The objective function is to maximize log likelihood:
Where:
By choosing the optimal and to get a comprehensive word representation (also called map optimization). Similar words are found based on cosine similarity metric (dot product) of . Spark implementation uses hierarchical softmax, which reduces the complexity of computing the conditional probability to , or log of the vocabulary size V, as opposed to , or proportional to V. The training is still linear in the dataset size, but is amenable to big data parallelization techniques.
Word2vec
is traditionally used to predict the most likely word given context or find similar words with a similar meaning (synonyms). The following code trains in word2vec
model on Leo Tolstoy's Wars and Peace, and finds synonyms for the word circle. I had to convert the Gutenberg's representation of War and Peace to a single-line format by running the cat 2600.txt | tr "
" " " > warandpeace.txt
command:
scala> val word2vec = new Word2Vec word2vec: org.apache.spark.mllib.feature.Word2Vec = org.apache.spark.mllib.feature.Word2Vec@58bb4dd scala> val model = word2vec.fit(sc.textFile("warandpeace").map(_.split("\W+").toSeq) model: org.apache.spark.mllib.feature.Word2VecModel = org.apache.spark.mllib.feature.Word2VecModel@6f61b9d7 scala> val synonyms = model.findSynonyms("life", 10) synonyms: Array[(String, Double)] = Array((freedom,1.704344822168997), (universal,1.682276637692245), (conception,1.6776193389148586), (relation,1.6760497906519414), (humanity,1.67601036253831), (consists,1.6637604144872544), (recognition,1.6526169382380496), (subjection,1.6496559771230317), (activity,1.646671198014248), (astronomy,1.6444424059160712)) scala> synonyms foreach println (freedom,1.704344822168997) (universal,1.682276637692245) (conception,1.6776193389148586) (relation,1.6760497906519414) (humanity,1.67601036253831) (consists,1.6637604144872544) (recognition,1.6526169382380496) (subjection,1.6496559771230317) (activity,1.646671198014248) (astronomy,1.6444424059160712)
While in general, it is hard to some with an objective function, and freedom
is not listed as a synonym to life
in the English Thesaurus, the results do make sense.
Each word in the word2vec model is represented as an array of doubles. Another interesting application is to find associations a to b is the same as c to ? by performing subtraction vector(a) - vector(b) + vector(c):
scala> val a = model.getVectors.filter(_._1 == "monarchs").map(_._2).head a: Array[Float] = Array(-0.0044642715, -0.0013227836, -0.011506443, 0.03691717, 0.020431392, 0.013427449, -0.0036369907, -0.013460356, -3.8938568E-4, 0.02432113, 0.014533845, 0.004130258, 0.00671316, -0.009344602, 0.006229065, -0.005442078, -0.0045390734, -0.0038824948, -6.5973646E-4, 0.021729799, -0.011289608, -0.0030690092, -0.011423801, 0.009100784, 0.011765533, 0.0069619063, 0.017540144, 0.011198071, 0.026103685, -0.017285397, 0.0045515243, -0.0044477824, -0.0074411617, -0.023975836, 0.011371289, -0.022625357, -2.6478301E-5, -0.010510282, 0.010622139, -0.009597833, 0.014937023, -0.01298345, 0.0016747514, 0.01172987, -0.001512275, 0.022340108, -0.009758578, -0.014942565, 0.0040697413, 0.0015349758, 0.010246878, 0.0021413323, 0.008739062, 0.007845526, 0.006857361, 0.01160148, 0.008595... scala> val b = model.getVectors.filter(_._1 == "princess").map(_._2).head b: Array[Float] = Array(0.13265875, -0.04882792, -0.08409957, -0.04067986, 0.009084379, 0.121674284, -0.11963971, 0.06699862, -0.20277102, 0.26296946, -0.058114383, 0.076021515, 0.06751665, -0.17419271, -0.089830205, 0.2463593, 0.062816426, -0.10538805, 0.062085453, -0.2483566, 0.03468293, 0.20642486, 0.3129267, -0.12418643, -0.12557726, 0.06725172, -0.03703333, -0.10810595, 0.06692443, -0.046484336, 0.2433963, -0.12762263, -0.18473054, -0.084376186, 0.0037174677, -0.0040220995, -0.3419341, -0.25928706, -0.054454487, 0.09521076, -0.041567303, -0.13727514, -0.04826158, 0.13326299, 0.16228828, 0.08495835, -0.18073058, -0.018380836, -0.15691829, 0.056539804, 0.13673553, -0.027935665, 0.081865616, 0.07029694, -0.041142456, 0.041359138, -0.2304657, -0.17088272, -0.14424285, -0.0030700471, -0... scala> val c = model.getVectors.filter(_._1 == "individual").map(_._2).head c: Array[Float] = Array(-0.0013353615, -0.01820516, 0.007949033, 0.05430816, -0.029520465, -0.030641818, -6.607431E-4, 0.026548808, 0.04784935, -0.006470232, 0.041406438, 0.06599842, 0.0074243015, 0.041538745, 0.0030222891, -0.003932073, -0.03154199, -0.028486902, 0.022139633, 0.05738223, -0.03890591, -0.06761177, 0.0055152955, -0.02480924, -0.053222697, -0.028698998, -0.005315235, 0.0582403, -0.0024816995, 0.031634405, -0.027884213, 6.0290704E-4, 1.9750209E-4, -0.05563172, 0.023785716, -0.037577976, 0.04134448, 0.0026664822, -0.019832063, -0.0011898747, 0.03160933, 0.031184288, 0.0025268437, -0.02718441, -0.07729341, -0.009460656, 0.005344515, -0.05110715, 0.018468754, 0.008984449, -0.0053139487, 0.0053904117, -0.01322933, -0.015247412, 0.009819351, 0.038043085, 0.044905875, 0.00402788... scala> model.findSynonyms(new DenseVector((for(i <- 0 until 100) yield (a(i) - b(i) + c(i)).toDouble).toArray), 10) foreach println (achievement,0.9432423663884002) (uncertainty,0.9187759184842362) (leader,0.9163721499105207) (individual,0.9048367510621271) (instead,0.8992079672038455) (cannon,0.8947818781378154) (arguments,0.8883634101905679) (aims,0.8725107984356915) (ants,0.8593842583047755) (War,0.8530727227924755)
This can be used to find relationships in the language.
Porter Stemmer was first developed around the 1980s and there are many implementations. The detailed steps and original reference are provided at http://tartarus.org/martin/PorterStemmer/def.txt. It consists of roughly 6-9 steps of suffix/endings replacements, some of which are conditional on prefix or stem. I will provide a Scala-optimized version with the book code repository. For example, step 1 covers the majority of stemming cases and consists of 12 substitutions: the last 8 of which are conditional on the number of syllables and the presence of vowels in the stem:
def step1(s: String) = { b = s // step 1a processSubList(List(("sses", "ss"), ("ies","i"),("ss","ss"), ("s", "")), _>=0) // step 1b if (!(replacer("eed", "ee", _>0))) { if ((vowelInStem("ed") && replacer("ed", "", _>=0)) || (vowelInStem("ing") && replacer("ing", "", _>=0))) { if (!processSubList(List(("at", "ate"), ("bl","ble"), ("iz","ize")), _>=0 ) ) { // if this isn't done, then it gets more confusing. if (doublec() && b.last != 'l' && b.last != 's' && b.last != 'z') { b = b.substring(0, b.length - 1) } else if (calcM(b.length) == 1 && cvc("")) { b = b + "e" } } } } // step 1c (vowelInStem("y") && replacer("y", "i", _>=0)) this }
The complete code is available at https://github.com/alexvk/ml-in-scala/blob/master/chapter09/src/main/scala/Stemmer.scala.
3.143.241.133