So, now that we can create vectors that encode the meaning of words, and we know that any given movie review post tokenization is an array of N words, we can begin creating a poor man's doc2vec by taking the average of all the words that make up the review. That is, for each review, by averaging the individual word vectors, we lose the specific sequencing of words, which, depending on the sensitivity of your application, can make a difference:
v(word_1) + v(word_2) + v(word_3) ... v(word_Z) / count(words in review)
Ideally, one would use a flavor of doc2vec to create document vectors; however, doc2vec has yet to be implemented in MLlib at the time of writing this book, so for now, we are going to use this simple version, which, as you will see, has surprising results. Fortunately, the Spark ML implementation of the word2vec model already averages word vectors if the model contains a list of tokens. For example, we can show that the phrase, funny movie, has a vector that is equal to the average of the vectors of the funny and movie tokens:
val testDf = Seq(Seq("funny"), Seq("movie"), Seq("funny", "movie")).toDF("reviewTokens")
w2vModel.transform(testDf).show(truncate=false)
The output is as follows:
Hence, we can prepare our simple version of doc2vec by a simple model transformation:
val inputData = w2vModel.transform(movieReviews)