Word2Vec

Word2Vec is a sophisticated neural network style natural language processing tool and uses a technique called skip-grams to convert a sentence of words into an embedded vector representation. Let's look at an example of how this can be used by looking at a collection of sentences about animals:

  • A dog was barking
  • Some cows were grazing the grass
  • Dogs usually bark randomly
  • The cow likes grass

Using neural network with a hidden layer (machine learning algorithm used in many unsupervised learning applications), we can learn (with enough examples) that dog and barking are related, cow and grass are related in the sense that they appear close to each other a lot, which is measured by probabilities. The output of Word2vec is a vector of Double features.

In order to invoke Word2vec, you need to import the package:

import org.apache.spark.ml.feature.Word2Vec

First, you need to initialize a Word2vec Transformer specifying the input column and the output column. Here, we are choosing the words column created by the Tokenizer and generate an output column for the word vector of size 3:

scala> val word2Vec = new Word2Vec().setInputCol("words").setOutputCol("wordvector").setVectorSize(3).setMinCount(0)
word2Vec: org.apache.spark.ml.feature.Word2Vec = w2v_fe9d488fdb69

Next, invoking the fit() function on the input dataset yields an output Transformer:

scala> val word2VecModel = word2Vec.fit(noStopWordsDF)
word2VecModel: org.apache.spark.ml.feature.Word2VecModel = w2v_fe9d488fdb69

Further, invoking the transform() function on the input dataset yields an output dataset:

scala> val word2VecDF = word2VecModel.transform(noStopWordsDF)
word2VecDF: org.apache.spark.sql.DataFrame = [id: int, sentence: string ... 3 more fields]

The following is the output dataset showing the input column IDs, sentence, and the output column wordvector:

scala> word2VecDF.show(false)
|id|sentence |words |filteredWords |wordvector |
|1 |Hello there, how do you like the book so far? |[hello, there,, how, do, you, like, the, book, so, far?] |[hello, there,, like, book, far?] |[0.006875938177108765,-0.00819675214588642,0.0040686681866645815]|
|2 |I am new to Machine Learning |[i, am, new, to, machine, learning] |[new, machine, learning] |[0.026012470324834187,0.023195965060343344,-0.10863214979569116] |
|3 |Maybe i should get some coffee before starting |[maybe, i, should, get, some, coffee, before, starting] |[maybe, get, coffee, starting] |[-0.004304863978177309,-0.004591284319758415,0.02117823390290141]|
|4 |Coffee is best when you drink it hot |[coffee, is, best, when, you, drink, it, hot] |[coffee, best, drink, hot] |[0.054064739029854536,-0.003801364451646805,0.06522738828789443] |
|5 |Book stores have coffee too so i should go to a book store|[book, stores, have, coffee, too, so, i, should, go, to, a, book, store]|[book, stores, coffee, go, book, store]|[-0.05887459063281615,-0.07891856770341595,0.07510609552264214] |

The diagram of the Word2Vec Features is as follows, which shows the words being converted into a vector:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.166.75