First, let's consider a simple movie review, which will act as our base example in the next few sections:
Now, imagine that we have a window that acts as a slider, which includes the main word currently in focus (highlighted in red in the following image), in addition to the five words before and after the focus word (highlighted in yellow):
The words in yellow form the context that surrounds the current focus word, ideas. These context words act as inputs to our feed-forward neural network, whereby each word is encoded via one-hot-encoding (all other elements are zeroed out) with one hidden layer and one output layer:
In the preceding diagram, the total size of our vocabulary (for example, post-tokenization) is denoted by a capital C, whereby we perform one-hot-encoding of each word within the context window--in this case, the five words before and after our focus word, ideas. At this point, we propagate our encoded vectors to our hidden layer via weighted sum--just like a normal feed-forward neural network--whereby, we specify beforehand the number of weights in our hidden layer. Finally, a sigmoid function is applied from the single hidden layer to the output layer, which attempts to predict the current focus word. This is achieved by maximizing the conditional probability of observing the focus word (idea), given the context of its surrounding words (film, with, plenty, of, smart, regarding, the, impact, of, and alien). Notice that the output layer is also of the same size as our initial vocabulary, C.
Herein lies the interesting property of both the families of the word2vec algorithm: it's an unsupervised learning algorithm at heart and relies on supervised learning to learn individual word vectors. This is true for the CBOW model and also the skip-gram model, which we will cover next. Note that at the time of writing this book, Spark's MLlib only incorporates the skip-gram model of word2vec.