In the previous model, we used a window of words before and after the focus word to predict the focus word. The skip-gram model takes a similar approach but reverses the architecture of the neural network. That is, we are going to start with the focus word as our input into our network and then try to predict the surrounding contextual words using a single hidden layer:
As you can see, the skip-gram model is the exact opposite of the CBOW model. The training goal of the network is to minimize the summed prediction error across all the context words in the output layer, which, in our example, is an input of ideas and an output layer that predicts film, with, plenty, of, smart, regarding, the, impact, of, and alien.
In the previous chapter, you saw that we used a tokenization function that removed stopwords, such as the, with, to, and so on, which we have not shown here intentionally to clearly convey our examples without losing the reader. In the example that follows, we will perform the same tokenization function as Chapter 4, Predicting Movie Reviews Using NLP and Spark Streaming, which will remove the stopwords.