Now that we have ingested our data and created our labels, it's time to extract our features to build our binary classification model. As its name suggests, the bag-of-words approach is a very common feature-extraction technique whereby we take a piece of text, in this case a movie review, and represent it as a bag (aka multiset) of its words and grammatical tokens. Let's look at an example using a few movie reviews:
Review 1: Jurassic World was such a flop!
Review 2: Titanic ... an instant classic. Cinematography was as good as the acting!!
For each token (can be a word and/or punctuation), we will create a feature and then count the occurrence of that token throughout the document. Here's what our bag-of-words dataset would look like for the first review:
Review ID |
a |
Flop |
Jurassic |
such |
World |
! |
Review 1 |
1 |
1 |
1 |
1 |
1 |
1 |
First, notice the arrangement of this dataset, often called a document-term matrix (each document [row] is composed of a certain set of words [terms] that make up this two-dimensional matrix). We can also arrange this differently and transpose the rows and columns to create-you guessed it-a term-document matrix whereby the columns now show the documents that have that particular term and the numbers inside the cells are the counts. Also, realize that the order of the words is alphabetical, which means we lose any sense of word order. This implies that the word "flop" is equidistant in similarity to the word "Jurassic," and while we know this is not true, this highlights one of the limitations of the bag-of-words approach: the word order is lost, and sometimes, different documents can have the same representation but mean totally different things.
In the next chapter, you will learn about an extremely powerful learning algorithm pioneered at Google and included in Spark called word-to-vector (word2vec), which essentially digitizes terms to "encode" their meaning.
Second, notice that for our given review of six tokens (including punctuation), we have six columns. Suppose we added the second review to our document-term-matrix; how would our original bag-of-words change?
Review ID |
a |
acting |
an |
as |
Cinematography |
classic |
flop |
good |
instant |
Jurassic |
such |
Titanic |
was |
World |
. |
! |
Review 1 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
1 |
0 |
0 |
1 |
0 |
1 |
Review 2 |
0 |
1 |
1 |
2 |
1 |
1 |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
0 |
1 |
2 |
We tripled our original number of features from five to 16 tokens, which brings us to another consideration with this approach. Given that we must create a feature for every token, it's not difficult to see we will soon have an extremely wide and very sparse matrix representation (sparse because one document will certainly not contain every word/symbol/emoticon, and so on, and therefore, most of the cell inputs will be zero). This poses some interesting problems with respect to the dimensionality for our algorithms.
Consider the situation where we are trying to train a random forest using a bag-of-words approach on a text document that has +200k tokens, whereby most of the inputs will be zero. Recall that in a tree-based learner, it is making determinations to "go left or go right", which is dependent on the feature type. In a bag-of-words example, we can count features as true or false (that is, the document has the term or not) or the occurrence of a term (that is, how many times does the document have this term). For each successive branch in our tree, the algorithm must consider all these features (or at least the square root of the number of features in the case of a random forest), which can be extremely wide and sparse, and make a decision that influences the overall outcome.
Luckily, you are about to learn how Spark deals with this type of dimensionality and sparsity along with some steps we can take to reduce the number of features in the next section.