Document classification with and without GloVe

In this example, we're going to use a somewhat famous text classification problem known as the 20 newsgroup problem (http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html). In this problem, we are given 19,997 documents, each belonging to a newsgroup. Our goal is to use the text of the post to predict which newsgroup the text belongs in. For the millennials among us, a newsgroup is sort of the precursor to Reddit (but it's probably closer to the great-great-great grandfather of Reddit). The topics covered in those newsgroups vary greatly and include such topics as politics, religion, and operating systems, all of which you should avoid discussing in polite company. These posts are fairly long and there are 174,074 unique words in the corpus.

This time I'm going to build two versions of the model. In the first version, we will use an embedding layer and we will learn the embedding space, just like we did in the previous example. In the second version, I will use GloVe vectors as the weights for the embedding layer. I'll then spend some time at the end comparing and contrasting the two methods.

Lastly, instead of an LSTM, in this example, we will use a 1D CNN.

Table of Contents for Document classification with and without GloVe

Create new playlist

Sign In

Sign Up

Table of Contents for
Document classification with and without GloVe