Preparing the data 

For this example, we use a dataset called WikiText2. The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Compared to the preprocessed version of Penn Treebank (PTB), another popularly-used dataset, WikiText-2 is over two times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation, and numbers. The dataset contains full articles and, as a result, it is well suited for models that take advantage of long term dependency.

The dataset was introduced in a paper called Pointer Sentinel Mixture Models (https://arxiv.org/abs/1609.07843). The paper talks about solutions that can be used for solving a specific problem, where the LSTM with a softmax layer has difficulty in predicting rare words, though the context is unclear. Let's not worry about this for now, as it is an advanced concept and out of the scope of this book.

The following screenshot shows what the data looks like inside the WikiText dump:

As usual, torchtext makes it easier to use the dataset, by providing abstractions over downloading and reading the dataset. Let's look at the code that does that:

TEXT = d.Field(lower=True, batch_first=True)
train, valid, test = datasets.WikiText2.splits(TEXT,root='data')

The previous code takes care of downloading the WikiText2 data and splits it into train, valid, and test datasets. The key difference in language modeling is how the data is processed. All the text data that we had in WikiText2 is stored in one long tensor. Let's look at the following code and the results, to understand how the data is processed better:

print(len(train[0].text))

#output
2088628

As we can see from the previous results, we have only one example field and it contains all the text. Let's also quickly look at how the text is represented:

print(train[0].text[:100])

#Results of first 100 tokens

'<eos>', '=', 'valkyria', 'chronicles', 'iii', '=', '<eos>', '<eos>', 'senjō', 'no', 'valkyria', '3', ':', '<unk>', 'chronicles', '(', 'japanese', ':', '3', ',', 'lit', '.', 'valkyria', 'of', 'the', 'battlefield', '3', ')', ',', 'commonly', 'referred', 'to', 'as', 'valkyria', 'chronicles', 'iii', 'outside', 'japan', ',', 'is', 'a', 'tactical', 'role', '@-@', 'playing', 'video', 'game', 'developed', 'by', 'sega', 'and', 'media.vision', 'for', 'the', 'playstation', 'portable', '.', 'released', 'in', 'january', '2011', 'in', 'japan', ',', 'it', 'is', 'the', 'third', 'game', 'in', 'the', 'valkyria', 'series', '.', '<unk>', 'the', 'same', 'fusion', 'of', 'tactical', 'and', 'real', '@-@', 'time', 'gameplay', 'as', 'its', 'predecessors', ',', 'the', 'story', 'runs', 'parallel', 'to', 'the', 'first', 'game', 'and', 'follows', 'the'

Now, take a quick look at the image that showed the initial text and how it is being tokenized. Now we have a long sequence, of length 2088628, representing WikiText2. The next important thing is how we batch the data.

Table of Contents for Preparing the data&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparing the data