Chapter 5. Text Classification

We’re leaving images behind for now and turning our attention to another area where deep learning has proven to be a significant advance on traditional techniques: natural language processing (NLP). A good example of this is Google Translate. Originally, the code that handled translation was a weighty 500,000 lines of code. The new, TensorFlow-based system has approximately 500, and it performs better than the old method.

Recent breakthroughs also have occurred in bringing transfer learning (which you learned about in Chapter 4) to NLP problems. New architectures such as the Transformer architecture have led to the creation of networks like OpenAI’s GPT-2, the larger variant of which produces text that is almost human-like in quality (and in fact, OpenAI has not released the weights of this model for fear of it being used maliciously).

This chapter provides a whirlwind tour of recurrent neural networks and embeddings. Then we explore the torchtext library and how to use it for text processing with an LSTM-based model.

Recurrent Neural Networks

If we look back at how we’ve been using our CNN-based architectures so far, we can see they have always been working on one complete snapshot of time. But consider these two sentence fragments:

The cat sat on the mat.

She got up and impatiently climbed on the chair, meowing for food.

Say you were to feed those two sentences, one after the other, into a CNN and ask, where is the cat? You’d have a problem, because the network has no concept of memory. This is incredibly important when it comes to dealing with data that has a temporal domain (e.g., text, speech, video, and time-series data).1 Recurrent neural networks (RNNs) answer this problem by giving neural networks a memory via hidden state.

What does an RNN look like? My favorite explanation is, “Imagine a neural network crossed with a for loop.” Figure 5-1 shows a diagram of a classical RNN structure.

Classical RNN diagram
Figure 5-1. An RNN

We add input at a time step of t, and we get a hidden output state of ht, and the output also gets fed back into the RNN for the next time step. We can unroll this network to take a deeper look at what’s going on, as shown in Figure 5-2.

Unrolled RNN diagram
Figure 5-2. An unrolled RNN

What we have here is a grouping of fully connected layers (with shared parameters), a series of inputs, and our output. Input data is fed into the network, and the next item in the sequence is predicted as output. In the unrolled view, we can see that the RNN can be thought of as a pipeline of fully connected layers, with the successive input being fed into the next layer in the sequence (with the usual nonlinearities such as ReLU being inserted between the layers). When we have our completed predicted sequence, we then have to backpropagate the error back through the RNN. Because this involves stepping back through the network’s steps, this process is known as backpropagation through time. The error is calculated on the entire sequence, then the network is unfolded as in Figure 5-2, and the gradients are calculated for each time step and combined to update the shared parameters of the network. You can imagine it as doing backprop on individual networks and summing all the gradients together.

That’s the theory behind RNNs. But this simple structure has problems that we need to talk about and how they were overcome with newer architectures.

Long Short-Term Memory Networks

In practice, RNNs were and are particularly susceptible to the vanishing gradient problem we talked about in Chapter 2, or the potentially worse scenario of the exploding gradient, where your error tends off toward infinity. Neither is good, so RNNs couldn’t be brought to bear on many of the problems they were considered suitable for. That all changed in 1997 when Sepp Hochreiter and Jürgen Schmidhuber introduced the Long Short-Term Memory (LSTM) variant of the RNN.

Figure 5-3 diagrams an LSTM layer. I know, there’s a lot going on here, but it’s not too complex. Honest.

LSTM diagram
Figure 5-3. An LSTM

OK, I admit, it is quite intimidating. The key is to think about the three gates (input, output, and forget). In a standard RNN, we “remember” everything forever. But that’s not how our brains work (sadly!), and the LSTM’s forget gate allows us to model the idea that as we continue in our input chain, the beginning of the chain becomes less important. And how much the LSTM forgets is something that is learned during training, so if it’s in the network’s best interest to be very forgetful, the forget gate parameters will do so.

The cell ends up being the “memory” of the network layer; and the input, output, and forget gates will determine how data flows through the layer. The data may simply pass through, it may “write” to the cell, and that data may (or may not!) flow through to the next layer, modified by the output gate.

This assemblage of parts was enough to solve the vanishing gradient problem, and also has the virtue of being Turing-complete, so theoretically, you can do any calculation that you can do on a computer with one of these.

But things didn’t stop there, of course. Several developments have occurred in the RNN space since LSTMs, and we’ll cover some of the major ones in the next sections.

Gated Recurrent Units

Since 1997, many variants of the base LSTM network have been created, most of which you probably don’t need to know about unless you’re curious. However, one variant that came along in 2014, the gated recurrent unit (GRU), is worth knowing about, as it has become quite popular in some circles. Figure 5-4 shows the makeup of a GRU architecture.

GRU diagram
Figure 5-4. A GRU

The main takeaway is that the GRU has merged the forget gate with the output gate. This means that it has fewer parameters than an LSTM and so tends to be quicker to train and uses fewer resources at runtime. For these reasons, and also that they’re essentially a drop-in replacement for LSTMs, they’ve become quite popular. However, strictly speaking, they are less powerful than LSTMs because of the merging of the forget and output gates, so in general I recommend playing with both GRUs or LSTMs in your network and seeing which one performs better. Or just accept that the LSTM may be a little slower in training, but may end up being the best choice in the end. You don’t have to follow the latest fad—honest!

biLSTM

Another common variant of the LSTM is the bidirectional LSTM or biLSTM for short. As you’ve seen so far, traditional LSTMs (and RNNs in general) can look to the past as they are trained and make decisions. Unfortunately, sometimes you need to see the future as well. This is particularly the case in applications like translation and handwriting recognition, where what comes after the current state can be just as important as the previous state for determining output.

A biLSTM solves this problem in the simplest of ways: it’s essentially two stacked LSTMs, with the input being sent in the forward direction in one LSTM and reversed in the second. Figure 5-5 shows how a biLSTM works across its input bidirectionally to produce the output.

biLSTM diagram
Figure 5-5. A biLSTM

PyTorch makes it easy to create biLSTMs by passing in a bidirectional=True parameter when creating an LSTM() unit, as you’ll see later in the chapter.

That completes our tour throughout the RNN-based architectures. In Chapter 9, we return to the question of architecture when we look at the Transformer-based BERT and GPT-2 models.

Embeddings

We’re almost at the point where we can start writing some code! But before we do, one little detail may have occurred to you: how do we represent words in a network? After all, we’re feeding tensors of numbers into a network and getting tensors out. With images, it seemed a fairly obvious thing to convert them into tensors representing the red/green/blue component values, and they’re already naturally thought of as arrays as they come with a height and width baked in. But words? Sentences? How is that going to work?

The simplest approach is still one that you’ll find in many approaches to NLP, and it’s called one-hot encoding. It’s pretty simple! Let’s look at our first sentence from the start of the chapter:

The cat sat on the mat.

If we consider that this is the entire vocabulary of our world, we have a tensor of [the, cat, sat, on, mat]. One-hot encoding simply means that we create a vector that is the size of the vocabulary, and for each word in it, we allocate a vector with one parameter set to 1 and the rest to 0:

the — [1 0 0 0 0]
cat — [0 1 0 0 0]
sat — [0 0 1 0 0]
on  — [0 0 0 1 0]
mat — [0 0 0 0 1]

We’ve now converted the words into vectors, and we can feed them into our network. Additionally, we may add extra symbols into our vocabulary, such as UNK (unknown, for words not in the vocabulary) and START/STOP to signify the beginning and ends of sentences.

One-hot encoding has a few limitations that become clearer when we add another word into our example vocabulary: kitty. From our encoding scheme, kitty would be represented by [0 0 0 0 0 1] (with all the other vectors being padded with a zero). First, you can see that if we are going to model a realistic set of words, our vectors are going to be very long with almost no information in them. Second, and perhaps more importantly, we know that a very strong relationship exists between the words kitty and cat (also with dammit, but thankfully that’s been skipped from our vocab here!), and this is impossible to represent with one-hot encoding; the two words are completely different things.

An approach that has become more popular recently is replacing one-hot encoding with an embedding matrix (of course, a one-hot encoding is an embedding matrix itself, just one that doesn’t contain any information about relationships between words). The idea is to squash the dimensionality of the vector space down to something a little more manageable and take advantage of the space itself.

For example, if we have an embedding in a 2D space, perhaps cat could be represented by the tensor [0.56, 0.45] and kitten by [0.56, 0.445], whereas mat could be [0.2, -0.1]. We cluster similar words together in the vector space and can do distance checks such as Euclidean or cosine distance functions to determine how close words are to each other. And how do we determine where words fall in the vector space? An embedding layer is no different from any other layer you’ve seen so far in building neural networks; we initialize the vector space randomly, and hopefully the training process updates the parameters so that similar words or concepts gravitate toward each other.

A famous example of embedding vectors is word2vec, which was released by Google in 2013.2 This was a set of word embeddings trained using a shallow neural network, and it revealed that the transformation into vector space seemed to capture something about the concepts underpinning the words. In its commonly cited finding, if you pulled the vectors for King, Man, and Woman and then subtracted the vector for Man from King and added the vector for Woman, you would get a result that was the vector representation for Queen. Since word2vec, other pretrained embeddings have become available, such as ELMo, GloVe, and fasttext.

As for using embeddings in PyTorch, it’s really simple:

embed = nn.Embedding(vocab_size, dimension_size)

This will contain a tensor of vocab_size x dimension_size initialized randomly. I prefer to think that it’s just a giant array or lookup table. Each word in your vocabulary indexes into an entry that is a vector of dimension_size, so if we go back to our cat and its epic adventures on the mat, we’d have something like this:

cat_mat_embed = nn.Embedding(5, 2)
cat_tensor = Tensor([1])
cat_mat_embed.forward(cat_tensor)

> tensor([[ 1.7793, -0.3127]], grad_fn=<EmbeddingBackward>)

We create our embedding, a tensor that contains the position of cat in our vocabulary, and pass it through the layer’s forward() method. That gives us our random embedding. The result also points out that we have a gradient function that we can use for updating the parameters after we combine it with a loss function.

We’ve now gone through all the theory and can get started on building something!

torchtext

Just like torchvision, PyTorch provides an official library, torchtext, for handling text-processing pipelines. However, torchtext is not quite as battle-tested or has as many eyes on it as torchvision, which means it’s not quite as easy to use or as well-documented. But it is still a powerful library that can handle a lot of the mundane work of building up text-based datasets, so we’ll be using it for the rest of the chapter.

Installing torchtext is fairly simple. You use either standard pip:

pip install torchtext

or a specific conda channel:

conda install -c derickl torchtext

You’ll also want to install spaCy (an NLP library), and pandas if you don’t have them on your system (again, either using pip or conda). We use spaCy for processing our text in the torchtext pipeline, and pandas for exploring and cleaning up our data.

Getting Our Data: Tweets!

In this section, we build a sentiment analysis model, so let’s grab a dataset. torchtext provides a bunch of built-in datasets via the torchtext.datasets module, but we’re going to work on one from scratch to get a feel for building a custom dataset and feeding it into a model we’ve created. We use the Sentiment140 dataset. This is based on tweets from Twitter, with every tweet ranked as 0 for negative, 2 for neutral, and 4 for positive.

Download the zip archive and unzip. We use the file training.1600000.processed.noemoticon.csv. Let’s look at the file using pandas:

import pandas as pd
tweetsDF = pd.read_csv("training.1600000.processed.noemoticon.csv",
                        header=None)

You may at this point get an error like this:

UnicodeDecodeError: 'utf-8' codec can't decode bytes in
position 80-81: invalid continuation byte

Congratulations—you’re now a real data scientist and you get to deal with data cleaning! From the error message, it appears that the default C-based CSV parser that pandas uses doesn’t like some of the Unicode in the file, so we need to switch to the Python-based parser:

tweetsDF = pd.read_csv("training.1600000.processed.noemoticon.csv",
engine="python", header=None)

Let’s take a look at the structure of the data by displaying the first five rows:

>>> tweetDF.head(5)
0  0  1467810672  ...  NO_QUERY   scotthamilton  is upset that ...
1  0  1467810917  ...  NO_QUERY        mattycus  @Kenichan I dived many times ...
2  0  1467811184  ...  NO_QUERY         ElleCTF    my whole body feels itchy
3  0  1467811193  ...  NO_QUERY          Karoli  @nationwideclass no, it's ...
4  0  1467811372  ...  NO_QUERY        joy_wolf  @Kwesidei not the whole crew

Annoyingly, we don’t have a header field in this CSV (again, welcome to the world of a data scientist!), but by looking at the website and using our intuition, we can see that what we’re interested in is the last column (the tweet text) and the first column (our labeling). However, the labels aren’t great, so let’s do a little feature engineering to work around that. Let’s see what counts we have in our training set:

>>> tweetsDF[0].value_counts()
4    800000
0    800000
Name: 0, dtype: int64

Curiously, there are no neutral values in the training dataset. This means that we could formulate the problem as a binary choice between 0 and 1 and work out our predictions from there, but for now we stick to the original plan that we may possibly have neutral tweets in the future. To encode the classes as numbers starting from 0, we first create a column of type category from the label column:

tweetsDF["sentiment_cat"] = tweetsDF[0].astype('category')

Then we encode those classes as numerical information in another column:

tweetsDF["sentiment"] = tweetsDF["sentiment_cat"].cat.codes

We then save the modified CSV back to disk:

tweetsDF.to_csv("train-processed.csv", header=None, index=None)

I recommend that you save another CSV that has a small sample of the 1.6 million tweets for you to test things out on too:

tweetsDF.sample(10000).to_csv("train-processed-sample.csv", header=None,
    index=None)

Now we need to tell torchtext what we think is important for the purposes of creating a dataset.

Defining Fields

torchtext takes a straightforward approach to generating datasets: you tell it what you want, and it’ll process the raw CSV (or JSON) for you. You do this by first defining fields. The Field class has a considerable number of parameters that can be assigned to it, and although you probably won’t use all of them at once, Table 5-1 provides a handy guide as to what you can do with a Field.

Table 5-1. Field parameter types
Parameter Description Default

sequential

Whether the field represents sequential data (i.e., text). If set to False, no tokenization is applied.

True

use_vocab

Whether to include a Vocab object. If set to False, the field should contain numerical data.

True

init_token

A token that will be added to the start of this field to indicate the beginning of the data.

None

eos_token

End-of-sentence token appended to the end of each sequence.

None

fix_length

If set to an integer, all entries will be padded to this length. If None, sequence lengths will be flexible.

None

dtype

The type of the tensor batch.

torch.long

lower

Convert the sequence into lowercase.

False

tokenize

The function that will perform sequence tokenization. If set to spacy, the spaCy tokenizer will be used.

string.split

pad_token

The token that will be used as padding.

<pad>

unk_token

The token that will be used to represent words that are not present in the Vocab dict.

<unk>

pad_first

Pad at the start of the sequence.

False

truncate_first

Truncate at the beginning of the sequence (if necessary).

False

As we noted, we’re interested in only the labels and the tweets text. We define these by using the Field datatype:

from torchtext import data

LABEL = data.LabelField()
TWEET = data.Field(tokenize='spacy', lower=true)

We’re defining LABEL as a LabelField, which is a subclass of Field that sets sequential to False (as it’s our numerical category class). TWEET is a standard Field object, where we have decided to use the spaCy tokenizer and convert all the text to lowercase, but otherwise we’re using the defaults as listed in the previous table. If, when running through this example, the step of building the vocabulary is taking a very long time, try removing the tokenize parameter and rerunning. This will use the default of simply splitting on whitespace, which will speed up the tokenization step considerably, though the created vocabulary will not be as good as the one spaCy creates.

Having defined those fields, we now need to produce a list that maps them onto the list of rows that are in the CSV:

 fields = [('score',None), ('id',None),('date',None),('query',None),
      ('name',None),
      ('tweet', TWEET),('category',None),('label',LABEL)]

Armed with our declared fields, we now use TabularDataset to apply that definition to the CSV:

twitterDataset = torchtext.data.TabularDataset(
        path="training-processed.csv",
        format="CSV",
        fields=fields,
        skip_header=False)

This may take some time, especially with the spaCy parser. Finally, we can split into training, testing, and validation sets by using the split() method:

(train, test, valid) = twitterDataset.split(split_ratio=[0.8,0.1,0.1])

(len(train),len(test),len(valid))
> (1280000, 160000, 160000)

Here’s an example pulled from the dataset:

>vars(train.examples[7])

{'label': '6681',
 'tweet': ['woah',
  ',',
  'hell',
  'in',
  'chapel',
  'thrill',
  'is',
  'closed',
  '.',
  'no',
  'more',
  'sweaty',
  'basement',
  'dance',
  'parties',
  '?',
  '?']}

In a surprising turn of serendipity, the randomly selected tweet references the closure of a club in Chapel Hill I frequently visited. See if you find anything as weird on your dive through the data!

Building a Vocabulary

Traditionally, at this point we would build a one-hot encoding of each word that is present in the dataset—a rather tedious process. Thankfully, torchtext will do this for us, and will also allow a max_size parameter to be passed in to limit the vocabulary to the most common words. This is normally done to prevent the construction of a huge, memory-hungry model. We don’t want our GPUs too overwhelmed, after all. Let’s limit the vocabulary to a maximum of 20,000 words in our training set:

vocab_size = 20000
TWEET.build_vocab(train, max_size = vocab_size)

We can then interrogate the vocab class instance object to make some discoveries about our dataset. First, we ask the traditional “How big is our vocabulary?”:

len(TWEET.vocab)
> 20002

Wait, wait, what? Yes, we specified 20,000, but by default, torchtext will add two more special tokens, <unk> for unknown words (e.g., those that get cut off by the 20,000 max_size we specified), and <pad>, a padding token that will be used to pad all our text to roughly the same size to help with efficient batching on the GPU (remember that a GPU gets its speed from operating on regular batches). You can also specify eos_token or init_token symbols when you declare a field, but they’re not included by default.

Now let’s take a look at the most common words in the vocabulary:

>TWEET.vocab.freqs.most_common(10)
[('!', 44802),
 ('.', 40088),
 ('I', 33133),
 (' ', 29484),
 ('to', 28024),
 ('the', 24389),
 (',', 23951),
('a', 18366),
 ('i', 17189),
('and', 14252)]

Pretty much what you’d expect, as we’re not removing stop-words with our spaCy tokenizer. (Because it’s just 140 characters, we’d be in danger of losing too much information from our model if we did.)

We are almost finished with our datasets. We just need to create a data loader to feed into our training loop. torchtext provides the BucketIterator method that will produce what it calls a Batch, which is almost, but not quite, like the data loader we used on images. (You’ll see shortly that we have to update our training loop to deal with some of the oddities of the Batch interface.)

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train, valid, test),
batch_size = 32,
device = device)

Putting everything together, here’s the complete code for building up our datasets:

from torchtext import data

device = "cuda"
LABEL = data.LabelField()
TWEET = data.Field(tokenize='spacy', lower=true)

fields = [('score',None), ('id',None),('date',None),('query',None),
      ('name',None),
      ('tweet', TWEET),('category',None),('label',LABEL)]

twitterDataset = torchtext.data.TabularDataset(
        path="training-processed.csv",
        format="CSV",
        fields=fields,
        skip_header=False)

(train, test, valid) = twitterDataset.split(split_ratio=[0.8,0.1,0.1])

vocab_size = 20002
TWEET.build_vocab(train, max_size = vocab_size)

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train, valid, test),
batch_size = 32,
device = device)

With our data processing sorted, we can move on to defining our model.

Creating Our Model

We use the Embedding and LSTM modules in PyTorch that we talked about in the first half of this chapter to build a simple model for classifying tweets:

import torch.nn as nn

class OurFirstLSTM(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size):
        super(OurFirstLSTM, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.encoder = nn.LSTM(input_size=embedding_dim,
                hidden_size=hidden_size, num_layers=1)
        self.predictor = nn.Linear(hidden_size, 2)

    def forward(self, seq):
        output, (hidden,_) = self.encoder(self.embedding(seq))
        preds = self.predictor(hidden.squeeze(0))
        return preds

model = OurFirstLSTM(100,300, 20002)
model.to(device)

All we do in this model is create three layers. First, the words in our tweets are pushed into an Embedding layer, which we have established as a 300-dimensional vector embedding. That’s then fed into a LSTM with 100 hidden features (again, we’re compressing down from the 300-dimensional input like we did with images). Finally, the output of the LSTM (the final hidden state after processing the incoming tweet) is pushed through a standard fully connected layer with three outputs to correspond to our three possible classes (negative, positive, or neutral). Next we turn to the training loop!

Updating the Training Loop

Because of some torchtext’s quirks, we need to write a slightly modified training loop. First, we create an optimizer (we use Adam as usual) and a loss function. Because we were given three potential classes for each tweet, we use CrossEntropyLoss() as our loss function. However, it turns out that only two classes are present in the dataset; if we assumed there would be only two classes, we could in fact change the output of the model to produce a single number between 0 and 1 and then use binary cross-entropy (BCE) loss (and we can combine the sigmoid layer that squashes output between 0 and 1 plus the BCE layer into a single PyTorch loss function, BCEWithLogitsLoss()). I mention this because if you’re writing a classifier that must always be one state or the other, it’s a better fit than the standard cross-entropy loss that we’re about to use.

optimizer = optim.Adam(model.parameters(), lr=2e-2)
criterion = nn.CrossEntropyLoss()

def train(epochs, model, optimizer, criterion, train_iterator, valid_iterator):
    for epoch in range(1, epochs + 1):

        training_loss = 0.0
        valid_loss = 0.0
        model.train()
        for batch_idx, batch in enumerate(train_iterator):
            opt.zero_grad()
            predict = model(batch.tweet)
            loss = criterion(predict,batch.label)
            loss.backward()
            optimizer.step()
            training_loss += loss.data.item() * batch.tweet.size(0)
        training_loss /= len(train_iterator)


        model.eval()
        for batch_idx,batch in enumerate(valid_iterator):
            predict = model(batch.tweet)
            loss = criterion(predict,batch.label)
            valid_loss += loss.data.item() * x.size(0)

        valid_loss /= len(valid_iterator)
        print('Epoch: {}, Training Loss: {:.2f},
        Validation Loss: {:.2f}'.format(epoch, training_loss, valid_loss))

The main thing to be aware of in this new training loop is that we have to reference batch.tweet and batch.label to get the particular fields we’re interested in; they don’t fall out quite as nicely from the enumerator as they do in torchvision.

Once we’ve trained our model by using this function, we can use it to classify some tweets to do simple sentiment analysis.

Classifying Tweets

Another hassle of torchtext is that it’s a bit of a pain to get it to predict things. What you can do is emulate the processing pipeline that happens internally and make the required prediction on the output of that pipeline, as shown in this small function:

def classify_tweet(tweet):
    categories = {0: "Negative", 1:"Positive"}
    processed = TWEET.process([TWEET.preprocess(tweet)])
    return categories[model(processed).argmax().item()]

We have to call preprocess(), which performs our spaCy-based tokenization. After that, we can call process() to the tokens into a tensor based on our already-built vocabulary. The only thing we have to be careful about is that torchtext is expecting a batch of strings, so we have to turn it into a list of lists before handing it off to the processing function. Then we feed it into the model. This will produce a tensor that looks like this:

tensor([[ 0.7828, -0.0024]]

The tensor element with the highest value corresponds to the model’s chosen class, so we use argmax() to get the index of that, and then item() to turn that zero-dimension tensor into a Python integer that we index into our categories dictionary.

With our model trained, let’s look at how to do some of the other tricks and techniques that you learned for images in Chapters 24.

Data Augmentation

You might wonder exactly how you can augment text data. After all, you can’t really flip it horizontally as you can an image! But you can use some techniques with text that will provide the model with a little more information for training. First, you could replace words in the sentence with synonyms, like so:

The cat sat on the mat

could become

The cat sat on the rug

Aside from the cat’s insistence that a rug is much softer than a mat, the meaning of the sentence hasn’t changed. But mat and rug will be mapped to different indices in the vocabulary, so the model will learn that the two sentences map to the same label, and hopefully that there’s a connection between those two words, as everything else in the sentences is the same.

In early 2019, the paper “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks” suggested three other augmentation strategies: random insertion, random swap, and random deletion. Let’s take a look at each of them.3

Random Insertion

A random insertion technique looks at a sentence and then randomly inserts synonyms of existing nonstop-words into the sentence n times. Assuming you have a way of getting a synonym of a word and a way of eliminating stop-words (common words such as and, it, the, etc.), shown, but not implemented, in this function via get_synonyms() and get_stopwords(), an implementation of this would be as follows:

def random_insertion(sentence,n):
    words = remove_stopwords(sentence)
    for _ in range(n):
        new_synonym = get_synonyms(random.choice(words))
        sentence.insert(randrange(len(sentence)+1), new_synonym)
    return sentence

An example of this in practice where it replaces cat could look like this:

The cat sat on the mat
The cat mat sat on feline the mat

Random Deletion

As the name suggests, random deletion deletes words from a sentence. Given a probability parameter p, it will go through the sentence and decide whether to delete a word or not based on that random probability:

def random_deletion(words, p=0.5):
    if len(words) == 1:
        return words
    remaining = list(filter(lambda x: random.uniform(0,1) > p,words))
    if len(remaining) == 0:
        return [random.choice(words)]
    else
        return remaining

The implementation deals with the edge cases—if there’s only one word, the technique returns it; and if we end up deleting all the words in the sentence, the technique samples a random word from the original set.

Random Swap

The random swap augmentation takes a sentence and then swaps words within it n times, with each iteration working on the previously swapped sentence. Here’s an implementation:

def random_swap(sentence, n=5):
    length = range(len(sentence))
    for _ in range(n):
        idx1, idx2 = random.sample(length, 2)
        sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1]
    return sentence

We sample two random numbers based on the length of the sentence, and then just keep swapping until we hit n.

The techniques in the EDA paper average about a 3% improvement in accuracy when used with small amounts of labeled examples (roughly 500). If you have more than 5,000 examples in your dataset, the paper suggests that this improvement may fall to 0.8% or lower, due to the model obtaining better generalization from the larger amounts of data available over the improvements that EDA can provide.

Back Translation

Another popular approach for augmenting datasets is back translation. This involves translating a sentence from our target language into one or more other languages and then translating all of them back to the original language. We can use the Python library googletrans for this purpose. Install it with pip, as it doesn’t appear to be in conda at the time of this writing:

pip install googletrans

Then, we can translate our sentence from English to French, and then back to English:

import googletrans
import googletrans.Translator

translator = Translator()

sentences = ['The cat sat on the mat']

translation_fr = translator.translate(sentences, dest='fr')
fr_text = [t.text for t in translations_fr]
translation_en = translator.translate(fr_text, dest='en')
en_text = [t.text for t in translation_en]
print(en_text)

>> ['The cat sat on the carpet']

That gives us an augmented sentence from English to French and back again, but let’s go a step further and select a language at random:

import random

available_langs = list(googletrans.LANGUAGES.keys())
tr_lang = random.choice(available_langs)
print(f"Translating to {googletrans.LANGUAGES[tr_lang]}")

translations = translator.translate(sentences, dest=tr_lang)
t_text = [t.text for t in translations]
print(t_text)

translations_en_random = translator.translate(t_text, src=tr_lang, dest='en')
en_text = [t.text for t in translations_en_random]
print(en_text)

In this case, we use random.choice to grab a random language, translate to that language, and then translate back as before. We also pass in the language to the src parameter just to help the language detection of Google Translate along. Try it out and see how much it resembles the old game of Telephone.

You need to be aware of a few limits. First, you can translate only up to 15,000 characters at a time, though that shouldn’t be too much of a problem if you’re just translating sentences. Second, if you are going to use this on a large dataset, you want to do your data augmentation on a cloud instance rather than your home computer, because if Google bans your IP, you won’t be able to use Google Translate for normal use! Make sure that you send a few batches at a time rather than the entire dataset at once. This should also allow you to restart translation batches if there’s an error on the Google Translate backend as well.

Augmentation and torchtext

You might have noticed that everything I’ve said so far about augmentation hasn’t involved torchtext. Sadly, there’s a reason for that. Unlike torchvision or torchaudio, torchtext doesn’t offer a transform pipeline, which is a little annoying. It does offer a way of performing pre- and post-processing, but this operates only on the token (word) level, which is perhaps enough for synonym replacement, but doesn’t provide enough control for something like back translation. And if you do try to hijack the pipelines for augmentation, you should probably do it in the preprocessing pipeline instead of the post-processing one, as all you’ll see in that one is the tensor that consists of integers, which you’ll have to map to words via the vocab rules.

For these reasons, I suggest not even bothering with spending your time trying to twist torchtext into knots to do data augmentation. Instead, do the augmentation outside PyTorch using techniques such as back translation to generate new data and feed that into the model as if it were real data.

That’s augmentation covered, but there’s an elephant in the room that we should address before wrapping up the chapter.

Transfer Learning?

You might be wondering why we haven’t talked about transfer learning yet. After all, it’s a key technique that allows us to create accurate image-based models, so why can’t we do that here? Well, it turns out that it has been a little harder to get transfer learning working on LSTM networks. But not impossible. We’ll return to the subject in Chapter 9, where you’ll see how to get transfer learning working with both the LSTM- and Transformer-based networks.

Conclusion

In this chapter, we covered a text-processing pipeline that covers encoding and embeddings, a simple LSTM-based neural network to perform classification, along with some data augmentation strategies for text-based data. You have plenty to experiment with so far. I’ve chosen to make every tweet lowercase during the tokenization phase. This is a popular approach in NLP, but it does throw away potential information in the tweet. Think about it: “Why is this NOT WORKING?” to our eyes is even more suggestive of a negative sentiment than “Why is this not working?” but we’ve thrown away that difference between the two tweets before it even hits the model. So definitely try running with case sensitivity left in the tokenized text. And try removing stop-words from your input text to see whether that helps improve the accuracy. Traditional NLP methods make a big point of removing them, but I’ve often found that deep learning techniques can perform better when leaving them in the input (which we’ve done in this chapter). This is because they provide more context for the model to learn from, whereas sentences that have been reduced to only important words may be missing nuances in the text.

You may also want to alter the size of the embedding vector. Larger vectors mean that the embedding can capture more information about the word it’s modeling at the cost of using more memory. Try going from 100- to 1,000-dimensional embeddings and see how that affects training time and accuracy.

Finally, you can also play with the LSTM. We’ve used a simple approach, but you can increase num_layers to create stacked LSTMs, increase or decrease the number of hidden features in the layer, or set bidirectional=true to create a biLSTM. Replacing the entire LSTM with a GRU layer would also be an interesting thing to try; does it train faster? Is it more accurate? Experiment and see what you find!

In the meantime, we move on from text and into the audio realm with torchaudio.

Further Reading

1 Note that it’s not impossible to do these things with CNNs; a lot of in-depth research in the last few years has been done to apply CNN-based networks in the temporal domain. We won’t cover them here, but “Temporal Convolutional Networks: A Unified Approach to Action Segmentation” by Colin Lea, et al. (2016) provides further information. And seq2seq!

2 See “Efficient Estimation of Word Representations in Vector Space” by Tomas Mikolov et al. (2013).

3 See “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks” by Jason W. Wei and Kai Zou (2019).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.63.95