Preparing the data

When working with text documents like this it can take a lot of mundane code to get you where you want to be. I'm including this example as a way to handle the problem. Once you understand what's going on here, you will be able to reuse much of it in future problems and shorten your development time, so it's worth the consideration.

The following function is going to take the top-level directory where the 20 newsgroup texts live. Within that directory, there will be 20 individual directories, each with files. Each file is a newsgroup post:

def load_data(text_data_dir, vocab_size, sequence_length, validation_split=0.2):
    data = dict()
    data["vocab_size"] = vocab_size
    data["sequence_length"] = sequence_length

    # second, prepare text samples and their labels
    print('Processing text dataset')

    texts = []  # list of text samples
    labels_index = {}  # dictionary mapping label name to numeric id
    labels = []  # list of label ids
    for name in sorted(os.listdir(text_data_dir)):
        path = os.path.join(text_data_dir, name)
        if os.path.isdir(path):
            label_id = len(labels_index)
            labels_index[name] = label_id
            for fname in sorted(os.listdir(path)):
                if fname.isdigit():
                    fpath = os.path.join(path, fname)
                    if sys.version_info < (3,):
                        f = open(fpath)
                    else:
                        f = open(fpath, encoding='latin-1')
                    t = f.read()
                    i = t.find('

')  # skip header
                    if 0 < i:
                        t = t[i:]
                    texts.append(t)
                    f.close()
                    labels.append(label_id)
    print('Found %s texts.' % len(texts))
    data["texts"] = texts
    data["labels"] = labels
    return data

For each directory, we will take the directory name and add it to a dictionary mapping it to a number. This number is going to become the value we care to predict, our label. We will keep that list of labels in data["labels"].

Likewise, for the texts, we will open each file, parse out just the relevant text, ignoring the junk about who posted in the information. We will then store the text in data["texts"]. It's very important to remove the part of the header that identifies the newsgroup, by the way; that's cheating!

In the end, we are left with a list of texts and a corresponding list of labels; however, at this point, each of those texts is a string. The next thing we need to do is split these strings into word tokens, convert those tokens into numeric tokens, and pad the sequences so that they're the same length. This is pretty much what we did in the previous example; however, in our previous example, the data came pre-tokenized. I'll use this function to accomplish the task, as shown in the following code:

def tokenize_text(data):
    tokenizer = Tokenizer(num_words=data["vocab_size"])
    tokenizer.fit_on_texts(data["texts"])
    data["tokenizer"] = tokenizer
    sequences = tokenizer.texts_to_sequences(data["texts"])

    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))

    data["X"] = pad_sequences(sequences, maxlen=data["sequence_length"])
    data["y"] = to_categorical(np.asarray(data["labels"]))
    print('Shape of data tensor:', data["X"].shape)
    print('Shape of label tensor:', data["y"].shape)

    # texts and labels aren't needed anymore
    data.pop("texts", None)
    data.pop("labels", None)
    return data

Here we're taking that list of texts and tokenizing them with keras.preprocessing.text.Tokenizer. After that, we're padding them to be equal length. Finally, we're converting the numeric labels to one_hot format, as we have in other multiclass classification problems with Keras.

We're almost done with the data; however, lastly, we need to take our text and labels and randomly split that data into a train, validation, and test set, as shown in the following code. I don't have much data to work with so I'm going to make the choice here to be fairly stingy on test and val. If my sample is too small, I might not get a good understanding of actual model performance, so be careful when you're doing this:

def train_val_test_split(data):

    data["X_train"], X_test_val, data["y_train"],  y_test_val = train_test_split(data["X"],
                                                                                 data["y"],
                                                                                 test_size=0.2,
                                                                                 random_state=42)
    data["X_val"], data["X_test"], data["y_val"], data["y_test"] = train_test_split(X_test_val,
                                                                                    y_test_val,
                                                                                  test_size=0.25,
                                                                                 random_state=42)
    return data

Table of Contents for Preparing the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparing the data