Conventions and style

The code, iterators, and wrappers that we will be using are from Practical Torchtext. This is a torchtext tutorial that was created by Keita Kurita—one of the top five contributors to torchtext.

The naming conventions and style are loosely inspired from the preceding work and fastai—a deep learning framework based on PyTorch itself.

Let's begin by setting up the required variable placeholders in place:

from torchtext.data import Field

The Field class determines how the data is preprocessed and converted into a numeric format. The Field class is a fundamental torchtext data structure and worth looking into. The Field class models common text processing and sets them up for numericalization (or vectorization):

LABEL = Field(sequential=False, use_vocab=False)

By default, all of the fields take in strings of words as input, and then the fields build a mapping from the words to integers later on. This mapping is called the vocab, and is effectively a one-hot encoding of the tokens.

We saw that each label in our case is already an integer marked as 0 or 1. Therefore, we will not one-hot this – we will tell the Field class that this is already one-hot encoded and non-sequential by setting use_vocab=False and sequential=False, respectively:

tokenize = lambda x: x.split()
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)

A few things are happening here, so let's unpack it a bit:

  • lower=True: All input is converted to lowercase.
  • sequential=True: If False, no tokenization is applied.
  • tokenizer: We defined a custom tokenize function that simply splits the string on the space. You should replace this with the spaCy tokenizer (set tokenize="spacy") and see if that changes the loss curve or final model's performance.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.189.98