Representing text data

While our aim is to predict the next word in a given sentence, or (ideally) predict a series of words that make sense and conform to some measure of English syntax/grammar, we will actually be encoding our data at the character level. This means that we need to take our text data (in this example, the collected works of William Shakespeare) and generate a sequence of tokens. These tokens might be whole sentences, individual words, or even characters themselves, depending on what type of model we are training.

Once we've tokenized out text data, we need to turn these tokens into some kind of numeric representation that's amenable to computation. As we've discussed, in our case, these representations are tensors. These tokens are then turned into some tensors and perform a number of operations on the text to extract different properties of the text, hereafter referred to as our corpus.

The aim here is to generator a vocabulary vector (a vector of length n, where n is the number of unique characters in your corpus). We will use this vector as a template to encode each character.

Table of Contents for Representing text data

Create new playlist

Sign In

Sign Up

Table of Contents for
Representing text data