One-hot encoding

In one-hot encoding, each token is represented by a vector of length N, where N is the size of the vocabulary. The vocabulary is the total number of unique words in the document. Let's take a simple sentence and observe how each token would be represented as one-hot encoded vectors. The following is the sentence and its associated token representation:

An apple a day keeps doctor away said the doctor.

One-hot encoding for the preceding sentence can be represented into a tabular format as follows:

An

100000000

apple

010000000

a

001000000

day

000100000

keeps

000010000

doctor

000001000

away

000000100

said

000000010

the

000000001

This table describes the tokens and their one-hot encoded representation. The vector length is 9, as there are nine unique words in the sentence. A lot of machine learning libraries have eased the process of creating one-hot encoding variables. We will write our own implementation to make it easier to understand, and we can use the same implementation to build other features required for later examples. The following code contains a Dictionary class, which contains functionality to create a dictionary of unique words along with a function to return a one-hot encoded vector for a particular word. Let's take a look at the code and then walk through each functionality:

class Dictionary(object):
def __init__(self):
self.word2idx = {}
self.idx2word = []
self.length = 0

def add_word(self,word):
if word not in self.idx2word:
self.idx2word.append(word)
self.word2idx[word] = self.length + 1
self.length += 1
return self.word2idx[word]

def __len__(self):
return len(self.idx2word)

def onehot_encoded(self,word):
vec = np.zeros(self.length)
vec[self.word2idx[word]] = 1
return vec

The preceding code provides three important functionalities:

  • The initialization function, __init__, creates a word2idx dictionary, which will store all unique words along with the index. The idx2word list stores all the unique words, and the length variable contains the total number of unique words in our documents.
  • The add_word function takes a word and adds it to word2idx and idx2word, and increases the length of the vocabulary, provided the word is unique.
  • The onehot_encoded function takes a word and returns a vector of length N with zeros throughout, except at the index of the word. If the index of the passed word is two, then the value of the vector at index two will be one, and all the remaining values will be zeros.

As we have defined our Dictionary class, let's use it on our thor_review data. The following code demonstrates how the word2idx is built and how we can call our onehot_encoded function:

dic = Dictionary()

for tok in thor_review.split():
dic.add_word(tok)

print(dic.word2idx)

The output of the preceding code is as follows:

# Results of word2idx

{'the': 1, 'action': 2, 'scenes': 3, 'were': 4, 'top': 5, 'notch': 6, 'in': 7, 'this': 8, 'movie.': 9, 'Thor': 10, 'has': 11, 'never': 12, 'been': 13, 'epic': 14, 'MCU.': 15, 'He': 16, 'does': 17, 'some': 18, 'pretty': 19, 'sh*t': 20, 'movie': 21, 'and': 22, 'he': 23, 'is': 24, 'definitely': 25, 'not': 26, 'under-powered': 27, 'anymore.': 28, 'unleashed': 29, 'this,': 30, 'I': 31, 'love': 32, 'that.': 33}

One-hot encoding for the word were is as follows:

# One-hot representation of the word 'were'
dic.onehot_encoded('were')
array([ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

One of the challenges with one-hot representation is that the data is too sparse, and the size of the vector quickly grows as the number of unique words in the vocabulary increases, which is considered to be a limitation, and hence it is rarely used with deep learning. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.189.228