© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
A. Kulkarni, A. ShivanandaNatural Language Processing Recipeshttps://doi.org/10.1007/978-1-4842-7351-7_3

3. Converting Text to Features

Akshay Kulkarni1   and Adarsha Shivananda1
(1)
Bangalore, Karnataka, India
 
This chapter covers basic to advanced feature engineering (text to features) methods. By the end of the chapter, you will be comfortable with the following recipes.
  • Recipe 1. One-hot encoding

  • Recipe 2. Count vectorizer

  • Recipe 3. n-grams

  • Recipe 4. Co-occurrence matrix

  • Recipe 5. Hash vectorizing

  • Recipe 6. Term frequency-inverse document frequency (TF-IDF)

  • Recipe 7. Word embedding

  • Recipe 8. Implementing fastText

  • Recipe 9. Converting text to features using state-of-the-art embeddings

Now that all the text preprocessing steps have been discussed, let’s explore feature engineering, the foundation for natural language processing. As you know, machines or algorithms cannot understand characters, words, or sentences. They can only take numbers as input, which includes binaries. But the inherent nature of textual data is unstructured and noisy, which makes it impossible to interact with machines.

The procedure of converting raw text into a machine-understandable format (numbers) is called feature engineering. The performance and accuracy of machine learning and deep learning algorithms are fundamentally dependent on the feature engineering technique.

This chapter discusses different feature engineering methods and techniques; their functionalities, advantages, and disadvantages; and examples to help you realize the importance of feature engineering.

Recipe 3-1. Converting Text to Features Using One-Hot Encoding

One-hot encoding is the traditional method used in feature engineering. Anyone who knows the basics of machine learning has come across one-hot encoding. It is the process of converting categorical variables into features or columns and coding one or zero for that particular category. The same logic is used here, and the number of features is the number of total tokens present in the corpus.

Problem

You want to convert text to a feature using one-hot encoding.

Solution

One-hot encoding converts characters or words into binary numbers, as shown next.
 

I

love

NLP

is

Future

I love NLP

1

1

1

0

0

NLP is future

0

0

1

1

1

How It Works

There are many functions to generate one-hot encoding features. Let’s take one function and discuss it in depth.

Step 1-1. Store the text in a variable

The following shows a single line.
Text = "I am learning NLP"

Step 1-2. Execute a function on the text data

The following is a function from the pandas library to convert text into a feature.
# Importing the library
import pandas as pd
# Generating the features
pd.get_dummies(Text.split())
Result :
   I  NLP  am  learning
0  1    0   0         0
1  0    0   1         0
2  0    0   0         1
3  0    1   0         0

The output has four features since the number of distinct words present in the input was 4.

Recipe 3-2. Converting Text to Features Using a Count Vectorizer

The approach used in Recipe 3-1 has a disadvantage . It does not consider the frequency of a word. If a particular word appears multiple times, there is a chance of missing information if it is not included in the analysis. A count vectorizer solves that problem. This recipe covers another method for converting text to a feature: the count vectorizer.

Problem

How do you convert text to a feature using a count vectorizer?

Solution

A count vectorizer is similar to one-hot encoding, but instead of checking whether a particular word is present or not, it counts the words that are present in the document.

In the following example, the words I and NLP occur twice in the first document.
 

I

love

NLP

is

future

will

learn

In

2month

I love NLP and I will learn NLP in 2 months

2

1

2

0

0

1

1

1

1

NLP is future

0

0

1

1

1

0

0

0

0

How It Works

sklearn has a feature extraction function that extracts features out of text. Let’s look at how to execute this. The following imports the CountVectorizer function from sklearn.
#importing the function
from sklearn.feature_extraction.text import CountVectorizer
# Text
text = ["I love NLP and I will learn NLP in 2month "]
# create the transform
vectorizer = CountVectorizer()
# tokenizing
vectorizer.fit(text)
# encode document
vector = vectorizer.transform(text)
# summarize & generating output
print(vectorizer.vocabulary_)
print(vector.toarray())
Result:
{'love': 4, 'nlp': 5, 'and': 1, 'will': 6, 'learn': 3, 'in': 2, '2month': 0}
[[1 1 1 1 1 2 1]]

The fifth token, nlp, appears twice in the document.

Recipe 3-3. Generating n-grams

In the preceding methods, each word was considered a feature. There is a drawback to this method. It does not consider the previous words and the next words to see if it would give a proper and complete meaning. For example, consider the phrase not bad. If it is split into individual words, it loses out on conveying good, which is what this phrase means.

As you saw, you could lose potential information or insights because many words make sense once they are put together. n-grams can solve this problem.

n-grams are the fusion of multiple letters or multiple words. They are formed in such a way that even the previous and next words are captured.
  • Unigrams are the unique words present in a sentence.

  • A bigram is the combination of two words.

  • A trigram is the combination of three words. And so on.

For example, look at the sentence, “I am learning NLP.”
  • Unigrams: “I”, “am”, “learning”, “NLP”

  • Bigrams: “I am”, “am learning”, “learning NLP”

  • Trigrams: “I am learning”, “am learning NLP”

Problem

Generate the n-grams for a given sentence.

Solution

There are a lot of packages that generate n-grams. TextBlob is the most commonly used.

How It Works

Follow the steps in this section.

Step 3-1. Generate n-grams using TextBlob

Let’s look at how to generate n-grams using TextBlob .
Text = "I am learning NLP"
Use the following TextBlob function to create n-grams. Use the text that is defined and state the n based on the requirement.
#Import textblob
from textblob import TextBlob
#For unigram : Use n = 1
TextBlob(Text).ngrams(1)
This is the output.
[WordList(['I']), WordList(['am']), WordList(['learning']), WordList(['NLP'])]
#For Bigram : For bigrams, use n = 2
TextBlob(Text).ngrams(2)
[WordList(['I', 'am']),
 WordList(['am', 'learning']),
 WordList(['learning', 'NLP'])]

There are three lists with two words in an instance.

Step 3-2. Generate bigram-based features for a document

Just like in the last recipe , a count vectorizer to generates features. Using the same function, let’s generate bigram features to see what the output looks like.
#importing the function
from sklearn.feature_extraction.text import CountVectorizer
# Text
text = ["I love NLP and I will learn NLP in 2month "]
# create the transform
vectorizer = CountVectorizer(ngram_range=(2,2))
# tokenizing
vectorizer.fit(text)
# encode document
vector = vectorizer.transform(text)
# summarize & generating output
print(vectorizer.vocabulary_)
print(vector.toarray())
This is the result .
{'love nlp': 3, 'nlp and': 4, 'and will': 0, 'will learn': 6, 'learn nlp': 2, 'nlp in': 5, 'in 2month': 1}
[[1 1 1 1 1 1 1]]

The output has features with bigrams; in the example, the count is 1 for all tokens. You can similarly use trigrams.

Recipe 3-4. Generating a Co-occurrence Matrix

Let’s discuss a feature engineering method called a co-occurrence matrix.

Problem

You want to understand and generate a co-occurrence matrix.

Solution

A co-occurrence matrix is like a count vectorizer; it counts the occurrence of a group of words rather than individual words.

How It Works

Let’s look at generating this kind of matrix using NLTK, bigrams, and some basic Python coding skills.

Step 4-1. Import the necessary libraries

Here is the code.
import numpy as np
import nltk
from nltk import bigrams
import itertools

Step 4-2. Create function for a co-occurrence matrix

The following is the co_occurrence_matrix function .
def co_occurrence_matrix(corpus):
    vocab = set(corpus)
    vocab = list(vocab)
    vocab_to_index = { word:i for i, word in enumerate(vocab) }
    # Create bigrams from all words in corpus
    bi_grams = list(bigrams(corpus))
    # Frequency distribution of bigrams ((word1, word2), num_occurrences)
    bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
    # Initialise co-occurrence matrix
    # co_occurrence_matrix[current][previous]
    co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
    # Loop through the bigrams taking the current and previous word,
    # and the number of occurrences of the bigram.
    for bigram in bigram_freq:
        current = bigram[0][1]
        previous = bigram[0][0]
        count = bigram[1]
        pos_current = vocab_to_index[current]
        pos_previous = vocab_to_index[previous]
        co_occurrence_matrix[pos_current][pos_previous] = count
    co_occurrence_matrix = np.matrix(co_occurrence_matrix)
    # return the matrix and the index
    return co_occurrence_matrix,vocab_to_index

Step 4-3. Generate a co-occurrence matrix

Here are the sentences for testing .
sentences = [['I', 'love', 'nlp'],
                   ['I', 'love','to' 'learn'],
                   ['nlp', 'is', 'future'],
                   ['nlp', 'is', 'cool']]
# create one list using many lists
merged = list(itertools.chain.from_iterable(sentences))
matrix = co_occurrence_matrix(merged)
# generate the matrix
CoMatrixFinal = pd.DataFrame(matrix[0], index=vocab_to_index, columns=vocab_to_index)
print(CoMatrixFinal)
           I   is  love  future  tolearn  cool  nlp
I        0.0  0.0   0.0     0.0      0.0   0.0  1.0
is       0.0  0.0   0.0     0.0      0.0   0.0  2.0
love     2.0  0.0   0.0     0.0      0.0   0.0  0.0
future   0.0  1.0   0.0     0.0      0.0   0.0  0.0
tolearn  0.0  0.0   1.0     0.0      0.0   0.0  0.0
cool     0.0  1.0   0.0     0.0      0.0   0.0  0.0
nlp      0.0  0.0   1.0     1.0      1.0   0.0  0.0

I, love, and is, nlp appeared together twice, and a few other words appeared only once.

Recipe 3-5. Hash Vectorizing

A count vectorizer and a co-occurrence matrix both have one limitation: the vocabulary can become very large and cause memory/computation issues.

A hash vectorizer is one way to solve this problem.

Problem

You want to understand and generate a hash vectorizer.

Solution

A hash vectorizer is memory efficient , and instead of storing tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside is that it’s one way, and once vectorized, the features cannot be retrieved.

How It Works

Let’s look at an example using sklearn.

Step 5-1. Import the necessary libraries and create a document

Here’s the code.
from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]

Step 5-2. Generate a hash vectorizer matrix

Let’s create a hash vectorizer matrix (HashingVectorizer) with a vector size of 10.
# transform
vectorizer = HashingVectorizer(n_features=10)
# create the hashing vector
vector = vectorizer.transform(text)
# summarize the vector
print(vector.shape)
print(vector.toarray())
(1, 10)
[[ 0.           0.57735027  0.       0.       0.      0.     0.
  -0.57735027  -0.57735027  0.       ]]

It created a vector of size 10, and now it can be used for any supervised/unsupervised tasks.

Recipe 3-6. Converting Text to Features Using TF-IDF

The above-mentioned text-to-feature methods have a few drawbacks, hence the introduction of TF-IDF. The following are some of the disadvantages.
  • Let’s say a particular word appears in all the corpus documents. It achieves higher importance in our previous methods, but that may not be relevant to your case.

  • TF-IDF reflects on how important a word is to a document in a collection and hence normalizes words that frequently appear in all the documents.

Problem

You want to convert text to features using TF-IDF.

Solution

Term frequency (TF) is the ratio of the count of a particular word present in a sentence to the total count of words in the same sentence. TF captures the importance of the word irrespective of the length of the document. For example, a word with a frequency of 3 in a sentence with 10 words is different from when the word length of the sentence is 100 words. It should have more importance in the first scenario, which is what TF does. TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Inverse document frequency (IDF) is a log of the ratio of the total number of rows to the number of rows in a particular document in which a word is present. IDF = log(N/n), where N is the total number of rows, and n is the number of rows in which the word was present.

IDF measures the rareness of a term. Words like a and the show up in all the corpus documents, but rare words are not in all documents. So, if a word appears in almost all the documents, that word is of no use since it does not help with classification or information retrieval. IDF nullifies this problem.

TF-IDF is the simple product of TF and IDF that addresses both drawbacks, making predictions and information retrieval relevant.

TF-IDF = TF * IDF

How It Works

Follow the steps in this section.

Step 6-1. Read the text data

The following is a familiar phrase.
Text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]

Step 6-2. Create the features

Execute the following code on the text data.
#Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#Create the transform
vectorizer = TfidfVectorizer()
#Tokenize and build vocab
vectorizer.fit(Text)
#Summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
This is the result.
Text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[ 1.69314718  1.28768207  1.28768207  1.69314718   1.69314718  1.69314718  1.69314718  1.   ]

Observe that the appears in all three documents, so it does not add much value. The vector value is 1, which is less than all the other tokens.

All the methods or techniques you have looked at so far are based on frequency. They are called frequency-based embeddings or features. The next recipe looks at prediction-based embeddings, typically called word embeddings.

Recipe 3-7. Implementing Word Embeddings

This recipe assumes that you have a working knowledge of how a neural network works and the mechanisms by which weights in the neural network are updated. If you are new to neural networks, we suggest that you go through Chapter 6 to gain a basic understanding of how a neural network works.

Even though all the previous methods solve most problems, once you get into more complex problems where you want to capture the semantic relation between words (context), these methods fail to perform.

The following explains the challenges with the methods discussed so far.
  • The techniques fail to capture the context and meaning of the words. They depend on the appearance or frequency of words. You need to know how to capture the context or semantic relationships.
    1. a.

      I am eating an apple.

       
    2. b.

      I am using an Apple.

       
In the example, apple has different meanings when it is used with different (close by) adjacent words eating and using.
  • For a problem like a document classification (book classification in the library) , a document is huge, and many tokens are generated. In these scenarios, your number of features can get out of control (wherein), thus hampering the accuracy and performance.

A machine/algorithm can match two documents/texts and say whether they are the same or not. How do we make machines talk about cricket or Virat Kohli when you search for MS Dhoni? How do you make the machine understand that the word apple in “An apple is a tasty fruit” is a fruit that can be eaten and not a company?

The answer to these questions lies in creating a representation for words that capture their meanings, semantic relationships, and the different types of contexts they are used in.

Word embeddings address these challenges. A word embedding is a feature-learning technique in which vocabulary are mapped to vectors of real numbers, capturing contextual hierarchy.

In the following table, every word is represented by four numbers, called vectors. Using the word embeddings technique, we derived those vectors for each word to use them in future analysis and building applications. In the example, the dimension is four, but you usually use a dimension greater than 100.

Words

Vectors

text

0.36

0.36

-0.43

0.36

idea

-0.56

-0.56

0.72

-0.56

word

0.35

-0.43

0.12

0.72

encode

0.19

0.19

0.19

0.43

document

-0.43

0.19

-0.43

0.43

grams

0.72

-0.43

0.72

0.12

process

0.43

0.72

0.43

0.43

feature

0.12

0.45

0.12

0.87

Problem

You want to implement word embeddings.

Solution

Word embeddings are prediction-based, and they use shallow neural networks to train the model that leads to learning the weight and using them as a vector representation.

word2vec is the deep learning Google framework to train word embeddings. It uses all the words of the whole corpus and predicts the nearby words. It creates a vector for all the words present in the corpus so that the context is captured. It also outperforms any other methodologies in the space of word similarity and word analogies.

There are mainly two types in word2vec.
  • skip-gram

  • Continuous Bag of Words (CBOW)

../images/475440_2_En_3_Chapter/475440_2_En_3_Figa_HTML.png

How It Works

The above figure shows the architecture of the CBOW and skip-gram algorithms used to build word embeddings. Let’s look at how these models work.

skip-gram

The skip-gram model 1 predicts the probabilities of a word given the context of the word or words.

Let’s take a small sentence and understand how it works. Each sentence generates a target word and context, which are the words nearby. The number of words to be considered around the target variable is called the window size. The following table shows all the possible target and context variables for window size 2. Window size needs to be selected based on data and the resources at your disposal. The larger the window size, the higher the computing power.

Text = “I love NLP and I will learn NLP in 2 months”
 

Target word

Context

I love NLP

I

love, NLP

I love NLP and

love

love, NLP, and

I love NLP and I will learn

NLP

I, love, and, I

in 2 months

month

in, 2

Since it takes a lot of text and computing power, let’s use sample data to build a skip-gram model.

Import the text corpus and break it into sentences. Perform some cleaning and preprocessing like removing punctuation and digits and splitting the sentences into words or tokens.
#Example sentences
sentences = [['I', 'love', 'nlp'],
                  ['I', 'will', 'learn', 'nlp', 'in', '2','months'],
                  ['nlp', 'is', 'future'],
                  ['nlp', 'saves', 'time', 'and', 'solves', 'lot', 'of', 'industry', 'problems'],
                  ['nlp', 'uses', 'machine', 'learning']]
#import library
!pip install gensim
import gensim
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# training the model
skipgram = Word2Vec(sentences, size =50, window = 3, min_count=1,sg = 1)
print(skipgram)
# access vector for one word
print(skipgram['nlp'])
[ 0.00552227 -0.00723104  0.00857073  0.00368054 -0.00071274  0.00837146
  0.00179965 -0.0049786  -0.00448666 -0.00182289  0.00857488 -0.00499459
  0.00188365 -0.0093498   0.00174774 -0.00609793 -0.00533857 -0.007905
 -0.00176814 -0.00024082 -0.00181886 -0.00093836 -0.00382601 -0.00986026
  0.00312014 -0.00821249  0.00787507 -0.00864689 -0.00686584 -0.00370761
  0.0056183   0.00859488 -0.00163146  0.00928791  0.00904601  0.00443816
 -0.00192308  0.00941    -0.00202355 -0.00756564 -0.00105471  0.00170084
  0.00606918 -0.00848301 -0.00543473  0.00747958  0.0003408   0.00512787
 -0.00909613  0.00683905]
Since our vector size parameter was 50, the model gives a vector of size 50 for each word.
# access vector for another one word
print(skipgram['deep'])
KeyError: "word 'deep' not in vocabulary"

We get an error saying the word doesn’t exist because this word was not in our input training data. We need to train the algorithm on as large a dataset as possible so that we do not miss words.

There is one more way to tackle this problem. Read Recipe 3-6 for the answer.
# save model
skipgram.save('skipgram.bin')
# load model
skipgram = Word2Vec.load('skipgram.bin')
A t-SNE plot is one of the ways to evaluate word embeddings. Let’s generate it and see how it looks.
# T – SNE plot
X = skipgram[skipgram.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(skipgram.wv.vocab)
for i, word in enumerate(words):
       pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()
../images/475440_2_En_3_Chapter/475440_2_En_3_Figb_HTML.jpg

Continuous Bag of Words (CBOW)

Now let’s look at how to build a CBOW model .
#import library
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
#Example sentences
sentences = [['I', 'love', 'nlp'],
                   ['I', 'will', 'learn', 'nlp', 'in', '2','months'],
                   ['nlp', 'is', 'future'],
                   ['nlp', 'saves', 'time', 'and', 'solves', 'lot', 'of', 'industry', 'problems'],
                   ['nlp', 'uses', 'machine', 'learning']]
# training the model
cbow = Word2Vec(sentences, size =50, window = 3, min_count=1,sg = 1)
print(cbow)
# access vector for one word
print(cbow['nlp'])
# save model
cbow.save('cbow.bin')
# load model
cbow = Word2Vec.load('cbow.bin')
# T – SNE plot
X = cbow[cbow.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(cbow.wv.vocab)
for i, word in enumerate(words):
       pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()
../images/475440_2_En_3_Chapter/475440_2_En_3_Figc_HTML.jpg

Training these models requires a huge amount of computing power. Let’s use Google’s pre-trained model, which has been trained with more than 100 billion words.

Download the model from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit and keep it in your local storage.

https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

Import the gensim package and follow the steps to learn Google’s word2vec.
# import gensim package
import gensim
# load the saved model
model = gensim.models.Word2Vec.load_word2vec_format('C:\Users\GoogleNews-vectors-negative300.bin', binary=True)
#Checking how similarity works.
print (model.similarity('this', 'is'))
Output:
0.407970363878
#Lets check one more.
print (model.similarity('post', 'book'))
Output:
0.0572043891977
This and is have a good amount of similarity, but the similarity between the words post and book is poor. For any given set of words, it uses the vectors of both the words and calculates the similarity between them.
# Finding the odd one out.
model.doesnt_match('breakfast cereal dinner lunch';.split())
The output is
'cereal'
Among breakfast, cereal, dinner, and lunch, the word cereal is the least related to all the other three words.
# It is also finding the relations between words.
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
This is the output.
queen: 0.7699

If you add woman and king and subtract man, it predicts queen as the output with 77% confidence. Isn’t this amazing?

../images/475440_2_En_3_Chapter/475440_2_En_3_Figd_HTML.png

Let’s look at a few interesting examples using the t-SNE plot for word embeddings, such as for home interiors and exteriors. For example, all the words related to electrical fittings are near each other; similarly, words related to bathroom fittings are near each other, and so on. This is the beauty of word embeddings.

../images/475440_2_En_3_Chapter/475440_2_En_3_Fige_HTML.jpg

Recipe 3-8. Implementing fastText

fastText is another deep learning framework developed by Facebook to capture context and generate a feature vector.

Problem

You want to learn how to implement fastText in Python.

Solution

fastText is the improvised version of word2vec, which considers words to build the representation. But fastText takes each character while computing a word’s representation.

How It Works

Let’s look at how to build a fastText word embedding.
# Import FastText
from gensim.models import FastText
from sklearn.decomposition import PCA
from matplotlib import pyplot
#Example sentences
sentences = [['I', 'love', 'nlp'],
                   ['I', 'will', 'learn', 'nlp', 'in', '2','months'],
                   ['nlp', 'is', 'future'],
                   ['nlp', 'saves', 'time', 'and', 'solves', 'lot', 'of', 'industry', 'problems'],
                   ['nlp', 'uses', 'machine', 'learning']]
fast = FastText(sentences,size=20, window=1, min_count=1, workers=5, min_n=1, max_n=2)
# vector for word nlp
print(fast['nlp'])
[-0.00459182  0.00607472 -0.01119007  0.00555629 -0.00781679  -0.01376211
  0.00675235 -0.00840158 -0.00319737  0.00924599  0.00214165  -0.01063819
  0.01226836  0.00852781  0.01361119 -0.00257012  0.00819397  -0.00410289
 -0.0053979  -0.01360016]
# vector for word deep
print(fast['deep'])
[ 0.00271002 -0.00242539 -0.00771885 -0.00396854  0.0114902   -0.00640606
  0.00637542 -0.01248098 -0.01207364  0.01400793 -0.00476079  -0.00230879
  0.02009759 -0.01952532  0.01558956 -0.01581665  0.00510567  -0.00957186
 -0.00963234 -0.02059373]
This is the advantage of using fastText . The word deep was not present in training word2vec, and we did not get a vector for that word. But since fastText is building the character level, it provides results—even for a word that was not there in training. You can see the vector for the word deep .
# load model
fast = Word2Vec.load('fast.bin')
# visualize
X = fast[fast.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(fast.wv.vocab)
for i, word in enumerate(words):
      pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()
../images/475440_2_En_3_Chapter/475440_2_En_3_Figf_HTML.jpg

The figure above shows the embedding representation for fastText. If you observe closely, the words love and solve are close together in fastText but in your skip-gram and CBOW, love and learn are near to each other . This is because of character-level embeddings.

Recipe 3-9. Converting Text to Features Using State-of-the-Art Embeddings

Let’s discuss and implement some advanced context-based feature engineering methods.

Problem

You want to learn text to features using state-of-the-art embeddings.

Solution

Let’s discuss the following seven methods.
  • GloVe Embedding

  • ELMo

  • Sentence encoders
    • doc2vec

    • Sentence-BERT

    • Universal Encoder

    • InferSent

  • Open-AI GPT

GloVe is an alternate word embedding method to create vector subspace of the word. GloVe model trains on co-occurrence counts of words, and by minimizing least square error, it produces vector space.

In GloVe, you first construct co-occurrence: each row is a word, and each column is the context. This matrix calculates the frequency of words with context. Since the context dimension is very large, you want to reduce the context and learn a low-dimensional representation of word embedding. This process can be regarded as the reconstruction problem of the co-occurrence matrix, namely reconstruction loss. The motivation of GloVe is to explicitly force the model to learn such a relationship based on the co-occurrence matrix.

word2vec, skip-gram, and CBOW are predictive and ignore the fact that some context words occur more often than others. They only take into consideration the local context and hence failing to capture the global context.

While word2vec predicts the context of a given word, GloVe learns by constructing a co-occurrence matrix.

word2vec does not have global information embedded, while GloVe creates a global co-occurrence matrix counting frequency of context with each word. The presence of global information makes GloVe better.

GloVe does not learn by a neural network like word2vec. Instead, it has the simple loss function of the difference between the product of word embeddings and log of the probability of co-occurrence.

The research paper is at https://nlp.stanford.edu/pubs/glove.pdf.

ELMo

ELMo vectors are the vectors that are the function of a given sentence. The main advantage of this method is it can have different vectors of words under different contexts.

ELMo is a deep contextualized word representation model. It looks at complex characteristics of words (e.g., syntax and semantics), and studies how they vary across linguistic contexts (i.e., to model polysemy).

Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.

Words with different contexts in different sentences are called polysemous words. ELMo can successfully handle words of this nature, which GloVe and fastText fail to capture.

The research paper is at www.aclweb.org/anthology/N18-1202.pdf.

Link to research paper: https://www.aclweb.org/anthology/N18-1202.pdf

Sentence Encoders

Why learned sentence embeddings? Traditional techniques use an average of the word embeddings to form sentence embeddings. But there are cons to this approach, such as the order of the words are not considered, and the similarities obtained by averaging word vectors are the same if the words are swapped in a sentence.

doc2vec

doc2vec is based on word2vec. Words maintain a grammatical structure, but documents don’t have any grammatical structures. To solve this problem, another vector (paragraph ID) is added to the word2vec model. This is the only difference between word2vec and doc2vec.

word2vec calculates the mean of all vectors represented by words, while doc2vec directly represents a sentence as a vector. Like word2vec, there are two doc2vec models available.
  • Distributed Memory Model of Paragraph Vectors (PV-DM)

  • Distributed Bag of Words version of Paragraph Vector (PV-DBOW)

The distributed memory (DM) model is similar to the CBOW model. CBOW predicts the target word given its context as an input, whereas in doc2vec, a paragraph ID is added.

The Distributed Bag-Of-Words (DBOW) model is similar to the skip-gram model in word2vec, which predicts the context words from a target word. This model only takes paragraph ID as input and predicts context from the vocabulary.

The research paper is at https://cs.stanford.edu/~quocle/paragraph_vector.pdf.

Sentence-BERT

Sentence-BERT (SBERT) is a modification of the pre-trained BERT network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort of finding the most similar pair from 65 hours with BERT/RoBERTa to about 5 seconds with SBERT while maintaining the accuracy from BERT.

The leader among the pack, Sentence-BERT, was introduced in 2018 and immediately took the pole position for sentence embeddings. There are four key concepts at the heart of this BERT-based model.
  • Attention

  • Transformers

  • BERT

  • Siamese networks

Sentences are passed to BERT models and a pooling layer to generate their embeddings.

The research paper is at www.aclweb.org/anthology/D19-1410.pdf.

Universal Encoder

The Universal Sentence Encoder model specifically targets transfer learning to the NLP tasks and generates embeddings.

It is trained on a variety of data sources to learn for a wide variety of tasks. The sources are Wikipedia, web news, web question-answer pages, and discussion forums. The input is variable-length English text, and the output is a 512-dimensional vector.

Sentence embeddings are calculated by averaging all the embeddings of the words in the sentence; however, just adding or averaging had limitations and was not suited for deriving the true semantic meaning of the sentence. The Universal Sentence Encoder makes getting sentence-level embeddings easy.

Two variants of the TensorFlow model allow for trade-offs between accuracy and computing resources.
  • Transformers

  • Deep Average Network

The research paper is at https://arxiv.org/pdf/1803.11175v2.pdf.

InferSent

In 2017, Facebook introduced InferSent as a sentence representation model trained using the Stanford Natural Language Inference datasets (SNLI) . SNLI is a dataset of 570,000 English sentences, and each sentence is a pair sentence of the premise, hypothesis labeled in one of the following categories: entailment, contradiction, or neutral.

The research paper is at https://arxiv.org/pdf/1705.02364.pdf.

Open-AI GPT

The GPT (Generative Pre-trained Transformer ) architecture implements a deep neural network, specifically a transformer model, which uses attention in place of previous recurrence-based and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant.

Due to the broadness of the dataset on which it is trained and the broadness of its approach, GPT became capable of performing a diverse range of tasks beyond simple text generation: answering questions, summarizing, and even translating between languages in a variety of specific domains.

The research paper is at https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

How It Works

Download the dataset from www.kaggle.com/rounakbanik/ted-talks and keep it in your local folder. Then follow the steps in this section.

Dataset link: https://www.kaggle.com/rounakbanik/ted-talks

Step 9-1. Import a notebook and data to Google Colab

Google Colab is used to solve this project given BERT models are huge, and building it in Colab is way easier and faster.

Go to Google Colab at https://colab.research.google.com/notebooks/intro.ipynb.

https://colab.research.google.com/notebooks/intro.ipynb

Go to file and open a new notebook or Upload notebook from your local by selecting “Upload notebook”.

../images/475440_2_En_3_Chapter/475440_2_En_3_Figg_HTML.jpg

To import the data, go to Files, click the Upload to Session Storage option, and then import the csv file.

../images/475440_2_En_3_Chapter/475440_2_En_3_Figh_HTML.jpg

Step 9-2. Install and import libraries

#If any of these libraries are not installed, please install them using pip before importing.
import pandas as pd
import numpy as np
import scipy
import os
import nltk
nltk.download('stopwords')
nltk.download('punkt')
ltk.download('wordnet')
import string
import csv
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer # used for preprocessing
import warnings
warnings.filterwarnings('ignore')
from sklearn import preprocessing
import spacy
from tqdm import tqdm
import re
import matplotlib.pyplot as plt # our main display package
import plotly.graph_objects as go
import tensorflow_hub as hub
import tensorflow as tf
print(tf.__version__)

Step 9-3. Read text data

df = pd.read_csv('Ted talks.csv')
df_sample=df.iloc[0:100,:]

Step 9-4. Process text data

Let’s implement the preprocessing steps that you learned in Chapter 2.
# remove urls, handles, and the hashtag from hashtags
def remove_urls(text):
    new_text = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z ])|(w+://S+)"," ",text).split())
    return new_text
# make all text lowercase
def text_lowercase(text):
    return text.lower()
# remove numbers
def remove_numbers(text):
    result = re.sub(r'd+', '', text)
    return result
# remove punctuation
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
# tokenize
def tokenize(text):
    text = word_tokenize(text)
    return text
# remove stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    text = [i for i in text if not i in stop_words]
    return text
# lemmatize
lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    text = [lemmatizer.lemmatize(token) for token in text]
    return text
def preprocessing(text):
    text = text_lowercase(text)
    text = remove_urls(text)
    text = remove_numbers(text)
    text = remove_punctuation(text)
    text = tokenize(text)
    text = remove_stopwords(text)
    text = lemmatize(text)
    text = ' '.join(text)
    return text
#preprocessing input
for i in range(df_sample.shape[0]):
    df_sample['description'][i]=preprocessing(str(df_sample['description'][i]))
#in case if description has next line character
for text in df_sample.description:
    text=text.replace(' ',' ')

Step 9-5. Generate a feature vector

#Implementations of above methods
#GloVe:
#loading pre-trained glove model
#downloading and unzipping all word embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip
!ls
!pwd
#importing 100-d glove model
glove_model_100vec = pd.read_table("glove.6B.100d.txt", sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
# getting mean vector for each sentence
def get_mean_vector(glove_model, words):
    # remove out-of-vocabulary words
    #assuming 100-d vector
    words = [word for word in word_tokenize(words) if word in list(glove_model_100vec.index)] #if word is in vocab
    if len(words) >= 1:
        return np.mean(glove_model_100vec.loc[words].values, axis=0)
    else:
        return np.array([0]*100)
#creating empty list and appending all mean arrays for comparing cosine similarities
glove_vec=[]
for i in df_sample.description:
    glove_vec.append(list(get_mean_vector(glove_model_100vec, i)))
glove_vec=np.asarray(glove_vec)
glove_vec
#output
array([[-0.11690753,  0.17445151,  0.04606778, ..., -0.48718723,
         0.28744267,  0.16625453],
       [-0.12658561,  0.17125735,  0.44709804, ..., -0.18936391,
         0.51547109,  0.2958283 ],
       [-0.06018609,  0.12372995,  0.27105957, ..., -0.38565426,
         0.39135596,  0.2519755 ],
       ...,
       [-0.12469988,  0.11091088,  0.16328073, ..., -0.08730062,
         0.25822592,  0.12540627],
       [ 0.09014104,  0.09796044,  0.13403036, ..., -0.371885  ,
         0.19138244,  0.05781978],
       [ 0.00891036,  0.09064478,  0.22670132, ..., -0.26099886,
         0.47415786,  0.30951336]])
ELMo:
# Due to some open issue with TensorFlow Hub on latest version (2.x), we are degrading to tensorflow 1.x version
#!pip uninstall tensorflow
!pip install tensorflow==1.15
import tensorflow as tf
import tensorflow_hub as hub
print(tf.__version__)
#Load pre-trained model
embed_ = hub.Module("https://tfhub.dev/google/elmo/3")
#function to average word vectors of each sentence
def elmo_vectors_sentence(x):
  sentence_embeddings = embed_(x.tolist(), signature="default", as_dict=True)["elmo"]
  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    # Average of each vector
    return sess.run(tf.reduce_mean(sentence_embeddings,1))
#if your data set is large , make a batch of 100 samples. Just remove comment and run the code given below. As we have just 100 samples, we are not doing this
 #samples= [df[i:i+100] for i in range(0,df.shape[0],100)]
 # elmo_vec = [elmo_vectors_sentence(x['description']) for x in samples]
 #elmo_vec_full= np.concatenate(elmo_vec, axis = 0)
#embeddings on our dataset
elmo_vec = elmo_vectors_sentence(df_sample['description'])
elmo_vec
#output
array([[ 0.0109894 , -0.16668989, -0.06553215, ...,  0.07014981,
         0.09196191,  0.04669906],
       [ 0.15317157, -0.19256656,  0.01390844, ...,  0.03459582,
         0.28029835,  0.11106762],
       [ 0.20210212, -0.13186318, -0.20647219, ..., -0.15281932,
         0.12729007,  0.17192583],
       ...,
       [ 0.29017407, -0.45098212,  0.0250571 , ..., -0.12281103,
         0.23303834,  0.15486737],
       [ 0.22871418,  0.12254314, -0.22637479, ...,  0.04150296,
         0.31900924,  0.28121516],
       [ 0.05940952,  0.01366339, -0.17502695, ...,  0.20946877,
         0.0020928 ,  0.1114894 ]], dtype=float32)
Doc2Vec:
#importing doc2vec and tagged document
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
#tokenizing data
tokenized_data=[word_tokenize(word) for word in df_sample.description]
train_data=[TaggedDocument(d, [i]) for i, d in enumerate(tokenized_data)]   #adding paragraph id as mentioned in explanation for training
train_data
#output
[TaggedDocument(words=['sir', 'ken', 'robinson', 'make', 'entertaining', 'profoundly', 'moving', 'case', 'creating', 'education', 'system', 'nurture', 'rather', 'undermines', 'creativity'], tags=[0]),
 TaggedDocument(words=['humor', 'humanity', 'exuded', 'inconvenient', 'truth', 'al', 'gore', 'spell', 'way', 'individual', 'address', 'climate', 'change', 'immediately', 'buying', 'hybrid', 'inventing', 'new', 'hotter', 'brand', 'name', 'global', 'warming'], tags=[1]),
 TaggedDocument(words=['new', 'york', 'time', 'columnist', 'david', 'pogue', 'take', 'aim', 'technology', 'worst', 'interface', 'design', 'offender', 'provides', 'encouraging', 'example', 'product', 'get', 'right', 'funny', 'thing', 'burst', 'song'], tags=[2]),
 TaggedDocument(words=['emotionally', 'charged', 'talk', 'macarthur', 'winning', 'activist', 'majora', 'carter', 'detail', 'fight', 'environmental', 'justice', 'south', 'bronx', 'show', 'minority', 'neighborhood', 'suffer', 'flawed', 'urban', 'policy'], tags=[3]),
 TaggedDocument(words=['never', 'seen', 'data', 'presented', 'like', 'drama', 'urgency', 'sportscaster', 'statistic', 'guru', 'han', 'rosling', 'debunks', 'myth', 'called', 'developing', 'world'], tags=[4])
……….
## Train doc2vec model
model = Doc2Vec(train_data, vector_size = 100, window = 2, min_count = 1, epochs = 100)
def get_vectors(model,words):
  words = [word for word in word_tokenize(words) if word in list(model.wv.vocab)]
  #words = [word for word in word_tokenize(words) if word in list(model.wv.index_to_key)] #if gensim version is >4.0.0 ,use this line
  if len(words)>=1:
    return model.infer_vector(words)
  else:
    return np.array([0]*100)
#defining empty list
doc2vec_vec=[]
for i in df_sample.description:
    doc2vec_vec.append(list(get_vectors(model, i)))
doc2vec_vec=np.asarray(doc2vec_vec)
doc2vec_vec
#output
array([[ 0.00505156, -0.582084  , -0.33430266, ...,  0.29665616,
        -0.5472022 ,  0.48537165],
       [ 0.05787622, -0.6559785 , -0.41140306, ...,  0.24132295,
        -0.73182726,  0.6089837 ],
       [ 0.02416484, -0.48238695, -0.29850838, ...,  0.2710957 ,
        -0.51971895,  0.4405582 ],
       ...,
       [ 0.0511999 , -0.5991625 , -0.34839907, ...,  0.29519215,
        -0.68761116,  0.4545323 ],
       [ 0.0180944 , -0.8318272 , -0.3488748 , ...,  0.30490136,
        -0.7558393 ,  0.56117946],
       [-0.04790357, -0.66188   , -0.3797214 , ...,  0.34476635,
        -0.7202311 ,  0.5834031 ]], dtype=float32)

Sentence-BERT

#BERT sentence transformer for sentence encoding
!pip install sentence-transformers
#importing bert-base model
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')
#one more model to try
#model = SentenceTransformer('paraphrase-MiniLM-L12-v2')
#embeding on description column
sentence_embeddings_BERT = sbert_model.encode(df_sample['description'])
print('Sample BERT embedding vector - length', len(sentence_embeddings_BERT[0]))
#output
Sample BERT embedding vector - length 768
sentence_embeddings_BERT
array([[-0.31804532,  0.6571422 ,  0.5327481 , ..., -0.76469   ,
        -0.4919126 ,  0.1543465 ],
       [-0.08962823,  1.0855986 ,  0.37181526, ..., -0.84685326,
         0.5427714 ,  0.32389015],
       [-0.13385592,  0.8280815 ,  0.76139224, ..., -0.33403403,
         0.2664094 , -0.05493931],
       ...,
       [ 0.05133615,  1.1150284 ,  0.75921553, ...,  0.5516633 ,
         0.46614835,  0.28827885],
       [-1.3568689 ,  0.2995725 ,  0.99510914, ...,  0.26881158,
        -0.1879525 ,  0.18646894],
       [-0.20679009,  0.8725009 ,  1.2933054 , ..., -0.44921246,
         0.14516312, -0.2050481 ]], dtype=float32)
sentence_embeddings_BERT.shape
#output
(100, 768)

Universal Encoder

#Load the pre-trained model
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model_USE = hub.load(module_url)
embeddings_USE = model_USE(df_sample['description'])
embeddings_USE = tf.reshape(embeddings_USE,[100,512])
embeddings_USE.shape
#output
TensorShape([Dimension(100), Dimension(512)])
#output is tensor

Infersent

There are two versions of InferSent. Version 1 uses GloVe, whereas version 2 uses fastText vectors. You can choose to work with any model. We used version 2, so we downloaded the InferSent model and the pre-trained word vectors.
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
! mkdir GloVe
! curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
! unzip GloVe/glove.840B.300d.zip -d GloVe/
! unzip GloVe/glove.840B.300d.zip -d GloVe/
from models import InferSent
import torch
V = 2
MODEL_PATH = '/content/drive/MyDrive/yolov3/encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
model_infer = InferSent(params_model)
model_infer.load_state_dict(torch.load(MODEL_PATH))
W2V_PATH = '/content/drive/MyDrive/yolov3/GloVe/glove.840B.300d.txt'
model_infer.set_w2v_path(W2V_PATH)
#building vocabulary
model_infer.build_vocab(df_sample.description, tokenize=True)
#output
Found 1266(/1294) words with w2v vectors
Vocab size : 1266
#encoding sample dataset
infersent_embed = model_infer.encode(df_sample.description,tokenize=True)
#shape of our vector
infersent_embed.shape
#output
(100, 4096)
get_embed(df_sample,'infersent')
#output
array([[ 0.00320979,  0.0560745 ,  0.11894835, ...,  0.04763867,
         0.02359796,  0.09751415],
       [ 0.00983471,  0.11757359,  0.12201475, ...,  0.06545023,
         0.04181211,  0.07941461],
       [-0.02874381,  0.18418473,  0.12211668, ...,  0.07526097,
         0.06728931,  0.1058861 ],
       ...,
       [ 0.00766308,  0.10781102,  0.13686652, ...,  0.08371441,
         0.01190174,  0.12111058],
       [-0.02874381,  0.20537955,  0.11543981, ...,  0.08811261,
         0.03787484,  0.08826952],
       [ 0.12408942,  0.30591702,  0.23708522, ...,  0.1063919 ,
         0.0908693 ,  0.14098585]], dtype=float32)

Open-AI GPT

#installing necessary model
!pip install pytorch_pretrained_bert
import torch
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel
tokenizer_openai = OpenAIGPTTokenizer.from_pretrained('openai-gpt')  #Construct a GPT Tokenizer. Based on Byte-Pair-Encoding with the following peculiarities:
model_openai = OpenAIGPTModel.from_pretrained('openai-gpt')
model_openai.eval()
print('Model Loaded')
#function to get embedding of each token
def Embedding_openai(Sentence):
  tokens = word_tokenize(Sentence)
  vectors = np.zeros((1,768))
  for word in tokens:
      subwords = tokenizer_openai.tokenize(word)
      indexed_tokens = tokenizer_openai.convert_tokens_to_ids(subwords)
      tokens_tensor = torch.tensor([indexed_tokens])
      with torch.no_grad():
          try:
            vectors += np.array(torch.mean(model_openai(tokens_tensor),1))
          except Exception as ex:
            continue
  vectors /= len(tokens)
  return vectors
# Initialize Matrix with dimension of numberof rows*vector dimension
open_ai_vec = np.zeros((df_sample.shape[0], 768))
# generating sentence embedding for each row
for iter in range(df_sample.shape[0]):
    text = df_sample.loc[iter,'description']
    open_ai_vec[iter] = Embedding_openai(text)
open_ai_vec
#output
array([[ 0.16126736,  0.14900037,  0.10306535, ...,  0.22078205,
        -0.38590393, -0.09898915],
       [ 0.17074709,  0.20849738,  0.14996684, ...,  0.21315758,
        -0.46983403,  0.02419061],
       [ 0.25158801,  0.12217634,  0.09847356, ...,  0.25541687,
        -0.44979091, -0.0174561 ],
       ...,
       [ 0.26624974,  0.15842849,  0.10565209, ...,  0.23473342,
        -0.40087843, -0.07652373],
       [ 0.22917288,  0.22115094,  0.09217898, ...,  0.18310198,
        -0.33768173, -0.16026535],
       [ 0.21503123,  0.21615047,  0.04715349, ...,  0.25044506,
        -0.42287723, -0.01473052]])

Step 9-6. Generate a feature vector function automatically using a selected embedding method

#takes input as dataframe and embedding model name as mentioned in function
def get_embed(df,model):
  if model=='Glove':
    return glove_vec
  if model=='ELMO':
    return elmo_vec
  if model=='doc2vec':
    return doc2vec_vec
  if model=='sentenceBERT':
    return sentence_embeddings_BERT
  if model=='USE':
    return embeddings_USE
  if model=='infersent':
    return infersent_embed
  if model=='Open-ai':
    return open_ai_vec
get_embed(df_sample,'ELMO')
#output
array([[ 0.0109894 , -0.16668989, -0.06553215, ...,  0.07014981,
         0.09196191,  0.04669906],
       [ 0.15317157, -0.19256656,  0.01390844, ...,  0.03459582,
         0.28029835,  0.11106762],
       [ 0.20210212, -0.13186318, -0.20647219, ..., -0.15281932,
         0.12729007,  0.17192583],
       ...,
       [ 0.29017407, -0.45098212,  0.0250571 , ..., -0.12281103,
         0.23303834,  0.15486737],
       [ 0.22871418,  0.12254314, -0.22637479, ...,  0.04150296,
         0.31900924,  0.28121516],
       [ 0.05940952,  0.01366339, -0.17502695, ...,  0.20946877,
         0.0020928 ,  0.1114894 ]], dtype=float32)

We hope that you are now comfortable with natural language processing. Now that the data has been cleaned and processed, and features have been created, let’s jump into building applications that solve business problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.171.136