Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A. Kulkarni, A. ShivanandaNatural Language Processing Recipeshttps://doi.org/10.1007/978-1-4842-7351-7_3

3. Converting Text to Features

Akshay Kulkarni¹ and Adarsha Shivananda¹

(1)

Bangalore, Karnataka, India

This chapter covers basic to advanced feature engineering (text to features) methods. By the end of the chapter, you will be comfortable with the following recipes.

Recipe 1. One-hot encoding
Recipe 2. Count vectorizer
Recipe 3. n-grams
Recipe 4. Co-occurrence matrix
Recipe 5. Hash vectorizing
Recipe 6. Term frequency-inverse document frequency (TF-IDF)
Recipe 7. Word embedding
Recipe 8. Implementing fastText
Recipe 9. Converting text to features using state-of-the-art embeddings

Now that all the text preprocessing steps have been discussed, let’s explore feature engineering, the foundation for natural language processing. As you know, machines or algorithms cannot understand characters, words, or sentences. They can only take numbers as input, which includes binaries. But the inherent nature of textual data is unstructured and noisy, which makes it impossible to interact with machines.

The procedure of converting raw text into a machine-understandable format (numbers) is called feature engineering. The performance and accuracy of machine learning and deep learning algorithms are fundamentally dependent on the feature engineering technique.

This chapter discusses different feature engineering methods and techniques; their functionalities, advantages, and disadvantages; and examples to help you realize the importance of feature engineering.

Recipe 3-1. Converting Text to Features Using One-Hot Encoding

One-hot encoding is the traditional method used in feature engineering. Anyone who knows the basics of machine learning has come across one-hot encoding. It is the process of converting categorical variables into features or columns and coding one or zero for that particular category. The same logic is used here, and the number of features is the number of total tokens present in the corpus.

Problem

You want to convert text to a feature using one-hot encoding.

Solution

One-hot encoding converts characters or words into binary numbers, as shown next.

	I	love	NLP	is	Future
I love NLP	1	1	1	0	0
NLP is future	0	0	1	1	1

How It Works

There are many functions to generate one-hot encoding features. Let’s take one function and discuss it in depth.

Step 1-1. Store the text in a variable

The following shows a single line.

Text = "I am learning NLP"

Step 1-2. Execute a function on the text data

The following is a function from the pandas library to convert text into a feature.

# Importing the library

import pandas as pd

# Generating the features

pd.get_dummies(Text.split())

Result :

I NLP am learning

0 1 0 0 0

1 0 0 1 0

2 0 0 0 1

3 0 1 0 0

The output has four features since the number of distinct words present in the input was 4.

Recipe 3-2. Converting Text to Features Using a Count Vectorizer

The approach used in Recipe 3-1 has a disadvantage . It does not consider the frequency of a word. If a particular word appears multiple times, there is a chance of missing information if it is not included in the analysis. A count vectorizer solves that problem. This recipe covers another method for converting text to a feature: the count vectorizer.

Problem

How do you convert text to a feature using a count vectorizer?

Solution

A count vectorizer is similar to one-hot encoding, but instead of checking whether a particular word is present or not, it counts the words that are present in the document.

In the following example, the words I and NLP occur twice in the first document.

	I	love	NLP	is	future	will	learn	In	2month
I love NLP and I will learn NLP in 2 months	2	1	2	0	0	1	1	1	1
NLP is future	0	0	1	1	1	0	0	0	0

How It Works

sklearn has a feature extraction function that extracts features out of text. Let’s look at how to execute this. The following imports the CountVectorizer function from sklearn.

#importing the function

from sklearn.feature_extraction.text import CountVectorizer

# Text

text = ["I love NLP and I will learn NLP in 2month "]

# create the transform

vectorizer = CountVectorizer()

# tokenizing

vectorizer.fit(text)

# encode document

vector = vectorizer.transform(text)

# summarize & generating output

print(vectorizer.vocabulary_)

print(vector.toarray())

Result:

{'love': 4, 'nlp': 5, 'and': 1, 'will': 6, 'learn': 3, 'in': 2, '2month': 0}

[[1 1 1 1 1 2 1]]

The fifth token, nlp, appears twice in the document.

Recipe 3-3. Generating n-grams

In the preceding methods, each word was considered a feature. There is a drawback to this method. It does not consider the previous words and the next words to see if it would give a proper and complete meaning. For example, consider the phrase not bad. If it is split into individual words, it loses out on conveying good, which is what this phrase means.

As you saw, you could lose potential information or insights because many words make sense once they are put together. n-grams can solve this problem.

n-grams are the fusion of multiple letters or multiple words. They are formed in such a way that even the previous and next words are captured.

Unigrams are the unique words present in a sentence.
A bigram is the combination of two words.
A trigram is the combination of three words. And so on.

For example, look at the sentence, “I am learning NLP.”

Unigrams: “I”, “am”, “learning”, “NLP”
Bigrams: “I am”, “am learning”, “learning NLP”
Trigrams: “I am learning”, “am learning NLP”

Problem

Generate the n-grams for a given sentence.

Solution

There are a lot of packages that generate n-grams. TextBlob is the most commonly used.

How It Works

Follow the steps in this section.

Step 3-1. Generate n-grams using TextBlob

Let’s look at how to generate n-grams using TextBlob .

Text = "I am learning NLP"

Use the following TextBlob function to create n-grams. Use the text that is defined and state the n based on the requirement.

#Import textblob

from textblob import TextBlob

#For unigram : Use n = 1

TextBlob(Text).ngrams(1)

This is the output.

[WordList(['I']), WordList(['am']), WordList(['learning']), WordList(['NLP'])]

#For Bigram : For bigrams, use n = 2

TextBlob(Text).ngrams(2)

[WordList(['I', 'am']),

WordList(['am', 'learning']),

WordList(['learning', 'NLP'])]

There are three lists with two words in an instance.

Step 3-2. Generate bigram-based features for a document

Just like in the last recipe , a count vectorizer to generates features. Using the same function, let’s generate bigram features to see what the output looks like.

#importing the function

from sklearn.feature_extraction.text import CountVectorizer

# Text

text = ["I love NLP and I will learn NLP in 2month "]

# create the transform

vectorizer = CountVectorizer(ngram_range=(2,2))

# tokenizing

vectorizer.fit(text)

# encode document

vector = vectorizer.transform(text)

# summarize & generating output

print(vectorizer.vocabulary_)

print(vector.toarray())

This is the result .

{'love nlp': 3, 'nlp and': 4, 'and will': 0, 'will learn': 6, 'learn nlp': 2, 'nlp in': 5, 'in 2month': 1}

[[1 1 1 1 1 1 1]]

The output has features with bigrams; in the example, the count is 1 for all tokens. You can similarly use trigrams.

Recipe 3-4. Generating a Co-occurrence Matrix

Let’s discuss a feature engineering method called a co-occurrence matrix.

Problem

You want to understand and generate a co-occurrence matrix.

Solution

A co-occurrence matrix is like a count vectorizer; it counts the occurrence of a group of words rather than individual words.

How It Works

Let’s look at generating this kind of matrix using NLTK, bigrams, and some basic Python coding skills.

Step 4-1. Import the necessary libraries

Here is the code.

import numpy as np

import nltk

from nltk import bigrams

import itertools

Step 4-2. Create function for a co-occurrence matrix

The following is the co_occurrence_matrix function .

def co_occurrence_matrix(corpus):

vocab = set(corpus)

vocab = list(vocab)

vocab_to_index = { word:i for i, word in enumerate(vocab) }

# Create bigrams from all words in corpus

bi_grams = list(bigrams(corpus))

# Frequency distribution of bigrams ((word1, word2), num_occurrences)

bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

# Initialise co-occurrence matrix

# co_occurrence_matrix[current][previous]

co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

# Loop through the bigrams taking the current and previous word,

# and the number of occurrences of the bigram.

for bigram in bigram_freq:

current = bigram[0][1]

previous = bigram[0][0]

count = bigram[1]

pos_current = vocab_to_index[current]

pos_previous = vocab_to_index[previous]

co_occurrence_matrix[pos_current][pos_previous] = count

co_occurrence_matrix = np.matrix(co_occurrence_matrix)

# return the matrix and the index

return co_occurrence_matrix,vocab_to_index

Step 4-3. Generate a co-occurrence matrix

Here are the sentences for testing .

sentences = [['I', 'love', 'nlp'],

['I', 'love','to' 'learn'],

['nlp', 'is', 'future'],

['nlp', 'is', 'cool']]

# create one list using many lists

merged = list(itertools.chain.from_iterable(sentences))

matrix = co_occurrence_matrix(merged)

# generate the matrix

CoMatrixFinal = pd.DataFrame(matrix[0], index=vocab_to_index, columns=vocab_to_index)

print(CoMatrixFinal)

I is love future tolearn cool nlp

I 0.0 0.0 0.0 0.0 0.0 0.0 1.0

is 0.0 0.0 0.0 0.0 0.0 0.0 2.0

love 2.0 0.0 0.0 0.0 0.0 0.0 0.0

future 0.0 1.0 0.0 0.0 0.0 0.0 0.0

tolearn 0.0 0.0 1.0 0.0 0.0 0.0 0.0

cool 0.0 1.0 0.0 0.0 0.0 0.0 0.0

nlp 0.0 0.0 1.0 1.0 1.0 0.0 0.0

I, love, and is, nlp appeared together twice, and a few other words appeared only once.

Recipe 3-5. Hash Vectorizing

A count vectorizer and a co-occurrence matrix both have one limitation: the vocabulary can become very large and cause memory/computation issues.

A hash vectorizer is one way to solve this problem.

Problem

You want to understand and generate a hash vectorizer.

Solution

A hash vectorizer is memory efficient , and instead of storing tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside is that it’s one way, and once vectorized, the features cannot be retrieved.

How It Works

Let’s look at an example using sklearn.

Step 5-1. Import the necessary libraries and create a document

Here’s the code.

from sklearn.feature_extraction.text import HashingVectorizer

# list of text documents

text = ["The quick brown fox jumped over the lazy dog."]

Step 5-2. Generate a hash vectorizer matrix

Let’s create a hash vectorizer matrix (HashingVectorizer) with a vector size of 10.

# transform

vectorizer = HashingVectorizer(n_features=10)

# create the hashing vector

vector = vectorizer.transform(text)

# summarize the vector

print(vector.shape)

print(vector.toarray())

(1, 10)

[[ 0. 0.57735027 0. 0. 0. 0. 0.

-0.57735027 -0.57735027 0. ]]

It created a vector of size 10, and now it can be used for any supervised/unsupervised tasks.

Recipe 3-6. Converting Text to Features Using TF-IDF

The above-mentioned text-to-feature methods have a few drawbacks, hence the introduction of TF-IDF. The following are some of the disadvantages.

Let’s say a particular word appears in all the corpus documents. It achieves higher importance in our previous methods, but that may not be relevant to your case.
TF-IDF reflects on how important a word is to a document in a collection and hence normalizes words that frequently appear in all the documents.

Problem

You want to convert text to features using TF-IDF.

Solution

Term frequency (TF) is the ratio of the count of a particular word present in a sentence to the total count of words in the same sentence. TF captures the importance of the word irrespective of the length of the document. For example, a word with a frequency of 3 in a sentence with 10 words is different from when the word length of the sentence is 100 words. It should have more importance in the first scenario, which is what TF does. TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Inverse document frequency (IDF) is a log of the ratio of the total number of rows to the number of rows in a particular document in which a word is present. IDF = log(N/n), where N is the total number of rows, and n is the number of rows in which the word was present.

IDF measures the rareness of a term. Words like a and the show up in all the corpus documents, but rare words are not in all documents. So, if a word appears in almost all the documents, that word is of no use since it does not help with classification or information retrieval. IDF nullifies this problem.

TF-IDF is the simple product of TF and IDF that addresses both drawbacks, making predictions and information retrieval relevant.

TF-IDF = TF * IDF

How It Works

Follow the steps in this section.

Step 6-1. Read the text data

The following is a familiar phrase.

Text = ["The quick brown fox jumped over the lazy dog.",

"The dog.",

"The fox"]

Step 6-2. Create the features

Execute the following code on the text data.

#Import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

#Create the transform

vectorizer = TfidfVectorizer()

#Tokenize and build vocab

vectorizer.fit(Text)

#Summarize

print(vectorizer.vocabulary_)

print(vectorizer.idf_)

This is the result.

Text = ["The quick brown fox jumped over the lazy dog.",

"The dog.",

"The fox"]

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}

[ 1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718 1.69314718 1. ]

Observe that the appears in all three documents, so it does not add much value. The vector value is 1, which is less than all the other tokens.

All the methods or techniques you have looked at so far are based on frequency. They are called frequency-based embeddings or features. The next recipe looks at prediction-based embeddings, typically called word embeddings.

Recipe 3-7. Implementing Word Embeddings

This recipe assumes that you have a working knowledge of how a neural network works and the mechanisms by which weights in the neural network are updated. If you are new to neural networks, we suggest that you go through Chapter 6 to gain a basic understanding of how a neural network works.

Even though all the previous methods solve most problems, once you get into more complex problems where you want to capture the semantic relation between words (context), these methods fail to perform.

The following explains the challenges with the methods discussed so far.

The techniques fail to capture the context and meaning of the words. They depend on the appearance or frequency of words. You need to know how to capture the context or semantic relationships.
1. a.
  I am eating an apple.
2. b.
  I am using an Apple.

In the example, apple has different meanings when it is used with different (close by) adjacent words eating and using.

For a problem like a document classification (book classification in the library) , a document is huge, and many tokens are generated. In these scenarios, your number of features can get out of control (wherein), thus hampering the accuracy and performance.

A machine/algorithm can match two documents/texts and say whether they are the same or not. How do we make machines talk about cricket or Virat Kohli when you search for MS Dhoni? How do you make the machine understand that the word apple in “An apple is a tasty fruit” is a fruit that can be eaten and not a company?

The answer to these questions lies in creating a representation for words that capture their meanings, semantic relationships, and the different types of contexts they are used in.

Word embeddings address these challenges. A word embedding is a feature-learning technique in which vocabulary are mapped to vectors of real numbers, capturing contextual hierarchy.

In the following table, every word is represented by four numbers, called vectors. Using the word embeddings technique, we derived those vectors for each word to use them in future analysis and building applications. In the example, the dimension is four, but you usually use a dimension greater than 100.

Words	Vectors
text	0.36	0.36	-0.43	0.36
idea	-0.56	-0.56	0.72	-0.56
word	0.35	-0.43	0.12	0.72
encode	0.19	0.19	0.19	0.43
document	-0.43	0.19	-0.43	0.43
grams	0.72	-0.43	0.72	0.12
process	0.43	0.72	0.43	0.43
feature	0.12	0.45	0.12	0.87

Problem

You want to implement word embeddings.

Solution

Word embeddings are prediction-based, and they use shallow neural networks to train the model that leads to learning the weight and using them as a vector representation.

word2vec is the deep learning Google framework to train word embeddings. It uses all the words of the whole corpus and predicts the nearby words. It creates a vector for all the words present in the corpus so that the context is captured. It also outperforms any other methodologies in the space of word similarity and word analogies.

There are mainly two types in word2vec.

skip-gram
Continuous Bag of Words (CBOW)

../images/475440_2_En_3_Chapter/475440_2_En_3_Figa_HTML.png

How It Works

The above figure shows the architecture of the CBOW and skip-gram algorithms used to build word embeddings. Let’s look at how these models work.

skip-gram

The skip-gram model ¹ predicts the probabilities of a word given the context of the word or words.

Let’s take a small sentence and understand how it works. Each sentence generates a target word and context, which are the words nearby. The number of words to be considered around the target variable is called the window size. The following table shows all the possible target and context variables for window size 2. Window size needs to be selected based on data and the resources at your disposal. The larger the window size, the higher the computing power.

Text = “I love NLP and I will learn NLP in 2 months”

	Target word	Context
I love NLP	I	love, NLP
I love NLP and	love	love, NLP, and
I love NLP and I will learn	NLP	I, love, and, I
…	…	…
in 2 months	month	in, 2

Since it takes a lot of text and computing power, let’s use sample data to build a skip-gram model.

Import the text corpus and break it into sentences. Perform some cleaning and preprocessing like removing punctuation and digits and splitting the sentences into words or tokens.

#Example sentences

sentences = [['I', 'love', 'nlp'],

['I', 'will', 'learn', 'nlp', 'in', '2','months'],

['nlp', 'is', 'future'],

['nlp', 'saves', 'time', 'and', 'solves', 'lot', 'of', 'industry', 'problems'],

['nlp', 'uses', 'machine', 'learning']]

#import library

!pip install gensim

import gensim

from gensim.models import Word2Vec

from sklearn.decomposition import PCA

from matplotlib import pyplot

# training the model

skipgram = Word2Vec(sentences, size =50, window = 3, min_count=1,sg = 1)

print(skipgram)

# access vector for one word

print(skipgram['nlp'])

[ 0.00552227 -0.00723104 0.00857073 0.00368054 -0.00071274 0.00837146

0.00179965 -0.0049786 -0.00448666 -0.00182289 0.00857488 -0.00499459

0.00188365 -0.0093498 0.00174774 -0.00609793 -0.00533857 -0.007905

-0.00176814 -0.00024082 -0.00181886 -0.00093836 -0.00382601 -0.00986026

0.00312014 -0.00821249 0.00787507 -0.00864689 -0.00686584 -0.00370761

0.0056183 0.00859488 -0.00163146 0.00928791 0.00904601 0.00443816

-0.00192308 0.00941 -0.00202355 -0.00756564 -0.00105471 0.00170084

0.00606918 -0.00848301 -0.00543473 0.00747958 0.0003408 0.00512787

-0.00909613 0.00683905]

Since our vector size parameter was 50, the model gives a vector of size 50 for each word.

# access vector for another one word

print(skipgram['deep'])

KeyError: "word 'deep' not in vocabulary"

We get an error saying the word doesn’t exist because this word was not in our input training data. We need to train the algorithm on as large a dataset as possible so that we do not miss words.

There is one more way to tackle this problem. Read Recipe 3-6 for the answer.

# save model

skipgram.save('skipgram.bin')

# load model

skipgram = Word2Vec.load('skipgram.bin')

A t-SNE plot is one of the ways to evaluate word embeddings. Let’s generate it and see how it looks.

# T – SNE plot

X = skipgram[skipgram.wv.vocab]

pca = PCA(n_components=2)

result = pca.fit_transform(X)

# create a scatter plot of the projection

pyplot.scatter(result[:, 0], result[:, 1])

words = list(skipgram.wv.vocab)

for i, word in enumerate(words):

pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))

pyplot.show()

../images/475440_2_En_3_Chapter/475440_2_En_3_Figb_HTML.jpg

Continuous Bag of Words (CBOW)

Now let’s look at how to build a CBOW model .

#import library

from gensim.models import Word2Vec

from sklearn.decomposition import PCA

from matplotlib import pyplot

#Example sentences

sentences = [['I', 'love', 'nlp'],

['I', 'will', 'learn', 'nlp', 'in', '2','months'],

['nlp', 'is', 'future'],

['nlp', 'saves', 'time', 'and', 'solves', 'lot', 'of', 'industry', 'problems'],

['nlp', 'uses', 'machine', 'learning']]

# training the model

cbow = Word2Vec(sentences, size =50, window = 3, min_count=1,sg = 1)

print(cbow)

# access vector for one word

print(cbow['nlp'])

# save model

cbow.save('cbow.bin')

# load model

cbow = Word2Vec.load('cbow.bin')

# T – SNE plot

X = cbow[cbow.wv.vocab]

pca = PCA(n_components=2)

result = pca.fit_transform(X)

# create a scatter plot of the projection

pyplot.scatter(result[:, 0], result[:, 1])

words = list(cbow.wv.vocab)

for i, word in enumerate(words):

pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))

pyplot.show()

../images/475440_2_En_3_Chapter/475440_2_En_3_Figc_HTML.jpg

Training these models requires a huge amount of computing power. Let’s use Google’s pre-trained model, which has been trained with more than 100 billion words.

Download the model from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit and keep it in your local storage.

https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

Import the gensim package and follow the steps to learn Google’s word2vec.

# import gensim package

import gensim

# load the saved model

model = gensim.models.Word2Vec.load_word2vec_format('C:\Users\GoogleNews-vectors-negative300.bin', binary=True)

#Checking how similarity works.

print (model.similarity('this', 'is'))

Output:

0.407970363878

#Lets check one more.

print (model.similarity('post', 'book'))

Output:

0.0572043891977

This and is have a good amount of similarity, but the similarity between the words post and book is poor. For any given set of words, it uses the vectors of both the words and calculates the similarity between them.

# Finding the odd one out.

model.doesnt_match('breakfast cereal dinner lunch';.split())

The output is

'cereal'

Among breakfast, cereal, dinner, and lunch, the word cereal is the least related to all the other three words.

# It is also finding the relations between words.

word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])

This is the output.

queen: 0.7699

If you add woman and king and subtract man, it predicts queen as the output with 77% confidence. Isn’t this amazing?

../images/475440_2_En_3_Chapter/475440_2_En_3_Figd_HTML.png

Let’s look at a few interesting examples using the t-SNE plot for word embeddings, such as for home interiors and exteriors. For example, all the words related to electrical fittings are near each other; similarly, words related to bathroom fittings are near each other, and so on. This is the beauty of word embeddings.

../images/475440_2_En_3_Chapter/475440_2_En_3_Fige_HTML.jpg

Recipe 3-8. Implementing fastText

fastText is another deep learning framework developed by Facebook to capture context and generate a feature vector.

Problem

You want to learn how to implement fastText in Python.

Solution

fastText is the improvised version of word2vec, which considers words to build the representation. But fastText takes each character while computing a word’s representation.

How It Works

Let’s look at how to build a fastText word embedding.

# Import FastText

from gensim.models import FastText

from sklearn.decomposition import PCA

from matplotlib import pyplot

#Example sentences

sentences = [['I', 'love', 'nlp'],

['I', 'will', 'learn', 'nlp', 'in', '2','months'],

['nlp', 'is', 'future'],

['nlp', 'saves', 'time', 'and', 'solves', 'lot', 'of', 'industry', 'problems'],

['nlp', 'uses', 'machine', 'learning']]

fast = FastText(sentences,size=20, window=1, min_count=1, workers=5, min_n=1, max_n=2)

# vector for word nlp

print(fast['nlp'])

[-0.00459182 0.00607472 -0.01119007 0.00555629 -0.00781679 -0.01376211

0.00675235 -0.00840158 -0.00319737 0.00924599 0.00214165 -0.01063819

0.01226836 0.00852781 0.01361119 -0.00257012 0.00819397 -0.00410289

-0.0053979 -0.01360016]

# vector for word deep

print(fast['deep'])

[ 0.00271002 -0.00242539 -0.00771885 -0.00396854 0.0114902 -0.00640606

0.00637542 -0.01248098 -0.01207364 0.01400793 -0.00476079 -0.00230879

0.02009759 -0.01952532 0.01558956 -0.01581665 0.00510567 -0.00957186

-0.00963234 -0.02059373]

This is the advantage of using fastText . The word deep was not present in training word2vec, and we did not get a vector for that word. But since fastText is building the character level, it provides results—even for a word that was not there in training. You can see the vector for the word deep .

# load model

fast = Word2Vec.load('fast.bin')

# visualize

X = fast[fast.wv.vocab]

pca = PCA(n_components=2)

result = pca.fit_transform(X)

# create a scatter plot of the projection

pyplot.scatter(result[:, 0], result[:, 1])

words = list(fast.wv.vocab)

for i, word in enumerate(words):

pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))

pyplot.show()

../images/475440_2_En_3_Chapter/475440_2_En_3_Figf_HTML.jpg

The figure above shows the embedding representation for fastText. If you observe closely, the words love and solve are close together in fastText but in your skip-gram and CBOW, love and learn are near to each other . This is because of character-level embeddings.

Recipe 3-9. Converting Text to Features Using State-of-the-Art Embeddings

Let’s discuss and implement some advanced context-based feature engineering methods.

Problem

You want to learn text to features using state-of-the-art embeddings.

Solution

Let’s discuss the following seven methods.

GloVe Embedding
ELMo
Sentence encoders
- doc2vec
- Sentence-BERT
- Universal Encoder
- InferSent
Open-AI GPT

GloVe is an alternate word embedding method to create vector subspace of the word. GloVe model trains on co-occurrence counts of words, and by minimizing least square error, it produces vector space.

In GloVe, you first construct co-occurrence: each row is a word, and each column is the context. This matrix calculates the frequency of words with context. Since the context dimension is very large, you want to reduce the context and learn a low-dimensional representation of word embedding. This process can be regarded as the reconstruction problem of the co-occurrence matrix, namely reconstruction loss. The motivation of GloVe is to explicitly force the model to learn such a relationship based on the co-occurrence matrix.

word2vec, skip-gram, and CBOW are predictive and ignore the fact that some context words occur more often than others. They only take into consideration the local context and hence failing to capture the global context.

While word2vec predicts the context of a given word, GloVe learns by constructing a co-occurrence matrix.

word2vec does not have global information embedded, while GloVe creates a global co-occurrence matrix counting frequency of context with each word. The presence of global information makes GloVe better.

GloVe does not learn by a neural network like word2vec. Instead, it has the simple loss function of the difference between the product of word embeddings and log of the probability of co-occurrence.

The research paper is at https://nlp.stanford.edu/pubs/glove.pdf.

ELMo

ELMo vectors are the vectors that are the function of a given sentence. The main advantage of this method is it can have different vectors of words under different contexts.

ELMo is a deep contextualized word representation model. It looks at complex characteristics of words (e.g., syntax and semantics), and studies how they vary across linguistic contexts (i.e., to model polysemy).

Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.

Words with different contexts in different sentences are called polysemous words. ELMo can successfully handle words of this nature, which GloVe and fastText fail to capture.

The research paper is at www.aclweb.org/anthology/N18-1202.pdf.

Link to research paper: https://www.aclweb.org/anthology/N18-1202.pdf

Sentence Encoders

Why learned sentence embeddings? Traditional techniques use an average of the word embeddings to form sentence embeddings. But there are cons to this approach, such as the order of the words are not considered, and the similarities obtained by averaging word vectors are the same if the words are swapped in a sentence.

doc2vec

doc2vec is based on word2vec. Words maintain a grammatical structure, but documents don’t have any grammatical structures. To solve this problem, another vector (paragraph ID) is added to the word2vec model. This is the only difference between word2vec and doc2vec.

word2vec calculates the mean of all vectors represented by words, while doc2vec directly represents a sentence as a vector. Like word2vec, there are two doc2vec models available.

Distributed Memory Model of Paragraph Vectors (PV-DM)
Distributed Bag of Words version of Paragraph Vector (PV-DBOW)

The distributed memory (DM) model is similar to the CBOW model. CBOW predicts the target word given its context as an input, whereas in doc2vec, a paragraph ID is added.

The Distributed Bag-Of-Words (DBOW) model is similar to the skip-gram model in word2vec, which predicts the context words from a target word. This model only takes paragraph ID as input and predicts context from the vocabulary.

The research paper is at https://cs.stanford.edu/~quocle/paragraph_vector.pdf.

Sentence-BERT

Sentence-BERT (SBERT) is a modification of the pre-trained BERT network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort of finding the most similar pair from 65 hours with BERT/RoBERTa to about 5 seconds with SBERT while maintaining the accuracy from BERT.

The leader among the pack, Sentence-BERT, was introduced in 2018 and immediately took the pole position for sentence embeddings. There are four key concepts at the heart of this BERT-based model.

Attention
Transformers
BERT
Siamese networks

Sentences are passed to BERT models and a pooling layer to generate their embeddings.

The research paper is at www.aclweb.org/anthology/D19-1410.pdf.

Universal Encoder

The Universal Sentence Encoder model specifically targets transfer learning to the NLP tasks and generates embeddings.

It is trained on a variety of data sources to learn for a wide variety of tasks. The sources are Wikipedia, web news, web question-answer pages, and discussion forums. The input is variable-length English text, and the output is a 512-dimensional vector.

Sentence embeddings are calculated by averaging all the embeddings of the words in the sentence; however, just adding or averaging had limitations and was not suited for deriving the true semantic meaning of the sentence. The Universal Sentence Encoder makes getting sentence-level embeddings easy.

Two variants of the TensorFlow model allow for trade-offs between accuracy and computing resources.

Transformers
Deep Average Network

The research paper is at https://arxiv.org/pdf/1803.11175v2.pdf.

InferSent

In 2017, Facebook introduced InferSent as a sentence representation model trained using the Stanford Natural Language Inference datasets (SNLI) . SNLI is a dataset of 570,000 English sentences, and each sentence is a pair sentence of the premise, hypothesis labeled in one of the following categories: entailment, contradiction, or neutral.

The research paper is at https://arxiv.org/pdf/1705.02364.pdf.

Open-AI GPT

The GPT (Generative Pre-trained Transformer ) architecture implements a deep neural network, specifically a transformer model, which uses attention in place of previous recurrence-based and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant.

Due to the broadness of the dataset on which it is trained and the broadness of its approach, GPT became capable of performing a diverse range of tasks beyond simple text generation: answering questions, summarizing, and even translating between languages in a variety of specific domains.

The research paper is at https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

How It Works

Download the dataset from www.kaggle.com/rounakbanik/ted-talks and keep it in your local folder. Then follow the steps in this section.

Dataset link: https://www.kaggle.com/rounakbanik/ted-talks

Step 9-1. Import a notebook and data to Google Colab

Google Colab is used to solve this project given BERT models are huge, and building it in Colab is way easier and faster.

Go to Google Colab at https://colab.research.google.com/notebooks/intro.ipynb.

https://colab.research.google.com/notebooks/intro.ipynb

Go to file and open a new notebook or Upload notebook from your local by selecting “Upload notebook”.

../images/475440_2_En_3_Chapter/475440_2_En_3_Figg_HTML.jpg

To import the data, go to Files, click the Upload to Session Storage option, and then import the csv file.

../images/475440_2_En_3_Chapter/475440_2_En_3_Figh_HTML.jpg

Step 9-2. Install and import libraries

#If any of these libraries are not installed, please install them using pip before importing.

import pandas as pd

import numpy as np

import scipy

import os

import nltk

nltk.download('stopwords')

nltk.download('punkt')

ltk.download('wordnet')

import string

import csv

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer # used for preprocessing

import warnings

warnings.filterwarnings('ignore')

from sklearn import preprocessing

import spacy

from tqdm import tqdm

import re

import matplotlib.pyplot as plt # our main display package

import plotly.graph_objects as go

import tensorflow_hub as hub

import tensorflow as tf

print(tf.__version__)

Step 9-3. Read text data

df = pd.read_csv('Ted talks.csv')

df_sample=df.iloc[0:100,:]

Step 9-4. Process text data

Let’s implement the preprocessing steps that you learned in Chapter 2.

# remove urls, handles, and the hashtag from hashtags

def remove_urls(text):

new_text = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z ])|(w+://S+)"," ",text).split())

return new_text

# make all text lowercase

def text_lowercase(text):

return text.lower()

# remove numbers

def remove_numbers(text):

result = re.sub(r'd+', '', text)

return result

# remove punctuation

def remove_punctuation(text):

translator = str.maketrans('', '', string.punctuation)

return text.translate(translator)

# tokenize

def tokenize(text):

text = word_tokenize(text)

return text

# remove stopwords

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):

text = [i for i in text if not i in stop_words]

return text

# lemmatize

lemmatizer = WordNetLemmatizer()

def lemmatize(text):

text = [lemmatizer.lemmatize(token) for token in text]

return text

def preprocessing(text):

text = text_lowercase(text)

text = remove_urls(text)

text = remove_numbers(text)

text = remove_punctuation(text)

text = tokenize(text)

text = remove_stopwords(text)

text = lemmatize(text)

text = ' '.join(text)

return text

#preprocessing input

for i in range(df_sample.shape[0]):

df_sample['description'][i]=preprocessing(str(df_sample['description'][i]))

#in case if description has next line character

for text in df_sample.description:

text=text.replace(' ',' ')

Step 9-5. Generate a feature vector

#Implementations of above methods

#GloVe:

#loading pre-trained glove model

#downloading and unzipping all word embeddings

!wget http://nlp.stanford.edu/data/glove.6B.zip

!unzip glove*.zip

!ls

!pwd

#importing 100-d glove model

glove_model_100vec = pd.read_table("glove.6B.100d.txt", sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

# getting mean vector for each sentence

def get_mean_vector(glove_model, words):

# remove out-of-vocabulary words

#assuming 100-d vector

words = [word for word in word_tokenize(words) if word in list(glove_model_100vec.index)] #if word is in vocab

if len(words) >= 1:

return np.mean(glove_model_100vec.loc[words].values, axis=0)

else:

return np.array([0]*100)

#creating empty list and appending all mean arrays for comparing cosine similarities

glove_vec=[]

for i in df_sample.description:

glove_vec.append(list(get_mean_vector(glove_model_100vec, i)))

glove_vec=np.asarray(glove_vec)

glove_vec

#output

array([[-0.11690753, 0.17445151, 0.04606778, ..., -0.48718723,

0.28744267, 0.16625453],

[-0.12658561, 0.17125735, 0.44709804, ..., -0.18936391,

0.51547109, 0.2958283 ],

[-0.06018609, 0.12372995, 0.27105957, ..., -0.38565426,

0.39135596, 0.2519755 ],

...,

[-0.12469988, 0.11091088, 0.16328073, ..., -0.08730062,

0.25822592, 0.12540627],

[ 0.09014104, 0.09796044, 0.13403036, ..., -0.371885 ,

0.19138244, 0.05781978],

[ 0.00891036, 0.09064478, 0.22670132, ..., -0.26099886,

0.47415786, 0.30951336]])

ELMo:

# Due to some open issue with TensorFlow Hub on latest version (2.x), we are degrading to tensorflow 1.x version

#!pip uninstall tensorflow

!pip install tensorflow==1.15

import tensorflow as tf

import tensorflow_hub as hub

print(tf.__version__)

#Load pre-trained model

embed_ = hub.Module("https://tfhub.dev/google/elmo/3")

#function to average word vectors of each sentence

def elmo_vectors_sentence(x):

sentence_embeddings = embed_(x.tolist(), signature="default", as_dict=True)["elmo"]

with tf.Session() as sess:

sess.run(tf.global_variables_initializer())

sess.run(tf.tables_initializer())

# Average of each vector

return sess.run(tf.reduce_mean(sentence_embeddings,1))

#if your data set is large , make a batch of 100 samples. Just remove comment and run the code given below. As we have just 100 samples, we are not doing this

#samples= [df[i:i+100] for i in range(0,df.shape[0],100)]

# elmo_vec = [elmo_vectors_sentence(x['description']) for x in samples]

#elmo_vec_full= np.concatenate(elmo_vec, axis = 0)

#embeddings on our dataset

elmo_vec = elmo_vectors_sentence(df_sample['description'])

elmo_vec

#output

array([[ 0.0109894 , -0.16668989, -0.06553215, ..., 0.07014981,

0.09196191, 0.04669906],

[ 0.15317157, -0.19256656, 0.01390844, ..., 0.03459582,

0.28029835, 0.11106762],

[ 0.20210212, -0.13186318, -0.20647219, ..., -0.15281932,

0.12729007, 0.17192583],

...,

[ 0.29017407, -0.45098212, 0.0250571 , ..., -0.12281103,

0.23303834, 0.15486737],

[ 0.22871418, 0.12254314, -0.22637479, ..., 0.04150296,

0.31900924, 0.28121516],

[ 0.05940952, 0.01366339, -0.17502695, ..., 0.20946877,

0.0020928 , 0.1114894 ]], dtype=float32)

Doc2Vec:

#importing doc2vec and tagged document

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

#tokenizing data

tokenized_data=[word_tokenize(word) for word in df_sample.description]

train_data=[TaggedDocument(d, [i]) for i, d in enumerate(tokenized_data)] #adding paragraph id as mentioned in explanation for training

train_data

#output

[TaggedDocument(words=['sir', 'ken', 'robinson', 'make', 'entertaining', 'profoundly', 'moving', 'case', 'creating', 'education', 'system', 'nurture', 'rather', 'undermines', 'creativity'], tags=[0]),

TaggedDocument(words=['humor', 'humanity', 'exuded', 'inconvenient', 'truth', 'al', 'gore', 'spell', 'way', 'individual', 'address', 'climate', 'change', 'immediately', 'buying', 'hybrid', 'inventing', 'new', 'hotter', 'brand', 'name', 'global', 'warming'], tags=[1]),

TaggedDocument(words=['new', 'york', 'time', 'columnist', 'david', 'pogue', 'take', 'aim', 'technology', 'worst', 'interface', 'design', 'offender', 'provides', 'encouraging', 'example', 'product', 'get', 'right', 'funny', 'thing', 'burst', 'song'], tags=[2]),

TaggedDocument(words=['emotionally', 'charged', 'talk', 'macarthur', 'winning', 'activist', 'majora', 'carter', 'detail', 'fight', 'environmental', 'justice', 'south', 'bronx', 'show', 'minority', 'neighborhood', 'suffer', 'flawed', 'urban', 'policy'], tags=[3]),

TaggedDocument(words=['never', 'seen', 'data', 'presented', 'like', 'drama', 'urgency', 'sportscaster', 'statistic', 'guru', 'han', 'rosling', 'debunks', 'myth', 'called', 'developing', 'world'], tags=[4])

……….

## Train doc2vec model

model = Doc2Vec(train_data, vector_size = 100, window = 2, min_count = 1, epochs = 100)

def get_vectors(model,words):

words = [word for word in word_tokenize(words) if word in list(model.wv.vocab)]

#words = [word for word in word_tokenize(words) if word in list(model.wv.index_to_key)] #if gensim version is >4.0.0 ,use this line

if len(words)>=1:

return model.infer_vector(words)

else:

return np.array([0]*100)

#defining empty list

doc2vec_vec=[]

for i in df_sample.description:

doc2vec_vec.append(list(get_vectors(model, i)))

doc2vec_vec=np.asarray(doc2vec_vec)

doc2vec_vec

#output

array([[ 0.00505156, -0.582084 , -0.33430266, ..., 0.29665616,

-0.5472022 , 0.48537165],

[ 0.05787622, -0.6559785 , -0.41140306, ..., 0.24132295,

-0.73182726, 0.6089837 ],

[ 0.02416484, -0.48238695, -0.29850838, ..., 0.2710957 ,

-0.51971895, 0.4405582 ],

...,

[ 0.0511999 , -0.5991625 , -0.34839907, ..., 0.29519215,

-0.68761116, 0.4545323 ],

[ 0.0180944 , -0.8318272 , -0.3488748 , ..., 0.30490136,

-0.7558393 , 0.56117946],

[-0.04790357, -0.66188 , -0.3797214 , ..., 0.34476635,

-0.7202311 , 0.5834031 ]], dtype=float32)

Sentence-BERT

#BERT sentence transformer for sentence encoding

!pip install sentence-transformers

#importing bert-base model

from sentence_transformers import SentenceTransformer

sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

#one more model to try

#model = SentenceTransformer('paraphrase-MiniLM-L12-v2')

#embeding on description column

sentence_embeddings_BERT = sbert_model.encode(df_sample['description'])

print('Sample BERT embedding vector - length', len(sentence_embeddings_BERT[0]))

#output

Sample BERT embedding vector - length 768

sentence_embeddings_BERT

array([[-0.31804532, 0.6571422 , 0.5327481 , ..., -0.76469 ,

-0.4919126 , 0.1543465 ],

[-0.08962823, 1.0855986 , 0.37181526, ..., -0.84685326,

0.5427714 , 0.32389015],

[-0.13385592, 0.8280815 , 0.76139224, ..., -0.33403403,

0.2664094 , -0.05493931],

...,

[ 0.05133615, 1.1150284 , 0.75921553, ..., 0.5516633 ,

0.46614835, 0.28827885],

[-1.3568689 , 0.2995725 , 0.99510914, ..., 0.26881158,

-0.1879525 , 0.18646894],

[-0.20679009, 0.8725009 , 1.2933054 , ..., -0.44921246,

0.14516312, -0.2050481 ]], dtype=float32)

sentence_embeddings_BERT.shape

#output

(100, 768)

Universal Encoder

#Load the pre-trained model

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"

model_USE = hub.load(module_url)

embeddings_USE = model_USE(df_sample['description'])

embeddings_USE = tf.reshape(embeddings_USE,[100,512])

embeddings_USE.shape

#output

TensorShape([Dimension(100), Dimension(512)])

#output is tensor

Infersent

There are two versions of InferSent. Version 1 uses GloVe, whereas version 2 uses fastText vectors. You can choose to work with any model. We used version 2, so we downloaded the InferSent model and the pre-trained word vectors.

! mkdir encoder

! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

! mkdir GloVe

! curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip

! unzip GloVe/glove.840B.300d.zip -d GloVe/

from models import InferSent

import torch

V = 2

MODEL_PATH = '/content/drive/MyDrive/yolov3/encoder/infersent%s.pkl' % V

params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,

'pool_type': 'max', 'dpout_model': 0.0, 'version': V}

model_infer = InferSent(params_model)

model_infer.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = '/content/drive/MyDrive/yolov3/GloVe/glove.840B.300d.txt'

model_infer.set_w2v_path(W2V_PATH)

#building vocabulary

model_infer.build_vocab(df_sample.description, tokenize=True)

#output

Found 1266(/1294) words with w2v vectors

Vocab size : 1266

#encoding sample dataset

infersent_embed = model_infer.encode(df_sample.description,tokenize=True)

#shape of our vector

infersent_embed.shape

#output

(100, 4096)

get_embed(df_sample,'infersent')

#output

array([[ 0.00320979, 0.0560745 , 0.11894835, ..., 0.04763867,

0.02359796, 0.09751415],

[ 0.00983471, 0.11757359, 0.12201475, ..., 0.06545023,

0.04181211, 0.07941461],

[-0.02874381, 0.18418473, 0.12211668, ..., 0.07526097,

0.06728931, 0.1058861 ],

...,

[ 0.00766308, 0.10781102, 0.13686652, ..., 0.08371441,

0.01190174, 0.12111058],

[-0.02874381, 0.20537955, 0.11543981, ..., 0.08811261,

0.03787484, 0.08826952],

[ 0.12408942, 0.30591702, 0.23708522, ..., 0.1063919 ,

0.0908693 , 0.14098585]], dtype=float32)

Open-AI GPT

#installing necessary model

!pip install pytorch_pretrained_bert

import torch

from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel

tokenizer_openai = OpenAIGPTTokenizer.from_pretrained('openai-gpt') #Construct a GPT Tokenizer. Based on Byte-Pair-Encoding with the following peculiarities:

model_openai = OpenAIGPTModel.from_pretrained('openai-gpt')

model_openai.eval()

print('Model Loaded')

#function to get embedding of each token

def Embedding_openai(Sentence):

tokens = word_tokenize(Sentence)

vectors = np.zeros((1,768))

for word in tokens:

subwords = tokenizer_openai.tokenize(word)

indexed_tokens = tokenizer_openai.convert_tokens_to_ids(subwords)

tokens_tensor = torch.tensor([indexed_tokens])

with torch.no_grad():

try:

vectors += np.array(torch.mean(model_openai(tokens_tensor),1))

except Exception as ex:

continue

vectors /= len(tokens)

return vectors

# Initialize Matrix with dimension of numberof rows*vector dimension

open_ai_vec = np.zeros((df_sample.shape[0], 768))

# generating sentence embedding for each row

for iter in range(df_sample.shape[0]):

text = df_sample.loc[iter,'description']

open_ai_vec[iter] = Embedding_openai(text)

open_ai_vec

#output

array([[ 0.16126736, 0.14900037, 0.10306535, ..., 0.22078205,

-0.38590393, -0.09898915],

[ 0.17074709, 0.20849738, 0.14996684, ..., 0.21315758,

-0.46983403, 0.02419061],

[ 0.25158801, 0.12217634, 0.09847356, ..., 0.25541687,

-0.44979091, -0.0174561 ],

...,

[ 0.26624974, 0.15842849, 0.10565209, ..., 0.23473342,

-0.40087843, -0.07652373],

[ 0.22917288, 0.22115094, 0.09217898, ..., 0.18310198,

-0.33768173, -0.16026535],

[ 0.21503123, 0.21615047, 0.04715349, ..., 0.25044506,

-0.42287723, -0.01473052]])

Step 9-6. Generate a feature vector function automatically using a selected embedding method

#takes input as dataframe and embedding model name as mentioned in function

def get_embed(df,model):

if model=='Glove':

return glove_vec

if model=='ELMO':

return elmo_vec

if model=='doc2vec':

return doc2vec_vec

if model=='sentenceBERT':

return sentence_embeddings_BERT

if model=='USE':

return embeddings_USE

if model=='infersent':

return infersent_embed

if model=='Open-ai':

return open_ai_vec

get_embed(df_sample,'ELMO')

#output

array([[ 0.0109894 , -0.16668989, -0.06553215, ..., 0.07014981,

0.09196191, 0.04669906],

[ 0.15317157, -0.19256656, 0.01390844, ..., 0.03459582,

0.28029835, 0.11106762],

[ 0.20210212, -0.13186318, -0.20647219, ..., -0.15281932,

0.12729007, 0.17192583],

...,

[ 0.29017407, -0.45098212, 0.0250571 , ..., -0.12281103,

0.23303834, 0.15486737],

[ 0.22871418, 0.12254314, -0.22637479, ..., 0.04150296,

0.31900924, 0.28121516],

[ 0.05940952, 0.01366339, -0.17502695, ..., 0.20946877,

0.0020928 , 0.1114894 ]], dtype=float32)

We hope that you are now comfortable with natural language processing. Now that the data has been cleaned and processed, and features have been created, let’s jump into building applications that solve business problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Converting Text to Features

Create new playlist

Sign In

Sign Up

3. Converting Text to Features

Recipe 3-1. Converting Text to Features Using One-Hot Encoding

Problem

Solution

How It Works

Step 1-1. Store the text in a variable

Step 1-2. Execute a function on the text data

Recipe 3-2. Converting Text to Features Using a Count Vectorizer

Problem

Solution

How It Works

Recipe 3-3. Generating n-grams

Problem

Solution

How It Works

Step 3-1. Generate n-grams using TextBlob

Step 3-2. Generate bigram-based features for a document

Recipe 3-4. Generating a Co-occurrence Matrix

Problem

Solution

How It Works

Step 4-1. Import the necessary libraries

Step 4-2. Create function for a co-occurrence matrix

Step 4-3. Generate a co-occurrence matrix

Recipe 3-5. Hash Vectorizing

Problem

Solution

How It Works

Step 5-1. Import the necessary libraries and create a document

Step 5-2. Generate a hash vectorizer matrix

Recipe 3-6. Converting Text to Features Using TF-IDF

Problem

Solution

How It Works

Step 6-1. Read the text data

Step 6-2. Create the features

Recipe 3-7. Implementing Word Embeddings

Problem

Solution

How It Works

skip-gram

Continuous Bag of Words (CBOW)

Recipe 3-8. Implementing fastText

Problem

Solution

How It Works

Recipe 3-9. Converting Text to Features Using State-of-the-Art Embeddings

Problem

Solution

ELMo

Sentence Encoders

doc2vec

Sentence-BERT

Universal Encoder

InferSent

Open-AI GPT

How It Works

Step 9-1. Import a notebook and data to Google Colab

Step 9-2. Install and import libraries

Step 9-3. Read text data

Step 9-4. Process text data

Step 9-5. Generate a feature vector

Sentence-BERT

Universal Encoder

Infersent

Open-AI GPT

Step 9-6. Generate a feature vector function automatically using a selected embedding method

Table of Contents for
3. Converting Text to Features