Language is the most important instrument of thought.
—Marvin Minsky
This chapter introduces algorithms for natural language processing (NLP). It first introduces the fundamentals of NLP. Then it presents preparing data for NLP tasks. After that, it explains the concepts of vectorizing textual data. Next, we discuss word embeddings. Finally, we present a detailed use case.
This chapter is made up of the following sections:
By the end of this chapter, you will understand the basic techniques that are used for NLP. You should also understand how NLP can be used to solve some interesting real-world problems.
Let’s start with the basic concepts.
NLP is a branch of machine learning algorithms that deals with the interaction between computers and human language. It involves analyzing, processing, and understanding human language to enable machines to comprehend and respond to human communication. NLP is a comprehensive subject and involves using computer linguistic algorithms and human-computer interaction technologies and methodologies to process complex unstructured data.
NLP works by processing human language and breaking it down into its constituent parts, such as words, phrases, and sentences. The goal is to enable the computer to understand the meaning of the text and respond appropriately. NLP algorithms utilize various techniques, such as statistical models, machine learning, and deep learning, to analyze and process large volumes of natural language data. For complex problems, we may need to use a combination of techniques to come up with an effective solution.
One of the most significant challenges in NLP is dealing with the complexity and ambiguity of human language. Languages are quite diverse with complex grammatical structures and idiomatic expressions. Additionally, the meaning of words and phrases can vary depending on the context in which they are used. NLP algorithms must be able to handle these complexities to achieve effective language processing.
Let’s start by looking at some of the terminology that is used when discussing NLP.
NLP is a vast field of study. In this section, we will investigate some of the basic terminology related to NLP:
Corpora, the plural of corpus, can be annotated, meaning they may contain extra details about the texts, such as part-of-speech tags and named entities. These annotated corpora offer specific information that enhances the training and evaluation of NLP algorithms, making them especially valuable resources in the field.
Next, let us study different text preprocessing techniques used in NLP:
Text preprocessing is a vital stage in NLP, where raw text data undergoes a transformation to become suitable for machine learning algorithms. This transformation involves converting the unorganized and often messy text into what is known as a “structured format.” A structured format means that the data is organized into a more systematic and predictable pattern, often involving techniques like tokenization, stemming, and removing unwanted characters. These steps help in cleaning the text, reducing irrelevant information or “noise,” and arranging the data in a manner that makes it easier for the machine learning models to understand. By following this approach, the raw text, which may contain inconsistencies and irregularities, is molded into a form that enhances the accuracy, performance, and efficiency of subsequent NLP tasks. In this section, we will explore various techniques used in text preprocessing to achieve this structured format.
As a reminder, tokenization is the crucial process of dividing text into smaller units, known as tokens. These tokens can be as small as individual words or even subwords. In NLP, tokenization is often considered the first step in preparing text data for further analysis. The reason for this foundational role lies in the very nature of language, where understanding and processing text requires breaking it down into manageable parts. By transforming a continuous stream of text into individual tokens, we create a structured format that mirrors the way humans naturally read and understand language. This structuring provides the machine learning models with a clear and systematic way to analyze the text, allowing them to recognize patterns and relationships within the data. As we delve deeper into NLP techniques, this tokenized format becomes the basis upon which many other preprocessing and analysis steps are built.
The following code snippet is tokenizing the given text using the Natural Language Toolkit (nltk
) library in Python. The nltk
is a widely used library in Python, specifically designed for working with human language data. It provides easy-to-use interfaces and tools for tasks such as classification, tokenization, stemming, tagging, parsing, and more, making it a valuable asset for NLP. For those who wish to leverage these capabilities in their Python projects, the nltk
library can be downloaded and installed directly from the Python Package Index (PyPI
) by using the command pip install
nltk
. By incorporating the nltk library into your code, you can access a rich set of functions and resources that streamline the development and execution of various NLP tasks, making it a popular choice among researchers, educators, and developers in the field of computational linguistics. Let us start by importing relevant functions and using them:
from nltk.tokenize import word_tokenize
corpus = 'This is a book about algorithms.'
tokens = word_tokenize(corpus)
print(tokens)
The output will be a list that looks like this:
['This', 'is', 'a', 'book', 'about', 'algorithms', '.']
In this example, each token is a word. The granularity of the resulting tokens will vary based on the objective—for example, each token can consist of a word, a sentence, or a paragraph.
To tokenize text based on sentences, you can use the sent_tokenize
function from the nltk.tokenize
module:
from nltk.tokenize import sent_tokenize
corpus = 'This is a book about algorithms. It covers various topics in depth.'
In this example, the corpus
variable contains two sentences. The sent_tokenize
function takes the corpus as input and returns a list of sentences. When you run the modified code, you will get the following output:
sentences = sent_tokenize(corpus)
print(sentences)
['This is a book about algorithms.', 'It covers various topics in depth.']
Sometimes we may need to break down large texts into paragraph-level chunks. nltk
can help with that task. It’s a feature that could be particularly useful in applications such as document summarization, where understanding the structure at the paragraph level may be crucial. Tokenizing text into paragraphs might seem straightforward, but it can be complex depending on the structure and format of the text. A simple approach is to split the text into two newline characters, which often separate paragraphs in plain text documents.
Here’s a basic example:
def tokenize_paragraphs(text):
# Split by two newline characters
paragraphs = text.split('
')
return [p.strip() for p in paragraphs if p]
Next, let us look into how we can clean the data.
Cleaning data is an essential step in NLP, as raw text data often contains noise and irrelevant information that can hinder the performance of NLP models. The goal of cleaning data for NLP is to preprocess the text data to remove noise and irrelevant information, and to transform it into a format that is suitable for analysis using NLP techniques. Note that data cleaning is done after it is tokenized. The reason is that cleaning might involve operations that depend on the structure revealed by tokenization. For instance, removing specific words or altering word forms might be done more accurately after the text is tokenized into individual terms.
Let us study some techniques used to clean data and prepare it for machine learning tasks:
Case conversion is a technique in NLP where text is transformed from one case format to another, such as from uppercase to lowercase, or from title case to uppercase.
For example, the text “Natural Language Processing” in title case could be converted to lowercase to be “natural language processing.”
This simple yet effective step helps in standardizing the text, which in turn simplifies its processing for various NLP algorithms. By ensuring that the text is in a uniform case, it aids in eliminating inconsistencies that might otherwise arise from variations in capitalization.
Punctuation removal in NLP refers to the process of removing punctuation marks from raw text data before analysis. Punctuation marks are symbols such as periods (.
), commas (,
), question marks (?
), and exclamation marks (!
) that are used in written language to indicate pauses, emphasis, or intonation. While they are essential in written language, they can add noise and complexity to raw text data, which can hinder the performance of NLP models.
It’s a reasonable concern to wonder how the removal of punctuation might affect the meaning of sentences. Consider the following examples:
"She's a cat."
"She's a cat??"
Without punctuation, both lines become “She’s a cat,” potentially losing the distinct emphasis conveyed by the question marks.
However, it’s worth noting that in many NLP tasks, such as topic classification or sentiment analysis, punctuation might not significantly impact the overall understanding. Additionally, models can rely on other cues from the text’s structure, content, or context to derive meaning. In cases where the nuances of punctuation are critical, specialized models and preprocessing techniques may be employed to retain the required information.
Numbers within text data can pose challenges in NLP. Here’s a look at two main strategies for handling numbers in text analysis, considering both the traditional approach of removal and an alternative option of standardization.
In some NLP tasks, numbers may be considered noise, particularly when the focus is on aspects likeyWord frequency or sentiment analysis. Here’s why some analysts might choose to remove numbers:
However, an alternative approach is to convert all numbers to a standard representation rather than discarding them. This method acknowledges that numbers can carry essential information and ensures that their value is retained in a consistent format. It can be particularly useful in contexts where numerical data plays a vital role in the meaning of the text.
Deciding whether to remove or retain numbers requires an understanding of the problem being solved. An algorithm may need customization to distinguish whether a number is significant based on the context of the text and the specific NLP task. Analyzing the role of numbers within the domain of the text and the goals of the analysis can guide this decision-making process.
Handling numbers in NLP is not a one-size-fits-all approach. Whether to remove, standardize, or carefully analyze numbers depends on the unique requirements of the task at hand. Understanding these options and their implications helps in making informed decisions that align with the goals of the text analysis.
White space removal in NLP refers to the process of removing unnecessary white spaces, such as multiple spaces and tab characters. White space in the context of text data is not merely the space between words but includes other “invisible” characters that create spacing within text. In NLP, white space removal refers to the process of eliminating these unnecessary white space characters. Removing unnecessary white spaces can reduce the size of the text data and make it easier to process and analyze.
Here’s a simple example to illustrate white space removal:
"The quick brown fox jumps over the lazy dog."
"The quick brown fox jumps over the lazy dog."
In the above example, extra spaces and a tab character (denoted by
) are removed to create a cleaner and more standardized text string.
Stop word removal is the process of eliminating common words, known as stop words, from a text corpus. stop words are words that occur frequently in a language but do not carry significant meaning or contribute to the overall understanding of the text. Examples of stop words in English include the, and, is, in and for. Stop word removal helps reduce the dimensionality of the data and improve the efficiency of the algorithms. By removing words that don’t contribute meaningfully to the analysis, computational resources can be focused on the words that do matter, improving the efficiency of various NLP algorithms.
Note that stop word removal is more than a mere reduction in text size; it’s about focusing on the words that truly matter for the analysis at hand. While stop words play a vital role in language structure, their removal in NLP can enhance the efficiency and focus of the analysis, particularly in tasks like sentiment analysis where the primary concern is understanding the underlying emotion or opinion.
In textual data, most words are likely to be present in slightly different forms. Reducing each word to its origin or stem in a family of words is called stemming. It is used to group words based on their similar meanings to reduce the total number of words that need to be analyzed. Essentially, stemming reduces the overall conditionality of the problem. The most common algorithm for stemming English is the Porter algorithm.
For example, let us look into a couple of examples:
{use, used, using, uses} => use
{easily, easier, easiest} => easi
It’s important to note that stemming can sometimes result in misspelled words, as seen in example 2 where easi
was produced.
Stemming is a simple and quick process, but it may not always produce correct results. For cases where correct spelling is required, lemmatization is a more appropriate method. Lemmatization considers the context and reduces words to their base form. The base form of a word, also known as the lemma, is its most simple and meaningful version. It represents the way a word would appear in the dictionary, devoid of any inflectional endings, which will be a correct English word, resulting in more accurate and meaningful word roots.
The process of guiding algorithms to recognize similarities is a precise and thoughtful task. Unlike humans, algorithms need explicit rules and criteria to make connections that might seem obvious to us. Understanding this distinction and knowing how to provide the necessary guidance is a vital skill in the development and tuning of algorithms for various applications.
Let us look into how we can clean text using Python.
First, let’s import the necessary libraries:
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# Make sure to download the NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
Next, here is the main function to perform text cleaning:
def clean_text(text):
"""
Cleans input text by converting case, removing punctuation, numbers,
white spaces, stop words and stemming
"""
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove numbers
text = re.sub(r'd+', '', text)
# Remove white spaces
text = text.strip()
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = nltk.word_tokenize(text)
filtered_text = [word for word in tokens if word not in stop_words]
text = ' '.join(filtered_text)
# Stemming
ps = PorterStemmer()
tokens = nltk.word_tokenize(text)
stemmed_text = [ps.stem(word) for word in tokens]
text = ' '.join(stemmed_text)
return text
Let us test the function clean_text()
:
corpus="7- Today, Ottawa is becoming cold again "
clean_text(corpus)
The result will be:
today ottawa becom cold
Note the word becom
in the output. As we are using stemming, not all the words in the output are expected to be correct English words.
All the preceding processing steps are typically needed; the actual processing steps depend on the problem that we want to solve. They will vary from use case to use case—for example, if the numbers in the text represent something that may have some value in the context of the problem that we are trying to solve, then we may not need to remove the numbers from the text in the normalization phase.
Once the data is cleaned, we need to store the results in a data structure tailored for this purpose. This data structure is called the Term Document Matrix (TDM) and is explained next.
A TDM is a mathematical structure used in NLP. It’s a table that counts the frequency of terms (words) in a collection of documents. Each row represents a unique term, and each column represents a specific document. It’s an essential tool for text analysis, where you can see how often each word occurs in various texts.
For documents containing the words cat
and dog
:
cat cat dog
dog dog cat
Document 1 |
Document 2 | |
cat |
2 |
1 |
dog |
1 |
2 |
This matrix structure allows the efficient storage, organization, and analysis of large text datasets. In Python, the CountVectorizer
module from the sklearn
library can be used to create a TDM as follows:
from sklearn.feature_extraction.text import CountVectorizer
# Define a list of documents
documents = ["Machine Learning is useful", "Machine Learning is fun", "Machine Learning is AI"]
# Create an instance of CountVectorizer
vectorizer = CountVectorizer()
# Fit and transform the documents into a TDM
tdm = vectorizer.fit_transform(documents)
# Print the TDM
print(tdm.toarray())
The output looks as follows:
[[0 0 1 1 1 1]
[0 1 1 1 1 0]
[1 0 1 1 1 0]]
Note that corresponding to each document, there is a row, and corresponding to each distinct word, there is a column. There are three documents and there are six distinct words, resulting in a matrix with dimensions 3x6.
In this matrix, the numbers represent the frequency with which each word (column) appears in the corresponding document (row). So, for example, if the number in the first row and first column is 1, this means that the first word appears once in the first document.
TDM uses the frequency of each term by default, which is a simple way to quantify the importance of each word in the context of each individual document. A more sophisticated way to quantify the importance of each word is TF-IDF, which is explained in the next section.
Term Frequency-Inverse Document Frequency (TF-IDF) is a method used to calculate the significance of words in a document. It considers two main components to determine the weight of each term: the Term Frequency (TF) and the Inverse Document Frequency (IDF). The TF looks at how often a word appears in a specific document, while the IDF examines how rare the word is across a collection of documents, known as a corpus. In the context of TF-IDF, the corpus refers to the entire set of documents that you are analyzing. If we are working with a collection of book reviews, for example, the corpus would include all the reviews:
To compute TF-IDF using Python, do the following:
from sklearn.feature_extraction.text import TfidfVectorizer
# Define a list of documents
documents = ["Machine Learning enables learning", "Machine Learning is fun", "Machine Learning is useful"]
# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the documents into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)
# Get the feature names
feature_names = vectorizer.get_feature_names_out()
# Loop over the feature names and print the TF-IDF score for each term
for i, term in enumerate(feature_names):
tfidf = tfidf_matrix[:, i].toarray().flatten()
print(f"{term}: {tfidf}")
This will print:
enables: [0.60366655 0. 0. ]
fun: [0. 0.66283998 0. ]
is: [0. 0.50410689 0.50410689]
learning: [0.71307037 0.39148397 0.39148397]
machine: [0.35653519 0.39148397 0.39148397]
useful: [0. 0. 0.66283998]
Each column in the output corresponds to a document, and the rows represent the TF-IDF values for the terms across the documents. For example, the term kids
has a non-zero TF-IDF value only in the second document, which is in line with our expectations.
The TF-IDF method provides a valuable way to weigh the importance of terms within individual documents and across an entire corpus. The resulting TF-IDF values reveal the relevance of specific terms within each document, taking into account both their frequency in a given document and their rarity across the entire collection. In the provided example, the varying TF-IDF scores for different terms demonstrate the model’s ability to distinguish words that are unique to specific documents from those that are more commonly used. This ability can be leveraged in various applications, such as text classification, information retrieval, and feature selection, to enhance the understanding and processing of text data.
One of the major advancements in NLP is our ability to create a meaningful numeric representation of words in the form of dense vectors. This technique is called word embedding. So, what exactly is a dense vector? Imagine you have a word like apple
. In word embedding, apple
might be represented as a series of numbers, such as [0.5, 0.8, 0.2]
, where each number is a coordinate in a continuous, multi-dimensional space. The term “dense” means that most or all of these numbers are non-zero, unlike sparse vectors where many elements might be zero. In simple terms, word embedding takes each word in a text and turns it into a unique, multi-dimensional point in space. This way, words with similar meanings will end up closer to each other in this space, allowing algorithms to understand the relationships between words. Yoshua Bengio first introduced the term in his paper A Neural Probabilistic Language Model. Each word in an NLP problem can be thought of as a categorical object.
In word embedding, try to establish the neighborhood of each word and use it to quantify its meaning and importance. The neighborhood of a word is the set of words that surround a particular word.
To truly grasp the concept of word embedding, let’s look at a tangible example involving a vocabulary of four familiar fruits: apple
, banana
, orange
, and pear
. The goal here is to represent these words as dense vectors, numerical arrays where each number captures a specific characteristic or feature of the word.
Why represent words this way? In NLP, converting words into dense vectors enables algorithms to quantify the relationships between different words. Essentially, we’re turning abstract language into something that is mathematically measurable.
Consider the features of sweetness, acidity, and juiciness for our fruit words. We could rate these features on a scale from 0 to 1 for each fruit, where 0 means the feature is entirely absent, and 1 means the feature is strongly present. This rating could look like this:
"apple": [0.5, 0.8, 0.2] – moderately sweet, quite acidic, not very juicy
"banana": [0.2, 0.3, 0.1] – not very sweet, moderately acidic, not juicy
"orange": [0.9, 0.6, 0.9] – very sweet, somewhat acidic, very juicy
"pear": [0.4, 0.1, 0.7] – moderately sweet, barely acidic, quite juicy
The numbers are subjective and can be derived from taste tests, expert opinions, or other methods, but they serve to transform the words into a format that an algorithm can understand and work with.
Visualizing this, you can imagine a 3D space where each axis represents one of the features (sweetness, acidity, or juiciness), and each fruit’s vector places it at a specific point in this space. Words (fruits) with similar tastes would be closer to each other in this space.
So, why the choice of dense vectors with a length of 3? This is based on the specific features we have chosen to represent. In other applications, the vector length might be different, based on the number of features you want to capture.
This example illustrates how word embedding takes a word and turns it into a numerical vector that holds real-world meaning. It’s a crucial step in enabling machines to “understand” and process human language.
Word2Vec is a prominent method used for obtaining vector representations of words, commonly referred to as word embeddings. Rather than “generating words,” this algorithm creates numerical vectors that represent the semantic meaning of each word in the language.
The basic idea behind Word2Vec is to use a neural network to predict the context of each word in a given text corpus. The neural network is trained by inputting the word and its surrounding context words, and the network learns to output the probability distribution of the context words given the input word. The weights of the neural network are then used as the word embeddings, which can be used for various NLP tasks:
import gensim
# Define a text corpus
corpus = [['apple', 'banana', 'orange', 'pear'],
['car', 'bus', 'train', 'plane'],
['dog', 'cat', 'fox', 'fish']]
# Train a word2vec model on the corpus
model = gensim.models.Word2Vec(corpus, window=5, min_count=1, workers=4)
Let us break down the important parameters of Word2Vec()
function:
100
or 300
, depending on the complexity of the vocabulary.5
, the algorithm will consider the five words immediately before and after the target word in the training process.min_count
to 2
, for example, any word that appears fewer than two times across all the sentences will be ignored during training.Once the Word2Vec model is trained, one of the powerful ways to use it is to measure the similarity or “distance” between words in the embedding space. This similarity score can give us insight into how the model perceives relationships between different words. Now let us check the model by looking at the distance between car
and train
:
print(model.wv.similarity('car', 'train'))
-0.057745814
Now let’s look into the similarity of car
and apple
:
print(model.wv.similarity('car', 'apple'))
0.11117952
Thus, the output gives us the similarity score between individual terms based on the word embeddings learned by the model.
The following details help with interpreting similarity scores:
Thus, these similarity scores provide quantitative insights into word relationships. By understanding these scores, you can better analyze the semantic structure of your text corpus and leverage it for various NLP tasks.
Word2Vec provides a powerful and efficient way to represent textual data in a way that captures semantic relationships between words, reduces dimensionality, and improves accuracy in downstream NLP tasks. Let us look into the advantages and disadvantages of Word2Vec.
The following are the advantages of using Word2Vec:
Now let us look into some of the disadvantages of using Word2Vec:
Next, let us look into a case study about restaurant reviews that combines all the concepts presented in this chapter.
We will use the Yelp Reviews dataset, which contains labeled reviews as positive (5 stars) or negative (1 star). We will train a model that can classify the reviews of a restaurant as negative or positive.
Let’s implement this processing pipeline by going through the following steps.
First, we import the packages that we need:
import numpy as np
import pandas as pd
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
Then we import the dataset from a .csv
file:
url = 'https://storage.googleapis.com/neurals/data/2023/Restaurant_Reviews.tsv'
dataset = pd.read_csv(url, delimiter=' ', quoting=3)
dataset.head()
Review Liked
0 Wow... Loved this place. 1
1 Crust is not good. 0
2 Not tasty and the texture was just nasty. 0
3 Stopped by during the late May bank holiday of... 1
4 The selection on the menu was great and so wer... 1
Next, we clean the data by performing text preprocessing on each of the reviews of the dataset using stemming and stopword removal techniques:
def clean_text(text):
text = re.sub('[^a-zA-Z]', ' ', text)
text = text.lower()
text = text.split()
ps = PorterStemmer()
text = [
ps.stem(word) for word in text
if not word in set(stopwords.words('english'))]
text = ' '.join(text)
return text
corpus = [clean_text(review) for review in dataset['Review']]
The code iterates through each review in the dataset (in this case, the 'Review'
column) and applies the clean_text
function to preprocess and clean each review. The code creates a new list called corpus
. The result is a list of cleaned and preprocessed reviews stored in the corpus
variable.
Now let’s define the features (represented by y
) and the label (represented by X
). Remember that features are the independent variables or attributes that describe the characteristics of the data, used as input for predictions.
And labels are the dependent variables or target values that the model is trained to predict, representing the outcomes corresponding to the features:
vectorizer = CountVectorizer(max_features=1500)
X = vectorizer.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
Let’s divide the data into testing and training data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
To train the model, we are using the Naive Bayes algorithm that we studied in Chapter 7:
classifier = GaussianNB()
classifier.fit(X_train, y_train)
Let’s predict the test set results:
y_pred = classifier.predict(X_test)
Next, let us print the confusion matrix. Remember that the confusion matrix is a table that helps visualize the performance of the classification model:
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[55 42]
[12 91]]
Looking at the confusion matrix, we can estimate the misclassification.
The confusion matrix gives us a glimpse into the misclassifications made by our model. In this context, there are:
The 55 true positives and 91 true negatives show that our model has a reasonable ability to distinguish between positive and negative reviews. However, the 42 false positives and 12 false negatives highlight areas for potential improvement.
In the context of restaurant reviews, understanding these numbers helps business owners and customers alike gauge the general sentiment. A high rate of true positives and true negatives indicates that the model can be trusted to give an accurate sentiment overview. This information could be invaluable for restaurants aiming to improve service or for potential customers seeking honest reviews. On the other hand, the presence of false positives and negatives suggests areas where the model might need fine-tuning to avoid misclassification and provide more accurate insights.
The continued advancement of NLP technology has revolutionized the way we interact with computers and other digital devices. It has made significant progress in recent years, with impressive achievements in many tasks, including:
The chapter discussed the basic terminology related to NLP, such as corpus, word embeddings, language modeling, machine translation, and sentiment analysis. In addition, the chapter covered various text preprocessing techniques that are essential in NLP, including tokenization, which involves breaking down text into smaller units called tokens, and other techniques such as stemming and stop word removal.
The chapter also discussed word embeddings and then presented a use case on restaurant review sentiment analysis. Now, readers should have a better understanding of the fundamental techniques used in NLP and their potential applications to real-world problems.
In the next chapter, we will look at training neural networks for sequential data. We will also investigate how the use of deep learning can further improve NLP techniques and the methodologies discussed in this chapter.
To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:
35.170.81.33