9

Algorithms for Natural Language Processing

Language is the most important instrument of thought.

—Marvin Minsky

This chapter introduces algorithms for natural language processing (NLP). It first introduces the fundamentals of NLP. Then it presents preparing data for NLP tasks. After that, it explains the concepts of vectorizing textual data. Next, we discuss word embeddings. Finally, we present a detailed use case.

This chapter is made up of the following sections:

  • Introducing NLP
  • Bag-of-words-based (BoW-based) NLP
  • Introduction to word embedding
  • Case study: Restaurant review sentiment analysis

By the end of this chapter, you will understand the basic techniques that are used for NLP. You should also understand how NLP can be used to solve some interesting real-world problems.

Let’s start with the basic concepts.

Introducing NLP

NLP is a branch of machine learning algorithms that deals with the interaction between computers and human language. It involves analyzing, processing, and understanding human language to enable machines to comprehend and respond to human communication. NLP is a comprehensive subject and involves using computer linguistic algorithms and human-computer interaction technologies and methodologies to process complex unstructured data.

NLP works by processing human language and breaking it down into its constituent parts, such as words, phrases, and sentences. The goal is to enable the computer to understand the meaning of the text and respond appropriately. NLP algorithms utilize various techniques, such as statistical models, machine learning, and deep learning, to analyze and process large volumes of natural language data. For complex problems, we may need to use a combination of techniques to come up with an effective solution.

One of the most significant challenges in NLP is dealing with the complexity and ambiguity of human language. Languages are quite diverse with complex grammatical structures and idiomatic expressions. Additionally, the meaning of words and phrases can vary depending on the context in which they are used. NLP algorithms must be able to handle these complexities to achieve effective language processing.

Let’s start by looking at some of the terminology that is used when discussing NLP.

Understanding NLP terminology

NLP is a vast field of study. In this section, we will investigate some of the basic terminology related to NLP:

  • Corpus: A corpus is a large and structured collection of text or speech data that serves as a resource for NLP algorithms. It can consist of various types of textual data, such as written text, spoken language, transcribed conversations, and social media posts. A corpus is created by intentionally gathering and organizing data from various online and offline sources, including the internet. While the internet can be a rich source for acquiring data, deciding what data to include in a corpus requires a purposeful selection and alignment with the goals of the particular study or analysis being conducted.

    Corpora, the plural of corpus, can be annotated, meaning they may contain extra details about the texts, such as part-of-speech tags and named entities. These annotated corpora offer specific information that enhances the training and evaluation of NLP algorithms, making them especially valuable resources in the field.

  • Normalization: This process involves converting text into a standard form, such as converting all characters to lowercase or removing punctuation, making it more amenable to analysis.
  • Tokenization: Tokenization breaks down text into smaller parts called tokens, usually words or subwords, enabling a more structured analysis.
  • Named Entity Recognition (NER): NER identifies and classifies named entities within the text, such as people’s names, locations, organizations, etc.
  • Stop words: These are commonly used words such as and, the, and is, which are often filtered out during text processing as they may not contribute significant meaning.
  • Stemming and lemmatization: Stemming involves reducing words to their root form, while lemmatization involves converting words to their base or dictionary form. Both techniques help in analyzing the core meaning of words.

Next, let us study different text preprocessing techniques used in NLP:

  • Word embeddings: This is a method used to translate words into numerical form, where each word is represented as a vector in a space that may have many dimensions. In this context, a “high-dimensional vector” refers to an array of numbers where the number of dimensions, or individual components, is quite large—often in the hundreds or even thousands. The idea behind using high-dimensional vectors is to capture the complex relationships between words, allowing words with similar meanings to be positioned closer together in this multi-dimensional space. The more dimensions the vector has, the more nuanced the relationships it can capture. Therefore, in word embeddings, semantically related words end up being closer to each other in this high-dimensional space, making it easier for algorithms to understand and process language in a way that reflects human understanding.
  • Language modeling: Language modeling is the process of developing statistical models that can predict or generate sequences of words or characters based on the patterns and structures found in a given text corpus.
  • Machine translation: The process of automatically translating text from one language to another using NLP techniques and models.
  • Sentiment analysis: The process of determining the attitude or sentiment expressed in a piece of text, often by analyzing the words and phrases used and their context.

Text preprocessing in NLP

Text preprocessing is a vital stage in NLP, where raw text data undergoes a transformation to become suitable for machine learning algorithms. This transformation involves converting the unorganized and often messy text into what is known as a “structured format.” A structured format means that the data is organized into a more systematic and predictable pattern, often involving techniques like tokenization, stemming, and removing unwanted characters. These steps help in cleaning the text, reducing irrelevant information or “noise,” and arranging the data in a manner that makes it easier for the machine learning models to understand. By following this approach, the raw text, which may contain inconsistencies and irregularities, is molded into a form that enhances the accuracy, performance, and efficiency of subsequent NLP tasks. In this section, we will explore various techniques used in text preprocessing to achieve this structured format.

Tokenization

As a reminder, tokenization is the crucial process of dividing text into smaller units, known as tokens. These tokens can be as small as individual words or even subwords. In NLP, tokenization is often considered the first step in preparing text data for further analysis. The reason for this foundational role lies in the very nature of language, where understanding and processing text requires breaking it down into manageable parts. By transforming a continuous stream of text into individual tokens, we create a structured format that mirrors the way humans naturally read and understand language. This structuring provides the machine learning models with a clear and systematic way to analyze the text, allowing them to recognize patterns and relationships within the data. As we delve deeper into NLP techniques, this tokenized format becomes the basis upon which many other preprocessing and analysis steps are built.

The following code snippet is tokenizing the given text using the Natural Language Toolkit (nltk) library in Python. The nltk is a widely used library in Python, specifically designed for working with human language data. It provides easy-to-use interfaces and tools for tasks such as classification, tokenization, stemming, tagging, parsing, and more, making it a valuable asset for NLP. For those who wish to leverage these capabilities in their Python projects, the nltk library can be downloaded and installed directly from the Python Package Index (PyPI) by using the command pip install nltk. By incorporating the nltk library into your code, you can access a rich set of functions and resources that streamline the development and execution of various NLP tasks, making it a popular choice among researchers, educators, and developers in the field of computational linguistics. Let us start by importing relevant functions and using them:

from nltk.tokenize import word_tokenize
corpus = 'This is a book about algorithms.'
tokens = word_tokenize(corpus)
print(tokens)

The output will be a list that looks like this:

['This', 'is', 'a', 'book', 'about', 'algorithms', '.']

In this example, each token is a word. The granularity of the resulting tokens will vary based on the objective—for example, each token can consist of a word, a sentence, or a paragraph.

To tokenize text based on sentences, you can use the sent_tokenize function from the nltk.tokenize module:

from nltk.tokenize import sent_tokenize
corpus = 'This is a book about algorithms. It covers various topics in depth.'

In this example, the corpus variable contains two sentences. The sent_tokenize function takes the corpus as input and returns a list of sentences. When you run the modified code, you will get the following output:

sentences = sent_tokenize(corpus)
print(sentences)
['This is a book about algorithms.', 'It covers various topics in depth.']

Sometimes we may need to break down large texts into paragraph-level chunks. nltk can help with that task. It’s a feature that could be particularly useful in applications such as document summarization, where understanding the structure at the paragraph level may be crucial. Tokenizing text into paragraphs might seem straightforward, but it can be complex depending on the structure and format of the text. A simple approach is to split the text into two newline characters, which often separate paragraphs in plain text documents.

Here’s a basic example:

def tokenize_paragraphs(text):
    # Split by two newline characters
    paragraphs = text.split('

') 
    return [p.strip() for p in paragraphs if p]

Next, let us look into how we can clean the data.

Cleaning data

Cleaning data is an essential step in NLP, as raw text data often contains noise and irrelevant information that can hinder the performance of NLP models. The goal of cleaning data for NLP is to preprocess the text data to remove noise and irrelevant information, and to transform it into a format that is suitable for analysis using NLP techniques. Note that data cleaning is done after it is tokenized. The reason is that cleaning might involve operations that depend on the structure revealed by tokenization. For instance, removing specific words or altering word forms might be done more accurately after the text is tokenized into individual terms.

Let us study some techniques used to clean data and prepare it for machine learning tasks:

Case conversion

Case conversion is a technique in NLP where text is transformed from one case format to another, such as from uppercase to lowercase, or from title case to uppercase.

For example, the text “Natural Language Processing” in title case could be converted to lowercase to be “natural language processing.”

This simple yet effective step helps in standardizing the text, which in turn simplifies its processing for various NLP algorithms. By ensuring that the text is in a uniform case, it aids in eliminating inconsistencies that might otherwise arise from variations in capitalization.

Punctuation removal

Punctuation removal in NLP refers to the process of removing punctuation marks from raw text data before analysis. Punctuation marks are symbols such as periods (.), commas (,), question marks (?), and exclamation marks (!) that are used in written language to indicate pauses, emphasis, or intonation. While they are essential in written language, they can add noise and complexity to raw text data, which can hinder the performance of NLP models.

It’s a reasonable concern to wonder how the removal of punctuation might affect the meaning of sentences. Consider the following examples:

"She's a cat."

"She's a cat??"

Without punctuation, both lines become “She’s a cat,” potentially losing the distinct emphasis conveyed by the question marks.

However, it’s worth noting that in many NLP tasks, such as topic classification or sentiment analysis, punctuation might not significantly impact the overall understanding. Additionally, models can rely on other cues from the text’s structure, content, or context to derive meaning. In cases where the nuances of punctuation are critical, specialized models and preprocessing techniques may be employed to retain the required information.

Handling numbers in NLP

Numbers within text data can pose challenges in NLP. Here’s a look at two main strategies for handling numbers in text analysis, considering both the traditional approach of removal and an alternative option of standardization.

In some NLP tasks, numbers may be considered noise, particularly when the focus is on aspects likeyWord frequency or sentiment analysis. Here’s why some analysts might choose to remove numbers:

  • Lack of relevance: Numeric characters may not carry significant meaning in specific text analysis scenarios.
  • Skewing frequency counts: Numbers can distort word frequency counts, especially in models like topic modeling.
  • Reducing complexity: Removing numbers may simplify the text data, potentially enhancing the performance of NLP models.

However, an alternative approach is to convert all numbers to a standard representation rather than discarding them. This method acknowledges that numbers can carry essential information and ensures that their value is retained in a consistent format. It can be particularly useful in contexts where numerical data plays a vital role in the meaning of the text.

Deciding whether to remove or retain numbers requires an understanding of the problem being solved. An algorithm may need customization to distinguish whether a number is significant based on the context of the text and the specific NLP task. Analyzing the role of numbers within the domain of the text and the goals of the analysis can guide this decision-making process.

Handling numbers in NLP is not a one-size-fits-all approach. Whether to remove, standardize, or carefully analyze numbers depends on the unique requirements of the task at hand. Understanding these options and their implications helps in making informed decisions that align with the goals of the text analysis.

White space removal

White space removal in NLP refers to the process of removing unnecessary white spaces, such as multiple spaces and tab characters. White space in the context of text data is not merely the space between words but includes other “invisible” characters that create spacing within text. In NLP, white space removal refers to the process of eliminating these unnecessary white space characters. Removing unnecessary white spaces can reduce the size of the text data and make it easier to process and analyze.

Here’s a simple example to illustrate white space removal:

  • Input text: "The quick brown fox jumps over the lazy dog."
  • Processed text: "The quick brown fox jumps over the lazy dog."

In the above example, extra spaces and a tab character (denoted by ) are removed to create a cleaner and more standardized text string.

Stop word removal

Stop word removal is the process of eliminating common words, known as stop words, from a text corpus. stop words are words that occur frequently in a language but do not carry significant meaning or contribute to the overall understanding of the text. Examples of stop words in English include the, and, is, in and for. Stop word removal helps reduce the dimensionality of the data and improve the efficiency of the algorithms. By removing words that don’t contribute meaningfully to the analysis, computational resources can be focused on the words that do matter, improving the efficiency of various NLP algorithms.

Note that stop word removal is more than a mere reduction in text size; it’s about focusing on the words that truly matter for the analysis at hand. While stop words play a vital role in language structure, their removal in NLP can enhance the efficiency and focus of the analysis, particularly in tasks like sentiment analysis where the primary concern is understanding the underlying emotion or opinion.

Stemming and lemmatization

In textual data, most words are likely to be present in slightly different forms. Reducing each word to its origin or stem in a family of words is called stemming. It is used to group words based on their similar meanings to reduce the total number of words that need to be analyzed. Essentially, stemming reduces the overall conditionality of the problem. The most common algorithm for stemming English is the Porter algorithm.

For example, let us look into a couple of examples:

  • Example 1: {use, used, using, uses} => use
  • Example 2: {easily, easier, easiest} => easi

It’s important to note that stemming can sometimes result in misspelled words, as seen in example 2 where easi was produced.

Stemming is a simple and quick process, but it may not always produce correct results. For cases where correct spelling is required, lemmatization is a more appropriate method. Lemmatization considers the context and reduces words to their base form. The base form of a word, also known as the lemma, is its most simple and meaningful version. It represents the way a word would appear in the dictionary, devoid of any inflectional endings, which will be a correct English word, resulting in more accurate and meaningful word roots.

The process of guiding algorithms to recognize similarities is a precise and thoughtful task. Unlike humans, algorithms need explicit rules and criteria to make connections that might seem obvious to us. Understanding this distinction and knowing how to provide the necessary guidance is a vital skill in the development and tuning of algorithms for various applications.

Cleaning data using Python

Let us look into how we can clean text using Python.

First, let’s import the necessary libraries:

import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# Make sure to download the NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

Next, here is the main function to perform text cleaning:

def clean_text(text):
    """
    Cleans input text by converting case, removing punctuation, numbers,
    white spaces, stop words and stemming
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove numbers
    text = re.sub(r'd+', '', text)
    
    # Remove white spaces
    text = text.strip()
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(text)
    filtered_text = [word for word in tokens if word not in stop_words]
    text = ' '.join(filtered_text)
    
    # Stemming
    ps = PorterStemmer()
    tokens = nltk.word_tokenize(text)
    stemmed_text = [ps.stem(word) for word in tokens]
    text = ' '.join(stemmed_text)
    
    return text

Let us test the function clean_text():

corpus="7- Today, Ottawa is becoming cold again "
clean_text(corpus)

The result will be:

today ottawa becom cold

Note the word becom in the output. As we are using stemming, not all the words in the output are expected to be correct English words.

All the preceding processing steps are typically needed; the actual processing steps depend on the problem that we want to solve. They will vary from use case to use case—for example, if the numbers in the text represent something that may have some value in the context of the problem that we are trying to solve, then we may not need to remove the numbers from the text in the normalization phase.

Once the data is cleaned, we need to store the results in a data structure tailored for this purpose. This data structure is called the Term Document Matrix (TDM) and is explained next.

Understanding the Term Document Matrix

A TDM is a mathematical structure used in NLP. It’s a table that counts the frequency of terms (words) in a collection of documents. Each row represents a unique term, and each column represents a specific document. It’s an essential tool for text analysis, where you can see how often each word occurs in various texts.

For documents containing the words cat and dog:

  • Document 1: cat cat dog
  • Document 2: dog dog cat

Document 1

Document 2

cat

2

1

dog

1

2

This matrix structure allows the efficient storage, organization, and analysis of large text datasets. In Python, the CountVectorizer module from the sklearn library can be used to create a TDM as follows:

from sklearn.feature_extraction.text import CountVectorizer
# Define a list of documents
documents = ["Machine Learning is useful", "Machine Learning is fun", "Machine Learning is AI"]
# Create an instance of CountVectorizer
vectorizer = CountVectorizer()
# Fit and transform the documents into a TDM
tdm = vectorizer.fit_transform(documents)
# Print the TDM
print(tdm.toarray())

The output looks as follows:

[[0 0 1 1 1 1]
 [0 1 1 1 1 0]
 [1 0 1 1 1 0]]

Note that corresponding to each document, there is a row, and corresponding to each distinct word, there is a column. There are three documents and there are six distinct words, resulting in a matrix with dimensions 3x6.

In this matrix, the numbers represent the frequency with which each word (column) appears in the corresponding document (row). So, for example, if the number in the first row and first column is 1, this means that the first word appears once in the first document.

TDM uses the frequency of each term by default, which is a simple way to quantify the importance of each word in the context of each individual document. A more sophisticated way to quantify the importance of each word is TF-IDF, which is explained in the next section.

Using TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a method used to calculate the significance of words in a document. It considers two main components to determine the weight of each term: the Term Frequency (TF) and the Inverse Document Frequency (IDF). The TF looks at how often a word appears in a specific document, while the IDF examines how rare the word is across a collection of documents, known as a corpus. In the context of TF-IDF, the corpus refers to the entire set of documents that you are analyzing. If we are working with a collection of book reviews, for example, the corpus would include all the reviews:

  • TF: TF measures the number of times a term appears in a document. It is calculated as the ratio of the number of occurrences of a term in a document to the total number of terms in the document. The more frequent the term, the higher its TF value.
  • IDF: IDF measures the importance of a term across the entire corpus of documents. It is calculated as the logarithm of the ratio of the total number of documents in the corpus to the number of documents containing the term. The rarer the term across the corpus, the higher its IDF value.

To compute TF-IDF using Python, do the following:

from sklearn.feature_extraction.text import TfidfVectorizer
# Define a list of documents
documents = ["Machine Learning enables learning", "Machine Learning is fun", "Machine Learning is useful"]
# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the documents into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)
# Get the feature names
feature_names = vectorizer.get_feature_names_out()
# Loop over the feature names and print the TF-IDF score for each term
for i, term in enumerate(feature_names):
    tfidf = tfidf_matrix[:, i].toarray().flatten()
    print(f"{term}: {tfidf}")

This will print:

enables:   [0.60366655 0.         0.        ]
fun:       [0.         0.66283998 0.        ]
is:        [0.         0.50410689 0.50410689]
learning:  [0.71307037 0.39148397 0.39148397]
machine:   [0.35653519 0.39148397 0.39148397]
useful:    [0.         0.         0.66283998]

Each column in the output corresponds to a document, and the rows represent the TF-IDF values for the terms across the documents. For example, the term kids has a non-zero TF-IDF value only in the second document, which is in line with our expectations.

Summary and discussion of results

The TF-IDF method provides a valuable way to weigh the importance of terms within individual documents and across an entire corpus. The resulting TF-IDF values reveal the relevance of specific terms within each document, taking into account both their frequency in a given document and their rarity across the entire collection. In the provided example, the varying TF-IDF scores for different terms demonstrate the model’s ability to distinguish words that are unique to specific documents from those that are more commonly used. This ability can be leveraged in various applications, such as text classification, information retrieval, and feature selection, to enhance the understanding and processing of text data.

Introduction to word embedding

One of the major advancements in NLP is our ability to create a meaningful numeric representation of words in the form of dense vectors. This technique is called word embedding. So, what exactly is a dense vector? Imagine you have a word like apple. In word embedding, apple might be represented as a series of numbers, such as [0.5, 0.8, 0.2], where each number is a coordinate in a continuous, multi-dimensional space. The term “dense” means that most or all of these numbers are non-zero, unlike sparse vectors where many elements might be zero. In simple terms, word embedding takes each word in a text and turns it into a unique, multi-dimensional point in space. This way, words with similar meanings will end up closer to each other in this space, allowing algorithms to understand the relationships between words. Yoshua Bengio first introduced the term in his paper A Neural Probabilistic Language Model. Each word in an NLP problem can be thought of as a categorical object.

In word embedding, try to establish the neighborhood of each word and use it to quantify its meaning and importance. The neighborhood of a word is the set of words that surround a particular word.

To truly grasp the concept of word embedding, let’s look at a tangible example involving a vocabulary of four familiar fruits: apple, banana, orange, and pear. The goal here is to represent these words as dense vectors, numerical arrays where each number captures a specific characteristic or feature of the word.

Why represent words this way? In NLP, converting words into dense vectors enables algorithms to quantify the relationships between different words. Essentially, we’re turning abstract language into something that is mathematically measurable.

Consider the features of sweetness, acidity, and juiciness for our fruit words. We could rate these features on a scale from 0 to 1 for each fruit, where 0 means the feature is entirely absent, and 1 means the feature is strongly present. This rating could look like this:

"apple": [0.5, 0.8, 0.2] – moderately sweet, quite acidic, not very juicy
"banana": [0.2, 0.3, 0.1] – not very sweet, moderately acidic, not juicy
"orange": [0.9, 0.6, 0.9] – very sweet, somewhat acidic, very juicy
"pear": [0.4, 0.1, 0.7] – moderately sweet, barely acidic, quite juicy

The numbers are subjective and can be derived from taste tests, expert opinions, or other methods, but they serve to transform the words into a format that an algorithm can understand and work with.

Visualizing this, you can imagine a 3D space where each axis represents one of the features (sweetness, acidity, or juiciness), and each fruit’s vector places it at a specific point in this space. Words (fruits) with similar tastes would be closer to each other in this space.

So, why the choice of dense vectors with a length of 3? This is based on the specific features we have chosen to represent. In other applications, the vector length might be different, based on the number of features you want to capture.

This example illustrates how word embedding takes a word and turns it into a numerical vector that holds real-world meaning. It’s a crucial step in enabling machines to “understand” and process human language.

Implementing word embedding with Word2Vec

Word2Vec is a prominent method used for obtaining vector representations of words, commonly referred to as word embeddings. Rather than “generating words,” this algorithm creates numerical vectors that represent the semantic meaning of each word in the language.

The basic idea behind Word2Vec is to use a neural network to predict the context of each word in a given text corpus. The neural network is trained by inputting the word and its surrounding context words, and the network learns to output the probability distribution of the context words given the input word. The weights of the neural network are then used as the word embeddings, which can be used for various NLP tasks:

import gensim
# Define a text corpus
corpus = [['apple', 'banana', 'orange', 'pear'],
          ['car', 'bus', 'train', 'plane'],
          ['dog', 'cat', 'fox', 'fish']]
# Train a word2vec model on the corpus
model = gensim.models.Word2Vec(corpus, window=5, min_count=1, workers=4)

Let us break down the important parameters of Word2Vec() function:

  • sentences: This is the input data for the model. It should be a collection of sentences, where each sentence is a list of words. Essentially, it’s a list of lists of words that represents your entire text corpus.
  • size: This defines the dimensionality of the word embeddings. In other words, it sets the number of features or numerical values in the vectors that represent the words. A typical value might be 100 or 300, depending on the complexity of the vocabulary.
  • window: This parameter sets the maximum distance between the target word and the context words used for prediction within a sentence. For example, if you set the window size to 5, the algorithm will consider the five words immediately before and after the target word in the training process.
  • min_count: Words that appear infrequently in the corpus may be excluded from the model by setting this parameter. If you set min_count to 2, for example, any word that appears fewer than two times across all the sentences will be ignored during training.
  • workers: This refers to the number of processing threads used during training. Increasing this value can speed up training on multi-core machines by enabling parallel processing.

Once the Word2Vec model is trained, one of the powerful ways to use it is to measure the similarity or “distance” between words in the embedding space. This similarity score can give us insight into how the model perceives relationships between different words. Now let us check the model by looking at the distance between car and train:

print(model.wv.similarity('car', 'train'))
-0.057745814

Now let’s look into the similarity of car and apple:

print(model.wv.similarity('car', 'apple'))
0.11117952

Thus, the output gives us the similarity score between individual terms based on the word embeddings learned by the model.

Interpreting similarity scores

The following details help with interpreting similarity scores:

  • Very similar: Scores close to 1 signify strong similarity. Words with this score often share contextual or semantic meanings.
  • Moderately similar: Scores around 0.5 indicate some level of similarity, possibly due to shared attributes or themes.
  • Weak or no similarity: Scores close to 0 or negative imply little to no similarity or even contrast in meanings.

Thus, these similarity scores provide quantitative insights into word relationships. By understanding these scores, you can better analyze the semantic structure of your text corpus and leverage it for various NLP tasks.

Word2Vec provides a powerful and efficient way to represent textual data in a way that captures semantic relationships between words, reduces dimensionality, and improves accuracy in downstream NLP tasks. Let us look into the advantages and disadvantages of Word2Vec.

Advantages and disadvantages of Word2Vec

The following are the advantages of using Word2Vec:

  • Capturing semantic relationships: Word2Vec’s embeddings are positioned in the vector space in such a way that semantically related words are located near each other. This spatial arrangement captures syntactic and semantic relationships like synonyms, analogies, and more, enabling better performance in tasks like information retrieval and semantic analysis.
  • Reducing dimensionality: Traditional one-hot encoding of words can create a sparse and high-dimensional space, especially with large vocabularies. Word2Vec compresses this into a denser and lower-dimensional continuous vector space (typically ranging from 100 to 300 dimensions). This condensed representation preserves essential linguistic patterns while being computationally more efficient.
  • Handling out-of-vocabulary words: Word2Vec can infer embeddings for words that didn’t appear in the training corpus by leveraging the surrounding context words. This property aids in generalizing better to unseen or new text data, enhancing robustness.

Now let us look into some of the disadvantages of using Word2Vec:

  • Training complexity: Word2Vec models can be computationally demanding to train, particularly with vast vocabularies and higher-dimensional vectors. They require significant computing resources and may necessitate optimization techniques, such as negative sampling or hierarchical softmax, to scale efficiently.
  • Lack of interpretability: The continuous and dense nature of Word2Vec embeddings makes them challenging to interpret by humans. Unlike carefully crafted linguistic features, the dimensions in Word2Vec don’t correspond to intuitive characteristics, making it difficult to understand what specific aspects of the words are being captured.
  • Sensitive to text preprocessing: The quality and effectiveness of Word2Vec embeddings can vary significantly based on the preprocessing steps applied to the text data. Factors such as tokenization, stemming, and lemmatization, or the removal of stopwords, must be carefully considered. The choice of preprocessing can impact the spatial relationships within the vector space, potentially affecting the model’s performance on downstream tasks.

Next, let us look into a case study about restaurant reviews that combines all the concepts presented in this chapter.

Case study: Restaurant review sentiment analysis

We will use the Yelp Reviews dataset, which contains labeled reviews as positive (5 stars) or negative (1 star). We will train a model that can classify the reviews of a restaurant as negative or positive.

Let’s implement this processing pipeline by going through the following steps.

Importing required libraries and loading the dataset

First, we import the packages that we need:

import numpy as np
import pandas as pd
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

Then we import the dataset from a .csv file:

url = 'https://storage.googleapis.com/neurals/data/2023/Restaurant_Reviews.tsv'
dataset = pd.read_csv(url, delimiter='	', quoting=3)
dataset.head()
                                             Review     Liked
0                           Wow... Loved this place.        1
1                                 Crust is not good.        0
2          Not tasty and the texture was just nasty.        0
3     Stopped by during the late May bank holiday of...     1
4      The selection on the menu was great and so wer...    1

Building a clean corpus: Preprocessing text data

Next, we clean the data by performing text preprocessing on each of the reviews of the dataset using stemming and stopword removal techniques:

def clean_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.lower()
    text = text.split()
    ps = PorterStemmer()
    text = [
        ps.stem(word) for word in text 
        if not word in set(stopwords.words('english'))]
    text = ' '.join(text)
    return text
corpus = [clean_text(review) for review in dataset['Review']]

The code iterates through each review in the dataset (in this case, the 'Review' column) and applies the clean_text function to preprocess and clean each review. The code creates a new list called corpus. The result is a list of cleaned and preprocessed reviews stored in the corpus variable.

Converting text data into numerical features

Now let’s define the features (represented by y) and the label (represented by X). Remember that features are the independent variables or attributes that describe the characteristics of the data, used as input for predictions.

And labels are the dependent variables or target values that the model is trained to predict, representing the outcomes corresponding to the features:

vectorizer = CountVectorizer(max_features=1500)
X = vectorizer.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

Let’s divide the data into testing and training data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

To train the model, we are using the Naive Bayes algorithm that we studied in Chapter 7:

classifier = GaussianNB()
classifier.fit(X_train, y_train)

Let’s predict the test set results:

y_pred = classifier.predict(X_test)

Next, let us print the confusion matrix. Remember that the confusion matrix is a table that helps visualize the performance of the classification model:

cm = confusion_matrix(y_test, y_pred)
print(cm)
[[55 42]
 [12 91]]

Looking at the confusion matrix, we can estimate the misclassification.

Analyzing the results

The confusion matrix gives us a glimpse into the misclassifications made by our model. In this context, there are:

  • 55 true positives (correctly predicted positive reviews)
  • 42 false positives (incorrectly predicted positive reviews)
  • 12 false negatives (incorrectly predicted negative reviews)
  • 91 true negatives (correctly predicted negative reviews)

The 55 true positives and 91 true negatives show that our model has a reasonable ability to distinguish between positive and negative reviews. However, the 42 false positives and 12 false negatives highlight areas for potential improvement.

In the context of restaurant reviews, understanding these numbers helps business owners and customers alike gauge the general sentiment. A high rate of true positives and true negatives indicates that the model can be trusted to give an accurate sentiment overview. This information could be invaluable for restaurants aiming to improve service or for potential customers seeking honest reviews. On the other hand, the presence of false positives and negatives suggests areas where the model might need fine-tuning to avoid misclassification and provide more accurate insights.

Applications of NLP

The continued advancement of NLP technology has revolutionized the way we interact with computers and other digital devices. It has made significant progress in recent years, with impressive achievements in many tasks, including:

  • Topic identification: To discover topics in a text repository and then classify the documents in the repository according to the topics discovered.
  • Sentiment analysis: To classify the text according to the positive or negative sentiments that it contains.
  • Machine translation: To translate between different languages.
  • Text to speech: To convert spoken words into text.
  • Question answering: This is a process of understanding and responding to a query using the information that is available. It involves intelligently interpreting the question and providing a relevant answer based on the existing knowledge or data.
  • Entity recognition: To identify entities (such as a person, place, or thing) from text.
  • Fake news detection: To flag fake news based on the content.

Summary

The chapter discussed the basic terminology related to NLP, such as corpus, word embeddings, language modeling, machine translation, and sentiment analysis. In addition, the chapter covered various text preprocessing techniques that are essential in NLP, including tokenization, which involves breaking down text into smaller units called tokens, and other techniques such as stemming and stop word removal.

The chapter also discussed word embeddings and then presented a use case on restaurant review sentiment analysis. Now, readers should have a better understanding of the fundamental techniques used in NLP and their potential applications to real-world problems.

In the next chapter, we will look at training neural networks for sequential data. We will also investigate how the use of deep learning can further improve NLP techniques and the methodologies discussed in this chapter.

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://packt.link/WHLel

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.123.173