© Pradeepta Mishra 2019
Pradeepta MishraPyTorch Recipeshttps://doi.org/10.1007/978-1-4842-4258-2_7

7. Natural Language Processing Using PyTorch

Pradeepta Mishra1 
(1)
Bangalore, Karnataka, India
 

Natural language processing is an important branch of computer science. It is the study of human language by computers performing various tasks. Natural language study is also known as computational linguistics . There are two different components of natural language processing: natural language understanding and natural language generation. Natural language understanding involves analysis and knowledge of the input language and responding to it. Natural language generation is the process of creating language from input text. Language can be used in various ways. One word may have different meanings, so removing ambiguity is an important part of natural language understanding.

The ambiguity level can be of three types.
  • Lexical ambiguity is based on parts of speech; deciding whether a word is a noun, verb, adverb, and so forth.

  • Syntactic ambiguity is where one sentence can have multiple interpretations; the subject and predicate are neutral.

  • Referential ambiguity is related to an event or scenario expressed in words.

Text analysis is a precursor to natural language processing and understanding. Text analysis means corpus creation creating a collected set of documents, and then removing white spaces, punctuation, stop words, junk values such as symbols, emojis, and so forth, which have no textual meaning. After clean up, the net task is to represent the text in vector form. This is done using the standard Word2vec model, or it can be represented in term frequency and inverse document frequency format (tf-idf). In today’s world, we see a lot of applications that use natural language processing; the following are some examples.
  • Spell checking applications—online and on smartphones. The user types a particular word and the system checks the meaning of the word and suggests whether the spelling needs to be corrected.

  • Keyword search has been an integral part of our lives over the last decade. Whenever we go to a restaurant, buy something, or visit some place, we do an online search. If the keyword typed is wrong, no match is retrieved; however, the search engine systems are so intelligent that they predict the user’s intent and suggest pages that user actually wants to search.

  • Predictive text is used in various chat applications. The user types a word, and based on the user’s writing pattern, a choice of next words appear. The user is prompted to select any word from the list to frame his sentence.

  • Question-and-answering systems like Google Home, Amazon Alexa, and so forth, allow users to interact with the system in natural language. The system processes that information, does an intelligent search, and retrieves the best results for the user.

  • Alternate data extraction is when actual data is not available to the user, but the user can use the Internet to fetch data that is publicly available, and search for relevant information. For example, if I want to buy a laptop, I want to compare the price of the laptop on various online portals. I have one system scrape the price information from various websites and provide a summary of the prices to me. This process is called alternate data collection using web scraping, text processing and natural language processing.

  • Sentiment analysis is a process of analyzing the mood of the customer, user, or agent from the text that they express. Customer reviews, movie reviews, and so forth. The text presented needs to be analyzed and tagged as a positive sentiment or a negative sentiment. Similar applications can be built using sentiment analysis.

  • Topic modeling is the process of finding distinct topics presented in the corpus. For example, if we take text from science, math, English, and biology, and jumble all the text, then ask the machine to classify the text and tell us how many topics exist in the corpus, and the machine correctly separates the words present in English from biology, biology from science, and so on so forth. This is called a perfect topic modeling system.

  • Text summarization is the process of summarizing the text from the corpus in a shorter format. If we have a two-page document that is 1000 words, and we need to summarize it in a 200-word paragraph, then we can achieve that by using text summarization algorithms.

  • Language translation is translating one language to another, such as English to French, French to German, and so on so forth. Language translation helps the user understand another language and make the communication process effective.

The study of human language is discrete and very complex. The same sentence may have many meanings, but it is specifically constructed for an intended audience. To understand the complexity of natural language, we not only need tools and programs but also the system and methods. The following five-step approach is followed in natural language processing to understand the text from the user.
  • Lexical analysis identifies the structure of the word.

  • Syntactic analysis is the study of English grammar and syntax.

  • Semantic analysis is the meaning of a word in a context.

  • PoS (point of sale) analysis is the understanding and parsing parts of speech.

  • Pragmatic analysis is understanding the real meaning of a word in context.

In this chapter, we use PyTorch to implement the steps that are most commonly used in natural language processing tasks.

Recipe 7-1. Word Embedding

Problem

How do we create a word-embedding model using PyTorch?

Solution

Word embedding is the process of representing the words, phrases, and tokens in a meaningful way in a vector structure. The input text is mapped to vectors of real numbers; hence, feature vectors can be used for further computation by machine learning or deep learning models.

How It Works

The words and phrases are represented in real vector format. The words or phrases that have similar meanings in a paragraph or document have similar vector representation. This makes the computation process effective in finding similar words. There are various algorithms for creating embedded vectors from text. Word2vec and GloVe are known frameworks to execute word embeddings. Let’s look at the following example.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figa_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Figb_HTML.jpg

The following sets up an embedding layer.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figc_HTML.jpg

Let’s look at the sample text. The following text has two paragraphs, and each paragraph has several sentences. If we apply word embedding on these two paragraphs, then we will get real vectors as features from the text. Those features can be used for further computation.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figd_HTML.jpg

Tokenization is the process of splitting sentences into small chunks of tokens, known as n-grams. This is called a unigram if it is a single word, a bigram if it is two words, a trigram if it is three words, so on and so forth.

../images/474315_1_En_7_Chapter/474315_1_En_7_Fige_HTML.jpg

The PyTorch n-gram language modeler can extract relevant key words.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figf_HTML.jpg

The n-gram extractor has three arguments: the length of the vocabulary to extract, a dimension of embedding vector, and context size. Let’s look at the loss function and the model specification.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figg_HTML.jpg

Apply the Adam optimizer.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figh_HTML.jpg

Context extraction from sentences is also important. Let’s look at the following function.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figi_HTML.jpg

Recipe 7-2. CBOW Model in PyTorch

Problem

How do we create a CBOW model using PyTorch?

Solution

There are two different methods to represent words and phrases in vectors: continuous bag of words (CBOW) and skip gram . The bag-of-words approach learns embedding vectors by predicting the word or phrase in context. Context means the words before and after the current word. If we take a context of size 4, this implies that the four words to the left of the current word and the four words to the right of it are considered for context. The model tries to find those eight words in another sentence to predict the current word.

How It Works

Let’s look at the following example.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figj_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Figk_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Figl_HTML.jpg
Graphically, the bag-of-words model looks like what is shown in Figure 7-1. It has three layers: input, which are the embedding vectors that take the words and phrases into account; the output vector, which is the relevant word predicted by the model; and the projection layer, which is a computational layer provided by the neural network model.
../images/474315_1_En_7_Chapter/474315_1_En_7_Fig1_HTML.png
Figure 7-1

CBOW model representation

../images/474315_1_En_7_Chapter/474315_1_En_7_Figm_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Fign_HTML.jpg

Recipe 7-3. LSTM Model

Problem

How do we create a LSTM model using PyTorch?

Solution

The long short-term memory (LSTM) model, also known as the specific form of recurrent neural network model, is commonly used in the natural language processing field. Text and sentences come in sequences to make a meaningful sentence, so we need a model that remembers the long and short sequences of text to predict a word or text.

How It Works

Let’s look at the following example.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figo_HTML.jpg

Prepare a sequence of words as training data to form the LSTM network.

../images/474315_1_En_7_Chapter/474315_1_En_7_Figp_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Figq_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Figr_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Figs_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Figt_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Figu_HTML.jpg
../images/474315_1_En_7_Chapter/474315_1_En_7_Figv_HTML.jpg

../images/474315_1_En_7_Chapter/474315_1_En_7_Figw_HTML.jpg

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.156.22