Recipe 1. Noun Phrase extraction
Recipe 2. Text similarity
Recipe 3. Parts of speech tagging
Recipe 4. Information extraction – NER – Entity recognition
Recipe 5. Topic modeling
Recipe 6. Text classification
Recipe 7. Sentiment analysis
Recipe 8. Word sense disambiguation
Recipe 9. Speech recognition and speech to text
Recipe 10. Text to speech
Recipe 11. Language detection and translation
Before getting into recipes, let’s understand the NLP pipeline and life cycle first. There are so many concepts we are implementing in this book, and we might get overwhelmed by the content of it. To make it simpler and smoother, let’s see what is the flow that we need to follow for an NLP solution.
Define the Problem : Understand the customer sentiment across the products.
Understand the depth and breadth of the problem : Understand the customer/user sentiments across the product; why we are doing this? What is the business impact? Etc.
- Data requirement brainstorming : Have a brainstorming activity to list out all possible data points.
All the reviews from customers on e-commerce platforms like Amazon, Flipkart, etc.
Emails sent by customers
Warranty claim forms
Survey data
Call center conversations using speech to text
Feedback forms
Social media data like Twitter, Facebook, and LinkedIn
Data collection : We learned different techniques to collect the data in Chapter 1. Based on the data and the problem, we might have to incorporate different data collection methods. In this case, we can use web scraping and Twitter APIs.
Text Preprocessing : We know that data won’t always be clean. We need to spend a significant amount of time to process it and extract insight out of it using different methods that we discussed earlier in Chapter 2.
Text to feature : As we discussed, texts are characters and machines will have a tough time understanding them. We have to convert them to features that machines and algorithms can understand using any of the methods we learned in the previous chapter.
Machine learning/Deep learning : Machine learning/Deep learning is a part of an artificial intelligence umbrella that will make systems automatically learn patterns in the data without being programmed. Most of the NLP solutions are based on this, and since we converted text to features, we can leverage machine learning or deep learning algorithms to achieve the goals like text classification, natural language generation, etc.
Insights and deployment : There is absolutely no use for building NLP solutions without proper insights being communicated to the business. Always take time to connect the dots between model/analysis output and the business, thereby creating the maximum impact.
Recipe 4-1. Extracting Noun Phrases
In this recipe, let us extract a noun phrase from the text data (a sentence or the documents).
Problem
You want to extract a noun phrase.
Solution
Noun Phrase extraction is important when you want to analyze the “who” in a sentence. Let’s see an example below using TextBlob.
How It Works
Recipe 4-2. Finding Similarity Between Texts
In this recipe, we are going to discuss how to find the similarity between two documents or text. There are many similarity metrics like Euclidian, cosine, Jaccard, etc. Applications of text similarity can be found in areas like spelling correction and data deduplication.
Cosine similarity : Calculates the cosine of the angle between the two vectors.
Jaccard similarity : The score is calculated using the intersection or union of words.
Jaccard Index = (the number in both sets) / (the number in either set) * 100.
Levenshtein distance : Minimal number of insertions, deletions, and replacements required for transforming string “a” into string “b.”
Hamming distance : Number of positions with the same symbol in both strings. But it can be defined only for strings with equal length.
Problem
You want to find the similarity between texts/documents.
Solution
The simplest way to do this is by using cosine similarity from the sklearn library.
How It Works
Let’s follow the steps in this section to compute the similarity score between text documents.
Step 2-1 Create/read the text data
Step 2-2 Find the similarity
If we clearly observe, the first sentence and last sentence have higher similarity compared to the rest of the sentences.
Phonetic matching
- 1.Install and import the library!pip install fuzzyimport fuzzy
- 2.
Run the Soundex function
- 3.Generate the phonetic formsoundex('natural')#output'N364'soundex('natuaral')#output'N364'soundex('language')#output'L52'soundex('processing')#output'P625'
Soundex is treating “natural” and “natuaral” as the same, and the phonetic code for both of the strings is “N364.” And for “language” and “processing,” it is “L52” and “P625” respectively.
Recipe 4-3. Tagging Part of Speech
Part of speech (POS) tagging is another crucial part of natural language processing that involves labeling the words with a part of speech such as noun, verb, adjective, etc. POS is the base for Named Entity Resolution, Sentiment Analysis, Question Answering, and Word Sense Disambiguation.
Problem
Tagging the parts of speech for a sentence.
Solution
Rule based - Rules created manually, which tag a word belonging to a particular POS.
Stochastic based - These algorithms capture the sequence of the words and tag the probability of the sequence using hidden Markov models.
How It Works
Again, NLTK has the best POS tagging module. nltk.pos_tag(word) is the function that will generate the POS tagging for any given word. Use for loop and generate POS for all the words present in the document.
Step 3-1 Store the text in a variable
Step 3-2 NLTK for POS
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” ... think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go ‘to’ the store
UH interjection
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-adverb where, when
Recipe 4-4. Extract Entities from Text
In this recipe, we are going to discuss how to identify and extract entities from the text, called Named Entity Recognition. There are multiple libraries to perform this task like NLTK chunker, StanfordNER, SpaCy, opennlp, and NeuroNER; and there are a lot of APIs also like WatsonNLU, AlchemyAPI, NERD, Google Cloud NLP API, and many more.
Problem
You want to identify and extract entities from the text.
Solution
The simplest way to do this is by using the ne_chunk from NLTK or SpaCy.
How It Works
Let’s follow the steps in this section to perform NER.
Step 4-1 Read/create the text data
Step 4-2 Extract the entities
Execute the below code.
Using NLTK
Using SpaCy
According to the output, Apple is an organization, 10000 is money, and New York is place. The results are accurate and can be used for any NLP applications.
Recipe 4-5. Extracting Topics from Text
In this recipe, we are going to discuss how to identify topics from the document. Say, for example, there is an online library with multiple departments based on the kind of book. As the new book comes in, you want to look at the unique keywords/topics and decide on which department this book might belong to and place it accordingly. In these kinds of situations, topic modeling would be handy.
Basically, this is document tagging and clustering.
Problem
You want to extract or identify topics from the document.
Solution
The simplest way to do this by using the gensim library.
How It Works
Let’s follow the steps in this section to identify topics within documents using genism.
Step 5-1 Create the text data
Step 5-2 Cleaning and preprocessing
Step 5-3 Preparing document term matrix
Step 5-4 LDA model
All the weights associated with the topics from the sentence seem almost similar. You can perform this on huge data to extract significant topics. The whole idea to implement this on sample data is to make you familiar with it, and you can use the same code snippet to perform on the huge data for significant results and insights.
Recipe 4-6. Classifying Text
Text classification – The aim of text classification is to automatically classify the text documents based on pretrained categories.
Sentiment Analysis
Document classification
Spam – ham mail classification
Resume shortlisting
Document summarization
Problem
Spam - ham classification using machine learning.
Solution
If you observe, your Gmail has a folder called “Spam.” It will basically classify your emails into spam and ham so that you don’t have to read unnecessary emails.
How It Works
Let’s follow the step-by-step method to build the classifier.
Step 6-1 Data collection and understanding
Please download data from the below link and save it in your working directory:
Step 6-2 Text processing and feature engineering
Step 6-3 Model training
Naive Bayes is giving better results than the linear classifier. We can try many more classifiers and then choose the best one.
Recipe 4-7. Carrying Out Sentiment Analysis
In this recipe, we are going to discuss how to understand the sentiment of a particular sentence or statement. Sentiment analysis is one of the widely used techniques across the industries to understand the sentiments of the customers/users around the products/services. Sentiment analysis gives the sentiment score of a sentence/statement tending toward positive or negative.
Problem
You want to do a sentiment analysis.
Solution
The simplest way to do this by using a TextBlob or vedar library.
How It Works
Polarity = Polarity lies in the range of [-1,1] where 1 means a positive statement and -1 means a negative statement.
Subjectivity = Subjectivity refers that mostly it is a public opinion and not factual information [0,1].
Step 7-1 Create the sample data
Step 7-2 Cleaning and preprocessing
Refer to Chapter 2, Recipe 2-10, for this step.
Step 7-3 Get the sentiment scores
This is a negative review, as the polarity is “-0.68.”
Note: We will cover a one real-time use case on sentiment analysis with an end-to-end implementation in the next chapter, Recipe 5-2.
Recipe 4-8. Disambiguating Text
There is ambiguity that arises due to a different meaning of words in a different context.
In the above texts, the word “bank” has different meanings based on the context of the sentence.
Problem
Understanding disambiguating word sense.
Solution
The Lesk algorithm is one of the best algorithms for word sense disambiguation. Let’s see how to solve using the package pywsd and nltk.
How It Works
Below are the steps to achieve the results.
Step 8-1 Import libraries
Step 8-2 Disambiguating word sense
Observe that in context-1, “bank” is a financial institution, but in context-2, “bank” is sloping land.
Recipe 4-9. Converting Speech to Text
Converting speech to text is a very useful NLP technique.
Problem
You want to convert speech to text.
Solution
The simplest way to do this by using Speech Recognition and PyAudio.
How It Works
Let’s follow the steps in this section to implement speech to text.
Step 9-1 Understanding/defining business problem
Interaction with machines is trending toward the voice, which is the usual way of human communication. Popular examples are Siri, Alexa’s Google Voice, etc.
Step 9-2 Install and import necessary libraries
Step 9-3 Run below code
Recipe 4-10. Converting Text to Speech
Converting text to speech is another useful NLP technique.
Problem
You want to convert text to speech.
Solution
The simplest way to do this by using the gTTs library.
How It Works
Let’s follow the steps in this section to implement text to speech.
Step 10-1 Install and import necessary libraries
Step 10-2 Run below code, gTTS function
Recipe 4-11. Translating Speech
Language detection and translation .
Problem
Whenever you try to analyze data from blogs that are hosted across the globe, especially websites from countries like China, where Chinese is used predominantly, analyzing such data or performing NLP tasks on such data would be difficult. That’s where language translation comes to the rescue. You want to translate one language to another.
Solution
The easiest way to do this by using the goslate library.
How It Works
Let’s follow the steps in this section to implement language translation in Python.
Step 11-1 Install and import necessary libraries
Step 11-2 Input text
Step 11-3 Run goslate function
Well, it feels accomplished, isn’t it? We have implemented so many advanced NLP applications and techniques. That is not all folks; we have a couple more interesting chapters ahead, where we will look at the industrial applications around NLP, their solution approach, and end-to-end implementation.