In the earlier chapters, you were introduced to why Natural Language Processing (NLP) is important especially in today's context, which was followed by a discussion on a few prerequisites and Python libraries that are highly beneficial for NLP tasks. In this chapter, we will take this discussion further and discuss some of the most concrete tasks involved in building a vocabulary for NLP tasks and preprocessing textual data in detail. We will start by learning what a vocabulary is and take the notion forward to actually build a vocabulary. We will do this by applying various methods on text data that are present in most of the NLP pipelines across any organization.

In this chapter, we'll cover the following topics:

Lexicons
Phonemes, graphemes, and morphemes
Tokenization
Understanding word normalization

Technical requirements

The code files for this chapter can be downloaded from the following GitHub repository: https://github.com/PacktPublishing/Hands-On-Python-Natural-Language-Processing/tree/master/Chapter03.

Lexicons

Lexicons can be defined as the vocabulary of a person, language, or branch of knowledge. In simple terms, a lexicon can be thought of as a dictionary of terms that are called lexemes. For instance, the terms used by medical practitioners can be thought of as a lexicon for their profession. As an example, when trying to build an algorithm to convert a physical prescription provided by doctors into an electronic form, the lexicons would be primarilycomposed of medical terms. Lexicons are used for a wide variety of NLP tasks, where they are provided as a list of words, or vocabulary. Conversations in the concerned field are driven by their respective vocabulary. In this chapter, we will look at the steps and processes involved in building a natural language vocabulary.

Phonemes, graphemes, and morphemes

Before we start looking at the steps for building vocabulary, we need to understand phonemes, graphemes, and morphemes:

Phonemes can be thought of as the speech sounds, made by the mouth or unit of sound, that can differentiate one word from another in a language.
Graphemes are groups of letters of size one or more that can represent these individual sounds or phonemes. The word spoon consists of five letters that actually represent four phonemes, identified by the graphemes s, p, oo, and n.
A morpheme is the smallest meaningful unit in a language. The word unbreakable is composed of three morphemes:
- un—a bound morpheme signifying not
- break—the root morpheme
- able—a free morpheme signifying can be done

Now, let's delve into some practical aspects that form the base of every NLP-based system.

Tokenization

In order to build up a vocabulary, the first thing to do is to break the documents or sentences into chunks called tokens. Each token carries a semantic meaning associated with it. Tokenization is one of the fundamental things to do in any text-processing activity. Tokenization can be thought of as a segmentation technique wherein you are trying to break down larger pieces of text chunks into smaller meaningful ones. Tokens generally comprise words and numbers, but they can be extended to include punctuation marks, symbols, and, at times, understandable emoticons.

Let’s go through a few examples to understand this better:

sentence = "The capital of China is Beijing"
sentence.split()

Here's the output.

['The', 'capital', 'of', 'China', 'is', 'Beijing']

A simple sentence.split() method could provide us with all the different tokens in the sentence The capital of China is Beijing.Each token in the preceding split carries an intrinsic meaning; however, it is not always as straightforward as this.

Issues with tokenization

Consider the sentence and corresponding split in the following example:

sentence = "China's capital is Beijing"
sentence.split()

Here's the output:

["China's", 'capital', 'is', 'Beijing']

In the preceding example, should it be China, Chinas, orChina's? A split method does not often know how to deal with situations containing apostrophes.

In the next two examples, how do we deal with we'll and I'm? We'll indicates we will and I'm indicates I am. What should be the tokenized form of we'll? Should it be well or we'll or we and 'll separately? Similarly, how do we tokenize I'm? An ideal tokenizer should be able to process we'll into two tokens, we and will, and I'm into two tokens, I and am. Let's see how our split method would do in this situation.

Here's the first example:

sentence = "Beijing is where we'll go"
sentence.split()

Here's the output:

['Beijing', 'is', 'where', "we'll", 'go']

Here's the second example:

sentence = "I'm going to travel to Beijing"
sentence.split()

Here's the output:

["I'm", 'going', 'to', 'travel', 'to', 'Beijing']

How do we represent Hong Kong? Should it be two different tokens or should they be one token?

sentence = "Let's travel to Hong Kong from Beijing"
sentence.split()

Here's the output:

["Let's", 'travel', 'to', 'Hong', 'Kong', 'from', 'Beijing']

Here, ideally, Hong Kong should be one token, but think of another sentence: The name of the King is Kong. In such scenarios, Kong should be an individual token. In such situations, context can play a major role in understanding how to treat similar token representations when the context varies. Tokens of size 1, such asKong, are referred to as unigrams, whereas tokens of size 2, such asHong Kong, are referred to asbigrams. These can be generalized under the wing ofn-grams, which we'll discuss towards the end of this chapter.

How do we deal with periods? How do we understand whether they signify the end of a sentence or indicate an abbreviation?

In the following code snippet and subsequent output, the period between M and S is actually indicative of an abbreviation:

sentence = "A friend is pursuing his M.S from Beijing"
sentence.split()

Here's the output:

['A', 'friend', 'is', 'pursuing', 'his', 'M.S', 'from', 'Beijing']

In the next example, does a token such as ummcarry any meaning? Shouldn't it be removed? Even though a token such asummis not a part of English vocabulary, it becomes important in use cases where speech synthesis is involved as it indicates that the person is taking a pause here and trying to think of something. Again, as well as the context, the notion of the use case also matters when understanding where something should be tokenized or simply removed as a fragment of text that doesn't convey any meaning:

sentence = "Most of the times umm I travel"
sentence.split()

Here's the output:

['Most', 'of', 'the', 'times', 'umm', 'I', 'travel']

The rise of social media platforms has resulted in a massive influx of user data, which is a rich mine of information to understand individuals and communities; however, it has also catered to the rise of a world of emoticons, short forms, new abbreviations (often called the millennial language), and so on. There is a need to understand this ever-growing kind of text, as well those cases where, for instance, a characterP used with a colon (:) and hyphen (-) denotes a face with a stuck -out tongue. Hashtags are another very common thing on social media that are mostly indicative of summaries or emotions behind a Facebook post or a tweet on Twitter. An example of this is shown in the following example. Such growth leads to the development of tokenizers such as TweetTokenizer:

sentence = "Beijing is a cool place!!! :-P <3 #Awesome"
sentence.split()

Here's the output:

['Beijing', 'is', 'a', 'cool', 'place!!!', ':-P', '<3', '#Awesome']

In the next section, we will look at TweetTokenizer and a few other standard tokenizers available from the nltk library.

Different types of tokenizers

Based on the understanding we have developed so far, let's discuss the different types of tokenizers that are readily available for usage and see how these could be leveraged for the proper tokenization of text.

Regular expressions

Regular expressions are sequences of characters that define a search pattern. They are one of the earliest and are still one of the most effective tools for identifying patterns in text. Imagine searching for email IDs in a corpus of text. These follow the same pattern and are guided by a set of rules, no matter which domain they are hosted upon. Regular expressions are the way to go for identifying such things in text data instead of trying out machine learning-oriented techniques. Other notable examples where regular expressions have been widely employed include the SUTime offering from Stanford NLP, wherein tokenization based on regular expressionsis used to identify the date, time, duration, and set type entities in text. Look at the following sentence:

Last summer, they met every Tuesday afternoon, from 1:00 pm to 3:00 pm.

For this sentence, the SUTime library would return TIMEX expressions where each TIMEX expression would indicate the existence of one of the aforementioned entities:

Last summer, they met every Tuesday afternoon, from 1:00 pm to 3:00 pm.

`Last summer`	2019-SU	`<TIMEX3 tid="t1" type="DATE" value="2019-SU">Last summer</TIMEX3>`
`every Tuesday afternoon`	XXXX-WXX-2TAF	`<TIMEX3 periodicity="P1W" quant="every" tid="t2" type="SET" value="XXXX-WXX-2TAF">every Tuesday afternoon</TIMEX3>`
`1:00 pm`	2020-01-06T13:00	`<TIMEX3 tid="t3" type="TIME" value="2020-01-06T13:00">1:00 pm</TIMEX3>`
`3:00 pm`	2020-01-06T15:00	`<TIMEX3 tid="t4" type="TIME" value="2020-01-06T15:00">3:00 pm</TIMEX3>`

The TIMEX expressions can be parsed to convert them into a user-readable format.

You can try various phrases at https://nlp.stanford.edu/software/sutime.shtml.

Try it out!

Regular expressions-based tokenizers

The nltk package in Python provides a regular expressions-based tokenizers (RegexpTokenizer) functionality. It can be used to tokenize or split a sentence based on a provided regular expression. Take the following sentence:

A Rolex watch costs in the range of $3000.0 - $8000.0 in the USA.

Here, we would like to have expressions indicating money, alphabetic sequences, and abbreviations together. We can define a regular expression to do this and pass the utterance to the corresponding tokenizer object, as shown in the following code block:

from nltk.tokenize import RegexpTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA."
tokenizer = RegexpTokenizer('w+|$[d.]+|S+')
tokenizer.tokenize(s)

Here's the output:

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$3000.0',
 '-',
 '$8000.0',
 'in',
 'USA',
 '.']

Now, how did this work?

The w+|$[d.]+|S+regular expression allows three alternative patterns:

First alternative: w+ that matches any word character (equal to [a-zA-Z0-9_]). The +is a quantifier and matches between one and unlimited times as many times as possible.
Second alternative: $[d.]+. Here, $matches the character $, dmatches a digit between 0 and 9, . matches the character . (period), and +again acts as a quantifier matching between one and unlimited times.
Third alternative: S+. Here, Saccepts any non-whitespace character and +again acts the same way as in the preceding two alternatives.

There are other tokenizers built on top of the RegexpTokenizer, such as the BlankLine tokenizer, which tokenizes a string treating blank lines as delimiters where blank lines are those that contain no characters except spaces or tabs.

The WordPunct tokenizer is another implementation on top of RegexpTokenizer, which tokenizes a text into a sequence of alphabetic and nonalphabetic characters using the regular expression w+|[^ws]+.

Try it out!

Build a regular expression to figure out email IDs from the text. Validate your expression at https://regex101.com.

Treebank tokenizer

The Treebank tokenizer also uses regular expressions to tokenize text according to the Penn Treebank (https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html). Here, words are mostly split based on punctuation.

The Treebank tokenizer does a great job of splitting contractions such as doesn't to does and n't. It further identifies periods at the ends of lines and eliminates them. Punctuation such as commas is split if followed by whitespaces.

Let’s look at the following sentence and tokenize it using the Treebank tokenizer:

I'm going to buy a Rolex watch that doesn't cost more than $3000.0

The code is as follows:

 from nltk.tokenize import TreebankWordTokenizer
 s = "I'm going to buy a Rolex watch that doesn't cost more than $3000.0"
 tokenizer = TreebankWordTokenizer()
 tokenizer.tokenize(s)

Here's the output:

['I',
 "'m",
 'going',
 'to',
 'buy',
 'a',
 'Rolex',
 'watch',
 'which',
 'does',
 "n't",
 'cost',
 'more',
 'than',
 '$',
 '3000.0']

As can be seen in the example and corresponding output, this tokenizer primarily helps in analyzing each component in the text separately. The I'm gets split into two components, namely the I, which corresponds to a noun phrase, and the 'm, which corresponds to a verb component. This split allows us to work on individual tokens that carry significant information that would have been difficult to analyze and parse if it was a single token. Similarly, doesn't gets split into does and n't, helping to better parse and understand the inherent semantics associated with the n't, which indicates negation.

TweetTokenizer

As discussed earlier, the rise of social media has given rise to an informal language wherein people tag each other using their social media handles and use a lot of emoticons, hashtags, and abbreviated text to express themselves. We need tokenizers in place that can parse such text and make things more understandable. TweetTokenizer caters to this use case significantly. Let's look at the following sentence/tweet:

@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3

The tweet contains a social media handle, amankedia, a couple of hashtags in the form of #happiness and #rolex, and :-D and <3 emoticons. The next code snippet and the corresponding output show how all the text gets tokenized using TweetTokenizer to take care of all of these occurrences.

Consider the following example:

from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer()
tokenizer.tokenize(s)

Here's the output:

['@amankedia',
 "I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxxxxxxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

Another common thing with social media writing is the use of expressions such as Rolexxxxxxxx. Here, a lot of x's are present in addition to the normal one; it is a very common trend and should be addressed to bring it to a form as close to normal as possible.

The TweetTokenizer provides two additional parameters in the form of reduce_len, which tries to reduce the excessive characters in a token. The word Rolexxxxxxxx is actually tokenized as Rolexxx in an attempt to reduce the number of x's present:

from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
tokenizer.tokenize(s)

Here's the output:

["I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

The parameter strip_handles, when set to True, removes the handles mentioned in a post/tweet. As can be seen in the preceding output, @amankedia is stripped, since it is a handle.

One more parameter that is available with TweetTokenizer is preserve_case, which, when set to False, converts everything to lower case in order to normalize the vocabulary. The default value for this parameter is True.

Understanding word normalization

Most of the time, we don't want to have every individual word fragment that we have ever encountered in our vocabulary. We could want this for several reasons, one being the need to correctly distinguish (for example) the phrase U.N. (with characters separated by a period) from UN (without any periods).We can also bring words to their root form in the dictionary. For instance,am,are, andis can be identified by their root form,be. On another front, we can remove inflections from words to bring them down to the same form. Wordscar,cars, andcar'scan all be identified ascar.

Also, common words that occur very frequently and do not convey much meaning, such as the articles a, an, and the, can be removed. However, all these highly depend on the use cases. Wh- words, such as when, why, where, and who, do not carry much information in most contexts and are removed as part of a technique called stopwordremoval, which we will see a little later in the Stopword removal section; however, in situations such as question classification and question answering, these words become very important and should not be removed. Now, with a basic understanding of these techniques, let's deep dive into them in detail.

Stemming

Imagine bringing all of the words computer, computerization, and computerize into one word, compute. What happens here is called stemming. As part of stemming, a crude attempt is made to remove the inflectional forms of a word and bring them to a base form called the stem. The chopped-off pieces are referred to as affixes. In the preceding example, compute is the base form and the affixes are r, rization, and rize, respectively. One thing to keep in mind is that the stem need not be a valid word as we know it. For example, the word traditional would get stemmed to tradit, which is not a valid word in the English dictionary.

The two most common algorithms/methods employed for stemming include the Porter stemmer and the Snowball stemmer. The Porter stemmer supports the English language, whereas the Snowball stemmer, which is an improvement on the Porter stemmer, supports multiple languages, which can be seen in the following code snippet and its output:

from nltk.stem.snowball import SnowballStemmer
print(SnowballStemmer.languages)

Here's the output:

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

One thing to note from the snippet is that the Porter stemmer is one of the offerings provided by the Snowball stemmer. Other stemmers include the Lancaster, Dawson, Krovetz, and Lovins stemmers, among others. We will look at the Porter and Snowball stemmers in detail here.

The Porter stemmer works only with strings, whereas the Snowball stemmer works with both strings and Unicode data. The Snowball stemmer also allows the option to ignore stopwords as an inherent functionality.

Let's now first apply the Porter stemmer to words and see its effects in the following code block:

plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating',
 'siezing', 'itemization', 'traditional', 'reference', 'colonizer', 'plotted', 'having', 'generously'] 

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

Here's the stemmed output from the Porter stemming algorithm:

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have gener

Next, let's see how the Snowball stemmer woulddo on the same text:

stemmer2 = SnowballStemmer(language='english')
singles = [stemmer2.stem(plural) for plural in plurals]
print(' '.join(singles))

Here's the stemmed output of applying the Snowball stemming algorithm:

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have generous

As can be seen in the preceding code snippets, the Snowball stemmer requires the specification of a language parameter. In most of cases, its output is similar to that of the Porter stemmer, except for generously, where the Porter stemmer outputs gener and the Snowball stemmer outputs generous. The example shows how the Snowball stemmer makes minor changes to the Porter algorithm, achieving improvements in some cases.

Over-stemming and under-stemming

Potential problems with stemming arise in the form of over-stemming and under-stemming. A situation may arise when words that are stemmed to the same root should have been stemmed to different roots. This problem is referred to as over-stemming. In contrast, another problem occurs when words that should have been stemmed to the same root aren't stemmed to it. This situation is referred to as under-stemming.

More about stemming can be read at https://pdfs.semanticscholar.org/1c0c/0fa35d4ff8a2f925eb955e48d655494bd167.pdf .

Lemmatization

Unlike stemming, wherein a few characters are removed from words using crude methods, lemmatization is a process wherein the context is used to convert a word to its meaningful base form. It helps in grouping together words that have a common base form and so can be identified as a single item. The base form is referred to as the lemma of the word and is also sometimes known as the dictionary form.

Lemmatization algorithms try to identify the lemma form of a word by taking into account the neighborhood context of the word, part-of-speech (POS) tags, the meaning of a word, and so on. The neighborhood can span across words in the vicinity, sentences, or even documents.

Also, the same words can have different lemmas depending on the context. A lemmatizer would try and identify the part-of-speech tags based on the context to identify the appropriate lemma. The most commonly used lemmatizer is the WordNet lemmatizer. Other lemmatizers include the Spacy lemmatizer, TextBlob lemmatizer, and Gensim lemmatizer, and others. In this section, we will explore the WordNet and Spacy lemmatizers.

WordNet lemmatizer

WordNet is a lexical database of English that is freely and publicly available. As part of WordNet, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing distinct concepts. These synsets are interlinked using lexical and conceptual semantic relationships. It can be easily downloaded, and the nltk library offers an interface to it that enables you to perform lemmatization.

Let's try and lemmatize the following sentence using the WordNet lemmatizer:

We are putting in efforts to enhance our understanding of Lemmatization

Here is the code:

import nltk
nltk.download('wordnet')
fromnltk.stemimportWordNetLemmatizer
lemmatizer = WordNetLemmatizer()
s = "We are putting in efforts to enhance our understanding of 
        Lemmatization"
token_list = s.split()
print("The tokens are: ", token_list)
lemmatized_output = ' '.join([lemmatizer.lemmatize(token) for token 
                              in token_list])
print("The lemmatized output is: ", lemmatized_output)

Here's the output:

The tokens are:  ['We', 'are', 'putting', 'in', 'efforts', 'to', 'enhance', 'our', 'understanding', 'of', 'Lemmatization']
The lemmatized output is:  We are putting in effort to enhance our understanding of Lemmatization

As can be seen, the WordNet lemmatizer did not do much here. Out of are, putting, efforts, and understanding, none were converted to their base form.

What are we lacking here?

The WordNet lemmatizer works well if the POS tags are also provided as inputs.

It is really impossible to manually annotate each word with its POS tag in a text corpus. Now, how do we solve this problem and provide the part-of-speech tags for individual words as input to the WordNet lemmatizer?

Fortunately, the nltk library provides a method for finding POS tags for a list of words using an averaged perceptron tagger, the details of which are out of the scope of this chapter.

The POS tags for the sentence We are trying our best to understand Lemmatization here provided by the POS tagging method can be found in the following code snippet:

nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(token_list)
pos_tags

Here's the output:

[('We', 'PRP'),
 ('are', 'VBP'),
 ('putting', 'VBG'),
 ('in', 'IN'),
 ('efforts', 'NNS'),
 ('to', 'TO'),
 ('enhance', 'VB'),
 ('our', 'PRP$'),
 ('understanding', 'NN'),
 ('of', 'IN'),
 ('Lemmatization', 'NN')]

As can be seen, a list of tuples of the form (the token and POS tag) is returned by the POS tagger. Now, the POS tags need to be converted to a form that can be understood by the WordNet lemmatizer and sent in as input along with the tokens.

The code snippet does what's needed by mapping the POS tags to the first character, which is accepted by the lemmatizer in the appropriate format:

from nltk.corpus import wordnet
##This is a common method which is widely used across the NLP community of practitioners and readers
def get_part_of_speech_tags(token):
"""Maps POS tags to first character lemmatize() accepts.
We are focusing on Verbs, Nouns, Adjectives and Adverbs here."""
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    tag = nltk.pos_tag([token])[0][1][0].upper()
    return tag_dict.get(tag, wordnet.NOUN)

Now, let’s see how the WordNet lemmatizer performs when the POS tags are also provided as inputs:

lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token, get_part_of_speech_tags(token)) for token in token_list]
print(' '.join(lemmatized_output_with_POS_information))

Here's the output:

We be put in effort to enhance our understand of Lemmatization

The following conversions happened:

are to be
putting toput
effortstoeffort
understandingtounderstand

Let’s compare this with the Snowball stemmer:

stemmer2 = SnowballStemmer(language='english')
stemmed_sentence = [stemmer2.stem(token) for token in token_list]
print(' '.join(stemmed_sentence))

The following conversions happened:

we are put in effort to enhanc our understand of lemmat

As can be seen, the WordNet lemmatizer makes a sensible and context-aware conversion of the token into its base form, unlike the stemmer, which tries to chop the affixes from the token.

Spacy lemmatizer

The Spacy lemmatizer comes with pretrained models that can parse text and figure out the various properties of the text, such as POS tags, named-entity tags, and so on, with a simple function call. The prebuilt models identify the POS tags and assign a lemma to each token, unlike the WordNet lemmatizer, where the POS tags need to be explicitly provided.

We can install Spacy and download the en model for the English language by running the following command from the command line:

          pip install spacy && python -m spacy download en

Now that we have installed spacy, let's see how spacy helps with lemmatization using the following code snippet:

import spacy
nlp = spacy.load('en')
doc = nlp("We are putting in efforts to enhance our understanding of Lemmatization")
" ".join([token.lemma_ for token in doc])

Here's the output:

'-PRON- be put in effort to enhance -PRON- understanding of lemmatization'

The spacy lemmatizer performed a decent job without the input information of the POS tags. The advantage here is that there's no need to look out for external dependencies for fetching POS tags as the information is built into the pretrained model.

Another thing to note in the preceding output is the -PRON- lemma. The lemma for Pronouns is returned as -PRON- in Spacy's default behavior. It can act as a feature or, conversely, can be a limitation, since the exact lemma is not being returned.

Spacy supports multiple languages other than English. You can learn what they are at https://spacy.io/usage/models.

Stopword removal

From time to time in the previous sections, a technique called stopword removal was mentioned. We will finally look at the technique in detail here.

What are stopwords?

Stopwords are words such as a, an, the, in, at, and so on that occur frequently in text corpora and do not carry a lot of information in most contexts. These words, in general, are required for the completion of sentences and making them grammatically sound. They are often the most common words in a language and can be filtered out in most NLP tasks, and consequently help in reducing the vocabulary or search space. There is no single list of stopwords that is available universally, and they vary mostly based on use cases; however, a certain list of words is maintained for languages that can be treated as stopwords specific to that language, but they should be modified based on the problem that is being solved.

Let’s look at the stopwords available for English in the nltk library!

nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
", ".join(stop)

Here's the output:

"it's, yours, an, doing, any, mightn't, you, having, wasn't, themselves, just, over, below, needn't, a, this, shan't, them, isn't, was, wouldn't, as, only, his, or, shan, wouldn, don, where, own, were, he, out, do, it, am, won, isn, there, hers, to, ll, most, for, weren, have, by, while, the, re, that, down, haven, has, is, here, itself, all, didn, herself, shouldn, him, ve, who, doesn, m, hadn't, after, further, weren't, at, hadn, should've, too, because, can, now, same, more, she's, wasn, these, yourself, himself, being, very, until, myself, few, so, which, ourselves, they, t, you'd, did, o, aren, but, that'll, such, whom, of, s, you'll, those, doesn't, my, what, aren't, during, hasn, through, will, couldn, i, mustn, needn, mustn't, d, had, me, under, won't, haven't, its, with, when, their, between, if, once, against, before, on, not, you're, each, yourselves, in, and, are, shouldn't, some, nor, her, does, she, off, how, both, our, then, why, again, we, no, y, be, other, ma, from, up, theirs, couldn't, should, into, didn't, ours, about, ain, you've, don't, above, been, than, your, hasn't, mightn"

If you look closely, you'll notice that Wh- words such as who, what, when, why, how, which, where, and whom are part of this list of stopwords; however, in one of the previous sections, it was mentioned that these words are very significant in use cases such as question answering and question classification. Measures should be taken to ensure that these words are not filtered out when the text corpus undergoes stopword removal. Let's learn how this can be achieved by running through the following code block:

wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
stop = set(stopwords.words('english'))
sentence = "how are we putting in efforts to enhance our understanding of Lemmatization"
for word in wh_words:
    stop.remove(word)
sentence_after_stopword_removal = [token for token in sentence.split() if token not in stop]
" ".join(sentence_after_stopword_removal)

Here's the output:

'how putting efforts enhance understanding Lemmatization'

The preceding code snippet shows that the sentence how are we putting in efforts to enhance our understanding of Lemmatization gets modified to how putting efforts enhance understanding Lemmatization. The stopwords are, we, in, to, our, and of were removed from the sentence. Stopword removal is generally the first step that is taken after tokenization while building a vocabulary or preprocessing text data.

Case folding

Another strategy that helps with normalization is called case folding. As part of case folding, all the letters in the text corpus are converted to lowercase. The and the will be treated the same in a scenario of case folding, whereas they would be treated differently in a non-case folding scenario. This technique helps systems that deal with information retrieval, such as search engines.

Lamborghini, which is a proper noun, will be treated as lamborghini; whether the user typed Lamborghini or lamborghini would not make a difference, and the same results would be returned.

However, in situations where proper nouns are derived from common noun terms, case folding will become a bottleneck as case-based distinction becomes an important feature here. For instance, General Motors is composed of common noun terms but is itself a proper noun. Performing case folding here might cause issues. Another problem is when acronyms are converted to lowercase. There is a high chance that they will map to common nouns. An example widely used here is CAT which stands for Common Admission Test in India getting converted to cat.

A potential solution to this is to build machine learning models that can use features from a sentence to determine which words or tokens in the sentence should be lowercase and which shouldn't be; however, this approach doesn't always help when users mostly type in lowercase. As a result, lowercasing everything becomes a wise solution.

The language here is a major feature; in some languages, such as English, capitalization from point to point in a text carries a lot of information, whereas in some other languages, cases might not be as important.

The following code snippet shows a very straightforward approach that would convert all letters in a sentence to lowercase, making use of the lower() method available in Python:

s = "We are putting in efforts to enhance our understanding of Lemmatization"
s = s.lower()
s

Here's the output:

'we are putting in efforts to enhance our understanding of lemmatization'

N-grams

Until now, we have focused on tokens of size 1, which means only one word. Sentences generally contain names of people and places and other open compound terms, such as living room and coffee mug. These phrases convey a specific meaning when two or more words are used together. When used individually, they carry a different meaning altogether and the inherent meaning behind the compound terms is somewhat lost. The usage of multiple tokens to represent such inherent meaning can be highly beneficial for the NLP tasks being performed. Even though such occurrences are rare, they still carry a lot of information. Techniques should be employed to make sense of these as well.

In general, these are grouped under the umbrella term of n-grams. When n is equal to 1, these are termed as unigrams. Bigrams, or2-grams, refer to pairs of words, such asdinner table.Phrases such as the United Arab Emiratescomprising three words are termed as trigrams or 3-grams. This naming system can be extended to larger n-grams, but most NLP tasks use only trigrams or lower.

Let’s understand how this works for the following sentence:

Natural Language Processing is the way to go

The phrase Natural Language Processing carries an inherent meaning that would be lost if each of the words in the phrase is processed individually; however, when we use trigrams, these phrases can be extracted together and the meaning gets captured. In general, all NLP tasks make use of unigrams, bigrams, and trigrams together to capture all the information.

The following code illustrates an example of capturing bigrams:

from nltk.util import ngrams
s = "Natural Language Processing is the way to go"
tokens = s.split()
bigrams = list(ngrams(tokens, 2))
[" ".join(token) for token in bigrams]

The output shows the list of bigrams that we captured:

['Natural Language',
 'Language Processing',
 'Processing is',
 'is the',
 'the way',
 'way to',
 'to go']

Let's try and capture trigrams from the same sentence using the following code:

s = "Natural Language Processing is the way to go"
tokens = s.split()
trigrams = list(ngrams(tokens, 3))
[" ".join(token) for token in trigrams]

The output shows the trigrams that were captured from the sentence:

['Natural Language Processing',
 'Language Processing is',
 'Processing is the',
 'is the way',
 'the way to',
 'way to go']

Taking care of HTML tags

Often, data is scraped from online websites for information retrieval. Since these are mostly HTML pages, there needs to be some preprocessing to remove the HTML tags. HTML tags are mostly noise; however, sometimes they can also carry specific information. Let's think of a use case where a website such as Amazon uses specific tags for identifying features of a product—for example, a <price> tag can be custom created to carry price entries for products. In such scenarios, HTML can be highly useful; however, they are noise for most NLP data.

How do we get rid of them?

BeautifulSoup is an amazing library that helps us with handling such data. The following code snippet shows an example of how this can be achieved:

html = "<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>"
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

Here's the output:

My First HeadingMy first paragraph.

How does all this fit into my NLP pipeline?

The steps we discussed should be performed as part of preprocessing the text corpora before applying any algorithms to the data; however, which steps to apply and which to ignore depend on the use case.

These tokens can also be put together after the necessary preprocessing steps that we looked at previously to form the vocabulary. A simple example of this can be seen in the following code:

s = "Natural Language Processing is the way to go"
tokens = set(s.split())
vocabulary = sorted(tokens)
vocabulary

Here's the output:

['Language', 'Natural', 'Processing', 'go', 'is', 'the', 'to', 'way']

Summary

In this chapter, we looked at the various steps that are needed to build a natural language vocabulary. These play the most critical role in preprocessing any natural language data. Data preprocessing is probably one of the most important aspects of any machine learning application, and the same applies to NLP as well. When performed properly, these steps help with the machine learning aspects that generally occur after preprocessing the data, consequently providing better results most of the time compared with scenarios where no preprocessing is involved.

In the next chapter, we will use the techniques discussed in this chapter to preprocess data and subsequently build mathematical representations of text that can be understood by machine learning algorithms.

Table of Contents for Building Your NLP Vocabulary

Create new playlist

Sign In

Sign Up

Technical requirements

Lexicons

Phonemes, graphemes, and morphemes

Tokenization

Issues with tokenization

Different types of tokenizers

Regular expressions

Regular expressions-based tokenizers

Treebank tokenizer

TweetTokenizer

Understanding word normalization

Stemming

Over-stemming and under-stemming

Lemmatization

WordNet lemmatizer

Spacy lemmatizer

Stopword removal

Case folding

N-grams

Taking care of HTML tags

How does all this fit into my NLP pipeline?

Summary

Table of Contents for
Building Your NLP Vocabulary