12.2 TextBlob1

TextBlob is an object-oriented NLP text-processing library that is built on the NLTK and pattern NLP libraries and simplifies many of their capabilities. Some of the NLP tasks TextBlob can perform include:

  • Tokenization—splitting text into pieces called tokens, which are meaningful units, such as words and numbers.

  • Parts-of-speech (POS) tagging—identifying each word’s part of speech, such as noun, verb, adjective, etc.

  • Noun phrase extraction—locating groups of words that represent nouns, such as “red brick factory.”2

  • Sentiment analysis—determining whether text has positive, neutral or negative sentiment.

  • Inter-language translation and language detection powered by Google Translate.

  • Inflection3—pluralizing and singularizing words. There are other aspects of inflection that are not part of TextBlob.

  • Spell checking and spelling correction.

  • Stemming—reducing words to their stems by removing prefixes or suffixes. For example, the stem of “varieties” is “varieti.”

  • Lemmatization—like stemming, but produces real words based on the original words’ context. For example, the lemmatized form of “varieties” is “variety.”

  • Word frequencies—determining how often each word appears in a corpus.

  • WordNet integration for finding word definitions, synonyms and antonyms.

  • Stop word elimination—removing common words, such as a, an, the, I, we, you and more to analyze the important words in a corpus.

  • n-grams—producing sets of consecutive words in a corpus for use in identifying words that frequently appear adjacent to one another.

Many of these capabilities are used as part of more complex NLP tasks. In this section, we’ll perform these NLP tasks using TextBlob and NLTK.

Installing the TextBlob Module

To install TextBlob, open your Anaconda Prompt (Windows), Terminal (macOS/Linux) or shell (Linux), then execute the following command:

conda install -c conda-forge textblob

Windows users might need to run the Anaconda Prompt as an Administrator for proper software installation privileges. To do so, right-click Anaconda Prompt in the start menu and select More > Run as administrator.

Once installation completes, execute the following command to download the NLTK corpora used by TextBlob:

ipython -m textblob.download_corpora

These include:

  • The Brown Corpus (created at Brown University4) for parts-of-speech tagging.

  • Punkt for English sentence tokenization.

  • WordNet for word definitions, synonyms and antonyms.

  • Averaged Perceptron Tagger for parts-of-speech tagging.

  • conll2000 for breaking text into components, like nouns, verbs, noun phrases and more—known as chunking the text. The name conll2000 is from the conference that created the chunking data—Conference on Computational Natural Language Learning.

  • Movie Reviews for sentiment analysis.

Project Gutenberg

A great source of text for analysis is the free e-books at Project Gutenberg:

https://www.gutenberg.org

The site contains over 57,000 e-books in various formats, including plain text files. These are out of copyright in the United States. For information about Project Gutenberg’s Terms of Use and copyright in other countries, see:

https://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use

In some of this section’s examples, we use the plain-text e-book file for Shakespeare’s Romeo and Juliet, which you can find at:

https://www.gutenberg.org/ebooks/1513

Project Gutenberg does not allow programmatic access to its e-books. You’re required to copy the books for that purpose.5 To download Romeo and Juliet as a plain-text e-book, right click the Plain Text UTF-8 link on the book’s web page, then select Save Link As… (Chrome/FireFox), Download Linked File As… (Safari) or Save target as (Microsoft Edge) option to save the book to your system. Save it as RomeoAndJuliet.txt in the ch12 examples folder to ensure that our code examples will work correctly. For analysis purposes, we removed the Project Gutenberg text before "THE TRAGEDY OF ROMEO AND JULIET", as well as the Project Guttenberg information at the end of the file starting with:

End of the Project Gutenberg EBook of Romeo and Juliet,
by William Shakespeare

tick mark Self Check

  1. (Fill-In) TextBlob is an object-oriented NLP text-processing library built on the       and       NLP libraries, and simplifies accessing their capabilities.
    Answer: NLTK, pattern.

12.2.1 Create a TextBlob

TextBlob6 is the fundamental class for NLP with the textblob module. Let’s create a TextBlob containing two sentences:

In [1]: from textblob import TextBlob

In [2]: text = 'Today is a beautiful day. Tomorrow looks like bad weather.'

In [3]: blob = TextBlob(text)

In [4]: blob
Out[4]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")

TextBlobs—and, as you’ll see shortly, Sentences and Words—support string methods and can be compared with strings. They also provide methods for various NLP tasks. Sentences, Words and TextBlobs inherit from BaseBlob, so they have many common methods and properties.

[Note: We use snippet [3]’s TextBlob in several of the following Self Checks and sub-sections, in which we continue the previous interactive session.]

tick mark Self Check

  1. (Fill-In)       is the fundamental class for NLP with the textblob module.
    Answer: TextBlob.

  2. (True/False) TextBlobs support string methods and can be compared with strings using the comparison operators.
    Answer: True.

  3. (IPython Session) Create a TextBlob named exercise_blob containing 'This is a TextBlob'.
    Answer:

    In [5]: exercise_blob = TextBlob('This is a TextBlob')
    
    In [6]: exercise_blob
    Out[6]: TextBlob("This is a TextBlob")
    

12.2.2 Tokenizing Text into Sentences and Words

Natural language processing often requires tokenizing text before performing other NLP tasks. TextBlob provides convenient properties for accessing the sentences and words in TextBlobs. Let’s use the sentence property to get a list of Sentence objects:

In [7]: blob.sentences
Out[7]:
[Sentence("Today is a beautiful day."),
 Sentence("Tomorrow looks like bad weather.")]

The words property returns a WordList object containing a list of Word objects, representing each word in the TextBlob with the punctuation removed:

In [8]: blob.words
Out[8]: WordList(['Today', 'is', 'a', 'beautiful', 'day', 'Tomorrow',
'looks', 'like', 'bad', 'weather'])

tick mark Self Check

  1. (IPython Session) Create a TextBlob with two sentences, then tokenize it into Sentences and Words, displaying all the tokens.
    Answer:

    In [9]: ex = TextBlob('My old computer is slow. My new one is fast.')
    
    In [10]: ex.sentences
    Out[10]: [Sentence("My old computer is slow."), Sentence("My new one is
    fast.")]
    
    In [11]: ex.words
    Out[11]: WordList(['My', 'old', 'computer', 'is', 'slow', 'My', 'new',
    'one', 'is', 'fast'])
    

12.2.3 Parts-of-Speech Tagging

Parts-of-speech (POS) tagging is the process of evaluating words based on their context to determine each word’s part of speech. There are eight primary English parts of speech—nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions and interjections (words that express emotion and that are typically followed by punctuation, like “Yes!” or “Ha!”). Within each category there are many subcategories.

Some words have multiple meanings. For example, the words “set” and “run” have hundreds of meanings each! If you look at the dictionary.com definitions of the word “run,” you’ll see that it can be a verb, a noun, an adjective or a part of a verb phrase. An important use of POS tagging is determining a word’s meaning among its possibly many meanings. This is important for helping computers “understand” natural language.

The tags property returns a list of tuples, each containing a word and a string representing its part-of-speech tag:

In [12]: blob
Out[12]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")

In [13]: blob.tags
Out[13]:
[('Today', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('day', 'NN'),
 ('Tomorrow', 'NNP'),
 ('looks', 'VBZ'),
 ('like', 'IN'),
 ('bad', 'JJ'),
 ('weather', 'NN')]

By default, TextBlob uses a PatternTagger to determine parts-of-speech. This class uses the parts-of-speech tagging capabilities of the pattern library:

https://www.clips.uantwerpen.be/pattern

You can view the library’s 63 parts-of-speech tags at

https://www.clips.uantwerpen.be/pages/MBSP-tags

In the preceding snippet’s output:

  • Today, day and weather are tagged as NN—a singular noun or mass noun.

  • is and looks are tagged as VBZ—a third person singular present verb.

  • a is tagged as DT—a determiner.7

  • beautiful and bad are tagged as JJ—an adjective.

  • Tomorrow is tagged as NNP—a proper singular noun.

  • like is tagged as IN—a subordinating conjunction or preposition.

tick mark Self Check

  1. (Fill-In)       is the process of evaluating words based on their context to determine each word’s part of speech
    Answer: Parts-of-speech (POS) tagging.

  2. (IPython Session) Display the parts-of-speech tags for the sentence, 'My dog is cute'.
    Answer:

    In [14]: TextBlob('My dog is cute').tags
    Out[14]: [('My', 'PRP$'), ('dog', 'NN'), ('is', 'VBZ'), ('cute', 'JJ')]
    

In the preceding output, the POS tag PRP$ indicates a possessive pronoun.

12.2.4 Extracting Noun Phrases

Let’s say you’re preparing to purchase a water ski so you’re researching them online. You might search for “best water ski.” In this case, “water ski” is a noun phrase. If the search engine does not parse the noun phrase properly, you probably will not get the best search results. Go online and try searching for “best water,” “best ski” and “best water ski” and see what you get.

A TextBlob’s noun_phrases property returns a WordList object containing a list of Word objects—one for each noun phrase in the text:

In [15]: blob
Out[15]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")

In [16]: blob.noun_phrases
Out[16]: WordList(['beautiful day', 'tomorrow', 'bad weather'])

Note that a Word representing a noun phrase can contain multiple words. A WordList is an extension of Python’s built-in list type. WordLists provide additional methods for stemming, lemmatizing, singularizing and pluralizing.

tick mark Self Check

  1. (IPython Session) Show the noun phrase(s) in the sentence, 'The red brick factory is for sale'.
    Answer:

    In [17]: TextBlob('The red brick factory is for sale').noun_phrases
    Out[17]: WordList(['red brick factory'])
    

12.2.5 Sentiment Analysis with TextBlob’s Default Sentiment Analyzer

One of the most common and valuable NLP tasks is sentiment analysis, which determines whether text is positive, neutral or negative. For instance, companies might use this to determine whether people are speaking positively or negatively online about their products. Consider the positive word “good” and the negative word “bad.” Just because a sentence contains “good” or “bad” does not mean the sentence’s sentiment necessarily is positive or negative. For example, the sentence

The food is not good.

clearly has negative sentiment. Similarly, the sentence

The movie was not bad.

clearly has positive sentiment, though perhaps not as positive as something like

The movie was excellent!

Sentiment analysis is a complex machine-learning problem. However, libraries like TextBlob have pretrained machine learning models for performing sentiment analysis.

Getting the Sentiment of a TextBlob

A TextBlob’s sentiment property returns a Sentiment object indicating whether the text is positive or negative and whether it’s objective or subjective:

In [18]: blob
Out[18]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")

In [19]: blob.sentiment
Out[19]: Sentiment(polarity=0.07500000000000007,
subjectivity=0.8333333333333333)

In the preceding output, the polarity indicates sentiment with a value from -1.0 (negative) to 1.0 (positive) with 0.0 being neutral. The subjectivity is a value from 0.0 (objective) to 1.0 (subjective). Based on the values for our TextBlob, the overall sentiment is close to neutral, and the text is mostly subjective.

Getting the polarity and subjectivity from the Sentiment Object

The values displayed above probably provide more precision that you need in most cases. This can detract from numeric output’s readability. The IPython magic %precision allows you to specify the default precision for standalone float objects and float objects in built-in types like lists, dictionaries and tuples. Let’s use the magic to round the polarity and subjectivity values to three digits to the right of the decimal point:

In [20]: %precision 3
Out[20]: '%.3f'

In [21]: blob.sentiment.polarity
Out[21]: 0.075

In [22]: blob.sentiment.subjectivity
Out[22]: 0.833

Getting the Sentiment of a Sentence

You also can get the sentiment at the individual sentence level. Let’s use the sentence property to get a list of Sentence8 objects, then iterate through them and display each Sentence’s sentiment property:

In [23]: for sentence in blob.sentences:
    ...:     print(sentence.sentiment)
    ...:
Sentiment(polarity=0.85, subjectivity=1.0)
Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666)

This might explain why the entire TextBlob’s sentiment is close to 0.0 (neutral)—one sentence is positive (0.85) and the other negative (-0.6999999999999998).

tick mark Self Check

  1. (IPython Session) Import Sentence from the TextBlob module then make Sentence objects to check the sentiment of the three sentences used in this section’s introduction.
    Answer: Snippet [25]’s output shows that the sentence’s sentiment is somewhat negative (due to “not good”). Snippet [26]’s output shows that the sentence’s sentiment is somewhat positive (due to “not bad”). Snippet [27]’s output shows that the sentence’s sentiment is totally positive (due to “excellent”). The outputs indicate that all three sentences are subjective, with the last being perfectly positive and subjective.

    In [24]: from textblob import Sentence
    
    In [25]: Sentence('The food is not good.').sentiment
    Out[25]: Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)
    
    In [26]: Sentence('The movie was not bad.').sentiment
    Out[26]: Sentiment(polarity=0.3499999999999999,
    subjectivity=0.6666666666666666)
    
    In [27]: Sentence('The movie was excellent!').sentiment
    Out[27]: Sentiment(polarity=1.0, subjectivity=1.0)
    

12.2.6 Sentiment Analysis with the NaiveBayesAnalyzer

By default, a TextBlob and the Sentences and Words you get from it determine sentiment using a PatternAnalyzer, which uses the same sentiment analysis techniques as in the Pattern library. The TextBlob library also comes with a NaiveBayesAnalyzer9 (module textblob.sentiments), which was trained on a database of movie reviews. Naive Bayes10 is a commonly used machine learning text-classification algorithm. The following uses the analyzer keyword argument to specify a TextBlob’s sentiment analyzer. Recall from earlier in this ongoing IPython session that text contains 'Today is a beautiful day. Tomorrow looks like bad weather.':

In [28]: from textblob.sentiments import NaiveBayesAnalyzer

In [29]: blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())

In [30]: blob
Out[30]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")

Let’s use the TextBlob’s sentiment property to display the text’s sentiment using the NaiveBayesAnalyzer:

In [31]: blob.sentiment
Out[31]: Sentiment(classification='neg', p_pos=0.47662917962091056,
p_neg=0.5233708203790892)

In this case, the overall sentiment is classified as negative (classification='neg'). The Sentiment object’s p_pos indicates that the TextBlob is 47.66% positive, and its p_neg indicates that the TextBlob is 52.34% negative. Since the overall sentiment is just slightly more negative we’d probably view this TextBlob’s sentiment as neutral overall.

Now, let’s get the sentiment of each Sentence:

In [32]: for sentence in blob.sentences:
    ...:     print(sentence.sentiment)
    ...:
Sentiment(classification='pos', p_pos=0.8117563121751951,
p_neg=0.18824368782480477)
Sentiment(classification='neg', p_pos=0.174363226578349,
p_neg=0.8256367734216521)

Notice that rather than polarity and subjectivity, the Sentiment objects we get from the NaiveBayesAnalyzer contain a classification'pos' (positive) or 'neg' (negative)— and p_pos (percentage positive) and p_neg (percentage negative) values from 0.0 to 1.0. Once again, we see that the first sentence is positive and the second is negative.

tick mark Self Check

  1. (IPython Session) Check the sentiment of the sentence 'The movie was excellent!' using the NaiveBayesAnalyzer.
    Answer:

    In [33]: text = ('The movie was excellent!')
    
    In [34]: exblob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
    
    In [35]: exblob.sentiment
    Out[35]: Sentiment(classification='pos', p_pos=0.7318278242290406,
    p_neg=0.26817217577095936)
    

12.2.7 Language Detection and Translation

Inter-language translation is a challenging problem in natural language processing and artificial intelligence. With advances in machine learning, artificial intelligence and natural language processing, services like Google Translate (100+ languages) and Microsoft Bing Translator (60+ languages) can translate between languages instantly.

Inter-language translation also is great for people traveling to foreign countries. They can use translation apps to translate menus, road signs and more. There are even efforts at live speech translation so that you’ll be able to converse in real time with people who do not know your natural language.11,12 Some smartphones, can now work together with in-ear headphones to provide near-live translation of many languages.13,14 ,15 In the “IBM Watson and Cognitive Computing” chapter, we develop a script that does near real-time inter-language translation among languages supported by Watson.

The TextBlob library uses Google Translate to detect a text’s language and translate TextBlobs, Sentences and Words into other languages.16 Let’s use detect_language method to detect the language of the text we’re manipulating ('en' is English):

In [36]: blob
Out[36]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")

In [37]: blob.detect_language()
Out[37]: 'en'

Next, let’s use the translate method to translate the text to Spanish ('es') then detect the language on the result. The to keyword argument specifies the target language.

In [38]: spanish = blob.translate(to='es')

In [39]: spanish
Out[39]: TextBlob("Hoy es un hermoso dia. Mañana parece mal tiempo.")

In [40]: spanish.detect_language()
Out[40]: 'es'

Next, let’s translate our TextBlob to simplified Chinese (specified as 'zh' or 'zh-CN') then detect the language on the result:

In [41]: chinese = blob.translate(to='zh')

In [42]: chinese
Out[42]: TextBlob("chinese letters")

In [43]: chinese.detect_language()
Out[43]: 'zh-CN'

Method detect_language’s output always shows simplified Chinese as 'zh-CN', even though the translate function can receive simplified Chinese as 'zh' or 'zh-CN'.

In each of the preceding cases, Google Translate automatically detects the source language. You can specify a source language explicitly by passing the from_lang keyword argument to the translate method, as in

chinese = blob.translate(from_lang='en', to='zh')

Google Translate uses iso-639-117 language codes listed at

https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

For the supported languages, you’d use these codes as the values of the from_lang and to keyword arguments. Google Translate’s list of supported languages is at:

https://cloud.google.com/translate/docs/languages

Calling translate without arguments translates from the detected source language to English:

In [44]: spanish.translate()
Out[44]: TextBlob("Today is a beautiful day. Tomorrow seems like bad
weather.")

In [45]: chinese.translate()
Out[45]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")

Note the slight difference in the English results.

tick mark Self Check

  1. (IPython Session) Translate 'Today is a beautiful day.' into French, then detect the language.
    Answer:

    In [46]: blob = TextBlob('Today is a beautiful day.')
    
    In [47]: french = blob.translate(to='fr')
    
    In [48]: french
    Out[48]: TextBlob("Aujourd'hui est un beau jour.")
    
    In [49]: french.detect_language()
    Out[49]: 'fr'
    

12.2.8 Inflection: Pluralization and Singularization

Inflections are different forms of the same words, such as singular and plural (like “person” and “people”) and different verb tenses (like “run” and “ran”). When you’re calculating word frequencies, you might first want to convert all inflected words to the same form for more accurate word frequencies. Words and WordLists each support converting words to their singular or plural forms. Let’s pluralize and singularize a couple of Word objects:

In [1]: from textblob import Word

In [2]: index = Word('index')

In [3]: index.pluralize()
Out[3]: 'indices'

In [4]: cacti = Word('cacti')

In [5]: cacti.singularize()
Out[5]: 'cactus'

Pluralizing and singularizing are sophisticated tasks which, as you can see above, are not as simple as adding or removing an “s” or “es” at the end of a word.

You can do the same with a WordList:

In [6]: from textblob import TextBlob

In [7]: animals = TextBlob('dog cat fish bird').words

In [8]: animals.pluralize()
Out[8]: WordList(['dogs', 'cats', 'fish', 'birds'])

Note that the word “fish” is the same in both its singular and plural forms.

tick mark Self Check

  1. (IPython Session) Singularize the word 'children' and pluralize 'focus'.
    Answer:

    In [1]: from textblob import Word
    
    In [2]: Word('children').singularize()
    Out[2]: 'child'
    
    In [3]: Word('focus').pluralize()
    Out[3]: 'foci'
    

12.2.9 Spell Checking and Correction

For natural language processing tasks, it’s important that the text be free of spelling errors. Software packages for writing and editing text, like Microsoft Word, Google Docs and others automatically check your spelling as you type and typically display a red line under misspelled words. Other tools enable you to manually invoke a spelling checker.

You can check a Word’s spelling with its spellcheck method, which returns a list of tuples containing possible correct spellings and a confidence value. Let’s assume we meant to type the word “they” but we misspelled it as “theyr.” The spell checking results show two possible corrections with the word 'they' having the highest confidence value:

In [1]: from textblob import Word

In [2]: word = Word('theyr')

In [3]: %precision 2
Out[3]: '%.2f'

In [4]: word.spellcheck()
Out[4]: [('they', 0.57), ('their', 0.43)]

Note that the word with the highest confidence value might not be the correct word for the given context.

TextBlobs, Sentences and Words all have a correct method that you can call to correct spelling. Calling correct on a Word returns the correctly spelled word that has the highest confidence value (as returned by spellcheck):

In [5]: word.correct() # chooses word with the highest confidence value
Out[5]: 'they'

Calling correct on a TextBlob or Sentence checks the spelling of each word. For each incorrect word, correct replaces it with the correctly spelled one that has the highest confidence value:

In [6]: from textblob import Word

In [7]: sentence = TextBlob('Ths sentense has missplled wrds.')

In [8]: sentence.correct()
Out[8]: TextBlob("The sentence has misspelled words.")

tick mark Self Check

  1. (True/False) You can check a Word’s spelling with its correct method, which returns a list of tuples containing possible correct spellings and a confidence value.
    Answer: False. You can check a Word’s spelling with its spellcheck method, which returns a list of tuples containing potential correct spellings and a confidence value.

  2. (IPython Session) Correct the spelling in 'I canot beleive I misspeled thees werds'.
    Answer:

    In [1]: from textblob import TextBlob
    
    In [2]: sentence = TextBlob('I canot beleive I misspeled thees werds')
    
    In [3]: sentence.correct()
    Out[3]: TextBlob("I cannot believe I misspelled these words")
    

12.2.10 Normalization: Stemming and Lemmatization

Stemming removes a prefix or suffix from a word leaving only a stem, which may or may not be a real word. Lemmatization is similar, but factors in the word’s part of speech and meaning and results in a real word.

Stemming and lemmatization are normalization operations, in which you prepare words for analysis. For example, before calculating statistics on words in a body of text, you might convert all words to lowercase so that capitalized and lowercase words are not treated differently. Sometimes, you might want to use a word’s root to represent the word’s many forms. For example, in a given application, you might want to treat all of the following words as “program”: program, programs, programmer, programming and programmed (and perhaps U.K. English spellings, like programmes as well).

Words and WordLists each support stemming and lemmatization via the methods stem and lemmatize. Let’s use both on a Word:

In [1]: from textblob import Word

In [2]: word = Word('varieties')

In [3]: word.stem()
Out[3]: 'varieti'

In [4]: word.lemmatize()
Out[4]: 'variety'

tick mark Self Check

  1. (True/False) Stemming is similar to lemmatization, but factors in the word’s part of speech and meaning and results in a real word.
    Answer: False. Lemmatization is similar to stemming, but factors in the word’s part of speech and meaning and results in a real word.

  2. (IPython Session) Stem and lemmatize the word 'strawberries'.
    Answer:

    In [1]: from textblob import Word
    
    In [2]: word = Word('strawberries')
    
    In [3]: word.stem()
    Out[3]: 'strawberri'
    
    In [4]: word.lemmatize()
    Out[4]: 'strawberry'
    

12.2.11 Word Frequencies

Various techniques for detecting similarity between documents rely on word frequencies. As you’ll see here, TextBlob automatically counts word frequencies. First, let’s load the e-book for Shakespeare’s Romeo and Juliet into a TextBlob. To do so, we’ll use the Path class from the Python Standard Library’s pathlib module:

In [1]: from pathlib import Path

In [2]: from textblob import TextBlob

In [3]: blob = TextBlob(Path('RomeoAndJuliet.txt').read_text())

Use the file RomeoAndJuliet.txt18 that you downloaded earlier. We assume here that you started your IPython session from that folder. When you read a file with Path’s read_text method, it closes the file immediately after it finishes reading the file.

You can access the word frequencies through the TextBlob’s word_counts dictionary. Let’s get the counts of several words in the play:

In [4]: blob.word_counts['juliet']
Out[4]: 190

In [5]: blob.word_counts['romeo']
Out[5]: 315

In [6]: blob.word_counts['thou']
Out[6]: 278

If you already have tokenized a TextBlob into a WordList, you can count specific words in the list via the count method:

In [7]: blob.words.count('joy')
Out[7]: 14

In [8]: blob.noun_phrases.count('lady capulet')
Out[8]: 46

tick mark Self Check

  1. (True/False) You can access the word frequencies through the TextBlob’s counts dictionary.
    Answer: False. You can access the word frequencies through the word_counts dictionary.

  2. (IPython Session) Using the TextBlob from this section’s IPython session, determine how many times the stop words “a,” “an” and “the” appear in Romeo and Juliet.
    Answer:

    In [9]: blob.word_counts['a']
    Out[9]: 483
    
    In [10]: blob.word_counts['an']
    Out[10]: 71
    
    In [11]: blob.word_counts['the']
    Out[11]: 688
    

12.2.12 Getting Definitions, Synonyms and Antonyms from WordNet

WordNet19 is a word database created by Princeton University. The TextBlob library uses the NLTK library’s WordNet interface, enabling you to look up word definitions, and get synonyms and antonyms. For more information, check out the NLTK WordNet interface documentation at:

https://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.wordnet

Getting Definitions

First, let’s create a Word:

In [1]: from textblob import Word

In [2]: happy = Word('happy')

The Word class’s definitions property returns a list of all the word’s definitions in the WordNet database:

In [3]: happy.definitions
Out[3]:
['enjoying or showing or marked by joy or pleasure',
 'marked by good fortune',
 'eagerly disposed to act or to be of service',
 'well expressed and to the point']

The database does not necessarily contain every dictionary definition of a given word. There’s also a define method that enables you to pass a part of speech as an argument so you can get definitions matching only that part of speech.

Getting Synonyms

You can get a Word’s synsets—that is, its sets of synonyms—via the synsets property. The result is a list of Synset objects:

In [4]: happy.synsets
Out[4]:
[Synset('happy.a.01'),
 Synset('felicitous.s.02'),
 Synset('glad.s.02'),
 Synset('happy.s.04')]

Each Synset represents a group of synonyms. In the notation happy.a.01:

  • happy is the original Word’s lemmatized form (in this case, it’s the same).

  • a is the part of speech, which can be a for adjective, n for noun, v for verb, r for adverb or s for adjective satellite. Many adjective synsets in WordNet have satellite synsets that represent similar adjectives.

  • 01 is a 0-based index number. Many words have multiple meanings, and this is the index number of the corresponding meaning in the WordNet database.

There’s also a get_synsets method that enables you to pass a part of speech as an argument so you can get Synsets matching only that part of speech.

You can iterate through the synsets list to find the original word’s synonyms. Each Synset has a lemmas method that returns a list of Lemma objects representing the synonyms. A Lemma’s name method returns the synonymous word as a string. In the following code, for each Synset in the synsets list, the nested for loop iterates through that Synset’s Lemmas (if any). Then we add the synonym to the set named synonyms. We used a set collection because it automatically eliminates any duplicates we add to it:

In [5]: synonyms = set()

In [6]: for synset in happy.synsets:
   ...:     for lemma in synset.lemmas():
   ...:         synonyms.add(lemma.name())
   ...:

In [7]: synonyms
Out[7]: {'felicitous', 'glad', 'happy', 'well-chosen'}

Getting Antonyms

If the word represented by a Lemma has antonyms in the WordNet database, invoking the Lemma’s antonyms method returns a list of Lemmas representing the antonyms (or an empty list if there are no antonyms in the database). In snippet [4] you saw there were four Synsets for 'happy'. First, let’s get the Lemmas for the Synset at index 0 of the synsets list:

In [8]: lemmas = happy.synsets[0].lemmas()

In [9]: lemmas
Out[9]: [Lemma('happy.a.01.happy')]

In this case, lemmas returned a list of one Lemma element. We can now check whether the database has any corresponding antonyms for that Lemma:

In [10]: lemmas[0].antonyms()
Out[10]: [Lemma('unhappy.a.01.unhappy')]

The result is list of Lemmas representing the antonym(s). Here, we see that the one antonym for 'happy' in the database is 'unhappy'.

tick mark Self Check

  1. (Fill-In) A(n)       represents synonyms of a given word.
    Answer: Synset.

  2. (IPython Session) Display the synsets and definitions for the word “boat.”
    Answer:

    In [1]: from textblob import Word
    
    In [2]: word = Word('boat')
    
    In [3]: word.synsets
    Out[3]: [Synset('boat.n.01'), Synset('gravy_boat.n.01'),
    Synset('boat.v.01')]
    
    In [4]: word.definitions
    Out[4]:
    ['a small vessel for travel on water',
    'a dish (often boat-shaped) for serving gravy or sauce',
    'ride in a boat on water']

    In this case, there were three Synsets, and the definitions property displayed the corresponding definitions.

12.2.13 Deleting Stop Words

Stop words are common words in text that are often removed from text before analyzing it because they typically do not provide useful information. The following table shows NLTK’s list of English stop words, which is returned by the NLTK stopwords module’s words function20 (which we’ll use momentarily):

An example of the N L T K’s English stop words list is a group of words enclosed within brackets. Each word is enclosed in single quotation marks and separated by a comma. Words  include in, doesn’t, themselves, and wouldn’t.

The NLTK library has lists of stop words for several other natural languages as well. Before using NLTK’s stop-words lists, you must download them, which you do with the nltk module’s download function:

In [1]: import nltk

In [2]: nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data] C:UsersPaulDeitelAppDataRoaming
ltk_data...
[nltk_data] Unzipping corporastopwords.zip.
Out[2]: True

For this example, we’ll load the 'english' stop words list. First import stopwords from the nltk.corpus module, then use stopwords method words to load the 'english' stop words list:

In [3]: from nltk.corpus import stopwords

In [4]: stops = stopwords.words('english')

Next, let’s create a TextBlob from which we’ll remove stop words:

In [5]: from textblob import TextBlob

In [6]: blob = TextBlob('Today is a beautiful day.')

Finally, to remove the stop words, let’s use the TextBlob’s words in a list comprehension that adds each word to the resulting list only if the word is not in stops:

In [7]: [word for word in blob.words if word not in stops]
Out[7]: ['Today', 'beautiful', 'day']

tick mark Self Check

  1. (Fill-In)       are common words in text that are often removed from text before analyzing it.
    Answer: Stop words.

  2. (IPython Session) Eliminate stop words from a TextBlob containing the sentence 'TextBlob is easy to use.'
    Answer:

    In [1]: from nltk.corpus import stopwords
    
    In [2]: stops = stopwords.words('english')
    
    In [3]: from textblob import TextBlob
    
    In [4]: blob = TextBlob('TextBlob is easy to use.')
    
    In [5]: [word for word in blob.words if word not in stops]
    Out[5]: ['TextBlob', 'easy', 'use']

12.2.14 n-grams

An n-gram21 is a sequence of n text items, such as letters in words or words in a sentence. In natural language processing, n-grams can be used to identify letters or words that frequently appear adjacent to one another. For text-based user input, this can help predict the next letter or word a user will type—such as when completing items in IPython with tab-completion or when entering a message to a friend in your favorite smartphone messaging app. For speech-to-text, n-grams might be used to improve the quality of the transcription. N-grams are a form of co-occurrence in which words or letters appear near each other in a body of text.

TextBlob’s ngrams method produces a list of WordList n-grams of length three by default—known as trigrams. You can pass the keyword argument n to produce n-grams of any desired length. The output shows that the first trigram contains the first three words in the sentence ('Today', 'is' and 'a'). Then, ngrams creates a trigram starting with the second word ('is', 'a' and 'beautiful') and so on until it creates a trigram containing the last three words in the TextBlob:

In [1]: from textblob import TextBlob

In [2]: text = 'Today is a beautiful day. Tomorrow looks like bad weather.'

In [3]: blob = TextBlob(text)

In [4]: blob.ngrams()
Out[4]:
[WordList(['Today', 'is', 'a']),
 WordList(['is', 'a', 'beautiful']),
 WordList(['a', 'beautiful', 'day']),
 WordList(['beautiful', 'day', 'Tomorrow']),
 WordList(['day', 'Tomorrow', 'looks']),
 WordList(['Tomorrow', 'looks', 'like']),
 WordList(['looks', 'like', 'bad']),
 WordList(['like', 'bad', 'weather'])]

The following produces n-grams consisting of five words:

In [5]: blob.ngrams(n=5)
Out[5]:
[WordList(['Today', 'is', 'a', 'beautiful', 'day']),
 WordList(['is', 'a', 'beautiful', 'day', 'Tomorrow']),
 WordList(['a', 'beautiful', 'day', 'Tomorrow', 'looks']),
 WordList(['beautiful', 'day', 'Tomorrow', 'looks', 'like']),
 WordList(['day', 'Tomorrow', 'looks', 'like', 'bad']),
 WordList(['Tomorrow', 'looks', 'like', 'bad', 'weather'])]

tick mark Self Check

  1. (Fill-In) N-grams are a form of       in which words appear near each other in a body of text.
    Answer: co-occurrence.

  2. (IPython Session) Produce n-grams consisting of three words each for 'TextBlob is easy to use.'
    Answer:

    In [1]: from textblob import TextBlob
    
    In [2]: blob = TextBlob('TextBlob is easy to use.')
    
    In [3]: blob.ngrams()
    Out[3]:
    [WordList(['TextBlob', 'is', 'easy']),
     WordList(['is', 'easy', 'to']),
     WordList(['easy', 'to', 'use'])]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.42.168