TextBlob is an object-oriented NLP text-processing library that is built on the NLTK and pattern NLP libraries and simplifies many of their capabilities. Some of the NLP tasks TextBlob can perform include:
Tokenization—splitting text into pieces called tokens, which are meaningful units, such as words and numbers.
Parts-of-speech (POS) tagging—identifying each word’s part of speech, such as noun, verb, adjective, etc.
Noun phrase extraction—locating groups of words that represent nouns, such as “red brick factory.”2
Sentiment analysis—determining whether text has positive, neutral or negative sentiment.
Inter-language translation and language detection powered by Google Translate.
Inflection3—pluralizing and singularizing words. There are other aspects of inflection that are not part of TextBlob.
Spell checking and spelling correction.
Stemming—reducing words to their stems by removing prefixes or suffixes. For example, the stem of “varieties” is “varieti.”
Lemmatization—like stemming, but produces real words based on the original words’ context. For example, the lemmatized form of “varieties” is “variety.”
Word frequencies—determining how often each word appears in a corpus.
WordNet integration for finding word definitions, synonyms and antonyms.
Stop word elimination—removing common words, such as a, an, the, I, we, you and more to analyze the important words in a corpus.
n-grams—producing sets of consecutive words in a corpus for use in identifying words that frequently appear adjacent to one another.
Many of these capabilities are used as part of more complex NLP tasks. In this section, we’ll perform these NLP tasks using TextBlob and NLTK.
To install TextBlob, open your Anaconda Prompt (Windows), Terminal (macOS/Linux) or shell (Linux), then execute the following command:
conda install -c conda-forge textblob
Windows users might need to run the Anaconda Prompt as an Administrator for proper software installation privileges. To do so, right-click Anaconda Prompt in the start menu and select More > Run as administrator
.
Once installation completes, execute the following command to download the NLTK corpora used by TextBlob:
ipython -m textblob.download_corpora
These include:
The Brown Corpus (created at Brown University4) for parts-of-speech tagging.
Punkt for English sentence tokenization.
WordNet for word definitions, synonyms and antonyms.
Averaged Perceptron Tagger for parts-of-speech tagging.
conll2000 for breaking text into components, like nouns, verbs, noun phrases and more—known as chunking the text. The name conll2000 is from the conference that created the chunking data—Conference on Computational Natural Language Learning.
Movie Reviews for sentiment analysis.
A great source of text for analysis is the free e-books at Project Gutenberg:
The site contains over 57,000 e-books in various formats, including plain text files. These are out of copyright in the United States. For information about Project Gutenberg’s Terms of Use and copyright in other countries, see:
https:/
In some of this section’s examples, we use the plain-text e-book file for Shakespeare’s Romeo and Juliet, which you can find at:
https://www.gutenberg.org/ebooks/1513Project Gutenberg does not allow programmatic access to its e-books. You’re required to copy the books for that purpose.5 To download Romeo and Juliet as a plain-text e-book, right click the Plain Text UTF-8
link on the book’s web page, then select Save Link As…
(Chrome/FireFox), Download Linked File As…
(Safari) or Save target as
(Microsoft Edge) option to save the book to your system. Save it as RomeoAndJuliet.txt
in the ch12
examples folder to ensure that our code examples will work correctly. For analysis purposes, we removed the Project Gutenberg text before "THE TRAGEDY OF ROMEO AND JULIET"
, as well as the Project Guttenberg information at the end of the file starting with:
End of the Project Gutenberg EBook of Romeo and Juliet,
by William Shakespeare
(Fill-In) TextBlob is an object-oriented NLP text-processing library built on the and NLP libraries, and simplifies accessing their capabilities.
Answer: NLTK, pattern.
TextBlob
6
is the fundamental class for NLP with the textblob
module. Let’s create a TextBlob
containing two sentences:
In [1]: from textblob import TextBlob
In [2]: text = 'Today is a beautiful day. Tomorrow looks like bad weather.'
In [3]: blob = TextBlob(text)
In [4]: blob
Out[4]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")
TextBlob
s—and, as you’ll see shortly, Sentence
s and Word
s—support string methods and can be compared with strings. They also provide methods for various NLP tasks. Sentence
s, Word
s and TextBlob
s inherit from BaseBlob
, so they have many common methods and properties.
[Note: We use snippet [3]
’s TextBlob
in several of the following Self Checks and sub-sections, in which we continue the previous interactive session.]
(Fill-In) is the fundamental class for NLP with the textblob
module.
Answer: TextBlob
.
(True/False) TextBlob
s support string methods and can be compared with strings using the comparison operators.
Answer: True.
(IPython Session) Create a TextBlob
named exercise_blob
containing 'This is a TextBlob'
.
Answer:
In [5]: exercise_blob = TextBlob('This is a TextBlob')
In [6]: exercise_blob
Out[6]: TextBlob("This is a TextBlob")
Natural language processing often requires tokenizing text before performing other NLP tasks. TextBlob
provides convenient properties for accessing the sentences and words in TextBlob
s. Let’s use the sentence
property to get a list of Sentence
objects:
In [7]: blob.sentences
Out[7]:
[Sentence("Today is a beautiful day."),
Sentence("Tomorrow looks like bad weather.")]
The words
property returns a WordList
object containing a list of Word
objects, representing each word in the TextBlob
with the punctuation removed:
In [8]: blob.words
Out[8]: WordList(['Today', 'is', 'a', 'beautiful', 'day', 'Tomorrow',
'looks', 'like', 'bad', 'weather'])
(IPython Session) Create a TextBlob
with two sentences, then tokenize it into Sentence
s and Word
s, displaying all the tokens.
Answer:
In [9]: ex = TextBlob('My old computer is slow. My new one is fast.')
In [10]: ex.sentences
Out[10]: [Sentence("My old computer is slow."), Sentence("My new one is
fast.")]
In [11]: ex.words
Out[11]: WordList(['My', 'old', 'computer', 'is', 'slow', 'My', 'new',
'one', 'is', 'fast'])
Parts-of-speech (POS) tagging is the process of evaluating words based on their context to determine each word’s part of speech. There are eight primary English parts of speech—nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions and interjections (words that express emotion and that are typically followed by punctuation, like “Yes!” or “Ha!”). Within each category there are many subcategories.
Some words have multiple meanings. For example, the words “set” and “run” have hundreds of meanings each! If you look at the dictionary.com
definitions of the word “run,” you’ll see that it can be a verb, a noun, an adjective or a part of a verb phrase. An important use of POS tagging is determining a word’s meaning among its possibly many meanings. This is important for helping computers “understand” natural language.
The tags
property returns a list of tuples, each containing a word and a string representing its part-of-speech tag:
In [12]: blob
Out[12]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")
In [13]: blob.tags
Out[13]:
[('Today', 'NN'),
('is', 'VBZ'),
('a', 'DT'),
('beautiful', 'JJ'),
('day', 'NN'),
('Tomorrow', 'NNP'),
('looks', 'VBZ'),
('like', 'IN'),
('bad', 'JJ'),
('weather', 'NN')]
By default, TextBlob
uses a PatternTagger
to determine parts-of-speech. This class uses the parts-of-speech tagging capabilities of the pattern library:
https://www.clips.uantwerpen.be/pattern
You can view the library’s 63 parts-of-speech tags at
https://www.clips.uantwerpen.be/pages/MBSP-tags
In the preceding snippet’s output:
Today
, day
and weather
are tagged as NN
—a singular noun or mass noun.
is
and looks
are tagged as VBZ
—a third person singular present verb.
a
is tagged as DT
—a determiner.7
beautiful
and bad
are tagged as JJ
—an adjective.
Tomorrow
is tagged as NNP
—a proper singular noun.
like
is tagged as IN
—a subordinating conjunction or preposition.
(Fill-In) is the process of evaluating words based on their context to determine each word’s part of speech
Answer: Parts-of-speech (POS) tagging.
(IPython Session) Display the parts-of-speech tags for the sentence, 'My dog is cute'
.
Answer:
In [14]: TextBlob('My dog is cute').tags
Out[14]: [('My', 'PRP$'), ('dog', 'NN'), ('is', 'VBZ'), ('cute', 'JJ')]
In the preceding output, the POS tag PRP$
indicates a possessive pronoun.
Let’s say you’re preparing to purchase a water ski so you’re researching them online. You might search for “best water ski.” In this case, “water ski” is a noun phrase. If the search engine does not parse the noun phrase properly, you probably will not get the best search results. Go online and try searching for “best water,” “best ski” and “best water ski” and see what you get.
A TextBlob
’s noun_phrases
property returns a WordList
object containing a list of Word
objects—one for each noun phrase in the text:
In [15]: blob
Out[15]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")
In [16]: blob.noun_phrases
Out[16]: WordList(['beautiful day', 'tomorrow', 'bad weather'])
Note that a Word
representing a noun phrase can contain multiple words. A WordList
is an extension of Python’s built-in list type. WordList
s provide additional methods for stemming, lemmatizing, singularizing and pluralizing.
(IPython Session) Show the noun phrase(s) in the sentence, 'The red brick factory is for sale'
.
Answer:
In [17]: TextBlob('The red brick factory is for sale').noun_phrases
Out[17]: WordList(['red brick factory'])
One of the most common and valuable NLP tasks is sentiment analysis, which determines whether text is positive, neutral or negative. For instance, companies might use this to determine whether people are speaking positively or negatively online about their products. Consider the positive word “good” and the negative word “bad.” Just because a sentence contains “good” or “bad” does not mean the sentence’s sentiment necessarily is positive or negative. For example, the sentence
The food is not good.
clearly has negative sentiment. Similarly, the sentence
The movie was not bad.
clearly has positive sentiment, though perhaps not as positive as something like
The movie was excellent!
Sentiment analysis is a complex machine-learning problem. However, libraries like TextBlob have pretrained machine learning models for performing sentiment analysis.
A TextBlob
’s sentiment
property returns a Sentiment
object indicating whether the text is positive or negative and whether it’s objective or subjective:
In [18]: blob
Out[18]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")
In [19]: blob.sentiment
Out[19]: Sentiment(polarity=0.07500000000000007,
subjectivity=0.8333333333333333)
In the preceding output, the polarity
indicates sentiment with a value from -1.0
(negative) to 1.0
(positive) with 0.0
being neutral. The subjectivity
is a value from 0.0 (objective) to 1.0 (subjective). Based on the values for our TextBlob
, the overall sentiment is close to neutral, and the text is mostly subjective.
polarity
and subjectivity
from the Sentiment Object
The values displayed above probably provide more precision that you need in most cases. This can detract from numeric output’s readability. The IPython magic %precision
allows you to specify the default precision for standalone float
objects and float
objects in built-in types like lists, dictionaries and tuples. Let’s use the magic to round the polarity
and subjectivity
values to three digits to the right of the decimal point:
In [20]: %precision 3
Out[20]: '%.3f'
In [21]: blob.sentiment.polarity
Out[21]: 0.075
In [22]: blob.sentiment.subjectivity
Out[22]: 0.833
Sentence
You also can get the sentiment at the individual sentence level. Let’s use the sentence
property to get a list of Sentence
8 objects, then iterate through them and display each Sentence
’s sentiment
property:
In [23]: for sentence in blob.sentences:
...: print(sentence.sentiment)
...:
Sentiment(polarity=0.85, subjectivity=1.0)
Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666)
This might explain why the entire TextBlob
’s sentiment
is close to 0.0
(neutral)—one sentence is positive (0.85) and the other negative (-0.6999999999999998).
(IPython Session) Import Sentence
from the TextBlob
module then make Sentence
objects to check the sentiment of the three sentences used in this section’s introduction.
Answer: Snippet [25]
’s output shows that the sentence’s sentiment is somewhat negative (due to “not good”). Snippet [26]
’s output shows that the sentence’s sentiment is somewhat positive (due to “not bad”). Snippet [27]
’s output shows that the sentence’s sentiment is totally positive (due to “excellent”). The outputs indicate that all three sentences are subjective, with the last being perfectly positive and subjective.
In [24]: from textblob import Sentence
In [25]: Sentence('The food is not good.').sentiment
Out[25]: Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)
In [26]: Sentence('The movie was not bad.').sentiment
Out[26]: Sentiment(polarity=0.3499999999999999,
subjectivity=0.6666666666666666)
In [27]: Sentence('The movie was excellent!').sentiment
Out[27]: Sentiment(polarity=1.0, subjectivity=1.0)
NaiveBayesAnalyzer
By default, a TextBlob
and the Sentence
s and Words
you get from it determine sentiment using a PatternAnalyzer
, which uses the same sentiment analysis techniques as in the Pattern library. The TextBlob library also comes with a NaiveBayesAnalyzer
9 (module textblob.sentiments
), which was trained on a database of movie reviews. Naive Bayes10 is a commonly used machine learning text-classification algorithm. The following uses the analyzer
keyword argument to specify a TextBlob
’s sentiment analyzer. Recall from earlier in this ongoing IPython session that text
contains 'Today is a beautiful day. Tomorrow looks like bad weather.'
:
In [28]: from textblob.sentiments import NaiveBayesAnalyzer
In [29]: blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
In [30]: blob
Out[30]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")
Let’s use the TextBlob
’s sentiment
property to display the text’s sentiment using the NaiveBayesAnalyzer
:
In [31]: blob.sentiment
Out[31]: Sentiment(classification='neg', p_pos=0.47662917962091056,
p_neg=0.5233708203790892)
In this case, the overall sentiment is classified as negative (classification='neg'
). The Sentiment
object’s p_pos
indicates that the TextBlob
is 47.66% positive, and its p_neg
indicates that the TextBlob
is 52.34% negative. Since the overall sentiment is just slightly more negative we’d probably view this TextBlob
’s sentiment as neutral overall.
Now, let’s get the sentiment of each Sentence
:
In [32]: for sentence in blob.sentences:
...: print(sentence.sentiment)
...:
Sentiment(classification='pos', p_pos=0.8117563121751951,
p_neg=0.18824368782480477)
Sentiment(classification='neg', p_pos=0.174363226578349,
p_neg=0.8256367734216521)
Notice that rather than polarity
and subjectivity
, the Sentiment objects we get from the NaiveBayesAnalyzer
contain a classification—'pos'
(positive) or 'neg'
(negative)— and p_pos
(percentage positive) and p_neg
(percentage negative) values from 0.0
to 1.0
. Once again, we see that the first sentence is positive and the second is negative.
(IPython Session) Check the sentiment of the sentence 'The movie was excellent!'
using the NaiveBayesAnalyzer
.
Answer:
In [33]: text = ('The movie was excellent!')
In [34]: exblob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
In [35]: exblob.sentiment
Out[35]: Sentiment(classification='pos', p_pos=0.7318278242290406,
p_neg=0.26817217577095936)
Inter-language translation is a challenging problem in natural language processing and artificial intelligence. With advances in machine learning, artificial intelligence and natural language processing, services like Google Translate (100+ languages) and Microsoft Bing Translator (60+ languages) can translate between languages instantly.
Inter-language translation also is great for people traveling to foreign countries. They can use translation apps to translate menus, road signs and more. There are even efforts at live speech translation so that you’ll be able to converse in real time with people who do not know your natural language.11,12 Some smartphones, can now work together with in-ear headphones to provide near-live translation of many languages.13,14 ,15 In the “IBM Watson and Cognitive Computing” chapter, we develop a script that does near real-time inter-language translation among languages supported by Watson.
The TextBlob library uses Google Translate to detect a text’s language and translate TextBlob
s, Sentence
s and Word
s into other languages.16 Let’s use detect_language
method to detect the language of the text we’re manipulating ('en'
is English):
In [36]: blob
Out[36]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")
In [37]: blob.detect_language()
Out[37]: 'en'
Next, let’s use the translate
method to translate the text to Spanish ('es'
) then detect the language on the result. The to
keyword argument specifies the target language.
In [38]: spanish = blob.translate(to='es')
In [39]: spanish
Out[39]: TextBlob("Hoy es un hermoso dia. Mañana parece mal tiempo.")
In [40]: spanish.detect_language()
Out[40]: 'es'
Next, let’s translate our TextBlob
to simplified Chinese (specified as 'zh'
or 'zh-CN'
) then detect the language on the result:
In [41]: chinese = blob.translate(to='zh')
In [42]: chinese
Out[42]: TextBlob("")
In [43]: chinese.detect_language()
Out[43]: 'zh-CN'
Method detect_language
’s output always shows simplified Chinese as 'zh-CN'
, even though the translate
function can receive simplified Chinese as 'zh'
or 'zh-CN'
.
In each of the preceding cases, Google Translate automatically detects the source language. You can specify a source language explicitly by passing the from_lang
keyword argument to the translate
method, as in
chinese = blob.translate(from_lang='en', to='zh')
Google Translate uses iso-639-117 language codes listed at
https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
For the supported languages, you’d use these codes as the values of the from_lang
and to
keyword arguments. Google Translate’s list of supported languages is at:
https://cloud.google.com/translate/docs/languages
Calling translate
without arguments translates from the detected source language to English:
In [44]: spanish.translate()
Out[44]: TextBlob("Today is a beautiful day. Tomorrow seems like bad
weather.")
In [45]: chinese.translate()
Out[45]: TextBlob("Today is a beautiful day. Tomorrow looks like bad
weather.")
Note the slight difference in the English results.
(IPython Session) Translate 'Today is a beautiful day.'
into French, then detect the language.
Answer:
In [46]: blob = TextBlob('Today is a beautiful day.')
In [47]: french = blob.translate(to='fr')
In [48]: french
Out[48]: TextBlob("Aujourd'hui est un beau jour.")
In [49]: french.detect_language()
Out[49]: 'fr'
Inflections are different forms of the same words, such as singular and plural (like “person” and “people”) and different verb tenses (like “run” and “ran”). When you’re calculating word frequencies, you might first want to convert all inflected words to the same form for more accurate word frequencies. Word
s and WordList
s each support converting words to their singular or plural forms. Let’s pluralize and singularize a couple of Word
objects:
In [1]: from textblob import Word
In [2]: index = Word('index')
In [3]: index.pluralize()
Out[3]: 'indices'
In [4]: cacti = Word('cacti')
In [5]: cacti.singularize()
Out[5]: 'cactus'
Pluralizing and singularizing are sophisticated tasks which, as you can see above, are not as simple as adding or removing an “s” or “es” at the end of a word.
You can do the same with a WordList
:
In [6]: from textblob import TextBlob
In [7]: animals = TextBlob('dog cat fish bird').words
In [8]: animals.pluralize()
Out[8]: WordList(['dogs', 'cats', 'fish', 'birds'])
Note that the word “fish” is the same in both its singular and plural forms.
(IPython Session) Singularize the word 'children'
and pluralize 'focus'
.
Answer:
In [1]: from textblob import Word
In [2]: Word('children').singularize()
Out[2]: 'child'
In [3]: Word('focus').pluralize()
Out[3]: 'foci'
For natural language processing tasks, it’s important that the text be free of spelling errors. Software packages for writing and editing text, like Microsoft Word, Google Docs and others automatically check your spelling as you type and typically display a red line under misspelled words. Other tools enable you to manually invoke a spelling checker.
You can check a Word
’s spelling with its spellcheck
method, which returns a list of tuples containing possible correct spellings and a confidence value. Let’s assume we meant to type the word “they” but we misspelled it as “theyr.” The spell checking results show two possible corrections with the word 'they'
having the highest confidence value:
In [1]: from textblob import Word
In [2]: word = Word('theyr')
In [3]: %precision 2
Out[3]: '%.2f'
In [4]: word.spellcheck()
Out[4]: [('they', 0.57), ('their', 0.43)]
Note that the word with the highest confidence value might not be the correct word for the given context.
TextBlob
s, Sentence
s and Word
s all have a correct
method that you can call to correct spelling. Calling correct
on a Word
returns the correctly spelled word that has the highest confidence value (as returned by spellcheck):
In [5]: word.correct() # chooses word with the highest confidence value
Out[5]: 'they'
Calling correct
on a TextBlob
or Sentence
checks the spelling of each word. For each incorrect word, correct
replaces it with the correctly spelled one that has the highest confidence value:
In [6]: from textblob import Word
In [7]: sentence = TextBlob('Ths sentense has missplled wrds.')
In [8]: sentence.correct()
Out[8]: TextBlob("The sentence has misspelled words.")
(True/False) You can check a Word
’s spelling with its correct
method, which returns a list of tuples containing possible correct spellings and a confidence value.
Answer: False. You can check a Word
’s spelling with its spellcheck
method, which returns a list of tuples containing potential correct spellings and a confidence value.
(IPython Session) Correct the spelling in 'I canot beleive I misspeled thees werds'
.
Answer:
In [1]: from textblob import TextBlob
In [2]: sentence = TextBlob('I canot beleive I misspeled thees werds')
In [3]: sentence.correct()
Out[3]: TextBlob("I cannot believe I misspelled these words")
Stemming removes a prefix or suffix from a word leaving only a stem, which may or may not be a real word. Lemmatization is similar, but factors in the word’s part of speech and meaning and results in a real word.
Stemming and lemmatization are normalization operations, in which you prepare words for analysis. For example, before calculating statistics on words in a body of text, you might convert all words to lowercase so that capitalized and lowercase words are not treated differently. Sometimes, you might want to use a word’s root to represent the word’s many forms. For example, in a given application, you might want to treat all of the following words as “program”: program, programs, programmer, programming and programmed (and perhaps U.K. English spellings, like programmes as well).
Word
s and WordList
s each support stemming and lemmatization via the methods stem
and lemmatize
. Let’s use both on a Word
:
In [1]: from textblob import Word
In [2]: word = Word('varieties')
In [3]: word.stem()
Out[3]: 'varieti'
In [4]: word.lemmatize()
Out[4]: 'variety'
(True/False) Stemming is similar to lemmatization, but factors in the word’s part of speech and meaning and results in a real word.
Answer: False. Lemmatization is similar to stemming, but factors in the word’s part of speech and meaning and results in a real word.
(IPython Session) Stem and lemmatize the word 'strawberries'
.
Answer:
In [1]: from textblob import Word
In [2]: word = Word('strawberries')
In [3]: word.stem()
Out[3]: 'strawberri'
In [4]: word.lemmatize()
Out[4]: 'strawberry'
Various techniques for detecting similarity between documents rely on word frequencies. As you’ll see here, TextBlob
automatically counts word frequencies. First, let’s load the e-book for Shakespeare’s Romeo and Juliet into a TextBlob
. To do so, we’ll use the Path
class from the Python Standard Library’s pathlib
module:
In [1]: from pathlib import Path
In [2]: from textblob import TextBlob
In [3]: blob = TextBlob(Path('RomeoAndJuliet.txt').read_text())
Use the file RomeoAndJuliet.txt18
that you downloaded earlier. We assume here that you started your IPython session from that folder. When you read a file with Path
’s read_text
method, it closes the file immediately after it finishes reading the file.
You can access the word frequencies through the TextBlob
’s word_counts
dictionary. Let’s get the counts of several words in the play:
In [4]: blob.word_counts['juliet']
Out[4]: 190
In [5]: blob.word_counts['romeo']
Out[5]: 315
In [6]: blob.word_counts['thou']
Out[6]: 278
If you already have tokenized a TextBlob
into a WordList
, you can count specific words in the list via the count
method:
In [7]: blob.words.count('joy')
Out[7]: 14
In [8]: blob.noun_phrases.count('lady capulet')
Out[8]: 46
(True/False) You can access the word frequencies through the TextBlob
’s counts
dictionary.
Answer: False. You can access the word frequencies through the word_counts
dictionary.
(IPython Session) Using the TextBlob
from this section’s IPython session, determine how many times the stop words “a,” “an” and “the” appear in Romeo and Juliet.
Answer:
In [9]: blob.word_counts['a']
Out[9]: 483
In [10]: blob.word_counts['an']
Out[10]: 71
In [11]: blob.word_counts['the']
Out[11]: 688
WordNet19 is a word database created by Princeton University. The TextBlob library uses the NLTK library’s WordNet interface, enabling you to look up word definitions, and get synonyms and antonyms. For more information, check out the NLTK WordNet interface documentation at:
https:/
First, let’s create a Word
:
In [1]: from textblob import Word
In [2]: happy = Word('happy')
The Word
class’s definitions
property returns a list of all the word’s definitions in the WordNet database:
In [3]: happy.definitions
Out[3]:
['enjoying or showing or marked by joy or pleasure',
'marked by good fortune',
'eagerly disposed to act or to be of service',
'well expressed and to the point']
The database does not necessarily contain every dictionary definition of a given word. There’s also a define
method that enables you to pass a part of speech as an argument so you can get definitions matching only that part of speech.
You can get a Word
’s synsets—that is, its sets of synonyms—via the synsets
property. The result is a list of Synset
objects:
In [4]: happy.synsets
Out[4]:
[Synset('happy.a.01'),
Synset('felicitous.s.02'),
Synset('glad.s.02'),
Synset('happy.s.04')]
Each Synset
represents a group of synonyms. In the notation happy.a.01
:
happy
is the original Word’s lemmatized form (in this case, it’s the same).
a
is the part of speech, which can be a
for adjective, n
for noun, v
for verb, r
for adverb or s
for adjective satellite. Many adjective synsets in WordNet have satellite synsets that represent similar adjectives.
01
is a 0-based index number. Many words have multiple meanings, and this is the index number of the corresponding meaning in the WordNet database.
There’s also a get_synsets
method that enables you to pass a part of speech as an argument so you can get Synset
s matching only that part of speech.
You can iterate through the synsets
list to find the original word’s synonyms. Each Synset
has a lemmas
method that returns a list of Lemma
objects representing the synonyms. A Lemma
’s name
method returns the synonymous word as a string. In the following code, for each Synset
in the synsets
list, the nested for
loop iterates through that Synset
’s Lemmas
(if any). Then we add the synonym to the set named synonyms
. We used a set collection because it automatically eliminates any duplicates we add to it:
In [5]: synonyms = set()
In [6]: for synset in happy.synsets:
...: for lemma in synset.lemmas():
...: synonyms.add(lemma.name())
...:
In [7]: synonyms
Out[7]: {'felicitous', 'glad', 'happy', 'well-chosen'}
If the word represented by a Lemma
has antonyms in the WordNet database, invoking the Lemma
’s antonyms
method returns a list of Lemma
s representing the antonyms (or an empty list if there are no antonyms in the database). In snippet [4]
you saw there were four Synset
s for 'happy'
. First, let’s get the Lemma
s for the Synset
at index 0
of the synsets
list:
In [8]: lemmas = happy.synsets[0].lemmas()
In [9]: lemmas
Out[9]: [Lemma('happy.a.01.happy')]
In this case, lemmas
returned a list of one Lemma
element. We can now check whether the database has any corresponding antonyms for that Lemma
:
In [10]: lemmas[0].antonyms()
Out[10]: [Lemma('unhappy.a.01.unhappy')]
The result is list of Lemma
s representing the antonym(s). Here, we see that the one antonym for 'happy'
in the database is 'unhappy'
.
(Fill-In) A(n) represents synonyms of a given word.
Answer: Synset
.
(IPython Session) Display the synsets
and definitions
for the word “boat.”
Answer:
In [1]: from textblob import Word
In [2]: word = Word('boat')
In [3]: word.synsets
Out[3]: [Synset('boat.n.01'), Synset('gravy_boat.n.01'),
Synset('boat.v.01')]
In [4]: word.definitions
Out[4]:
['a small vessel for travel on water',
'a dish (often boat-shaped) for serving gravy or sauce',
'ride in a boat on water']
In this case, there were three Synset
s, and the definitions
property displayed the corresponding definitions.
Stop words are common words in text that are often removed from text before analyzing it because they typically do not provide useful information. The following table shows NLTK’s list of English stop words, which is returned by the NLTK stopwords
module’s words
function20 (which we’ll use momentarily):
The NLTK library has lists of stop words for several other natural languages as well. Before using NLTK’s stop-words lists, you must download them, which you do with the nltk
module’s download
function:
In [1]: import nltk
In [2]: nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data] C:UsersPaulDeitelAppDataRoaming ltk_data...
[nltk_data] Unzipping corporastopwords.zip.
Out[2]: True
For this example, we’ll load the 'english'
stop words list. First import stopwords
from the nltk.corpus
module, then use stopwords
method words
to load the 'english'
stop words list:
In [3]: from nltk.corpus import stopwords
In [4]: stops = stopwords.words('english')
Next, let’s create a TextBlob
from which we’ll remove stop words:
In [5]: from textblob import TextBlob
In [6]: blob = TextBlob('Today is a beautiful day.')
Finally, to remove the stop words, let’s use the TextBlob
’s words in a list comprehension that adds each word to the resulting list only if the word is not in stops
:
In [7]: [word for word in blob.words if word not in stops]
Out[7]: ['Today', 'beautiful', 'day']
(Fill-In) are common words in text that are often removed from text before analyzing it.
Answer: Stop words.
(IPython Session) Eliminate stop words from a TextBlob
containing the sentence 'TextBlob is easy to use.'
Answer:
In [1]: from nltk.corpus import stopwords
In [2]: stops = stopwords.words('english')
In [3]: from textblob import TextBlob
In [4]: blob = TextBlob('TextBlob is easy to use.')
In [5]: [word for word in blob.words if word not in stops]
Out[5]: ['TextBlob', 'easy', 'use']
An n-gram21 is a sequence of n text items, such as letters in words or words in a sentence. In natural language processing, n-grams can be used to identify letters or words that frequently appear adjacent to one another. For text-based user input, this can help predict the next letter or word a user will type—such as when completing items in IPython with tab-completion or when entering a message to a friend in your favorite smartphone messaging app. For speech-to-text, n-grams might be used to improve the quality of the transcription. N-grams are a form of co-occurrence in which words or letters appear near each other in a body of text.
TextBlob
’s ngrams
method produces a list of WordList
n-grams of length three by default—known as trigrams. You can pass the keyword argument n
to produce n-grams of any desired length. The output shows that the first trigram contains the first three words in the sentence ('Today', 'is' and 'a'). Then, ngrams
creates a trigram starting with the second word ('is', 'a' and 'beautiful') and so on until it creates a trigram containing the last three words in the TextBlob
:
In [1]: from textblob import TextBlob
In [2]: text = 'Today is a beautiful day. Tomorrow looks like bad weather.'
In [3]: blob = TextBlob(text)
In [4]: blob.ngrams()
Out[4]:
[WordList(['Today', 'is', 'a']),
WordList(['is', 'a', 'beautiful']),
WordList(['a', 'beautiful', 'day']),
WordList(['beautiful', 'day', 'Tomorrow']),
WordList(['day', 'Tomorrow', 'looks']),
WordList(['Tomorrow', 'looks', 'like']),
WordList(['looks', 'like', 'bad']),
WordList(['like', 'bad', 'weather'])]
The following produces n-grams consisting of five words:
In [5]: blob.ngrams(n=5)
Out[5]:
[WordList(['Today', 'is', 'a', 'beautiful', 'day']),
WordList(['is', 'a', 'beautiful', 'day', 'Tomorrow']),
WordList(['a', 'beautiful', 'day', 'Tomorrow', 'looks']),
WordList(['beautiful', 'day', 'Tomorrow', 'looks', 'like']),
WordList(['day', 'Tomorrow', 'looks', 'like', 'bad']),
WordList(['Tomorrow', 'looks', 'like', 'bad', 'weather'])]
(Fill-In) N-grams are a form of in which words appear near each other in a body of text.
Answer: co-occurrence.
(IPython Session) Produce n-grams consisting of three words each for 'TextBlob is easy to use.'
Answer:
In [1]: from textblob import TextBlob
In [2]: blob = TextBlob('TextBlob is easy to use.')
In [3]: blob.ngrams()
Out[3]:
[WordList(['TextBlob', 'is', 'easy']),
WordList(['is', 'easy', 'to']),
WordList(['easy', 'to', 'use'])]
3.147.42.168