So far the data we have worked with generally has been in the form of numbers or countable values. In most cases, we’ve simply stored the data without conducting any analysis after the fact. In this chapter, we’ll attempt to tackle the tricky subject of the English language.1
How does Google know what you’re looking for when you type “cute kitten” into its Image Search? Because of the text that surrounds the cute kitten images. How does YouTube know to bring up a certain Monty Python sketch when you type “dead parrot” into its search bar? Because of the title and description text that accompanies each uploaded video.
In fact, even typing in terms such as “deceased bird monty python” immediately brings up the same “Dead Parrot” sketch, even though the page itself contains no mention of the words “deceased” or “bird.” Google knows that a “hot dog” is a food and that a “boiling puppy” is an entirely different thing. How? It’s all statistics!
Although you might not think that text analysis has anything to do with your project, understanding the concepts behind it can be extremely useful for all sorts of machine learning, as well as the more general ability to model real-world problems in probabilistic and algorithmic terms.
For instance, the Shazam music service can identify audio as containing a certain song recording, even if that audio contains ambient noise or distortion. Google is working on automatically captioning images based on nothing but the image itself.2 By comparing known images of, say, hot dogs to other images of hot dogs, the search engine can gradually learn what a hot dog looks like and observe these patterns in additional images it is shown.
In Chapter 7, we looked at breaking up text content into n-grams, or sets of phrases that are n-words in length. At a very basic level, this can be used to determine which sets of words and phrases tend to be most commonly used in a section of text. In addition, it can be used to create natural-sounding data summaries by going back to the original text and extracting sentences around some of these most popular phrases.
One piece of sample text we’ll be using to do this is the inauguration speech of the ninth president of the United States, William Henry Harrison. Harrison’s presidency sets two records in the history of the office: one for the longest inauguration speech, and another for the shortest time in office, 32 days.
We’ll use the full text of this speech as the source for many of the code samples in this chapter.
Slightly modifying the n-gram used to find code in Chapter 7, we can produce code that looks for sets of 2-grams and sorts them using Python’s sorting function in the “operator” module:
from
urllib.request
import
urlopen
from
bs4
import
BeautifulSoup
import
re
import
string
import
operator
def
cleanInput
(
input
):
input
=
re
.
sub
(
'
+'
,
" "
,
input
)
.
lower
()
input
=
re
.
sub
(
'[[0-9]*]'
,
""
,
input
)
input
=
re
.
sub
(
' +'
,
" "
,
input
)
input
=
bytes
(
input
,
"UTF-8"
)
input
=
input
.
decode
(
"ascii"
,
"ignore"
)
cleanInput
=
[]
input
=
input
.
split
(
' '
)
for
item
in
input
:
item
=
item
.
strip
(
string
.
punctuation
)
if
len
(
item
)
>
1
or
(
item
.
lower
()
==
'a'
or
item
.
lower
()
==
'i'
):
cleanInput
.
append
(
item
)
return
cleanInput
def
ngrams
(
input
,
n
):
input
=
cleanInput
(
input
)
output
=
{}
for
i
in
range
(
len
(
input
)
-
n
+
1
):
ngramTemp
=
" "
.
join
(
input
[
i
:
i
+
n
])
if
ngramTemp
not
in
output
:
output
[
ngramTemp
]
=
0
output
[
ngramTemp
]
+=
1
return
output
content
=
str
(
urlopen
(
"http://pythonscraping.com/files/inaugurationSpeech.txt"
)
.
read
(),
'utf-8'
)
ngrams
=
ngrams
(
content
,
2
)
sortedNGrams
=
sorted
(
ngrams
.
items
(),
key
=
operator
.
itemgetter
(
1
),
reverse
=
True
)
(
sortedNGrams
)
The output produces, in part:
[(
'of the'
,
213
),
(
'in the'
,
65
),
(
'to the'
,
61
),
(
'by the'
,
41
),
(
't
he
constitution
', 34), ('
of
our
', 29), ('
to
be
', 26), ('
from
the
', 24
),
(
'the people'
,
24
),
(
'and the'
,
23
),
(
'it is'
,
23
),
(
'that the'
,
2
3
),
(
'of a'
,
22
),
(
'of their'
,
19
)
Of these 2-grams, “the constitution” seems like a reasonably popular subject in the speech, but “of the,” “in the,” and “to the” don’t seem especially noteworthy. How can you automatically get rid of unwanted words in an accurate way?
Fortunately, there are people out there who carefully study the differences between “interesting” words and “uninteresting” words, and their work can help us do just that. Mark Davies, a linguistics professor at Brigham Young University, maintains the Corpus of Contemporary American English, a collection of over 450 million words from the last decade or so of popular American publications.
The list of 5,000 most frequently found words is available for free, and fortunately, this is far more than enough to act as a basic filter to weed out the most common 2-grams. Just the first 100 words vastly improves the results, with the addition of an isCommon
function:
def
isCommon
(
ngram
):
commonWords
=
[
"the"
,
"be"
,
"and"
,
"of"
,
"a"
,
"in"
,
"to"
,
"have"
,
"it"
,
"i"
,
"that"
,
"for"
,
"you"
,
"he"
,
"with"
,
"on"
,
"do"
,
"say"
,
"this"
,
"they"
,
"is"
,
"an"
,
"at"
,
"but"
,
"we"
,
"his"
,
"from"
,
"that"
,
"not"
,
"by"
,
"she"
,
"or"
,
"as"
,
"what"
,
"go"
,
"their"
,
"can"
,
"who"
,
"get"
,
"if"
,
"would"
,
"her"
,
"all"
,
"my"
,
"make"
,
"about"
,
"know"
,
"will"
,
"as"
,
"up"
,
"one"
,
"time"
,
"has"
,
"been"
,
"there"
,
"year"
,
"so"
,
"think"
,
"when"
,
"which"
,
"them"
,
"some"
,
"me"
,
"people"
,
"take"
,
"out"
,
"into"
,
"just"
,
"see"
,
"him"
,
"your"
,
"come"
,
"could"
,
"now"
,
"than"
,
"like"
,
"other"
,
"how"
,
"then"
,
"its"
,
"our"
,
"two"
,
"more"
,
"these"
,
"want"
,
"way"
,
"look"
,
"first"
,
"also"
,
"new"
,
"because"
,
"day"
,
"more"
,
"use"
,
"no"
,
"man"
,
"find"
,
"here"
,
"thing"
,
"give"
,
"many"
,
"well"
]
for
word
in
ngram
:
if
word
in
commonWords
:
return
True
return
False
This produces the following 2-grams that were found more than twice in the text body:
(
'united states'
,
10
),
(
'executive department'
,
4
),
(
'general governm
ent
', 4), ('
called
upon
', 3), ('
government
should
', 3), ('
whole
count
ry
', 3), ('
mr
jefferson
', 3), ('
chief
magistrate
', 3), ('
same
causes
'
,
3
),
(
'legislative body'
,
3
)
Appropriately enough, the first two items in the list are “United States” and “executive department,” which we would expect for a presidential inauguration speech.
It’s important to note that we are using a list of common words from relatively modern times to filter the results, which might not be appropriate given that the text was written in 1841. However, because we’re using only the first 100 or so words on the list—which we can assume are more stable over time than, say, the last 100 words—and we appear to be getting satisfactory results, we can likely save ourselves the effort of tracking down or creating a list of the most common words from 1841 (although such an effort might be interesting).
Now that some key topics have been extracted from the text, how does this help us write text summaries? One way is to search for the first sentence that contains each “popular” n-gram, the theory being that the first instance will yield a satisfactory overview of the body of the content. The first five most popular 2-grams yield these bullet points:
The Constitution of the United States is the instrument containing this grant of power to the several departments composing the government.
Such a one was afforded by the executive department constituted by the Constitution.
The general government has seized upon none of the reserved rights of the states.
Called from a retirement which I had supposed was to continue for the residue of my life to fill the chief executive office of this great and free nation, I appear before you, fellow-citizens, to take the oaths which the constitution prescribes as a necessary qualification for the performance of its duties; and in obedience to a custom coeval with our government and what I believe to be your expectations I proceed to present to you a summary of the principles which will govern me in the discharge of the duties which I shall be called upon to perform.
The presses in the necessary employment of the government should never be used to clear the guilty or to varnish crime.
Sure, it might not be published in CliffsNotes any time soon, but considering that the original document was 217 sentences in length, and the fourth sentence (“Called from a retirement...”) condenses the main subject down fairly well, it’s not too bad for a first pass.
You might have heard of Markov text generators. They’ve become popular for entertainment purposes, as in the Twitov app, as well as their use for generating real-sounding spam emails to fool detection systems.
All of these text generators are based on the Markov model, which is often used to analyze large sets of random events, where one discrete event is followed by another discrete event with a certain probability.
For example, we might build a Markov model of a weather system as illustrated in Figure 8-1.
In this model, each sunny day has a 70% chance of the following day also being sunny, with a 20% chance of the following day being cloudy with a mere 10% chance of rain. If the day is rainy, there is a 50% chance of rain the following day, a 25% chance of sun, and a 25% chance of clouds.
Note:
Obviously, this is a very simple system, and Markov models can grow arbitrarily large in size. In fact, Google’s page rank algorithm is based partly on a Markov model, with websites represented as nodes and inbound/outbound links represented as connections between nodes. The “likelihood” of landing on a particular node represents the relative popularity of the site. That is, if our weather system represented an extremely small Internet, “rainy” would have a low page rank, while “cloudy” would have a high page rank.
With all of this in mind, let’s bring it back down to a more concrete example: analyzing and writing text.
Again using the inauguration speech of William Henry Harrison analyzed in the previous example, we can write the following code that generates arbitrarily long Markov chains (with the chain length set to 100) based on the structure of its text:
from
urllib.request
import
urlopen
from
random
import
randint
def
wordListSum
(
wordList
):
sum
=
0
for
word
,
value
in
wordList
.
items
():
sum
+=
value
return
sum
def
retrieveRandomWord
(
wordList
):
randIndex
=
randint
(
1
,
wordListSum
(
wordList
))
for
word
,
value
in
wordList
.
items
():
randIndex
-=
value
if
randIndex
<=
0
:
return
word
def
buildWordDict
(
text
):
#Remove newlines and quotes
text
=
text
.
replace
(
"
"
,
" "
);
text
=
text
.
replace
(
"
"
"
,
""
);
#Make sure punctuation marks are treated as their own "words,"
#so that they will be included in the Markov chain
punctuation
=
[
','
,
'.'
,
';'
,
':'
]
for
symbol
in
punctuation
:
text
=
text
.
replace
(
symbol
,
" "
+
symbol
+
" "
);
words
=
text
.
split
(
" "
)
#Filter out empty words
words
=
[
word
for
word
in
words
if
word
!=
""
]
wordDict
=
{}
for
i
in
range
(
1
,
len
(
words
)):
if
words
[
i
-
1
]
not
in
wordDict
:
#Create a new dictionary for this word
wordDict
[
words
[
i
-
1
]]
=
{}
if
words
[
i
]
not
in
wordDict
[
words
[
i
-
1
]]:
wordDict
[
words
[
i
-
1
]][
words
[
i
]]
=
0
wordDict
[
words
[
i
-
1
]][
words
[
i
]]
=
wordDict
[
words
[
i
-
1
]][
words
[
i
]]
+
1
return
wordDict
text
=
str
(
urlopen
(
"http://pythonscraping.com/files/inaugurationSpeech.txt"
)
.
read
(),
'utf-8'
)
wordDict
=
buildWordDict
(
text
)
#Generate a Markov chain of length 100
length
=
100
chain
=
""
currentWord
=
"I"
for
i
in
range
(
0
,
length
):
chain
+=
currentWord
+
" "
currentWord
=
retrieveRandomWord
(
wordDict
[
currentWord
])
(
chain
)
The output of this code changes every time it is run, but here’s an example of the uncannily nonsensical text it will generate:
I sincerely believe in Chief Magistrate to make all necessary sacrifices and oppression of the remedies which we may have occurred to me in the arrangement and disbursement of the democratic claims them , consolatory to have been best political power in fervently commending every other addition of legislation , by the interests which violate that the Government would compare our aboriginal neighbors the people to its accomplishment . The latter also susceptible of the Constitution not much mischief , disputes have left to betray . The maxim which may sometimes be an impartial and to prevent the adoption or
So what’s going on in the code?
The function buildWordDict
takes in the string of text, which was retrieved from the Internet. It then does some cleaning and formatting, removing quotes, and putting spaces around other punctuation so it is effectively treated as a separate word. After this, it builds a two-dimensional dictionary—a “dictionary of dictionaries—that has the following form:
{
word_a
:
{
word_b
:
2
,
word_c
:
1
,
word_d
:
1
},
word_e
:
{
word_b
:
5
,
word_d
:
2
},
...
}
In this example dictionary, “word_a” was found four times, two instances of which were followed by “word_b,” one instance followed by “word_c,” and one instance followed by “word_d.” “Word_e” was followed seven times, five times by “word_b” and twice by “word_d.”
If we were to draw a node model of this result, the node representing “word_a” would have a 50% arrow pointing toward “word_b” (which followed it two out of four times), a 25% arrow pointing toward “word_c,” and a 25% arrow pointing toward “word_d.”
Once this dictionary is built up, it can be used as a lookup table to see where to go next, no matter which word in the text you happen to be on.3 Using the sample dictionary of dictionaries, we might currently be on “word_e,” which means that we’ll pass the dictionary {word_b : 5, word_d: 2}
to the retrieveRandomWord
function. This function in turn retrieves a random word from the dictionary, weighted by the number of times it occurs.
By starting with a random starting word (in this case, the ubiquitous “I”), we can traverse through the Markov chain easily, generating as many words as we like.
In Chapter 3, we created a scraper that collects links from one Wikipedia article to the next, starting with the article on Kevin Bacon, and stores them in a database. Why are we bringing it up again? Because it turns out the problem of choosing a path of links that starts on one page and ends up on the target page (i.e., finding a string of pages between http://bit.ly/1d7SU7h and http://bit.ly/1GOSY7Z) is the same as finding a Markov chain where both the first word and last word are defined. These sorts of problems are directed graph problems, where A → B does not necessarily mean that B → A. The word “football” might often be followed by the word “player,” but you’ll find that the word “player” is much less often followed by the word “football.” Although Kevin Bacon’s Wikipedia article links to the article on his home city, Philadelphia, the article on Philadelphia does not reciprocate by linking back to him.
In contrast, the original Six Degrees of Kevin Bacon game is an undirected graph problem. If Kevin Bacon starred in Flatliners with Julia Roberts, then Julia Roberts necessarily starred in Flatliners with Kevin Bacon, so the relationship goes both ways (it has no “direction”). Undirected graph problems tend to be less common in computer science than directed graph problems, and both are computationally difficult to solve.
Although much work has been done on these sorts of problems and multitudes of variations on them, one of the best and most common ways to find shortest paths in a directed graph—and thus find paths between the Wikipedia article on Kevin Bacon and all other Wikipedia articles—is through a breadth-first search.
A breadth-first search is performed by first searching all links that link directly to the starting page. If those links do not contain the target page (the page you are searching for), then a second level of links—pages that are linked by a page that is linked by the starting page—is searched. This process continues until either the depth limit (6 in this case) is reached or the target page is found.
A complete solution to the breadth-first search, using a table of links as described in Chapter 5, is as follows:
from
urllib.request
import
urlopen
from
bs4
import
BeautifulSoup
import
pymysql
conn
=
pymysql
.
connect
(
host
=
'127.0.0.1'
,
unix_socket
=
'/tmp/mysql.sock'
,
user
=
'root'
,
passwd
=
None
,
db
=
'mysql'
,
charset
=
'utf8'
)
cur
=
conn
.
cursor
()
cur
.
execute
(
"USE wikipedia"
)
class
SolutionFound
(
RuntimeError
):
def
__init__
(
self
,
message
):
self
.
message
=
message
def
getLinks
(
fromPageId
):
cur
.
execute
(
"SELECT toPageId FROM links WHERE fromPageId =
%s
"
,
(
fromPageId
))
if
cur
.
rowcount
==
0
:
return
None
else
:
return
[
x
[
0
]
for
x
in
cur
.
fetchall
()]
def
constructDict
(
currentPageId
):
links
=
getLinks
(
currentPageId
)
if
links
:
return
dict
(
zip
(
links
,
[{}]
*
len
(
links
)))
return
{}
#The link tree may either be empty or contain multiple links
def
searchDepth
(
targetPageId
,
currentPageId
,
linkTree
,
depth
):
if
depth
==
0
:
#Stop recursing and return, regardless
return
linkTree
if
not
linkTree
:
linkTree
=
constructDict
(
currentPageId
)
if
not
linkTree
:
#No links found. Cannot continue at this node
return
{}
if
targetPageId
in
linkTree
.
keys
():
(
"TARGET "
+
str
(
targetPageId
)
+
" FOUND!"
)
raise
SolutionFound
(
"PAGE: "
+
str
(
currentPageId
))
for
branchKey
,
branchValue
in
linkTree
.
items
():
try
:
#Recurse here to continue building the tree
linkTree
[
branchKey
]
=
searchDepth
(
targetPageId
,
branchKey
,
branchValue
,
depth
-
1
)
except
SolutionFound
as
e
:
(
e
.
message
)
raise
SolutionFound
(
"PAGE: "
+
str
(
currentPageId
))
return
linkTree
try
:
searchDepth
(
134951
,
1
,
{},
4
)
(
"No solution found"
)
except
SolutionFound
as
e
:
(
e
.
message
)
The functions getLinks
and constructDict
are helper functions that retrieve links from the database given a page, and format those links into a dictionary. The main function, searchDepth
, works recursively to simultaneously construct and search a tree of links, working one level at a time. It operates on the following rules:
The output for searching for a link between the page on Kevin Bacon (page ID 1, in my database) and the page on Eric Idle (page ID 78520 in my database) is:
TARGET
134951
FOUND
!
PAGE
:
156224
PAGE
:
155545
PAGE
:
3
PAGE
:
1
This translates into the relationship of links: Kevin Bacon → San Diego Comic Con International → Brian Froud → Terry Jones → Eric Idle
In addition to solving “6 degree” problems and modeling which words tend to follow which other words in sentences, directed and undirected graphs can be used to model a variety of different situations encountered in web scraping. Which websites link to which other websites? Which research papers cite which other research papers? Which products tend to be shown with which other products on a retail site? What is the strength of this link? Is the link reciprocal?
Recognizing these fundamental types of relationships can be extremely helpful for making models, visualizations, and predictions based on scraped data.
So far, this chapter has focused primarily on the statistical analysis of words in bodies of text. Which words are most popular? Which words are unusual? Which words are likely to come after which other words? How are they grouped together? What we are missing is understanding, to the extent that we can, what the words represent.
The Natural Language Toolkit (NLTK) is a suite of Python libraries designed to identify and tag parts of speech found in natural English text. Its development began in 2000, and over the past 15 years dozens of developers around the world have contributed to the project. Although the functionality it provides is tremendous (entire books are devoted to NLTK), this section will focus on just a few of its uses.
The NLTK module can be installed the same as other Python modules, either by downloading the package through the NLTK website directly or by using any number of third-party installers with the keyword “nltk.” For complete installation instructions, you see the NLTK website.
After installing the module it’s a good idea to download its preset text repositories so you can try out some of the features more easily. Type this on the Python command line:
>>> import nltk >>> nltk.download()
This opens the NLTK Downloader (Figure 8-2).
I recommend installing all of the available packages. Because everything is text based the downloads are very small; you never know what you’ll end up using, and you can easily uninstall packages at any time.
NLTK is great for generating statistical information about word counts, word frequency, and word diversity in sections of text. If all you need is a relatively straightforward calculation (e.g., the number of unique words used in a section of text), importing NLTK might be overkill—it’s a very large module. However, if you need to do relatively extensive analysis of a text, you have a number of functions at your fingertips that will give you just about any metric you want.
Analysis with NLTK always starts with the Text
object. Text
objects can be created from simple Python strings in the following way:
from
nltk
import
word_tokenize
from
nltk
import
Text
tokens
=
word_tokenize
(
"Here is some not very interesting text"
)
text
=
Text
(
tokens
)
The input for the word_tokenize
function can be any Python text string. If you don’t have any long strings handy but still want to play around with the features, NLTK has quite a few books already built into the library, which can accessed using the import
function:
from
nltk.book
import
*
This loads the nine books:
***
Introductory
Examples
for
the
NLTK
Book
***
Loading
text1
,
...
,
text9
and
sent1
,
...
,
sent9
Type
the
name
of
the
text
or
sentence
to
view
it
.
Type
:
'texts()'
or
'sents()'
to
list
the
materials
.
text1
:
Moby
Dick
by
Herman
Melville
1851
text2
:
Sense
and
Sensibility
by
Jane
Austen
1811
text3
:
The
Book
of
Genesis
text4
:
Inaugural
Address
Corpus
text5
:
Chat
Corpus
text6
:
Monty
Python
and
the
Holy
Grail
text7
:
Wall
Street
Journal
text8
:
Personals
Corpus
text9
:
The
Man
Who
Was
Thursday
by
G
.
K
.
Chesterton
1908
We will be working with text6, “Monty Python and the Holy Grail” (the screenplay for the 1975 movie) in all of the following examples.
Text objects can be manipulated much like normal Python arrays, as if they were an array containing words of the text. Using this property, you can count the number of unique words in a text and compare it against the total number of words:
>>>
len
(
text6
)
/
len
(
words
)
7.833333333333333
The preceding shows that each word in the script was used about eight times on average. You can also put the text into a frequency distribution object to see what some of the most common words are and the frequencies for various words:
>>>
from
nltk
import
FreqDist
>>>
fdist
=
FreqDist
(
text6
)
>>>
fdist
.
most_common
(
10
)
[(
':'
,
1197
),
(
'.'
,
816
),
(
'!'
,
801
),
(
','
,
731
),
(
"'"
,
421
),
(
'['
,
3
19
),
(
']'
,
312
),
(
'the'
,
299
),
(
'I'
,
255
),
(
'ARTHUR'
,
225
)]
>>>
fdist
[
"Grail"
]
34
Because this is a screenplay, some artifacts of how it is written can pop up. For instance, “ARTHUR” in all caps crops up frequently because it appears before each of King Arthur’s lines in the script. In addition, a colon (:) appears before every single line, acting as a separator between the name of the character and the character’s line. Using this fact, we can see that there are 1,197 lines in the movie!
What we have called 2-grams in previous chapters NLTK refers to as bigrams (from time to time you might also hear 3-grams referred to as “trigrams,” but I personally prefer n-gram rather than bigram or trigram). You can create, search, and list 2-grams extremely easily:
>>>
from
nltk
import
bigrams
>>>
bigrams
=
bigrams
(
text6
)
>>>
bigramsDist
=
FreqDist
(
bigrams
)
>>>
bigramDist
[(
"Sir"
,
"Robin"
)]
18
To search for the 2-grams “Sir Robin” we need to break it up into an array (“Sir”, “Robin”), to match the way the 2-grams are represented in the frequency distribution. There is also a trigrams
module that works in the exact same way. For the general case, you can also import the ngrams
module:
>>>
from
nltk
import
ngrams
>>>
fourgrams
=
ngrams
(
text6
,
4
)
>>>
fourgramsDist
=
FreqDist
(
fourgrams
)
>>>
fourgramsDist
[(
"father"
,
"smelt"
,
"of"
,
"elderberries"
)]
1
Here, the ngrams
function is called to break up a text object into n-grams of any size, governed by the second parameter. In this case, I’m breaking the text into 4-grams. Then, I can demonstrate that the phrase “father smelt of elderberries” occurs in the screenplay exactly once.
Frequency distributions, text objects, and n-grams also can be iterated through and operated on in a loop. The following prints out all 4-grams that begin with the word “coconut,” for instance:
from
nltk.book
import
*
from
nltk
import
ngrams
fourgrams
=
ngrams
(
text6
,
4
)
for
fourgram
in
fourgrams
:
if
fourgram
[
0
]
==
"coconut"
:
(
fourgram
)
The NLTK library has a vast array of tools and objects designed to organize, count, sort, and measure large swaths of text. Although we’ve barely scratched the surface of their uses, most of these tools are very well designed and operate rather intuitively for someone familiar with Python.
So far, we’ve compared and categorized all the words we’ve encountered based only on the value they represent by themselves. There is no differentiation between homonyms or the context in which the words are used.
Although some people might be tempted to dismiss homonyms as rarely problematic, you might be surprised at how frequently they crop up. Most native English speakers probably don’t even register that a word is a homonym, much less consider that it might possibly be confused for another word in a different context.
“He was objective in achieving his objective of writing an objective philosophy, primarily using verbs in the objective case” is easy for humans to parse but might make a web scraper think the same word is being used four times and cause it to simply discard all the information about the meaning behind each word.
In addition to sussing out parts of speech, being able to distinguish a word being used in one way versus another might be useful. For example, you might want to look for company names made up of common English words, or analyze someone’s opinions about a company “ACME Products is good” and “ACME Products is not bad” can have the same meaning, even if one sentence uses “good” and the other uses “bad”.
In addition to measuring language, NLTK can assist in finding meaning in the words based on context and its own very large dictionaries. At a basic level, NLTK can identify parts of speech:
>>>
from
nltk.book
import
*
>>>
from
nltk
import
word_tokenize
>>>
text
=
word_tokenize
(
"Strange women lying in ponds distributing swords is no
basis
for
a
system
of
government
.
Supreme
executive
power
derives
from
a
mandate
from
the
masses
,
not
from
some
farcical
aquatic
ceremony.
")
>>>
from
nltk
import
pos_tag
>>>
pos_tag
(
text
)
[(
'Strange'
,
'NNP'
),
(
'women'
,
'NNS'
),
(
'lying'
,
'VBG'
),
(
'in'
,
'IN'
)
,
(
'ponds'
,
'NNS'
),
(
'distributing'
,
'VBG'
),
(
'swords'
,
'NNS'
),
(
'is'
,
'VBZ'
),
(
'no'
,
'DT'
),
(
'basis'
,
'NN'
),
(
'for'
,
'IN'
),
(
'a'
,
'DT'
),
(
'system'
,
'NN'
),
(
'of'
,
'IN'
),
(
'government'
,
'NN'
),
(
'.'
,
'.'
),
(
'S
upreme
', '
NNP
'), ('
executive
', '
NN
'), ('
power
', '
NN
'), ('
derives
', '
N
NS
'), ('
from
', '
IN
'), ('
a
', '
DT
'), ('
mandate
', '
NN
'), ('
from
', '
IN
'),
(
'the'
,
'DT'
),
(
'masses'
,
'NNS'
),
(
','
,
','
),
(
'not'
,
'RB'
),
(
'from'
,
'IN'
),
(
'some'
,
'DT'
),
(
'farcical'
,
'JJ'
),
(
'aquatic'
,
'JJ'
),
(
'cer
emony
', '
NN
'), ('
.
', '
.
')]
Each word is separated into a tuple containing the word and a tag identifying the part of speech (see the preceding Penn Treebank Tags sidebar for more information about these tags). Although this might seem like a very straightforward lookup the complexity needed to perform the task correctly becomes apparent with the following example:
>>>
text
=
word_tokenize
(
"The dust was thick so he had to dust"
)
>>>
pos_tag
(
text
)
[(
'The'
,
'DT'
),
(
'dust'
,
'NN'
),
(
'was'
,
'VBD'
),
(
'thick'
,
'JJ'
),
(
'so
', '
RB
'), ('
he
', '
PRP
'), ('
had
', '
VBD
'), ('
to
', '
TO
'), ('
dust
', '
VB
')
]
Notice that the word “dust” is used twice in the sentence: once as a noun, and again as a verb. NLTK identifies both usages correctly, based on their context in the sentence. NLTK identifies parts of speech using a context-free grammar defined by the English language. Context-free grammars are, essentially, sets of rules that define which things are allowed to follow which other things in ordered lists. In this case, they define which parts of speech are allowed to follow which other parts of speech. Whenever an ambiguous word such as “dust” is encountered, the rules of the context-free grammar is consulted and an appropriate part of speech that follows the rules is selected.
So, what’s the point of knowing whether a word is a verb or a noun in a given context? It might be neat in a computer science research lab, but how does it help with web scraping?
A very common problem in web scraping deals with search. You might be scraping text off a site and want to be able to search it for instances of the word “google,” but only when it’s being used as a verb, not a proper noun. Or you might be looking only for instances of the company Google and don’t want to rely on people’s correct use of capitalization in order to find those instances. Here, the pos_tag
function can be extremely useful:
from
nltk
import
word_tokenize
,
sent_tokenize
,
pos_tag
sentences
=
sent_tokenize
(
"Google is one of the best companies in the world.
I
constantly
myself
to
see
what
I
'm up to.")
nouns
=
[
'NN'
,
'NNS'
,
'NNP'
,
'NNPS'
]
for
sentence
in
sentences
:
if
"google"
in
sentence
.
lower
():
taggedWords
=
pos_tag
(
word_tokenize
(
sentence
))
for
word
in
taggedWords
:
if
word
[
0
]
.
lower
()
==
"google"
and
word
[
1
]
in
nouns
:
(
sentence
)
This prints only sentences that contain the word “google” (or “Google”) as some sort of a noun, not a verb. Of course, you could be more specific and demand that only instances of Google tagged with “NNP” (a proper noun) are printed, but even NLTK makes mistakes at times, and it can be good to leave yourself a little wiggle room, depending on the application.
Much of the ambiguity of natural language can be resolved using NLTK’s pos_tag
function. By searching text not just for instances of your target word or phrase but instances of your target word or phrase plus its tag, you can greatly increase the accuracy and effectiveness of your scraper’s searches.
Processing, analyzing, and understanding natural language by machine is one of the most difficult tasks in computer science, and countless volumes and research papers have been written on the subject. I hope that the coverage here will inspire you to think beyond conventional web scraping, or at least give some initial direction about where to begin when undertaking a project that requires natural language analysis.
There are many excellent resources on introductory language processing and Python’s Natural Language Toolkit. In particular, Steven Bird, Ewan Klein, and Edward Loper’s book Natural Language Processing with Python presents both a comprehensive and introductory approach to the topic.
In addition, James Pustejovsky and Amber Stubbs’ Natural Language Annotations for Machine Learning provides a slightly more advanced theoretical guide. You’ll need a knowledge of Python to implement the lessons; the topics covered work perfectly with Python’s Natural Language Toolkit.
1 Although many of the techniques described in this chapter can be applied to all or most languages, it’s okay for now to focus on natural language processing in English only. Tools such as Python’s Natural Language Toolkit, for example, focus on English. Fifty-six percent of the Internet is still in English (with German following at a mere 6%, according to http://w3techs.com/technologies/overview/content_language/all). But who knows? English’s hold on the majority of the Internet will almost certainly change in the future, and further updates may be necessary in the next few years.
2 See “A Picture Is Worth a Thousand (Coherent) Words: Building a Natural Description of Images,” November 17, 2014 (http://bit.ly/1HEJ8kX).
3 The exception is the last word in the text because nothing follows the last word. In our example text, the last word is a period (.), which is convenient because it has 215 other occurrences in the text and so does not represent a dead end. However, in real-world implementations of the Markov generator, the last word of the text might be something you need to account for.
52.15.122.235