Using modified regular expressions, we can define chunk patterns. These are patterns of part-of-speech tags that define what kinds of words make up a chunk. We can also define patterns for what kinds of words should not be in a chunk. These unchunked words are known as chinks.
A ChunkRule
specifies what to include in a chunk, while a ChinkRule
specifies what to exclude from a chunk. In other words,
chunking creates chunks, while chinking
breaks up those chunks.
We first need to know how to define chunk patterns. These are modified regular expressions designed to match sequences of part-of-speech tags. An individual tag is specified by surrounding angle brackets, such as <NN>
to match a noun tag. Multiple tags can then be combined, as in <DT><NN>
to match a determiner followed by a noun. Regular expression syntax can be used within the angle brackets to match individual tag patterns, so you can do <NN.*>
to match all nouns including NN
and NNS
. You can also use regular expression syntax outside of the angle brackets to match patterns of tags. <DT>?<NN.*>+
will match an optional determiner followed by one or more nouns. The chunk patterns are converted internally to regular expressions using the
tag_pattern2re_pattern()
function:
>>> from nltk.chunk import tag_pattern2re_pattern >>> tag_pattern2re_pattern('<DT>?<NN.*>+') '(<(DT)>)?(<(NN[^\{\}<>]*)>)+'
You don't have to use this function to do chunking, but it might be useful or interesting to see how your chunk patterns convert to regular expressions.
The pattern for specifying a chunk is to use surrounding curly braces, such as {<DT><NN>}
. To specify a chink, you flip the braces, as in }<VB>{
. These rules can be combined into a
grammar for a particular phrase type. Here's a grammar for noun-phrases that combines both a chunk and a chink pattern, along with the result of parsing the sentence "The book has many chapters":
>>> from nltk.chunk import RegexpParser >>> chunker = RegexpParser(r''' ... NP: ... {<DT><NN.*><.*>*<NN.*>} ... }<VB.*>{ ... ''') >>> chunker.parse([('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')]) Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
The grammar tells the RegexpParser
that there are two rules for parsing NP
chunks. The first chunk pattern says that a chunk starts with a determiner followed by any kind of noun. Then any number of other words is allowed, until a final noun is found. The second pattern says that verbs should be chinked, thus separating any large chunks that contain a verb. The result is a tree with two noun-phrase chunks: "the book" and "many chapters".
Tagged sentences are always parsed into a Tree
(found in the nltk.tree
module). The top node of the Tree
is 'S
', which stands for sentence. Any chunks found will be subtrees whose nodes will refer to the chunk type. In this case, the chunk type is 'NP
' for noun-phrase. Trees can be drawn calling the draw()
method, as in t.draw()
.
Here's what happens, step-by-step:
Tree
, as shown in the following figure:Tree
is used to create a ChunkString
.RegexpParser
parses the grammar to create a NP RegexpChunkParser
with the given rules.ChunkRule
is created and applied to the ChunkString
, which matches the entire sentence into a chunk, as shown in the following figure:ChinkRule
is created and applied to the same ChunkString
, which splits the big chunk into two smaller chunks with a verb between them, as shown in the following figure:ChunkString
is converted back to a Tree
, now with two NP chunk subtrees, as shown in the following figure:You
can do this yourself using the classes in nltk.chunk.regexp
. ChunkRule
and ChinkRule
are both subclasses of RegexpChunkRule
and require two arguments: the pattern, and a description of the rule. ChunkString
is an object that starts with a flat tree, which is then modified by each rule when it is passed in to the rule's apply()
method. A ChunkString
is converted back to a Tree
with the to_chunkstruct()
method. Here's the code to demonstrate it:
>>> from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule >>> from nltk.tree import Tree >>> t = Tree('S', [('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')]) >>> cs = ChunkString(t) >>> cs <ChunkString: '<DT><NN><VBZ><JJ><NNS>'> >>> ur = ChunkRule('<DT><NN.*><.*>*<NN.*>', 'chunk determiners and nouns') >>> ur.apply(cs) >>> cs <ChunkString: '{<DT><NN><VBZ><JJ><NNS>}'> >>> ir = ChinkRule('<VB.*>', 'chink verbs') >>> ir.apply(cs) >>> cs <ChunkString: '{<DT><NN>}<VBZ>{<JJ><NNS>}'> >>> cs.to_chunkstruct() Tree('S', [Tree('CHUNK', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('CHUNK', [('many', 'JJ'), ('chapters', 'NNS')])])
The
preceding tree diagrams can be drawn at each step by calling cs.to_chunkstruct().draw()
.
You will notice that the subtrees from the ChunkString
are tagged as 'CHUNK'
and not 'NP'
. That's because the previous rules are phrase agnostic; they create chunks without needing to know what kind of chunks they are.
Internally, the RegexpParser
creates a RegexpChunkParser
for each chunk phrase type. So if you are only chunking NP
phrases, there will only be one RegexpChunkParser
. The RegexpChunkParser
gets all the rules for the specific chunk type, and handles applying the rules in order and converting the 'CHUNK'
trees to the specific chunk type, such as 'NP'
.
Here's some code to illustrate the usage of RegexpChunkParser
. We pass the previous two rules into the RegexpChunkParser
, and then parse the same sentence tree we created before. The resulting tree is just like what we got from applying both rules in order, except 'CHUNK'
has been replaced with 'NP'
in the two subtrees. This is because RegexpChunkParser
defaults to chunk_node='NP'
.
>>> from nltk.chunk import RegexpChunkParser >>> chunker = RegexpChunkParser([ur, ir]) >>> chunker.parse(t) Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
If you wanted to parse a different chunk type, then you could pass that in as chunk_node
to RegexpChunkParser
. Here's the same code we have just seen, but instead of 'NP'
subtrees, we will call them 'CP'
for custom phrase.
>>> from nltk.chunk import RegexpChunkParser >>> chunker = RegexpChunkParser([ur, ir], chunk_node='CP') >>> chunker.parse(t) Tree('S', [Tree('CP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('CP', [('many', 'JJ'), ('chapters', 'NNS')])])
RegexpParser
does this internally when you specify multiple phrase types. This will be covered in Partial parsing with regular expressions.
The same parsing results can be obtained by using two chunk patterns in the grammar, and discarding the chink pattern:
>>> chunker = RegexpParser(r''' ... NP: ... {<DT><NN.*>} ... {<JJ><NN.*>} ... ''') >>> chunker.parse(t) Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
In fact, you could reduce the two chunk patterns into a single pattern.
>>> chunker = RegexpParser(r''' ... NP: ... {(<DT>|<JJ>)<NN.*>} ... ''') >>> chunker.parse(t) Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
How you create and combine patterns is really up to you. Pattern creation is a process of trial and error, and entirely depends on what your data looks like and which patterns are easiest to express.
You can also create chunk rules with a surrounding tag context. For example, if your pattern is <DT>{<NN>}
, which will be parsed into a ChunkRuleWithContext
. Any time there's a tag on either side of the curly braces, you will get a ChunkRuleWithContext
instead of a ChunkRule
. This can allow you to be more specific about when to parse particular kinds of chunks.
Here's an example of using ChunkWithContext
directly. It takes four arguments: the left context, the pattern to chunk, the right context, and a description:
>>> from nltk.chunk.regexp import ChunkRuleWithContext >>> ctx = ChunkRuleWithContext('<DT>', '<NN.*>', '<.*>', 'chunk nouns only after determiners') >>> cs = ChunkString(t) >>> cs <ChunkString: '<DT><NN><VBZ><JJ><NNS>'> >>> ctx.apply(cs) >>> cs <ChunkString: '<DT>{<NN>}<VBZ><JJ><NNS>'> >>> cs.to_chunkstruct() Tree('S', [('the', 'DT'), Tree('CHUNK', [('book', 'NN')]), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])
This example only chunks nouns that follow a determiner, therefore ignoring the noun that follows an adjective. Here's how it would look using the RegexpParser
:
>>> chunker = RegexpParser(r''' ... NP: ... <DT>{<NN.*>} ... ''') >>> chunker.parse(t) Tree('S', [('the', 'DT'), Tree('NP', [('book', 'NN')]), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])
18.222.180.70