Chunking and chinking with regular expressions

Using modified regular expressions, we can define chunk patterns. These are patterns of part-of-speech tags that define what kinds of words make up a chunk. We can also define patterns for what kinds of words should not be in a chunk. These unchunked words are known as chinks.

A ChunkRule specifies what to include in a chunk, while a ChinkRule specifies what to exclude from a chunk. In other words, chunking creates chunks, while chinking breaks up those chunks.

Getting ready

We first need to know how to define chunk patterns. These are modified regular expressions designed to match sequences of part-of-speech tags. An individual tag is specified by surrounding angle brackets, such as <NN> to match a noun tag. Multiple tags can then be combined, as in <DT><NN> to match a determiner followed by a noun. Regular expression syntax can be used within the angle brackets to match individual tag patterns, so you can do <NN.*> to match all nouns including NN and NNS. You can also use regular expression syntax outside of the angle brackets to match patterns of tags. <DT>?<NN.*>+ will match an optional determiner followed by one or more nouns. The chunk patterns are converted internally to regular expressions using the tag_pattern2re_pattern() function:

>>> from nltk.chunk import tag_pattern2re_pattern
>>> tag_pattern2re_pattern('<DT>?<NN.*>+')
'(<(DT)>)?(<(NN[^\{\}<>]*)>)+'

You don't have to use this function to do chunking, but it might be useful or interesting to see how your chunk patterns convert to regular expressions.

How to do it...

The pattern for specifying a chunk is to use surrounding curly braces, such as {<DT><NN>}. To specify a chink, you flip the braces, as in }<VB>{. These rules can be combined into a grammar for a particular phrase type. Here's a grammar for noun-phrases that combines both a chunk and a chink pattern, along with the result of parsing the sentence "The book has many chapters":

>>> from nltk.chunk import RegexpParser
>>> chunker = RegexpParser(r'''
... NP:
...    {<DT><NN.*><.*>*<NN.*>}
...    }<VB.*>{
... ''')
>>> chunker.parse([('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])

The grammar tells the RegexpParser that there are two rules for parsing NP chunks. The first chunk pattern says that a chunk starts with a determiner followed by any kind of noun. Then any number of other words is allowed, until a final noun is found. The second pattern says that verbs should be chinked, thus separating any large chunks that contain a verb. The result is a tree with two noun-phrase chunks: "the book" and "many chapters".

Note

Tagged sentences are always parsed into a Tree (found in the nltk.tree module). The top node of the Tree is 'S', which stands for sentence. Any chunks found will be subtrees whose nodes will refer to the chunk type. In this case, the chunk type is 'NP' for noun-phrase. Trees can be drawn calling the draw() method, as in t.draw().

How it works...

Here's what happens, step-by-step:

  1. The sentence is converted into a flat Tree, as shown in the following figure:
    How it works...
  2. The Tree is used to create a ChunkString.
  3. RegexpParser parses the grammar to create a NP RegexpChunkParser with the given rules.
  4. A ChunkRule is created and applied to the ChunkString, which matches the entire sentence into a chunk, as shown in the following figure:
    How it works...
  5. A ChinkRule is created and applied to the same ChunkString, which splits the big chunk into two smaller chunks with a verb between them, as shown in the following figure:
    How it works...
  6. The ChunkString is converted back to a Tree, now with two NP chunk subtrees, as shown in the following figure:
    How it works...

You can do this yourself using the classes in nltk.chunk.regexp. ChunkRule and ChinkRule are both subclasses of RegexpChunkRule and require two arguments: the pattern, and a description of the rule. ChunkString is an object that starts with a flat tree, which is then modified by each rule when it is passed in to the rule's apply() method. A ChunkString is converted back to a Tree with the to_chunkstruct() method. Here's the code to demonstrate it:

>>> from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule
>>> from nltk.tree import Tree
>>> t = Tree('S', [('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])
>>> cs = ChunkString(t)
>>> cs
<ChunkString: '<DT><NN><VBZ><JJ><NNS>'>
>>> ur = ChunkRule('<DT><NN.*><.*>*<NN.*>', 'chunk determiners and nouns')
>>> ur.apply(cs)
>>> cs
<ChunkString: '{<DT><NN><VBZ><JJ><NNS>}'>
>>> ir = ChinkRule('<VB.*>', 'chink verbs')
>>> ir.apply(cs)
>>> cs
<ChunkString: '{<DT><NN>}<VBZ>{<JJ><NNS>}'>
>>> cs.to_chunkstruct()
Tree('S', [Tree('CHUNK', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('CHUNK', [('many', 'JJ'), ('chapters', 'NNS')])])

The preceding tree diagrams can be drawn at each step by calling cs.to_chunkstruct().draw().

There's more...

You will notice that the subtrees from the ChunkString are tagged as 'CHUNK' and not 'NP'. That's because the previous rules are phrase agnostic; they create chunks without needing to know what kind of chunks they are.

Internally, the RegexpParser creates a RegexpChunkParser for each chunk phrase type. So if you are only chunking NP phrases, there will only be one RegexpChunkParser. The RegexpChunkParser gets all the rules for the specific chunk type, and handles applying the rules in order and converting the 'CHUNK' trees to the specific chunk type, such as 'NP'.

Here's some code to illustrate the usage of RegexpChunkParser. We pass the previous two rules into the RegexpChunkParser, and then parse the same sentence tree we created before. The resulting tree is just like what we got from applying both rules in order, except 'CHUNK' has been replaced with 'NP' in the two subtrees. This is because RegexpChunkParser defaults to chunk_node='NP'.

>>> from nltk.chunk import RegexpChunkParser
>>> chunker = RegexpChunkParser([ur, ir])
>>> chunker.parse(t)
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])

Different chunk types

If you wanted to parse a different chunk type, then you could pass that in as chunk_node to RegexpChunkParser. Here's the same code we have just seen, but instead of 'NP' subtrees, we will call them 'CP' for custom phrase.

>>> from nltk.chunk import RegexpChunkParser
>>> chunker = RegexpChunkParser([ur, ir], chunk_node='CP')
>>> chunker.parse(t)
Tree('S', [Tree('CP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('CP', [('many', 'JJ'), ('chapters', 'NNS')])])

RegexpParser does this internally when you specify multiple phrase types. This will be covered in Partial parsing with regular expressions.

Alternative patterns

The same parsing results can be obtained by using two chunk patterns in the grammar, and discarding the chink pattern:

>>> chunker = RegexpParser(r'''
... NP:
...    {<DT><NN.*>}
...    {<JJ><NN.*>}
... ''')
>>> chunker.parse(t)
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])

In fact, you could reduce the two chunk patterns into a single pattern.

>>> chunker = RegexpParser(r'''
... NP:
...    {(<DT>|<JJ>)<NN.*>}
... ''')
>>> chunker.parse(t)
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])

How you create and combine patterns is really up to you. Pattern creation is a process of trial and error, and entirely depends on what your data looks like and which patterns are easiest to express.

Chunk rule with context

You can also create chunk rules with a surrounding tag context. For example, if your pattern is <DT>{<NN>}, which will be parsed into a ChunkRuleWithContext. Any time there's a tag on either side of the curly braces, you will get a ChunkRuleWithContext instead of a ChunkRule. This can allow you to be more specific about when to parse particular kinds of chunks.

Here's an example of using ChunkWithContext directly. It takes four arguments: the left context, the pattern to chunk, the right context, and a description:

>>> from nltk.chunk.regexp import ChunkRuleWithContext
>>> ctx = ChunkRuleWithContext('<DT>', '<NN.*>', '<.*>', 'chunk nouns only after determiners')
>>> cs = ChunkString(t)
>>> cs
<ChunkString: '<DT><NN><VBZ><JJ><NNS>'>
>>> ctx.apply(cs)
>>> cs
<ChunkString: '<DT>{<NN>}<VBZ><JJ><NNS>'>
>>> cs.to_chunkstruct()
Tree('S', [('the', 'DT'), Tree('CHUNK', [('book', 'NN')]), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])

This example only chunks nouns that follow a determiner, therefore ignoring the noun that follows an adjective. Here's how it would look using the RegexpParser:

>>> chunker = RegexpParser(r'''
... NP:
...    <DT>{<NN.*>}
... ''')
>>> chunker.parse(t)
Tree('S', [('the', 'DT'), Tree('NP', [('book', 'NN')]), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])

See also

In the next recipe, we will cover merging and splitting chunks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.180.70