Swapping noun cardinals

In a chunk, a cardinal word, tagged as CD, refers to a number, such as 10. These cardinals often occur before or after a noun. For normalization purposes, it can be useful to always put the cardinal before the noun.

How to do it...

The swap_noun_cardinal() function is defined in transforms.py. It swaps any cardinal that occurs immediately after a noun with the noun so that the cardinal occurs immediately before the noun. It uses a helper function, tag_equals(), which is similar to tag_startswith(), but in this case, the function it returns does an equality comparison with the given tag:

def tag_equals(tag):
  def f(wt):
    return wt[1] == tag
  return f

Now we can define swap_noun_cardinal():

def swap_noun_cardinal(chunk):
  cdidx = first_chunk_index(chunk, tag_equals('CD'))
  # cdidx must be > 0 and there must be a noun immediately before it
  if not cdidx or not chunk[cdidx-1][1].startswith('NN'):
    return chunk

  noun, nntag = chunk[cdidx-1]
  chunk[cdidx-1] = chunk[cdidx]
  chunk[cdidx] = noun, nntag
  return chunk

Let's try it on a date, such as Dec 10, and another common phrase, the top 10.

>>> swap_noun_cardinal([('Dec.', 'NNP'), ('10', 'CD')])
[('10', 'CD'), ('Dec.', 'NNP')]
>>> swap_noun_cardinal([('the', 'DT'), ('top', 'NN'), ('10', 'CD')])
[('the', 'DT'), ('10', 'CD'), ('top', 'NN')]

The result is that the numbers are now in front of the noun, creating 10 Dec and the 10 top.

How it works...

We start by looking for a CD tag in the chunk. If no CD is found, or if the CD is at the beginning of the chunk, then the chunk is returned as is. There must also be a noun immediately before the CD. If we do find a CD with a noun preceding it, then we swap the noun and cardinal.

See also

The Correcting verb forms recipe defines the first_chunk_index() function used to find tagged words in a chunk.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.254.4