Using regular expressions for NER

Regular expressions can be used to identify entities in a document. We will investigate two general approaches:

  • The first one uses regular expressions as supported by Java. These can be useful in situations where the entities are relatively simple and consistent in their form.
  • The second approach uses classes designed to specifically use regular expressions. To demonstrate this, we will use LingPipe's RegExChunker class.

When working with regular expressions, it is advantageous to avoid reinventing the wheel. There are many sources for predefined and tested expressions. One such library can be found at http://regexlib.com/Default.aspx. We will use several of the regular expressions in this library for our examples.

To test how well these approaches work, we will use the following text for most of our examples:

private static String regularExpressionText
    = "He left his email address ([email protected]) and his "
    + "phone number,800-555-1234. We believe his current address "
    + "is 100 Washington Place, Seattle, CO 12345-1234. I "
    + "understand you can also call at 123-555-1234 between "
    + "8:00 AM and 4:30 most days. His URL is http://example.com "
    + "and he was born on February 25, 1954 or 2/25/1954.";

Using Java's regular expressions to find entities

To demonstrate how these expressions can be used, we will start with several simple examples. The initial example starts with the following declaration. It is a simple expression designed to identify certain types of phone numbers:

String phoneNumberRE = "\d{3}-\d{3}-\d{4}";

We will use the following code to test our simple expressions. The Pattern class' compile method takes a regular expression and compiles it into a Pattern object. Its matcher method can then be executed against the target text, which returns a Matcher object. This object allows us to repeatedly identify regular expression matches:

Pattern pattern = Pattern.compile(phoneNumberRE);
Matcher matcher = pattern.matcher(regularExpressionText);
while (matcher.find()) {
    System.out.println(matcher.group() + " [" + matcher.start()
        + ":" + matcher.end() + "]");
}

The find method will return true when a match occurs. Its group method returns the text that matches the expression. Its start and end methods give us the position of the matched text in the target text.

When executed, we will get the following output:

800-555-1234 [68:80]
123-555-1234 [196:208]

A number of other regular expressions can be used in a similar manner. These are listed in the following table. The third column is the output produced when the corresponding regular expression is used in the previous code sequence:

Entity type

Regular expression

Output

URL

\b(https?|ftp|file|ldap)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]*[-A-Za-z0-9+&@#/%=~_|]

http://example.com [256:274]

ZIP code

[0-9]{5}(\-?[0-9]{4})?

12345-1234 [150:160]

E-mail

[a-zA-Z0-9'._%+-]+@(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,4}

[email protected] [27:45]

Time

(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?

8:00 [217:221]

4:30 [229:233]

Date

((0?[13578]|10|12)(-|\/)(([1-9])|(0[1-9])|([12])([0-9]?)|(3[01]?))(-|\/)((19)([2-9])(\d{1})|(20)([01])(\d{1})|([8901])(\d{1}))|(0?[2469]|11)(-|\/)(([1-9])|(0[1-9])|([12])([0-9]?)|(3[0]?))(-|\/)((19)([2-9])(\d{1})|(20)([01])(\d{1})|([8901])(\d{1})))

2/25/1954 [315:324]

There are many other regular expressions that we could have used. However, these examples illustrate the basic technique. As demonstrated with the date regular expression, some of these can be quite complex.

It is common for regular expressions to miss some entities and to falsely report other non-entities as entities. For example, if we replace the text with the following expression:

regularExpressionText = 
    "(888)555-1111 888-SEL-HIGH 888-555-2222-J88-W3S";

Executing the code will return this:

888-555-2222 [27:39]

It missed the first two phone numbers and falsely reported the "part number" as a phone number.

We can also search for more than one regular expression at a time using the | operator. In the following statement, three regular expressions are combined using this operator. They are declared using the corresponding entries in the previous table:

Pattern pattern = Pattern.compile(phoneNumberRE + "|" 
    + timeRE + "|" + emailRegEx);

When executed using the original regularExpressionText text defined at the beginning of the previous section, we get the following output:

[email protected] [27:45]
800-555-1234 [68:80]
123-555-1234 [196:208]
8:00 [217:221]
4:30 [229:233]

Using LingPipe's RegExChunker class

The RegExChunker class uses chunks to find entities in text. The class uses a regular expression to represent an entity. Its chunk method returns a Chunking object that can be used as we did in our earlier examples.

The RegExChunker class' constructor takes three arguments:

  • String: This is a regular expression
  • String: This is a type of entity or category
  • double: A value for score

We will demonstrate this class using a regular expression representing time as shown in the next example. The regular expression is the same as used in Using Java's regular expressions to find entities earlier in this chapter. The Chunker instance is then created:

String timeRE = 
   "(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?";
       Chunker chunker = new RegExChunker(timeRE,"time",1.0);

The chunk method is used along with the displayChunkSet method, as shown here:

Chunking chunking = chunker.chunk(regularExpressionText);
Set<Chunk> chunkSet = chunking.chunkSet();
displayChunkSet(chunker, regularExpressionText);

The displayChunkSet method is shown in the following code segment. The chunkSet method returns a Set collection of Chunk instances. We can use various methods to display specific parts of the chunk:

public void displayChunkSet(Chunker chunker, String text) {
    Chunking chunking = chunker.chunk(text);
    Set<Chunk> set = chunking.chunkSet();
    for (Chunk chunk : set) {
        System.out.println("Type: " + chunk.type() + " Entity: ["
             + text.substring(chunk.start(), chunk.end())
             + "] Score: " + chunk.score());
    }
}

The output is as follows:

Type: time Entity: [8:00] Score: 1.0
Type: time Entity: [4:30] Score: 1.0+95

Alternately, we can declare a simple class to encapsulate the regular expression, which lends itself for reuse in other situations. Next, the TimeRegexChunker class is declared and it supports the identification of time entities:

public class TimeRegexChunker extends RegExChunker {
    private final static String TIME_RE = 
      "(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?";
    private final static String CHUNK_TYPE = "time";
    private final static double CHUNK_SCORE = 1.0;
    
    public TimeRegexChunker() {
        super(TIME_RE,CHUNK_TYPE,CHUNK_SCORE);
    }
}

To use this class, replace this section's initial declaration of chunker with the following declaration:

Chunker chunker = new TimeRegexChunker();

The output will be the same as before.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.172.56