Regular expressions can be used to identify entities in a document. We will investigate two general approaches:
RegExChunker
class.When working with regular expressions, it is advantageous to avoid reinventing the wheel. There are many sources for predefined and tested expressions. One such library can be found at http://regexlib.com/Default.aspx. We will use several of the regular expressions in this library for our examples.
To test how well these approaches work, we will use the following text for most of our examples:
private static String regularExpressionText = "He left his email address ([email protected]) and his " + "phone number,800-555-1234. We believe his current address " + "is 100 Washington Place, Seattle, CO 12345-1234. I " + "understand you can also call at 123-555-1234 between " + "8:00 AM and 4:30 most days. His URL is http://example.com " + "and he was born on February 25, 1954 or 2/25/1954.";
To demonstrate how these expressions can be used, we will start with several simple examples. The initial example starts with the following declaration. It is a simple expression designed to identify certain types of phone numbers:
String phoneNumberRE = "\d{3}-\d{3}-\d{4}";
We will use the following code to test our simple expressions. The Pattern
class' compile
method takes a regular expression and compiles it into a Pattern
object. Its matcher
method can then be executed against the target text, which returns a Matcher
object. This object allows us to repeatedly identify regular expression matches:
Pattern pattern = Pattern.compile(phoneNumberRE); Matcher matcher = pattern.matcher(regularExpressionText); while (matcher.find()) { System.out.println(matcher.group() + " [" + matcher.start() + ":" + matcher.end() + "]"); }
The find
method will return true
when a match occurs. Its group
method returns the text that matches the expression. Its start
and end
methods give us the position of the matched text in the target text.
When executed, we will get the following output:
800-555-1234 [68:80] 123-555-1234 [196:208]
A number of other regular expressions can be used in a similar manner. These are listed in the following table. The third column is the output produced when the corresponding regular expression is used in the previous code sequence:
Entity type |
Regular expression |
Output |
---|---|---|
URL |
|
|
ZIP code |
|
|
|
|
|
Time |
|
|
Date |
|
|
There are many other regular expressions that we could have used. However, these examples illustrate the basic technique. As demonstrated with the date regular expression, some of these can be quite complex.
It is common for regular expressions to miss some entities and to falsely report other non-entities as entities. For example, if we replace the text with the following expression:
regularExpressionText = "(888)555-1111 888-SEL-HIGH 888-555-2222-J88-W3S";
Executing the code will return this:
888-555-2222 [27:39]
It missed the first two phone numbers and falsely reported the "part number" as a phone number.
We can also search for more than one regular expression at a time using the |
operator. In the following statement, three regular expressions are combined using this operator. They are declared using the corresponding entries in the previous table:
Pattern pattern = Pattern.compile(phoneNumberRE + "|" + timeRE + "|" + emailRegEx);
When executed using the original regularExpressionText
text defined at the beginning of the previous section, we get the following output:
[email protected] [27:45] 800-555-1234 [68:80] 123-555-1234 [196:208] 8:00 [217:221] 4:30 [229:233]
The RegExChunker
class uses chunks to find entities in text. The class uses a regular expression to represent an entity. Its chunk
method returns a Chunking
object that can be used as we did in our earlier examples.
The RegExChunker
class' constructor takes three arguments:
String
: This is a regular expressionString
: This is a type of entity or categorydouble
: A value for scoreWe will demonstrate this class using a regular expression representing time as shown in the next example. The regular expression is the same as used in Using Java's regular expressions to find entities earlier in this chapter. The Chunker
instance is then created:
String timeRE = "(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?"; Chunker chunker = new RegExChunker(timeRE,"time",1.0);
The chunk
method is used along with the displayChunkSet
method, as shown here:
Chunking chunking = chunker.chunk(regularExpressionText); Set<Chunk> chunkSet = chunking.chunkSet(); displayChunkSet(chunker, regularExpressionText);
The displayChunkSet
method is shown in the following code segment. The chunkSet
method returns a Set
collection of Chunk
instances. We can use various methods to display specific parts of the chunk:
public void displayChunkSet(Chunker chunker, String text) { Chunking chunking = chunker.chunk(text); Set<Chunk> set = chunking.chunkSet(); for (Chunk chunk : set) { System.out.println("Type: " + chunk.type() + " Entity: [" + text.substring(chunk.start(), chunk.end()) + "] Score: " + chunk.score()); } }
The output is as follows:
Type: time Entity: [8:00] Score: 1.0 Type: time Entity: [4:30] Score: 1.0+95
Alternately, we can declare a simple class to encapsulate the regular expression, which lends itself for reuse in other situations. Next, the TimeRegexChunker
class is declared and it supports the identification of time entities:
public class TimeRegexChunker extends RegExChunker { private final static String TIME_RE = "(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?"; private final static String CHUNK_TYPE = "time"; private final static double CHUNK_SCORE = 1.0; public TimeRegexChunker() { super(TIME_RE,CHUNK_TYPE,CHUNK_SCORE); } }
To use this class, replace this section's initial declaration of chunker
with the following declaration:
Chunker chunker = new TimeRegexChunker();
3.145.172.56