Simple Java SBDs

Sometimes, text may be simple enough that Java core support will suffice. There are two approaches that will perform SBD: using regular expressions and using the BreakIterator class. We will examine both approaches here.

Using regular expressions

Regular expressions can be difficult to understand. While simple expressions are not usually a problem, as they become more complex, their readability worsens. This is one of the limitations of regular expressions when trying to use them for SBD.

We will present two different regular expressions. The first expression is simple, but does not do a very good job. It illustrates a solution that may be too simple for some problem domains. The second is more sophisticated and does a better job.

In this example, we create a regular expression class that matches periods, question marks, and exclamation marks. The String class' split method is used to split the text into sentences:

String simple = "[.?!]";
String[] splitString = (paragraph.split(simple));
for (String string : splitString) {
    System.out.println(string);
}

The output is as follows:

When determining the end of sentences we need to consider several factors
 Sentences may end with exclamation marks
 Or possibly questions marks
 Within sentences we may find numbers like 3
14159, abbreviations such as found in Mr
 Smith, and possibly ellipses either within a sentence …, or at the end of a sentence…

As expected, the method splits the paragraph into characters regardless of whether they are part of a number or abbreviation.

A second approach follows, which produces better results. This example has been adapted from an example found at http://stackoverflow.com/questions/5553410/regular-expression-match-a-sentence. The Pattern class, which compiles the following regular expression, is used:

[^.!?s][^.!?]*(?:[.!?](?!['"]?s|$)[^.!?]*)*[.!?]?['"]?(?=s|$)

The comment in the following code sequence provides an explanation of what each part represents:

Pattern sentencePattern = Pattern.compile(
    "# Match a sentence ending in punctuation or EOS.
"
    + "[^.!?\s]    # First char is non-punct, non-ws
"
    + "[^.!?]*      # Greedily consume up to punctuation.
"
    + "(?:          # Group for unrolling the loop.
"
    + "  [.!?]      # (special) inner punctuation ok if
"
    + "  (?!['"]?\s|$)  # not followed by ws or EOS.
"
    + "  [^.!?]*    # Greedily consume up to punctuation.
"
    + ")*           # Zero or more (special normal*)
"
    + "[.!?]?       # Optional ending punctuation.
"
    + "['"]?       # Optional closing quote.
"
    + "(?=\s|$)",
    Pattern.MULTILINE | Pattern.COMMENTS);

Another representation of this expression can be generated using the display tool found at http://regexper.com/. As shown in the following diagram, it graphically depicts the expression and can clarify how it works:

Using regular expressions

The matcher method is executed against the sample paragraph and then the results are displayed:

Matcher matcher = sentencePattern.matcher(paragraph);
while (matcher.find()) {
    System.out.println(matcher.group());
}

The output follows. The sentence terminators are retained, but there are still problems with abbreviations:

When determining the end of sentences we need to consider several factors.
Sentences may end with exclamation marks!
Or possibly questions marks?
Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr.
Smith, and possibly ellipses either within a sentence …, or at the end of a sentence…

Using the BreakIterator class

The BreakIterator class can be used to detect various text boundaries such as those between characters, words, sentences, and lines. Different methods are used to create different instances of the BreakIterator class as follows:

  • For characters, the getCharacterInstance method is used
  • For words, the getWordInstance method is used
  • For sentences, the getSentenceInstance method is used
  • For lines, the getLineInstance method is used

Detecting breaks between characters is important at times, for example, when we need to process characters that are composed of multiple Unicode characters such as ü. This character is sometimes formed by combining the u0075 (u) and u00a8 (¨) Unicode characters. The class will identify these types of characters. This capability is further detailed at https://docs.oracle.com/javase/tutorial/i18n/text/char.html.

The BreakIterator class can be used to detect the end of a sentence. It uses a cursor that references the current boundary. It supports a next and a previous method that moves the cursor forward and backwards in the text, respectively. BreakIterator has a single, protected default constructor. To obtain an instance of the BreakIterator class to detect the end of a sentence, use the static getSentenceInstance method, as shown here:

BreakIterator sentenceIterator = BreakIterator.getSentenceInstance();

There is also an overloaded version of the method. It takes a Locale instance as an argument:

Locale currentLocale = new Locale("en", "US");
BreakIterator sentenceIterator = 
    BreakIterator.getSentenceInstance(currentLocale);

Once an instance has been created, the setText method will associate the text to be processed with the iterator:

sentenceIterator.setText(paragraph);

BreakIterator identifies the boundaries found in text using a series of methods and fields. All of these return integer values, and they are detailed in the following table:

Method

Usage

first

Returns the first boundary of the text

next

Returns the boundary following the current boundary

previous

Returns the boundary preceding the current boundary

DONE

The final integer, which is assigned a value of -1 (indicating that there are no more boundaries to be found)

To use the iterator in a sequential fashion, the first boundary is identified using the first method, and then the next method is called repeatedly to find the subsequent boundaries. The process is terminated when Done is returned. This technique is illustrated in the next code sequence, which uses the previously declared sentenceIterator instance:

int boundary = sentenceIterator.first();
while (boundary != BreakIterator.DONE) {
    int begin = boundary;
    System.out.print(boundary + "-");
    boundary = sentenceIterator.next();
    int end = boundary;
    if (end == BreakIterator.DONE) {
        break;
    }
    System.out.println(boundary + " ["
        + paragraph.substring(begin, end) + "]");
}

On execution, we get the following output:

0-75 [When determining the end of sentences we need to consider several factors. ]
75-117 [Sentences may end with exclamation marks! ]
117-146 [Or possibly questions marks? ]
146-233 [Within sentences we may find numbers like 3.14159 , abbreviations such as found in Mr. ]
233-319 [Smith, and possibly ellipses either within a sentence … , or at the end of a sentence…]
319-

This output works for simple sentences but is not successful with more complex sentences.

The uses of both regular expressions and the BreakIterator class have limitations. They are useful for text consisting of relatively simple sentences. However, when the text becomes more complex, it is better to use the NLP APIs instead, as discussed in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.190.175