Sometimes, text may be simple enough that Java core support will suffice. There are two approaches that will perform SBD: using regular expressions and using the BreakIterator
class. We will examine both approaches here.
Regular expressions can be difficult to understand. While simple expressions are not usually a problem, as they become more complex, their readability worsens. This is one of the limitations of regular expressions when trying to use them for SBD.
We will present two different regular expressions. The first expression is simple, but does not do a very good job. It illustrates a solution that may be too simple for some problem domains. The second is more sophisticated and does a better job.
In this example, we create a regular expression class that matches periods, question marks, and exclamation marks. The String
class' split
method is used to split the text into sentences:
String simple = "[.?!]"; String[] splitString = (paragraph.split(simple)); for (String string : splitString) { System.out.println(string); }
The output is as follows:
When determining the end of sentences we need to consider several factors Sentences may end with exclamation marks Or possibly questions marks Within sentences we may find numbers like 3 14159, abbreviations such as found in Mr Smith, and possibly ellipses either within a sentence …, or at the end of a sentence…
As expected, the method splits the paragraph into characters regardless of whether they are part of a number or abbreviation.
A second approach follows, which produces better results. This example has been adapted from an example found at http://stackoverflow.com/questions/5553410/regular-expression-match-a-sentence. The Pattern
class, which compiles the following regular expression, is used:
[^.!?s][^.!?]*(?:[.!?](?!['"]?s|$)[^.!?]*)*[.!?]?['"]?(?=s|$)
The comment in the following code sequence provides an explanation of what each part represents:
Pattern sentencePattern = Pattern.compile( "# Match a sentence ending in punctuation or EOS. " + "[^.!?\s] # First char is non-punct, non-ws " + "[^.!?]* # Greedily consume up to punctuation. " + "(?: # Group for unrolling the loop. " + " [.!?] # (special) inner punctuation ok if " + " (?!['"]?\s|$) # not followed by ws or EOS. " + " [^.!?]* # Greedily consume up to punctuation. " + ")* # Zero or more (special normal*) " + "[.!?]? # Optional ending punctuation. " + "['"]? # Optional closing quote. " + "(?=\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Another representation of this expression can be generated using the display tool found at http://regexper.com/. As shown in the following diagram, it graphically depicts the expression and can clarify how it works:
The matcher
method is executed against the sample paragraph and then the results are displayed:
Matcher matcher = sentencePattern.matcher(paragraph); while (matcher.find()) { System.out.println(matcher.group()); }
The output follows. The sentence terminators are retained, but there are still problems with abbreviations:
When determining the end of sentences we need to consider several factors. Sentences may end with exclamation marks! Or possibly questions marks? Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr. Smith, and possibly ellipses either within a sentence …, or at the end of a sentence…
The BreakIterator
class can be used to detect various text boundaries such as those between characters, words, sentences, and lines. Different methods are used to create different instances of the BreakIterator
class as follows:
getCharacterInstance
method is usedgetWordInstance
method is usedgetSentenceInstance
method is usedgetLineInstance
method is usedDetecting breaks between characters is important at times, for example, when we need to process characters that are composed of multiple Unicode characters such as ü. This character is sometimes formed by combining the u0075
(u) and u00a8
(¨) Unicode characters. The class will identify these types of characters. This capability is further detailed at https://docs.oracle.com/javase/tutorial/i18n/text/char.html.
The BreakIterator
class can be used to detect the end of a sentence. It uses a cursor that references the current boundary. It supports a next
and a previous
method that moves the cursor forward and backwards in the text, respectively. BreakIterator
has a single, protected default constructor. To obtain an instance of the BreakIterator
class to detect the end of a sentence, use the static getSentenceInstance
method, as shown here:
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance();
There is also an overloaded version of the method. It takes a Locale
instance as an argument:
Locale currentLocale = new Locale("en", "US"); BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);
Once an instance has been created, the setText
method will associate the text to be processed with the iterator:
sentenceIterator.setText(paragraph);
BreakIterator
identifies the boundaries found in text using a series of methods and fields. All of these return integer values, and they are detailed in the following table:
Method |
Usage |
---|---|
| |
| |
| |
|
The final integer, which is assigned a value of -1 (indicating that there are no more boundaries to be found) |
To use the iterator in a sequential fashion, the first boundary is identified using the first
method, and then the next
method is called repeatedly to find the subsequent boundaries. The process is terminated when Done
is returned. This technique is illustrated in the next code sequence, which uses the previously declared sentenceIterator
instance:
int boundary = sentenceIterator.first(); while (boundary != BreakIterator.DONE) { int begin = boundary; System.out.print(boundary + "-"); boundary = sentenceIterator.next(); int end = boundary; if (end == BreakIterator.DONE) { break; } System.out.println(boundary + " [" + paragraph.substring(begin, end) + "]"); }
On execution, we get the following output:
0-75 [When determining the end of sentences we need to consider several factors. ] 75-117 [Sentences may end with exclamation marks! ] 117-146 [Or possibly questions marks? ] 146-233 [Within sentences we may find numbers like 3.14159 , abbreviations such as found in Mr. ] 233-319 [Smith, and possibly ellipses either within a sentence … , or at the end of a sentence…] 319-
This output works for simple sentences but is not successful with more complex sentences.
The uses of both regular expressions and the BreakIterator
class have limitations. They are useful for text consisting of relatively simple sentences. However, when the text becomes more complex, it is better to use the NLP APIs instead, as discussed in the next section.
18.188.190.175