The necessary steps include the following:
- Insert the following import statements:
import java.util.ArrayList;
import java.util.List;
import com.aliasi.sentences.IndoEuropeanSentenceModel;
import com.aliasi.sentences.SentenceModel;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
- Next, add the following code segment to the main method, which creates a tokenizer and sentence model:
String text =
"We will start with a simple sentence. However, is it "
+ "possible for a sentence to end with a question "
+ "mark? Obviously that is possible! Another "
+ "complication is the use of a number such as 56.32 "
+ "or ellipses such as ... Ellipses may be found ... "
+ "with a sentence! Of course, we may also find the "
+ "use of abbreviations such as Mr. Smith or "
+ "Dr. Jones.";
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
SentenceModel sentenceModel = new IndoEuropeanSentenceModel();
- Add the following lists, which will hold the tokens and white space:
List<String> tokenList = new ArrayList<>();
List<String> whiteList = new ArrayList<>();
- Insert the next code sequence to populate these lists and find the sentence boundaries:
Tokenizer tokenizer = tokenizerFactory.tokenizer(
text.toCharArray(), 0, text.length());
tokenizer.tokenize(tokenList, whiteList);
int[] sentenceBoundaries = sentenceModel.boundaryIndices(
tokenList.toArray(new String[tokenList.size()]),
whiteList.toArray(new String[whiteList.size()]));
- Add the following code to display the sentences detected:
int start = 0;
for (int boundary : sentenceBoundaries) {
System.out.print("[");
while (start <= boundary) {
System.out.print(tokenList.get(start) +
whiteList.get(start + 1));
start++;
}
System.out.println("]");
}
- Execute the program. You will see the following output displayed:
[We will start with a simple sentence. ]
[However, is it possible for a sentence to end with a question mark? ]
[Obviously that is possible!. ]
[Another complication is the use of a number such as 56.32 or ellipses such as ... Ellipses may be found ... with a sentence! ]
[Of course, we may also find the use of abbreviations such as Mr. Smith or Dr. Jones.]