How to do it...

The necessary steps include the following:

  1. Insert the following import statements:
import java.util.ArrayList;
import java.util.List;
import com.aliasi.sentences.IndoEuropeanSentenceModel;
import com.aliasi.sentences.SentenceModel;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
  1. Next, add the following code segment to the main method, which creates a tokenizer and sentence model:
String text = 
"We will start with a simple sentence. However, is it "
+ "possible for a sentence to end with a question "
+ "mark? Obviously that is possible! Another "
+ "complication is the use of a number such as 56.32 "
+ "or ellipses such as ... Ellipses may be found ... "
+ "with a sentence! Of course, we may also find the "
+ "use of abbreviations such as Mr. Smith or "
+ "Dr. Jones.";
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
SentenceModel sentenceModel = new IndoEuropeanSentenceModel();
  1. Add the following lists, which will hold the tokens and white space:
List<String> tokenList = new ArrayList<>();
List<String> whiteList = new ArrayList<>();
  1. Insert the next code sequence to populate these lists and find the sentence boundaries:
Tokenizer tokenizer = tokenizerFactory.tokenizer(
text.toCharArray(), 0, text.length());
tokenizer.tokenize(tokenList, whiteList);

int[] sentenceBoundaries = sentenceModel.boundaryIndices(
tokenList.toArray(new String[tokenList.size()]),
whiteList.toArray(new String[whiteList.size()]));
  1.  Add the following code to display the sentences detected:
int start = 0;
for (int boundary : sentenceBoundaries) {
System.out.print("[");
while (start <= boundary) {
System.out.print(tokenList.get(start) +
whiteList.get(start + 1));
start++;
}
System.out.println("]");
}
  1. Execute the program. You will see the following output displayed:
[We will start with a simple sentence. ]
[However, is it possible for a sentence to end with a question mark? ]
[Obviously that is possible!. ]
[Another complication is the use of a number such as 56.32 or ellipses such as ... Ellipses may be found ... with a sentence! ]
[Of course, we may also find the use of abbreviations such as Mr. Smith or Dr. Jones.]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.20.142