We will use OpenNLP's SentenceDetectorME
class to illustrate the training process. This class has a static train
method that uses sample sentences found in a file. The method returns a model that is usually serialized to a file for later use.
Models use specially annotated data to clearly specify where a sentence ends. Frequently, a large file is used to provide a good sample for training purposes. Part of the file is used for training purposes, and the rest is used to verify the model after it has been trained.
The training file used by OpenNLP consists of one sentence per line. Usually, at least 10 to 20 sample sentences are needed to avoid processing errors. To demonstrate the process, we will use a file called sentence.train
. It consists of Chapter 5, Twenty Thousand Leagues under the Sea by Jules Verne. The text of the book can be found at http://www.gutenberg.org/files/164/164-h/164-h.htm#chap05. The file can be downloaded from www.packtpub.com.
A FileReader
object is used to open the file. This object is used as the argument of the PlainTextByLineStream
constructor. The stream that results consists of a string for each line of the file. This is used as the argument of the SentenceSampleStream
constructor, which converts the sentence strings to SentenceSample
objects. These objects hold the beginning index of each sentence. This process is shown next, where the statements are enclosed in a try block to handle exceptions that may be thrown by these statements:
try { ObjectStream<String> lineStream = new PlainTextByLineStream( new FileReader("sentence.train")); ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream); ... } catch (FileNotFoundException ex) { // Handle exception } catch (IOException ex) { // Handle exception }
Now the train
method can be used like this:
SentenceModel model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
The output of the method is a trained model. The parameters of this method are detailed in the following table:
Parameter |
Meaning |
---|---|
| |
|
The training text stream |
|
Specifies whether end tokens shown should be used |
|
A dictionary for abbreviations |
|
Specifies that the default training parameters should be used |
In the following sequence, an OutputStream
is created and used to save the model in the modelFile
file. This allows the model to be reused for other applications:
OutputStream modelStream = new BufferedOutputStream( new FileOutputStream("modelFile")); model.serialize(modelStream);
The output of this process is as follows. All the iterations have not been shown here to save space. The default cuts off indexing events to 5 and iterations to 100:
Indexing events using cutoff of 5 Computing event counts... done. 93 events Indexing... done. Sorting and merging events... done. Reduced 93 events to 63. Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 63 Number of Outcomes: 2 Number of Predicates: 21 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-64.4626877920749 0.9032258064516129 2: ... loglikelihood=-31.11084296202819 0.9032258064516129 3: ... loglikelihood=-26.418795734248626 0.9032258064516129 4: ... loglikelihood=-24.327956749903198 0.9032258064516129 5: ... loglikelihood=-22.766489585258565 0.9032258064516129 6: ... loglikelihood=-21.46379347841989 0.9139784946236559 7: ... loglikelihood=-20.356036369911394 0.9139784946236559 8: ... loglikelihood=-19.406935608514992 0.9139784946236559 9: ... loglikelihood=-18.58725539754483 0.9139784946236559 10: ... loglikelihood=-17.873030559849326 0.9139784946236559 ... 99: ... loglikelihood=-7.214933901940582 0.978494623655914 100: ... loglikelihood=-7.183774954664058 0.978494623655914
We can then use the model as illustrated in the next code sequence. This is based on the techniques illustrated in Using the SentenceDetectorME class earlier in this chapter:
try (InputStream is = new FileInputStream( new File(getModelDir(), "modelFile"))) { SentenceModel model = new SentenceModel(is); SentenceDetectorME detector = new SentenceDetectorME(model); String sentences[] = detector.sentDetect(paragraph); for (String sentence : sentences) { System.out.println(sentence); } } catch (FileNotFoundException ex) { // Handle exception } catch (IOException ex) { // Handle exception }
The output is as follows:
When determining the end of sentences we need to consider several factors. Sentences may end with exclamation marks! Or possibly questions marks? Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr. Smith, and possibly ellipses either within a sentence …, or at the end of a sentence…
This model did not process the last sentence very well, which reflects a mismatch between the sample text and the text the model is used against. Using relevant training data is important. Otherwise, downstream tasks based on this output will suffer.
We reserved a part of the sample file for evaluation purposes so that we can use the SentenceDetectorEvaluator
class to evaluate the model. We modified the sentence.train
file by extracting the last ten sentences and placing them in a file called evalSample
. Then we used this file to evaluate the model. In the next example, we've reused the lineStream
and sampleStream
variables to create a stream of SentenceSample
objects based on the file's contents:
lineStream = new PlainTextByLineStream(new FileReader("evalSample")); sampleStream = new SentenceSampleStream(lineStream);
An instance of the SentenceDetectorEvaluator
class is created using the previously created SentenceDetectorME
class variable detector. The second argument of the constructor is a SentenceDetectorEvaluationMonitor
object, which we will not use here. Then the evaluate
method is called:
SentenceDetectorEvaluator sentenceDetectorEvaluator = new SentenceDetectorEvaluator(detector, null); sentenceDetectorEvaluator.evaluate(sampleStream);
The getFMeasure
method will return an instance of the FMeasure
class that provides measurements of the quality of the model:
System.out.println(sentenceDetectorEvaluator.getFMeasure());
The output follows. Precision is the fraction of correct instances that are included, and recall reflects the sensitivity of the model. F-measure is a score that combines recall and precision. In essence, it reflects how well the model works. It is best to keep the precision above 90 percent for tokenization and SBD tasks:
Precision: 0.8181818181818182 Recall: 0.9 F-Measure: 0.8571428571428572
3.144.37.12