Training a Sentence Detector model

We will use OpenNLP's SentenceDetectorME class to illustrate the training process. This class has a static train method that uses sample sentences found in a file. The method returns a model that is usually serialized to a file for later use.

Models use specially annotated data to clearly specify where a sentence ends. Frequently, a large file is used to provide a good sample for training purposes. Part of the file is used for training purposes, and the rest is used to verify the model after it has been trained.

The training file used by OpenNLP consists of one sentence per line. Usually, at least 10 to 20 sample sentences are needed to avoid processing errors. To demonstrate the process, we will use a file called sentence.train. It consists of Chapter 5, Twenty Thousand Leagues under the Sea by Jules Verne. The text of the book can be found at http://www.gutenberg.org/files/164/164-h/164-h.htm#chap05. The file can be downloaded from www.packtpub.com.

A FileReader object is used to open the file. This object is used as the argument of the PlainTextByLineStream constructor. The stream that results consists of a string for each line of the file. This is used as the argument of the SentenceSampleStream constructor, which converts the sentence strings to SentenceSample objects. These objects hold the beginning index of each sentence. This process is shown next, where the statements are enclosed in a try block to handle exceptions that may be thrown by these statements:

try {
    ObjectStream<String> lineStream = new PlainTextByLineStream(
        new FileReader("sentence.train"));
    ObjectStream<SentenceSample> sampleStream
        = new SentenceSampleStream(lineStream);
    ...
    } catch (FileNotFoundException ex) {
        // Handle exception
    } catch (IOException ex) {
        // Handle exception
}

Now the train method can be used like this:

SentenceModel model = SentenceDetectorME.train("en", sampleStream, true,
    null, TrainingParameters.defaultParams());

The output of the method is a trained model. The parameters of this method are detailed in the following table:

Parameter

Meaning

"en"

Specifies that the language of the text is English

sampleStream

The training text stream

true

Specifies whether end tokens shown should be used

null

A dictionary for abbreviations

TrainingParameters.defaultParams()

Specifies that the default training parameters should be used

In the following sequence, an OutputStream is created and used to save the model in the modelFile file. This allows the model to be reused for other applications:

OutputStream modelStream = new BufferedOutputStream(
    new FileOutputStream("modelFile"));
model.serialize(modelStream);

The output of this process is as follows. All the iterations have not been shown here to save space. The default cuts off indexing events to 5 and iterations to 100:

Indexing events using cutoff of 5

    Computing event counts...  done. 93 events
    Indexing...  done.
Sorting and merging events... done. Reduced 93 events to 63.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 63
        Number of Outcomes: 2
      Number of Predicates: 21
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-64.4626877920749    0.9032258064516129
  2:  ... loglikelihood=-31.11084296202819    0.9032258064516129
  3:  ... loglikelihood=-26.418795734248626    0.9032258064516129
  4:  ... loglikelihood=-24.327956749903198    0.9032258064516129
  5:  ... loglikelihood=-22.766489585258565    0.9032258064516129
  6:  ... loglikelihood=-21.46379347841989    0.9139784946236559
  7:  ... loglikelihood=-20.356036369911394    0.9139784946236559
  8:  ... loglikelihood=-19.406935608514992    0.9139784946236559
  9:  ... loglikelihood=-18.58725539754483    0.9139784946236559
 10:  ... loglikelihood=-17.873030559849326    0.9139784946236559
 ...
 99:  ... loglikelihood=-7.214933901940582    0.978494623655914
100:  ... loglikelihood=-7.183774954664058    0.978494623655914

Using the Trained model

We can then use the model as illustrated in the next code sequence. This is based on the techniques illustrated in Using the SentenceDetectorME class earlier in this chapter:

try (InputStream is = new FileInputStream(
        new File(getModelDir(), "modelFile"))) {
    SentenceModel model = new SentenceModel(is);
    SentenceDetectorME detector = new SentenceDetectorME(model);
    String sentences[] = detector.sentDetect(paragraph);
    for (String sentence : sentences) {
        System.out.println(sentence);
    }
} catch (FileNotFoundException ex) {
    // Handle exception
} catch (IOException ex) {
    // Handle exception
}

The output is as follows:

When determining the end of sentences we need to consider several factors.
Sentences may end with exclamation marks! Or possibly questions marks?
Within sentences we may find numbers like 3.14159,
abbreviations such as found in Mr.
Smith, and possibly ellipses either within a sentence …, or at the end of a sentence…

This model did not process the last sentence very well, which reflects a mismatch between the sample text and the text the model is used against. Using relevant training data is important. Otherwise, downstream tasks based on this output will suffer.

Evaluating the model using the SentenceDetectorEvaluator class

We reserved a part of the sample file for evaluation purposes so that we can use the SentenceDetectorEvaluator class to evaluate the model. We modified the sentence.train file by extracting the last ten sentences and placing them in a file called evalSample. Then we used this file to evaluate the model. In the next example, we've reused the lineStream and sampleStream variables to create a stream of SentenceSample objects based on the file's contents:

lineStream = new PlainTextByLineStream(new FileReader("evalSample"));
sampleStream = new SentenceSampleStream(lineStream);

An instance of the SentenceDetectorEvaluator class is created using the previously created SentenceDetectorME class variable detector. The second argument of the constructor is a SentenceDetectorEvaluationMonitor object, which we will not use here. Then the evaluate method is called:

SentenceDetectorEvaluator sentenceDetectorEvaluator
    = new SentenceDetectorEvaluator(detector, null);
sentenceDetectorEvaluator.evaluate(sampleStream);

The getFMeasure method will return an instance of the FMeasure class that provides measurements of the quality of the model:

System.out.println(sentenceDetectorEvaluator.getFMeasure());

The output follows. Precision is the fraction of correct instances that are included, and recall reflects the sensitivity of the model. F-measure is a score that combines recall and precision. In essence, it reflects how well the model works. It is best to keep the precision above 90 percent for tokenization and SBD tasks:

Precision: 0.8181818181818182
Recall: 0.9
F-Measure: 0.8571428571428572
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.37.12