Training a model

We will use OpenNLP to demonstrate how a model is trained. The training file used must:

  • Contain marks to demarcate the entities
  • Have one sentence per line

We will use the following model file named en-ner-person.train:

<START:person> Joe <END> was the last person to see <START:person> Fred <END>. 
He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. 
<START:person> Joe <END> wanted to go to Vermont for the day to visit a cousin who works at IBM, but <START:person> Sally <END> and he had to look for <START:person> Fred <END>.

Several methods of this example are capable of throwing exceptions. These statements will be placed in a try-with-resource block as shown here, where the model's output stream is created:

try (OutputStream modelOutputStream = new BufferedOutputStream(
        new FileOutputStream(new File("modelFile")));) {
    ...
} catch (IOException ex) {
    // Handle exception
}

Within the block, we create an OutputStream<String> object using the PlainTextByLineStream class. This class' constructor takes a FileInputStream instance and returns each line as a String object. The en-ner-person.train file is used as the input file, as shown here. The UTF-8 string refers to the encoding sequence used:

ObjectStream<String> lineStream = new PlainTextByLineStream(
    new FileInputStream("en-ner-person.train"), "UTF-8");

The lineStream object contains streams that are annotated with tags delineating the entities in the text. These need to be converted to the NameSample objects so that the model can be trained. This conversion is performed by the NameSampleDataStream class as shown here. A NameSample object holds the names of the entities found in the text:

ObjectStream<NameSample> sampleStream = 
    new NameSampleDataStream(lineStream);

The train method can now be executed as follows:

TokenNameFinderModel model = NameFinderME.train(
    "en", "person",  sampleStream, 
    Collections.<String, Object>emptyMap(), 100, 5);

The arguments of the method are as detailed in the following table:

Parameter

Meaning

"en"

Language Code

"person"

Entity type

sampleStream

Sample data

null

Resources

100

The number of iterations

5

The cutoff

The model is then serialized to an output file:

model.serialize(modelOutputStream);

The output of this sequence is as follows. It has been shortened to conserve space. Basic information about the model creation is detailed:

Indexing events using cutoff of 5

  Computing event counts...  done. 53 events
  Indexing...  done.
Sorting and merging events... done. Reduced 53 events to 46.
Done indexing.
Incorporating indexed data for training...  
done.
  Number of Event Tokens: 46
      Number of Outcomes: 2
    Number of Predicates: 34
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-36.73680056967707  0.05660377358490566
  2:  ... loglikelihood=-17.499660626361216  0.9433962264150944
  3:  ... loglikelihood=-13.216835449617108  0.9433962264150944
  4:  ... loglikelihood=-11.461783667999262  0.9433962264150944
  5:  ... loglikelihood=-10.380239416084963  0.9433962264150944
  6:  ... loglikelihood=-9.570622475692486  0.9433962264150944
  7:  ... loglikelihood=-8.919945779143012  0.9433962264150944
...
 99:  ... loglikelihood=-3.513810438211968  0.9622641509433962
100:  ... loglikelihood=-3.507213816708068  0.9622641509433962

Evaluating a model

The model can be evaluated using the TokenNameFinderEvaluator class. The evaluation process uses marked up sample text to perform the evaluation. For this simple example, a file called en-ner-person.eval was created that contained the following text:

<START:person> Bill <END> went to the farm to see <START:person> Sally <END>. 
Unable to find <START:person> Sally <END> he went to town.
There he saw <START:person> Fred <END> who had seen <START:person> Sally <END> at the book store with <START:person> Mary <END>.

The following code is used to perform the evaluation. The previous model is used as the argument of the TokenNameFinderEvaluator constructor. A NameSampleDataStream instance is created based on the evaluation file. The TokenNameFinderEvaluator class' evaluate method performs the evaluation:

TokenNameFinderEvaluator evaluator = 
    new TokenNameFinderEvaluator(new NameFinderME(model));    
lineStream = new PlainTextByLineStream(
    new FileInputStream("en-ner-person.eval"), "UTF-8");
sampleStream = new NameSampleDataStream(lineStream);
evaluator.evaluate(sampleStream);

To determine how well the model worked with the evaluation data, the getFMeasure method is executed. The results are then displayed:

FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());

The following output displays the precision, recall, and F-measure. It indicates that 50 percent of the entities found exactly match the evaluation data. The recall is the percentage of entities defined in the corpus that were found in the same location. The performance measure is the harmonic mean and is defined as: F1 = 2 * Precision * Recall / (Recall + Precision)

Precision: 0.5
Recall: 0.25
F-Measure: 0.3333333333333333

The data and evaluation sets should be much larger to create a better model. The intent here was to demonstrate the basic approach used to train and evaluate a POS model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.225