We will use OpenNLP to demonstrate how a model is trained. The training file used must:
We will use the following model file named en-ner-person.train
:
<START:person> Joe <END> was the last person to see <START:person> Fred <END>. He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. <START:person> Joe <END> wanted to go to Vermont for the day to visit a cousin who works at IBM, but <START:person> Sally <END> and he had to look for <START:person> Fred <END>.
Several methods of this example are capable of throwing exceptions. These statements will be placed in a try-with-resource block as shown here, where the model's output stream is created:
try (OutputStream modelOutputStream = new BufferedOutputStream( new FileOutputStream(new File("modelFile")));) { ... } catch (IOException ex) { // Handle exception }
Within the block, we create an OutputStream<String>
object using the PlainTextByLineStream
class. This class' constructor takes a FileInputStream
instance and returns each line as a String
object. The en-ner-person.train
file is used as the input file, as shown here. The UTF-8
string refers to the encoding sequence used:
ObjectStream<String> lineStream = new PlainTextByLineStream( new FileInputStream("en-ner-person.train"), "UTF-8");
The lineStream
object contains streams that are annotated with tags delineating the entities in the text. These need to be converted to the NameSample
objects so that the model can be trained. This conversion is performed by the NameSampleDataStream
class as shown here. A NameSample
object holds the names of the entities found in the text:
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
The train
method can now be executed as follows:
TokenNameFinderModel model = NameFinderME.train( "en", "person", sampleStream, Collections.<String, Object>emptyMap(), 100, 5);
The arguments of the method are as detailed in the following table:
Parameter |
Meaning |
---|---|
|
Language Code |
|
Entity type |
|
Sample data |
|
Resources |
|
The number of iterations |
|
The cutoff |
The model is then serialized to an output file:
model.serialize(modelOutputStream);
The output of this sequence is as follows. It has been shortened to conserve space. Basic information about the model creation is detailed:
Indexing events using cutoff of 5 Computing event counts... done. 53 events Indexing... done. Sorting and merging events... done. Reduced 53 events to 46. Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 46 Number of Outcomes: 2 Number of Predicates: 34 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-36.73680056967707 0.05660377358490566 2: ... loglikelihood=-17.499660626361216 0.9433962264150944 3: ... loglikelihood=-13.216835449617108 0.9433962264150944 4: ... loglikelihood=-11.461783667999262 0.9433962264150944 5: ... loglikelihood=-10.380239416084963 0.9433962264150944 6: ... loglikelihood=-9.570622475692486 0.9433962264150944 7: ... loglikelihood=-8.919945779143012 0.9433962264150944 ... 99: ... loglikelihood=-3.513810438211968 0.9622641509433962 100: ... loglikelihood=-3.507213816708068 0.9622641509433962
The model can be evaluated using the TokenNameFinderEvaluator
class. The evaluation process uses marked up sample text to perform the evaluation. For this simple example, a file called en-ner-person.eval
was created that contained the following text:
<START:person> Bill <END> went to the farm to see <START:person> Sally <END>. Unable to find <START:person> Sally <END> he went to town. There he saw <START:person> Fred <END> who had seen <START:person> Sally <END> at the book store with <START:person> Mary <END>.
The following code is used to perform the evaluation. The previous model is used as the argument of the TokenNameFinderEvaluator
constructor. A NameSampleDataStream
instance is created based on the evaluation file. The TokenNameFinderEvaluator
class' evaluate
method performs the evaluation:
TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model)); lineStream = new PlainTextByLineStream( new FileInputStream("en-ner-person.eval"), "UTF-8"); sampleStream = new NameSampleDataStream(lineStream); evaluator.evaluate(sampleStream);
To determine how well the model worked with the evaluation data, the getFMeasure
method is executed. The results are then displayed:
FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
The following output displays the precision, recall, and F-measure. It indicates that 50 percent of the entities found exactly match the evaluation data. The recall is the percentage of entities defined in the corpus that were found in the same location. The performance measure is the harmonic mean and is defined as: F1 = 2 * Precision * Recall / (Recall + Precision)
Precision: 0.5 Recall: 0.25 F-Measure: 0.3333333333333333
The data and evaluation sets should be much larger to create a better model. The intent here was to demonstrate the basic approach used to train and evaluate a POS model.
3.145.179.225