How it works...

An instance of the InputStreamFactory class is needed to train the model. The anonymous inner class is a convenient way of doing that.

The training-data.train file contained the training data. Each entity of type location is surrounded by a <START:location> and <END> tag. The file does not contain a very large training sample. It would be better if the file was much larger, but it will suffice for this example. A larger and more varied the file would result in a more accurate model, as there would be more data to calibrate the model.

The first try block is duplicated in the following code snippet for your convenience. It was used to create three streams:

The first statement created an OutputStream instance for the model, which eventually serialized the model to the location-model.bin file. This file is placed in the root directory of the project.
The second statement created an ObjectStream instance consisting of individual lines of the training file.
The last statement created an ObjectStream instance of a NameSample type. This puts the training data into the correct format to train the model:

try (   OutputStream modelOutputStream = new BufferedOutputStream(
            new FileOutputStream(new File("location-model.bin")));
        ObjectStream<String> stringStream = new PlainTextByLineStream(
            inputStreamFactory, "UTF-8");
        ObjectStream<NameSample> nameSampleStream = new 
            NameSampleDataStream(stringStream);) {

The train method performed the actual training. It used five parameters:

The first parameter specified the language being used. In this example, we used English.
The second parameter is the name of the type of parameter.
The third parameter is the ObjectStream instance holding the training data.
The fourth parameter specifies the training parameters used during the training process. We used a default set of training parameters.
The last parameter is an instance of the TokenNameFinderFactory class

We used the train method as shown next:

TokenNameFinderModel locationModel = 
    NameFinderME.train("en", "LOCATION", nameSampleStream,
TrainingParameters.defaultParams(), new TokenNameFinderFactory());

The serialize method was then executed to save the model to the model file:

locationModel.serialize(modelOutputStream);

The second try block tested the serialized model. A detailed explanation of this technique is found in OpenNLP to find entities in text recipes.

Notice that the output did not detect both cities in the sample string. It missed the city of Quebec. Using a more comprehensive set of training data can overcome this problem.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...