We will use OpenNLP, Stanford API, and LingPipe to demonstrate various classification approaches. We will spend more time with LingPipe as it offers several different classification approaches.
The
DocumentCategorizer
interface specifies methods to support the classification process. The interface is implemented by the DocumentCategorizerME
class. This class will classify text into predefined categories using a maximum entropy framework. We will:
First, we have to train our model because OpenNLP does not have prebuilt models. This process consists of creating a file of training data and then using the DocumentCategorizerME
model to perform the actual training. The model created is typically saved in a file for later use.
The training file format consists of a series of lines where each line represents a document. The first word of the line is the category. The category is followed by text separated by whitespace. Here is an example for the dog
category:
dog The most interesting feature of a dog is its ...
To demonstrate the training process, we created the en-animals.train
file, where we created two categories: cats and dogs. For the training text, we used sections of Wikipedia. For dogs (http://en.wikipedia.org/wiki/Dog), we used the As Pets section. For cats (http://en.wikipedia.org/wiki/Cats_and_humans), we used the Pet section plus the first paragraph of the Domesticated varieties section. We also removed the numeric references from the sections.
The first part of each line is shown here:
dog The most widespread form of interspecies bonding occurs ... dog There have been two major trends in the changing status of ... dog There are a vast range of commodity forms available to ... dog An Australian Cattle Dog in reindeer antlers sits on Santa's lap ... dog A pet dog taking part in Christmas traditions ... dog The majority of contemporary people with dogs describe their ... dog Another study of dogs' roles in families showed many dogs have ... dog According to statistics published by the American Pet Products ... dog The latest study using Magnetic resonance imaging (MRI) ... cat Cats are common pets in Europe and North America, and their ... cat Although cat ownership has commonly been associated ... cat The concept of a cat breed appeared in Britain during ... cat Cats come in a variety of colors and patterns. These are physical ... cat A natural behavior in cats is to hook their front claws periodically ... cat Although scratching can serve cats to keep their claws from growing ...
When creating training data, it is important to use a large enough sample size. The data we used is not sufficient for some analysis. However, as we will see, it does a pretty good job of identifying the categories correctly.
The DoccatModel
class supports categorization and classification of text. A model is trained using the train
method based on annotated text. The train
method uses a string denoting the language and an ObjectStream<DocumentSample>
instance holding the training data. The DocumentSample
instance holds the annotated text and its category.
In the following example, the en-animal.train
file is used to train the model. Its input stream is used to create a PlainTextByLineStream
instance, which is then converted to an ObjectStream<DocumentSample>
instance. The train
method is then applied. The code is enclosed in a try-with-resources block to handle exceptions. We also created an output stream that we will use to persist the model:
DoccatModel model = null; try (InputStream dataIn = new FileInputStream("en-animal.train"); OutputStream dataOut = new FileOutputStream("en-animal.model");) { ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream); model = DocumentCategorizerME.train("en", sampleStream); ... } catch (IOException e) { // Handle exceptions }
The output is as follows and has been shortened to conserve space:
Indexing events using cutoff of 5 Computing event counts... done. 12 events Indexing... done. Sorting and merging events... done. Reduced 12 events to 12. Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 12 Number of Outcomes: 2 Number of Predicates: 30 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-8.317766166719343 0.75 2: ... loglikelihood=-7.1439957443937265 0.75 3: ... loglikelihood=-6.560690872956419 0.75 4: ... loglikelihood=-6.106743124066829 0.75 5: ... loglikelihood=-5.721805583104927 0.8333333333333334 6: ... loglikelihood=-5.3891508904777785 0.8333333333333334 7: ... loglikelihood=-5.098768040466029 0.8333333333333334 ... 98: ... loglikelihood=-1.4117372921765519 1.0 99: ... loglikelihood=-1.4052738190352423 1.0 100: ... loglikelihood=-1.398916120150312 1.0
The model is saved as shown here using the serialize
method. The model is saved to the en-animal.model
file as opened in the previous try-with-resources block:
OutputStream modelOut = null; modelOut = new BufferedOutputStream(dataOut); model.serialize(modelOut);
Once a model has been created, we can use the DocumentCategorizerME
class to classify text. We need to read the model, create an instance of the DocumentCategorizerME
class, and then invoke the categorize
method to return an array of probabilities that will tell us which category the text best fits in.
Since we are reading from a file, exceptions need to be dealt with, as shown here:
try (InputStream modelIn = new FileInputStream(new File("en-animal.model"));) { ... } catch (IOException ex) { // Handle exceptions }
With the input stream, we create instances of the DoccatModel
and DocumentCategorizerME
classes as illustrated here:
DoccatModel model = new DoccatModel(modelIn); DocumentCategorizerME categorizer = new DocumentCategorizerME(model);
The categorize
method is called using a string as an argument. This returns an array of double values with each element containing the likelihood that the text belongs to a category. The DocumentCategorizerME
class' getNumberOfCategories
method returns the number of categories handled by the model. The DocumentCategorizerME
class' getCategory
method returns the category given an index.
We used these methods in the following code to display each category and its corresponding likelihood:
double[] outcomes = categorizer.categorize(inputText); for (int i = 0; i<categorizer.getNumberOfCategories(); i++) { String category = categorizer.getCategory(i); System.out.println(category + " - " + outcomes[i]); }
For testing, we used part of the Wikipedia article (http://en.wikipedia.org/wiki/Toto_%28Oz%29) for Toto, Dorothy's dog. We used the first sentence of The classic books section as declared here:
String toto = "Toto belongs to Dorothy Gale, the heroine of " + "the first and many subsequent books. In the first " + "book, he never spoke, although other animals, native " + "to Oz, did. In subsequent books, other animals " + "gained the ability to speak upon reaching Oz or " + "similar lands, but Toto remained speechless.";
To test for a cat, we used the first sentence of the Tortoiseshell and Calico section of the Wikipedia article (http://en.wikipedia.org/wiki/Cats_and_humans) as declared here:
String calico = "This cat is also known as a calimanco cat or " + "clouded tiger cat, and by the abbreviation 'tortie'. " + "In the cat fancy, a tortoiseshell cat is patched " + "over with red (or its dilute form, cream) and black " + "(or its dilute blue) mottled throughout the coat.";
Using the text for toto
, we get the following output. This suggests that the text should be placed in the dog
category:
dog - 0.5870711529777994 cat - 0.41292884702220056
Using calico
instead yields these results:
dog - 0.28960436044424276 cat - 0.7103956395557574
We could have used the getBestCategory
method to return only the best category. This method uses the array of outcomes and returns a string. The getAllResults
method will return all of the results as a string. These two methods are illustrated here:
System.out.println(categorizer.getBestCategory(outcomes)); System.out.println(categorizer.getAllResults(outcomes));
The output will be as follows:
cat dog[0.2896] cat[0.7104]
The Stanford API supports several classifiers. We will examine the use of the ColumnDataClassifier
class for general classification and the StanfordCoreNLP
pipeline to perform sentiment analysis. The classifiers supported by the Stanford API can be difficult to use at times. With the ColumnDataClassifier
class, we will demonstrate how to classify the size of boxes. With the pipeline, we will illustrate how to determine the positive or negative sentiment of short text phrases. The classifier can be downloaded from http://www-nlp.stanford.edu/wiki/Software/Classifier.
This classifier uses data with multiple values to describe the data. In this demonstration, we will use a training file to create a classifier. We will then use a test file to assess the performance of the classifier. The class uses a property file to configure the creation process.
We will be creating a classifier that attempts to classify a box based on its dimensions. Three categories will be possible: small, medium, and large. The height, width, and length dimensions of a box will be expressed as floating point numbers. They are used to characterize a box.
The properties file specifies parameter information and supplies data about the training and test files. There are many possible properties that can be specified. For this example, we will use only a few of the more relevant properties.
We will use the following properties file saved as box.prop
. The first set of properties deal with the number of features that are contained in the training and test files. Since we used three values, three realValued
columns are specified. The trainFile
and testFile
properties specify the location and names of the respective files:
useClassFeature=true 1.realValued=true 2.realValued=true 3.realValued=true trainFile=.box.train testFile=.box.test
The training and test files use the same format. Each line consists of a category followed by the defining values, each separated by a tab. The box.train
training file consist of 60 entries and the box.test
file consists of 30 entries. These files can be downloaded from www.packtpub.com. The first line of the box.train
file follows here. The category is small; its height, width, and length are 2.34, 1.60, and 1.50, respectively:
small 2.34 1.60 1.50
The code to create the classifier is shown here. An instance of the ColumnDataClassifier
class is created using the properties file as the constructor's argument. An instance of the Classifier
interface is returned by the makeClassifier
method. This interface supports three methods, two of which we will demonstrate. The readTrainingExamples
method reads the training data from the training file:
ColumnDataClassifier cdc = new ColumnDataClassifier("box.prop"); Classifier<String, String> classifier = cdc.makeClassifier(cdc.readTrainingExamples("box.train"));
When executed, we get extensive output. We will discuss the more relevant parts here. The first part of the output repeats parts of the property file:
3.realValued = true testFile = .box.test ... trainFile = .box.train
The next part displays the number of datasets read along with various features' information, as shown here:
Reading dataset from box.train ... done [0.1s, 60 items]. numDatums: 60 numLabels: 3 [small, medium, large] ... AVEIMPROVE The average improvement / current value EVALSCORE The last available eval score Iter ## evals ## <SCALING> [LINESEARCH] VALUE TIME |GNORM| {RELNORM} AVEIMPROVE EVALSCORE
The classifier then iterates over the data to create the classifier:
Iter 1 evals 1 <D> [113M 3.107E-4] 5.985E1 0.00s |3.829E1| {1.959E-1} 0.000E0 - Iter 2 evals 5 <D> [M 1.000E0] 5.949E1 0.01s |1.862E1| {9.525E-2} 3.058E-3 - Iter 3 evals 6 <D> [M 1.000E0] 5.923E1 0.01s |1.741E1| {8.904E-2} 3.485E-3 - ... Iter 21 evals 24 <D> [1M 2.850E-1] 3.306E1 0.02s |4.149E-1| {2.122E-3} 1.775E-4 - Iter 22 evals 26 <D> [M 1.000E0] 3.306E1 0.02s QNMinimizer terminated due to average improvement: | newest_val - previous_val | / |newestVal| < TOL Total time spent in optimization: 0.07s
At this point, the classifier is ready to use. Next, we use the test file to verify the classifier. We start by getting a line from the text file using the ObjectBank
class' getLineIterator
method. This class supports the conversion of data read into a more standardized form. The getLineIterator
method returns one line at a time in a format that can be used by the classifier. The loop for this process is shown here:
for (String line : ObjectBank.getLineIterator("box.test", "utf-8")) { ... }
Within the for-each statement, a Datum
instance is created from the line and then its classOf
method is used to return the predicted category as shown here. The Datum
interface supports objects that contain features. When used as the argument of the classOf
method, the category determined by the classifier is returned:
Datum<String, String> datum = cdc.makeDatumFromLine(line); System.out.println("Datum: {" + line + "] Predicted Category: " + classifier.classOf(datum));
When this sequence is executed, each line of the test file is processed and the predicted category is displayed, as follows. Only the first two and last two lines are shown here. The classifier was able to correctly classify all of the test data:
Datum: {small 1.33 3.50 5.43] Predicted Category: medium Datum: {small 1.18 1.73 3.14] Predicted Category: small ... Datum: {large 6.01 9.35 16.64] Predicted Category: large Datum: {large 6.76 9.66 15.44] Predicted Category: large
To test an individual entry, we can use the makeDatumFromStrings
method to create a Datum
instance. In the next code sequence, a one-dimensional array of strings is created where each element represents data values for a box. The first entry, the category, is left null. The Datum
instance is then used as the argument of the classOf
method to predict its category:
String sample[] = {"", "6.90", "9.8", "15.69"}; Datum<String, String> datum = cdc.makeDatumFromStrings(sample); System.out.println("Category: " + classifier.classOf(datum));
The output for this sequence is shown here, which correctly classifies the box:
Category: large
In this section, we will illustrate how the Stanford API can be used to perform sentiment analysis. We will use the StanfordCoreNLP
pipeline to perform this analysis on different texts.
We will use three different texts as defined here. The review
string is a movie review from Rotten Tomatoes (http://www.rottentomatoes.com/m/forrest_gump/) about the movie Forrest Gump:
String review = "An overly sentimental film with a somewhat " + "problematic message, but its sweetness and charm " + "are occasionally enough to approximate true depth " + "and grace. "; String sam = "Sam was an odd sort of fellow. Not prone " + "to angry and not prone to merriment. Overall, " + "an odd fellow."; String mary = "Mary thought that custard pie was the " + "best pie in the world. However, she loathed " + "chocolate pie.";
To perform this analysis, we need to use a sentiment annotator
as shown here. This also requires the use of the tokenize
, ssplit
and parse
annotators. The parse
annotator provides more structural information about the text, which will be discussed in more detail in Chapter 7, Using a Parser to Extract Relationships:
Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, parse, sentiment"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
The text is used to create an Annotation
instance, which is then used as the argument to the annotate
method that performs the actual work, as shown here:
Annotation annotation = new Annotation(review); pipeline.annotate(annotation);
The following array holds the strings for the different sentiments possible:
String[] sentimentText = {"Very Negative", "Negative", "Neutral", "Positive", "Very Positive"};
The Annotation
class' get
method returns an object that implements the CoreMap
interface. In this case, these objects represent the results of splitting the input text into sentences, as shown in the following code. For each sentence, an instance of a Tree
object is obtained that represents a tree structure containing a parse of the text for the sentiment. The getPredictedClass
method returns an index into the sentimentText
array reflecting the sentiment of the test:
for (CoreMap sentence : annotation.get( CoreAnnotations.SentencesAnnotation.class)) { Tree tree = sentence.get( SentimentCoreAnnotations.AnnotatedTree.class); int score = RNNCoreAnnotations.getPredictedClass(tree); System.out.println(sentimentText[score]); }
When the code is executed using the review
string, we get the following output:
Positive
The text, sam
, consists of three sentences. The output for each is as follows, showing the sentiment for each sentence:
Neutral Negative Neutral
The text, mary
, consists of two sentences. The output for each is as follows:
Positive Neutral
We will use LingPipe to demonstrate a number of classification tasks including general text classification using trained models, sentiment analysis, and language identification. We will cover the following classification topics:
Classified
classSeveral of the tasks described in this section will use the following declarations. LingPipe comes with training data for several categories. The categories
array contains the names of the categories packaged with LingPipe:
String[] categories = {"soc.religion.christian", "talk.religion.misc","alt.atheism","misc.forsale"};
The DynamicLMClassifier
class is used to perform the actual classification. It is created using the categories
array giving it the names of the categories to use. The nGramSize
value specifies the number of contiguous items in a sequence used in the model for classification purposes:
int nGramSize = 6; DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess( categories, nGramSize);
General text classification using LingPipe involves training the DynamicLMClassifier
class using training files and then using the class to perform the actual classification. LingPipe comes with several training datasets as found in the LingPipe directory, demos/data/fourNewsGroups/4news-train
. We will use these to illustrate the training process. This example is a simplified version of the process found at http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html.
We start by declaring the training directory:
String directory = ".../demos"; File trainingDirectory = new File(directory + "/data/fourNewsGroups/4news-train");
In the training directory, there are four subdirectories whose names are listed in the categories
array. In each subdirectory is a series of files with numeric names. These files contain newsgroups (http://qwone.com/~jason/20Newsgroups/) data that deal with that directories, names.
The process of training the model involves using each file and category with the DynamicLMClassifier
class' handle
method. The method will use the file to create a training instance for the category and then augment the model with this instance. The process uses nested for-loops.
The outer for-loop creates a File
object using the directory's name and then applies the list
method against it. The list
method returns a list of the files in the directory. The names of these files are stored in the trainingFiles
array, which will be used in the inner loop:
for (int i = 0; i < categories.length; ++i) { File classDir = new File(trainingDirectory, categories[i]); String[] trainingFiles = classDir.list(); // Inner for-loop }
The inner for-loop, as shown next, will open each file and read the text from the file. The Classification
class represents a classification with a specified category. It is used with the text to create a Classified
instance. The DynamicLMClassifier
class' handle
method updates the model with the new information:
for (int j = 0; j < trainingFiles.length; ++j) { try { File file = new File(classDir, trainingFiles[j]); String text = Files.readFromFile(file, "ISO-8859-1"); Classification classification = new Classification(categories[i]); Classified<CharSequence> classified = new Classified<>(text, classification); classifier.handle(classified); } catch (IOException ex) { // Handle exceptions } }
The classifier can be serialized for later use as shown here. The AbstractExternalizable
class is a utility class that supports the serialization of objects. It has a static compileTo
method that accepts a Compilable
instance and a File
object. It writes the object to the file, as follows:
try { AbstractExternalizable.compileTo( (Compilable) classifier, new File("classifier.model")); } catch (IOException ex) { // Handle exceptions }
The loading of the classifier will be illustrated in the Classifying text using LingPipe section later in this chapter.
Other newsgroups data can be found at http://qwone.com/~jason/20Newsgroups/. These collections of data can be used to train other models as listed in the following table. Although there are only 20 categories, they can be useful training models. Three different downloads are available where some have been sorted and in others, duplicate data has been removed:
Newsgroups | |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To classify text, we will use the DynamicLMClassifier
class' classify
method. We will demonstrate its use with two different text sequences:
forSale
: The first is from http://www.homes.com/for-sale/ where we use the first complete sentencemartinLuther
: The second is from http://en.wikipedia.org/wiki/Martin_Luther where we use the first sentence of the second paragraphThese strings are declared here:
String forSale = "Finding a home for sale has never been " + "easier. With Homes.com, you can search new " + "homes, foreclosures, multi-family homes, " + "as well as condos and townhouses for sale. " + "You can even search our real estate agent " + "directory to work with a professional " + "Realtor and find your perfect home."; String martinLuther = "Luther taught that salvation and subsequently " + "eternity in heaven is not earned by good deeds " + "but is received only as a free gift of God's " + "grace through faith in Jesus Christ as redeemer " + "from sin and subsequently eternity in Hell.";
To reuse the classifier serialized in the previous section, use the AbstractExternalizable
class' readObject
method as shown here. We will use the LMClassifier
class instead of the DynamicLMClassifier
class. They both support the classify
method but the DynamicLMClassifier
class is not readily serializable:
LMClassifier classifier = null; try { classifier = (LMClassifier) AbstractExternalizable.readObject( new File("classifier.model")); } catch (IOException | ClassNotFoundException ex) { // Handle exceptions }
In the next code sequence, we apply the LMClassifier
class' classify
method. This returns a JointClassification
instance, which we use to determine the best match:
JointClassification classification = classifier.classify(text); System.out.println("Text: " + text); String bestCategory = classification.bestCategory(); System.out.println("Best Category: " + bestCategory);
For the forSale
text, we get the following output:
Text: Finding a home for sale has never been easier. With Homes.com, you can search new homes, foreclosures, multi-family homes, as well as condos and townhouses for sale. You can even search our real estate agent directory to work with a professional Realtor and find your perfect home. Best Category: misc.forsale
For the martinLuther
text, we get the following output:
Text: Luther taught that salvation and subsequently eternity in heaven is not earned by good deeds but is received only as a free gift of God's grace through faith in Jesus Christ as redeemer from sin and subsequently eternity in Hell. Best Category: soc.religion.christian
They both correctly classified the text.
Sentiment analysis is performed in a very similar manner to that of general text classification. One difference is the use of only two categories: positive and negative.
We need to use data files to train our model. We will use a simplified version of the sentiment analysis performed at http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html using sentiment data found developed for movies (http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz). This data was developed from 1,000 positive and 1,000 negative reviews of movies found in IMDb's movie archives.
These reviews need to be downloaded and extracted. A txt_sentoken
directory will be extracted along with its two subdirectories: neg
and pos
. Both of these subdirectories contain movie reviews. Although some of these files can be held in reserve to evaluate the model created, we will use all of them to simplify the explanation.
We will start with re-initialization of variables declared in the Using LingPipe to classify text section. The categories
array is set to a two-element array to hold the two categories. The classifier
variable is assigned a new DynamicLMClassifier
instance using the new category array and nGramSize
of size 8:
categories = new String[2]; categories[0] = "neg"; categories[1] = "pos"; nGramSize = 8; classifier = DynamicLMClassifier.createNGramProcess( categories, nGramSize);
As we did earlier, we will create a series of instances based on the contents found in the training files. We will not detail the following code as it is very similar to that found in the Training text using the Classified class section. The main difference is there are only two categories to process:
String directory = "..."; File trainingDirectory = new File(directory, "txt_sentoken"); for (int i = 0; i < categories.length; ++i) { Classification classification = new Classification(categories[i]); File file = new File(trainingDirectory, categories[i]); File[] trainingFiles = file.listFiles(); for (int j = 0; j < trainingFiles.length; ++j) { try { String review = Files.readFromFile( trainingFiles[j], "ISO-8859-1"); Classified<CharSequence> classified = new Classified<>(review, classification); classifier.handle(classified); } catch (IOException ex) { ex.printStackTrace(); } } }
The model is now ready to be used. We will use the review for the movie Forrest Gump:
String review = "An overly sentimental film with a somewhat " + "problematic message, but its sweetness and charm " + "are occasionally enough to approximate true depth " + "and grace. ";
We use the classify
method to perform the actual work. It returns a Classification
instance whose bestCategory
method returns the best category, as shown here:
Classification classification = classifier.classify(review); String bestCategory = classification.bestCategory(); System.out.println("Best Category: " + bestCategory);
When executed, we get the following output:
Best Category: pos
This approach will also work well for other categories of text.
LingPipe comes with a model, langid-leipzig.classifier
, trained for several languages and is found in the demos/models
directory. A list of supported languages is found in the following table. This model was developed using training data derived from the Leipzig Corpora Collection (http://corpora.uni-leipzig.de/). Another good tool can be found at http://code.google.com/p/language-detection/.
Language |
Abbreviation |
Language |
Abbreviation |
---|---|---|---|
Catalan |
cat |
Italian |
it |
Danish |
dk |
Japanese |
jp |
English |
en |
Korean |
kr |
Estonian |
ee |
Norwegian |
no |
Finnish |
fi |
Sorbian |
sorb |
French |
fr |
Swedish |
se |
German |
de |
Turkish |
tr |
To use this model, we use essentially the same code we used in the Classifying text using LingPipe section earlier in this chapter. We start with the same movie review of Forrest Gump:
String text = "An overly sentimental film with a somewhat " + "problematic message, but its sweetness and charm " + "are occasionally enough to approximate true depth " + "and grace. "; System.out.println("Text: " + text);
The LMClassifier
instance is created using the langid-leipzig.classifier
file:
LMClassifier classifier = null; try { classifier = (LMClassifier) AbstractExternalizable.readObject( new File(".../langid-leipzig.classifier")); } catch (IOException | ClassNotFoundException ex) { // Handle exceptions }
The classify
method is used followed by the application of the bestCategory
method to obtain the best language fit, as shown here:
Classification classification = classifier.classify(text); String bestCategory = classification.bestCategory(); System.out.println("Best Language: " + bestCategory);
The output is as follows with English being chosen:
Text: An overly sentimental film with a somewhat problematic message, but its sweetness and charm are occasionally enough to approximate true depth and grace. Best Language: en
The following code example uses the first sentence of the Swedish Wikipedia entry in Swedish (http://sv.wikipedia.org/wiki/Svenska) for the text:
text = "Svenska är ett östnordiskt språk som talas av cirka " + "tio miljoner personer[1], främst i Finland " + "och Sverige.";
The output, as shown here, correctly selects the Swedish language:
Text: Svenska är ett östnordiskt språk som talas av cirka tio miljoner personer[1], främst i Finland och Sverige. Best Language: se
Training can be conducted in the same way as done for the previous LingPipe models. Another consideration when performing language identification is that the text may be written in multiple languages. This can complicate the language detection process.
3.144.30.62