Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using APIs to classify text

We will use OpenNLP, Stanford API, and LingPipe to demonstrate various classification approaches. We will spend more time with LingPipe as it offers several different classification approaches.

Using OpenNLP

The DocumentCategorizer interface specifies methods to support the classification process. The interface is implemented by the DocumentCategorizerME class. This class will classify text into predefined categories using a maximum entropy framework. We will:

Demonstrate how to train the model
Illustrate how the model can be used

Training an OpenNLP classification model

First, we have to train our model because OpenNLP does not have prebuilt models. This process consists of creating a file of training data and then using the DocumentCategorizerME model to perform the actual training. The model created is typically saved in a file for later use.

The training file format consists of a series of lines where each line represents a document. The first word of the line is the category. The category is followed by text separated by whitespace. Here is an example for the dog category:

dog The most interesting feature of a dog is its ...

To demonstrate the training process, we created the en-animals.train file, where we created two categories: cats and dogs. For the training text, we used sections of Wikipedia. For dogs (http://en.wikipedia.org/wiki/Dog), we used the As Pets section. For cats (http://en.wikipedia.org/wiki/Cats_and_humans), we used the Pet section plus the first paragraph of the Domesticated varieties section. We also removed the numeric references from the sections.

The first part of each line is shown here:

dog The most widespread form of interspecies bonding occurs ...
dog There have been two major trends in the changing status of  ...
dog There are a vast range of commodity forms available to  ...
dog An Australian Cattle Dog in reindeer antlers sits on Santa's lap ...
dog A pet dog taking part in Christmas traditions ...
dog The majority of contemporary people with dogs describe their  ...
dog Another study of dogs' roles in families showed many dogs have  ...
dog According to statistics published by the American Pet Products  ...
dog The latest study using Magnetic resonance imaging (MRI) ...
cat Cats are common pets in Europe and North America, and their  ...
cat Although cat ownership has commonly been associated  ...
cat The concept of a cat breed appeared in Britain during ...
cat Cats come in a variety of colors and patterns. These are physical  ...
cat A natural behavior in cats is to hook their front claws periodically  ...
cat Although scratching can serve cats to keep their claws from growing  ...

When creating training data, it is important to use a large enough sample size. The data we used is not sufficient for some analysis. However, as we will see, it does a pretty good job of identifying the categories correctly.

The DoccatModel class supports categorization and classification of text. A model is trained using the train method based on annotated text. The train method uses a string denoting the language and an ObjectStream<DocumentSample> instance holding the training data. The DocumentSample instance holds the annotated text and its category.

In the following example, the en-animal.train file is used to train the model. Its input stream is used to create a PlainTextByLineStream instance, which is then converted to an ObjectStream<DocumentSample> instance. The train method is then applied. The code is enclosed in a try-with-resources block to handle exceptions. We also created an output stream that we will use to persist the model:

DoccatModel model = null;
try (InputStream dataIn = 
            new FileInputStream("en-animal.train");
        OutputStream dataOut = 
            new FileOutputStream("en-animal.model");) {
    ObjectStream<String> lineStream
        = new PlainTextByLineStream(dataIn, "UTF-8");
    ObjectStream<DocumentSample> sampleStream = 
        new DocumentSampleStream(lineStream);            
    model = DocumentCategorizerME.train("en", sampleStream);
    ...
} catch (IOException e) {
// Handle exceptions  
}

The output is as follows and has been shortened to conserve space:

Indexing events using cutoff of 5

  Computing event counts...  done. 12 events
  Indexing...  done.
Sorting and merging events... done. Reduced 12 events to 12.
Done indexing.
Incorporating indexed data for training...  
done.
  Number of Event Tokens: 12
      Number of Outcomes: 2
    Number of Predicates: 30
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-8.317766166719343  0.75
  2:  ... loglikelihood=-7.1439957443937265  0.75
  3:  ... loglikelihood=-6.560690872956419  0.75
  4:  ... loglikelihood=-6.106743124066829  0.75
  5:  ... loglikelihood=-5.721805583104927  0.8333333333333334
  6:  ... loglikelihood=-5.3891508904777785  0.8333333333333334
  7:  ... loglikelihood=-5.098768040466029  0.8333333333333334
...
 98:  ... loglikelihood=-1.4117372921765519  1.0
 99:  ... loglikelihood=-1.4052738190352423  1.0
100:  ... loglikelihood=-1.398916120150312  1.0

The model is saved as shown here using the serialize method. The model is saved to the en-animal.model file as opened in the previous try-with-resources block:

OutputStream modelOut = null;
modelOut = new BufferedOutputStream(dataOut);
model.serialize(modelOut);

Using DocumentCategorizerME to classify text

Once a model has been created, we can use the DocumentCategorizerME class to classify text. We need to read the model, create an instance of the DocumentCategorizerME class, and then invoke the categorize method to return an array of probabilities that will tell us which category the text best fits in.

Since we are reading from a file, exceptions need to be dealt with, as shown here:

try (InputStream modelIn = 
        new FileInputStream(new File("en-animal.model"));) {
    ...
} catch (IOException ex) {
    // Handle exceptions
}

With the input stream, we create instances of the DoccatModel and DocumentCategorizerME classes as illustrated here:

DoccatModel model = new DoccatModel(modelIn);
DocumentCategorizerME categorizer = 
    new DocumentCategorizerME(model);

The categorize method is called using a string as an argument. This returns an array of double values with each element containing the likelihood that the text belongs to a category. The DocumentCategorizerME class' getNumberOfCategories method returns the number of categories handled by the model. The DocumentCategorizerME class' getCategory method returns the category given an index.

We used these methods in the following code to display each category and its corresponding likelihood:

double[] outcomes = categorizer.categorize(inputText);
for (int i = 0; i<categorizer.getNumberOfCategories(); i++) {
    String category = categorizer.getCategory(i);
    System.out.println(category + " - " + outcomes[i]);
}

For testing, we used part of the Wikipedia article (http://en.wikipedia.org/wiki/Toto_%28Oz%29) for Toto, Dorothy's dog. We used the first sentence of The classic books section as declared here:

String toto = "Toto belongs to Dorothy Gale, the heroine of " 
        + "the first and many subsequent books. In the first "
        + "book, he never spoke, although other animals, native "
        + "to Oz, did. In subsequent books, other animals "
        + "gained the ability to speak upon reaching Oz or "
        + "similar lands, but Toto remained speechless.";

To test for a cat, we used the first sentence of the Tortoiseshell and Calico section of the Wikipedia article (http://en.wikipedia.org/wiki/Cats_and_humans) as declared here:

String calico = "This cat is also known as a calimanco cat or "
        + "clouded tiger cat, and by the abbreviation 'tortie'. "
        + "In the cat fancy, a tortoiseshell cat is patched "
        + "over with red (or its dilute form, cream) and black "
        + "(or its dilute blue) mottled throughout the coat.";

Using the text for toto, we get the following output. This suggests that the text should be placed in the dog category:

dog - 0.5870711529777994
cat - 0.41292884702220056

Using calico instead yields these results:

dog - 0.28960436044424276
cat - 0.7103956395557574

We could have used the getBestCategory method to return only the best category. This method uses the array of outcomes and returns a string. The getAllResults method will return all of the results as a string. These two methods are illustrated here:

System.out.println(categorizer.getBestCategory(outcomes));
System.out.println(categorizer.getAllResults(outcomes));

The output will be as follows:

cat
dog[0.2896]  cat[0.7104]

Using Stanford API

The Stanford API supports several classifiers. We will examine the use of the ColumnDataClassifier class for general classification and the StanfordCoreNLP pipeline to perform sentiment analysis. The classifiers supported by the Stanford API can be difficult to use at times. With the ColumnDataClassifier class, we will demonstrate how to classify the size of boxes. With the pipeline, we will illustrate how to determine the positive or negative sentiment of short text phrases. The classifier can be downloaded from http://www-nlp.stanford.edu/wiki/Software/Classifier.

Using the ColumnDataClassifier class for classification

This classifier uses data with multiple values to describe the data. In this demonstration, we will use a training file to create a classifier. We will then use a test file to assess the performance of the classifier. The class uses a property file to configure the creation process.

We will be creating a classifier that attempts to classify a box based on its dimensions. Three categories will be possible: small, medium, and large. The height, width, and length dimensions of a box will be expressed as floating point numbers. They are used to characterize a box.

The properties file specifies parameter information and supplies data about the training and test files. There are many possible properties that can be specified. For this example, we will use only a few of the more relevant properties.

We will use the following properties file saved as box.prop. The first set of properties deal with the number of features that are contained in the training and test files. Since we used three values, three realValued columns are specified. The trainFile and testFile properties specify the location and names of the respective files:

useClassFeature=true
1.realValued=true
2.realValued=true
3.realValued=true
trainFile=.box.train
testFile=.box.test

The training and test files use the same format. Each line consists of a category followed by the defining values, each separated by a tab. The box.train training file consist of 60 entries and the box.test file consists of 30 entries. These files can be downloaded from www.packtpub.com. The first line of the box.train file follows here. The category is small; its height, width, and length are 2.34, 1.60, and 1.50, respectively:

small  2.34  1.60  1.50

The code to create the classifier is shown here. An instance of the ColumnDataClassifier class is created using the properties file as the constructor's argument. An instance of the Classifier interface is returned by the makeClassifier method. This interface supports three methods, two of which we will demonstrate. The readTrainingExamples method reads the training data from the training file:

ColumnDataClassifier cdc = 
    new ColumnDataClassifier("box.prop");
Classifier<String, String> classifier = 
    cdc.makeClassifier(cdc.readTrainingExamples("box.train"));

When executed, we get extensive output. We will discuss the more relevant parts here. The first part of the output repeats parts of the property file:

3.realValued = true
testFile = .box.test
...
trainFile = .box.train

The next part displays the number of datasets read along with various features' information, as shown here:

Reading dataset from box.train ... done [0.1s, 60 items].
numDatums: 60
numLabels: 3 [small, medium, large]
...
AVEIMPROVE     The average improvement / current value
EVALSCORE      The last available eval score
Iter ## evals ## <SCALING> [LINESEARCH] VALUE TIME |GNORM| {RELNORM} AVEIMPROVE EVALSCORE

The classifier then iterates over the data to create the classifier:

Iter 1 evals 1 <D> [113M 3.107E-4] 5.985E1 0.00s |3.829E1| {1.959E-1} 0.000E0 - 
Iter 2 evals 5 <D> [M 1.000E0] 5.949E1 0.01s |1.862E1| {9.525E-2} 3.058E-3 - 
Iter 3 evals 6 <D> [M 1.000E0] 5.923E1 0.01s |1.741E1| {8.904E-2} 3.485E-3 - 
...
Iter 21 evals 24 <D> [1M 2.850E-1] 3.306E1 0.02s |4.149E-1| {2.122E-3} 1.775E-4 - 
Iter 22 evals 26 <D> [M 1.000E0] 3.306E1 0.02s
QNMinimizer terminated due to average improvement: | newest_val - previous_val | / |newestVal| < TOL 
Total time spent in optimization: 0.07s

At this point, the classifier is ready to use. Next, we use the test file to verify the classifier. We start by getting a line from the text file using the ObjectBank class' getLineIterator method. This class supports the conversion of data read into a more standardized form. The getLineIterator method returns one line at a time in a format that can be used by the classifier. The loop for this process is shown here:

for (String line : 
        ObjectBank.getLineIterator("box.test", "utf-8")) {
    ...
}

Within the for-each statement, a Datum instance is created from the line and then its classOf method is used to return the predicted category as shown here. The Datum interface supports objects that contain features. When used as the argument of the classOf method, the category determined by the classifier is returned:

Datum<String, String> datum = cdc.makeDatumFromLine(line);
System.out.println("Datum: {" 
    + line + "]	Predicted Category: " 
    + classifier.classOf(datum));

When this sequence is executed, each line of the test file is processed and the predicted category is displayed, as follows. Only the first two and last two lines are shown here. The classifier was able to correctly classify all of the test data:

Datum: {small  1.33  3.50  5.43]  Predicted Category: medium
Datum: {small  1.18  1.73  3.14]  Predicted Category: small
...
Datum: {large  6.01  9.35  16.64]  Predicted Category: large
Datum: {large  6.76  9.66  15.44]  Predicted Category: large

To test an individual entry, we can use the makeDatumFromStrings method to create a Datum instance. In the next code sequence, a one-dimensional array of strings is created where each element represents data values for a box. The first entry, the category, is left null. The Datum instance is then used as the argument of the classOf method to predict its category:

String sample[] = {"", "6.90", "9.8", "15.69"};
Datum<String, String> datum = 
    cdc.makeDatumFromStrings(sample);
System.out.println("Category: " + classifier.classOf(datum));

The output for this sequence is shown here, which correctly classifies the box:

Category: large

Using the Stanford pipeline to perform sentiment analysis

In this section, we will illustrate how the Stanford API can be used to perform sentiment analysis. We will use the StanfordCoreNLP pipeline to perform this analysis on different texts.

We will use three different texts as defined here. The review string is a movie review from Rotten Tomatoes (http://www.rottentomatoes.com/m/forrest_gump/) about the movie Forrest Gump:

String review = "An overly sentimental film with a somewhat "
    + "problematic message, but its sweetness and charm "
    + "are occasionally enough to approximate true depth "
    + "and grace. ";

String sam = "Sam was an odd sort of fellow. Not prone "
    + "to angry and not prone to merriment. Overall, "
    + "an odd fellow.";
    
String mary = "Mary thought that custard pie was the "
    + "best pie in the world. However, she loathed "
    + "chocolate pie.";

To perform this analysis, we need to use a sentiment annotator as shown here. This also requires the use of the tokenize, ssplit and parse annotators. The parse annotator provides more structural information about the text, which will be discussed in more detail in Chapter 7, Using a Parser to Extract Relationships:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, parse, sentiment");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

The text is used to create an Annotation instance, which is then used as the argument to the annotate method that performs the actual work, as shown here:

Annotation annotation = new Annotation(review);
pipeline.annotate(annotation);

The following array holds the strings for the different sentiments possible:

String[] sentimentText = {"Very Negative", "Negative", 
    "Neutral", "Positive", "Very Positive"};

The Annotation class' get method returns an object that implements the CoreMap interface. In this case, these objects represent the results of splitting the input text into sentences, as shown in the following code. For each sentence, an instance of a Tree object is obtained that represents a tree structure containing a parse of the text for the sentiment. The getPredictedClass method returns an index into the sentimentText array reflecting the sentiment of the test:

for (CoreMap sentence : annotation.get(
        CoreAnnotations.SentencesAnnotation.class)) {
    Tree tree = sentence.get(
        SentimentCoreAnnotations.AnnotatedTree.class);
    int score = RNNCoreAnnotations.getPredictedClass(tree);
    System.out.println(sentimentText[score]);
}

When the code is executed using the review string, we get the following output:

Positive

The text, sam, consists of three sentences. The output for each is as follows, showing the sentiment for each sentence:

Neutral
Negative
Neutral

The text, mary, consists of two sentences. The output for each is as follows:

Positive
Neutral

Using LingPipe to classify text

We will use LingPipe to demonstrate a number of classification tasks including general text classification using trained models, sentiment analysis, and language identification. We will cover the following classification topics:

Training text using the Classified class
Training models using other training categories
How to classify text using LingPipe
Performing sentiment analysis using LingPipe
Identifying the language used

Several of the tasks described in this section will use the following declarations. LingPipe comes with training data for several categories. The categories array contains the names of the categories packaged with LingPipe:

String[] categories = {"soc.religion.christian",
    "talk.religion.misc","alt.atheism","misc.forsale"};

The DynamicLMClassifier class is used to perform the actual classification. It is created using the categories array giving it the names of the categories to use. The nGramSize value specifies the number of contiguous items in a sequence used in the model for classification purposes:

int nGramSize = 6;
DynamicLMClassifier<NGramProcessLM> classifier = 
    DynamicLMClassifier.createNGramProcess(
        categories, nGramSize);

Training text using the Classified class

General text classification using LingPipe involves training the DynamicLMClassifier class using training files and then using the class to perform the actual classification. LingPipe comes with several training datasets as found in the LingPipe directory, demos/data/fourNewsGroups/4news-train. We will use these to illustrate the training process. This example is a simplified version of the process found at http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html.

We start by declaring the training directory:

String directory = ".../demos";
File trainingDirectory = new File(directory 
    + "/data/fourNewsGroups/4news-train");

In the training directory, there are four subdirectories whose names are listed in the categories array. In each subdirectory is a series of files with numeric names. These files contain newsgroups (http://qwone.com/~jason/20Newsgroups/) data that deal with that directories, names.

The process of training the model involves using each file and category with the DynamicLMClassifier class' handle method. The method will use the file to create a training instance for the category and then augment the model with this instance. The process uses nested for-loops.

The outer for-loop creates a File object using the directory's name and then applies the list method against it. The list method returns a list of the files in the directory. The names of these files are stored in the trainingFiles array, which will be used in the inner loop:

for (int i = 0; i < categories.length; ++i) {
    File classDir = 
        new File(trainingDirectory, categories[i]);
    String[] trainingFiles = classDir.list();
    // Inner for-loop
}

The inner for-loop, as shown next, will open each file and read the text from the file. The Classification class represents a classification with a specified category. It is used with the text to create a Classified instance. The DynamicLMClassifier class' handle method updates the model with the new information:

for (int j = 0; j < trainingFiles.length; ++j) {
    try {
        File file = new File(classDir, trainingFiles[j]);
        String text = Files.readFromFile(file, "ISO-8859-1");
        Classification classification = 
            new Classification(categories[i]);
        Classified<CharSequence> classified = 
            new Classified<>(text, classification);
        classifier.handle(classified);
    } catch (IOException ex) {
        // Handle exceptions
    }
}

Note

You can alternately use the com.aliasi.util.Files class instead in java.io.File, otherwise the readFromFile method will not be available.

The classifier can be serialized for later use as shown here. The AbstractExternalizable class is a utility class that supports the serialization of objects. It has a static compileTo method that accepts a Compilable instance and a File object. It writes the object to the file, as follows:

try {
    AbstractExternalizable.compileTo( (Compilable) classifier,
        new File("classifier.model"));
} catch (IOException ex) {
    // Handle exceptions
}

The loading of the classifier will be illustrated in the Classifying text using LingPipe section later in this chapter.

Using other training categories

Other newsgroups data can be found at http://qwone.com/~jason/20Newsgroups/. These collections of data can be used to train other models as listed in the following table. Although there are only 20 categories, they can be useful training models. Three different downloads are available where some have been sorted and in others, duplicate data has been removed:

Newsgroups
`comp.graphics`	`sci.crypt`
`comp.os.ms-windows.misc`	`sci.electronics`
`comp.sys.ibm.pc.hardware`	`sci.med`
`comp.sys.mac.hardware`	`sci.space`
`comp.windows.x`	`misc.forsale`
`rec.autos`	`talk.politics.misc`
`rec.motorcycles`	`talk.politics.guns`
`rec.sport.baseball`	`talk.politics.mideast`
`rec.sport.hockey`	`talk.religion.misc`
`alt.atheism`

Classifying text using LingPipe

To classify text, we will use the DynamicLMClassifier class' classify method. We will demonstrate its use with two different text sequences:

forSale: The first is from http://www.homes.com/for-sale/ where we use the first complete sentence
martinLuther: The second is from http://en.wikipedia.org/wiki/Martin_Luther where we use the first sentence of the second paragraph

These strings are declared here:

String forSale = 
    "Finding a home for sale has never been "
    + "easier. With Homes.com, you can search new "
    + "homes, foreclosures, multi-family homes, "
    + "as well as condos and townhouses for sale. "
    + "You can even search our real estate agent "
    + "directory to work with a professional "
    + "Realtor and find your perfect home.";
String martinLuther = 
    "Luther taught that salvation and subsequently "
    + "eternity in heaven is not earned by good deeds "
    + "but is received only as a free gift of God's "
    + "grace through faith in Jesus Christ as redeemer "
    + "from sin and subsequently eternity in Hell.";

To reuse the classifier serialized in the previous section, use the AbstractExternalizable class' readObject method as shown here. We will use the LMClassifier class instead of the DynamicLMClassifier class. They both support the classify method but the DynamicLMClassifier class is not readily serializable:

LMClassifier classifier = null;
try {
    classifier = (LMClassifier) 
        AbstractExternalizable.readObject(
            new File("classifier.model"));
} catch (IOException | ClassNotFoundException ex) {
    // Handle exceptions
}

In the next code sequence, we apply the LMClassifier class' classify method. This returns a JointClassification instance, which we use to determine the best match:

JointClassification classification = 
    classifier.classify(text);
System.out.println("Text: " + text);
String bestCategory = classification.bestCategory();
System.out.println("Best Category: " + bestCategory);

For the forSale text, we get the following output:

Text: Finding a home for sale has never been easier. With Homes.com, you can search new homes, foreclosures, multi-family homes, as well as condos and townhouses for sale. You can even search our real estate agent directory to work with a professional Realtor and find your perfect home.
Best Category: misc.forsale

For the martinLuther text, we get the following output:

Text: Luther taught that salvation and subsequently eternity in heaven is not earned by good deeds but is received only as a free gift of God's grace through faith in Jesus Christ as redeemer from sin and subsequently eternity in Hell.
Best Category: soc.religion.christian

They both correctly classified the text.

Sentiment analysis using LingPipe

Sentiment analysis is performed in a very similar manner to that of general text classification. One difference is the use of only two categories: positive and negative.

We need to use data files to train our model. We will use a simplified version of the sentiment analysis performed at http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html using sentiment data found developed for movies (http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz). This data was developed from 1,000 positive and 1,000 negative reviews of movies found in IMDb's movie archives.

These reviews need to be downloaded and extracted. A txt_sentoken directory will be extracted along with its two subdirectories: neg and pos. Both of these subdirectories contain movie reviews. Although some of these files can be held in reserve to evaluate the model created, we will use all of them to simplify the explanation.

We will start with re-initialization of variables declared in the Using LingPipe to classify text section. The categories array is set to a two-element array to hold the two categories. The classifier variable is assigned a new DynamicLMClassifier instance using the new category array and nGramSize of size 8:

categories = new String[2];
categories[0] = "neg";
categories[1] = "pos";
nGramSize = 8;
classifier = DynamicLMClassifier.createNGramProcess(
    categories, nGramSize);

As we did earlier, we will create a series of instances based on the contents found in the training files. We will not detail the following code as it is very similar to that found in the Training text using the Classified class section. The main difference is there are only two categories to process:

String directory = "...";
File trainingDirectory = new File(directory, "txt_sentoken");
for (int i = 0; i < categories.length; ++i) {
    Classification classification = 
        new Classification(categories[i]);
    File file = new File(trainingDirectory, categories[i]);
    File[] trainingFiles = file.listFiles();
    for (int j = 0; j < trainingFiles.length; ++j) {
        try {
            String review = Files.readFromFile(
                trainingFiles[j], "ISO-8859-1");
            Classified<CharSequence> classified = 
                new Classified<>(review, classification);
            classifier.handle(classified);
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}

The model is now ready to be used. We will use the review for the movie Forrest Gump:

String review = "An overly sentimental film with a somewhat "
    + "problematic message, but its sweetness and charm "
    + "are occasionally enough to approximate true depth "
    + "and grace. ";

We use the classify method to perform the actual work. It returns a Classification instance whose bestCategory method returns the best category, as shown here:

Classification classification = classifier.classify(review);
String bestCategory = classification.bestCategory();
System.out.println("Best Category: " + bestCategory);

When executed, we get the following output:

Best Category: pos

This approach will also work well for other categories of text.

Language identification using LingPipe

LingPipe comes with a model, langid-leipzig.classifier, trained for several languages and is found in the demos/models directory. A list of supported languages is found in the following table. This model was developed using training data derived from the Leipzig Corpora Collection (http://corpora.uni-leipzig.de/). Another good tool can be found at http://code.google.com/p/language-detection/.

Language	Abbreviation	Language	Abbreviation
Catalan	cat	Italian	it
Danish	dk	Japanese	jp
English	en	Korean	kr
Estonian	ee	Norwegian	no
Finnish	fi	Sorbian	sorb
French	fr	Swedish	se
German	de	Turkish	tr

To use this model, we use essentially the same code we used in the Classifying text using LingPipe section earlier in this chapter. We start with the same movie review of Forrest Gump:

String text = "An overly sentimental film with a somewhat "
    + "problematic message, but its sweetness and charm "
    + "are occasionally enough to approximate true depth "
    + "and grace. ";
System.out.println("Text: " + text);

The LMClassifier instance is created using the langid-leipzig.classifier file:

LMClassifier classifier = null;
try {
    classifier = (LMClassifier) 
        AbstractExternalizable.readObject(
            new File(".../langid-leipzig.classifier"));
} catch (IOException | ClassNotFoundException ex) {
    // Handle exceptions
}

The classify method is used followed by the application of the bestCategory method to obtain the best language fit, as shown here:

Classification classification = classifier.classify(text);
String bestCategory = classification.bestCategory();
System.out.println("Best Language: " + bestCategory);

The output is as follows with English being chosen:

Text: An overly sentimental film with a somewhat problematic message, but its sweetness and charm are occasionally enough to approximate true depth and grace. 
Best Language: en

The following code example uses the first sentence of the Swedish Wikipedia entry in Swedish (http://sv.wikipedia.org/wiki/Svenska) for the text:

text = "Svenska är ett östnordiskt språk som talas av cirka "
    + "tio miljoner personer[1], främst i Finland "
    + "och Sverige.";

The output, as shown here, correctly selects the Swedish language:

Text: Svenska är ett östnordiskt språk som talas av cirka tio miljoner personer[1], främst i Finland och Sverige.
Best Language: se

Training can be conducted in the same way as done for the previous LingPipe models. Another consideration when performing language identification is that the text may be written in multiple languages. This can complicate the language detection process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Using APIs to classify text

Create new playlist

Sign In

Sign Up