Chapter 9. Text Analysis

Text analysis is a broad topic and is typically referred to as Natural Language Processing (NLP). It is used for many different tasks, including text searching, language translation, sentiment analysis, speech recognition, and classification, to mention a few. The process of analyzing can be difficult due to the particularities and ambiguity found in natural languages. However, there has been a considerable amount of work in this area and there are several Java APIs supporting this effort.

We will start with an introduction to the basic concepts and tasks used in NLP. These include the following:

  • Tokenization: The process of splitting text into individual tokens or words.
  • Stop words: These are words that are common and may not be necessary for processing. They include such words as the, a, and to.
  • Name Entity Recognition (NER): This is the process of identifying elements of text such as people's name, locations, or things.
  • Parts of Speech (POS): This identifies the grammatical parts of a sentence such as noun, verb, adjective, and so on.
  • Relationships: Here, we are concerned with identifying how parts of text are related to each other, such as the subject and object of a sentence.

The concepts of words, sentences, and paragraphs are well known. However, extracting and analyzing these components is not always that straightforward. The term corpus frequently refers to a collection of text.

As with most data science problems, it is important to preprocess text. Frequently, this involves handling such tasks as these:

  • Handling Unicode
  • Converting text to uppercase or lowercase
  • Removing stop words

We examined several techniques for tokenization and removing stop words in Chapter 3, Data Cleaning. In this chapter, we will focus on POS, NER, extracting relationships from sentence, text classification, and sentiment analysis.

There are several NLP APIs available, including these:

We will use OpenNLP and DL4J to demonstrate text analysis in this chapter. We chose these because they are both well-known and have good published resources for additional support.

We will use the Google Word2Vec and Doc2Vec neural networks to perform text classification. This includes feature vectors based on other words as well as using labeled information to classify documents. Finally, we will discuss sentiment analysis. This type of analysis seeks to assign meaning to text and also uses the Word2Vec network.

We start our discussion with NER.

Implementing named entity recognition

This is sometimes referred to as finding people and things. Given a text segment, we may want to identify all the names of people present. However, this is not always easy because a name such as Rob may also be used as a verb.

In this section, we will demonstrate how to use OpenNLP's TokenNameFinderModel class to find names and locations in text. While there are other entities we may want to find, this example will demonstrate the basics of the technique. We begin with names.

Most names occur within a single line. We do not want to use multiple lines because an entity such as a state might inadvertently be identified incorrectly. Consider the following sentences:

Jim headed north. Dakota headed south.

If we ignored the period, then the state of North Dakota might be identified as a location, when in fact it is not present.

Using OpenNLP to perform NER

We start our example with a try-catch block to handle exceptions. OpenNLP uses models that have been trained on different sets of data. In this example, the en-token.bin and en-ner-person.bin files contain the models for the tokenization of English text and for English name elements, respectively. These files can be downloaded from http://opennlp.sourceforge.net/models-1.5/. However, the IO stream used here is standard Java:

try (InputStream tokenStream =  
            new FileInputStream(new File("en-token.bin")); 
        InputStream personModelStream = new FileInputStream( 
            new File("en-ner-person.bin"));) { 
    ... 
} catch (Exception ex) { 
    // Handle exceptions 
} 

An instance of the TokenizerModel class is initialized using the token stream. This instance is then used to create the actual TokenizerME tokenizer. We will use this instance to tokenize our sentence:

TokenizerModel tm = new TokenizerModel(tokenStream); 
TokenizerME tokenizer = new TokenizerME(tm); 

The TokenNameFinderModel class is used to hold a model for name entities. It is initialized using the person model stream. An instance of the NameFinderME class is created using this model since we are looking for names:

TokenNameFinderModel tnfm = new 
  TokenNameFinderModel(personModelStream); 
NameFinderME nf = new NameFinderME(tnfm); 

To demonstrate the process, we will use the following sentence. We then convert it to a series of tokens using the tokenizer and tokenizer method:

String sentence = "Mrs. Wilson went to Mary's house for dinner."; 
String[] tokens = tokenizer.tokenize(sentence); 

The Span class holds information regarding the positions of entities. The find method will return the position information, as shown here:

Span[] spans = nf.find(tokens); 

This array holds information about person entities found in the sentence. We then display this information as shown here:

for (int i = 0; i < spans.length; i++) { 
    out.println(spans[i] + " - " + tokens[spans[i].getStart()]); 
} 

The output for this sequence is as follows. Notice that it identifies the last name of Mrs. Wilson but not the "Mrs.":

[1..2) person - Wilson
[4..5) person - Mary

Once these entities have been extracted, we can use them for specialized analysis.

Identifying location entities

We can also find other types of entities such as dates and locations. In the following example, we find locations in a sentence. It is very similar to the previous person example, except that an en-ner-location.bin file is used for the model:

try (InputStream tokenStream =  
            new FileInputStream("en-token.bin"); 
        InputStream locationModelStream = new FileInputStream( 
            new File("en-ner-location.bin"));) { 
 
    TokenizerModel tm = new TokenizerModel(tokenStream); 
    TokenizerME tokenizer = new TokenizerME(tm); 
 
    TokenNameFinderModel tnfm =  
        new TokenNameFinderModel(locationModelStream); 
    NameFinderME nf = new NameFinderME(tnfm); 
 
    sentence = "Enid is located north of Oklahoma City."; 
    String tokens[] = tokenizer.tokenize(sentence); 
 
    Span spans[] = nf.find(tokens); 
 
    for (int i = 0; i < spans.length; i++) { 
        out.println(spans[i] + " - " +  
        tokens[spans[i].getStart()]); 
    } 
} catch (Exception ex) { 
    // Handle exceptions 
} 

With the sentence defined previously, the model was only able to find the second city, as shown here. This likely due to the confusion that arises with the name Enid which is both the name of a city and a person' name:

[5..7) location - Oklahoma

Suppose we use the following sentence:

sentence = "Pond Creek is located north of Oklahoma City."; 

Then we get this output:


[1..2) location - Creek
[6..8) location - Oklahoma

Unfortunately, it has missed the town of Pond Creek. NER is a useful tool for many applications, but like many techniques, it is not always foolproof. The accuracy of the NER approach presented, and many of the other NLP examples, will vary depending on factors such as the accuracy of the model, the language being used, and the type of entity.

We may also be interested in how text can be classified. We will examine one approach in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.71.106