Understanding normalization

Normalization is a process that converts a list of words to a more uniform sequence. This is useful in preparing text for later processing. By transforming the words to a standard format, other operations are able to work with the data and will not have to deal with issues that might compromise the process. For example, converting all words to lowercase will simplify the searching process.

The normalization process can improve text matching. For example, there are several ways that the term "modem router" can be expressed, such as modem and router, modem & router, modem/router, and modem-router. By normalizing these words to the common form, it makes it easier to supply the right information to a shopper.

Understand that the normalization process might also compromise an NLP task. Converting to lowercase letters can decrease the reliability of searches when the case is important.

Normalization operations can include the following:

  • Changing characters to lowercase
  • Expanding abbreviations
  • Removing stopwords
  • Stemming and lemmatization

We will investigate these techniques here except for expanding abbreviations. This technique is similar to the technique used to remove stopwords, except that the abbreviations are replaced with their expanded version.

Converting to lowercase

Converting text to lowercase is a simple process that can improve search results. We can either use Java methods such as the String class' toLowerCase method, or use the capability found in some NLP APIs such as LingPipe's LowerCaseTokenizerFactory class. The toLowerCase method is demonstrated here:

String text = "A Sample string with acronyms, IBM, and UPPER " + "and lowercase letters.";
String result = text.toLowerCase();
System.out.println(result);

The output will be as follows:

a sample string with acronyms, ibm, and upper and lowercase letters.

LingPipe's LowerCaseTokenizerFactory approach is illustrated in the section Normalizing using a pipeline, later in this chapter.

Removing stopwords

There are several approaches to remove stopwords. A simple approach is to create a class to hold and remove stopwords. Also, several NLP APIs provide support for stopword removal. We will create a simple class called StopWords to demonstrate the first approach. We will then use LingPipe's EnglishStopTokenizerFactory class to demonstrate the second approach.

Creating a StopWords class

The process of removing stopwords involves examining a stream of tokens, comparing them to a list of stopwords, and then removing the stopwords from the stream. To illustrate this approach, we will create a simple class that supports basic operations as defined in the following table:

Constructor/Method

Usage

Default constructor

Uses a default set of stopwords

Single argument constructor

Uses stopwords stored in a file

addStopWord

Adds a new stopword to the internal list

removeStopWords

Accepts an array of words and returns a new array with the stopwords removed

Create a class called StopWords, which declares two instance variables as shown in the following code block. The variable defaultStopWords is an array that holds the default stopword list. The HashSet variable stopwords list is used to hold the stopwords for processing purposes:

public class StopWords {

    private String[] defaultStopWords = {"i", "a", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", where", "who", "will", "with"};

    private static HashSet stopWords  = new HashSet();
    ...
}

Two constructors of the class follow which populate the HashSet:

public StopWords() {
    stopWords.addAll(Arrays.asList(defaultStopWords));
}

public StopWords(String fileName) {
    try {
        BufferedReader bufferedreader = 
                new BufferedReader(new FileReader(fileName));
        while (bufferedreader.ready()) {
            stopWords.add(bufferedreader.readLine());
        }
    } catch (IOException ex) {
        ex.printStackTrace();
    }
}

The convenience method addStopWord allows additional words to be added:

public void addStopWord(String word) {
    stopWords.add(word);
}

The removeStopWords method is used to remove the stopwords. It creates an ArrayList to hold the original words passed to the method. The for loop is used to remove stopwords from this list. The contains method will determine if the word submitted is a stopword, and if so, remove it. The ArrayList is converted to an array of strings and then returned. This is shown as follows:

public String[] removeStopWords(String[] words) {
    ArrayList<String> tokens = 
        new ArrayList<String>(Arrays.asList(words));
    for (int i = 0; i < tokens.size(); i++) {
        if (stopWords.contains(tokens.get(i))) {
            tokens.remove(i);
        }
    }
    return (String[]) tokens.toArray(new String[tokens.size()]);
}

The following sequence illustrates how StopWords can be used. First, we declare an instance of the StopWords class using the default constructor. The OpenNLP SimpleTokenizer class is declared and the sample text is defined, as shown here:

StopWords stopWords = new StopWords();
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
paragraph = "A simple approach is to create a class "
    + "to hold and remove stopwords.";

The sample text is tokenized and then passed to the removeStopWords method. The new list is then displayed:

String tokens[] = simpleTokenizer.tokenize(paragraph);
String list[] = stopWords.removeStopWords(tokens);
for (String word : list) {
    System.out.println(word);
}

When executed, we get the following output. The "A" is not removed because it is uppercase and the class does not perform case conversion:

A
simple
approach
create
class
hold
remove
stopwords
.

Using LingPipe to remove stopwords

LingPipe possesses the EnglishStopTokenizerFactory class that we will use to identify and remove stopwords. The words in this list are found in http://alias-i.com/lingpipe/docs/api/com/aliasi/tokenizer/EnglishStopTokenizerFactory.html. They include words such as a, was, but, he, and for.

The factory class' constructor requires a TokenizerFactory instance as its argument. We will use the factory's tokenizer method to process a list of words and remove the stopwords. We start by declaring the string to be tokenized:

String paragraph = "A simple approach is to create a class " 
    + "to hold and remove stopwords.";

Next, we create an instance of a TokenizerFactory based on the IndoEuropeanTokenizerFactory class. We then use that factory as the argument to create our EnglishStopTokenizerFactory instance:

TokenizerFactory factory = IndoEuropeanTokenizerFactory.INSTANCE;
factory = new EnglishStopTokenizerFactory(factory);

Using the LingPipe Tokenizer class and the factory's tokenizer method, the text as declared in the paragraph variable is processed. The tokenizer method uses an array of char, a starting index, and its length:

Tokenizer tokenizer = factory.tokenizer(paragraph.toCharArray(), 0, paragraph.length());

The following for-each statement will iterate over the revised list:

for (String token : tokenizer) {
    System.out.println(token);
}

The output will be as follows:

A
simple
approach
create
class
hold
remove
stopwords
.

Notice that although the letter, "A" is a stopword, it was not removed from the list. This is because the stopword list uses a lowercase 'a' and not an uppercase 'A'. As a result, it missed the word. We will correct this problem in the section Normalizing using a pipeline, later in the chapter.

Using stemming

Finding the stem of a word involves removing any prefixes or suffixes and what is left is considered to be the stem. Identifying stems is useful for tasks where finding similar words is important. For example, a search may be looking for occurrences of words like "book". There are many words that contain this word including books, booked, bookings, and bookmark. It can be useful to identify stems and then look for their occurrence in a document. In many situations, this can improve the quality of a search.

A stemmer may produce a stem that is not a real word. For example, it may decide that bounties, bounty, and bountiful all have the same stem, "bounti". This can still be useful for searches.

Note

Similar to stemming is Lemmatization. This is the process of finding its lemma, its form as found in a dictionary. This can also be useful for some searches. Stemming is frequently viewed as a more primitive technique, where the attempt to get to the "root" of a word involves cutting off parts of the beginning and/or ending of a token.

Lemmatization can be thought of as a more sophisticated approach where effort is devoted to finding the morphological or vocabulary meaning of a token. For example, the word "having" has a stem of "hav" while its lemma is "have". Also, the words "was" and "been" have different stems but the same lemma, "be".

Lemmatization can often use more computational resources than stemming. They both have their place and their utility is partially determined by the problem that needs to be solved.

Using the Porter Stemmer

The Porter Stemmer is a commonly used stemmer for English. Its home page can be found at http://tartarus.org/martin/PorterStemmer/. It uses five steps to stem a word.

Although Apache OpenNLP 1.5.3 does not contain the PorterStemmer class, its source code can be downloaded from https://svn.apache.org/repos/asf/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/stemmer/PorterStemmer.java. It can then be added to your project.

In the next example, we demonstrate the PorterStemmer class against an array of words. The input could as easily have originated from some other text source. An instance of the PorterStemmer class is created and then its stem method is applied to each word of the array:

String words[] = {"bank", "banking", "banks", "banker", "banked", "bankart"};
PorterStemmer ps = new PorterStemmer();
for(String word : words) {
    String stem = ps.stem(word);
    System.out.println("Word: " + word + "  Stem: " + stem);
}

When executed, you will get the following output:

Word: bank  Stem: bank
Word: banking  Stem: bank
Word: banks  Stem: bank
Word: banker  Stem: banker
Word: banked  Stem: bank
Word: bankart  Stem: bankart

The last word is used in combination with the word "lesion" as in "Bankart lesion". This is an injury of the shoulder and doesn't have much to do with the previous words. It does show that only common affixes are used when finding the stem.

Other potentially useful PorterStemmer class methods are found in the following table:

Method

Meaning

add

This will add a char to the end of the current stem word

stem

The method used without an argument will return true if a different stem occurs

reset

Reset the stemmer so a different word can be used

Stemming with LingPipe

The PorterStemmerTokenizerFactory class is used to find stems using LingPipe. In this example, we will use the same words array as in the previous section. The IndoEuropeanTokenizerFactory class is used to perform the initial tokenization followed by the use of the Porter Stemmer. These classes are defined here:

TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE;
TokenizerFactory porterFactory = 
    new PorterStemmerTokenizerFactory(tokenizerFactory);

An array to hold the stems is declared next. We reuse the words array declared in the previous section. Each word is processed individually. The word is tokenized and its stem is stored in stems as shown in the following code block. The words and their stems are then displayed:

String[] stems = new String[words.length];
for (int i = 0; i < words.length; i++) {
    Tokenization tokenizer = new Tokenization(words[i],porterFactory);
    stems = tokenizer.tokens();
    System.out.print("Word: " + words[i]);
    for (String stem : stems) {
        System.out.println("  Stem: " + stem);
    }
}

When executed, we get the following output:

Word: bank  Stem: bank
Word: banking  Stem: bank
Word: banks  Stem: bank
Word: banker  Stem: banker
Word: banked  Stem: bank
Word: bankart  Stem: bankart

We have demonstrated Porter Stemmer using OpenNLP and LingPipe examples. It is worth noting that there are other types of stemmers available including NGrams and various mixed probabilistic/algorithmic approaches.

Using lemmatization

Lemmatization is supported by a number of NLP APIs. In this section, we will illustrate how lemmatization can be performed using the StanfordCoreNLP and the OpenNLPLemmatizer classes. The lemmatization process determines the lemma of a word. A lemma can be thought of as the dictionary form of a word. For example, the lemma of "was" is "be".

Using the StanfordLemmatizer class

We will use the StanfordCoreNLP class with a pipeline to demonstrate lemmatization. We start by setting up the pipeline with four annotators including lemma as shown here:

StanfordCoreNLP pipeline;
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
pipeline = new StanfordCoreNLP(props);

These annotators are needed and are explained as follows:

Annotator

Operation to be Performed

tokenize

Tokenization

ssplit

Sentence splitting

pos

POS tagging

lemma

Lemmatization

ner

NER

parse

Syntactic parsing

dcoref

Coreference resolution

A paragraph variable is used with the Annotation constructor and the annotate method is then executed, as shown here:

String paragraph = "Similar to stemming is Lemmatization. " 
    +"This is the process of finding its lemma, its form " + 
    +"as found in a dictionary.";
Annotation document = new Annotation(paragraph);
pipeline.annotate(document);

We now need to iterate over the sentences and tokens of the sentences. The Annotation and CoreMap class' get methods will return values of the type specified. If there are no values of the specified type, it will return null. We will use these classes to obtain a list of lemmas.

First, a list of sentences is returned and then each word of each sentence is processed to find lemmas. The list of sentences and lemmas are declared here:

List<CoreMap> sentences = document.get(SentencesAnnotation.class);
List<String> lemmas = new LinkedList<>();

Two for-each statements iterate over the sentences to populate the lemmas list. Once this is completed, the list is displayed:

for (CoreMap sentence : sentences) {
    for (CoreLabelword : sentence.get(TokensAnnotation.class)) {
        lemmas.add(word.get(LemmaAnnotation.class));
    }
}

System.out.print("[");
for (String element : lemmas) {
    System.out.print(element + " ");
}
System.out.println("]");

The output of this sequence is as follows:

[similar to stem be lemmatization . this be the process of find its lemma , its form as find in a dictionary . ]

Comparing this to the original test, we see that it does a pretty good job:

Similar to stemming is Lemmatization. This is the process of finding its lemma, its form as found in a dictionary.

Using lemmatization in OpenNLP

OpenNLP also supports lemmatization using the JWNLDictionary class. This class' constructor uses a string that contains the path of the dictionary files used to identify roots. We will use a WordNet dictionary developed at Princeton University (wordnet.princeton.edu). The actual dictionary is a series of files stored in a directory. These files contain a list of words and their "root". For the example used in this section, we will use the dictionary found at https://code.google.com/p/xssm/downloads/detail?name=SimilarityUtils.zip&can=2&q=.

The JWNLDictionary class' getLemmas method is passed the word we want to process and a second parameter that specifies the POS for the word. It is important that the POS match the actual word type if we want accurate results.

In the next code sequence, we create an instance of the JWNLDictionary class using a path ending with \dict\. This is the location of the dictionary. We also define our sample text. The constructor can throw IOException and JWNLException, which we deal with in a try-catch block sequence:

try {
    dictionary = new JWNLDictionary("…\dict\");
    paragraph = "Eat, drink, and be merry, for life is but a dream";
    …
} catch (IOException | JWNLException ex)
    //
}

Following the text initialization, add the following statements. First, we tokenize the string using the WhitespaceTokenizer class as explained in the section Using the WhitespaceTokenizer class. Then, each token is passed to the getLemmas method with an empty string as the POS type. The original token and its lemmas are then displayed:

String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(paragraph);
for (String token : tokens) {
    String[] lemmas = dictionary.getLemmas(token, "");
    for (String lemma : lemmas) {
        System.out.println("Token: " + token + "  Lemma: " + lemma);
    }
}

The output is as follows:

Token: Eat,  Lemma: at
Token: drink,  Lemma: drink
Token: be  Lemma: be
Token: life  Lemma: life
Token: is  Lemma: is
Token: is  Lemma: i
Token: a  Lemma: a
Token: dream  Lemma: dream

The lemmatization process works well except for the token "is" that returns two lemmas. The second one is not valid. This illustrates the importance of using the proper POS for a token. We could have used one or more of the POS tags as the argument to the getLemmas method. However, this begs the question: how do we determine the correct POS? This topic is discussed in detail in Chapter 5, Detecting Parts of Speech.

A short list of POS tags is found in the following table. This list is adapted from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The complete list of The University of Pennsylvania (Penn) Treebank Tag-set can be found at http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html.

Tag

Description

JJ

Adjective

NN

Noun, singular or mass

NNS

Noun, plural

NNP

Proper noun, singular

NNPS

Proper noun, plural

POS

Possessive ending

PRP

Personal pronoun

RB

Adverb

RP

Particle

VB

Verb, base form

VBD

Verb, past tense

VBG

Verb, gerund or present participle

Normalizing using a pipeline

In this section, we will combine many of the normalization techniques using a pipeline. To demonstrate this process, we will expand upon the example used in Using LingPipe to remove stopwords. We will add two additional factories to normalize text: LowerCaseTokenizerFactory and PorterStemmerTokenizerFactory.

The LowerCaseTokenizerFactory factory is added before the creation of the EnglishStopTokenizerFactory and the PorterStemmerTokenizerFactory after the creation of the EnglishStopTokenizerFactory, as shown here:

paragraph = "A simple approach is to create a class "
     + "to hold and remove stopwords.";
TokenizerFactory factory = IndoEuropeanTokenizerFactory.INSTANCE;
factory = new LowerCaseTokenizerFactory(factory);
factory = new EnglishStopTokenizerFactory(factory);
factory = new PorterStemmerTokenizerFactory(factory);
Tokenizer tokenizer = factory.tokenizer(paragraph.toCharArray(), 0, paragraph.length());
for (String token : tokenizer) {
    System.out.println(token);
}

The output is as follows:

simpl
approach
creat
class
hold
remov
stopword
.

What we have left are the stems of the words in lowercase with the stopwords removed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.171.212