Normalization is a process that converts a list of words to a more uniform sequence. This is useful in preparing text for later processing. By transforming the words to a standard format, other operations are able to work with the data and will not have to deal with issues that might compromise the process. For example, converting all words to lowercase will simplify the searching process.
The normalization process can improve text matching. For example, there are several ways that the term "modem router" can be expressed, such as modem and router, modem & router, modem/router, and modem-router. By normalizing these words to the common form, it makes it easier to supply the right information to a shopper.
Understand that the normalization process might also compromise an NLP task. Converting to lowercase letters can decrease the reliability of searches when the case is important.
Normalization operations can include the following:
We will investigate these techniques here except for expanding abbreviations. This technique is similar to the technique used to remove stopwords, except that the abbreviations are replaced with their expanded version.
Converting text to lowercase is a simple process that can improve search results. We can either use Java methods such as the String
class' toLowerCase
method, or use the capability found in some NLP APIs such as LingPipe's LowerCaseTokenizerFactory
class. The toLowerCase
method is demonstrated here:
String text = "A Sample string with acronyms, IBM, and UPPER " + "and lowercase letters."; String result = text.toLowerCase(); System.out.println(result);
The output will be as follows:
a sample string with acronyms, ibm, and upper and lowercase letters.
LingPipe's LowerCaseTokenizerFactory
approach is illustrated in the section Normalizing using a pipeline, later in this chapter.
There are several approaches to remove stopwords. A simple approach is to create a class to hold and remove stopwords. Also, several NLP APIs provide support for stopword removal. We will create a simple class called StopWords
to demonstrate the first approach. We will then use LingPipe's EnglishStopTokenizerFactory
class to demonstrate the second approach.
The process of removing stopwords involves examining a stream of tokens, comparing them to a list of stopwords, and then removing the stopwords from the stream. To illustrate this approach, we will create a simple class that supports basic operations as defined in the following table:
Constructor/Method |
Usage |
---|---|
Default constructor |
Uses a default set of stopwords |
Single argument constructor |
Uses stopwords stored in a file |
|
Adds a new stopword to the internal list |
|
Accepts an array of words and returns a new array with the stopwords removed |
Create a class called StopWords
, which declares two instance variables as shown in the following code block. The variable defaultStopWords
is an array that holds the default stopword list. The HashSet
variable stopwords
list is used to hold the stopwords for processing purposes:
public class StopWords { private String[] defaultStopWords = {"i", "a", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", where", "who", "will", "with"}; private static HashSet stopWords = new HashSet(); ... }
Two constructors of the class follow which populate the HashSet
:
public StopWords() { stopWords.addAll(Arrays.asList(defaultStopWords)); } public StopWords(String fileName) { try { BufferedReader bufferedreader = new BufferedReader(new FileReader(fileName)); while (bufferedreader.ready()) { stopWords.add(bufferedreader.readLine()); } } catch (IOException ex) { ex.printStackTrace(); } }
The convenience method addStopWord
allows additional words to be added:
public void addStopWord(String word) { stopWords.add(word); }
The removeStopWords
method is used to remove the stopwords. It creates an ArrayList
to hold the original words passed to the method. The for loop is used to remove stopwords from this list. The contains
method will determine if the word submitted is a stopword, and if so, remove it. The ArrayList
is converted to an array of strings and then returned. This is shown as follows:
public String[] removeStopWords(String[] words) { ArrayList<String> tokens = new ArrayList<String>(Arrays.asList(words)); for (int i = 0; i < tokens.size(); i++) { if (stopWords.contains(tokens.get(i))) { tokens.remove(i); } } return (String[]) tokens.toArray(new String[tokens.size()]); }
The following sequence illustrates how StopWords
can be used. First, we declare an instance of the StopWords
class using the default constructor. The OpenNLP SimpleTokenizer
class is declared and the sample text is defined, as shown here:
StopWords stopWords = new StopWords(); SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE; paragraph = "A simple approach is to create a class " + "to hold and remove stopwords.";
The sample text is tokenized and then passed to the removeStopWords
method. The new list is then displayed:
String tokens[] = simpleTokenizer.tokenize(paragraph); String list[] = stopWords.removeStopWords(tokens); for (String word : list) { System.out.println(word); }
When executed, we get the following output. The "A
" is not removed because it is uppercase and the class does not perform case conversion:
A simple approach create class hold remove stopwords .
LingPipe possesses the EnglishStopTokenizerFactory
class that we will use to identify and remove stopwords. The words in this list are found in http://alias-i.com/lingpipe/docs/api/com/aliasi/tokenizer/EnglishStopTokenizerFactory.html. They include words such as a, was, but, he, and for.
The factory
class' constructor requires a TokenizerFactory
instance as its argument. We will use the factory's tokenizer
method to process a list of words and remove the stopwords. We start by declaring the string to be tokenized:
String paragraph = "A simple approach is to create a class " + "to hold and remove stopwords.";
Next, we create an instance of a TokenizerFactory
based on the IndoEuropeanTokenizerFactory
class. We then use that factory as the argument to create our EnglishStopTokenizerFactory
instance:
TokenizerFactory factory = IndoEuropeanTokenizerFactory.INSTANCE; factory = new EnglishStopTokenizerFactory(factory);
Using the LingPipe Tokenizer
class and the factory's tokenizer
method, the text as declared in the paragraph
variable is processed. The tokenizer
method uses an array of char
, a starting index, and its length:
Tokenizer tokenizer = factory.tokenizer(paragraph.toCharArray(), 0, paragraph.length());
The following for-each statement will iterate over the revised list:
for (String token : tokenizer) { System.out.println(token); }
The output will be as follows:
A simple approach create class hold remove stopwords .
Notice that although the letter, "A
" is a stopword, it was not removed from the list. This is because the stopword list uses a lowercase 'a' and not an uppercase 'A'. As a result, it missed the word. We will correct this problem in the section Normalizing using a pipeline, later in the chapter.
Finding the stem of a word involves removing any prefixes or suffixes and what is left is considered to be the stem. Identifying stems is useful for tasks where finding similar words is important. For example, a search may be looking for occurrences of words like "book". There are many words that contain this word including books, booked, bookings, and bookmark. It can be useful to identify stems and then look for their occurrence in a document. In many situations, this can improve the quality of a search.
A stemmer may produce a stem that is not a real word. For example, it may decide that bounties, bounty, and bountiful all have the same stem, "bounti". This can still be useful for searches.
Similar to stemming is Lemmatization. This is the process of finding its lemma, its form as found in a dictionary. This can also be useful for some searches. Stemming is frequently viewed as a more primitive technique, where the attempt to get to the "root" of a word involves cutting off parts of the beginning and/or ending of a token.
Lemmatization can be thought of as a more sophisticated approach where effort is devoted to finding the morphological or vocabulary meaning of a token. For example, the word "having" has a stem of "hav" while its lemma is "have". Also, the words "was" and "been" have different stems but the same lemma, "be".
Lemmatization can often use more computational resources than stemming. They both have their place and their utility is partially determined by the problem that needs to be solved.
The Porter Stemmer is a commonly used stemmer for English. Its home page can be found at http://tartarus.org/martin/PorterStemmer/. It uses five steps to stem a word.
Although Apache OpenNLP 1.5.3 does not contain the PorterStemmer
class, its source code can be downloaded from https://svn.apache.org/repos/asf/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/stemmer/PorterStemmer.java. It can then be added to your project.
In the next example, we demonstrate the PorterStemmer
class against an array of words. The input could as easily have originated from some other text source. An instance of the PorterStemmer
class is created and then its stem
method is applied to each word of the array:
String words[] = {"bank", "banking", "banks", "banker", "banked", "bankart"}; PorterStemmer ps = new PorterStemmer(); for(String word : words) { String stem = ps.stem(word); System.out.println("Word: " + word + " Stem: " + stem); }
When executed, you will get the following output:
Word: bank Stem: bank Word: banking Stem: bank Word: banks Stem: bank Word: banker Stem: banker Word: banked Stem: bank Word: bankart Stem: bankart
The last word is used in combination with the word "lesion" as in "Bankart lesion". This is an injury of the shoulder and doesn't have much to do with the previous words. It does show that only common affixes are used when finding the stem.
Other potentially useful PorterStemmer
class methods are found in the following table:
Method |
Meaning |
---|---|
|
This will add a |
|
The method used without an argument will return |
|
Reset the stemmer so a different word can be used |
The PorterStemmerTokenizerFactory
class is used to find stems using LingPipe. In this example, we will use the same words array as in the previous section. The IndoEuropeanTokenizerFactory
class is used to perform the initial tokenization followed by the use of the Porter Stemmer. These classes are defined here:
TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; TokenizerFactory porterFactory = new PorterStemmerTokenizerFactory(tokenizerFactory);
An array to hold the stems is declared next. We reuse the words
array declared in the previous section. Each word is processed individually. The word is tokenized and its stem is stored in stems
as shown in the following code block. The words and their stems are then displayed:
String[] stems = new String[words.length]; for (int i = 0; i < words.length; i++) { Tokenization tokenizer = new Tokenization(words[i],porterFactory); stems = tokenizer.tokens(); System.out.print("Word: " + words[i]); for (String stem : stems) { System.out.println(" Stem: " + stem); } }
When executed, we get the following output:
Word: bank Stem: bank Word: banking Stem: bank Word: banks Stem: bank Word: banker Stem: banker Word: banked Stem: bank Word: bankart Stem: bankart
We have demonstrated Porter Stemmer using OpenNLP and LingPipe examples. It is worth noting that there are other types of stemmers available including NGrams and various mixed probabilistic/algorithmic approaches.
Lemmatization is supported by a number of NLP APIs. In this section, we will illustrate how lemmatization can be performed using the StanfordCoreNLP
and the OpenNLPLemmatizer
classes. The lemmatization process determines the lemma of a word. A lemma can be thought of as the dictionary form of a word. For example, the lemma of "was" is "be".
We will use the StanfordCoreNLP
class with a pipeline to demonstrate lemmatization. We start by setting up the pipeline with four annotators including lemma
as shown here:
StanfordCoreNLP pipeline; Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma"); pipeline = new StanfordCoreNLP(props);
These annotators are needed and are explained as follows:
Annotator |
Operation to be Performed |
---|---|
|
Tokenization |
|
Sentence splitting |
|
POS tagging |
|
Lemmatization |
|
NER |
|
Syntactic parsing |
|
Coreference resolution |
A paragraph
variable is used with the Annotation
constructor and the annotate
method is then executed, as shown here:
String paragraph = "Similar to stemming is Lemmatization. " +"This is the process of finding its lemma, its form " + +"as found in a dictionary."; Annotation document = new Annotation(paragraph); pipeline.annotate(document);
We now need to iterate over the sentences and tokens of the sentences. The Annotation
and CoreMap
class' get
methods will return values of the type specified. If there are no values of the specified type, it will return null
. We will use these classes to obtain a list of lemmas.
First, a list of sentences is returned and then each word of each sentence is processed to find lemmas. The list of sentences and lemmas are declared here:
List<CoreMap> sentences = document.get(SentencesAnnotation.class); List<String> lemmas = new LinkedList<>();
Two for-each statements iterate over the sentences to populate the lemmas list. Once this is completed, the list is displayed:
for (CoreMap sentence : sentences) { for (CoreLabelword : sentence.get(TokensAnnotation.class)) { lemmas.add(word.get(LemmaAnnotation.class)); } } System.out.print("["); for (String element : lemmas) { System.out.print(element + " "); } System.out.println("]");
The output of this sequence is as follows:
[similar to stem be lemmatization . this be the process of find its lemma , its form as find in a dictionary . ]
Comparing this to the original test, we see that it does a pretty good job:
Similar to stemming is Lemmatization. This is the process of finding its lemma, its form as found in a dictionary.
OpenNLP also supports lemmatization using the JWNLDictionary
class. This class' constructor uses a string that contains the path of the dictionary files used to identify roots. We will use a WordNet dictionary developed at Princeton University (wordnet.princeton.edu). The actual dictionary is a series of files stored in a directory. These files contain a list of words and their "root". For the example used in this section, we will use the dictionary found at https://code.google.com/p/xssm/downloads/detail?name=SimilarityUtils.zip&can=2&q=.
The JWNLDictionary
class' getLemmas
method is passed the word we want to process and a second parameter that specifies the POS for the word. It is important that the POS match the actual word type if we want accurate results.
In the next code sequence, we create an instance of the JWNLDictionary
class using a path ending with \dict\
. This is the location of the dictionary. We also define our sample text. The constructor can throw IOException
and JWNLException
, which we deal with in a try-catch block sequence:
try { dictionary = new JWNLDictionary("…\dict\"); paragraph = "Eat, drink, and be merry, for life is but a dream"; … } catch (IOException | JWNLException ex) // }
Following the text initialization, add the following statements. First, we tokenize the string using the WhitespaceTokenizer
class as explained in the section Using the WhitespaceTokenizer class. Then, each token is passed to the getLemmas
method with an empty string as the POS type. The original token and its lemmas are then displayed:
String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(paragraph); for (String token : tokens) { String[] lemmas = dictionary.getLemmas(token, ""); for (String lemma : lemmas) { System.out.println("Token: " + token + " Lemma: " + lemma); } }
The output is as follows:
Token: Eat, Lemma: at Token: drink, Lemma: drink Token: be Lemma: be Token: life Lemma: life Token: is Lemma: is Token: is Lemma: i Token: a Lemma: a Token: dream Lemma: dream
The lemmatization process works well except for the token "is" that returns two lemmas. The second one is not valid. This illustrates the importance of using the proper POS for a token. We could have used one or more of the POS tags as the argument to the getLemmas
method. However, this begs the question: how do we determine the correct POS? This topic is discussed in detail in Chapter 5, Detecting Parts of Speech.
A short list of POS tags is found in the following table. This list is adapted from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The complete list of The University of Pennsylvania (Penn) Treebank Tag-set can be found at http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html.
Tag |
Description |
---|---|
JJ |
Adjective |
NN |
Noun, singular or mass |
NNS |
Noun, plural |
NNP |
Proper noun, singular |
NNPS |
Proper noun, plural |
POS |
Possessive ending |
PRP |
Personal pronoun |
RB |
Adverb |
RP |
Particle |
VB |
Verb, base form |
VBD |
Verb, past tense |
VBG |
Verb, gerund or present participle |
In this section, we will combine many of the normalization techniques using a pipeline. To demonstrate this process, we will expand upon the example used in Using LingPipe to remove stopwords. We will add two additional factories to normalize text: LowerCaseTokenizerFactory
and PorterStemmerTokenizerFactory
.
The LowerCaseTokenizerFactory
factory is added before the creation of the EnglishStopTokenizerFactory
and the PorterStemmerTokenizerFactory
after the creation of the EnglishStopTokenizerFactory
, as shown here:
paragraph = "A simple approach is to create a class " + "to hold and remove stopwords."; TokenizerFactory factory = IndoEuropeanTokenizerFactory.INSTANCE; factory = new LowerCaseTokenizerFactory(factory); factory = new EnglishStopTokenizerFactory(factory); factory = new PorterStemmerTokenizerFactory(factory); Tokenizer tokenizer = factory.tokenizer(paragraph.toCharArray(), 0, paragraph.length()); for (String token : tokenizer) { System.out.println(token); }
The output is as follows:
simpl approach creat class hold remov stopword .
What we have left are the stems of the words in lowercase with the stopwords removed.
3.144.244.250