Searching is a rich and complex topic. There are many different types of searches and approaches to perform a search. The intent here is to demonstrate how various NLP techniques can be applied to support this effort.
A single text document can be processed at one time in a reasonable time period on most machines. However, when multiple large documents need to be searched, then creating an index is a common approach to support searches. This results in a search process that completes in a reasonable period of time.
We will demonstrate one approach to create an index and then search using the index. Although the text we will use is not that large, it is sufficient to demonstrate the process.
We need to:
There are several factors that influence the contents of an index file:
We will use OpenNLP to demonstrate the process. The intent of this example is to demonstrate how to combine NLP techniques in a pipeline process to solve a search type problem. This is not a comprehensive solution and we will ignore some techniques such as stemming. In addition, the actual creation of an index file will not be presented but rather left as an exercise for the reader. Here, we will focus on how NLP techniques can be used.
Specifically, we will:
We will develop two classes to support the index data structure: Word
and Positions
. We will also augment the StopWords
class, developed in Chapter 2, Finding Parts of Text, to support an overloaded version of the removeStopWords
method. The new version will provide a more convenient method for removing stop words.
We start with a try-with-resources block to open streams for the sentence model, en-sent.bin
, and a file containing the contents of Twenty Thousand Leagues Under the Sea by Jules Verne. The book was downloaded from http://www.gutenberg.org/ebooks/164 and modified slightly to remove leading and trailing Gutenberg text to make it more readable:
try (InputStream is = new FileInputStream(new File( "C:/Current Books/NLP and Java/Models/en-sent.bin")); FileReader fr = new FileReader("Twenty Thousands.txt"); BufferedReader br = new BufferedReader(fr)) { … } catch (IOException ex) { // Handle exceptions }
The sentence model is used to create an instance of the SentenceDetectorME
class as shown here:
SentenceModel model = new SentenceModel(is); SentenceDetectorME detector = new SentenceDetectorME(model);
Next, we will create a string using a StringBuilder
instance to support the detection of sentence boundaries. The book's file is read and added to the StringBuilder
instance. The sentDetect
method is then applied to create an array of sentences, as shown here:
String line; StringBuilder sb = new StringBuilder(); while ((line = br.readLine()) != null) { sb.append(line + " "); } String sentences[] = detector.sentDetect(sb.toString());
For the modified version of the book file, this method created an array with 14,859 sentences.
Next, we used the toLowerCase
method to convert the text to lowercase. This was done to ensure that when stop words are removed, the method will catch all of them.
for (int i = 0; i < sentences.length; i++) { sentences[i] = sentences[i].toLowerCase(); }
Converting to lowercase and removing stop words restricts searches. However, this is considered to be a feature of this implementation and can be adjusted for other implementations.
Next, the stop words are removed. As mentioned earlier, an overloaded version of the removeStopWords
method has been added to make it easier to use with this example. The new method is shown here:
public String removeStopWords(String words) { String arr[] = WhitespaceTokenizer.INSTANCE.tokenize(words); StringBuilder sb = new StringBuilder(); for (int i = 0; i < arr.length; i++) { if (stopWords.contains(arr[i])) { // Do nothing } else { sb.append(arr[i]+" "); } } return sb.toString(); }
We created a StopWords
instance using the stop-words_english_2_en.txt
file as shown in the following code sequence. This is one of several lists that can be downloaded from https://code.google.com/p/stop-words/. We chose this file simply because it contains stop words that we felt were appropriate for the book.
StopWords stopWords = new StopWords("stop-words_english_2_en.txt"); for (int i = 0; i < sentences.length; i++) { sentences[i] = stopWords.removeStopWords(sentences[i]); }
The text has now been processed. The next step will be to create an index-like data structure based on the processed text. This structure will use the Word
and Positions
class. The Word
class consists of fields for the word and an ArrayList
of Positions
objects. Since a word may appear more than once in a document, the list is used to maintain its position within the document. This class is defined as shown here:
public class Word { private String word; private final ArrayList<Positions> positions; public Word() { this.positions = new ArrayList(); } public void addWord(String word, int sentence, int position) { this.word = word; Positions counts = new Positions(sentence, position); positions.add(counts); } public ArrayList<Positions> getPositions() { return positions; } public String getWord() { return word; } }
The Positions
class contains a field for the sentence number, sentence
, and for the position of the word within the sentence, position
. The class definition is as follows:
class Positions { int sentence; int position; Positions(int sentence, int position) { this.sentence = sentence; this.position = position; } }
To use these classes, we create a HashMap
instance to hold position information about each word in the file:
HashMap<String, Word> wordMap = new HashMap();
The creation of the Word
entries in the map is shown next. Each sentence is tokenized and then each token is checked to see if it exists in the map. The word is used as the key to the hash map.
The containsKey
method determines whether the word has already been added. If it has, then the Word
instance is removed. If the word has not been added before, a new Word
instance is created. Regardless, the new position information is added to the Word
instance and then it is added to the map:
for (int sentenceIndex = 0; sentenceIndex < sentences.length; sentenceIndex++) { String words[] = WhitespaceTokenizer.INSTANCE.tokenize( sentences[sentenceIndex]); Word word; for (int wordIndex = 0; wordIndex < words.length; wordIndex++) { String newWord = words[wordIndex]; if (wordMap.containsKey(newWord)) { word = wordMap.remove(newWord); } else { word = new Word(); } word.addWord(newWord, sentenceIndex, wordIndex); wordMap.put(newWord, word); } }
To demonstrate the actual lookup process, we use the get
method to return an instance of the Word
object for the word "reef". The list of the positions is returned with the getPositions
method and then each position is displayed, as shown here:
Word word = wordMap.get("reef"); ArrayList<Positions> positions = word.getPositions(); for (Positions position : positions) { System.out.println(word.getWord() + " is found at line " + position.sentence + ", word " + position.position); }
reef is found at line 0, word 10 reef is found at line 29, word 6 reef is found at line 1885, word 8 reef is found at line 2062, word 12
This implementation is relatively simple but does demonstrate how to combine various NLP techniques to create and use an index data structure that can be saved as an index file. Other enhancements are possible including:
Positions
classPositions
classThese are left as exercises for the reader.
13.58.220.83