Creating a pipeline to search text

Searching is a rich and complex topic. There are many different types of searches and approaches to perform a search. The intent here is to demonstrate how various NLP techniques can be applied to support this effort.

A single text document can be processed at one time in a reasonable time period on most machines. However, when multiple large documents need to be searched, then creating an index is a common approach to support searches. This results in a search process that completes in a reasonable period of time.

We will demonstrate one approach to create an index and then search using the index. Although the text we will use is not that large, it is sufficient to demonstrate the process.

We need to:

  1. Read the text from the file
  2. Tokenize and find sentence boundaries
  3. Remove stop words
  4. Accumulate the index statistics
  5. Write out the index file

There are several factors that influence the contents of an index file:

  • Removal of stop words
  • Case-sensitive searches
  • Finding synonyms
  • Using stemming and lemmatization
  • Allowing searches across sentence boundaries

We will use OpenNLP to demonstrate the process. The intent of this example is to demonstrate how to combine NLP techniques in a pipeline process to solve a search type problem. This is not a comprehensive solution and we will ignore some techniques such as stemming. In addition, the actual creation of an index file will not be presented but rather left as an exercise for the reader. Here, we will focus on how NLP techniques can be used.

Specifically, we will:

  • Split the book into sentences
  • Convert the sentences to lowercase
  • Remove stop words
  • Create an internal index data structure

We will develop two classes to support the index data structure: Word and Positions. We will also augment the StopWords class, developed in Chapter 2, Finding Parts of Text, to support an overloaded version of the removeStopWords method. The new version will provide a more convenient method for removing stop words.

We start with a try-with-resources block to open streams for the sentence model, en-sent.bin, and a file containing the contents of Twenty Thousand Leagues Under the Sea by Jules Verne. The book was downloaded from http://www.gutenberg.org/ebooks/164 and modified slightly to remove leading and trailing Gutenberg text to make it more readable:

try (InputStream is = new FileInputStream(new File(
    "C:/Current Books/NLP and Java/Models/en-sent.bin"));
    FileReader fr = new FileReader("Twenty Thousands.txt");
    BufferedReader br = new BufferedReader(fr)) {
        …
} catch (IOException ex) {
    // Handle exceptions
}

The sentence model is used to create an instance of the SentenceDetectorME class as shown here:

SentenceModel model = new SentenceModel(is);
SentenceDetectorME detector = new SentenceDetectorME(model);

Next, we will create a string using a StringBuilder instance to support the detection of sentence boundaries. The book's file is read and added to the StringBuilder instance. The sentDetect method is then applied to create an array of sentences, as shown here:

String line;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
    sb.append(line + " ");
}
String sentences[] = detector.sentDetect(sb.toString());

For the modified version of the book file, this method created an array with 14,859 sentences.

Next, we used the toLowerCase method to convert the text to lowercase. This was done to ensure that when stop words are removed, the method will catch all of them.

for (int i = 0; i < sentences.length; i++) {
    sentences[i] = sentences[i].toLowerCase();
}

Converting to lowercase and removing stop words restricts searches. However, this is considered to be a feature of this implementation and can be adjusted for other implementations.

Next, the stop words are removed. As mentioned earlier, an overloaded version of the removeStopWords method has been added to make it easier to use with this example. The new method is shown here:

public String removeStopWords(String words) {
    String arr[] = 
        WhitespaceTokenizer.INSTANCE.tokenize(words);
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < arr.length; i++) {
        if (stopWords.contains(arr[i])) {
            // Do nothing
        } else {
            sb.append(arr[i]+" ");
        }
    }
    return sb.toString();
}

We created a StopWords instance using the stop-words_english_2_en.txt file as shown in the following code sequence. This is one of several lists that can be downloaded from https://code.google.com/p/stop-words/. We chose this file simply because it contains stop words that we felt were appropriate for the book.

StopWords stopWords = new StopWords("stop-words_english_2_en.txt");
for (int i = 0; i < sentences.length; i++) {
    sentences[i] = stopWords.removeStopWords(sentences[i]);
}

The text has now been processed. The next step will be to create an index-like data structure based on the processed text. This structure will use the Word and Positions class. The Word class consists of fields for the word and an ArrayList of Positions objects. Since a word may appear more than once in a document, the list is used to maintain its position within the document. This class is defined as shown here:

public class Word {
    private String word;
    private final ArrayList<Positions> positions;

    public Word() {
        this.positions = new ArrayList();
    }

    public void addWord(String word, int sentence, 
            int position) {
        this.word = word;
        Positions counts = new Positions(sentence, position);
        positions.add(counts);
    }

    public ArrayList<Positions> getPositions() {
        return positions;
    }

    public String getWord() {
        return word;
    }
}

The Positions class contains a field for the sentence number, sentence, and for the position of the word within the sentence, position. The class definition is as follows:

class Positions {
    int sentence;
    int position;

    Positions(int sentence, int position) {
        this.sentence = sentence;
        this.position = position;
    }
}

To use these classes, we create a HashMap instance to hold position information about each word in the file:

HashMap<String, Word> wordMap = new HashMap();

The creation of the Word entries in the map is shown next. Each sentence is tokenized and then each token is checked to see if it exists in the map. The word is used as the key to the hash map.

The containsKey method determines whether the word has already been added. If it has, then the Word instance is removed. If the word has not been added before, a new Word instance is created. Regardless, the new position information is added to the Word instance and then it is added to the map:

for (int sentenceIndex = 0; 
        sentenceIndex < sentences.length; sentenceIndex++) {
    String words[] = WhitespaceTokenizer.INSTANCE.tokenize(
        sentences[sentenceIndex]);
    Word word;
    for (int wordIndex = 0; 
            wordIndex < words.length; wordIndex++) {
        String newWord = words[wordIndex];
        if (wordMap.containsKey(newWord)) {
             word = wordMap.remove(newWord);
        } else {
            word = new Word();
        }
        word.addWord(newWord, sentenceIndex, wordIndex);
        wordMap.put(newWord, word);
    }
}

To demonstrate the actual lookup process, we use the get method to return an instance of the Word object for the word "reef". The list of the positions is returned with the getPositions method and then each position is displayed, as shown here:

Word word = wordMap.get("reef");
ArrayList<Positions> positions = word.getPositions();
for (Positions position : positions) {
    System.out.println(word.getWord() + " is found at line " 
        + position.sentence + ", word " + position.position);
}

The output is as follows:

reef is found at line 0, word 10
reef is found at line 29, word 6
reef is found at line 1885, word 8
reef is found at line 2062, word 12

This implementation is relatively simple but does demonstrate how to combine various NLP techniques to create and use an index data structure that can be saved as an index file. Other enhancements are possible including:

  • Other filter operations
  • Store document information in the Positions class
  • Store chapter information in the Positions class
  • Provide search options such as:
    • Case-sensitive searches
    • Exact text searches
  • Better exception handling

These are left as exercises for the reader.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.220.83