Finding sentences

Words (tokens) aren't the only structures that we're interested in, however. Another interesting and useful grammatical structure is the sentence. In this recipe, we'll use a process similar to the one we used in the previous recipe, Tokenizing text, in order to create a function that will pull sentences from a string in the same way that tokenize pulled tokens from a string in the last recipe.

Getting ready

We'll need to include clojure-opennlp in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

We will also need to require it into the current namespace:

(require '[opennlp.nlp :as nlp])

Finally, we'll download a model for a statistical sentence splitter. I downloaded en-sent.bin from http://opennlp.sourceforge.net/models-1.5/. I then saved it into models/en-sent.bin.

How to do it…

As in the Tokenizing text recipe, we will start by loading the sentence identification model data, as shown here:

(def get-sentences
  (nlp/make-sentence-detector "models/en-sent.bin"))

Now, we use that data to split a text into a series of sentences, as follows:

user=> (get-sentences "I never saw a Purple Cow.
           I never hope to see one.
           But I can tell you, anyhow.
           I'd rather see than be one.")
 ["I never saw a Purple Cow."
  "I never hope to see one."
  "But I can tell you, anyhow."
  "I'd rather see than be one."]

How it works…

The data model in models/en-sent.bin contains the information that OpenNLP needs to recreate a previously-trained sentence identification algorithm. Once we have reinstantiated this algorithm, we can use it to identify the sentences in a text, as we did by calling get-sentences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.204.5