Tokenizing text

Before we can do any real analysis of a text or a corpus of texts, we have to identify the words in the text. This process is called tokenization. The output of this process is a list of words, and possibly includes punctuation in a text. This is different from tokenizing formal languages such as programming languages: it is meant to work with natural languages and its results are less structured.

It's easy to write your own tokenizer, but there are a lot of edge and corner cases to take into consideration and account for. It's also easy to include a natural language processing (NLP) library that includes one or more tokenizers. In this recipe, we'll use the OpenNLP (http://opennlp.apache.org/) and its Clojure wrapper (https://clojars.org/clojure-opennlp).

Getting ready

We'll need to include the clojure-opennlp in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

We will also need to require it into the current namespace, as follows:

(require '[opennlp.nlp :as nlp])

Finally, we'll download a model for a statistical tokenizer. I downloaded all of the files from http://opennlp.sourceforge.net/models-1.5/. I then saved them into models/.

How to do it…

In order to tokenize a document, we'll need to first create the tokenizer. We can do this by loading the model:

(def tokenize (nlp/make-tokenizer "models/en-token.bin"))

Then, we use it by passing a string to this tokenizer object:

user=> (tokenize "This is a string.")
["This" "is" "a" "string" "."]
user=> (tokenize "This isn't a string.")
["This" "is" "n't" "a" "string" "."]

How it works…

In OpenNLP, tokenizers are statistically trained to identify tokens in a language, based on the language used in the text. The en-token.bin file contains the information for a trained tokenizer for English. In the second example of the previous section, we can see that it correctly pulls the contracted not from the base word, is.

Once we load this data back into a tokenizer, we can use it again to pull the tokens out.

The main catch is that the language used to generate the model data has to match the language in the input string that we're attempting to tokenize.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.75.70