What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need at times to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important.

Character

Meaning

Unicode space character

(space_separator, line_separator, or paragraph_separator)

U+0009 horizontal tabulation

U+000A line feed

u000B

U+000B vertical tabulation

f

U+000C form feed

U+000D carriage return

u001C

U+001C file separator

u001D

U+001D group separator

u001E

U+001E record separator

u001F

U+001F unit separator

The tokenization process is complicated by a large number of factors such as:

  • Language: Different languages present unique challenges. Whitespace is a commonly used delimiter but it will not be sufficient if we need to work with Chinese, where they are not used.
  • Text format: Text is often stored or presented using different formats. How simple text is processed versus HTML or other markup techniques will complicate the tokenization process.
  • Stopwords: Commonly used words might not be important for some NLP tasks such as general searches. These common words are called stopwords. Stopwords are sometimes removed when they do not contribute to the NLP task at hand. These can include words such as "a", "and", and "she".
  • Text expansion: For acronyms and abbreviations, it is sometimes desirable to expand them so that postprocesses can produce better quality results. For example, if a search is interested in the word "machine", then knowing that IBM stands for International Business Machines can be useful.
  • Case: The case of a word (upper or lower) may be significant in some situations. For example, the case of a word can help identify proper nouns. When identifying the parts of text, conversion to the same case can be useful in simplifying searches.
  • Stemming and lemmatization: These processes will alter the words to get to their "roots".

Removing stopwords can save space in an index and make the indexing process faster. However, some search engines do not remove stopwords because they can be useful for certain queries. For example, when performing an exact match, removing stopwords will result in misses. Also, the NER task often depends on stopword inclusion. Recognizing that "Romeo and Juliet" is a play is dependent on the inclusion of the word "and".

Note

There are many lists which define stopwords. Sometimes what constitutes a stopword is dependent on the problem domain. A list of stopwords can be found at http://www.ranks.nl/stopwords. It lists a few categories of English stopwords and stopwords for languages other than English. At http://www.textfixer.com/resources/common-english-words.txt, you will find a comma-separated formatted list of English stopwords.

A list of the top ten stopwords adapted from Stanford (http://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be) are listed in the following table:

Stopword

Occurrences

the

7,578

of

6,582

and

4,106

in

2,298

a

1,137

to

1,033

for

695

on

685

an

289

with

231

We will focus on the techniques used to tokenize English text. This usually involves using whitespace or other delimiters to return a list of tokens.

Note

Parsing is closely related to tokenization. They are both concerned with identifying parts of text, but parsing is also concerned with identifying the parts of speech and their relationship to each other.

Uses of tokenizers

The output of tokenization can be used for simple tasks such as spell checkers and processing simple searches. It is also useful for various downstream NLP tasks such as identifying POS, sentence detection, and classification. Most of the chapters that follow will involve tasks that require tokenization.

Frequently, the tokenization process is just one step in a larger sequence of tasks. These steps involve the use of pipelines, as we will illustrate in Using a pipeline later in this chapter. This highlights the need for tokenizers that produce quality results for the downstream task. If the tokenizer does a poor job, then the downstream task will be adversely affected.

There are many different tokenizers and tokenization techniques available in Java. There are several core Java classes that were designed to support tokenization. Some of these are now outdated. There are also a number of NLP APIs designed to address both simple and complex tokenization problems. The next two sections will examine these approaches. First, we will see what the Java core classes have to offer, and then we will demonstrate a number of the NLP API tokenization libraries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.232.187