Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character
class' isWhitespace
method. These characters are listed in the following table. However, there may be a need at times to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important.
Character |
Meaning |
---|---|
Unicode space character |
(space_separator, line_separator, or paragraph_separator) |
|
U+0009 horizontal tabulation |
|
U+000A line feed |
|
U+000B vertical tabulation |
|
U+000C form feed |
|
U+000D carriage return |
|
U+001C file separator |
|
U+001D group separator |
|
U+001E record separator |
|
U+001F unit separator |
The tokenization process is complicated by a large number of factors such as:
Removing stopwords can save space in an index and make the indexing process faster. However, some search engines do not remove stopwords because they can be useful for certain queries. For example, when performing an exact match, removing stopwords will result in misses. Also, the NER task often depends on stopword inclusion. Recognizing that "Romeo and Juliet" is a play is dependent on the inclusion of the word "and".
There are many lists which define stopwords. Sometimes what constitutes a stopword is dependent on the problem domain. A list of stopwords can be found at http://www.ranks.nl/stopwords. It lists a few categories of English stopwords and stopwords for languages other than English. At http://www.textfixer.com/resources/common-english-words.txt, you will find a comma-separated formatted list of English stopwords.
A list of the top ten stopwords adapted from Stanford (http://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be) are listed in the following table:
Stopword |
Occurrences |
---|---|
the |
7,578 |
of |
6,582 |
and |
4,106 |
in |
2,298 |
a |
1,137 |
to |
1,033 |
for |
695 |
on |
685 |
an |
289 |
with |
231 |
We will focus on the techniques used to tokenize English text. This usually involves using whitespace or other delimiters to return a list of tokens.
The output of tokenization can be used for simple tasks such as spell checkers and processing simple searches. It is also useful for various downstream NLP tasks such as identifying POS, sentence detection, and classification. Most of the chapters that follow will involve tasks that require tokenization.
Frequently, the tokenization process is just one step in a larger sequence of tasks. These steps involve the use of pipelines, as we will illustrate in Using a pipeline later in this chapter. This highlights the need for tokenizers that produce quality results for the downstream task. If the tokenizer does a poor job, then the downstream task will be adversely affected.
There are many different tokenizers and tokenization techniques available in Java. There are several core Java classes that were designed to support tokenization. Some of these are now outdated. There are also a number of NLP APIs designed to address both simple and complex tokenization problems. The next two sections will examine these approaches. First, we will see what the Java core classes have to offer, and then we will demonstrate a number of the NLP API tokenization libraries.
18.221.232.187