Chapter 2. Finding Parts of Text

Finding parts of text is concerned with breaking text down into individual units called tokens, and optionally performing additional processing on these tokens. This additional processing can include stemming, lemmatization, stopword removal, synonym expansion, and converting text to lowercase.

We will demonstrate several tokenization techniques found in the standard Java distribution. These are included because sometimes this is all you may need to do the job. There may be no need to import NLP libraries in this situation. However, these techniques are limited. This is followed by a discussion of specific tokenizers or tokenization approaches supported by NLP APIs. These examples will provide a reference for how the tokenizers are used and the type of output they produce. This is followed by a simple comparison of the differences between the approaches.

There are many specialized tokenizers. For example, the Apache Lucene project supports tokenizers for various languages and specialized documents. The WikipediaTokenizer class is a tokenizer that handles Wikipedia-specific documents and the ArabicAnalyzer class handles Arabic text. It is not possible to illustrate all of these varying approaches here.

We will also examine how certain tokenizers can be trained to handle specialized text. This can be useful when a different form of text is encountered. It can often eliminate the need to write a new and specialized tokenizer.

Next, we will illustrate how some of these tokenizers can be used to support specific operations such as stemming, lemmatization, and stopword removal. POS can also be considered as a special instance of parts of text. However, this topic is investigated in Chapter 5, Detecting Parts of Speech.

Understanding the parts of text

There are a number of ways of categorizing parts of text. For example, we may be concerned with character-level issues such as punctuations with a possible need to ignore or expand contractions. At the word level, we may need to perform different operations such as:

  • Identifying morphemes using stemming and/or lemmatization
  • Expanding abbreviations and acronyms
  • Isolating number units

We cannot always split words with punctuations because the punctuations are sometimes considered to be part of the word, such as the word "can't". We may also be concerned with grouping multiple words to form meaningful phrases. Sentence detection can also be a factor. We do not necessarily want to group words that cross sentence boundaries.

In this chapter, we are primarily concerned with the tokenization process and a few specialized techniques such as stemming. We will not attempt to show how they are used in other NLP tasks. Those efforts are reserved for later chapters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.239.182