Chapter 13. Text Analytics with R

In the previous chapter, we examined how to deal with nested data using multilevel analyses. In Chapter 11, Classifiation Trees we discovered how to classify data using decision trees. Here, we will deal with textual data. This chapter will cover the following topics:

  • A brief introduction to text analytics
  • How to load and preprocess text
  • How to perform document classification
  • How to perform basic topic modeling to extract meaning
  • How to download news articles using R

An introduction to text analytics

It might come as a surprise, or not, textual data represents the greatest part of the overall data accessible to companies and data analysts. Textual data is often available only in unstructured form. Imagine, for instance, an e-mail, a company memo, or a post on a blog. What they have in common is that text is mostly presented in the form of words arranged in sentences arranged in paragraphs. More complex documents are also composed of sub-sections, sections, and chapters. Humans derive meaning from this basic structure and the relationships between these elements. But for machines to classify documents and extract meaning, text preprocessing is required.

There are several usual steps in the preprocessing of textual documents for classification. These include:

  1. Importing the corpus.
  2. Converting text to lowercase, so that, in the analyses, words that include capital letters are not distinguished from words that do not. For instance, the following words are the same after converting to lowercase:
    • Documents
    • DOCUMENTS
    • documents
  3. Removing punctuation so that words followed/preceded by punctuation signs are not treated differently compared to words that are not. For instance, the following words are the same after removing punctuation:
    • documents.
    • documents:
    • documents
  4. Removing numbers contained in the text. Text often includes numbers. These can interfere with the analyses (as they are considered as words). It is, therefore, useful to discard numbers.
  5. Stop word filtering, which is the suppression of uninformative words (contained in most documents), such as:
    • it
    • me
    • where
    • some
  6. Removing extra whitespaces occurring in the original text or resulting from the previous operations.
  7. Performing stemming. Textual analysis often requires replacing words with their stem. The following words are the same after stemming:
    • documentation
    • documented
    • documents
    • document
  8. Performing other necessary transformations, which we will examine in the next section. It is important to note that the task of the analyst is to determine which of the preceding steps are necessary for a particular analysis. Additional steps include:
    • Tokenizing and building the term-document matrix. In this step, each unit (usually the unit is a single word, but can also be an n-gram, that is, two or more contiguous words) is assigned to a column. Documents are presented in rows. The cells (the intersection of columns and words) represent either the presence/absence of each word in each document, the count of each word in each document, or the term frequency–inverse document frequency (the tf–idf measure). This measure takes into account the rarity of the words in the whole corpus (scarcer words are more informative than common words). We discussed how to compute the measure in Chapter 4, Cluster Analysis.
    • Pruning rare tokens, which is the suppression of words that occur infrequently in the corpus.

Once these steps are performed, it is possible to analyze term associations and perform document classification on the basis of the term-document matrix in an efficient way, using algorithms we have already discovered and others we will discover here.

In the following sections, we will examine how to perform text analytics with R, focusing on classification and the extraction of meaning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.134.133