Text classification

In text classification, or text categorization, the goal is to assign a text document according to its content to one or more classes or categories, which tend to be a more general subject area, such as vehicles or pets. Such general classes are referred to as topics, and the classification task is then called text classification, text categorization, topic classification, or topic spotting. While documents can be categorized according to other attributes such as document type, author, and publication year, the focus in this chapter will be on the document content only. Examples of text classification include the following components:

Spam detection in email messages, user comments, web pages, and so on
Detection of sexually explicit content
Sentiment detection, which automatically classifies a product or service review as positive or negative
Email sorting according to content
Topic-specific search, where search engines restrict searches to a particular topic or genre, hence providing more accurate results

These examples show how important text classification is in information retrieval systems; hence, most modern information retrieval systems use some kind of text classifier. The classification task that we will use as an example in this book is text classification for detecting email spam.

We will continue this chapter with an introduction to Mallet, a Java-based package for statistical natural-language processing, document classification, clustering, topic modeling, information extraction, and other machine-learning applications to text. We will then cover two text analytics applications, namely, topics modeling and spam detection as text classification.

Table of Contents for Text classification

Create new playlist

Sign In

Sign Up

Table of Contents for
Text classification