Chapter 7. Text Classification

In this chapter, we will cover:

  • Bag of Words feature extraction
  • Training a naive Bayes classifier
  • Training a decision tree classifier
  • Training a maximum entropy classifier
  • Measuring precision and recall of a classifier
  • Calculating high information words
  • Combining classifiers with voting
  • Classifying with multiple binary classifiers

Introduction

Text classification is a way to categorize documents or pieces of text. By examining the word usage in a piece of text, classifiers can decide what class label to assign to it. A binary classifier decides between two labels, such as positive or negative. The text can either be one label or the other, but not both, whereas a multi-label classifier can assign one or more labels to a piece of text.

Classification works by learning from labeled feature sets, or training data, to later classify an unlabeled feature set. A feature set is basically a key-value mapping of feature names to feature values. In the case of text classification, the feature names are usually words, and the values are all True. As the documents may have unknown words, and the number of possible words may be very large, words that don't occur in the text are omitted, instead of including them in a feature set with the value False.

An instance is a single feature set. It represents a single occurrence of a combination of features. We will use instance and feature set interchangeably. A labeled feature set is an instance with a known class label that we can use for training or evaluation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.10.169