In this chapter, we will cover:
Text classification is a way to categorize documents or pieces of text. By examining the word usage in a piece of text, classifiers can decide what class label to assign to it. A binary classifier decides between two labels, such as positive or negative. The text can either be one label or the other, but not both, whereas a multi-label classifier can assign one or more labels to a piece of text.
Classification works by learning from labeled feature sets, or training data, to later classify an unlabeled feature set. A
feature set is basically a key-value mapping of feature names to feature values. In the case of text classification, the feature names are usually words, and the values are all True
. As the documents may have unknown words, and the number of possible words may be very large, words that don't occur in the text are omitted, instead of including them in a feature set with the value False
.
An instance is a single feature set. It represents a single occurrence of a combination of features. We will use instance and feature set interchangeably. A labeled feature set is an instance with a known class label that we can use for training or evaluation.
18.191.10.169