Categorizing newspaper articles and newswires into topics

Articles and newswires denote the huge periodical source of events of knowledge at different periods of time. The classification of text is the preprocessing step to store all these documents into a specific corpus. The categorization of text is the base of text processing.

We will now introduce an N-gram-based text-classification algorithm. From a longer string, an N-character slice is called N-gram. The key point of this algorithm is the calculation of the profiles of the N-gram frequencies.

Before the introduction of the algorithm, here are the necessary illustrations of a couple of concepts adopted in the algorithm:

Categorizing newspaper articles and newswires into topics

The N-gram-based text categorization

The summarized pseudocodes for the N-gram-based text-categorization algorithm are as follows:

The N-gram-based text categorization
The N-gram-based text categorization

The N-gram frequency-generation algorithm is as follows:

The N-gram-based text categorization
The N-gram-based text categorization

The R implementation

Please take a look at the R codes file ch_10_ngram_classifier.R from the bundle of R codes for the above algorithms. The codes can be tested with the following command:

> source("ch_10_ngram_classifier.R")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.113.190