Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Categorizing newspaper articles and newswires into topics

Articles and newswires denote the huge periodical source of events of knowledge at different periods of time. The classification of text is the preprocessing step to store all these documents into a specific corpus. The categorization of text is the base of text processing.

We will now introduce an N-gram-based text-classification algorithm. From a longer string, an N-character slice is called N-gram. The key point of this algorithm is the calculation of the profiles of the N-gram frequencies.

Before the introduction of the algorithm, here are the necessary illustrations of a couple of concepts adopted in the algorithm:

Categorizing newspaper articles and newswires into topics

The N-gram-based text categorization

The summarized pseudocodes for the N-gram-based text-categorization algorithm are as follows:

The N-gram frequency-generation algorithm is as follows:

The R implementation

Please take a look at the R codes file ch_10_ngram_classifier.R from the bundle of R codes for the above algorithms. The codes can be tested with the following command:

> source("ch_10_ngram_classifier.R")

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Categorizing newspaper articles and newswires into topics

Create new playlist

Sign In

Sign Up

Categorizing newspaper articles and newswires into topics

The N-gram-based text categorization

The R implementation

Table of Contents for
Categorizing newspaper articles and newswires into topics