Counting the occurrence of each word token

It seems that we are only interested in the occurrence of certain words, their count, or a related measure and not in the order of the words. We can therefore view a text as a collection of words. This is called the Bag of Words (BoW) model. This is a very basic model, but it works pretty well in practice. We can optionally define a more complex model that takes into account the order of words and PoS tags. However, such a model is going to be more computationally expensive and more difficult to program. In reality, the basic BoW model in most cases suffices. Have a doubt? We can give it a shot and see whether the BoW model makes sense.

We start with converting documents into a matrix where each row represents each newsgroup document and each column represents a word token, or specifically, a unigram to begin with. And the value of each element in the matrix is the number of times the word (column) occurs in the document (row). We are utilizing the CountVectorizer class from scikit-learn to do the work:

>>> from sklearn.feature_extraction.text import CountVectorizer

The important parameters and options for the count conversion function are summarized in the following table:

We first initialize the count vectorizer with 500 top features (500 most frequent tokens):

>>>  count_vector = CountVectorizer(max_features=500)

Use it to fit on the raw text data as follows:

>>> data_count = count_vector.fit_transform(groups.data)

Now the count vectorizer captures the top 500 features and generates a token count matrix out of the original text input:

>>> data_count
<11314x500 sparse matrix of type '<class 'numpy.int64'>'
      with 798221 stored elements in Compressed Sparse Row format>
>>> data_count[0]
<1x500 sparse matrix of type '<class 'numpy.int64'>'
      with 53 stored elements in Compressed Sparse Row format>

The resulting count matrix is a sparse matrix where each row only stores non-zero elements (hence, only 798,221 elements instead of 11314 * 500 = 5,657,000). For example, the first document is converted into a sparse vector composed of 53 non-zero elements. If you are interested in seeing the whole matrix, feel free to run the following:

>>> data_count.toarray()

If you just want the first row, run the following:

>>> data_count.toarray()[0]

Let's take a look at the following output derived from the preceding command:

So what are those 500 top features? They can be found in the following output:

>>> print(count_vector.get_feature_names())
['00', '000', '10', '100', '11', '12', '13', '14', '145', '15', '16', '17', '18', '19', '1993', '20', '21', '22', '23', '24', '25', '26', '27', '30', '32', '34', '40', '50', '93', 'a86', 'able', 'about', 'above', 'ac', 'access', 'actually', 'address', 'after', 'again', 'against', 'ago', 'all', 'already', 'also', 'always', 'am', 'american', 'an', 'and', 'andrew', 'another', 'answer', 'any', 'anyone', 'anything', 'apple', 'apr', 'april', 'are', 'armenian', 'around', 'article', 'as', 'ask', 'at', 'au', 'available', 'away', 'ax', 'b8f', 'back', 'bad', 'based', 'be', 'because', 'been',
……
……
……
, 'that', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'thing', 'things', 'think', 'this', 'those', 'though', 'thought', 'three', 'through', 'time', 'times', 'to', 'today', 'told', 'too', 'true', 'try', 'trying', 'turkish', 'two', 'type', 'uiuc', 'uk', 'under', 'university', 'unix', 'until', 'up', 'us', 'usa', 'use', 'used', 'using', 'uucp', 've', 'version', 'very', 'vs', 'want', 'war', 'was', 'washington', 'way', 'we', 'well', 'were', 'what', 'when', 'where', 'whether', 'which', 'while', 'who', 'whole', 'why', 'will', 'win', 'window', 'windows', 'with', 'without', 'won', 'word', 'work', 'works', 'world', 'would', 'writes', 'wrong', 'wrote', 'year', 'years', 'yes', 'yet', 'you', 'your']

Our first trial doesn't look perfect. Obviously, the most popular tokens are numbers, or letters with numbers such as a86, which do not convey important information. Moreover, there are many words that have no actual meaning, such as you, the, them, and then. Also, some words contain identical information, for example, tell and told, use and used, and time and times. Let's tackle these issues.

Table of Contents for Counting the occurrence of each word token

Create new playlist

Sign In

Sign Up

Table of Contents for
Counting the occurrence of each word token