Clustering newsgroups data using k-means

Up to this point, you should be very familiar with k-means clustering. Let's see what we are able to mine from the newsgroups dataset using this algorithm. We, herein, use all data from four categories as an example.

We first load the data from those newsgroups and preprocess it as we did in Chapter 2, Mining the 20 Newsgroups Dataset with Clustering and Topic Modeling Algorithms:

>>> from sklearn.datasets import fetch_20newsgroups
>>> categories = [
...     'alt.atheism',
...     'talk.religion.misc',
...     'comp.graphics',
...     'sci.space',
... ]
>>> groups = fetch_20newsgroups(subset='all',
                                     categories=categories)
>>> labels = groups.target
>>> label_names = groups.target_names
>>> def is_letter_only(word):
...     for char in word:
...         if not char.isalpha():
...             return False
...     return True
>>> from nltk.corpus import names
>>> all_names = set(names.words())
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> data_cleaned = []
>>> for doc in groups.data:
...     doc = doc.lower()
...     doc_cleaned = ' '.join(lemmatizer.lemmatize(word) for
                    word in doc.split() if is_letter_only(word)  
                    and word not in all_names)
...     data_cleaned.append(doc_cleaned)

We then convert the cleaned text data into count vectors using CountVectorizer of scikit-learn:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vector = CountVectorizer(stop_words="english",
                        max_features=None, max_df=0.5, min_df=2)
>>> data = count_vector.fit_transform(data_cleaned)

Note the vectorizer we use here does not limit the number of features (word tokens), but the minimum and maximum document frequency, which are 2 and 50% of the dataset respectively. Document frequency of a word is measured by the fraction of documents (samples) in the dataset that contain this word.

With the input data ready, we now try to cluster them into four groups as follows:

>>> from sklearn.cluster import KMeans
>>> k = 4
>>> kmeans = KMeans(n_clusters=k, random_state=42)
>>> kmeans.fit(data)

Let's do a quick check on the sizes of the resulting clusters:

>>> clusters = kmeans.labels_
>>> from collections import Counter
>>> print(Counter(clusters))
Counter({3: 3360, 0: 17, 1: 7, 2: 3})

The clusters don't look absolutely correct, with most samples (3360 samples) congested in one big cluster (cluster 3). What could have gone wrong? It turns out that our count-based features are not sufficiently representative. A better numerical representation for text data is the term frequency-inverse document frequency (tf-idf). Instead of simply using the token count, or the so-called term frequency (tf), it assigns each term frequency a weighting factor that is inversely proportional to the document frequency. In practice, the idf factor of a term t in documents D is calculated as follows:

Here, n_D is the total number of documents, n_t is the number of documents containing the term t, and the 1 is added to avoid division by zero.

With the idf factor incorporated, the tf-idf representation diminishes the weight of common terms (such as get, and make) occurring frequently, and emphasizes terms that rarely occur, but that convey an important meaning.

To use the tf-idf representation, we just need to replace CountVectorizer with TfidfVectorizer from scikit-learn as follows:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf_vector = TfidfVectorizer(stop_words='english',
                        max_features=None, max_df=0.5, min_df=2)

Now, redo feature extraction using the tf-idf vectorizer and the k-means clustering algorithm on the resulting feature space:

>>> data = tfidf_vector.fit_transform(data_cleaned)
>>> kmeans.fit(data)
>>> clusters = kmeans.labels_
print(Counter(clusters))
Counter({1: 1560, 2: 686, 3: 646, 0: 495})

The clustering result becomes more reasonable.

We also take a closer look at the clusters by examining what they contain and the top 10 terms (the terms with the 10 highest tf-idf) representing each cluster:

>>> cluster_label = {i: labels[np.where(clusters == i)] for i in
                                                        range(k)}
>>> terms = tfidf_vector.get_feature_names()
>>> centroids = kmeans.clustercenters
>>> for cluster, index_list in cluster_label.items():
...     counter = Counter(cluster_label[cluster])
...     print('cluster_{}: {} samples'.format(cluster, len(index_list)))
...     for label_index, count in sorted(counter.items(),
                               key=lambda x: x[1], reverse=True):
...         print('{}: {} samples'.format(label_names[label_index], count))
...     print('Top 10 terms:')
...     for ind in centroids[cluster].argsort()[-10:]:
...         print(' %s' % terms[ind], end="")
...     print()

cluster_0: 495 samples
sci.space: 494 samples
comp.graphics: 1 samples
Top 10 terms:
toronto moon zoology nasa hst mission wa launch shuttle space
cluster_1: 1560 samples
sci.space: 459 samples
alt.atheism: 430 samples
talk.religion.misc: 352 samples
comp.graphics: 319 samples
Top 10 terms:
people new think know like ha just university article wa
cluster_2: 686 samples
comp.graphics: 651 samples
sci.space: 32 samples
alt.atheism: 2 samples
talk.religion.misc: 1 samples
Top 10 terms:
know thanks need format looking university program file graphic image
cluster_3: 646 samples
alt.atheism: 367 samples
talk.religion.misc: 275 samples
sci.space: 2 samples
comp.graphics: 2 samples
Top 10 terms:
moral article morality think jesus people say christian wa god

From what we observe in the preceding results:

cluster_0 is obviously about space and includes almost all sci.space samples and related terms such as moon, nasa, launch, shuttle, and space.
cluster_1 is more of a generic topic.

cluster_2 is more about computer graphics and related terms, such as format, program, file, graphic, and image.
cluster_3 is an interesting one, which successfully brings together two overlapping topics, atheism and religion, with key terms including moral, morality, jesus, christian, and god.

Feel free to try different values of k, or use the Elbow method to find the optimal one (this is actually an exercise for this chapter).

It is quite interesting to find key terms for each text group via clustering. Topic modeling is another approach for doing so, but in a much more direct way. It does not simply search for the key terms in individual clusters generated beforehand. What it does do is that it directly extracts collections of key terms over documents. You will see how this works in the next section.

Table of Contents for Clustering newsgroups data using k-means

Create new playlist

Sign In

Sign Up

Table of Contents for
Clustering newsgroups data using k-means