Solving our initial challenge

We now put everything together and demonstrate our system for the following new post that we assign to the variable new_post:

Disk drive problems. Hi, I have a problem with my hard disk.

After 1 year it is working only sporadically now.

I tried to format it, but now it doesn't boot any more.

Any ideas? Thanks.

As we have learned previously, we will first have to vectorize this post before we predict its label as follows:

>>> new_post_vec = vectorizer.transform([new_post])
>>> new_post_label = km.predict(new_post_vec)[0]

Now that we have the clustering, we do not need to compare new_post_vec to all post vectors. Instead, we can focus only on the posts of the same cluster. Let us fetch their indices in the original dataset:

>>> similar_indices = (km.labels_==new_post_label).nonzero()[0]

The comparison in the bracket results in a Boolean array, and nonzero converts that array into a smaller array containing the indices of the True elements.

Using similar_indices, we then simply have to build a list of posts together with their similarity scores as follows:

>>> similar = []
>>> for i in similar_indices:
...    dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray())
...    similar.append((dist,[i]))
>>> similar = sorted(similar)
>>> print(len(similar))

We found 44 posts in the cluster of our post. To give the user a quick idea of what kind of similar posts are available, we can now present the most similar post (show_at_1), the least similar one (show_at_3), and an in-between post (show_at_2), all of which are from the same cluster as follows:

>>> show_at_1 = similar[0]
>>> show_at_2 = similar[len(similar)/2]
>>> show_at_3 = similar[-1]

The following table shows the posts together with their similarity values:



Excerpt from post



BOOT PROBLEM with IDE controller


I've got a Multi I/O card (IDE controller + serial/parallel interface) and two floppy drives (5 1/4, 3 1/2) and a Quantum ProDrive 80AT connected to it. I was able to format the hard disk, but I could not boot from it. I can boot from drive A: (which disk drive does not matter) but if I remove the disk from drive A and press the reset switch, the LED of drive A: continues to glow, and the hard disk is not accessed at all. I guess this must be a problem of either the Multi I/o card or floppy disk drive settings (jumper configuration?) Does someone have any hint what could be the reason for it. […]



IDE Cable

I just bought a new IDE hard drive for my system to go with the one I already had. My problem is this. My system only had a IDE cable for one drive, so I had to buy cable with two drive connectors on it, and consequently have to switch cables. The problem is, the new hard drive's manual refers to matching pin 1 on the cable with both pin 1 on the drive itself and pin 1 on the IDE card. But for the life of me I cannot figure out how to tell which way to plug in the cable to align these. Secondly, the cable has like a connector at two ends and one between them. I figure one end goes in the controller and then the other two go into the drives. Does it matter which I plug into the "master" drive and which into the "Slave"? any help appreciated […]



Conner CP3204F info please

How to change the cluster size Wondering if somebody could tell me if we can change the cluster size of my IDE drive. Normally I can do it with Norton's Calibrat on MFM/RLL drives but dunno if I can on IDE too. […]

It is interesting how the posts reflect the similarity measurement score. The first post contains all the salient words from our new post. The second one also revolves around hard disks, but lacks concepts such as formatting. Finally, the third one is only slightly related. Still, for all the posts, we would say that they belong to the same domain as that of the new post.

Another look at noise

We should not expect a perfect clustering, in the sense that posts from the same newsgroup (for example, are also clustered together. An example will give us a quick impression of the noise that we have to expect:

>>> post_group = zip(,
>>> z = (len(post[0]), post[0], dataset.target_names[post[1]]) for post in post_group
>>> print(sorted(z)[5:7])
[(107, 'From: "kwansik kim" <[email protected]>
Subject: Where is FAQ ?

Where can I find it ?

Thanks, Kwansik

', ''), (110, 'From: [email protected]
Subject: What is 3dO?

Someone please fill me in on what 3do.


', '')]

For both of these posts, there is no real indication that they belong to, considering only the wording that is left after the preprocessing step:

>>> analyzer = vectorizer.build_analyzer() 
>>> list(analyzer(z[5][1]))
[u'kwansik', u'kim', u'kkim', u'cs', u'indiana', u'edu', u'subject', u'faq', u'thank', u'kwansik']
>>> list(analyzer(z[6][1]))
[u'lioness', u'mapl', u'circa', u'ufl', u'edu', u'subject', u'3do', u'3do', u'thank', u'bh']

This is only after tokenization, lower casing, and stop word removal. If we also subtract those words that will be later filtered out via min_df and max_df, which will be done later in fit_transform, it gets even worse:

>>> list(set(analyzer(z[5][1])).intersection(
[u'cs', u'faq', u'thank']>>> list(set(analyzer(z[6][1])).intersection(
[u'bh', u'thank']

Furthermore, most of the words occur frequently in other posts as well, as we can check with the IDF scores. Remember that the higher the TF-IDF, the more discriminative a term is for a given post. And as IDF is a multiplicative factor here, a low value of it signals that it is not of great value in general:

>>> for term in ['cs', 'faq', 'thank', 'bh', 'thank']:
...     print('IDF(%s)=%.2f'%(term, 

So, except for bh, which is close to the maximum overall IDF value of 6.74, the terms don't have much discriminative power. Understandably, posts from different newsgroups will be clustered together.

For our goal, however, this is no big deal, as we are only interested in cutting down the number of posts that we have to compare a new post to. After all, the particular newsgroup from where our training data came from is of no special interest.

