Finding the most similar documents

The CountVectorizer result lets us find the most similar documents using the pdist() function for pairwise distances provided by the scipy.spatial.distance module. It returns a condensed distance matrix with entries corresponding to the upper triangle of a square matrix. We use np.triu_indices() to translate the index that minimizes the distance to the row and column indices that in turn correspond to the closest token vectors:

m = binary_dtm.todense() # pdist does not accept sparse format
pairwise_distances = pdist(m, metric='cosine')
closest = np.argmin(pairwise_distances) # index that minimizes distance
rows, cols = np.triu_indices(n_docs) # get row-col indices
rows[closest], cols[closest]
(11, 75)

Articles number 11 and 75 are closest by cosine similarity because they share 58 tokens (see notebook):





Software watching while you work

BT program to beat dialer scams


Software that can not only monitor every keystroke and action performed at a PC but can also be used as legally binding evidence of wrong-doing has been unveiled. Worries about cyber-crime and sabotage have prompted many employers to consider monitoring employees.

BT is introducing two initiatives to help beat rogue dialer scams, which can cost dial-up net users thousands. From May, dial-up net users will be able to download free software to stop computers using numbers not on a user's pre-approved list.


Both CountVectorizer and TfidFVectorizer can be used with spaCy; for example, to perform lemmatization and exclude certain characters during tokenization, we use the following:

nlp = spacy.load('en')
def tokenizer(doc):
return [w.lemma_ for w in nlp(doc)
if not w.is_punct | w.is_space]
vectorizer = CountVectorizer(tokenizer=tokenizer, binary=True)
doc_term_matrix = vectorizer.fit_transform(docs.body)

See the notebook for additional details and more examples.

