Finding the most similar documents

The CountVectorizer result lets us find the most similar documents using the pdist() function for pairwise distances provided by the scipy.spatial.distance module. It returns a condensed distance matrix with entries corresponding to the upper triangle of a square matrix. We use np.triu_indices() to translate the index that minimizes the distance to the row and column indices that in turn correspond to the closest token vectors:

m = binary_dtm.todense() # pdist does not accept sparse format
pairwise_distances = pdist(m, metric='cosine')
closest = np.argmin(pairwise_distances) # index that minimizes distance
rows, cols = np.triu_indices(n_docs) # get row-col indices
rows[closest], cols[closest]
(11, 75)

Articles number 11 and 75 are closest by cosine similarity because they share 58 tokens (see notebook):

Topic

tech

tech

Heading

Software watching while you work

BT program to beat dialer scams

Body

Software that can not only monitor every keystroke and action performed at a PC but can also be used as legally binding evidence of wrong-doing has been unveiled. Worries about cyber-crime and sabotage have prompted many employers to consider monitoring employees.

BT is introducing two initiatives to help beat rogue dialer scams, which can cost dial-up net users thousands. From May, dial-up net users will be able to download free software to stop computers using numbers not on a user's pre-approved list.

 

Both CountVectorizer and TfidFVectorizer can be used with spaCy; for example, to perform lemmatization and exclude certain characters during tokenization, we use the following:

nlp = spacy.load('en')
def tokenizer(doc):
return [w.lemma_ for w in nlp(doc)
if not w.is_punct | w.is_space]
vectorizer = CountVectorizer(tokenizer=tokenizer, binary=True)
doc_term_matrix = vectorizer.fit_transform(docs.body)

See the notebook for additional details and more examples.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.184.90