Document-term matrix with sklearn

The scikit-learn preprocessing module offers two tools to create a document-term matrix. CountVectorizer uses binary or absolute counts to measure the term frequency tf(d, t) for each document d and token t.

TfidFVectorizer, in contrast, weighs the (absolute) term frequency by the inverse document frequency (idf). As a result, a term that appears in more documents will receive a lower weight than a token with the same frequency for a given document but lower frequency across all documents. More specifically, using the default settings, tf-idf(d, t) entries for the document-term matrix are computed as tf-idf(d, t) = tf(d, t) x idf(t):

Here nd is the number of documents and df(d, t) the document frequency of term t. The resulting tf-idf vectors for each document are normalized with respect to their absolute or squared totals (see the sklearn documentation for details). The tf-idf measure was originally used in information retrieval to rank search engine results and has subsequently proven useful for text classification or clustering.

Both tools use the same interface and perform tokenization and further optional preprocessing of a list of documents before vectorizing the text by generating token counts to populate the document-term matrix.

Key parameters that affect the size of the vocabulary include the following:

  • stop_words: Use a built-in or provide a list of (frequent) words to exclude
  • ngram_range: Include n-grams in a range for n defined by a tuple of (nmin, nmax)
  • lowercase: Convert characters accordingly (default is True)
  • min_df / max_df: Ignore words that appear in less / more (int) or a smaller/larger share of documents (if float [0.0,1.0])
  • max_features: Limit the number of tokens in a vocabulary accordingly
  • binary: Set non-zero counts to 1 True

See the document_term_matrix notebook for the following code samples and additional details. We are again using the 2,225 BBC News articles for illustration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
54.211.148.68