The Tf-idf vectorizer

A Tf-idfVectorizer can be broken down into two components. First, the tf part, which represents term frequency, and the idf part, meaning inverse document frequency. It is a term—weighting method that has applications in information—retrieval and clustering.

A weight is given to evaluate how important a word is to a document in a corpus. Let's look into each part a little more: 

  • tf: term frequency: Measures how frequently a term occurs in a document. Since documents can be different in length, it is possible that a term would appear many more times in longer documents than shorter ones. Thus, the term frequency is often divided by the document length, or the total number of terms in the document, as a way of normalization.
  • idf: inverse document frequencyMeasures how important a term is. While computing term frequency, all terms are considered equally important. However, certain terms, such as is, of, and that, may appear a lot of times but have little importance. So, we need to weight the frequent terms less, while we scale up the rare ones.

To re-emphasize, a TfidfVectorizer is the same as CountVectorizer, in that it constructs features from tokens, but it takes a step further and normalizes counts to frequency of occurrences across a corpus. Let's see an example of this in action.

First, our import:

from sklearn.feature_extraction.text import TfidfVectorizer

To bring up some code from before, a plain vanilla CountVectorizer will output a document-term matrix:

vect = CountVectorizer()
_ = vect.fit_transform(X)
print _.shape, _[0,:].mean()

(99989, 105849) 6.61319426731e-05

Our  TfidfVectorizer can be set up as follows:

vect = TfidfVectorizer()
_ = vect.fit_transform(X)
print _.shape, _[0,:].mean() # same number of rows and columns, different cell values

(99989, 105849) 2.18630609758e-05

We can see that both vectorizers output the same number of rows and columns, but produce different values in each cell. This is because TfidfVectorizer and CountVectorizer are both used to transform text data into quantitative data, but the way in which they fill in cell values differ.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.147.160