The effect of smoothing

To avoid zero division, TfidfVectorizer uses smoothing for document and term frequencies:

  • smooth_idf: Add 1 to document frequency, as if an extra document contained every token in the vocabulary, to prevent zero divisions
  • sublinear_tf: Apply sublinear tf scaling; in other words, replace tf with 1 + log(tf)

In combination with normed weights, the results differ slightly:

vect = TfidfVectorizer(smooth_idf=True,
norm='l2', # squared weights sum to 1 by
document
sublinear_tf=False, # if True, use 1+log(tf)
binary=False)
pd.DataFrame(vect.fit_transform(sample_docs).todense(),
columns=vect.get_feature_names())

call me please taxi tomorrow you
0 0.39 0.00 0.00 0.00 0.65 0.65
1 0.43 0.55 0.00 0.72 0.00 0.00
2 0.27 0.34 0.90 0.00 0.00 0.00
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.234.141