Using CountVectorizer

The notebook contains an interactive visualization that explores the impact of the min_df and max_df settings on the size of the vocabulary. We read the articles into a DataFrame, set the CountVectorizer to produce binary flags and use all tokens, and call its .fit_transform() method to produce a document-term matrix:

binary_vectorizer = CountVectorizer(max_df=1.0,
min_df=1,
binary=True)

binary_dtm = binary_vectorizer.fit_transform(docs.body)
<2225x29275 sparse matrix of type '<class 'numpy.int64'>'
with 445870 stored elements in Compressed Sparse Row format>

The output is a scipy.sparse matrix in row format that efficiently stores of the small share (<0.7%) of 445870 non-zero entries in the 2225 (document) rows and 29275 (token) columns.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
174.129.59.198