When we deal with text documents that contain millions of words, we need to convert them into some kind of numeric representation. The reason for this is to make them usable for machine learning algorithms. These algorithms need numerical data so that they can analyze them and output meaningful information. This is where the bag-of-words approach comes into picture. This is basically a model that learns a vocabulary from all the words in all the documents. After this, it models each document by building a histogram of all the words in the document.
import numpy as np from nltk.corpus import brown from chunking import splitter
main
function. Load the input data from the Brown corpus:if __name__=='__main__': # Read the data from the Brown corpus data = ' '.join(brown.words()[:10000])
# Number of words in each chunk num_words = 2000 chunks = [] counter = 0 text_chunks = splitter(data, num_words)
for text in text_chunks: chunk = {'index': counter, 'text': text} chunks.append(chunk) counter += 1
# Extract document term matrix from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=5, max_df=.95) doc_term_matrix = vectorizer.fit_transform([chunk['text'] for chunk in chunks])
vectorizer
object and print it:vocab = np.array(vectorizer.get_feature_names()) print " Vocabulary:" print vocab
print " Document term matrix:" chunk_names = ['Chunk-0', 'Chunk-1', 'Chunk-2', 'Chunk-3', 'Chunk-4']
formatted_row = '{:>12}' * (len(chunk_names) + 1) print ' ', formatted_row.format('Word', *chunk_names), ' '
for word, item in zip(vocab, doc_term_matrix.T): # 'item' is a 'csr_matrix' data structure output = [str(x) for x in item.data] print formatted_row.format(word, *output)
bag_of_words.py
file. If you run this code, you will see two main things printed on the Terminal. The first output is the vocabulary as shown in the following image:Consider the following sentences:
If you consider all the three sentences, we have the following nine unique words:
Now, let's convert each sentence into a histogram using the count of words in each sentence. Each feature vector will be 9-dimensional because we have nine unique words:
Once we extract these feature vectors, we can use machine learning algorithms to analyze them.
18.188.166.246