The clustering model

We can tokenize the text into sentences so that we expand our dataset. We imported a function called sent_tokenize from the nltk package (natural language toolkit). This function will take in a single string and output the sentence as an ordered list of sentences separated by punctuation. For example:

sent_tokenize("hello! I am Sinan. How are you??? I am fine")

['hello!', 'I am Sinan.', 'How are you???', 'I am fine']

We will apply this function to our entire corpus using some reduce logic in Python. Essentially, we are applying the sent_tokenize function to each review and creating a single list called sentences that will hold all of our sentences:

sentences = reduce(lambda x, y:x+y, texts.apply(lambda x: sent_tokenize(str(x).decode('utf-8'))))

We can now see how many sentences we have:

# the number of sentences
len(sentences)

118151

This gives us 118,151—the number of sentences we have to work with. To create a document-term matrix, let's use TfidfVectorizer on our sentences: 

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')

tfidf_transformed = tfidf.fit_transform(sentences)

tfidf_transformed

We get the following:

<118151x280901 sparse matrix of type '<type 'numpy.float64'>'
        with 1180273 stored elements in Compressed Sparse Row format>

Now, let's try to fit a PCA to this data, like so:

# try to fit PCA

PCA(n_components=1000).fit(tfidf_transformed)

Upon running this code, we get the following error:

TypeError: PCA does not support sparse input. See TruncatedSVD for a possible alternative.

The is error tells us that for PCA, we cannot have a sparse input, and it suggests that we use TruncatedSVD. singular value decomposition (SVD) is a matrix trick for computing the same PCA components (when the data is centered) that allow us to work with sparse matrices. Let's take this suggestion and use the TruncatedSVD module.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.25.231