We can tokenize the text into sentences so that we expand our dataset. We imported a function called sent_tokenize from the nltk package (natural language toolkit). This function will take in a single string and output the sentence as an ordered list of sentences separated by punctuation. For example:
sent_tokenize("hello! I am Sinan. How are you??? I am fine")
['hello!', 'I am Sinan.', 'How are you???', 'I am fine']
We will apply this function to our entire corpus using some reduce logic in Python. Essentially, we are applying the sent_tokenize function to each review and creating a single list called sentences that will hold all of our sentences:
sentences = reduce(lambda x, y:x+y, texts.apply(lambda x: sent_tokenize(str(x).decode('utf-8'))))
We can now see how many sentences we have:
# the number of sentences len(sentences)
118151
This gives us 118,151—the number of sentences we have to work with. To create a document-term matrix, let's use TfidfVectorizer on our sentences:
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english') tfidf_transformed = tfidf.fit_transform(sentences) tfidf_transformed
We get the following:
<118151x280901 sparse matrix of type '<type 'numpy.float64'>' with 1180273 stored elements in Compressed Sparse Row format>
Now, let's try to fit a PCA to this data, like so:
# try to fit PCA PCA(n_components=1000).fit(tfidf_transformed)
Upon running this code, we get the following error:
The is error tells us that for PCA, we cannot have a sparse input, and it suggests that we use TruncatedSVD. singular value decomposition (SVD) is a matrix trick for computing the same PCA components (when the data is centered) that allow us to work with sparse matrices. Let's take this suggestion and use the TruncatedSVD module.