Stemming and lemmatizing words

As mentioned earlier, we have two basic strategies to deal with words from the same root—stemming and lemmatization. Stemming is a quicker approach that involves, if necessary, chopping off letters, for example, words becomes word after stemming. The result of stemming doesn't have to be a valid word. For instance, trying and try become tri. Lemmatizing, on the other hand, is slower but more accurate. It performs a dictionary lookup and guarantees to return a valid word. Recall we have implemented both stemming and lemmatization using NLTK in a previous section.

Putting all of these (preprocessing, dropping stop words, lemmatizing, and count vectorizing) together, we obtain the following:

>>> from nltk.corpus import names
>>> all_names = set(names.words())
>>> count_vector_sw = CountVectorizer(stop_words="english", max_features=500)
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> data_cleaned = []
>>> for doc in groups.data:
... doc = doc.lower()
... doc_cleaned = ' '.join(lemmatizer.lemmatize(word)
for word in doc.split()
if is_letter_only(word) and
word not in all_names)
... data_cleaned.append(doc_cleaned)
>>> data_cleaned_count = count_vector_sw.fit_transform(data_cleaned)

Now the features are much more meaningful:

>>> print(count_vector_sw.get_feature_names())
['able', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'address', 'ago', 'agree', 'algorithm', 'allow', 'american', 'anonymous', 'answer', 'anybody', 'apple', 'application', 'apr', 'april', 'arab', 'area', 'argument', 'armenian', 'article', 'ask', 'asked', 'assume', 'atheist', 'attack', 'attempt', 'available', 'away', 'bad', 'based', 'belief', 'believe', 'best', 'better', 'bible', 'big', 'bike', 'bit', 'black', 'board', 'body', 'book', 'box', 'build', 'bus', 'buy', 'ca', 'california', 'called', 'came', 'canada', 'car', 'card', 'care', 'carry', 'case', 'cause', 'center', 'certain', 'certainly', 'chance', 'change', 'check', 'child', 'chip', 'christian', 'church', 'city', 'claim', 'clear', 'clinton', 'clipper', 'code', 'college', 'color', 'come', 'coming', 'command', 'comment', 'common', 'communication', 'company', 'computer', 'consider', 'considered', 'contact', 'control', 'copy',
……
……
'short', 'shot', 'similar', 'simple', 'simply', 'single', 'site', 'situation', 'size', 'small', 'software', 'sort', 'sound', 'source', 'space', 'special', 'specific', 'speed', 'standard', 'start', 'started', 'state', 'statement', 'steve', 'stop', 'strong', 'study', 'stuff', 'subject', 'sun', 'support', 'sure', 'taken', 'taking', 'talk', 'talking', 'tape', 'tax', 'team', 'technical', 'technology', 'tell', 'term', 'test', 'texas', 'text', 'thanks', 'thing', 'think', 'thinking', 'thought', 'time', 'tin', 'today', 'told', 'took', 'total', 'tried', 'true', 'truth', 'try', 'trying', 'turkish', 'turn', 'type', 'understand', 'united', 'university', 'unix', 'unless', 'usa', 'use', 'used', 'user', 'using', 'usually', 'value', 'various', 'version', 'video', 'view', 'wa', 'want', 'wanted', 'war', 'water', 'way', 'weapon', 'week', 'went', 'western', 'white', 'widget', 'win', 'window', 'woman', 'word', 'work', 'working', 'world', 'worth', 'write', 'written', 'wrong', 'year', 'york', 'young']
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.198.94