Stemming and lemmatization

In textual data, most words are likely to be present in slightly different forms. Reducing each word to its origin or stem in a family of words is called stemming. It is used to group words based on their similar meanings to reduce the total number of words that need to be analyzed. Essentially, stemming reduces the overall conditionality of the problem.

For example, {use, used, using, uses} => use.

The most common algorithm for stemming English is the Porter algorithm.

Stemming is a crude process that can result in chopping off the ends of words. This may result in words that are misspelled. For many use cases, each word is just an identifier of a level in our problem space, and misspelled words do not matter. If correctly spelled words are required, then lemmatization should be used instead of stemming.

Algorithms lack common sense. For the human brain, treating similar words the same is straightforward. For an algorithm, we have to guide it and provide the grouping criteria.

Fundamentally, there are three different ways of implementing NLP. These three techniques, which are different in terms of sophistication, are as follows:

Bag-of-words-based (BoW-based) NLP
Traditional NLP classifiers
Using deep learning for NLP

Table of Contents for Stemming and lemmatization

Create new playlist

Sign In

Sign Up

Table of Contents for
Stemming and lemmatization