Stemming, lemmatization, and stopwords

Stemming and lemmatization are two different but very similar techniques that attempt to reduce every word to its base form, which simplifies the language model. For instance, if we were to stem the various forms of a cat, we'd make the transformation in this example:

cat, cats, cat's, cats' -> cat

The difference between lemmatization and stemming then becomes how we make this transformation. Stemming is done algorithmically. When applied to multiple forms of the same word, the extracted root should be the same most of the time. This concept can be contrasted with lemmatization, which uses a vocabulary with known bases and consideration for how the word was used.

Stemming is typically much faster than lemmatization. The Porter stemmer works very well in many cases, so you might consider that as a first safe choice for stemming.

Stop words are words that are very common in the language but carry very little semantic meaning. The canonical example is the word the. I just used it three times in my last sentence, but it really only held meaning once. Often we remove stop words to make the input a little more sparse.

Most BoW models benefit from stemming, lemmatization, and removing stop words. Sometimes word-embedding models, which we will talk about soon, also benefit from stemming or lemmatization. Word-embedding models will rarely benefit from the removal of stop words.

Table of Contents for Stemming, lemmatization, and stopwords

Create new playlist

Sign In

Sign Up

Table of Contents for
Stemming, lemmatization, and stopwords