Stemming and lemmatization

In textual data, most words are likely to be present in slightly different forms. Reducing each word to its origin or stem in a family of words is called stemming. It is used to group words based on their similar meanings to reduce the total number of words that need to be analyzed. Essentially, stemming reduces the overall conditionality of the problem.

For example, {use, used, using, uses} => use.

The most common algorithm for stemming English is the Porter algorithm.

Stemming is a crude process that can result in chopping off the ends of words. This may result in words that are misspelled. For many use cases, each word is just an identifier of a level in our problem space, and misspelled words do not matter. If correctly spelled words are required, then lemmatization should be used instead of stemming.

Algorithms lack common sense. For the human brain, treating similar words the same is straightforward. For an algorithm, we have to guide it and provide the grouping criteria.

Fundamentally, there are three different ways of implementing NLP. These three techniques, which are different in terms of sophistication, are as follows:

  • Bag-of-words-based (BoW-based) NLP
  • Traditional NLP classifiers
  • Using deep learning for NLP
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.241.228