Text preprocessing

Textual data requires careful and diligent preprocessing before any feature extraction/engineering can be performed. There are various steps involved in preprocessing textual data. The following is a list of some of the most widely used preprocessing steps for textual data:

  • Tokenization
  • Lowercasing
  • Removal of special characters
  • Contraction expansions
  • Stopword removal
  • Spell corrections
  • Stemming and lemmatization

We will be covering most techniques in detail in the chapters related to use cases. For a better understanding, readers may refer to Chapter 4 and Chapter 7 of Practical Machine Learning with Python, Sarkar and their co-authors, Springer, 2017.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.248.239