To perform feature extraction, we still need to provide individual words-tokens that are composing the original text. However, we do not need to consider all the words or characters. We can, for example, directly skip punctuations or unimportant words such as prepositions or articles, which mostly do not bring any useful information.
Furthermore, a common practice is to regularize tokens to a common representation. This can include methods such as unification of characters (for example, using only lowercase characters, removing diacritics, using common character encoding such as utf8, and so on) or putting words into a common form (so-called stemming, for example, "cry"/"cries"/"cried" is represented by "cry").
In our example, we will perform this process using the following steps:
- Lowercase all words ("Because" and "because" are the same word).
- Remove punctuation symbols with a regular expression function.
- Remove stopwords. These are essentially injunctions and conjunctions such as in, at, the, and, etc, and so on, that add no contextual meaning to the review that we want to classify.
- Find "rare tokens" that have a total number of occurrences less than three times in our corpus of reviews.
- Finally, remove all "rare tokens."