Normalization is performed on the input text data to improve its quality in the context of training a machine learning model. Normalization usually involves the following processing steps:
- Converting all text to uppercase or lowercase
- Removing punctuation
- Removing numbers
Note that although the preceding processing steps are typically needed, the actual processing steps depend on the problem that we want to solve. They will vary from use case to use caseāfor example, if the numbers in the text represent something that may have some value in the context of the problem that we are trying to solve, then we may not need to remove the numbers from the text in the normalization phase.