Pre-processing and text normalization

Preprocessing is one of the most important parts of the analysis process. It reformats the unstructured data into uniform, standardized form. The characters, words, and sentences identified at this stage are the fundamental units passed to all further processing stages. The quality of the preprocessing has a big impact of the final result on the whole process.

There are several stages of the process: from simple text cleaning by removing white spaces, punctuation, HTML tags and special characters up to more sophisticated normalization techniques such as tokenization, stemming, or lemmatization. In general, the main aim is to keep all the characters and words that are important for the analysis and, at the same time, get rid of all others, and the text corpus should be maintained in one uniform format.

We import all necessary libraries.

import re, itertools 
import nltk 
from nltk.corpus import stopwords  

When dealing with raw text, we usually have a set of words including many details we are not interested in, such as whitespace, line breaks, and blank lines. Moreover, many words contain capital letters so programming languages misinterpret for example, "go" and "Go" as two different words. In order to handle such distinctions, we can convert all words to lowercase format with the following steps:

  1. Perform basic text mining cleaning.
  2. Remove all whitespaces:
verbatim = verbatim.strip() 

Many text processing tasks can be done via pattern matching. We can find words containing a character and replace it with another one or just remove it. Regular expressions give us a powerful and flexible method for describing the character patterns we are interested in. They are commonly used in cleaning punctuation, HTML tags, and URLs paths.

  1. Remove punctuation:
verbatim = re.sub(r'[^ws]','',verbatim) 
  1. Remove HTML tags:
verbatim = re.sub('<[^<]+?>', '', verbatim) 
  1. Remove URLs:
verbatim = re.sub(r'^https?://.*[
]*', '', verbatim, flags=re.MULTILINE) 

Depending on the quality of the text corpus, sometimes there is a need to implement some corrections. This refers to the text sources such as Twitter or forums, where emotions can play a role and the comments contain multiple letters words for example, "happpppy" instead of "happy".

  1. Standardize words (remove multiple letters)
verbatim = ''.join(''.join(s)[:2] for _, s in itertools.groupby(verbatim)) 

After removal of punctuation or white spaces, words can be attached. This happens especially when deleting the periods at the end of the sentences. The corpus might look like: "the brown dog is lostEverybody is looking for him". So there is a need to split "lostEverybody" into two separate words.

  1. Split attached words
verbatim = " ".join(re.findall('[A-Z][^A-Z]*', verbatim)) 

Stop words are basically a set of commonly used words in any language: mainly determiners, prepositions, and coordinating conjunctions. By removing the words that are very commonly used in a given language, we can focus only on the important words instead, and improve the accuracy of the text processing.

  1. Convert text to lowercase, lower():
verbatim = verbatim.lower() 
  1. Stop word removal:
verbatim = ' '.join([word for word in verbatim.split() if word not in (stopwords.words('english'))]) 
  1. Stemming and lemmatization: The main aim of stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming reduces word forms to so-called stems, whereas lemmatization reduces word forms to linguistically valid lemmas.
    • Some examples of stemming are cars -> car, men -> man, and went -> go
    • Such text processing can give added value in some domains, and may improve the accuracy of practical information extraction tasks
  2. Tokenization: Tokenization is the process of breaking a text corpus up into words (most commonly), phrases, or other meaningful elements, which are then called tokens. The tokens become the basic units for further text processing.
tokens = nltk.word_tokenize(verbatim) 

Other techniques are spelling correction, domain knowledge, and grammar checking

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.119.114