Preprocessing data using tokenization

Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens. For example, we can divide a chunk of text into words, or we can divide it into sentences. Depending on the task at hand, we can define our own conditions to divide the input text into meaningful tokens. Let's take a look at how to do this.

How to do it…

  1. Create a new Python file and add the following lines. Let's define some sample text for analysis:
    text = "Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."
  2. Let's start with sentence tokenization. NLTK provides a sentence tokenizer, so let's import this:
    # Sentence tokenization
    from nltk.tokenize import sent_tokenize
  3. Run the sentence tokenizer on the input text and extract the tokens:
    sent_tokenize_list = sent_tokenize(text)
  4. Print the list of sentences to see whether it works correctly:
    print "
    Sentence tokenizer:"
    print sent_tokenize_list
  5. Word tokenization is very commonly used in NLP. NLTK comes with a couple of different word tokenizers. Let's start with the basic word tokenizer:
    # Create a new word tokenizer
    from nltk.tokenize import word_tokenize
    
    print "
    Word tokenizer:"
    print word_tokenize(text)
  6. There is another word tokenizer that is available called PunktWord Tokenizer. This splits the text on punctuation, but this keeps it within the words:
    # Create a new punkt word tokenizer
    from nltk.tokenize import PunktWordTokenizer
    
    punkt_word_tokenizer = PunktWordTokenizer()
    print "
    Punkt word tokenizer:"
    print punkt_word_tokenizer.tokenize(text)
  7. If you want to split these punctuations into separate tokens, then we need to use WordPunct Tokenizer:
    # Create a new WordPunct tokenizer
    from nltk.tokenize import WordPunctTokenizer
    
    word_punct_tokenizer = WordPunctTokenizer()
    print "
    Word punct tokenizer:"
    print word_punct_tokenizer.tokenize(text)
  8. The full code is in the tokenizer.py file. If you run this code, you will see the following output on your Terminal:
    How to do it…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.70.132