Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens. For example, we can divide a chunk of text into words, or we can divide it into sentences. Depending on the task at hand, we can define our own conditions to divide the input text into meaningful tokens. Let's take a look at how to do this.
text = "Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."
# Sentence tokenization from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(text)
print " Sentence tokenizer:" print sent_tokenize_list
# Create a new word tokenizer from nltk.tokenize import word_tokenize print " Word tokenizer:" print word_tokenize(text)
PunktWord
Tokenizer. This splits the text on punctuation, but this keeps it within the words:# Create a new punkt word tokenizer from nltk.tokenize import PunktWordTokenizer punkt_word_tokenizer = PunktWordTokenizer() print " Punkt word tokenizer:" print punkt_word_tokenizer.tokenize(text)
WordPunct
Tokenizer:# Create a new WordPunct tokenizer from nltk.tokenize import WordPunctTokenizer word_punct_tokenizer = WordPunctTokenizer() print " Word punct tokenizer:" print word_punct_tokenizer.tokenize(text)
tokenizer.py
file. If you run this code, you will see the following output on your Terminal:18.117.70.132