Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Preprocessing data using tokenization

Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens. For example, we can divide a chunk of text into words, or we can divide it into sentences. Depending on the task at hand, we can define our own conditions to divide the input text into meaningful tokens. Let's take a look at how to do this.

How to do it…

Create a new Python file and add the following lines. Let's define some sample text for analysis:

text = "Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."

Let's start with sentence tokenization. NLTK provides a sentence tokenizer, so let's import this:
```
# Sentence tokenization
from nltk.tokenize import sent_tokenize
```
Run the sentence tokenizer on the input text and extract the tokens:
```
sent_tokenize_list = sent_tokenize(text)
```
Print the list of sentences to see whether it works correctly:
```
print "
Sentence tokenizer:"
print sent_tokenize_list
```
Word tokenization is very commonly used in NLP. NLTK comes with a couple of different word tokenizers. Let's start with the basic word tokenizer:
```
# Create a new word tokenizer
from nltk.tokenize import word_tokenize

print "
Word tokenizer:"
print word_tokenize(text)
```

There is another word tokenizer that is available called PunktWord Tokenizer. This splits the text on punctuation, but this keeps it within the words:

# Create a new punkt word tokenizer
from nltk.tokenize import PunktWordTokenizer

punkt_word_tokenizer = PunktWordTokenizer()
print "
Punkt word tokenizer:"
print punkt_word_tokenizer.tokenize(text)

If you want to split these punctuations into separate tokens, then we need to use WordPunct Tokenizer:

# Create a new WordPunct tokenizer
from nltk.tokenize import WordPunctTokenizer

word_punct_tokenizer = WordPunctTokenizer()
print "
Word punct tokenizer:"
print word_punct_tokenizer.tokenize(text)

The full code is in the tokenizer.py file. If you run this code, you will see the following output on your Terminal:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Preprocessing data using tokenization

Create new playlist

Sign In

Sign Up

Preprocessing data using tokenization

How to do it…

Table of Contents for
Preprocessing data using tokenization