Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Dividing text using chunking

Chunking refers to dividing the input text into pieces, which are based on any random condition. This is different from tokenization in the sense that there are no constraints and the chunks do not need to be meaningful at all. This is used very frequently during text analysis. When you deal with really large text documents, you need to divide it into chunks for further analysis. In this recipe, we will divide the input text into a number of pieces, where each piece has a fixed number of words.

How to do it…

Create a new Python file, and import the following packages:
```
import numpy as np
from nltk.corpus import brown
```
Let's define a function to split text into chunks. The first step is to divide the text based on spaces:
```
# Split a text into chunks 
def splitter(data, num_words):
    words = data.split(' ')
    output = []
```
Initialize a couple of required variables:
```
    cur_count = 0
    cur_words = []
```

Let's iterate through the words:

    for word in words:
        cur_words.append(word)
        cur_count += 1

Once you hit the required number of words, reset the variables:

        if cur_count == num_words:
            output.append(' '.join(cur_words))
            cur_words = []
            cur_count = 0

Append the chunks to the output variable, and return it:

    output.append(' '.join(cur_words) )

    return output

We can now define the main function. Load the data from Brown corpus. We will use the first 10,000 words:

if __name__=='__main__':
    # Read the data from the Brown corpus
    data = ' '.join(brown.words()[:10000])

Define the number of words in each chunk:

    # Number of words in each chunk 
    num_words = 1700

Initialize a couple of relevant variables:
```
    chunks = []
    counter = 0
```

Call the splitter function on this text data and print the output:

    text_chunks = splitter(data, num_words)

    print "Number of text chunks =", len(text_chunks)

The full code is in the chunking.py file. If you run this code, you will see the number of chunks generated printed on the Terminal. It should be 6!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Dividing text using chunking

Create new playlist

Sign In

Sign Up

Dividing text using chunking

How to do it…

Table of Contents for
Dividing text using chunking