Dividing text using chunking

Chunking refers to dividing the input text into pieces, which are based on any random condition. This is different from tokenization in the sense that there are no constraints and the chunks do not need to be meaningful at all. This is used very frequently during text analysis. When you deal with really large text documents, you need to divide it into chunks for further analysis. In this recipe, we will divide the input text into a number of pieces, where each piece has a fixed number of words.

How to do it…

  1. Create a new Python file, and import the following packages:
    import numpy as np
    from nltk.corpus import brown
  2. Let's define a function to split text into chunks. The first step is to divide the text based on spaces:
    # Split a text into chunks 
    def splitter(data, num_words):
        words = data.split(' ')
        output = []
  3. Initialize a couple of required variables:
        cur_count = 0
        cur_words = []
  4. Let's iterate through the words:
        for word in words:
            cur_words.append(word)
            cur_count += 1
  5. Once you hit the required number of words, reset the variables:
            if cur_count == num_words:
                output.append(' '.join(cur_words))
                cur_words = []
                cur_count = 0
  6. Append the chunks to the output variable, and return it:
        output.append(' '.join(cur_words) )
    
        return output
  7. We can now define the main function. Load the data from Brown corpus. We will use the first 10,000 words:
    if __name__=='__main__':
        # Read the data from the Brown corpus
        data = ' '.join(brown.words()[:10000])
  8. Define the number of words in each chunk:
        # Number of words in each chunk 
        num_words = 1700
  9. Initialize a couple of relevant variables:
        chunks = []
        counter = 0
  10. Call the splitter function on this text data and print the output:
        text_chunks = splitter(data, num_words)
    
        print "Number of text chunks =", len(text_chunks)
  11. The full code is in the chunking.py file. If you run this code, you will see the number of chunks generated printed on the Terminal. It should be 6!
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.69.106