Time for action—counting frequent words by filtering

Let's suppose, you have some plain text files, and you want to know what is said in them. You don't want to read them, so you decide to count the times that words appear in the text, and see the most frequent ones to get an idea of what the files are about.

Note

Before starting, you'll need at least one text file to play with. The text file used in this tutorial is named smcng10.txt and is available for you to download from the Packt website.

Let's work:

  1. Create a new transformation.
  2. By using a Text file input step, read your file. The trick here is to put as a separator a sign you are not expecting in the file, for example |. By doing so, the entire line would be recognized as a single field. Configure the Fields tab by defining a single string field named line.
  3. From the Transform category of step, drag to the canvas a Split field to rows step, and create a hop from Text file input step to this new step.
  4. Configure the step like this:
    Time for action—counting frequent words by filtering
  5. With this last step selected, do a preview. Your preview window should look like this:
    Time for action—counting frequent words by filtering
  6. Close the preview window.
  7. Expand the Flow category of steps, and drag a Filter rows step to the work area.
  8. Create a hop from the last step to the Filter rows step.
  9. Edit the Filter rows step by double-clicking it.
  10. Click the<field> textbox to the left of the = sign. The list of fields appears. Select word.
  11. Click the = sign. A list of operations appears. Select IS NOT NULL.
  12. The window looks like the following:
    Time for action—counting frequent words by filtering
  13. Click OK.
  14. From the Transform category of steps drag a Sort rows step to the canvas, and create a hop from the Filter rows step to this new step.
  15. Sort the rows by word.
  16. From the Statistics category, drag a Group by step, and create a hop from the Sort rows step to this step.
  17. Configure the grids in the Group by configuration window like shown:
    Time for action—counting frequent words by filtering
  18. Add a Calculator step, create a hop from the last step to this, and calculate the new field len_word representing the length of the words. For that, use the calculator function Return the length of a string A and select word from the drop-down menu for Field A.
  19. Expand the Flow category and drag another Filter rows step to the canvas.
  20. Create a hop from the Calculator step to this step and edit it.
  21. Click<field> and select counter.
  22. Click the = sign, and select>.
  23. Click<value>. A small window appears.
  24. In the Value textbox of the little window, enter 2.
  25. Click OK.
  26. Position the mouse cursor over the icon in the upper-right corner of the window. When the text Add condition shows up, click on the icon.
    Time for action—counting frequent words by filtering
  27. A new blank condition is shown below the one you created.
  28. Click on null = [] and create the condition len_word>3, in the same way you created the condition counter>2.
  29. Click OK.
  30. The final condition looks like this:
    Time for action—counting frequent words by filtering
  31. Add one more Filter rows step to the transformation and create a hop from the last step to this new step.
  32. On the left side of the condition, select word.
  33. As comparator select IN LIST.
  34. At the end of the condition, inside the textbox value, type the following: a;an;and;the;that;this;there;these.
  35. Click the upper-left square above the condition and the word NOT will appear.
  36. The condition looks like the following:
    Time for action—counting frequent words by filtering
  37. Add a Sort rows step, create a hop from the previous step to this step, and sort the rows in the descending order of counter.
  38. Add a Dummy step at the end of the transformation, create a hop from the last step to the Dummy step.
  39. With the Dummy step selected, preview the transformation. The following is what you should see now:
    Time for action—counting frequent words by filtering

What just happened?

You read a regular plain file and arranged the words that appear in the file in some particular fashion.

The first thing you did was to read the plain file and split the lines so that every word became a new row in the dataset. Consider, for example, the following line:

subsidence; comparison with the Portillo chain.

The splitting of this line resulted in the following rows being generated:

What just happened?

Thus, a new field named word became the basis for your transformation.

First of all, you discarded rows with null words. You did it by using a filter with the condition word IS NOT NULL. Then, you counted the words by using the Group by step you learned in the previous tutorial. Once you counted the words, you discarded those rows where the word was too short (length less than 4) or too common (comparing to a list you typed).

Once you applied all those filters, you sorted the rows in the descending order of the number of times the word appeared in the file so that you could see the most frequent words.

Scrolling down a little the preview window to skip some prepositions, pronouns, and other very common words that have nothing to do with a specific subject, you found words such as shells, strata, formation, South, elevation, porphyritic, Valley, tertiary, calcareous, plain, North, rocks, and so on. If you had to guess, you would say that this was a book or article about geology, and you would be right. The text taken for this exercise was Geological Observations on South America by Charles Darwin.

Filtering rows using the Filter rows step

The Filter rows step allows you to filter rows based on conditions and comparisons.

The step checks the condition for every row. Then it applies a filter letting pass only the rows for which the condition is true. The other rows are lost.

In the counting words exercise, you used the Filter rows step several times so you already have an idea of how it works. Let's review it.

In the Filter rows setting window you have to enter a condition. The following table summarizes the different kinds of conditions you may enter:

Condition

Description

Example

A single field followed by IS NULL or IS NOT NULL

Checks whether the value of a field in the stream is null

word IS NOT NULL

A field, a comparator, and a constant

Compares a field in the stream against a constant value.

counter > 2

Two fields separated by a comparator

Compares two fields in the stream

line CONTAINS word

You can combine conditions as shown here:

counter > 2
AND
len_word>3

You can also create subconditions such as:

(
counter > 2
AND
len_word>3
)
OR
(word in list geology; sun)

In this last example, the condition lets the word geology pass even if it appears only once. It also lets the word sun pass, despite its length.

When editing conditions, you always have a contextual menu which allows you to add and delete sub-conditions, change the order of existent conditions, and more.

Maybe you wonder what the Send 'true' data to step: and Send 'false' data to step: textboxes are for. Be patient, you will learn how to use them in Chapter 4.

Have a go hero—playing with filters

Now it is your turn to try filtering rows. Modify the counting_words transformation in the following way:

  • Alter the Filter rows steps. By using a Formula step create a flag (a Boolean field) that evaluates the different conditions (counter>2, and so on). Then use only one Filter rows step that filters the rows for which the flag is true. Test it and verify that the results are the same as before the change.

    Tip

    In the Formula editing window, use the options under the Logic category.

    Then in the Filter rows step, you can type true or Y as the value against which you compare the flag.

  • Add a sub-condition to avoid excluding some words, just like the one in the example: (word in list geology; sun). Change the list of words and test the filter to see that the results are as expected.

Have a go hero—counting words and discarding those that are commonly used

If you take a look at the results in the tutorial, you may notice that some words appear more than once in the final list because of special signs such as . , ) or", or because of lower or upper case letters. For example, look how many times the word rock appears: rock (99 occurrences) - rock,(51 occurrences) rock. (11 occurrences) rock." (1 occurrence) - rock: (6 occurrences) - rock; - (2 occurrences). You can fix this and make the word rock appear only once: Before grouping the words, remove all extra signs and convert all words to lower case or upper case, so they are grouped as expected.

Try one or more of the following steps: Formula, Calculator, Replace in string.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.126.199