Executing part of a job several times until a condition is true

Suppose that you have a list of tasks that have to be repeated while or until a condition is true (or false). If you know about programming languages, think of this as an analogy of a while or repeat until loop. Kettle allows you to implement these kinds of iterations and this recipe explains how to do it.

For the recipe, you will use one of the transformations described in the introduction of this chapter: the transformation that generates random numbers and writes them to a file. You will execute the transformation repeatedly and keep track of the number of lines written to those files. You will continue executing the transformation as long as the total number of written lines is less than 25.

Getting ready

You will need the transformation that generates random numbers described in the introduction. If instead of downloading the transformation you created it yourself, then you will have to do a quick fix in order to make Kettle save the number of written lines to the log (this has already been done in the transformation available on the book's site):

  1. Edit the transformation.
  2. Press Ctrl-T to bring the transformation's setting window.
  3. Select the Logging tab and click on Transformation.
  4. In the Fields to log: grid, search for the entry named LINES_OUTPUT. Under Step name, select the name of the step that generates the file of random numbers. The result is shown in the following screenshot:
    Getting ready
  5. Save the transformation.

How to do it...

Carry out the following steps:

  1. Create a job.
  2. From the General category, drag START, Set variables, and Transformation entries. Create a hop from the START entry toward the Set variables entry, and another from this entry toward the Transformation entry.
  3. Double-click on the Set variables entry. Add a row in order to define the variable that will keep track of the number of lines written. Under Variable name, type total_lines, for Value type 0, and for Variable scope type select Valid in the current job.
  4. Configure the Transformation entry to run the transformation that generates the random numbers.
  5. From the Scripting category, add a JavaScript entry.
  6. From the Utility category, drag two Write To Log entries.
  7. Link the entries as shown in the following diagram:
    How to do it...
  8. Double-click on the JavaScript entry. In the JavaScript: area, type the following code:
    var total_lines = parseInt(parent_job.getVariable("total_lines"))
    var new_total_lines = total_lines + previous_result.getNrLinesOutput();
    parent_job.setVariable("total_lines", new_total_lines);
    new_total_lines < 25;
    
  9. Double-click on the Write To Log entry that is executed after the success of the JavaScript entry (the entry at the end of the green hop). For Log level, select Minimal logging. For Log subject, type lines written=${total_lines}. For Log message type Ready to run again.
  10. Double-click on the other Write To Log entry, the one that is executed after the failure of the JavaScript entry (the entry at the end of the red hop). For Log level, select Minimal logging. For Log subject type ${total_lines} lines have been written. For Log message, type The generation of random numbers has succeeded.
  11. Save the job.
  12. Press F9 to run the job. For Log level, select Minimal logging and click on Launch.
  13. In the Logging tab of the Execution results pane, you will see the following:
    2011/01/11 22:43:50 - Spoon - Starting job...
    2011/01/11 22:43:50 - main - Start of job execution
    2011/01/11 22:43:50 - lines written=10 - Ready to run again ...
    2011/01/11 22:43:50 - lines written=20 - Ready to run again ...
    2011/01/11 22:43:51 - 30 lines have been written. - The generation of random numbers has been successful.
    2011/01/11 22:43:51 - main - Job execution finished
    2011/01/11 22:43:51 - Spoon - Job has ended.
    
  14. In order to confirm that 30 lines have actually been written, open the generated files.

How it works...

In order to run the transformation that generates a file with random numbers until the number of written lines is greater than 25, you implemented a loop. The following flow chart shows you the logic of this process:

How it works...

To control the execution of the transformation, you created a variable named total_lines and initialized this variable with the value 0. After executing the transformation, you incremented the value of the variable using JavaScript code. If the value was less than 25, you wrote a message to the log and re-ran the transformation. If not, you wrote a final message to the log.

The JavaScript code deserves a separate explanation:

previous_result.getNrLinesOutput() is the function that returns the number of lines that were written by the previous job entry. That is the value that you have to add to the total_lines variable, in order to keep the variable updated.

The couple of functions parent_job.getVariable() and parent_job.setVariable() are meant to get and set the value of the Kettle variable named total_lines. By default, the type of the Kettle variables is String. Therefore, in order to do the math, you had to use the parseInt() JavaScript function.

The second line in the JavaScript code evaluates to True or False. If it evaluates to True, the JavaScript entry follows the green arrow. If it evaluates to False, the JavaScript entry follows the red arrow.

There's more...

In this recipe, you built a loop and controlled the execution with the help of a JavaScript entry. The following subsections give you more information about these topics.

Implementing loops in a job

Suppose that you need to build a job in which one or more entries have to be executed repeatedly until a condition is met. People refer to these repetitive tasks as a loop. The first thing you have to have in mind is a clear understanding of the logic of this loop; that is, the condition that will cause the exit from the loop: You may want to exit the loop when:

  • There are no more files for processing
  • The number of errors exceeds a predefined threshold
  • The job is taking more time than expected - maybe due to an unavailable service
  • The number of records inserted into a table exceeded the expected value and so on

Once you understand this, you have to implement the logic. In the recipe, you implemented the logic with a JavaScript job entry. There are other entries that you can use for deciding whether to exit a loop or not. You will find useful entries for this purpose in the Conditions category: Simple evaluation, Evaluate file metrics, Check webservice availability, and so on. You can even implement the logic with an extra transformation that will succeed or fail according to your rules.

Note

Make sure that the number of iterations is small. If you build an endless loop or a loop with many iterations, then you risk running out of heap space.

Tip

Here are a few tips to avoid errors:

If your loop is causing you troubles—for instance, you run out of memory—try to rethink the solution. The following are some alternatives you can think of:

  • Solve the same problem by creating a list of elements and iterating over that list.
  • Consider limiting the number of iterations to a maximum value N.
  • In the logic that determines whether to exit the loop or not, add a condition for ensuring that the number of iterations remains below N.

Using the JavaScript step to control the execution of the entries in your job

The JavaScript entry is a useful step for controlling whether a job entry or a group of job entries should be executed or not. In particular, you used it in the recipe for deciding if the loop should end or not.

This entry works as follows: In its setting window, you should type JavaScript code that ends with an expression that evaluates to a Boolean. As with any job entry, the Javascript entry either succeeds or fails. Success or failure is decided by the result of evaluating that expression. Then, based on that value, Kettle knows which entry to execute next.

Within the code, you are free to use the previous_result element. The previous_result element is the representation of the Result object, an object that contains the result of the execution of the previous job entry. In the recipe, you used the previous_result element to ask for the number of written lines, but you can ask for the number of read lines, the number of errors, the number of executed job entries, and so on. You can find a complete description of the available previous_result options at the following URL: http://wiki.pentaho.com/display/EAI/Evaluating+conditions+in+The+JavaScript+job+entry

See also

The recipe named Executing part of a job once for every row in dataset in this chapter. This recipe will help you understand one of the alternative approaches for avoiding running out of memory.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.120.136