Moving part of a transformation to a subtransformation

Suppose that you have a part of a transformation that you will like to use in another transformation. A quick way to do that would be to copy the set of steps and paste them into the other transformation, and then perform some modifications, for example, changing the names of the fields accordingly.

Now you realize that you need it in a third place. You do that again: copy, paste, and modify.

What if you notice that there was a bug in that part of the transformation? Or maybe you'd like to optimize something there? You would need to do that in three different places! This inconvenience is one of the reasons why you might like to move those steps to a common place - a subtransformation.

In this recipe, you will develop a subtransformation that receives the following two dates:

  1. A date of birth
  2. A reference date

The subtransformation will calculate how old a person was (or will be) at the reference date if the date of birth provided was theirs.

For example, if the date of birth is December 30, 1979 and the reference date is December 19, 2010 the age would be calculated as 30 years.

Then, you will call that subtransformation from a main transformation.

Getting ready

You will need a file containing a list of names and dates of birth, for example:

name,birthdate
Paul,31/12/1969
Santiago,15/02/2004
Lourdes,05/08/1994
Anna,08/10/1978

How to do it...

This recipe is split into two parts.

First, you will create the subtransformation by carrying out the following steps:

  1. Create a transformation.
  2. From the Mapping category, add two Mapping input specification and one Mapping output specification steps. Rename this step output.
  3. Also, add a Join Rows (Cartesian product) (Join category), a Calculator (Transform category), and a User Defined Java Expression or UDJE for short (Scripting category) step. Link the steps as shown in the following diagram:
    How to do it...
  4. Double-click on one of the Mapping input specification steps. Add a field named birth field. For Type, select Date. Name the step birthdates.
  5. Double-click on the other Mapping input specification step. Add a field named reference_field. For Type, select Date. Name the step reference date.
  6. Double-click the Join step. For Main step to read from, select birthdates.

The following two steps perform the main task - the calculation of the age.

Note

Note that these steps are a slightly modified version of the steps you used for calculating the age in the previous recipe.

  1. Double-click on the Calculator step and fill in the setting window, as shown in the following screenshot:
    How to do it...
  2. Double-click on the UDJE step. Add a field named calculated_age. As Value type, select Integer. For Java expression type:
    ((b_month > t_month) ||
    (b_month - t_month ==0 && b_day > t_day))?
    (t_year b_year - 1):(t_year - b_year)
    

    Note

    The expression is written over three lines for clarity. You should type the whole expression on a single line.

  3. Save the transformation.

Now you will create the main transformation. It will read the sample file and calculate the age of the people in the file as at the present day.

  1. Create another transformation.
  2. Use a Text file input step to read the sample file. Name the step people.
  3. Use a Get System Info step to get the present day: Add a field named today. For Type, select Today 00:00:00. Name the step today.
  4. From the Mapping category, add a Mapping (sub-transformation) step. Link the steps as shown in the following diagram:
    How to do it...
  5. Double-click on the Mapping step. The following are the most important steps in this recipe!
  6. In the first textbox, under the text Use a file for the mapping transformation, select the transformation created earlier.
  7. Click on the Add Input button. A new Input tab will be added to the window. Under this tab, you will define a correspondence between the incoming step people and the subtransformation step birthdates.
  8. In the Input source step name, type people, the name of the step that reads the file.

    Tip

    Alternatively, you can select it by clicking on the Choose... button.

  9. In the Mapping target step name, type birthdates, the name of the subtransformation step that expects the dates of birth.
  10. Click on Ask these values to be renamed back on output?
  11. Under the same tab, fill in the grid as follows: Under Fieldname from source step type birthdate, the name of the field coming out the people step containing the date of birth. Under Fieldname to mapping input step, type birth_field, the name of the field in the subtransformation step birthdates that will contain the date of birth needed for calculating the age.

    Tip

    Alternatively, you can add the whole line by clicking on Mapping... and selecting the matching fields in the window that is displayed.

  12. Add another Input tab. Under this tab, you will define a correspondence between the incoming step today and the subtransformation step reference date. Fill in the tab as follows:
    How to do it...
  13. Finally, click on Add Output to add an Output tab. Under this tab, click on Is this the main data path?
  14. Under the same tab, fill in the grid as follows: Under Fieldname from mapping step, type calculated_age. Under Fieldname to target step, type age.
  15. Close the mapping settings window and save the transformation.
  16. Do a preview on the last step; you should see the following screen:
How to do it...

How it works...

The subtransformation (the first transformation you created) has the purpose of calculating the age of a person at a given reference date. In order to do that, it defines two entry points through the use of the Mapping input specification steps. These steps are meant to specify the fields needed by the subtransformation. In this case, you defined the date of birth in one entry point and the reference date in the other. Then it calculates the age in the same way you would do it with any regular transformation. Finally it defines an output point through the Mapping output specification step.

Note that we developed the subtransformation blindly, without testing or previewing. This was because you cannot preview a subtransformation. The Mapping input specification steps are just a definition of the data that will be provided; they have no data to preview.

Tip

While you are designing a subtransformation, you can provisionally substitute each Mapping input specification step with a step that provides some fictional data, for example, a Text file input, a Generate rows, a Get System Info, or a Data Grid step.

This fictional data for each of these steps has to have the same metadata as the corresponding Mapping input specification step. This will allow you to preview and test your subtransformation before calling it from another transformation.

Now, let's explain the main transformation, the one that calls the subtransformation. You added as many input tabs as entry points to the subtransformation. The input tabs are meant to map the steps and fields in your transformation to the corresponding steps and fields in the subtransformation. For example, the field that you called today in your main transformation became reference_field in the subtransformation.

On the other side, in the subtransformation, you defined just one output point. Therefore, under the Output tab, you clicked on Is this the main data path? Checking it means that you don't need to specify the correspondence between steps. What you did under this tab was fill in the grid to ask the field calculated_age be renamed to age.

In the final preview, you can see all the fields you had before the subtransformation, plus the fields added by it. Among these fields, there is the age field which was the main field you expected to be added.

As you can see in the final dataset, the field birthdates kept its name, while the field today was renamed to reference_field. The field birthdates kept its name because you checked the Ask these values to be renamed back on output? option under the people input tab. On the other hand, the field today was renamed because you didn't check that option under the today input tab.

There's more...

Kettle subtransformations are a practical way to centralize some functionality so that it may be used in more than one place. Another use of subtransformations is to isolate a part of a transformation that meets some specific purpose as a whole, in order to keep the main transformation simple, no matter whether you will reuse that part or not.

Let's look at some examples of what you might like to implement via a subtransformation:

  • Take some free text representing an address, parse it, and return the street name, street number, city, zip code, and state.
  • Take some text, validate it according to a set of rules, clean it, for example by removing some unwanted characters and return the validated clean text along with a flag indicating whether the original text was valid or not.
  • Take an error code and write a customized line to the Kettle log.
  • Take the date of birth of a person and a reference date and calculate how old that person was at the reference date.

If you then wish to implement any of the following enhancements, you will need to do it in one place:

  • Enhance the process for parsing the parts of an address
  • Change the rules for validating the text
  • Internationalize the text you write to the Kettle log
  • Change the method or algorithm for calculating the age

From the development point of view, a subtransformation is just a regular transformation with some input and output steps connecting it to the transformations that use it.

Back in Chapter 6, Understanding Data Flows, it was explained that when a transformation is launched, each step starts a new thread; that is, all steps work simultaneously. The fact that we are using a sub transformation does not change that. When you run a transformation that calls a subtransformation, both the steps in the transformation and those in the subtransformation start at the same time, and run in parallel. The subtransformation is not an isolated process; the data in the main transformation just flows through the subtransformation. Imagine this flow as if the steps in the subtransformation were part of the main transformation. In this sense, it is worth noting that a common cause of error in the development of subtransformations is the wrong use of the Select values step.

Note

Selecting some values with a Select values step by using the Select & Alter tab in a subtransformation will implicitly remove not only the rest of the fields in the subtransformation, but also all of the fields in the transformation that calls it.

Tip

If you need to rename or reorder some fields in a subtransformation, then make sure you check the Include unspecified fields, ordered by name option in order to keep not only the rest of the fields in the subtransformation but also the fields coming from the calling transformation.

If what you need is to remove some fields, do not use the Select & Alter tab; use the Remove tab instead. If needed, use another Select values step to reorder or rename the fields afterward.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.235.188