Joining two or more streams based on given conditions

There are occasions where you will need to join two datasets. If you are working with databases, you could use SQL statements to perform this task, but for other kinds of input (XML, text, Excel), you will need another solution.

Kettle provides the Merge Join step to join data coming from any kind of source.

Let's assume that you are building a house and want to track and manage the costs of building it. Before starting, you prepared an Excel file with the estimated costs for the different parts of your house. Now, you are given a weekly file with the progress and the real costs. So, you want to compare both to see the progress.

Getting ready

To run this recipe, you will need two Excel files, one for the budget and another with the real costs. The budget.xls has the estimated starting date, estimated end date, and cost for the planned tasks. The costs.xls has the real starting date, end date, and cost for tasks that have already started.

You can download the sample files from the book's site.

How to do it...

Carry out the following steps:

  1. Create a new transformation.
  2. Drop two Excel input steps into the canvas.
  3. Use one step for reading the budget information (budget.xls file) and the other for reading the costs information (costs.xls file).
  4. Under the Fields tab of these steps, click on the Get fields from header row button in order to populate the grid automatically. Apply the format dd/MM/yyyy to the fields of type Date and $0.00 to the fields with costs.
  5. Add a Merge Join step from the Join category, and create a hop from each Excel input step toward this step. The following diagram depicts what you have so far:
    How to do it...
  6. Configure the Merge Join step, as shown in the following screenshot:
    How to do it...
  7. If you do a preview on this step, you will obtain the result of the two Excel files merged. In order to have the columns more organized, add a Select values step from the Transform category. In this new step, select the fields in this order: task, starting date (est.), starting date, end date (est.), end date, cost (est.), cost.
  8. Doing a preview on the last step, you will obtain the merged data with the columns of both Excel files interspersed, as shown in the following screenshot:
How to do it...

How it works...

In the example, you saw how to use the Merge Join step to join data coming from two Excel files. You can use this step to join any other kind of input.

In the Merge Join step, you set the name of the incoming steps, and the fields to use as the keys for joining them. In the recipe, you joined the streams by just a single field: the task field.

Note

The rows are expected to be sorted in an ascending manner on the specified key fields.

There's more...

In the example, you set the Join Type to LEFT OUTER JOIN. Let's see explanations of the possible join options:

Join

Description

In the example

INNER

The result contains only the rows with the same key in both sources

You will obtain only the tasks that have estimated and real information.

LEFT OUTER

The result contains all the rows from the first source, and the correspondent values for second source (or empty values for non-matching keys)

You will obtain all the tasks from the budget, and the real costs related to these tasks (with empty values for the tasks that still haven't any associated costs).

RIGHT OUTER

The result contains all the rows from the second source, and the corresponding values for the first source (or empty values for non-matching keys)

You will obtain all the real tasks' costs and their related information from the budget. If there is a cost for a task that hadn't been estimated, the estimated cost will be empty.

FULL OUTER

The result contains all the rows from both sources (with empty values for non-matching keys)

You will obtain all the tasks from the budget and the real costs. This was the case in the recipe.

See also

The recipe named Reading an Excel file in Chapter 2, Reading and Writing Files. In this recipe, you can learn the details of reading Excel files with Kettle.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.197.213