Comparing two streams and generating differences

Suppose that you have two streams with the same structure and want to find out the differences in the data. Kettle has a step meant specifically for that purpose: the Merge Rows (diff) step. In this recipe, you will see how it works.

Suppose that you have a file with information about the fastest roller coasters around the world. Now, you get an updated file and want to find out the differences between the files: There can be new roller coasters in the list; maybe some roller coasters are no longer among the fastest. Besides, you were told that in the old file, there were some errors about the location, country, and year information, so you are also interested in knowing if some of these have changed.

Getting ready

For this recipe, you will need two files with information about roller coasters. You can download them from the book's site.

Both files have the same structure and look like the following:

Roller_Coaster|Speed|park|location|country|Year
Kingda Ka|128 mph|Six Flags Great Adventure|Jackson, New Jersey||2005
Top Thrill Dragster|120 mph|Cedar Point|Sandusky, Ohio||2003
Dodonpa|106.8 mph|Fuji-Q Highland|FujiYoshida-shi|Japan|2001
Steel Dragon 2000|95 mph|Nagashima Spa Land|Mie|Japan|2000
Millennium Force|93 mph|Cedar Point|Sandusky, Ohio||2000
...

For the There's more section, you will also need a database with the first file already loaded in a table. You will find a script for creating and loading it also available for downloading.

How to do it...

Carry out the following steps:

  1. Create a transformation.
  2. Drag a Text file input step into the canvas and use it to read the file top_roller_coasters.txt. As a separator, type |.
  3. Do a preview to make sure that you are reading the file as expected.
  4. Add a Sort rows step to sort the rows by roller_coaster and park.
  5. Repeat the steps 2 to 4 to read the file named top_roller_coasters_updates.txt and sort the rows also by roller_coaster and park.
  6. From the Join category, add a Merge Rows (diff) step and use it to join both streams as depicted in the following diagram:
    How to do it...
  7. Double-click on the step you just added. In the Reference rows origin: select the name of the step coming from the stream that reads the top_roller_coasters.txt file.
  8. In the Compare rows origin: select the name of the step coming from the stream that reads the top_roller_coasters_updates.txt file.
  9. As Flag fieldname, type flag.
  10. Fill the Keys to match: and Values to compare: grids as shown in the following screenshot:
    How to do it...

    Note

    You can save time by clicking on the Get key fields and Get value fields buttons to fill each grid respectively. Then just delete the fields that you don't need.

  11. Close the window and do a preview; you should see the following:
How to do it...

How it works...

The Merge Rows (diff) step is used for comparing two streams and finding out the differences between them. The output of the step is a single stream. The output stream contains a new field that acts as a flag indicating the kind of difference found as explained next.

Note

When you use the Merge Rows (diff) step, the two streams you are merging must have the same metadata, that is, the name, order, and type of the fields must be the same.

Let's call the streams being merged reference stream and compare stream. The first holds the old data while the second holds the new data. In the recipe, the old data is the data coming from the top_roller_coasters.txt file and the new data is the data coming from the top_roller_coasters_update.txt file.

Note

Both streams must be sorted on the specified keys.

In order to perform the comparison, you have to tell Kettle how to detect that a row is the same in both streams, that is, you have to specify the key fields. You do it by entering them in the first grid. In the recipe, the key was made up by the roller coaster name and the park name (roller_coaster and park fields).

Tip

If your data comes from a database, instead of using a Sort rows step for sorting the rows, you can sort them in the Table input. That will give you better performance.

Given the two streams, Kettle tries to match rows of both streams based on the key fields provided. Depending on the result, it sets a different value for the flag as explained in the following table:

Result of the comparison

Flag

Example

The key was only found in the reference stream

new

Formula Rossa roller coaster

The key was only found in the compared stream

deleted

Colossos roller coaster

The key was found in both streams and the fields typed in the Value to compare grid are equal

identical

Millennium Force roller coaster. The location (Sandusky, Ohio), country (empty), and year (2000) were the same in both streams.

The key was found in both streams but at least one of the fields typed in the Value to compare grid is different

changed

Furious Baco roller coaster. The location changed from Spain to Salou and the Country changed from empty to Spain.

Note that if a row is found in both streams with identical key fields and compare fields, it is marked as identical even if there are differences in other fields. For example, the Dodonpa roller coaster has a speed of 106.8 mph in the reference stream, but a speed of 106 mph in the compare stream. As you didn't put the speed in the values to compare list, the rows are marked as identical.

As a final remark, note that for the rows marked as new or changed, the values that pass to the output stream are those coming from the compare stream.

For the rows marked as identical or deleted, the values that are passed are those coming from the reference stream.

There's more...

The Merge Rows (diff) step is commonly used together with the Synchronize after merge step to keep a database table updated. The following section shows an example of how to do this.

Using the differences to keep a table up to date

Suppose that you have a table in a database with information about roller coasters, and that you have already inserted the data in the top_roller_coasters.txt file in that table.

Now, you have the top_roller_coasters_updates.txt file and want to update the table based on the differences.

Note

The table is totally de-normalized on purpose to keep the exercise simple.

Try the following:

  1. After running the script mentioned in the introduction, modify the transformation in the recipe by replacing the first stream with a Table Input step, in order to read the table rollercoasters. Use the following statement:
    SELECT roller_coaster
    , speed
    , park
    , location
    , country
    , year
    FROM rollercoasters
    ORDER BY roller_coaster
    , park
    
  2. You will have something like the following:
    Using the differences to keep a table up to date
  3. Do a preview on the last step, the Merge Rows (diff). The output should be exactly the same as the output in the recipe.
  4. Now, add a Synchronize after merge step. Select the connection to the database or create it if it doesn't exist and as Target table, type rollercoasters. Fill the grids as shown in the following screenshot:
    Using the differences to keep a table up to date
  5. Select the Advanced tab and fill in the Operation frame as shown in the following screenshot:
    Using the differences to keep a table up to date
  6. Close the window, save the transformation, and run it.
  7. Execute a SELECT statement to see the data in the rollercoaster table. The roller coasters with the flag deleted should have been deleted from the table. The rows with the flag new should have been inserted in the table, and the rows with the flag changed should have been updated.

See also

The recipe named Inserting, deleting, or updating a table depending on a field in Chapter 1, Working with Databases. This recipe will help you understand the use of the Synchronize after merge step.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.93.141