Time for action—testing the transformation that keeps a history of product changes

  1. In the previous tutorial you loaded a dimension with products by using a Dimension lookup/update step. You ran the transformation once, causing the insertion of one record for each product and a special record with values n/a for the descriptive fields. Let's apply some changes in the operational database, and run the transformation again to see how the Dimension lookup/update step keeps history.
  2. In MySQL Query Browser, open the script update_jumbo_products.sql and run it.
  3. Switch to Spoon.
  4. If the transformation created in the last tutorial is not open, open it again.
  5. Run the transformation. Explore the js_dw database again. Press Open SQL for [lk_puzzles] and type the following sentence:
    SELECT *
    FROM lk_puzzles
    WHERE id_js_man = 'JUM'
    ORDER BY id_js_prod
    , version
    
  6. You will see this:
    Time for action—testing the transformation that keeps a history of product changes

What just happened?

After making some changes in the operational database, you ran the transformation for a second time. The modifications you made caused the insertion of new records recreating the history of the puzzle attributes.

Keeping an entire history of data with a Type II slowly changing dimension

Type II SCDs differ from Type I SCDs in that a Type II keeps the whole history of the data of your dimension. Typical examples of attributes for which you would like to keep a history are sales territories that change over time, categories of products that are reclassified from time to time, and promotions that you apply to products and are valid in a given range of dates.

Note

There are no rules that dictate whether or not you keep/retain the history in a dimension. It's the final user who decides based on his requirements.

In the puzzle dimension, you kept information about the changes for the name and theme attributes. Let's see how the history is kept for this sample dimension.

Each puzzle is to be represented by one or more records, each with the information valid during a certain period of time, as in the following example:

Keeping an entire history of data with a Type II slowly changing dimension

The history is kept in three extra fields in the dimension table—version, date_from, and date_to.

The version field is an automatically incremented value that maintains a revision number of the records for a particular puzzle.

The date range is used to indicate the period of applicability of the data.

In the tutorial you also had a current field, that acted as a flag to show if a record is the record valid in the present day.

The sample puzzle, Cinderellas Grand Arrival, was classified in the category Castles until October 1, 2009. After that date, the puzzle was reclassified as a Disney puzzle. This is the second version of the puzzle, as indicated by the column version. It's also the current version, as indicated by the column current.

In general, if you have to implement a Type II SCD with PDI, your dimension table must have the first three fields—version, date from, and date to. The current flag is optional.

Loading Type II SCDs with the Dimension lookup/update step

Type II SCDs can be loaded by using the Dimension lookup/update step. The Dimension lookup/update or Dimension L/U for short, looks in the dimension for a record that matches the information you put in the Keys grid of the settings window.

If the lookup fails, it inserts a new record. If a record is found, the step inserts or updates records depending on how you configured the step.

Let's explain how the Dimension L/U works with the following sample puzzle in the js database:

Loading Type II SCDs with the Dimension lookup/update stepType II SCDsusing, to maintain entire history

The first time you run the transformation, the step looks in the dimension for a record where id_js_prod is equal to JUMBO107and id_js_man is equal to JUM. Not only that, the period from start_date to end_date of the found record must contain the value of the stream datefield, which is 01/10/2009.

Because you never loaded this table before, the table was empty and so the lookup failed.

As a result, the step inserts the following record:

Loading Type II SCDs with the Dimension lookup/update stepType II SCDsusing, to maintain entire history

Note the values that the step put for the special fields:

The version for the new record is 1, the current flag is set to true, and the start_date and end_date take as values the dates you put in the Min.year and Max.year: 01/01/1900 and 31/12/2199.

After making some modifications to the operational database, you ran the transformation again. Look at the following screenshot:

Loading Type II SCDs with the Dimension lookup/update stepType II SCDsusing, to maintain entire history

The puzzle information changed. As you see to the right of the Table input step, the puzzle is now classified as a Disney puzzle.

This time the lookup succeeds. There is a record for which the keys match and the period from start_date to end_date of the found record, 01/01/1900 to 31/12/2199, obviously contains the value of the stream datefield, 01/10/2009.

Once found, the step compares the fields you put in the Fields tab—name and theme in the dimension table against pro_name and pro_theme in the incoming stream.

As there is a difference in the theme field, the step inserts a new record, and modifies the current—it changes the validity dates and sets the current flag to false. Now this puzzle has two versions in the dimension table, as you see below the Dimension L/U icon in the drawing.

These update and insert operations are made for all records that changed.

For the records that didn't change, dimension records are found but as nothing changed, nothing is inserted or updated.

Note

Take a note about the stream date: The field you put here is key to the loading process of the dimension, as its value is interpreted by PDI as the effective date of the change. In the tutorial, you put a fixed date—01/10/2009. In real situations you should use the effective or last changed date of the data if that date is available. If it is not available, leave the field blank. PDI will use the system date.

In this example, you filled the column Type of SCD update with the option Insert for every field. Doing so, you loaded a pure Type II SCD, that is, a dimension that keeps track of all changes in all fields.

In the sample puzzles dimension, you kept a history of changes both in the theme and in the name. For the sample puzzle, the theme was changed from Castles to Disney. If, after some time, you query the sales and notice that the sales for that puzzle increased after the change, then you may conclude that the customers are more interested in Disney puzzles than in castle puzzles. The possibility of creating these kinds of reports is a good reason for maintaining a Type II SCD.

On the other hand, if the name of the puzzle changes, you may not be so interested in knowing what the name was before. Fortunately, you may change the configuration and create a Hybrid SCD. Instead of selecting Insert for every field, you may select Update or Punch through:

  • When there is a change in a field for which you chose Update, the new value overwrites the old value in the last dimension record version, this being the usual behavior in Type I SCDs.
  • When there is a change in a field for which you chose Punch through, the new data overwrites the old value in all record versions.

Note that selecting Punch through for all the fields, the Dimension L/U step allows you to load a Type I SCD dimension. When you build Type I SCD you are not interested in range dates. Thus, you can leave the Stream datefield textbox empty. The current date is assumed by default.

In practice both Type I, Type II, and Hybrid SCDs are used. The choice of the type of SCD depends on the business needs.

Besides all those inserts and updates operations, the Dimension L/U automatically inserts in the dimension a record for unavailable data.

Note

In order to insert the special record with key equal to zero, all fields must have default values or allow nulls. If none of these conditions are true, the automatic insertion will fail.

In order to load a dimension with the Dimension L/U step, your table has to have columns for the version, date from, and date to. The step automatically maintains those columns. You simply have to put their names in the right textbox in the settings window.

Besides those fields, your dimension table may have a column for the current flag, and another column for the date of last insert or update. To fill those optional columns, you have to add them in the Fields tab as you did in the tutorial.

Have a go hero—keeping a history just for the theme of a product

Modify the loading of the products dimension so that it only keeps a history of the theme. If the name of the product changes, just overwrite the old values. Modify some data in the js database and run your transformation to confirm that it works as expected.

Have a go hero—loading a Type II SCD dimension

As you saw in the Hero exercise to add regions to the Region Dimension, the countries were grouped in three: Spain, Rest of Europe, Rest of the World.

As the sales rose in several countries of the world, you decided to regroup the countries in more than three groups. However, you want to do it starting in 2008. For older sales you prefer to keep seeing the sales grouped by the original categories.

This is what you will do: Use the table named lk_regions_2 to create a Type II Region dimension. Here is a guide to follow:

Create a transformation that loads the dimension. You will take the stream date (the date you use for loading the dimension) from the command line. If the command line argument is empty, use the present day.

As the name for the sheet with the region definition, use a named parameter.

Tip

Stream date

If the command line argument is present, remember to change it to Date before using it. You do that with a Select values step.

Note that you have to define the format of the entered data in advance. Suppose that you want to enter as argument the date January 1, 2008. If you chose the format dd-mm-yyyy, you'll have to enter the argument as 01-01-2008.

In case the command line argument is absent, you can get the default with a Get System Info step. Note that the system date you add with this step is already a Date field.

Now just follow these steps:

  1. Run the transformation by using the regions.xls file. Don't worry about the command line argument. Check that the dimension was loaded as expected. There has to be a single record for every city.
  2. Run the transformation again. This time use the regions2008.xls file as source for the region column. As command line, enter January 1st, 2008. Remember to type the date in the expected format (check the preceding tip). Explore the dimension table. There has to be two records for each country—one valid before 2008 and one valid after that date.
  3. Modify the sheet to create a new grouping for the American countries. Use your imagination for this task! Run the transformation for the third time. This time use the sheet you created and as date, type the present day (or leave the argument blank). Explore the dimension table. Now each city for the countries you regrouped has to have three versions, where the current is the version you created. The other cities should continue to have two versions each, because nothing related to those cities changed.

Pop quiz—loading slowly changing dimensions

Suppose you have DVDs with the French films in the catalog you've created so far. You rent those DVDs and keep the rental information in the database. Now you will design a dimensional model for that data.

  1. You begin by designing a dimension to store the names of the films. How do you create the Films dimension:

    a. As a Type I SCD

    b. As a Type II SCD

    c. You will decide when you have rented enough films so you make the right decision.

  2. In order to create that dimension, you could use:

    a. A Dimension L/U step

    b. A Combination L/U step

    c. Either of the above

    d. Neither of the above

Pop quiz—loading type III slowly changing dimensions

Type III SCD are dimensions that store the immediately preceding and current value for a descriptive field of the dimension. Each entity is stored in a single record. The field for which you want to keep the previous value has two columns assigned in the record: One for the current value and the other for the old. Sometimes, it is possible to have a third column holding the date of effective change.

Type III SCDs are appropriate when you don't want to keep all the history, but mainly when you need to support two views of the attribute simultaneously—the previous and the current. Suppose you have an Employees dimension. Among the attributes you have their position. People are promoted from time to time and you want to keep these changes in the dimension; however, you are not interested in knowing all the intermediate positions the employees have been through. In this case, you may implement a Type III SCD.

The question is, how would you load a Type III SCD with PDI:

a. With a Dimension L/U step configuring it properly

b. By using a Database lookup step to get the previous value. Then with a Dimension L/U step or a Combination L/U step to insert or update the records.

c. You can't load Type III SCDs with PDI

It's worth saying that type III SCD are used rather infrequently and not always can be automated. Sometimes they are used to represent human-applied changes and the implementation has to be made manually.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.94.159