n/a
for the descriptive fields. Let's apply some changes in the operational database, and run the transformation again to see how the Dimension lookup/update step keeps history. update_jumbo_products.sql
and run it. js_dw
database again. Press Open SQL for [lk_puzzles] and type the following sentence:SELECT * FROM lk_puzzles WHERE id_js_man = 'JUM' ORDER BY id_js_prod , version
After making some changes in the operational database, you ran the transformation for a second time. The modifications you made caused the insertion of new records recreating the history of the puzzle attributes.
Type II SCDs differ from Type I SCDs in that a Type II keeps the whole history of the data of your dimension. Typical examples of attributes for which you would like to keep a history are sales territories that change over time, categories of products that are reclassified from time to time, and promotions that you apply to products and are valid in a given range of dates.
There are no rules that dictate whether or not you keep/retain the history in a dimension. It's the final user who decides based on his requirements.
In the puzzle dimension, you kept information about the changes for the name and theme attributes. Let's see how the history is kept for this sample dimension.
Each puzzle is to be represented by one or more records, each with the information valid during a certain period of time, as in the following example:
The history is kept in three extra fields in the dimension table—version
, date_from, and date_to
.
The version field is an automatically incremented value that maintains a revision number of the records for a particular puzzle.
The date range is used to indicate the period of applicability of the data.
In the tutorial you also had a current field, that acted as a flag to show if a record is the record valid in the present day.
The sample puzzle, Cinderellas Grand Arrival, was classified in the category Castles until October 1, 2009. After that date, the puzzle was reclassified as a Disney puzzle. This is the second version of the puzzle, as indicated by the column version
. It's also the current version, as indicated by the column current
.
In general, if you have to implement a Type II SCD with PDI, your dimension table must have the first three fields—version
, date from
, and date to
. The current flag is optional.
Type II SCDs can be loaded by using the Dimension lookup/update step. The Dimension lookup/update or Dimension L/U for short, looks in the dimension for a record that matches the information you put in the Keys grid of the settings window.
If the lookup fails, it inserts a new record. If a record is found, the step inserts or updates records depending on how you configured the step.
Let's explain how the Dimension L/U works with the following sample puzzle in the js
database:
The first time you run the transformation, the step looks in the dimension for a record where id_js_prod
is equal to JUMBO107
and id_js_man
is equal to JUM
. Not only that, the period from start_date
to end_date
of the found record must contain the value of the stream datefield
, which is 01/10/2009
.
Because you never loaded this table before, the table was empty and so the lookup failed.
As a result, the step inserts the following record:
Note the values that the step put for the special fields:
The version for the new record is 1, the current flag is set to true, and the start_date and end_date take as values the dates you put in the Min.year
and Max.year: 01/01/1900
and 31/12/2199
.
After making some modifications to the operational database, you ran the transformation again. Look at the following screenshot:
The puzzle information changed. As you see to the right of the Table input step, the puzzle is now classified as a Disney puzzle.
This time the lookup succeeds. There is a record for which the keys match and the period from start_date
to end_date
of the found record, 01/01/1900
to 31/12/2199
, obviously contains the value of the stream datefield, 01/10/2009
.
Once found, the step compares the fields you put in the Fields tab—name
and theme
in the dimension table against pro_name
and pro_theme
in the incoming stream.
As there is a difference in the theme
field, the step inserts a new record, and modifies the current—it changes the validity dates and sets the current flag to false. Now this puzzle has two versions in the dimension table, as you see below the Dimension L/U icon in the drawing.
These update and insert operations are made for all records that changed.
For the records that didn't change, dimension records are found but as nothing changed, nothing is inserted or updated.
Take a note about the stream date: The field you put here is key to the loading process of the dimension, as its value is interpreted by PDI as the effective date of the change. In the tutorial, you put a fixed date—01/10/2009
. In real situations you should use the effective or last changed date of the data if that date is available. If it is not available, leave the field blank. PDI will use the system date.
In this example, you filled the column Type of SCD update with the option Insert
for every field. Doing so, you loaded a pure Type II SCD, that is, a dimension that keeps track of all changes in all fields.
In the sample puzzles dimension, you kept a history of changes both in the theme and in the name. For the sample puzzle, the theme was changed from Castles
to Disney
. If, after some time, you query the sales and notice that the sales for that puzzle increased after the change, then you may conclude that the customers are more interested in Disney puzzles than in castle puzzles. The possibility of creating these kinds of reports is a good reason for maintaining a Type II SCD.
On the other hand, if the name of the puzzle changes, you may not be so interested in knowing what the name was before. Fortunately, you may change the configuration and create a Hybrid SCD. Instead of selecting Insert
for every field, you may select Update
or Punch through:
Update
, the new value overwrites the old value in the last dimension record version, this being the usual behavior in Type I SCDs. Punch through
, the new data overwrites the old value in all record versions.Note that selecting Punch through
for all the fields, the Dimension L/U step allows you to load a Type I SCD dimension. When you build Type I SCD you are not interested in range dates. Thus, you can leave the Stream datefield textbox empty. The current date is assumed by default.
In practice both Type I, Type II, and Hybrid SCDs are used. The choice of the type of SCD depends on the business needs.
Besides all those inserts and updates operations, the Dimension L/U automatically inserts in the dimension a record for unavailable data.
In order to insert the special record with key equal to zero, all fields must have default values or allow nulls. If none of these conditions are true, the automatic insertion will fail.
In order to load a dimension with the Dimension L/U step, your table has to have columns for the version, date from, and date to. The step automatically maintains those columns. You simply have to put their names in the right textbox in the settings window.
Besides those fields, your dimension table may have a column for the current flag, and another column for the date of last insert or update. To fill those optional columns, you have to add them in the Fields tab as you did in the tutorial.
Modify the loading of the products dimension so that it only keeps a history of the theme. If the name of the product changes, just overwrite the old values. Modify some data in the js
database and run your transformation to confirm that it works as expected.
As you saw in the Hero exercise to add regions to the Region Dimension, the countries were grouped in three: Spain, Rest of Europe, Rest of the World.
As the sales rose in several countries of the world, you decided to regroup the countries in more than three groups. However, you want to do it starting in 2008. For older sales you prefer to keep seeing the sales grouped by the original categories.
This is what you will do: Use the table named lk_regions_2
to create a Type II Region dimension. Here is a guide to follow:
Create a transformation that loads the dimension. You will take the stream date (the date you use for loading the dimension) from the command line. If the command line argument is empty, use the present day.
As the name for the sheet with the region definition, use a named parameter.
Stream date
If the command line argument is present, remember to change it to Date before using it. You do that with a Select values step.
Note that you have to define the format of the entered data in advance. Suppose that you want to enter as argument the date January 1, 2008. If you chose the format dd-mm-yyyy
, you'll have to enter the argument as 01-01-2008
.
In case the command line argument is absent, you can get the default with a Get System Info step. Note that the system date you add with this step is already a Date field.
Now just follow these steps:
regions.xls
file. Don't worry about the command line argument. Check that the dimension was loaded as expected. There has to be a single record for every city. regions2008.xls
file as source for the region column. As command line, enter January 1st, 2008. Remember to type the date in the expected format (check the preceding tip). Explore the dimension table. There has to be two records for each country—one valid before 2008 and one valid after that date.Suppose you have DVDs with the French films in the catalog you've created so far. You rent those DVDs and keep the rental information in the database. Now you will design a dimensional model for that data.
a. As a Type I SCD
b. As a Type II SCD
c. You will decide when you have rented enough films so you make the right decision.
a. A Dimension L/U step
b. A Combination L/U step
c. Either of the above
d. Neither of the above
Type III SCD are dimensions that store the immediately preceding and current value for a descriptive field of the dimension. Each entity is stored in a single record. The field for which you want to keep the previous value has two columns assigned in the record: One for the current value and the other for the old. Sometimes, it is possible to have a third column holding the date of effective change.
Type III SCDs are appropriate when you don't want to keep all the history, but mainly when you need to support two views of the attribute simultaneously—the previous and the current. Suppose you have an Employees dimension. Among the attributes you have their position. People are promoted from time to time and you want to keep these changes in the dimension; however, you are not interested in knowing all the intermediate positions the employees have been through. In this case, you may implement a Type III SCD.
The question is, how would you load a Type III SCD with PDI:
a. With a Dimension L/U step configuring it properly
b. By using a Database lookup step to get the previous value. Then with a Dimension L/U step or a Combination L/U step to insert or update the records.
c. You can't load Type III SCDs with PDI
It's worth saying that type III SCD are used rather infrequently and not always can be automated. Sometimes they are used to represent human-applied changes and the implementation has to be made manually.
3.133.130.199