Chapter 12 Time Variant Database Definition

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

If this Time Variant Solution Design has taken you out of your comfort zone, then please pause, breathe deeply, and get ready to go back to your comfort zone. This database design is more about fitting Instance Keys and indexes into an existing data warehouse design than it is about creating a new data warehouse design. If you’re waiting for this chapter to tell you to throw everything away and prepare to start over, you’re going to wait a very long time. Rather than throw away everything you know about your data warehouse, including all those little idiosyncrasies that only a handful of people know about, this chapter will assume that you have maintained your data warehouse to the best of your ability and that your data warehouse is the best it can be. For that reason, obviously you’re not trying to solve a data model or ETL problem with a Time Variant design. You are simply trying to integrate Time Variance into an existing data warehouse.

If, however, you’re starting from the beginning, please be aware that everything in this chapter comes after the source system analysis, conceptual data model, logical data model, and the first physical data model. Time variance should be introduced somewhere around the second physical data model. All these analysis and design methods are explained in Building and Maintaining a Data Warehouse. This discussion of time variance in a data warehouse does not bypass or replace any of the analysis and design methods that are key to the success of every data warehouse. Rather, this discussion of time variance in a data warehouse is in addition to the data warehouse analysis and design methods. For that reason, this discussion of data warehousing assumes a shared level of knowledge. If this is your first book on data warehousing, please have a copy of Building and Maintaining a Data Warehouse handy for reference purposes.

Tables of Types and Types of Tables

This Time Variant Solution Design is a method for defining table keys and optimizing joins to those table keys. Fact tables are still fact tables. Dimension tables are still dimension tables. Dimensional data models are still dimensional data models. In addition, Third Normal Form retains its present form. This Time Variant Solution Design is not intended to change any of the data warehousing data modeling methods. Instead, this Time Variant Solution Design is intended to optimize joins within those data models when they are intended to reflect the passage of time.

The genius of the fact, dimension, and summary table concepts is rendered more obvious in the presence of this Time Variant Solution Design. Rather than modify their purpose and function, the addition of time variance to fact, dimension, and summary tables only increases their effectiveness in achieving their individual goals and purposes. We don’t need to force fact, dimension, and summary tables to a new and unfamiliar purpose, like a square peg in a round hole. Instead, we expand the value achieved by fact, dimension, and summary tables in a data warehouse by incorporating time variance into their current design.

Type 2 Time Variant Dimension Tables

Dimension tables present the properties and attributes of an entity. Time variance expands that definition to...present the properties and attributes of an entity as of a moment in time. Without time variance a dimension table is a two-way join between an entity and a property or attribute. The addition of time variance creates a three-way join that now includes the time interval during which the two-way join between an entity and a property or attribute was in effect. That three-way join has been the cause of significant performance issues in relational data warehouses.

The performance degradation is resolved by the use of an Instance Key that uniquely identifies a single row of a dimension table. In that way, an Instance Key uniquely identifies the row that presents the two-way join between an entity and its attributes and properties as of a specific moment in time. An Instance Key is a sequential numeric column in a dimension table. A Simple Instance Key is composed of only that sequential numeric column of data in a dimension table. A Compound Instance Key is composed of an Entity Key and a sequential numeric column of data that resets back to the value 1 for every new Entity Key.

Figure 11.4 shows a set of time variant dimension tables that join by Simple Instance Keys. Figure 11.5 shows a set of time variant dimension tables that join by Compound Instance Keys. Both sets of tables follow the same query paths, which can go in two directions. The query can begin with fact table rows and then join up to the dimension tables. The query can also begin with a parent dimension table rows and then join down to the fact table. Instance Keys are intended to optimize the joins in query paths in both directions.

A query that begins with a fact table will join up to the dimension tables. More specifically, a query that begins with a row of data from a fact table will join up to a single row in a dimension table. The optimization of this Time Variant Solution Design is achieved by optimizing the join between a row of a fact table and a row of a dimension table. To that end, the Instance Key is the primary key (or primary index) of a Type 2 time variant dimension table. When the Instance Key of a dimension table is also the primary key (or primary index) of a table, a join to that table via the Instance Key becomes the fastest join path to that table. To make that happen, the Instance Key of every dimension table is also the primary key (or primary index) of every Type 2 time variant dimension table. For these tables, that means the Instance Key is the fastest possible join path to each table. When the Instance Key of a table is a Simple Instance Key, the primary key of that table is that single column which is the Instance Key. When the Instance Key of a table is a Compound Instance Key, the primary key of that table is the pair of columns that are the Entity Key and the Instance Key, which together form the Compound Instance Key.

A query that begins with a fact table joins to a dimension table and then joins dimension tables to dimension tables. As shown in Figure 11.4 and Figure 11.5, a time variant join can occur between dimensions. Figure 11.4 shows the dimension table Item joins to the Color_Formula and Carrier_Terms tables by referencing the single column that is the Instance Key of the Color_Formula table and the single column that is the Instance Key of the Carrier_Terms table. Figure 11.5 shows the dimension table Item joins to the Color_Formula and Carrier_Terms tables by referencing the two columns that are the Entity Key and Instance Key of the Color_Formula table and the two columns that are the Entity Key and Instance Key of the Carrier_Terms table. Every dimension table has its own Instance Key, which is its primary key.

The query path from a fact table begins with the fact table. A foreign key column in the fact table joins to the Instance Key of a dimension table. In Figure 11.4 the fact table foreign key joins to a Simple Instance Key in the Item table. In Figure 11.5 the fact table foreign key joins to a Compound Instance Key in the Item table. Then a dimension table continues the query path by joining, via Instance Keys, to other dimension tables. In Figure 11.4 the Item table joins to the Color_Formula and Carrier_Terms tables via Simple Instance Keys. In Figure 11.5 the Item table joins to the Color_Formula and Carrier_Terms tables via Compound Instance Keys.

The query path that begins with a fact table and then joins to dimension tables usually is the result of a question that begins with the transactions in the fact table. The query path can also begin with a dimension table. A Department summary would begin with the Department dimension table. A Region summary would begin with a Region dimension table. Or, in the case of the example in Figure 11.4 and Figure 11.5, a Carrier_Terms summary would begin with the Carrier_Terms table; a Color_Formula summary would begin with the Color_Formula table.

To optimize the join path from a dimension table (e.g., Carrier_Terms or Color_Formula) backward toward the fact table, the foreign key Instance Keys in all dimension tables will be indexed as a secondary index. In Figure 11.4 and Figure 11.5, the join path from the Color_Formula table “backward” to the Item table will be optimized by a secondary index on the Color_Formula Instance Key in the Item table; in addition, the join path from the Carrier_Terms table “backward” to the Item table will be optimized by a secondary index on the Carrier_Terms Instance Key in the Item table. So, when the query path goes “backward” from a dimension table, through other dimension tables, down to the fact table, the primary keys and foreign keys reverse. The join from the Carrier_Terms table down to the Item table finds the applicable rows in the Item table by finding the Item rows wherein the Carrier_Terms Instance Key matches the Carrier_Terms Instance Key of the Carrier_Terms table. The join from the Color_Formula table down to the Item table finds the applicable rows in the Item table by finding the Item rows wherein the Color_Formula Instance Key matches the Color_Formula Instance Key of the Color_Formula table. To optimize this join path, the Color_Formula Instance Key in the Item table is covered by a secondary index, and the Carrier_Terms Instance Key in the Item table is also covered by a secondary index. When the Instance Keys are Simple Instance Keys, the covering secondary index is an index of only the one Instance Key column. When the Instance Keys are Compound Instance Keys, the covering secondary index is an index of the two Instance Key columns.

To summarize the design of a Type 2 dimension table:

The Instance Key is a numeric sequential unique identifier for every row of a dimension table.
A Simple Instance Key is made of only a single numeric column within a dimension table.
A Compound Instance Key is made of an Entity Key and a single numeric sequential column of data that resets back to 1 for every new entity.
A dimension table’s Instance Key is the primary key, or primary index, of that table.
A dimension table has foreign Instance Keys which should be covered as a secondary index.
For some RDBMS platforms, foreign key/primary key relations perform more efficiently when all instances of the key are the same data type.
Figure 11.4 and Figure 11.5 display foreign key/primary key relations in a Type 2 time variant data warehouse.

Type 1 and Type 2 Time Variant Dimension Tables in a Shared Environment

Type 2 Time Variance is achieved by connecting every row of a fact table to the dimension rows that were in effect at the time of the transaction presented in that row. Type 1 Time Variance is achieved by connecting every row of a fact table to the dimension rows that are in effect right now. This will create two representations of the enterprise. One is a Type 1 “present state” representation of the enterprise. The other is a Type 2 “historic state” representation of the enterprise. Figure 11.8 (showing Simple Instance Keys) and Figure 11.9 (showing Compound Instance Keys) give a big picture view of how these two representations work together in one data warehouse. As a DBA, you choose how to present these two sets of dimensions (present state and historic state). The possibilities include two separate databases, two schemas, or some other demarcation. The goal of that demarcation is to allow the users of the data warehouse to know which set of dimensions they are using. They need to know that when they are in the “Present State” demarcation, their query results will reflect only the present state of the enterprise. Likewise, the data warehouse users need to know that when they are in the “Historic State” demarcation, their query results will reflect only the historic state of the enterprise.

Figure 11.8 and Figure 11.9 show that both the Type 1 Time Variance dimension tables and the Type 2 Time Variance dimension tables share the same set of fact tables. The sharing of fact tables avoids unnecessary resource consumption by doubling the disk capacity of the fact tables via replicated tables and avoids the possibility of differing total quantities when comparing Type 1 and Type 2 result sets (and yes, everyone will compare the Type 1 and Type 2 data).

Having established the concept and implementation of Instance Keys, Type 2 Time Variance is easy to understand. The remaining challenge is to build the Type 1 Time Variant dimension tables and the fact tables, so that the Type 1 and Type 2 dimension tables can share the same fact tables.

Type 1 Time Variant Dimension Tables

A data warehouse presents enterprise events in both their historic context and their present context. Ralph Kimball dubbed the presentation of enterprise events in their present context as Type 1 Time Variance, and Kimball dubbed the presentation of enterprise events in their historic context as Type 2 Time Variance. The discussion of Type 1 Time Variance versus Type 2 Time Variance is a discussion of the presentation of dimensions in the data warehouse. The previous section explained the concepts underlying relational structures that would achieve Type 2 Time Variance. This section will explain the concepts underlying relational structures that would achieve Type 1 Time Variance.

The historic context of Type 2 Time Variant dimension tables is achieved by joining on Instance Keys. Instance Keys synchronize the rows that were in effect as of a moment in time. The present context of Type 1 Time Variant dimension tables is achieved by joining on Entity Keys. Entity Keys ignore the possibility that a row of dimension data might not have been in effect as of a moment in the past, because the Type 1 dimension tables cannot see the past.

All the same Columns

The Type 1 dimension tables have all the same columns as the Type 2 dimension tables, including the Instance Keys. Retaining all the columns allows the queries and BI reports using the Type 2 “historic context” dimension to be as similar as possible to the queries and BI reports using the Type 1 “historic context” dimensions. The same columns also make management of the two sets of dimension tables easier and cleaner. Also, retaining the Instance Keys, Row_First_Date, and Row_Last_Date columns in the Type 1 “present state” dimension tables strengthens the effect of the data quality assessment that will verify that the Type 1 dimension tables do indeed have only the most recent dimension row for each entity. So, while it may seem strange at first to retain all the time variant columns in the Type 1 dimension tables, that design approach adds more value than the disk space it consumes.

Only the Row in Effect Right Now

The Type 1 Time Variant dimension table will avoid any possible confusion about historic dimension rows by presenting only the row in effect right now for a given entity. Figure 11.6 and Figure 11.7 display how this would look in the example of a transaction involving Item Key 4. In that example, Item Key is the Entity Key, and Item Key 4 is the entity. Since the example includes only one item (Item 4), the Type 1 Item table includes only one row. If the example were expanded to include four hundred items, then the Type 1 Item table would include four hundred rows.

The other obvious observation is that the Type 1 dimension tables have fewer rows than their Type 2 dimension table counterparts. Obviously, the Type 1 dimension tables have fewer rows because they do not include any history rows.

Entity Primary Keys

The primary key in Type 1 Time Variant dimension tables is a structural Data Definition Language (DDL) change compared to the primary keys of the Type 2 Time Variant dimension tables. In Type 1 Time Variant dimension tables the Entity Key is the primary key. This is different because in a Type 1 Time Variant dimension you are not joining to an entity as of a moment in the past. Instead, you are joining to an entity...period. In a Type 1 Time Variant dimension there is no representation of the past. So, there is no need to try to join to the past. Instead, each entity has only one row, and that one row is the goal of a join to that entity.

When the query path goes from a fact table to a Type 1 dimension table, the fact table joins to an Entity Key in a dimension table. That dimension table joins to another dimension table via the Entity Key of the second dimension table. The second dimension table joins to a third dimension table via the Entity Key of the third dimension table, and so on.

Entity Foreign Keys

When the query path goes from a Type 1 dimension table to a fact table, the dimension table joins to the fact table via a foreign Entity Key. A Type 1 dimension table can join to a second Type 1 dimension table by finding its Entity Key as a foreign key in the second dimension table. The second dimension table can also join to a third dimension table by finding its Entity Key as a foreign key in the third dimension table, and so on. Finally, the lowest-level Type 1 dimension table joins to a fact table by finding its Entity Key as a foreign key in the fact table.

To optimize the performance of these joins from a Type 1 dimension table to another Type 1 dimension table, or a fact table, the foreign Entity Keys should be covered by a secondary index. By joining to the secondary index, the join from one dimension table to another dimension table should be able to find a more efficient explain than a full table scan.

Fact Tables

The rows of a fact table represent individual enterprise events or transactions. In a data warehouse, the fact tables represent the business functions of the data warehouse. The rows of the fact tables represent the individual occurrences of those business functions. The Entity Keys and Instance Keys in a fact table identify the individual enterprise entities involved in the business function.

The primary key of a fact table has no relevance to this Time Variant Solution Design. So, the primary key should be designed to optimize BI reports that query large numbers of fact rows. The foreign keys that join to dimension tables, however, are relevant to this Time Variant Solution Design. To optimize the performance of queries that begin from the dimension tables, the Entity Keys that join to Type 1 dimension tables should each be covered by a secondary index, and the Instance Keys that join to Type 2 dimension tables should each be covered by a secondary index.

Time Summary Tables

The detailed and granular data in a fact table can be summarized by any dimension hierarchy referenced in that fact table. Transactions can be summarized by a Product hierarchy up to the Department level of that hierarchy. Events can be summarized by a Labor hierarchy up to the Skill level of that hierarchy. The Time hierarchy is a ubiquitous summary scheme. Typically, a data warehouse will summarize individual transactions up to the Day and Week. From those two Time summaries, other less frequent Time summaries can be performed at run time.

The “Summary Tables” section of Chapter 11 explained that Time summaries in this Time Variant Solution Design can be defined as either Type 1 Summary tables or Type 2 Summary tables. A Type 1 Summary table is different from its source Fact table in that a Type 1 Summary table has only Entity Keys and no Instance Keys. A Type 2 Summary table is different from its source Fact table in that a Type 2 Summary table has only Instance Keys and no Entity Keys.

A Type 2 Summary table could serve both purposes as a Type 1 and Type 2 Summary table by retaining both the Entity Keys and Instance Keys. Such a hybrid Type 1/Type 2 Summary table is feasible because hierarchically all Instance Keys for an Entity Key are contained within that Entity Key. When a hybrid Type 1/Type 2 Summary table joins to a Type 1 Dimension table, a summation operation must occur again to sum the quantitative measurements up to the Entity Keys in the Summary table. When a hybrid Type 1/Type 2 Summary table joins to a Type 2 Dimension table, no additional summation operation is required to sum the quantitative measurements up to the Instance Keys in the Summary table.

The decision to deploy a Type 1 Summary table, Type 2 Summary table, or hybrid Type 1/Type 2 Summary table is a judgment call between the requirements, disk capacity, and CPU and I/O capacity of the data warehouse RDBMS.

Conclusion

Figure 11.8, using Simple Instance Keys, and Figure 11.9, using Compound Instance Keys, display a data warehouse that has a set of Type 1 “present state” dimension tables, a set of Type 2 “historic state” dimension tables, and a fact table that joins to both sets of dimension tables. The Type 1 “present state” dimension tables join via Entity Keys. The Type 2 “historic state” dimension tables join via Instance Keys. The fact and summary tables join via the method indicated by the set of dimension tables to which they join in an individual query. The result of these three sets of tables is a data warehouse that can present the events and transactions of the enterprise in their present and historic context.

Conceivably, a data warehouse can use a mixture of Simple and Compound Instance Keys. In practice, such a mixture of Simple and Compound Instance Keys can be very confusing. For that reason, if possible, a data warehouse should be designed using only Simple Instance Keys or only Compound Instance Keys. If that unilateral design is not an option, then a naming standard should be adopted to identify the time variant key of each table. Because every dimension table, including each Type 1 table and every Type 2 table, will retain its Entity Key and Instance Key, a table with a Simple Instance Key will look very similar to a table with a Compound Instance Key. The data warehouse users should not be required to profile the data in every dimension table to determine its key structure before using each table. For that reason, the best approach is to unilaterally use one Instance Key design (either Simple or Compound) or a naming standard that distinguishes the two from each other.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12 Time Variant Database Definition

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 12 Time Variant Database Definition