This section of the chapter will lead you through the design and build process for the small warehouse (often referred to as a data mart) used for the reporting examples in the following chapters of this book.
For this book, we have taken the Microsoft AdventureWorks sample system, which already includes a warehouse schema for reporting.
Therefore, in this next section I will use a theoretical tennis statistic reporting system to show you the steps involved in designing a data warehouse.
The first step is to assess each source system table for its type of data in order to determine if it fits into a Dimension table, a Fact table, or another table type. Based upon our assessment of the source tables, we can then design and build the warehouse tables. This is followed by the creation of a process to copy the data from the source to the warehouse. Finally, we review and tune the database in order to ensure that we can meet the goals we have set.
We will start with the list of source tables. For each one, we will note which data type they are, examine the contents, determine if they are required in the warehouse, and consider any other factors that might be significant.
The standard approach is to look at the table from three angles:
The tables in the Tennis Statistics source system are split into Men's Tour and Ladies' Tour tables, along with some joint lookup tables, for example Tennis Court Types. A section of the database showing Men's Tennis is shown in the following diagram:
Using the approach shown previously, we create the following table:
We can now draw up a design for the warehouse, with a star schema in mind. Our goals are to reduce joins and therefore reduce snowflake designs, and make the design flexible enough for any reporting that we would like to think of. We will also keep in mind the rules laid out in the first section of this chapter.
The initial result of our analysis has resulted in six Dimension tables and two Fact tables. This is not set in stone; we need to be flexible enough to add more tables later if the report requirements dictate. There are no aggregate tables yet in place, because at this stage we do not know where, if any, there are performance issues. Given that we have designed a star then initial performance for a two million record table should be good:
Warehouse Table Name |
Content |
Type |
Granularity |
---|---|---|---|
|
List of Players |
Dimension |
Player |
|
List of Rounds |
Dimension |
Round |
|
List of Dates |
Dimension |
Day |
|
List of Court Types |
Dimension |
Court Type |
|
List of Stadium |
Dimension |
Stadium |
|
List of Tournaments |
Dimension |
Tournament |
|
Matches at a tournament |
Fact |
Match, Player |
|
Players at a tournament |
Fact |
Tournament, Player |
These tables can be arranged into the two following star schema layouts, with joins between the tables using the primary keys for the dimensions.
Note the direct connection from the Tournament Dimension table (W_TOURNAMENT_D
) to the Match Fact table (W_MATCH_F
). If we modeled the Tournament table to connect to the Match table (W_MATCH_D
) and did not have a Tournament WID
field in the Match fact, then we would have a snowflake design.
WID
is a column name suffix that was introduced to us in the Siebel days. It stands for Warehouse ID, and is used to name the field which is normally part of the key for a table. For example, DATE_WID
would be used on the date table and would contain a warehouse derived ID. In the Oracle BI Applications databases, you often see ROW_WID
as the primary key on the Dimension tables.
Having designed your warehouse, it's now time to create the objects.
Creating the tables is straightforward and can be achieved using free tools such as SQL Developer.
You are now faced with choices over the use of indexes and partitions. The general rule of thumb is to only create what you need. If an index is not going to be assisting in reports then do not create it. If a table has more than 10 million records then it becomes a good candidate for partitioning. The latest versions of Oracle have a useful feature for auto incrementing partitions which can make the whole process much easier to implement (make sure you have the right license though!).
Primary keys should be added where suitable, certainly on the Dimension tables, but also on the Fact tables provided they do not slow down the loading process too much. If you are not able to use a unique key in the fact due to massive amounts of data, then it is prudent to check that the dimensions rule is not broken, that is, make sure there are no references in the fact to dimension records that do not exist.
Having created and populated our warehouse the hard work begins! Once the Oracle BI tool starts to throw requests at our database we will learn where the bottlenecks are. Monitoring the ETL and database statistics each day are vital to ensure that potential breakages are avoided. This includes understanding how the tables are growing each day, and what the impact is on our allocated disk space.
The latest version of Oracle Enterprise Manager (11g) has a substantial set of monitoring processes that makes it an invaluable tool in your armory.
Implement Usage Tracking in Oracle BI to understand where your customers are experiencing slow performance, and act on the information gathered. This could be adding indexes, Performance Data or even changing the way that OBI accesses the star.
Usage tracking will be explained in detail later in this book.
3.144.116.69