Creating a warehouse

This section of the chapter will lead you through the design and build process for the small warehouse (often referred to as a data mart) used for the reporting examples in the following chapters of this book.

For this book, we have taken the Microsoft AdventureWorks sample system, which already includes a warehouse schema for reporting.

Therefore, in this next section I will use a theoretical tennis statistic reporting system to show you the steps involved in designing a data warehouse.

The first step is to assess each source system table for its type of data in order to determine if it fits into a Dimension table, a Fact table, or another table type. Based upon our assessment of the source tables, we can then design and build the warehouse tables. This is followed by the creation of a process to copy the data from the source to the warehouse. Finally, we review and tune the database in order to ensure that we can meet the goals we have set.

Source system assessment

We will start with the list of source tables. For each one, we will note which data type they are, examine the contents, determine if they are required in the warehouse, and consider any other factors that might be significant.

The standard approach is to look at the table from three angles:

  • Physical attributes:
    • What are the columns and their data types?
    • What restrictions are there?
    • And most importantly, what keys are there?

  • Data content:
    • Try to get as big a sample of the data as possible
    • Bring it into a system you understand, such as Excel, Access, or a database
    • Finally, try to understand the field contents and numbers involved

  • Business use:
    • Find out how the table is used in the source system
    • If the table is used in an existing reporting system, what reports use the table (and how!)?

The tables in the Tennis Statistics source system are split into Men's Tour and Ladies' Tour tables, along with some joint lookup tables, for example Tennis Court Types. A section of the database showing Men's Tennis is shown in the following diagram:

Source system assessment

Using the approach shown previously, we create the following table:

Source system assessment

Warehouse design

We can now draw up a design for the warehouse, with a star schema in mind. Our goals are to reduce joins and therefore reduce snowflake designs, and make the design flexible enough for any reporting that we would like to think of. We will also keep in mind the rules laid out in the first section of this chapter.

Warehouse tables

The initial result of our analysis has resulted in six Dimension tables and two Fact tables. This is not set in stone; we need to be flexible enough to add more tables later if the report requirements dictate. There are no aggregate tables yet in place, because at this stage we do not know where, if any, there are performance issues. Given that we have designed a star then initial performance for a two million record table should be good:

Warehouse Table Name

Content

Type

Granularity

W_PLAYER_D

List of Players

Dimension

Player

W_ROUND_D

List of Rounds

Dimension

Round

W_DATE_D

List of Dates

Dimension

Day

W_COURT_TYPE_D

List of Court Types

Dimension

Court Type

W_STADIUM_D

List of Stadium

Dimension

Stadium

W_TOURNAMENT_D

List of Tournaments

Dimension

Tournament

W_MATCH_F

Matches at a tournament

Fact

Match, Player

W_TOURNAMENT_F

Players at a tournament

Fact

Tournament, Player

These tables can be arranged into the two following star schema layouts, with joins between the tables using the primary keys for the dimensions.

The match star schema

The match star schema

Note the direct connection from the Tournament Dimension table (W_TOURNAMENT_D) to the Match Fact table (W_MATCH_F). If we modeled the Tournament table to connect to the Match table (W_MATCH_D) and did not have a Tournament WID field in the Match fact, then we would have a snowflake design.

Note

WID is a column name suffix that was introduced to us in the Siebel days. It stands for Warehouse ID, and is used to name the field which is normally part of the key for a table. For example, DATE_WID would be used on the date table and would contain a warehouse derived ID. In the Oracle BI Applications databases, you often see ROW_WID as the primary key on the Dimension tables.

The tournament star schema

The tournament star schema

With the previous two star designs, we are now able to respond to a whole range of analyses, such as how many matches do left-handed players win in the Southern Hemisphere each year on hard courts!

Populating and tuning

Having designed your warehouse, it's now time to create the objects.

Creating the tables is straightforward and can be achieved using free tools such as SQL Developer.

You are now faced with choices over the use of indexes and partitions. The general rule of thumb is to only create what you need. If an index is not going to be assisting in reports then do not create it. If a table has more than 10 million records then it becomes a good candidate for partitioning. The latest versions of Oracle have a useful feature for auto incrementing partitions which can make the whole process much easier to implement (make sure you have the right license though!).

Primary keys should be added where suitable, certainly on the Dimension tables, but also on the Fact tables provided they do not slow down the loading process too much. If you are not able to use a unique key in the fact due to massive amounts of data, then it is prudent to check that the dimensions rule is not broken, that is, make sure there are no references in the fact to dimension records that do not exist.

Monitoring and maintaining

Having created and populated our warehouse the hard work begins! Once the Oracle BI tool starts to throw requests at our database we will learn where the bottlenecks are. Monitoring the ETL and database statistics each day are vital to ensure that potential breakages are avoided. This includes understanding how the tables are growing each day, and what the impact is on our allocated disk space.

The latest version of Oracle Enterprise Manager (11g) has a substantial set of monitoring processes that makes it an invaluable tool in your armory.

Implement Usage Tracking in Oracle BI to understand where your customers are experiencing slow performance, and act on the information gathered. This could be adding indexes, Performance Data or even changing the way that OBI accesses the star.

Usage tracking will be explained in detail later in this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.116.69