16.2. Data Warehousing and OLAP

Most commercial information systems are built to support heavy volumes of transactions by many users on a daily basis. Examples include banking, insurance, and order processing systems. These Online Transaction Processing (OLTP) systems typically require quick throughput for their largely predefined range of transactions, especially update transactions. To improve performance, historical data is often archived once it reaches a certain age, reducing the size of the data sets used for daily operations. A single organization may have several OLTP systems (e.g., purchasing, sales, inventory, customer records), possibly implemented using different kinds of DBMS or other software applications, and the coupling between such systems may be weak or even nonexistent.

Over time, businesses became aware that the collective information contained in their various systems had great potential for analyzing market trends and improving their business processes. However, their OLTP systems were unsuitable for executives to perform this task, given the poor performance and complex interface for ad hoc analysis queries (e.g., aggregated multitable joins and nested correlated subqueries). Moreover, insufficient integration or history made some queries simply impossible. Partly to address the problem of integrating data from prerelational systems for analysis purposes, IBM proposed the notion of an “information warehouse”. Although performance problems delayed acceptance of this idea for some years, a later proposal for a “data warehouse” by Bill Inmon (1993) was enthusiastically embraced by the business community, and nowadays most large companies have a data warehouse.

A data warehouse is an enterprisewide, integrated, historical database of information extracted from individual data sources for the purpose of supporting analysis of the business by management. During analysis it is read-only. Since the patterns and trends sought from the analysis tend to evolve slowly, and some imprecision is acceptable, updates are performed in bulk according to an agreed schedule (e.g., weekly).

The construction of a data warehouse for a large enterprise can be a lengthy task. To exploit the benefits as soon as possible, the data warehouse is often built iteratively, one subject area at a time. As subject areas are added to the data warehouse, they may be used to load data marts (see Figure 16.1). A data mart is a smaller “departmental warehouse” focused on one subject area, often containing more summarized and less detailed data than the data warehouse. For end users who perform their analysis within one subject area, a data mart provides a simpler model adapted to their needs, and the smaller data volume often leads to better performance.

Figure 16.1. Data is extracted into the data warehouse then loaded into data marts.


Many different approaches to data warehousing exist. Sometimes, data marts are built first and used to incrementally construct the data warehouse. However, if an analytical query may span multiple subject areas, it is critical that an overall enterprise architecture be in place to make the appropriate connections.

It has been argued that the data warehouse should be implemented as a fully normalized relational model, based directly on the enterprise data model, with no summary data, postponing any denormalization and aggregation to the data marts loaded from the warehouse (e.g., Moody and Kortink 2000). Current practice however usually does incorporate denormalized and summarized data in the data warehouse (Silverston et. al. 1997). A nice overview of alternative approaches is provided by Jukik (2006).

Before extracting data to the data warehouse, the business analysis needs should be determined and the relevant data sources identified. The data sources may include operational databases as well as spreadsheets and legacy systems. Any details in the data sources irrelevant to the analysis needs should be removed. For example, the phone number and street address of a store are probably not of interest, but its city probably is. The remaining data now needs to be transformed to remove errors (applying integrity rules). For example, a gendercode field with many ‘F’ and ‘M’ entries might include a few instances of ‘D’, ‘G’, ‘J’, or ‘N’. Assuming that these are typographical errors, can you guess the intended letter? (Hint: look at the keyboard.)

We must also ensure that all facts of the same type are represented in the same way. As a trivial example, a customer’s birthdate might be recorded as a character string in a ‘DOB’ field in one source, and elsewhere in a ‘birthdate’ column based on a date data type. Once the data is “cleansed” it is transformed into a uniform representation in the data warehouse (typically a relational database).

To facilitate the analysis of historical trends, appropriate temporal data should be included, at the desired granularity (e.g., daily, weekly, or monthly). For example, suppose an operational data source stores the following fact type: Employee manages Store. Over time, a store may have different managers. To retain history of these changes, when store management facts are loaded into the data warehouse, the load date may be inserted into the key, to populate the historical fact type: Employee on Date managed Store. Basic aspects of temporal modeling were discussed in Section 10.3.

To improve query performance, data marts (and usually the data warehouse) often contain derived data and denormalized structures. As a simple example of derived data, suppose an operational source stores the fact type Customer was born on Date. For demographical analysis, we may be interested in how product preferences are influenced by the age of customers. In this case it may be more appropriate in a data mart to store the age, or even just the age group, of the customer rather than their birthdate, using a fact type such as Customer on Date belonged to AgeGroup. The snapshot dates are inserted when the fact type is updated incrementally.

A more typical example incorporating both derivation and denormalization is a data mart for analyzing Sales trends. An ORM conceptual model of such a mart (simplified) is shown in Figure 16.2. In this UoD, the company makes sales from many stores, located in various cities within the same country (e.g., the United States).

Figure 16.2. A conceptual schema for the Sales data mart (derivation rules omitted).


A city is identified by combining its name and state (e.g., Portland, Oregon differs from Portland, Maine). States are grouped into regions (e.g., Oregon and Washington belong to the Northwest region). Items sold have a unique code and title, and belong to a category (e.g., Developer Tools). A line item is identified by its line number on a given invoice, and records the sale of an item, as well as the quantity and unit price for that item. The line total for each line item is derived by multiplying the quantity by the unit price and is then stored. To support sales analysis over time, the month number, quarter number, and year for each date are derived and stored. Although calendar years are used here, we could use fiscal years instead or as well. In Figure 16.2, fact types that are derived and stored are marked “**”, but for simplicity the derivation rules are omitted.

Using the default Rmap algorithm, this conceptual schema maps to the following six table schemes: Lineltem (invoiceNr. lineNr. itemCode, quantity, unitPrice, lineTotal); Invoice, in voiceNr. saieDate, storeNr); ltem (itemCode, itemTitle, category); Store (storeNr. stateCode, cityName); StateLocatiori stateCode. region); TimeDimension (saieDate. saleYear, QuarterNr, MonthNr).

However, to improve query performance it is decided to denormalize the relational schema to four table schemes, as shown in Figure 16.3. The Item and TimeDimension tables are normalized, but the Sale and Store tables are not. The Sale table is denormalized to 1NF, since saleDate and storeNr are functionally dependent on invoiceNr, which is just part of the primary key. The Store table is denormalized to 2NF since region is functionally dependent on stateCode, a nonkey attribute.

Figure 16.3. A denormalized, relational star schema for the model in Figure 16.2.


The decision to denormalize in this way is indicated by annotating the conceptual schema, as shown in Figure 16.2. Here the key object types are shaded. An object type is a key object type if and only if its preferred identification scheme is used as the primary key of a table. Graphically, each key object type forms the root of a tree, where each node is an object type and each edge is a functional (n:1 or 1:1) predicate.

For example, from the Store object type we run down functional chains to the leaf object types CityName and Region, gathering all the fact types on the way to group them into a single table based on the identifier for Store.

A functional chain stops if it runs into a key object type (or a leaf or a nonfunctional predicate). For example, starting at Lineltem we gather up all its functional fact types, as well as those for Invoice, but we cannot proceed past Date, Store, or Item, since these are key object types. This leads to the Sale table in Figure 16.3.

The denormalized Sale and Star tables contain embedded functional dependencies (e.g., stateCoderegion), but there is no need to enforce these because they have already been enforced in the operational tables from which the data mart is derived. Since the operational tables are used for base updates, and not the data mart, it is acceptable to denormalize the data mart in this way. Reducing the number of tables eliminates the need for many joins, leading to faster queries.

The schema in Figure 16.3 is composed of a central table (Sale) linked by foreign key connections to outer tables (Item, Store, and TimeDimension). This is called a star schema, since the central table may be viewed as the center of a star pattern, with its outer tables becoming “points of the star”. In data warehousing terminology, the central table is called a ”fact table” and the outer tables are dimension tables. Since all tables contain facts, the term “fact table” is rather inappropriate here. Moreover, ORM uses the term “fact table” to mean an atomic fact table. To avoid confusion, we’ll use the more descriptive “central table” instead of the more popular “fact table” in this context.

Some approaches require the primary key of the central table to include all the keys of its dimension tables. This would require the (invoiceNr, lineNr) key of the Sale table in Figure 16.3 to be expanded to the superkey (invoiceNr, lineNr, itemCode, sale-Date, storeNr). From a purely logical standpoint this is not required, since joins can be made on nonkey attributes.

If some dimension tables are themselves used as central tables for other stars, the overall schema is called a “snowflake schema”. A set of star schemas with shared dimension tables is sometimes called a “galaxy”.

Data warehouses and data marts are used for online analytical processing (OLAP) and data mining. The term “OLAP” was introduced by Edgar Codd to describe interactive analysis of dimensioned and aggregated data for decision support (Codd et al. 1993). Data mining involves deeper analysis of the data, typically using sophisticated statistical techniques and complex algorithms for detecting patterns. Nowadays, many tools for OLAP and data mining are in use. Either topic deserves a book in itself. The remainder of this section provides a brief overview of OLAP.

There are three main approaches to OLAP. Each uses base data as well as aggregated data (e.g., sales figures might be summed and grouped at various levels). Multidimensional OLAP (MOLAP) stores both base and aggregated data in multidimensional structures rather than tables. Relational OLAP (ROLAP) stores both base and aggregated data in relational tables. Hybrid OLAP (HOLAP) stores base data in relational tables and aggregated data in multidimensional structures. Some DBMSs, such as Microsoft SQL Server, support all three kinds of OLAP.

The multidimensional structures used for OLAP are popularly known as cubes. In geometry, a cube is a three-dimensional box structure. In OLAP theory, a cube can have as many dimensions as you like. Cubes provide an intuitive way to visualize and browse data, as well as fast access to aggregate data. Let’s consider a simple example.

With reference to the star schema in Figure 16.3, suppose we wanted to list the number of units sold in the years 2007 through 2009 for each geographic region and each item category. As an exercise you might like to formulate the SQL query for this. You need to perform the natural join of the four tables, group by sale Year, region and category, and compute sum(qty). An extract of a possible result is shown in Table 16.1, assuming only two categories (SW = Software, HW = Hardware) and four regions (N = North, S = South, E = East, W = West). The fact type underlying this table is the quaternary: Year in Region had sales of items of Category in NrUnits. The full table display of the result would include 24 rows (8 for each year); only the first 9 rows are shown here.

Table 16.1. Units sold.
YearRegionCategoryNrUnits
2007NSW10000
2007NHW500
2007SSW12000
2007SHW330
2007ESW7500
2007EHW440
2007WSW12000
2007WHW...350
2008NSW14500

An alternative way of displaying the data using a cube is shown in Figure 16.4. Here the independent variables (year, region, and category) appear as the dimensions making up the edges of the cube, and the values for the dependent variable or measure (units sold) appear in the relevant cells. Only the software sales figures are shown here in the 12 cells making up the front half of the cube. You can imagine seeing the hardware sales figures by rotating the cube to see its other side.

Figure 16.4. Cube depiction of units sold (only the software figures are shown here).


OLAP cubes can be much more complex than this example. For any set of independent dimensions, there may be more than one measure (e.g., units sold, and revenue). These different measures are often collectively referred to as the Measures dimension. Moreover, each independent dimension typically has a hierarchy of levels. For example, a Location dimension might have regions at its top level, composed of states at the second level, with cities at the third level, and stores at the fourth level. Similarly a Time dimension may be decomposed into Years, then Quarters, then Months, and then Days. Finally, the Item dimension may be decomposed into categories and then items.

So our data mart example can be used to construct a cube with three independent, hierarchical dimensions (Location, Time, and Item) and one dependent dimension for UnitsSold and Revenue measures. The neat thing about the cube structure is that it enables aggregate values for the measures to be efficiently stored and accessed for all levels of the hierarchies. When analyzing a cube, you can choose to consolidate or rollup these aggregates (e.g., roll up sales figures for cities to regional sales figures). You can also do the opposite, drilling down to a finer level of granularity. Moreover you can slice and dice the cube whichever way you like by taking a subcube of it (e.g., if you restrict the item category in Figure 16.4 to software, you get the front slice of the cube)

Because of such advantages, star schemas in data marts are often used to create a variety of cubes for easy analysis. SQL Server provides wizards to simplify the task of cube creation and also supports a Multidimensional Expression (MDX) language for querying cubes.

For example, assume SalesCube is based on the hierarchical Location, Time and Item dimensions discussed earlier. The following MDX query will result in a two-dimensional grid that lists the software units sold and revenue in Washington State in each of the first three quarters of 2007.

select { [Measures].[UnitsSold], [Measures].[Revenue] } on columns,
       { [Time].[2007].[Q1] : [Time].[2007].[Q3] } on rows
from SalesCube
where ([Item].[SW], [Location].[N].[WA])

In MDX, braces are used to delimit sequences, and square brackets may be used to delimit identifiers or values. The dot notation is used to move down a hierarchy. Notice that Time is drilled down to quarters and Location is drilled down to states. A colon “:” is used to indicate a range (like “..” in English). The result grid from the aforementioned query has two columns for units sold and revenue, and three rows for the first three quarters of 2007. A full coverage of MDX would require another chapter or two, so it is not pursued further here. An extensive treatment of MDX and OLAP for SQL Server is provided by Thomsen et al. (1999).

Different OLAP tools work differently. SQL: 1999 onwards includes support for cubes and rollup, so it is hoped that some OLAP features will become more standardized in the future. MDX queries can be very complex, and Microsoft SQL Server provides a very high level language called English Query that can be used by end users to generate MDX queries. The topic of natural language-based queries is discussed in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.130.201