Understanding Data Warehousing

Suppose that you are the president of a small, new company. Your company needs to grow, but you have limited resources to support the expansion. You have decisions to make, and to make those decisions you must have particular information. Much of the information you need comes from outside the organization. That’s why you read the Wall Street Journal and keep a bookmark in your browser pointed at www.bloomberg.com. But much of the information you need also comes from inside the organization, and much of that information is numerical. A data warehouse is a tool for storing and analyzing numerical information.

Ralph Kimball, perhaps the preeminent apostle of data warehousing philosophy, describes a data warehouse simply as “the place where people can access their data.” He lists six critical requirements for a data warehouse:

  • Warehouse retrievals must be fast.

  • Warehouse values must be internally consistent.

  • Users must be able to slice and dice—that is, extract a single item (slice) and compare items in a cross-tabulated table (dice)—warehouse data.

  • A warehouse must include easy-to-use browsing tools.

  • Warehouse data must be complete and reliable.

  • Quality warehouse data requires quality data-gathering processes.

Analysis Services can play an integral part in the data warehousing strategy of a corporation, but it does not satisfy all the needs of a data warehouse. In fact, Analysis Services facilitates only two of the six requirements: Analysis Services does help make retrieval from a warehouse fast (Requirement 1), and it does make it easier to slice and dice the data (Requirement 3). Analysis Services does not directly provide easy-to-use browsing tools (Requirement 4)—although, as described in Chapter 5, “Office 2000 Analysis Components,” you can use the Microsoft Excel 2000 PivotTable report tool as a simple browser for an Analysis Services warehouse.

Kimball’s remaining three requirements—all of which have to do with making the data in the warehouse consistent, reliable, and trustworthy—must be done before your use of Analysis Services enters the picture. In other words, Analysis Services assumes that you already have a valid, functional data warehouse created using a relational database management system such as SQL Server.

Analysis Services creates a new layer on top of an existing relational warehouse. The purpose of that new layer is to make access to the data very fast and very flexible (two of Kimball’s data warehouse requirements).

Note

Even though Analysis Services is bundled with SQL Server 2000, it does not require that you use SQL Server as your relational database management system. You can use Analysis Services with Oracle, DB2, Informix, or any other relational database tool, including Microsoft Access.


The Purpose of a Data Warehouse

A data warehouse stores stable, verified data values. You might find it helpful to compare a data warehouse with a transaction database:

  • A transaction database helps people carry out activities, and a data warehouse helps people make decisions. For example, a transaction database might show which seats are available on an airline flight so that a travel agent can book a new reservation. A data warehouse, on the other hand, might show the historical pattern of empty seats by flight so that an airline manager can decide whether to adjust flight schedules in the future.

  • A transaction database is volatile; its information constantly changes as new orders are placed or cancelled, as new products are built or shipped, or as new reservations are made. A data warehouse is stable; its information is updated at standard intervals—perhaps monthly, weekly, or even hourly—and, in an ideal world, an update would add values for the new time period only, without changing values previously stored in the warehouse.

  • A transaction database focuses on the details: the travel agent booking a flight reservation doesn’t want to know the average number of empty seats, and the parent purchasing the latest Harry Potter book doesn’t care about inventory levels for the Juvenile Fiction product line. A data warehouse focuses on high-level aggregates: the manager updating flight schedules doesn’t care which specific seats were empty but does want to see the big picture. The implication of this difference is that the key values in a data warehouse must be numeric values that can be summarized.

  • A transaction database typically provides the data values that are then stored in a data warehouse.

Dimensions in Data Analysis

In the world of data warehousing, a summarizable numerical value that you use to monitor your business is called a measure. When looking for numerical information, your first question is which measure you want to see. You could look at, say, Sales Dollars, at Shipment Units, at Defects Per Hour, or at Ad Campaign Responses. Suppose that you ask to see a report of your company’s Units Sold. Here’s what you get:

113

Looking at the one value doesn’t tell you much. You want to break it out into something more informative. For example, how has your company done over time? You ask for a monthly analysis, and here’s the new report:

JanuaryFebruaryMarchApril
14413325

Your company has been operating for four months, so across the top of the report you find four labels for the months. Rather than the one value you had before, you now find four values. The months subdivide the original value. The new number of values equals the number of months. This is analogous to calculating linear distances in the physical world: the length of a line is simply the length.

You’re still not satisfied with the monthly report. Your company sells more than one product. How did each of those products do over time? You ask for a new report by product and by month:

 JanuaryFebruaryMarchApril
Colony Blueberry Muffins  617
Colony Cranberry Muffins61668
Sphinx Bagels82521 

Your young company sells three products, so down the left side of the report are the three product names. Each product subdivides the monthly values. Meanwhile, the four labels for the months are still across the top of the report. You now have 12 values to consider. The number of values equals the number of products times the number of months. This is analogous to calculating the area of a rectangle in the physical world: area equals the rectangle’s length times its width. The report even looks like a rectangle.

The Term OLAP in Analysis Services

The previous version of Analysis Services was named OLAP Services. The name was changed in SQL Server 2000 to avoid the use of an unfamiliar acronym and also because Analysis Services now includes data mining tools, which are outside the realm of OLAP technology. The term OLAP still appears frequently in the product documentation, and you should understand what it means. Understanding the history of the term can help you understand its meaning.

In the 1980s, E. F. Codd coined the term online transaction processing (OLTP) and proposed 12 criteria that define an OLTP database. His terminology and criteria became widely accepted as the standard for databases used to manage the day-to-day operations (transactions) of a company. In the 1990s, Codd came up with the term online analytical processing (OLAP) and again proposed 12 criteria to define an OLAP database. This time his criteria did not gain wide acceptance, but the term OLAP did, seeming perfect to many for describing databases designed to facilitate decision making (analysis) in an organization.

Some people use OLAP simply as a synonym for data warehousing. Usually, however, the term OLAP describes specialized tools that make warehouse data easily accessible. It is in that specialized sense that the term OLAP is used in SQL Server 2000 Analysis Services.


The comparison to a rectangle, however, applies only to the arithmetic involved, not to the shape of the report. Your report could be organized differently—it could just as easily look like this:

Colony Blueberry MuffinsJanuary 
Colony Blueberry MuffinsFebruary 
Colony Blueberry MuffinsMarch6
Colony Blueberry MuffinsApril17
Colony Cranberry MuffinsJanuary6
Colony Cranberry MuffinsFebruary16
Colony Cranberry MuffinsMarch6
Colony Cranberry MuffinsApril8
Sphinx BagelsJanuary8
Sphinx BagelsFebruary25
Sphinx BagelsMarch21
Sphinx BagelsApril 

Whether you display the values in a list like the one above (where the numerical values form a line) or display them in a grid (where they form a rectangle), you still have the potential for 12 values if you have four monthly values for each of three products. Your report has 12 potential values because the products and the months are independent. Each product gets its own sales value—even if that value is zero—for each month.

Back to the rectangular report. Suppose that your company sells in two different states and you’d like to know how each product is doing each month in each state. Add another set of labels indicating the states your company uses, and you get a new report, one that looks like this:

  JanuaryFebruaryMarchApril
WAColony Blueberry Muffins  310
 Colony Cranberry Muffins3166 
 Sphinx Bagels4166 
ORColony Blueberry Muffins  37
 Colony Cranberry Muffins3  8
 Sphinx Bagels4915 

The report now has two labels for the states, three labels for products (each shown twice), and four labels for months. It has the potential for showing 24 values, even if some of those value cells are blank. The number of potential values equals the number of states times the number of products times the number of months. This is analogous to calculating the volume of a cube in the physical world: volume equals the length of the cube times its width times its height. Your report doesn’t really look like a cube—it looks more like a rectangle. Again, you could rearrange it to look like a list, and the beginning of the list would look like this:

WAColony Blueberry MuffinsJanuary 
WAColony Cranberry MuffinsJanuary3
WASphinx BagelsJanuary4
ORColony Blueberry MuffinsJanuary 
ORColony Cranberry MuffinsJanuary3
ORSphinx BagelsJanuary4
WAColony Blueberry MuffinsFebruary 
WAColony Cranberry MuffinsFebruary16
WASphinx BagelsFebruary16

Whichever way you lay out your report, it has three independent lists of labels, and the total number of potential values in the report equals the number of unique items in the first independent list of labels (for example, two states) times the number of unique items in the second independent list of labels (three products) times the number of unique items in the third independent list of labels (four months). Because the phrase independent list of labels is wordy, and because the arithmetic used to calculate the number of potential values in the report is identical to the arithmetic used to calculate length, area, and volume—measurements of spatial extension—in place of independent list of labels, data warehouse designers borrow the term dimension from mathematics. Remember that this is a borrowed term. A data analysis dimension is very different from a physical dimension. Thus, your report has three dimensions—State, Product, and Time—and the report’s number of values equals the number of items in the first dimension times the number of items in the second dimension, and so forth. Using the term dimension doesn’t say anything about how the labels and values are displayed in a report or even about how they should be stored in a database.

Each time you’ve created a new dimension, the items in that dimension have conceptually related to one another—for example, they are all products, or they are all dates. Accordingly, items in a dimension are called members of that dimension.

Now complicate the report even more. Perhaps you want to see dollars as well as units. You get a new report that looks like this:

  JanuaryFebruaryMarchApril
U$U$U$U$
WAColony Blueberry Muffins    37.441024.80
 Colony Cranberry Muffins37.951642.40615.90  
 Sphinx Bagels47.321629.28610.98  
ORColony Blueberry Muffins    3  7.44   717.36
 Colony Cranberry Muffins37.95       821.20
 Sphinx Bagels47.32   916.471527.45  
U=Units; $=Dollars

Because units and dollars are independent of the State, Product, and Time dimensions, they form what you can think of as a new, fourth dimension, which you could call a Measures dimension. The number of values in the report still equals the product of the number of members in each dimension: 2 times 3 times 4 times 2, which equals 48. But there is not—and there does not need to be—any kind of physical world analogue. Remember that the word dimension is simply a convenient way of saying independent list of labels, and having four (or twenty or sixty) independent lists is just as easy as having three. It just makes the report bigger.

In the physical world, the object you’re measuring changes depending on how many dimensions there are. For example, a one-dimensional inch is a linear inch, but a two-dimensional inch is a square inch, and a three-dimensional inch is a cubic inch. A cubic inch is a completely different object from a square inch or from a linear inch. In your report, however, the object that you measure as you add dimensions is always the same: a numerical value. There is no difference between a numerical value in a “four-dimensional” report and a numerical value in a “one-dimensional” report. In the reporting world, an additional dimension simply creates a new, independent way to subdivide a measure.

Although adding a fourth or fifth dimension to a report is not a metaphysical act, that’s not to say that adding a new dimension is trivial. Suppose that you start with a report with two dimensions: 30 products and 12 months, or 360 possible values. Adding three new members to the product dimension increases the number of values in the report to 396, a 10 percent increase. Adding a third dimension with three new members, however, increases the number of values in the report to 1080, a 300 percent increase. Consider this extreme example: with 128 members in a single dimension, a report has 128 possible values, but with those same 128 total members split up into 64 dimensions—with two members in each dimension—a report has 18,446,744,073,709,551,616 possible values!

Hierarchies in Data Analysis

When an organization is small or hasn’t been around for a long time, it’s easy to understand the business simply by looking at the detailed numbers. For example, if you have a company that has been in business for only four months and it has only three products, you can comfortably analyze data at the lowest level of detail. Psychological studies indicate that most people can easily comprehend about seven items—or seven groups of items. Grouping—aggregating—is the way that humans deal with numerous items. Once your company has sold products for more than six months, you’ll undoubtedly want to start looking at the values by quarter as well as by month. Likewise, once your company has more than a dozen products, you’ll probably want to group the products into product lines or product groups. But how do aggregations such as quarters and product lines fit into dimensions?

Generally, you think of members in a dimension as “belonging together.” January and February naturally seem to belong together and clearly should reside in the same dimension. January and Colony Blueberry Muffins don’t naturally belong together and clearly should not reside in the same dimension. But what about the members January and Qtr1? Do they belong together?

Remember that a dimension is really an independent list of labels for the report. To decide whether new members belong in a new dimension or in an existing dimension, imagine the new members as the column headings of a report, with the members of the existing dimension forming the row headings. If the new members are independent, you should—at least potentially—have a value each time they intersect. But look at a report that shows months on the rows and quarters on the columns:

 Qtr1Qtr2
January14 
February41 
March33 
April 25
May 29
June 39

Half the cells are empty, and it’s not coincidental. There is no such thing as a January in Qtr2, just as there’s no April in Qtr1. The report looks silly. Putting the two quarters in a dimension other than the Time dimension multiplies the total number of values by two but also guarantees that half of the values will always be empty. So the number of true potential values has not changed. In fact, the report with quarters on the columns never shows you what you want to see, which is the total for each quarter.

The report you want looks more like this:

Qtr188
January14
February41
March33
Qtr293
April25
May29
June39

Months and quarters are not completely independent members and should not appear in separate dimensions. The quarter totals are simply aggregations of the month totals, and they belong in the same dimension. There is, however, something different between Qtr1 and January. For one thing, you probably want the January label to be indented more than the Qtr1 label. And there’s something similar about the labels Qtr1 and Qtr2.

Even though the words Month and Quarter don’t appear in the report, you naturally refer to the labels January, February, and so forth as months, and you refer to the labels Qtr1, Qtr2, and so forth as quarters. Months and quarters form a hierarchy within the Time dimension, and each degree of summarization is referred to as a level. For example, in this Time dimension, January and February are members of the Month level, and Qtr1 and Qtr2 are members of the Quarter level. As time goes on and you add more months and quarters, you’ll eventually add a Year level to the dimension’s hierarchy. A dimension containing more than a few members almost always breaks into a hierarchy, and a hierarchy, by definition, contains levels.

When you have a hierarchy in a dimension, sometimes you’ll want to see the entire hierarchy, sometimes you’ll want to see only the top one or two levels, and sometimes you’ll want to see only the lowest level. You can use the term members to describe either all the members in the entire dimension or only the members of a specific level within the dimension. For example, the members of the Time dimension include years and quarters as well as months, and the members of the Months level within the Time dimension do not include any years or quarters. The members at the lowest level of detail are called leaf members. A dimension cannot exist without leaf members, but it is possible to have a dimension with nothing but leaf members—that is, with only one level. For example, in the Measures dimension, it doesn’t make sense to sum the total of Units and Dollars.

Some hierarchies, such as Time, are balanced: if there are months under Qtr1, there will also be months under Qtr2, Qtr3, and Qtr4. In a balanced hierarchy, it’s easy to give names to levels. For example, the levels in a typical Time hierarchy might have the names Year, Quarter, and Month.

Some hierarchies are unbalanced. An organization chart is often unbalanced. For example, in many companies, there might be many more people—and thus many more levels of management—in the Manufacturing organization than in Human Resources. In an unbalanced hierarchy, it’s often difficult to give names to specific levels, but leaf members are always the ones that have no children below them.

Some hierarchies appear to blur the distinction between unbalanced and balanced. For example, in a geographic hierarchy, you might have easily named levels—Country, Region, and State—but skip the Region level for certain States. This really is a balanced hierarchy (because there are easily named levels), but the parents of some of the members are missing or invisible. A hierarchy that hides some of the parent members is called a ragged hierarchy.

Analysis Services gives you a great deal of flexibility in defining balanced or unbalanced hierarchies, whether ragged or not. A dimension always has leaf members. The hierarchy simply defines how (and whether) the values for leaf members are summarized.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.87.156