CHAPTER 14: Reporting Design

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 14

Reporting Design

Great questions make great reporting.

—Diane Sawyer

People use reporting in every aspect of their daily lives. Reports provide the local weather, the status of the morning train, and even the best place to get a cup of joe. And this is all before getting to work! At work, reports include enterprise dashboards, system outage reports, and timesheet cards. Any way you slice it, reporting is something that no one can get away from.

Wouldn’t it be nice if you could understand how that reporting works? How the reports get put together and how the information is stored underneath the covers? This chapter will show you how that happens, by discussing two types of reporting you may encounter: analytical and aggregation. You will learn more about each area, including the types of dimensional modeling and how to get started on your own dimensional model. You will also create an aggregation model. Finally, you will look at common queries that can be used to retrieve information from each of these models.

Keep in mind that there are entire sets of books dedicated just to reporting, so you won’t learn everything here, but hopefully, you’ll learn enough to get you started on your own reporting solution.

Reporting Styles

Reporting can mean different things to different people and can be used in different ways. Reporting can lean toward the analytical, or even be used as an aggregation engine to display a specific set of information. Although each of these purposes is a valid reporting option, each has a unique way of storing information to ensure it can quickly and accurately satisfy its purpose and provide the correct information.

I’ve often had the debate with coworkers, clients, and friends of the best way to store data in a database. While methodologies differ, the overarching viewpoints revolve around two focuses: data in and data out. Or as Claudia Imhoff has stated in “Are You an Inny or an Outty?” ( http://www.information-management.com/issues/19990901/1372-1.html ). Innies are people who are most interested in storing the data into the table in the most efficient and quick way. On the other hand, outties are those who are most interested in making the output of the data as simple as possible for the end user. Both types of people can be involved in relational or data warehouse design. I unabashedly admit to being an outty. My entire goal of storing data is to make it easier for someone to get it out. I don’t care if I need to repeat values, break normalization practices, or create descriptive column names that could compete with an anaconda for length.

The reporting styles that I utilize to handle these reporting situations that are discussed in this chapter are:

Analytical
Aggregation

Each modeling paradigm has its own particular way of modeling the underlying database and storing the data. We will dig further into each of these reporting styles in this chapter.

Analytical Reporting

Probably the best-known method for storing data for analytical reporting is a dimensional model. Analytics focuses on two areas: understanding what happened and forecasting or trending what will happen. Both of these areas and questions can be solved using a dimensional model.

Dimensional modeling is a very business focused type of modeling. Knowing the business process of how the data is used is important in being able to model it correctly. The same set of information can actually be modeled in two different ways, based on how the information is used! In the database modeling technique of dimensional modeling, denormalization is a good thing. You will see how to create a dimensional model later in this chapter.

Dimensional models are typically used in data warehouses or datamarts. There are two leading methodologies for creating a dimensional model, led by two amazing technologists: Ralph Kimball and Bill Inmon. There are others as well, but we will focus on these two methodologies. While I typically use an approach more like Kimball’s, both methodologies have benefits and can be used to create a successful dimensional model and analytical reporting solution.

Ralph Kimball

Ralph Kimball’s approach to data warehousing and dimensional modeling is typically known as a bottom-up approach. This name signifies a bus architecture, in which a number of datamarts are combined using similar dimensions to create an enterprise data warehouse. Each data mart (and in turn, the data warehouse) is created used a star or snowflake schema, which contains one or more fact tables, linked to many dimensions.

To create the bus architecture, datamarts are created that are subject oriented and contain business logic for a particular department. Over time, new datamarts are created, using some of the same entities or dimensions as the original datamart. Eventually, the datamarts grow to create a full enterprise data warehouse.

For more information on Ralph Kimball’s approach, see http://www.kimballgroup.com .

Bill Inmon

The other leading data warehouse methodology is proponed by Bill Inmon and is typically known as a top-down approach. In this scenario, “top-down” means starting with an enterprise warehouse and creating subject area datamarts from that data warehouse. The enterprise data warehouse is typically in a third-normal form, which was covered in Chapter 5, while data marts are typically in a dimensional format.

Inmon also created the concept of a corporate information factory, which combines all organization systems, including applications and data storage, into one cohesive machine. Operational data stores, enterprise data warehouses, and data management systems are prevalent in these systems.

See http://www.inmoncif.com for additional information on Bill Inmon’s approach.

Aggregation Reporting

A hybrid of analytical and relational reporting, aggregation reporting combines the best of both worlds to create a high-performing set of information in specific reporting scenarios. This information is used for some analysis and is also used operationally. By creating a set of aggregated values, report writers can quickly pull exactly the information they need.

Modeling summary tables typically occurs when a report that used an aggregated value becomes too slow for the end users. This could be because the amount of data that was aggregated has increased or the end users are doing something that wasn’t originally planned for. To fix the slow reporting situation, summary tables can be created to speed up the reports without having to create a whole data warehouse.

Some scenarios where it may make sense to create summary tables include rolling up data over time and looking at departments or groups within a company where you can report on multiple levels. Aggregation reporting is especially useful when there is a separate reporting database where its resources can be consumed by queries and reports.

Requirements-Gathering Process

Before starting to do any modeling, it is essential to gather the requirements for the end solution. In this case, the word “requirements” doesn’t mean how the end solution should look or who should have access to it. “Requirements” refers to who will use information and how they use it, particularly the business processes and results of using the data.

By talking to numerous members of the business, executive team, and key stakeholders, you will learn how the business works and what is valued. The corporate strategy and goals provide an important aspect to your requirements. They will ensure you are on the same page as the company and that your analytics can contribute to the high-level goals of the organization. At this time, I like to review the existing technical processes and data model. Because you understand the business inside and out, you will be able to see how the current infrastructure does or does not support the business.

Here, my system for gathering requirements veers off from typical processes. I like to create the initial data model as part of the requirements process. I find that I dig more deeply into understanding the relationship between entities, attributes, and processes if I am trying to put it into a data model. Then, if I can explain the data model in terms of a business process to the business folk, I am golden!

A recap of the steps listed in this section is shown in Figure 14-1.

Figure 14-1. Requirements gathering steps diagram

Whether you require an analytical or aggregation reporting solution, you will still follow a similar requirements process. The process that ports the business process requirements into a particular type of data model (described as step 4) will be described in the following sections of this chapter.

The requirements-gathering phase is a very important step in any software development life cycle, including a reporting solution. Be firm about talking to the business as well as the technology department.

At the end of a business intelligence project, I often find myself in a situation where I am explaining a piece of the business to someone from another department. Because of the perpetual silos that occur in a company, it’s too easy to ignore something that doesn’t affect you directly. Your goal as a reporting modeler is to learn about all areas and processes. Often, you’ll discover ties between the siloed departments that were impossible to initially see!

Dimensional Modeling for Analytical Reporting

The design principles in this section describe dimensional modeling, which is used for analytical reporting. Dimensional modeling takes a business process and separates it into logical units, typically described as entities, attributes, and metrics. These logical units are split into separate tables, called dimensions and facts. Additional tables can be included, but these are the two most common and important tables that are used.

Following are the different types of tables we will work with in a dimensional model:

Dimension: A dimension table contains information about an entity. All descriptors of that entity are included in the dimension. The most common dimension is the date dimension, where the entity is a day and the descriptors include month, year, and day of week.
Fact: A fact table is the intersection of all of the dimensions and includes the numbers you care about, also known as measures. The measures are usually aggregatable, so they can be summed, averaged, or used in calculations. An example of a fact table would be a store sales fact, where the measures include a dollar amount and unit quantity.
Bridge: A bridge table, also known as a many-to-many table, links two tables together that cannot be linked by a single key. There are multiple ways of presenting this bridge, but the outcome is to create a many-to-many relationship between two dimensions tables.

We will start by describing the different types of dimensions that you can create and wrap up with the types of facts you can create. After we complete our discussion of facts and dimensions, we will end up with a complete health care insurer data model, as shown in Figure 14-2.

Figure 14-2. Completed health care payer dimensional model

Dimensions

As previously stated, a dimension is the table that describes an entity. Once the business process is described, the first step is to determine what dimensions we need. Then, we can decide on what type of dimension we need to use, what the grain (or smallest detail) of the dimension should be, and finally, what we want to put into the dimension. Let’s start with business process.

While the claim payment business process is fairly complicated, we will use a limited version to illustrate a dimensional model. When an insurance company receives a claim for a member, the claim goes through an adjudication process. The claim can either be automatically adjudicated, manually adjudicated, or denied. As an insurance company, we need to know whether we adjudicated the claim, how much the physician requested based on the member’s diagnosis, and how much we are willing to pay for that procedure.

Sometimes, translation between the technical and business folks can be difficult, so try to understand how the business works and talk in their language. Don’t try to relate the business to another field, and don’t assume that you understand what they are saying. Ask the questions in different ways and try to come up with edge cases to see if they may help you gain a deeper understanding. If 99 percent of camouflage pants come in green or gray, but 1 percent comes in purple, you must know about that 1 percent from a data perspective.

Once you’ve thought through the business process, it is easy to pick out the key entities. While getting started, it may help to write down the business as in the previous paragraph and then highlight the entities as you come to them. The following paragraph illustrates how to highlight the entities from the business process:

When an insurance company receives a claim for a member, the claim goes through an adjudication process . The claim can either be automatically adjudicated, manually adjudicated, or denied. As an insurance company, we need to know whether we adjudicated the claim, how much the physician requested based on the member’s diagnosis, and how much we are willing to pay for that procedure.

Using those italicized phrases, our dimensions are: date, member, adjudication type, physician/provider, diagnosis, and procedure. Once we have the dimensions, it is important to talk to the business to find out what the grain for each of those dimensions should be. Do we receive claims on a daily or monthly basis? Is a member defined as an insurance number, a social security number, or a household? And so on. These are questions that must be asked of and answered by the business.

You must know where the data will be sourced from to ensure that the correct names and datatypes are used in the model. Fortunately, healthcare is a field that has standard definitions and files for many of the entities that we are interested in! During the walkthrough of each entity, any standard sources will be called out.

Once you have gone through the requirements process, discovered all of the business processes, and know everything there is to know about your business, it is time to start modeling! I like to start with dimensions, as the facts seem to fall into place after that. If you recall from earlier in the chapter, a dimension is an entity and its descriptors. Dimensions also include hierarchies, which are levels of properties that relate to each other in groups. Let’s walk through a few examples of dimensions to solidify everything we’ve discussed.

Date Dimension

The date dimension is the most common dimension, as almost every business wants to see how things change over time. Sometimes, the business even wants to see information at a lower granularity than date, such as at an hour or minute level. While it is tempting to combine the time granularity with the date dimension, don’t do it! Analysis is typically done at either the time or date level, so a rollup is not necessary, and you will inflate the number of rows in your dimension (365 days * 1440 minutes = 525,600 rows in your table). Let’s walk through the thought process of creating the date dimension, and then, you can use a similar thought process to create a separate time dimension.

You can always start with a base date dimension and modify it to suit your needs. The base date dimension should have the following features:

An integer key that uniquely identifies that row: Sometimes, a surrogate key is used, which has absolutely no business value and can be implemented using an IDENTITY column. My recommendation is to use a smart key, which combines the year and date in an eight-digit number, such as 20110928, for September 28, 2011 for the United States folks. Smart keys make sense for date and time dimensions, because you know the entity itself will never change. In other words, September 28, 2011 will always be in the month of September and will always be in the year 2011.
The business key that represents the entity : For a date dimension, this value is very simple; it is just the date. This business key should always be the lowest granularity of information stored in your dimension.
Additional standard attributes : These attributes can include information about whether this date is a holiday, whether a promotion is running during this time, or additional fields describing a fiscal calendar

The code to create a base date dimension table is listed here:

-- Create schema for all dimension tables

CREATE SCHEMA dim

-- Create Date Dimension

CREATE TABLE dim.Date

(

DateKey INTEGER NOT NULL,

DateValue DATE NOT NULL,

DayValue INTEGER NOT NULL,

WeekValue INTEGER NOT NULL,

MonthValue INTEGER NOT NULL,

YearValue INTEGER NOT NULL

CONSTRAINT PK_Date PRIMARY KEY CLUSTERED

(

DateKey ASC

))

Note When I create my table and column names, I use a particular naming convention that includes schemas to describe the table type, capital letters for each word in the name, and a set of suffixes that describe the column type. While this standard works for me, you should use a convention that fits into your organization’s standards.

To populate the dimension, you can use a stored procedure or script that will automatically add additional days as needed. Such a stored procedure follows. Note the initial insert statement that adds the unknown row if it doesn’t exist.

Unknown rows are used to symbolize that the relationship between a metric and a particular dimension entity does not exist. The relationship may not exist for a number of reasons:

The relationship does not exist in the business process.
The incoming feed does not yet know what the relationship should be.
The information does not appear to be valid or applicable to the scenario.

By tying the fact table to the unknown row, it is very simple to see where the relationship is missing. The surrogate key for each unknown row is –1, and the descriptors contain variations of “unknown,” such as UN, -1, and UNK. Each dimension that we will create will contain an unknown row to highlight this use case.

In some cases, it may make sense to create multiple unknown rows to distinguish between the three reasons for having an unknown row. One reason why you may want to do this is if you have many late-arriving dimensions, where you are very concerned with the rows that have not yet been linked to a dimension. Another reason you may want to do this is if you have poor data quality, and it is important to distinguish between acceptable and unacceptable unknown rows.

The following stored procedure uses only one unknown row:

-- Create Date Dimension Load Stored Precedure

CREATE PROCEDURE dim.LoadDate (@startDate DATETIME, @endDate DATETIME)

BEGIN

IF NOT EXISTS (SELECT * FROM dim.Date WHERE DateKey = -1)

BEGIN

INSERT INTO dim.Date

SELECT -1, '01/01/1900', -1, -1, -1, -1

END

WHILE @startdate <= @enddate

BEGIN

IF NOT EXISTS (SELECT * FROM dim.Date WHERE DateValue = @startdate)

INSERT INTO dim.Date

SELECT CONVERT(CHAR(8), @startdate, 112) AS DateKey

,@startdate AS DateValue

,DAY(@startdate) AS DayValue

,DATEPART(wk, @startdate) AS WeekValue

,MONTH(@startdate) AS MonthValue

,YEAR(@startdate) AS YearValue

SET @startdate = DATEADD(dd, 1, @startdate)

END

The outcome of this stored procedure is to load any date values that are not yet in the date dimension. Running the following query will insert two years of date:

EXECUTE dim.LoadDate '01/01/2011', '12/31/2012'

A sample of the data in the table is shown here:

DateKey	DateValue	DayValue	WeekValue	MonthValue	YearValue
--------	----------	--------	---------	----------	---------
-1	1900-01-01	-1	-1	-1	-1
20111230	2011-12-30	30	53	12	2011
20111231	2011-12-31	31	53	12	2011
20120101	2012-01-01	1	1	1	2012
20120102	2012-01-02	2	1	1	2012

This dimension and stored procedure may be exactly what you need, and if so, move onto the next section on slowly changing dimensions. On the other hand, there’s a good chance that you’ll need to go further with this dimension. To start, many organizations have fiscal calendars. You will use the same base date dimension, but add additional columns for FiscalDayValue, FiscalWeekValue, and so on. Along the same vein, you may also have date-specific descriptors that should be included. These could be as simple as calling out the federal holidays or as complicated as highlighting the various sales periods throughout the year. To include any of those additional items, you’ll want to modify the stored procedure to use the logic specific to your organization.

Slowly Changing Dimension

The date dimension is about as simple as a dimension can get. The information can be populated once or over time and is never modified. The rest of the dimensions won’t be that simple. To begin, we need to discuss the concept of slowly changing dimensions. Here are the most common types of slowly changing dimensions:

Type 1, or changing: A type 1 change overwrites the value that was previously assigned to that attribute. A type 1 change is typically used when keeping the original value is not important. Example attributes include product category and company name.
Type 2, or historical: A type 2 change is more in-depth than a type 1 change. It tracks any changes that occurred to that attribute over time so seeing the original, final, and any intermediate values is possible. Example historical attributes in which you may be interested include customer state and employee manager.
Type 3, or hybrid: Finally, a type 3 change is a hybrid of both the type 1 and type 2 changes. It keeps track of the intermediate and final values on every row. This allows reporting to compare the current value with the latest value for any intermediate value or to be able to see the alternate view, as though the change didn’t happen. This approach is not often used, but one good example is comparing the average temperature for a location at that time versus the final average temperature over the life of the warehouse.

As this book has hammered over and over again, uniqueness is so very important when dealing with data. And I’m not going to contradict that now, except to say that it is possibly even more important when dealing with reporting. You need to determine the unique identifier, also known as the business key, for each entity. This key is how you will know how to link a dimension to a fact and how to create additional rows in the case of a slowly changing dimension.

In the case of a health insurance company, a member can change personal information on a regular basis, and seeing how that information tracks over time may be important. Let’s walk through the steps of creating the member dimension.

As with every dimension, we begin by determining the business key for the entity. For an insured member, the business key is the insurance number. Additional attributes for personal information that are analyzed include first name, last name, primary care physician, county, state, and membership duration. Our initial dimension can be created using the following script:

-- Create the Member dimension table

CREATE TABLE dim.Member

(

MemberKey INTEGER NOT NULL IDENTITY(1,1),

InsuranceNumber VARCHAR(12) NOT NULL,

FirstName VARCHAR(50) NOT NULL,

LastName VARCHAR(50) NOT NULL,

PrimaryCarePhysician VARCHAR(100) NOT NULL,

County VARCHAR(40) NOT NULL,

StateCode CHAR(2) NOT NULL,

MembershipLength VARCHAR(15) NOT NULL

CONSTRAINT PK_Member PRIMARY KEY CLUSTERED

(

MemberKey ASC

))

For demonstration’s sake, load the member dimension with the following INSERT script:

-- Load Member dimension table

SET IDENTITY_INSERT [dim].[Member] ON

INSERT INTO [dim].[Member]

([MemberKey],[InsuranceNumber],[FirstName],[LastName],[PrimaryCarePhysician]

,[County],[StateCode],[MembershipLength])

VALUES

(-1, 'UNKNOWN','UNKNOWN','UNKNOWN','UNKNOWN','UNKNOWN','UN','UNKNOWN')

SET IDENTITY_INSERT [dim].[Member] OFF

INSERT INTO [dim].[Member]

([InsuranceNumber],[FirstName],[LastName],[PrimaryCarePhysician]

,[County],[StateCode],[MembershipLength])

VALUES

('IN438973','Brandi','Jones','Dr. Keiser & Associates','Henrico','VA','<1 year')

INSERT INTO [dim].[Member]

([InsuranceNumber],[FirstName],[LastName],[PrimaryCarePhysician]

,[County],[StateCode],[MembershipLength])

VALUES

('IN958394','Neil','Gomez','Healthy Lifestyles','Henrico','VA','1-2 year')

INSERT INTO [dim].[Member]

([InsuranceNumber],[FirstName],[LastName],[PrimaryCarePhysician]

,[County],[StateCode],[MembershipLength])

VALUES

('IN3867910','Catherine','Patten','Dr. Jenny Stevens','Spotsylvania','VA','<1 year')

Use the following query to retrieve the sample data that follows:

select MemberKey, InsuranceNumber, FirstName, LastName, PrimaryCarePhysician

from dim.Member

Here are the results:

MemberKey	Insurance	FirstName	LastName	PrimaryCarePhysician
-----------	-----------	-----------	----------	---------------------
-1	UNKNOWN	UNKNOWN	UNKNOWN	UNKNOWN
1	IN438973	Brandi	Jones	Dr. Keiser & Associates
2	IN958394	Neil	Gomez	Healthy Lifestyles
3	IN3867910	Catherine	Patten	Dr. Jenny Stevens

What do we do when a piece of information changes? Well, if it’s something that we want to track, we use a type 2 change. If it’s something we don’t care about, we use type 1. We must know which type we want to track at modeling time to make some adjustments. In the case of the member dimension, we want to use a type 2 attribute. We know this because we want to track if a member’s primary care physician (PCP) changes to verify that we are only paying out for visits to a PCP. To track this information as the type 2 attribute, we need to add a flag called isCurrent to indicate which row contains the most current information. The code to add this field is listed here:

ALTER TABLE dim.Member

ADD isCurrent INTEGER NOT NULL DEFAULT 1

When a load process runs to update the member dimension, it will check to see if there is a match on the business key of InsuranceNumber, and if so, it will add an additional row but flip the isCurrent bit on the old row to 0. Use the following code to emulate Brandi changing her primary care physician and updating the old record:

INSERT INTO [dim].[Member]

([InsuranceNumber],[FirstName],[LastName],[PrimaryCarePhysician]

,[County],[StateCode],[MembershipLength])

VALUES

('IN438973','Brandi','Jones','Dr. Jenny Stevens','Henrico','VA','<1 year')

UPDATE [dim].[Member] SET isCurrent = 0

WHERE InsuranceNumber = 'IN438973' AND PrimaryCarePhysician = 'Dr. Keiser & Associates'

If we use a similar query as before, but add the new isCurrent column as follows, we see the full change of the dimension:

select MemberKey, InsuranceNumber, FirstName, LastName, PrimaryCarePhysician, isCurrent

from dim.Member

And here are the results:

MemberKey	Insurance	FirstName	LastName	PrimaryCarePhysician	isCurrent
---------	---------	---------	--------	-----------------------	---------
-1	UNKNOWN	UNKNOWN	UNKNOWN	UNKNOWN	1
1	IN438973	Brandi	Jones	Dr. Keiser & Associates	0
2	IN958394	Neil	Gomez	Healthy Lifestyles	1
3	IN3867910	Catherine	Patten	Dr. Jenny Stevens	1
4	IN438973	Brandi	Jones	Dr. Jenny Stevens	1

Note For a more detailed explanation of how to programmatically add the logic for a type 2 dimension using SSIS, see Apress’s Pro SQL Server 2012 Integration Services by Michael Coles and Francis Rodrigues.

An alternative to using the isCurrent flag is to use two columns: start date and end date. The start date describes when the row first became active, and the end date is when that row has expired. If the row has not yet expired, the value is null. If you wanted to see all current values, you can query all rows where the end date is null. In addition to that information, you can also see when each dimension row changed without having to look at the fact table.

You may ask yourself why you need any flag or date range. Isn’t it possible to just find the latest row based on the growing identity key, where the maximum value is the current one? You would be absolutely correct that you can get the same information; however, the query to pull that information is more complicated and difficult to pull. As a case in point, would you rather write the query in Listing 14-1 or 14-2 on a regular basis?

Listing 14-1. Preferred Query to Pull the Latest Dimension Row

select * from dim.Member where isCurrent = 1

Listing 14-2. Rejected Query to Pull the Latest Dimension Row

select * from (

select m.*, ROW_NUMBER() OVER (PARTITION BY m.InsuranceNumber

ORDER BY m.MemberKey DESC) As Latest

from dim.Member m ) LatestMembers

WHERE LatestMembers.Latest = 1

The provider dimension is similar to the member dimension, as it can change over time. The information in this information is based on a standard through the National Provider Registry, which can be found at http://www.cms.gov/NationalProvIdentStand/downloads/Data_Dissemination_File-Readme.pdf . By using a limited set of information, we can create a dimension as described here:

-- Create the Provider dimension table

CREATE TABLE dim.Provider (

ProviderKey INTEGER IDENTITY(1,1) NOT NULL,

NPI VARCHAR(10) NOT NULL,

EntityTypeCode INTEGER NOT NULL,

EntityTypeDesc VARCHAR(12) NOT NULL, -- (1:Individual,2:Organization)

OrganizationName VARCHAR(70) NOT NULL,

DoingBusinessAsName VARCHAR(70) NOT NULL,

Street VARCHAR(55) NOT NULL,

City VARCHAR(40) NOT NULL,

State VARCHAR(40) NOT NULL,

Zip VARCHAR(20) NOT NULL,

Phone VARCHAR(20) NOT NULL,

isCurrent INTEGER NOT NULL DEFAULT 1

CONSTRAINT PK_Provider PRIMARY KEY CLUSTERED

(

ProviderKey ASC

))

Add an initial set of data using the following query:

-- Insert sample data into Provider dimension table

SET IDENTITY_INSERT [dim].[Provider] ON

INSERT INTO [dim].[Provider]

([ProviderKey],[NPI],[EntityTypeCode],[EntityTypeDesc],[OrganizationName],

[DoingBusinessAsName],[Street],[City],[State],[Zip],[Phone])

VALUES

(-1, 'UNKNOWN',-1,'UNKNOWN','UNKNOWN','UNKNOWN','UNKNOWN','UNKNOWN','UNKNOWN','UNKNOWN','UNKNOWN')

SET IDENTITY_INSERT [dim].[Provider] OFF

INSERT INTO [dim].[Provider]

([NPI],[EntityTypeCode],[EntityTypeDesc],[OrganizationName],

[DoingBusinessAsName],[Street],[City],[State],[Zip],[Phone])

VALUES

('1234567',1,'Individual','Patrick Lyons','Patrick Lyons','80 Park St.',

'Boston','Massachusetts','55555','555-123-1234')

INSERT INTO [dim].[Provider]

([NPI],[EntityTypeCode],[EntityTypeDesc],[OrganizationName],

[DoingBusinessAsName],[Street],[City],[State],[Zip],[Phone])

VALUES

('2345678',1,'Individual','Lianna White, LLC','Dr. White & Associates','74 West Pine Ave.',

'Waltham','Massachusetts','55542','555-123-0012')

INSERT INTO [dim].[Provider]

([NPI],[EntityTypeCode],[EntityTypeDesc],[OrganizationName],

[DoingBusinessAsName],[Street],[City],[State],[Zip],[Phone])

VALUES

('76543210',2,'Organization','Doctors Conglomerate, Inc','Family Doctors','25 Main Street Suite 108',

'Boston','Massachusetts','55555','555-321-4321')

INSERT INTO [dim].[Provider]

([NPI],[EntityTypeCode],[EntityTypeDesc],[OrganizationName],

[DoingBusinessAsName],[Street],[City],[State],[Zip],[Phone])

VALUES

('3456789',1,'Individual','Dr. Drew Adams','Dr. Drew Adams','1207 Corporate Center',

'Peabody','Massachusetts','55554','555-234-1234')

A dimension can have attributes of multiple types, so the last name can be type 1 while the primary care physician is type 2. I’ve never understood why they’re called slowly changing dimensions when they exist at an attribute level, but that’s the way it is.

Snowflake Dimension

In our discussion of modeling methodologies, we talked about star schemas and snowflake schemas. The key differentiator between those two methodologies is the snowflake dimension. A snowflake dimension takes the information from one dimension and splits it into two dimensions. The second dimension contains a subset of information from the first dimension and is linked to the first dimension, rather than the fact, through a surrogate key.

A great example of this is when one entity can be broken into two subentities that change on their own. By grouping the subentities’ properties together in their own dimension, they can change on their own or be linked directly to the fact on its own, instead of using the intermediary dimension, if the grain applies. However, breaking out an entity into a snowflake dimension is not without its downfalls. Never, ever, ever link both the first and second dimension to the same fact table. You’d create a circular relationship that could potentially cause discrepancies if your loads are not kept up to date. I also recommend never snowflaking your dimensions to more than one level, if at all. The added complexity and maintenance needed to support such a structure is often not worthwhile when alternative methods can be found. Let’s talk through a possible snowflake dimension, when you would want to use it, and alternative methods.

To continue our healthcare example, we will look at the types of insurance coverage provided by the company. There are multiple types of plans that each contains a set of benefits. Sometimes, we look at the metrics at the benefit level and other times, at the plan level. Even though we sometimes look at benefits and plans differently, there is a direct relationship between them.

Let’s begin by looking at each of the tables separately. The benefit table creation script is listed here:

-- Create the Benefit dimension table

CREATE TABLE dim.Benefit(

BenefitKey INTEGER IDENTITY(1,1) NOT NULL,

BenefitCode INTEGER NOT NULL,

BenefitName VARCHAR(35) NOT NULL,

BenefitSubtype VARCHAR(20) NOT NULL,

BenefitType VARCHAR(20) NOT NULL

CONSTRAINT PK_Benefit PRIMARY KEY CLUSTERED

(

BenefitKey ASC

))

The health plan creation script is shown here:

-- Create the Health Plan dimension table

CREATE TABLE dim.HealthPlan(

HealthPlanKey INTEGER IDENTITY(1,1) NOT NULL,

HealthPlanIdentifier CHAR(4) NOT NULL,

HealthPlanName VARCHAR(35) NOT NULL

CONSTRAINT PK_HealthPlan PRIMARY KEY CLUSTERED

(

HealthPlanKey ASC

))

I’ve already explained that a set of benefits can be tied to a plan, so we need to add a key from the benefit dimension to the plan dimension. Use an ALTER statement to do this, as shown:

ALTER TABLE dim.Benefit

ADD HealthPlanKey INTEGER

ALTER TABLE dim.Benefit WITH CHECK

ADD CONSTRAINT FK_Benefit_HealthPlan

FOREIGN KEY(HealthPlanKey) REFERENCES dim.HealthPlan (HealthPlanKey)

When the benefit key is added to a fact, the plan key does not also need to be added but will still be linked and can be used to aggregate or filter the information. In addition, if the grain of one of the fact tables is only at the plan level, the plan key can be used directly without having to use the benefit key.

We will not use the Benefit dimension in our model, but we will use the HealthPlan dimension. You can populate the data using the following script:

-- Insert sample data into Health plan dimension

INSERT INTO [dim].[HealthPlan]

([HealthPlanIdentifier],[HealthPlanName])

VALUES ('BRON','Bronze Plan')

INSERT INTO [dim].[HealthPlan]

([HealthPlanIdentifier],[HealthPlanName])

VALUES ('SILV','Silver Plan')

INSERT INTO [dim].[HealthPlan]

([HealthPlanIdentifier],[HealthPlanName])

VALUES ('GOLD','Gold Plan')

An important concept that we have not yet covered is how to handle hierarchies in your data structure. Hierarchies are an important concept that you will hear about over and over again. Each level of the hierarchy is included in the same dimension, and the value of each parent level is repeated for every child value. In the Benefit dimension, BenefitSubtype and BenefitType create a hierarchy. Ensure that all subtypes have the same type in the load process to help later reporting.

Type Dimension

Another common dimension type that often appears is what I like to call a type dimension, because it is often simple to identify due to its name. This dimension is a very simple one that typically contains a few type 1 changing attributes. Not all entities have to contain the word “type,” but it’s a nice indicator. Examples include transaction type, location type, and sale type.

In our health care example, the adjudication type is a perfect example of this dimension. The adjudication type dimension contains two columns and four rows, and the full creation script is shown here:

-- Create the AdjudicationType dimension table

CREATE TABLE dim.AdjudicationType (

AdjudicationTypeKey INTEGER IDENTITY(1,1) NOT NULL,

AdjudicationType VARCHAR(6) NOT NULL,

AdjudicationCategory VARCHAR(8) NOT NULL

CONSTRAINT PK_AdjudicationType PRIMARY KEY CLUSTERED

(

AdjudicationTypeKey ASC

))

The values for the adjudication type dimension are static, so they can be loaded once, instead of on a recurring or nightly basis. The following INSERT script will insert the three business value rows and the one unknown row:

-- Insert values for the AdjudicationType dimension

SET IDENTITY_INSERT dim.AdjudicationType ON

INSERT INTO dim.AdjudicationType

(AdjudicationTypeKey, AdjudicationType, AdjudicationCategory)

VALUES (-1, 'UNKNWN', 'UNKNOWN')

INSERT INTO dim.AdjudicationType

(AdjudicationTypeKey, AdjudicationType, AdjudicationCategory)

VALUES (1, 'AUTO', 'ACCEPTED')

INSERT INTO dim.AdjudicationType

(AdjudicationTypeKey, AdjudicationType, AdjudicationCategory)

VALUES (2, 'MANUAL', 'ACCEPTED')

INSERT INTO dim.AdjudicationType

(AdjudicationTypeKey, AdjudicationType, AdjudicationCategory)

VALUES (3, 'DENIED', 'DENIED')

SET IDENTITY_INSERT dim.AdjudicationType OFF

The values in this table can then be shown as follows:

AdjudicationTypeKey	AdjudicationType	AdjudicationCategory
-------------------	----------------	-------------------
-1	UNKNWN	UNKNOWN
1	AUTO	ACCEPTED
2	MANUAL	ACCEPTED
3	DENIED	DENIED

This provides an easy way to filter any claim that you have on either the actual type of adjudication that occurred or the higher-level category.

Our final two dimensions also fit into this category: Diagnosis and HCPCSProcedure. Diagnosis and procedure codes contain a standard set of data fields. Diagnoses and procedure codes are known as International Classification of Diseases, Tenth Revision, Clinical Modification (ICD10-CM) and Healthcare Common Procedure Coding System (HCPCS) codes, respectively. We can access this set of information through the Centers for Medicare and Medicaid Services web site ( http://www.cms.gov ). The creation script for these dimensions follows:

-- Create Diagnosis dimension table

CREATE TABLE dim.Diagnosis(

DiagnosisKey int IDENTITY(1,1) NOT NULL,

DiagnosisCode char(7) NULL,

ShortDesc varchar(60) NULL,

LongDesc varchar(322) NULL,

OrderNumber int NULL,

CONSTRAINT PK_Diagnosis PRIMARY KEY CLUSTERED

(

DiagnosisKey ASC

))

-- Create HCPCSProcedure dimension table

CREATE TABLE dim.HCPCSProcedure (

ProcedureKey INTEGER IDENTITY(1,1) NOT NULL,

ProcedureCode CHAR(5) NOT NULL,

ShortDesc VARCHAR(28) NOT NULL,

LongDesc VARCHAR(80) NOT NULL

CONSTRAINT PK_HCPCSProcedure PRIMARY KEY CLUSTERED

(

ProcedureKey ASC

))

Depending on how in-depth your business needs are, you can add more fields to the diagnosis and procedure dimensions. For a full list of all available diagnosis fields, see http://www.cms.gov/ICD10/downloads/ICD10OrderFiles.zip , and for all available procedure fields, see https://www.cms.gov/HCPCSReleaseCodeSets/Downloads/contr_recordlayout_2011.pdf . You can also use the files provided to load the set of static rows (for each year that the code set is released). Sample Integration Services packages are provided with the code for this chapter to load these values.

Facts

Now that we’ve looked at the entities and properties that describe our business process, we need to tie it all together with the metrics. The fact table combines all dimensions and the metrics associated with our business process. We can create many types of facts, just as we can create many types of dimensions. We will discuss two of the more commonly occurring fact types now: transaction and snapshot.

Transaction Fact

The fact table that most people know is the transaction fact. This table relates the metrics and dimensions together very simply: every action that occurs in the business process, there is one row in the table. It is important to understand the grain of the table, which is the combination of the dimensions that describes what the metric is used for. For example, if you were to look at a bank transaction, the grain of the table could consist of a day, bank location, bank teller, customer, and transaction type. This is in comparison to a bank promotion where the grain could be hour, bank location, customer, and promotion type.

Along with the dimension grain, metrics make up the fact table. The metrics are typically numeric values, such as currencies, quantities, or counts. These values can be aggregated in some fashion: by summing, averaging, or using the maximum or minimum values.

To continue our healthcare example, one transaction fact would be the payouts for insurance claims. Since we’ve already created our dimensions, there is a good chance we already know what our grain should be, but let’s walk through the business process to make sure we have everything we need.

For this example, we have talked to the business owners and understand the business process down to the smallest detail. We know that our grain is day, insurance number, adjudication value, ICD-10 code, and HCPC. Once we have the grain, we are very close to having a complete transaction fact.

Now that we have the dimensional grain of the fact, the next step is to describe the metrics related to those dimensions. We will use the same process we followed to determine our dimensions. Look at the business process description that started our dimension discussion, and determine the aggregatable values that become the metrics, and pull out the desired values. An example of this process with metrics italicized can be seen in the following paragraph:

When an insurance company receives a claim for a member, the claim goes through an adjudication process. The claim can either be automatically adjudicated, manually adjudicated, or denied. As an insurance company, we need to know whether we adjudicated the claim, how much the physician requested based on the member’s diagnosis, and how much we are willing to pay for that procedure.

Based on the italicized phrases, the desired metrics are adjudication counts, claim amount, and payout amount. It is important to look at how the business uses these metrics. Existing reports are a great place to start, but you also want to talk to the business analysts to understand what they do with that information. If they always take the information from the report and then perform manipulations in Excel, you want to try to replicate the manipulations they are doing directly into your dimensional model.

Our completed fact table is shown here:

-- Create schema for all fact tables

CREATE SCHEMA fact

-- Create Claim Payment transaction fact table

CREATE TABLE fact.ClaimPayment

(

DateKey INTEGER NOT NULL,

MemberKey INTEGER NOT NULL,

AdjudicationTypeKey INTEGER NOT NULL,

ProviderKey INTEGER NOT NULL,

DiagnosisKey INTEGER NOT NULL,

ProcedureKey INTEGER NOT NULL,

ClaimID VARCHAR(8) NOT NULL,

ClaimAmount DECIMAL(10,2) NOT NULL,

AutoPayoutAmount DECIMAL(10,2) NOT NULL,

ManualPayoutAmount DECIMAL(10,2) NOT NULL,

AutoAdjudicatedCount INTEGER NOT NULL,

ManualAdjudicatedCount INTEGER NOT NULL,

DeniedCount INTEGER NOT NULL

)

Of course, we need to link our fact table to the dimensions that we’ve already created, using foreign keys. When loading your data, it may make sense to disable or remove the constraints and enable them afterward to increase the speed of the load. I like to have the relationships after the fact to double check that the load of information was successful and that the fact doesn’t reference dimensions keys that do not exist. To set up the foreign keys, use the following script:

-- Add foreign keys from ClaimPayment fact to dimensions