Business Problem

The customer for this chapter is the same manufacturing company that we discussed in Chapter 3, “Building a Data Warehouse.” We will be looking at the broader integration issues underlying the data warehouse solution, taking into account the sources of information and other business processes.

Problem Statement

The manufacturer has grown partly by acquisition, and the need to integrate multiple disparate systems is proving costly and difficult to achieve. Due to the high cost and risk of moving each of the businesses to a single enterprise resource planning (ERP) solution, two systems with different reporting capabilities are currently used. Also, additional information such as budgets and forecasts is generated outside the ERP systems.

This has led to the following problems for the business:

  • Managers at every level aren’t getting the information they need to run the enterprise effectively. Producing information that describes the business as a whole is difficult to achieve and requires extensive manual processes, which increases costs to bring the two sets of accounts together. To further reduce the effort required, many levels of detail were omitted, leaving the corporation with only a high-level view of operations.

  • Information is stale by the time it reaches the readers. Manually consolidating information takes time, leading to the process being initiated only once per quarter.

Solution Overview

In our solution, we’ll create a conformed data warehouse that can logically accommodate the information available in each of the sales systems. Our extraction, transformation, and loading (ETL) process will transform the data from each data source into the shape needed for the data warehouse. The ETL process can be run on a weekly (or more frequent) basis to keep the information current. The budgeting information will be integrated into the warehouse, eliminating the manual process of tracking sales to budgets. Users will continue to be able to use familiar Excel spreadsheets to enter their budget figures.

Business Requirements

The high-level requirements to support the business objectives are as follows:

  • Present a consolidated view of the information. The information from the two source ERP systems needs to be consolidated, taking into account differences in product and territory dimensions and differences in representing transactional sales measures.

  • Present a comparison of budget versus actual sales. The business creates budgets using an Excel spreadsheet for each region to understand sales performance. The budgets are created at a higher level than the detailed level transactions, specifically region by product by quarter.

High-Level Architecture

We will use SQL Server Integration Services (SSIS) to read data from our data sources, perform any data transformations we need, and then store it in the data warehouse. Integration Services will also be used to control the sequence of events and processes, such as emptying tables, loading data in the proper order, and generating audit logs of the operations. The tasks and transforms necessary for each of these processes are stored in Integration Services packages. You develop packages using the graphic designer for SQL Server Integration Services projects in BI Development Studio. You can test and debug your packages in the development studio. Once your packages are ready, you deploy them to the Integration Services server, where they can be executed directly, or they can be invoked by a job step in a SQL Agent job.

We will load the full set of dimension data into our warehouse on a daily basis. This data has the business keys from the original companies. We will use our own surrogate keys to uniquely identify the products and regions and create a uniform model across the business. A result of reloading the dimensions is the surrogate keys can change, invalidating the facts previously loaded. For this example, we will reload all the sales for each company so the new surrogate keys will be reflected in the fact table. This works if the volume of data is small enough that the processing fits within your time constraints. We show you how to properly update existing dimensions and incrementally load facts in Chapter 8, “Managing Changing Data.”

The cube data will remain available to users during the rebuild process through the built-in caching facilities of Analysis Services. During the loading of sales data, we will translate business keys into the new surrogate keys. We will load budget data quarterly from Excel spreadsheets from each company and perform the same business key translations as we do for sales. This will allow us to directly compare sales and budgets. We’ll use the capability of Integration Services to loop through the filenames in a directory to automate the loading of the budget spreadsheets. Figure 4-1 illustrates the flow of data in our ETL process.

Figure 4-1. High-level architecture

image

Business Benefits

The solution will deliver the following benefits to the client:

  • The automation of data integration will lead to lower costs and improved awareness of business performance and provide a solid platform for future growth and acquisitions.

  • Access to more up-to-date information will improve the business’ agility and responsiveness to changing market conditions.

Data Model

We are extending the data model that we created in Chapter 3 by adding support for budgeting. The first question that we normally ask about a new business area is “What is the grain of the fact data?.” You might remember from Chapter 3 that the grain of the Shipments fact data was customer by product, by day, by plant, and by ship method. The grain is defined by the lowest level of each dimension used in the fact table.

Now we’ll look at the grain of the budget facts. You will see that although some of the same dimensions are used, the level differs for budgets. Specifically, the Time dimension goes down to the level of day, but budget grain is only at the level of a quarter. You don’t need to define a new Time dimension to handle this. You can always support less grain with the same dimension.

The Budget Data

In our budget, we want to predict what we are going to sell, where it will be sold, and over what period of time.

We want to support the planning process of the manufacturing plants by providing estimates for each product, so Product is one component of the grain of the budget. We also want to show the commitments made by each sales territory to support these sales, so Sales Territory is another component of the grain. Finally, in contrast to the Shipments data, which is at a day level, it would be unrealistic to produce a budget that tried to predict for each day of the upcoming year exactly what products we would ship to our customers. The budget numbers need to be at a much higher level, such as quarterly. Our grain then is by product, by territory, and by quarter.

Figure 4-2 shows a sample of a budget spreadsheet that we will use as the source for the budget facts. There will be one spreadsheet for each sales region.

Figure 4-2. Budget spreadsheet

image

Working with Extended Measures

The budget spreadsheets express everything in terms of number of units. This is convenient in some ways because unit is a nice, additive measure that is very simple to model. However, we also want to use the price and cost to compute expected revenue, cost, and profit. Because the budget numbers are at the Product level, these additional measures can be calculated by looking up the appropriate selling price, cost, and profit amount on the relevant product record and multiplying by the number of units. By having the ETL process look up these numbers and store them on the fact table, we reduce the chances of a transcription error in the spreadsheet.

Instead of physically storing the extra columns in the Budget fact table, you could instead create a view over the fact table that joins to the Product table and calculates the revenue, cost, and profit. A side effect of this approach is that the values on the Product dimension record, such as standard cost, can change over time. If we calculated the revenue, cost, and profit in a view, this information could change through the course of a fiscal year, and we won’t be able to go back and see the historical values. Of course, if this is the behavior you are after, using a view is a good solution.

Figure 4-3 shows the data model we will use to support the budgets. We will store the extended revenue and cost in the fact table. This will give us better performance and will retain the revenue and costs current at the time of the budgeting.

Figure 4-3. Budget fact table

image

Technical Solution

SQL Server Integration Services is the tool we will use to bring in our dimension data and our sales and budget facts from the source systems. Integration Services will also be used to transform the data into the shape we want before it is loaded into our data warehouse.

At a high level, all you want to do is copy some data from one or more data sources to the data warehouse. However, we are at the point now where you need to specify in detail just how to do that, and there is more to this than simply copying the data.

The sales and budget facts depend on the dimension data, so we must load the dimensions first. For simplicity, we have elected to reload all the data in our dimension and fact tables rather than trying to determine what has changed. Our data volume is small enough that we can get away with this. We address how you can load only the changes to the dimensions and facts in Chapter 8.

We need to accomplish several things in our ETL process. We need to load the dimensions, and then the sales facts can be loaded. Annual forecasts are prepared toward the end of each year, and these facts are loaded independently of the other dimensions or facts. When the data loading is completed, we need to process the dimensions and cubes in Analysis Services.

Getting Started with Integration Services

We are going to create an Integration Services project to load our dimensions and facts into the data warehouse. The sequence of tasks and transforms you define in Integration Services are saved in packages, which can reside in SQL Server or the file system. We like to create modular packages that have a single purpose, instead of creating very large packages that do everything. Later, we will add a package for loading the sales data, and another to load the budgets. We strongly encourage the use of integrated source control, such as Visual Source Safe 2005 or Visual Studio Team System, to prevent multiple developers from inadvertently modifying the same package.

Data Sources and Destinations

An Integration Services project usually needs one or more sources of data and a destination for the transformed data. In our project, we want to populate our data warehouse with data from a number of data sources: the dimension data and the sales from our on-line transaction processing (OLTP) systems, and the budgets from spreadsheets. Integration Services uses connection managers to connect data with tasks and transforms in Integration Services. Connection managers specify a database in an instance of SQL Server or other relational database such as Oracle or DB2, a file, a spreadsheet, or one of several other data sources. We find it useful to define these as soon as we create a package. We’d like to clarify a potential area of confusion about Data Source objects. These are not connection managers, but you can create a connection manager from a data source object. The advantage of a Data Source object is that it is visible to all packages in a project, and any connection managers based on a Data Source object are synchronized. If you change where a Data Source object points, all connection managers based on that data source will change to the new location. Note that this only applies during design. It is the connection manager that is exposed for configuration at runtime. This is discussed later in the section on deployment.

You don’t need to have all your data in one source, or even one type of source—it can come from any number of different sources. If you don’t see a data provider for your particular data source, there is usually one available from third parties. A quick search for “ole db providers” using your favorite search engine should locate what you need.

For our customer, we’ll need a connection to the data warehouse on SQL Server, a connection to each of the sales databases, and a connection to the file location where the Excel budget data is kept.

If you are looking down the road to deploying your solution to another environment, rest easy. You can dynamically change where a connection manager points to at runtime by assigning a variable to the appropriate property of the connection manager, such as the ServerName property. You can also set up an external runtime configuration that will set any property of a connection manager. We’ll look at how to do that in a later section on deployment.

QUICK START: Setting Up Connection Managers

We’re assuming at this point that you have set up your data warehouse in the previous chapter and that your source data is accessible via the network from the machine running SQL Server 2005. Our example uses data from a SQL Server database:

  1. In the BI Development Studio, select New Project from the File menu, and choose Integration Services Project from the Business Intelligence Projects subfolder.

  2. Name the project Manufacturing and click OK. Click Next to skip the first page.

  3. Create a new data source by right-clicking Data Sources in the Solution Explorer.

  4. Skip over the welcome page. Click New on the How to Define the Connection page.

  5. On the Connection Manager form, choose the provider (Native OLE DBSQL Native Client for our example).

  6. Type or select the server name. Remember to specify the Instance name if you are referring to a named instance of SQL Server.

  7. Choose the method to log on to the server. (We prefer Windows authentication.)

  8. Select a database to connect to that contains one or more of the source tables for the dimensions and click OK. Rename the connection manager to SalesTracking.

  9. Repeat this process to create a data source for the data warehouse. Note that in Step 8, you will choose the data warehouse database, rather than the source database, and rename it to DataWarehouse.

  10. At the bottom of the Control Flow tab, right-click in the Connection Managers area, select New Connection from Data Source, and choose the Data Source you just created.

Now we are ready to start defining our ETL process.

TIP:

Be Consistent in Naming Connection Managers

When it comes time to deploy your solution from development to other environments, you will want to redirect your data sources and destinations. The configuration for each connection manager is usually stored by name in a SQL Server table or an XML file. If you use the same name in each Integration Services package for connection managers that refer to the same database, you will only have to change a few rows in a configuration table or an attribute in a few XML configuration files. Note that connection manager names are case sensitive.

Loading the Dimensions

We are going to create one package for each dimension we want to load. All these packages will be part of the project we just created.

Integration Services explicitly separates the design of the process flow from the design of the data flow. This separation simplifies the creation of ETL processes, and makes the process you implement much more understandable to others who might have to maintain the application. You first define the sequence of events that need to take place in order to populate the data warehouse. This is called the Control Flow. Separately, you define how each table in the data warehouse is populated from the data sources and what transformations the data must go through before it is ready for the data warehouse. This is called the Data Flow. Each package has one control flow. You can have multiple data flows in a control flow, and you will have a data flow for each entity you are populating in the data warehouse.

Defining the Control Flow

We’ll start by loading the Sales Territory dimension table. In this section, we create an Integration Services package you can use as a model for loading dimension tables from any source. We hasten to point out that we will deal with what are called slowly changing dimensions (SCDs) in a separate chapter because of the extra complexity needed to handle them. Right now, we’re just going to ease into loading what we will assume to be static dimensions. This is appropriate for simple dimensions such as status, condition, or gender. It is also appropriate for situations where you don’t care about the historical values of a dimension and will be reloading all your data when you update the data warehouse, or for when you are just working on a proof of concept. In our experience, most applications have an implicit requirement to maintain history, so you should plan on accommodating this in your designs.

Because our control flow for loading a dimension consists of just copying data from a source to a destination, simply drag a Data Flow task from the Toolbox onto the surface of the Control Flow tab, as shown in Figure 4-4. Click the label to rename the Data Flow task to Load Sales Territories.

Figure 4-4. Initial control flow

image

Finally, you are ready to build the piece of the puzzle that you originally set out to do: moving some data into the data warehouse.

Defining the Data Flow

Data flow processing is initiated when a Data Flow task is executed in a control flow. In a Data Flow task, for one entity in the data warehouse, you define the flow of data from source to destination and the transforms needed along the way. The other streams required to populate other tables in the data warehouse will be handled by other Data Flow tasks. You can apply as many transformations as needed to prepare the data for the data warehouse. In this initial example, we just copy a subset of the columns from the source to the data warehouse.

QUICK START: Configuring a Simple Data Flow

Now we’ll continue on with designing a package to load a dimension. In this example, we copy a subset of the columns of the Sales Territory table from our OLTP source system into a table in our data warehouse.

  1. Drag a Data Flow task from the Toolbox onto the surface of the Control Flow tab, as shown in Figure 4-4. Click the label to rename the Data Flow task to Load Sales Territories.

  2. Double-click the new Data Flow task to open the data flow design surface. Drag an OLE DB Source from the Toolbox onto the pane and change the name to Sales Territory.

  3. Drag an OLE DB Destination data flow destination from the Toolbox onto the pane and change the name to DW SalesTerritory.

  4. Click Sales Territory and drag the green arrow onto DW Sales Territory. You can’t configure the data destination properly until it has data flowing into it. You will notice in the Sales Territory data flow source that there is a small red circle with an X in it. This is because it is missing some information. You can hover the mouse over the X to see what the problem is. We need to tell it where to find the source data, which table it is in, and which columns we want to copy.

  5. Double-click the Sales Territory OLE DB Source to open the OLE DB Source Editor.

  6. Set the OLE DB Connection Manager setting to SalesTracking.

  7. Set the data access mode to Table or view, and select [Sales].[Sales Territory] from the list of tables. Click Columns.

  8. We only want to copy TerritoryID, Name, CountryRegionCode, and Group, so check those columns. Uncheck the others.

  9. Double-click the DW Sales Territory SQL Server destination to open the OLE DB Source Editor.

  10. Set the Connection Manager to DataWarehouse.

  11. Select [dbo].[dimSalesTerritory] from the list of tables.

  12. Click Mappings.

  13. Drag and drop each input column onto the corresponding destination column. The TerritoryID input column is mapped to the business key, not to the SalesTerritoryKey. SalesTerritoryKey is an internally generated surrogate key. (The editor does try to map the columns based on the name, but none of our names match. If there are any automatic mappings, it is a good idea to review them.)

  14. Click OK to save the configuration. This would be a good time to save the whole solution, too.

Your data flow should now look like Figure 4-5.

Figure 4-5. Basic data flow

image

TIP:

SQL Server Destinations Only Work on a Local Server

We did not choose the SQL Server Destination in the Quick Start because it will only work when the database is on the same server that the package is running on. If you know you will always be running on the same server, you should definitely choose the SQL Server Destination, because it is faster than the OLE DB Destination.

Testing Your Package

To try out your package, simply click the green Play button (or press F5, or choose Debug/Start). The data flow between the source and destination will show you how many rows were sent to the destination. You should see the task turn green when it has completed. The package is still in debug mode. To continue to work with the package, you need to stop debugging by pressing Shift+F5 or choosing Debug/Stop debugging.

You’re so excited that it actually worked, you call the boss over to see it go, and press Play again. After what seems to be an interminable delay, the source task turns green, but the destination turns red. It didn’t work the second time. You claim it is just the demonstration effect, but deep down you know something is missing.

Where do you look to see what went wrong? Click the tab labeled Progress to see a detailed list of the results of each step, as shown in Figure 4-6. Scroll down to find the row with an exclamation mark and hover the mouse over the line to view the entire error message. The message autohides after a short time. You can right-click it, copy it to the clipboard, and paste it into Notepad to make it easier to read the message.

Figure 4-6. Examining the progress log

image

In the message, you’ll see that there is a violation of a unique key constraint. The business key is required to be unique, and that’s a good thing. We just tried to load the same data twice. One of the constraints prevented an accidental duplication of the dimensional data, but we need to be able to run this task more than once! We need to change the control flow so that the table is emptied before we reload.

Making Your Package Repeatable

The problem with the task we just created is that the data in the destination is not overwritten. Instead, we are inserting additional data even if it’s the same data. A simple fix for this is to delete the data at the destination just before we insert the new data. All we need to do is issue a Truncate Table SQL statement. We can do this by modifying the control flow, adding a simple Execute SQL task before the Load Sales Territory task.

QUICK START: Working with Execute SQL Tasks

In this example, we truncate the destination table by adding a SQL statement to the control flow. This will prevent the same rows from being added twice to the table:

  1. Drag an Execute SQL task onto the Control Flow tab, above the Data Flow task.

  2. Rename this task Empty Sales Territories.

  3. Double-click the Empty Sales Territories task on the Control Flow tab to bring up the Execute SQL Task Editor.

  4. On the General tab of the editor, verify that the ResultSet property is set to None, the Connection Type is OLE-DB, and the SQLSourceType is Direct input.

  5. Click in the Connection property and choose the DataWarehouse connection manager from the list.

  6. In the SQL Statement property, enter the following SQL command: Truncate table dbo.DimSalesTerritory

  7. Click Parse Query to check for syntax errors. Note that this does not check to see whether the tables and columns exist! Your query may still fail at runtime if you made a spelling mistake in the table name.

  8. Click OK to save the definition of this task.

  9. Set the execution order of the two tasks by dragging the green arrow from the Empty Sales Territories task onto the Load Sales Territories task. This causes the Load Sales Territories task to wait for the successful completion of the Empty Sales Territories task before beginning execution. You can right-click the green arrow and set the condition for starting to failure of the previous task or any completion status.

    Your new control flow will now look like Figure 4-7.

    Figure 4-7. Final control flow including Execute SQL task

    image

Now you can bring back the boss and click Play with confidence. The table will be emptied first, and then repopulated with the current contents of the source, so there will be no problem rerunning this task.

Inspecting the Data Flow

Sometimes, your transforms don’t work as you expected, and the problem is more complex that forgetting to empty a table. If you want to see what data is flowing between two data flow components, you can add a Data Viewer to monitor the output of a task. Simply right-click the flow line you want to examine, and click Add. Choose Grid to show a table of the rows being passed between the data flow components.

The Data Viewer shows you a batch of rows, not a continuous stream of data. The size of the batch depends on the size of the rows. The data flow pauses at the end of a batch, allowing you to inspect the data in that batch. Press the Play button (green arrow) to continue.

Completing the Dimensions

You’ve now got a general idea how a simple data flow is built. You can now go ahead and build packages for loading the rest of the dimensions. Use the same Integration Services project for these dimensions; just create a new package by right-clicking the Packages node in the Solution Explorer pane. After you have the dimensions in place, you can load the facts.

Loading the Fact Table

Unlike our dimension tables, the source records for the Shipments facts come from more than one table. The sales data is in two major tables: the SalesHeader and the SalesDetail tables. We also need to include the ProductCostHistory table so that we’ll know how much the items cost when they were shipped. The header contains the business keys for sales-person, territory, the dates, and the customer. The detail table contains the business keys for product and special offer, as well as the measures for quantity shipped, unit price, and discount. To answer questions such as “Where do we ship each of our products?” and “What was our cost for the products we shipped?,” we need to have columns from all three source tables in our Shipments fact table.

The other transformation that we need to perform is to translate the business keys in the incoming source records to the corresponding surrogate keys used by the dimensions.

Working with Multiple Source Tables

You have several choices about how to create this denormalized view of your shipments that joins the three source tables together. You could create a view in the original data source and directly query this view through the data source in your data flow. But often, you can’t make any changes or additions to the source database, so you need to have an alternative method for loading this data. You can either load each table independently and denormalize later with a view, or you use a SQL query in your data source that joins the tables you need. For demonstration purposes, we’re going to load from multiple tables using a SQL query into a denormalized table.

In our previous example where we loaded a dimension, the OLE DB data source Data Access Mode property was set to Table or view. To specify a query that we will build ourselves, you need to set the Data Access Mode to SQL Command. If you click the Query Builder button, you will see a fairly familiar graphical query designer. Use the Add Table icon (plus sign over a grid) to select the three tables we need (SalesOrderHeader, SalesOrderDetail, and ProductCostHistory) and create the joins by dragging SalesOrderID from the Header to the Detail table, and ProductID from the Detail to the CostHistory table. Finally, qualify which time frame in CostHistory we are referring to by adding a where clause:

WHERE OrderDate >= ProductCostHistory.StartDate and OrderDate < ProductCostHistory.EndDate


Check off the columns we want to use. (We won’t talk you through checking off each one; you can decide what you need.) Now you can click OK twice, and you have a multi-table data stream in your data flow (see Figure 4-8).

Figure 4-8. Creating a multiple-table query in a data source

image

Looking Up the Dimension Keys

The next step is to translate the business keys in the data stream from our data source into the surrogate keys used by the dimension tables in the data warehouse. The process is simple: Using the business key, look up the corresponding row in the dimension table. Sounds like a join, so why not use a join? We want to avoid hardcoding any database names in a query. Another reason is to ensure data quality by ensuring that we have a matching dimension member. We discuss this in Chapter 7. Another reason is to avoid hardcoding any references to databases on other servers. You would not be able to change the server name to use the package in another environment without editing the package. This only invites maintenance headaches.

Integration Services has a Lookup transform that can translate business keys into surrogate keys, and also meets both criteria of allowing the dimension table to be in a separate database from the source fact table, and it will assist us with data quality. You use one Lookup transform for each business key you are translating, so you will end up with a series of lookups in the data flows of the packages where you are populating a fact table. For each lookup transform, we need to identify which business key we want to translate into a surrogate key, which database and table contains the dimension table that maps these two keys, and finally which column in the dimension table is the surrogate key we want to return to the data stream.

To configure a Lookup transform to translate business keys into surrogate keys, drag a Lookup transform from the Toolbox onto the data flow surface, and connect the source data stream to it. Rename it to something descriptive such as Lookup Product. Double-click the transform to edit it. You need to specify a connection manager that points to the database containing the reference table, which is one of our dimension tables in the data warehouse. Next, select the reference table from the Use a table or view list.

On the Columns tab, specify which columns in the data flow are to be used to uniquely specify a row in the reference table. Just drag one or more columns from the available input columns onto the Available Lookup Columns in the reference table. This is shown in Figure 4-9.

Figure 4-9. Configuring a Lookup transform

image

Finally, check one ore more column from the Available Lookup Columns that you want to have joining the data flow. This is usually the surrogate key column in the dimension table.

A word of caution applies to Lookup transforms. The matching columns or join criteria is case sensitive, unlike a join in SQL. You might need to create a derived column to set the case of the business keys to a known state—for example, all uppercase.

Add a Lookup transform for each of the other dimensions using the same pattern we used for the Product dimension.

To complete the data flow, drag an OLE DB Data Destination from the Toolbox onto the dataflow surface. Rename this destination to Sales Data. Reuse the Data Warehouse connection manager and set the table to your sales data table. (If you haven’t created this table in the data warehouse yet, you can easily do so by clicking the New button beside the table field.)

You should end up with a data flow that looks like Figure 4-10.

Figure 4-10. Fact table data flow

image

Loading the Budget Information from Excel

The previous examples loaded data from relational tables. Now, let’s look at loading data from an Excel spreadsheet. We want to be able to import the budget figures into the data warehouse so that we can compare them with the actual results.

The company creates sales forecasts for each of the four quarters in a year, for each region and product. These forecasts are at the product level because they are used for plant capacity planning as well as revenue forecasts.

Budgeting will need its own package. You add a new package to a project by right-clicking on Packages in the solution explorer, and choosing New Package. Rename the package to Load Budgets. You will follow the same pattern of adding a data flow task to the control flow, along with any other tasks you need, such as one to empty the budget table in the data warehouse.

For the data flow, you start with an Excel data source. When you edit the data source and click New to create the connection manager, you will see that the dialogue is different than for a database source. You browse to the Excel file containing your budget, and specify which sheet in the workbook you want to use.

Matching Source and Destination Data Types

One important thing you need to do is set the true data types of each column. Excel chooses a type for each column based on an examination of the first eight rows. That choice doesn’t always match the data warehouse requirements. For example, numeric columns come in as float, whereas general comes in as nvarchar(255). We recommend using a Data Conversion transform right after the Excel data source to set the data types to what you really need.

Reshaping the Data

Fact tables, such as the budget forecasts we are working with, need to have exactly one occurrence of each measure per row, along with one member of each dimension. This isn’t always the way people like to see the information presented to them on a spreadsheet.

Looking back at the spreadsheet we are importing, shown in Figure 4-2, you see that the rows aren’t normalized. Four forecasts for each region and product category repeat on each row; and the time dimension is not a value in the row, it is a column name. We really need the forecast on one row for each quarter. This will allow us to create a cube to compare the forecasts with the actual revenue. Fortunately, we can use an Unpivot transform to transpose these repeating fields onto a row of their own. In our case, one row from the spreadsheet will become four rows after the Unpivot transform.

QUICK START: Working with the Unpivot Transformation

To use the Unpivot transform, you need to specify which columns should be repeated on each row (passed through) and which columns are the repeated columns of the row and should generate a row for each of the columns. You also need to provide a value for a new column that denotes which column is on the new row:

  1. Drag an Unpivot transform onto the data flow pane, below the Excel source.

  2. Connect the Excel Source to the Unpivot transform.

  3. Double-click the Unpivot transform to open the editor to set the configuration.

  4. Check Q1, Q2, Q3, and Q4 for the columns to unpivot. These names will also be used in the Pivot Values, which will be an additional column used to identify which column a new row represents. Think of the names as an additional component in the compound key for this row. We change them later in Step 7.

  5. Uncheck the Description column from the Pass Through check box. We don’t need the description, because we have the description code. It was just there to help the users understand which product they were budgeting for. Columns Q1 to Q4 are automatically passed through to the output stream, on separate rows.

  6. Set the Destination Column for each row to Units. This is the name of the new column that the value in Q1 to Q4 will be placed in. You could use different names for each column, but that isn’t appropriate for normalizing this spreadsheet.

  7. Set the Pivot Value column to 1, 2, 3, and 4, corresponding to Q1 to Q4.

  8. Set the Pivot Value Column Name to Quarter. This column will take on the Pivot Values corresponding to columns Q1 to Q4 we set in Step 7, based on the column that is on the current row.

  9. Click OK to save the configuration.

In the Unpivot transformation, we created a new column for Quarter. Whenever a transform creates a new column, check that the column has a data type compatible with the column it is mapped to in the next step in the data flow. You can check and set the data type for a new column created in a transform by using the Input and Output Properties tab on the Advanced Editor for that transform. Start the Advanced Editor by right-clicking the transform. The Unpivot transform configuration should look like Figure 4-11.

Figure 4-11. Transposing column values to rows

image

We have a couple of small tasks remaining before we can save the data into the warehouse. We need to use the Lookup task to find the Territory surrogate key for the region in the budget, and another Lookup task to find the product record to give us the product surrogate key, unit price, and unit cost. Finally, we want to compute the total price and cost for this product for this region and quarter.

Adding New Columns for Total Cost and Price

To create the two new columns for the total cost and total price, we’ll use a Derived Column transform. This transform enables you to specify a name for the new column, define an expression for the value of the column, and to set the data type. Drag a derived column transform onto the data flow panel and give it a name like Compute cost and price.

Double-click the Derived Column transform to open it for editing. Enter a new column name for the derived column, such as TotalCost.

Provide the formula for the column. In our example, drag Budget Units from the available columns into the Expression field, enter an asterisk (*), and then drag in Product Cost. Check that the data type is correct. It should be Currency. You can create multiple new columns in one Derived Column transform, so just repeat this procedure for BudgetRevenue. Figure 4-12 shows how you would configure a derived column transform to create the new columns.

Figure 4-12. Adding new columns with a Derived Column transform

image

Saving the Budget into the Data Warehouse

The output of the Compute cost and price transform is now ready to be saved to the data warehouse. This output contains all the columns we’ve defined. You have done something similar when you saved the SalesTerritory dimension into the data warehouse. All you need to do is drop an OLE DB Destination onto the data flow, drag the green arrow from the Derived transform onto the destination, and map the columns from the stream onto the destination table. Your data flow for budget loading should now look like Figure 4-13.

Figure 4-13. Final budget data flow

image

Loading Multiple Sets of Data

In our example, we have individual budget spreadsheets submitted from each region, so there will many spreadsheets and we don’t want to have to change our package to specify the name of each spreadsheet. You can use the For Each Loop Container task to enumerate the file names in a directory that match a specific pattern. If we establish a convention that all our budgets will have filenames that start with Budget and have .XLS extensions, and we put them in a fixed directory, we can easily load every budget regardless of any other distinguishing parts of the name. For example, the name for the Canadian budget for fiscal year 2006 would be Budget-CAD-2006.XLS. This fits the pattern and distinguishes it from the Southwest budget.

Using Variables in a Package

The For Each loop gives us back a list of filenames; but how do we work with each filename, one at a time? We need a way to take one filename off the list and pass it to a data source so that it works on a file from the list, not the one we coded into the package at design time. This is what variables are for.

Variables are objects you can store values in for use in calculations, expressions, and passing information between tasks and other packages. Variables have scope. Scope defines how widely known the variable will be. Package scope means that the variable will be known to every task in the package, and also inside every data flow invoked by the package. If the Data Flow tab had been selected rather than the Control Flow tab, the variable would have had the scope only of the Data Flow task and would not be visible to the For Each task. This prevents accidental name collisions, but can also cause you a bit of confusion when you try to find the variable if you have accidentally created a variable with the wrong scope.

QUICK START: Processing a Set of Files

  1. Go to the Control Flow tab, select Variables from the Integration Services menu to show the Variables window, and click the Add Variable icon. Give the variable a name, such as BudgetFileName. Select String as the data type.

  2. You want to specify an initial Value for this variable that points to a valid Excel spreadsheet so that you can work with the data flow in design mode. Without a value, the Excel data source will complain; so, specify a full path such as C:Budget-CAD-2006.xls.

  3. Drag and drop a new For Each Loop Container onto the Control Flow tab, and double-click it to configure the properties.

  4. On the Collection section, specify the folder name that contains your budget spreadsheets and the filenames to search for (such as Budget*.xls).

  5. On the Variable Mappings section, select the variable created in Step 1 (BudgetFileName) from the list and specify 0 as the Index to map. As the For Each loop iterates through the filenames, each will be assigned one at a time to the variable. Click OK.

  6. Drag your existing data flow task on top of the For Each Loop Container. If the Data Flow task has a preceding task, such as one to empty the table, you must disconnect the two tasks and make the connection to the For Each Loop Container. This will let Integration Services know that the data flow needs to be executed for every file that is retrieved in the loop.

  7. You need to configure the Excel connection to pick up the filename from the variable. Right-click the Excel Connection Manager in the Connection Managers area at the bottom of the package designer and open the Properties window. Click the ellipsis button (...) next to the Expressions property.

  8. Select the ExcelFilePath property and drag the variable BudgetFileName into the expression area. Click OK.

Your control flow for loading the budgets should now look like what is shown in Figure 4-14.

Figure 4-14. Control flow for iterating budget files

image

Managing the Solution

During development, you should be taking periodic checkpoints of your development so that you have a point to go back to should something go seriously wrong with the design or coding of one of your packages. When your packages are through the development stage, you will want to deploy them to the test and production environments and be able to configure them appropriately for that environment.

Protecting Your Source Code

The packages you create are XML files, and you can consider them to be source code for Integration Services packages. Throughout the development cycle, you will reach stages where a component is running well, perhaps not perfectly, but it is worth saving as a checkpoint before you continue to modify it. We highly recommend you use some form of source control, such as Visual Source Safe (VSS), to save copies of your source files at significant checkpoints. VSS 2005 integrates with BI Development Studio to perform automatic checkout of any package you modify to prevent others from overwriting your work, as well as saving a historical copy of what you started with.

Deployment

When you were developing your packages, you were executing them from your client in BI Development Studio. This was great for debugging, but now it’s time to deploy the packages to a location more appropriate for a production environment, or at least to a test environment. You can deploy your packages to SQL Server or to a file system managed through SQL Server Management Studio. SQL Server is preferred if you are going to work in a clustered server environment.

Before you deploy your packages, you want to change some of the properties of some of the tasks in the package, such as the connection managers so that they point to the right servers.

Changing Settings for a Production Environment

Packages include a lot of information that may differ between development and production environments, such as the names of servers, database names, and input file paths. In Integration Services, you can specify which properties should be configured at runtime and where to obtain the values for those properties. The values can be stored in an environment variable, an XML file, or a table in SQL Server, or some combination of all three sources.

You specify a configuration file for a package by selecting Package Configurations from the SSIS menu. The dialog includes a wizard that walks you through the various options, such as selecting which properties of your package you would like to read from the configuration file at runtime. You can reuse the same configuration for multiple packages in your solution, meaning that a value for a property such as a ServerName can be specified in one place, but applied to all packages launched from the same location.

A common scenario is to store configuration settings in a table in SQL Server. A connection manager is used to specify which server to use, but one of our goals here is to be able to pick up different settings depending on the environment—how do we change this connection manager to point to another server? This is where an environment variable or an XML file in a fixed location is used. An environment variable proves very useful if you cannot guarantee that the XML file will be in the same location in each environment. Here is one way to set this up.

First, you need to add a new connection manager to each package you want to configure. Use a consistent name for the connection manager throughout your solutions. We use one named ETL_Operations. It should point to a SQL Server database that will contain tables used for managing your ETL tools, auditing, and operational statistics, not one used for the data warehouse. We’ll create a new database named the same as the connection manager, ETL_Operations.

Next, we’ll create an XML configuration file that will contain a connection string that will specify where the ETL_Operations connection manager should point to. Using the Configuration Wizard, choose a configuration type of XML configuration file and specify a fully qualified file name to contain the configuration settings. This file will reside in the same location on every server where you want to use this configuration (for example, development and production environments). We used C:SSISConfigurationsManufacturing.dtsConfig. Figure 4-15 illustrates what this looks like in the wizard.

Figure 4-15. Setting the configuration pointer

image

Click Next. Now you can specify the properties of the ETL_Operations connection manager that will change as you move through different environments. In our example, it’s just the server name. We consistently use the same database name and Windows authentication to keep things simple. Expand the Connection Managers node, and the ETL_Operations properties, and then check the ServerName property, as shown in Figure 4-16.

Figure 4-16. Specifying a connection property to be configured at runtime

image

Click Next. Give the configuration a name. We’ll use ConfigurationReference. Click Finish to save the specifications into the package. The configuration now appears in the Package Organizer.

Now you can add the other properties you want to configure to the configuration table in SQL Server referenced by the ConfigurationReference configuration file we just created. Click Add to add another configuration, but this time choose a type of SQL Server. For the Connection, choose the ETL_Operations connection manager we created just for this purpose. We don’t have a table yet, so click New. This will create a table called [SSIS Configuration]. We’ll use this table from now on for all our configurations stored in SQL Server. For the Configuration filter setting, choose a name related to the application, not the environment. We’ll use Manufacturing. This will distinguish our configurations from others that could be set up independently for other applications (see Figure 4-17).

Figure 4-17. Defining configurations stored in SQL Server

image

Now, just as you did for the ETL_Operations connection manager, select the ServerName property from all the connection managers you want to reconfigure at runtime. Click Next and provide a name for the configuration, such as Connections. You should now see two configurations in the Package Configuration Organizer. Leave them in this order, with the ConfigurationReference configuration before the Connections configuration, as shown in Figure 4-18.

Figure 4-18. Configuration Organizer

image

Reusing the Configuration Settings in Other Packages

Now that you’ve got the configurations set up for one package, you can easily reuse them in other packages that use the same connection managers. When you start the configuration wizard for a package, begin as you did for the first package by specifying the XML configuration file. Provide the same filename, and you will be prompted to either reuse the existing settings or to overwrite them. Choose to reuse existing settings. Do the same for the SQL Server configurations, specifying the same connection manager, table (choose from the drop-down list), and filter.

It is important to note that to reuse existing configurations, you must use the same names for the connection managers, including case, and you must have the same number of connection managers. If you open the configuration for a package that is missing a connection manager (even if it doesn’t use it), that connection manager will be dropped from the configuration.

Using the Configurations in Multiple Environments

We have defined the configurations for one environment, but our goal was to be able to use them in several environments. This is easy to accomplish. First, copy the XML file to the new environment, putting it in exactly the same drive and directory. Edit the file with Notepad and change the server name to the new server name. Then use the Copy Database Wizard in SQL Server Management Studio to copy the ETL_Operations database to the new server. Open the [SSIS Configurations] table and change the ConfiguredValue column for each of your connection managers to point to the new server(s).

Deploying Packages to a New Server

After you have set up the package configuration, the next step is to move the packages to the targeted environment.

Deploying a Single Package

You can move the packages one at a time, as they are completed, using the File/Save copy as menu option. Choose SQL Server as the location and specify which SQL Server instance you want to use to store the package. At the Package path: prompt, click the dot (.); under Stored Packages, select the folder where you want to save the package (usually MSDB, but you can create your own), and type in the name you want to store the package under, without the .dtsx extension. Click OK to set the location and name, and OK again to complete the save.

Deploying All Packages in a Project

You can also move all the packages in a project to SQL Server or the file system on a server using the manifest and Deployment Wizard. The manifest is a file defining what is to be deployed. You need to enable a project setting to create a deployment manifest. This is done through the Project/Properties menu. On the Deployment Utility node of the Configuration Properties, set the CreateDeploymentUtility property to True. Doing so causes the manifest to be created when you build the project. To launch the Deployment Wizard, double-click the manifest, which is located in the in directory of your solution.

Creating Integration Services Subfolders on SQL Server

Using SQL Server Management Studio, you can organize your packages stored on SQL Server by connecting to an Integration Services service. (Note that you do not specify an instance name for Integration Services, just the server name.) You will see two nodes in the tree: Running Packages and Stored Packages. Open the Stored Packages node. If you have many packages, it is convenient to create subfolders under the MSDB folder. To create a subfolder, just right-click the parent folder and choose New folder. When you deploy a package, or choose FileSave a copy as, you can specify the path to the subfolder you want to store it in.

Security

Package security is managed by specific roles for Integration Services created and managed by the SQL Server engine. Unless a user is a member of one of these three roles, that user will not be able to work with packages stored in SQL Server, or even store a package there. The three roles are as follows:

  • db_dtsadmin. Can add, remove, and execute any package.

  • db_dtsltduser. Can add, remove, and execute its own packages.

  • db_dtsoperator. Can view and execute any package directly or as a job.

You assign Windows groups or users to these roles through the Security node of the msdb database in SQL Server Management Studio.

You can also change the roles that can run or replace a package. In SQL Server Management Studio, right-click a package and choose Package roles. You can set each of the read and write permissions to any SQL role defined for the msdb database.

Maintenance

You can also use the deployment utility to create a new installation routine when you have made changes after the packages have initially been installed, which will replace the existing packages in production. Because you probably don’t want to overwrite any configuration changes that were made in production, you should set the AllowConfigurationChanges project property to False.

Operations

After you’ve completed your packages and they’ve been promoted to production, you are in a completely different environment from BI Development Studio. Your connection managers need to know where production servers are, you’ll want to log progress information if a package fails, and you may want to be able to restart from a known checkpoint.

Running Packages in Production

After the packages have been deployed on the production server, you can use the Package Execution utility (available from the Integration Services folder on the Start menu) to select a package, set options and configurations, assign variable values, set connection properties, and, of course, run the package. You can also right-click a package and choose Run to execute a package. Both of these methods execute the package on the machine you are working on, not the server containing the packages.

Instead of running packages manually, in a production environment it is usually better if the packages are automatically executed at a certain time or after some event, such as every Sunday night at 10 p.m., with someone being notified only if they don’t succeed. You can use the SQL Server Agent service introduced in Chapter 2, “Introduction to SQL Server 2005,” to schedule packages; with the Agent, you can specify package logging options, select which configuration file to use, and notify an operator via e-mail or other device if the job fails.

Execution Location of Packages

Packages run through jobs started by the SQL Agent are executed on the server, as you would expect. You can run packages from a client machine using SQL Server Management Studio, but you need to be aware that the package will run on the client, not on the server that displays in the Object Explorer. If you use environment variables or XML files to set runtime configurations of packages, you must create those variables and XML files on the client, too.

Packages running on the client may increase the network traffic, or may execute slower than on the server, depending on the capacity of the client. Some packages may fail if the Integration Services runtime isn’t installed on the client.

Running Packages within a Transaction

Many packages contain multiple steps that need to succeed or fail as a single unit. For example, in the Sales Territory dimension package that we created earlier, if the truncate step succeeds but the import from the source database fails for some reason (such as the source database not being available), we would be left with an empty Sales Territory table in the data warehouse.

Integration Services deals with this issue by allowing you to set up the whole package to run in a transaction so that it succeeds or fails as a unit, or to explicitly select which tasks need to share the same transaction. If you have packages that call other packages, they can also share the same transaction. If any task that is part of the transaction fails, all the preceding tasks in the transaction will be rolled back.

To set up a transaction for a package, right-click anywhere in the Control Flow view and select Properties. Change the TransactionOption property from Supported, which means that if the package is called from another package, it participates in the other package’s transaction, to Required, which creates a new transaction for this package if one doesn’t already exist. All the tasks that need to participate in the package’s transaction also need to have their individual TransactionOption property set to Supported, which is the default setting for tasks.

Building Restartable Packages

Some packages include really long-running tasks such as downloading a file from an FTP server. If a subsequent task fails for some reason, such as loading the data into a table when the database server is not available, you might prefer not have to run the whole package again, but just the steps after the download.

Integration Services uses a concept called checkpoints to support this. When checkpoints are enabled for a package, a checkpoint file is created while the package is executing that keeps track of the last step that successfully executed. When a step fails and you have to restart the package, Integration Services can use the checkpoint file to restart from the point of failure rather than from the beginning.

To enable checkpoints, right-click anywhere in the Control Flow view and select Properties. Change the SaveCheckpoints property to True, specify a CheckpointFileName, and set the CheckpointUsage property to IfExists to make the package restart from the checkpoint if a checkpoint file exists.

TIP:

Designing Packages for Restartability

A package can only be restarted from the task level; so if, for example, you have multiple steps within a single Data Flow task, you can only restart the package at the beginning of the data flow. If you split up complex Data Flow tasks into multiple simpler tasks, the package can be restarted at a more granular level.

Logging Package Information

It often proves useful for operations staff to have access to a detailed execution log in case a package fails and they need to discover the cause. You can set up logging for Integration Services packages either in the Visual Studio designer by selecting Logging from the Integration Services menu or when you execute the package using the Package Execution utility or SQL Server Agent.

Logs can be written to multiple destinations including a simple text file, a SQL Server trace file for integrated debugging of the database and Integration Services package, or the Windows event log. The level of detail that you want to include can also be specified, so that you can limit the amount of information that is included. We recommend logging these events: OnError, OnPostExecute, OnPrevalidate, OnTaskFailed.

Next Steps

In this introduction to Integration Services, you’ve learned how to copy and transform data in a number of different ways. Integration Services has many additional aspects, and we want to draw your attention to a few of them.

Data Quality

For purposes of clarity in introducing Integration Services, we have assumed in this chapter a rather ideal world where there are no errors in the data. This is rarely the case, and in Chapter 7 we discuss techniques for detecting and correcting bad data. All the tasks and transforms we used have an error output you can use to divert the process or data flow in case of an error.

Scaling Your Solution

The techniques shown in this chapter work well for moderate to high volumes of data. If you have a very large database, or very high rates of new data, you should read Chapter 11, “Very Large Data Warehouses.” This addresses design and management issues relating to Integration Services packages, the data warehouse, and the on-line analytical processing (OLAP) databases. Integration Services packages are typically stored in the msdb database in SQL Server. This means that if that SQL Server instance is clustered, the packages are accessible even if a cluster node fails. Integration Services itself is not cluster aware and doesn’t really need to be, because it is really just a service available on any machine it is installed on. In a cluster, you can only store Integration Services packages on one instance of SQL Server. If you are using Integration Services packages stored on a clustered instance of SQL Server, be sure to use the virtual server name, not the physical server name, when referencing a package.

Other Transformations

Our goal in this chapter was to get you familiar with the structure of Integration Services packages, not necessarily to explain each transform or task. A number of other transformations are not covered in this chapter. One set of transformations controls how rows flow between source and destination, such as Union All to merge multiple data sets with the same set of columns, and Merge Join, which allows you to perform database-style joins between two data streams. You can also Sort data, or use the Aggregate transformation to perform functions such as sum or count.

Another set of transformations allows you to transform column values, such as converting data types using Data Conversion or performing string manipulations using the Character Map transformation. There are also transformations for working with Analysis Services objects, such as the Dimension Processing and Data Mining Model Training transformations.

The Script transform is extremely useful for creating complex transforms because you can code in a flexible language.

Finally, one important transformation called the Slowly Changing Dimension transformation is covered in detail in Chapter 8.

Control Flow Tasks

We have focused our attention mostly on the Data Flow task in this chapter because this is the most commonly used task. Other tasks include Data Preparation tasks that prepare data for loading, such as copying or downloading files using FTP or MSMQ, and Bulk Insert.

You can also add a great deal of flexibility to your packages by writing custom .NET code using a Script task, execute other packages or Windows applications using Execute Package or Execute Process tasks, or even call Web services to perform a function or return data using the Web Service task.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.253.55