Chapter 11. Warehousing Software

A warehousing team will require several different types of tools during the course of a warehousing project. These software products generally fall into one or more of the categories illustrated in Figure 11-1 and described below.

  • Extraction and transformation. . As part of the data extraction and transformation process, the warehouse team requires tools that can extract, transform, integrate, clean, and load data from source systems into one or more data warehouse databases. Middleware and gateway products may be required for warehouses that extract data from host-based source systems.

  • Warehouse storage. . Software products are also required to store warehouse data and their accompanying metadata. Relational database management systems in particular are well suited to large and growing warehouses.

  • Data access and retrieval. . Different types of software are required to access, retrieve, distribute, and present warehouse data to its end users.

Tool examples listed throughout this chapter are provided for reference purposes only and are by no means an attempt to provide a complete list of vendors and tools. The tools are listed in alphabetical order by company name; the sequence does not imply any form of ranking.

Data Warehouse Software Components

Figure 11-1. Data Warehouse Software Components

Also, many of the sample tools listed automate more than one aspect of the warehouse back-end process. Thus, a tool listed in the extraction category may also have features that fit into the transformation or data quality categories.

Middleware and Connectivity Tools

Connectivity tools provide transparent access to source systems in heterogeneous computing environments. Such tools are expensive but quite often prove to be invaluable because they provide transparent access to databases of different types, residing on different platforms.

Examples of commercial middleware and connectivity tools include:

  • IBM. . DataJoiner

  • Oracle. . Transparent Gateway

  • SAS. . SAS/Connect

  • Sybase. . Enterprise Connect

Extraction Tools

There are now quite a number of extraction tools available, making tool selection a potentially complicated process.

Tool Selection

Warehouse teams have many options when it comes to extraction tools. In general, the choice of tool depends greatly on the following factors:

  • The source system platform and database. . Extraction and transformation tools cannot access all types of data sources on all types of computing platforms. Unless the team is willing to invest in middleware, the tool options are limited to those that can work with the enterprise's source systems.

  • Built-in extraction or duplication functionality. . The source systems may have built-in extraction or duplication features, either through application code or through database technology. The availability of these built-in tools may help reduce the technical difficulties inherent in the data extraction process.

  • The batch windows of the operational systems. . Some extraction mechanisms are faster or more efficient than others. The batch windows of the operational systems determine the available time frame for the extraction process and therefore may limit the team to a certain set of tools or extraction techniques.

The enterprise may opt to use simple custom-programmed extraction scripts for open, homogeneous computing environments; although without a disciplined approach to documentation, such an approach may create an extraction system that is difficult to maintain. Sophisticated extraction tools are a better choice for source systems in proprietary, heterogeneous environments, although these tools are quite expensive.

Extraction Methods

There are two primary methods for extracting data from source systems (see Figure 11-2):

  • Bulk extractions. . The entire data warehouse is refreshed periodically by extractions from the source systems. All applicable data are extracted from the source systems for loading into the warehouse. This approach heavily taxes the network connection between source and target databases, but such warehouses are easier to set up and maintain.

  • Change-based replication. . Only data that have been newly inserted or updated in the source systems are extracted and loaded into the warehouse. This approach places less stress on the network (due to the smaller volume of data to be transported) but requires more complex programming to determine when a new warehouse record must be inserted or when an existing warehouse record must be updated.

    Extraction Options

    Figure 11-02. Extraction Options

Examples of extraction tools include:

  • Apertus Carleton. . Passport

  • Evolutionary Technologies. . ETI Extract

  • Platinum. . InfoPump

Transformation Tools

Transformation tools are aptly named; they transform extracted data into the appropriate format, data structure, and values that are required by the data warehouse.

Most transformation tools provide the features illustrated in Figure 11-3 and described below.

  • Field splitting and consolidation. . Several logical data items may be implemented as a single physical field in the source systems, resulting in the need to split up a single source field into more than one target warehouse field. At the same time, there will be many instances when several source system fields must be consolidated and stored in one single warehouse field. This is especially true when the same field can be found in more than one source system.

  • Standardization. . Standards and conventions for abbreviations, date formats, data types, character formats, etc., are applied to individual data items to improve uniformity in both format and content. Different naming conventions for different warehouse object types are also defined and implemented as part of the transformation process.

  • Deduplication. . Rules are defined to identify duplicate stores of customers or products. In many cases, the lack of data makes it difficult to determine whether two records actually refer to the same customer or product. When a duplicate is identified, two or more records are merged to form one warehouse record. Potential duplicates can be identified and logged for further verification.

    Data Transformations

    Figure 11-03. Data Transformations

Warehouse load images (i.e., records to be loaded into the warehouse) are created toward the end of the transformation process. Depending on the team's key generation approach, these load images may or may not yet have warehouse keys.

Examples of transformation tools include the following:

  • Apertus Carleton. . Enterprise/Integrator

  • Data Mirror. . Transformation Server

  • Informatica. . PowerMart Designer

Data Quality Tools

Data quality tools assist warehousing teams with the task of locating and correcting data errors that exist in the source system or in the data warehouse. Experience has shown that easily up to 15 percent of the raw data extracted from operational systems are inconsistent or incorrect. A higher percentage of data are likely to be in the wrong format.

Variations in naming conventions, abbreviations, and formats result in inconsistencies that increase the difficulty of locating duplicate records. For example, "14/F," "14th Floor," and "14th Flr." all mean the same thing to operational staff but may not be recognized as equivalent during the warehouse load.

Erroneous spelling of names, addresses, etc., due to homonyms likewise cause inconsistencies. Updates (e.g., change of address) in one system that are not propagated to other source systems also cause data quality problems.

Data quality tools can help identify and correct data errors, ideally at the source systems. If corrections at the source are not possible, data quality tools can also be used on the warehouse load images or on the warehouse data itself. However, this practice will introduce inconsistencies between the source systems and the warehouse data; the warehouse team may inadvertently create data synchronization problems.

It is interesting to note that while dirty data continue to be one of the biggest issues for data warehousing initiatives, research indicates that data quality investments consistently receive but a small percentage of total warehouse spending.

Examples of data quality tools include the following:

  • DataFlux. . Data Quality Workbench

  • Pine Cone Systems. . Content Tracker

  • Prism. . Quality Manager

  • Vality Technology. . Integrity Data Reengineering

Data Loaders

Data loaders load transformed data (i.e., load images) into the data warehouse. If load images are available on the same RDBMS engine as the warehouse, then stored procedures can be used to handle the warehouse loading.

If the load images do not yet have warehouse keys, then data loaders must generate the appropriate warehouse keys as part of the load process.

Database Management Systems

A database management system is required to store the cleansed and integrated data for easy retrieval by business users. Two flavors of database management systems are currently popular: relational databases and Multidimensional databases.

Relational Database Management Systems (RDBMS)

All major relational database vendors have already announced the availability or upcoming availability of data warehousing related features in their products. These features aim to make the respective RDBMSes particularly suitable to very large database (VLDB) implementations. Examples of such features are bit-mapped indexes and parallel query capabilities.

Examples of these products include

  • IBM. . DB2

  • Informix. . Informix RDBMS

  • Microsoft. . SQL Server

  • Oracle. . Oracle RDBMS

  • Red Brick Systems. . Red Brick Warehouse

  • Sybase. . RDBMS Engine—System 11

Multidimensional Databases (MDDBs)

Multidimensional database engines store data in hypercubes, i.e., pages of numbers that are paged in and out of memory on an as-needed basis, depending on the scope and type of query. This approach is in contrast to the use of tables and fields in relational databases.

Different MDDB engines have different limitations as to the number of dimensions and variables (facts) that can be stored. As a result, most MDDB engines have maximum database sizes below 100 gigabytes. New versions of these products, however, continuously push the limits further back by increasing the number of dimensions supported, as well as the corresponding storage capacity.

Examples of these products include:

  • Arbor. . Essbase

  • BrioQuery. . Enterprise

  • Dimensional Insight. . DI-Diver

  • Oracle. . Express Server

Convergence of RDBMSes and MDDBs

Many relational database vendors have announced plans to integrate multidimensional capabilities into their RDBMSes. This integration will be achieved by caching SQL query results on a multidimensional hypercube on the database. Such Database OLAP technology (sometimes referred to as DOLAP) aims to provide warehousing teams with the best of both OLAP worlds.

Metadata Repository

Although there is a current lack of metadata repository standards, there is a consensus that the metadata repository should support the documentation of source system data structures, transformation business rules, the extraction and transformation programs that move the data, and data structure definitions of the warehouse or data marts. In addition, the metadata repository should also support aggregate navigation, query statistics collection, and end-user help for warehouse contents.

Metadata repository products are also referred to as information catalogs and business information directories. Examples of metadata repositories include:

  • Apertus Carleton. . Warehouse Control Center

  • Informatica. . PowerMart Repository

  • Intellidex. . Warehouse Control Center

  • Prism. . Prism Warehouse Directory

Data Access and Retrieval Tools

Data warehouse users derive and obtain information through these types of tools. Data access and retrieval tools are currently classified into the subcategories below.

Online Analytical Processing (OLAP) Tools

OLAP tools allow users to make ad hoc queries or generate canned queries against the warehouse database. The OLAP category has since divided further into the multidimensional OLAP (MOLAP) and relational OLAP (ROLAP) markets.

MOLAP products run against a multidimensional database (MDDB). These products provide exceptional responses to queries and typically have additional functionality or features, such as budgeting and forecasting capabilities. Some of the tools also have built-in statistical functions. MOLAP tools are better suited to power users in the enterprise.

ROLAP products, in contrast, run directly against warehouses in relational databases (RDBMS). While the products provide slower response times than their MOLAP counterparts, ROLAP products are simpler and easier to use and are therefore suitable to the typical warehouse user. Also, since ROLAP products run directly against relational databases, they can be used directly with large enterprise warehouses.

Examples of OLAP tools include:

  • Arbor Software. . Essbase OLAP

  • Cognos. . Powerplay

  • Intranet Business Systems. . R/olapXL

Reporting Tools

These tools allow users to produce canned, graphic-intensive, sophisticated reports based on the warehouse data. There are two main classifications of reporting tools: report writers and report servers.

Report writers allow users to create parameterized reports that can be run by users on an as-needed basis. These typically require some initial programming to create the report template. Once the template has been defined, however, generating a report can be as easy as clicking a button or two.

Report servers are similar to report writers but have additional capabilities that allow their users to schedule when a report is to be run. This feature is particularly helpful if the warehouse team prefers to schedule report generation processing during the night, after a successful warehouse load. By scheduling the report run for the evening, the warehouse team effectively removes some of the processing from the daytime, leaving the warehouse free for ad hoc queries from online users. Some report servers also come with automated report distribution capabilities. For example, a report server can e-mail a newly generated report to a specified user or generate a web page that users can access on the enterprise intranet. Report servers can also store copies of reports for easy retrieval by users over a network on an as-needed basis.

Examples of reporting tools include:

  • IQ Software. . IQ/SmartServer

  • Seagate Software. . Crystal Reports

Executive Information Systems (EIS)

EIS systems and other Decision Support Systems (DSS) are packaged applications that run against warehouse data. These provide different executive reporting features, including "what if" or scenario-based analysis capabilities and support for the enterprise budgeting process.

Examples of these tools include:

  • Comshare. . Decision

  • Oracle. . Oracle Financial Analyzer

While there are packages that provide decisional reporting capabilities, there are EIS and DSS development tools that enable the rapid development and maintenance of custom-made decisional systems.

Examples include:

  • Microstrategy. . DSS Executive

  • Oracle. . Express Objects

Data Mining

Data mining tools search for inconspicuous patterns in transaction-grained data to shed new light on the operations of the enterprise. Different data mining products support different data mining algorithms or techniques (e.g., market basket analysis, clustering), and the selection of a data mining tool is often influenced by the number and type of algorithms supported.

Regardless of the mining techniques, however, the objectives of these tools remain the same: crunching through large volumes of data to identify actionable patterns that would otherwise have remained undetected.

Data mining tools work best with transaction-grained data. For this reason, the deployment of data mining tools may result in a dramatic increase in warehouse size. Due to disk costs, the warehousing team may find itself having to make the painful compromise of storing transaction-grained data for only a subset of its customers. Other teams may compromise by storing transaction-grained data for a short time on a first-in-first-out basis (e.g., transactions for all customers, but for the last six months only).

One last important note about data mining: Since these tools infer relationships and patterns in warehouse data, a clean data warehouse will always produce better results than a dirty warehouse. Dirty data may mislead both the data mining tools and their users by producing erroneous conclusions.

Examples of data mining products include:

  • ANGOSS. . KnowledgeSTUDIO

  • Data Distilleries. . Data Surveyor

  • HyperParallel. . //Discovery

  • IBM. . Intelligent Miner

  • Integral Solutions. . Clementine

  • Magnify. . PATTERN

  • NeoVista Software. . Decision Series

  • Syllogic. . Syllogic Data Mining Tool

Exception Reporting and Alert Systems

These systems highlight or call an end-user's attention to data or a set of conditions about data that are defined as exceptions. An enterprise typically implements three types of alerts:

  • Operational alerts from individual operational systems. . These have long been used in OLTP applications and are typically used to highlight exceptions relating to transactions in the operational system. However, these types of alerts are limited by the data scope of the OLTP application concerned.

  • Operational alerts from the Operational Data Store. . These alerts require integrated operational data and therefore are possible only on the Operational Data Store. For example, a bank branch manager may wish to be alerted when a bank customer who has missed a loan payment has made a large withdrawal from his deposit account.

  • Decisional alerts from the data warehouse. . These alerts require comparisons with historical values and therefore are possible only on the data warehouse. For example, a sales manager may wish to be alerted when the sales for the current month are found to be at least 8 percent less than sales for the same month last year.

Products that can be used as exception reporting or alert systems include:

  • Compulogic. . Dynamic Query Messenger

  • Pine Cone Systems. . Activator Module (Content Tracker)

Web-Enabled Products

Front-end tools belonging to the above categories have gradually been adding web-publishing features. This development is spurred by the growing interest in intranet technology as a cost-effective alternative for sharing and delivering information within the enterprise.

Data Modeling Tools

Data modeling tools allow users to prepare and maintain an information model of both the source database and the target database. Some of these tools also generate the data structures based on the models that are stored or are able to create models by reverse engineering existing databases. IT organizations that have enterprise data models will quite likely have documented these models using a data modeling tool. While these tools are nice to have, they are not a prerequisite for a successful data warehouse project.

As an aside, some enterprises make the mistake of adding the enterprise data model to the list of data warehouse planning deliverables. While an enterprise data model is helpful to warehousing, particularly during the source system audit, it is definitely not a prerequisite of the warehousing project. Making the enterprise model a prerequisite or a deliverable of the project will only serve to divert the team's attention from building a warehouse to documenting what data currently exists.

Examples include:

  • Cayenne Software. . Terrain

  • Relational Matters. . Syntagma Designer

  • Sybase. . PowerDesigner WarehouseArchitect

Warehouse Management Tools

These tools assist warehouse administrators in the day-to-day management and administration of the warehouse. Different warehouse management tools support or automate different aspects of the warehouse administration and management tasks.

For example, some tools focus on the load process and therefore track the load histories of the warehouse. Other tools track the types of queries that users direct to the warehouse and identify which data are not used and therefore are candidates for removal.

Examples include:

  • Pine Cone Systems. . Usage Tracker, Refreshment Tracker

  • Red Brick Systems. . Enterprise Control and Coordination

Source Systems

Data warehouses would not be possible without source systems, i.e., the operational systems of the enterprise that serve as the primary source of warehouse data. Although strictly speaking, the source systems are not data warehousing software products, they do influence the selection of these tools or products.

The computing environments of the source systems generally determine the complexity of extracting operational data. As can be expected, heterogeneous computing environments increase the difficulties that a data warehouse team may encounter with data extraction and transformation.

Application packages (e.g., integrated banking or integrated manufacturing and distribution systems) with proprietary database structures will also pose data access problems.

External data sources may also be used. Examples include Bloomberg News, Lundberg, A.C. Nielsen, Dun and Bradstreet, Mailcode or Zipcode Data, Dow Jones News Service, Lexis, New York Times Services, and Nexis.

In Summary

Quite a number of technology vendors are supplying warehousing products in more than one category, and a clear trend toward the integration of different warehousing products is evidenced by efforts to share metadata across different products and by the many partnerships and alliances formed between warehousing vendors.

Despite this, there is still no clear market leader for an integrated suite of data warehousing products. Warehousing teams are still forced to take on the responsibility of integrating disparate products, tools, and environments or to rely on the services of a solution integrator. Until this situation changes, enterprises should carefully evaluate the fit of the tools they eventually select for different aspects of their warehousing initiative. The integration problems posed by the source system data are difficult enough without adding tool integration problems to the project.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.8.82