Chapter 13. Warehouse Metadata

Metadata have traditionally been defined as data about data. While such a catchy statement may not seem very helpful, it is actually quite appropriate as a definition—metadata are a form of abstration that describes the structure and contents of the data warehouse.

Metadata Are a Form of Abstration

It is fairly easy to apply abstraction on concrete, tangible items. Information technology professionals do this all the time when they design operational systems. A concrete product is abstracted and described by its properties (i.e., data attributes)—for example, name, color, weight, size, price. A person can also be abstracted and described through his name, age, gender, occupation, etc.

Abstraction complexity increases when the item that is abstracted is not as concrete; however, such abstraction is still routinely performed in operational systems. For example, a banking transaction can be described by the transaction amount, transaction currency, transaction type (e.g., withdrawal), and the date and time when the transaction took place.

Figure 13–1 and Figure 13–2 present two metadata examples for data warehouses; the first example provides sample metadata for warehouse fields. The second provides sample metadata for warehouse dimensions. These metadata are supported by the Warehouse Designer software product that accompanies this book.

In data warehousing, abstraction is applied to the data sources, extraction and transformation rules and programs, data structure, and contents of the data warehouse itself. Since the data warehouse is a repository of data, the results of such an abstraction—the metadata—can be described as "data about data."

Why Are Metadata Important?

Metadata are important to a data warehouse for several reasons. To explain why, we examine the different uses of metadata.

Metadata Example for Warehouse Fields

Figure 13-1. Metadata Example for Warehouse Fields

Metadata Example for Warehouse Dimensions

Figure 13-2. Metadata Example for Warehouse Dimensions

Metadata Establish the Context of the Warehouse Data

Metadata help warehouse administrators and users locate and understand data items, both in the source systems and in the warehouse data structures. For example, the date value 02/05/1998 may mean different dates depending on the date convention used. The same set of numbers can be interpreted as February 5, 1998 or as May 2, 1998. If metadata describing the format of this date field were available, the definite and unambiguous meaning of the data item could be easily determined.

In operational systems, software developers and database administrators deal with metadata every day. All technical documentation of source systems are metadata in one form or another. Metadata, however, remain for the most part transparent to the end users of operational systems. They perceive the operational system as a black box and interact only with the user interface.

This practice is in direct contrast to data warehousing, where the users of decisional systems actively browse through the contents of the data warehouse and must first understand the warehouse contents before they can make effective use of the data.

Metadata Facilitate the Analysis Process

Consider the typical process that business analysts follow as part of their work. Enterprise analysts must go through the process of locating data, retrieving data, interpreting and analyzing data to yield information, presenting the information, and then recommending courses of action.

To make the data warehouse useful to enterprise analysts, the metadata must provide warehouse end users with the information they need to easily perform the analysis steps. Thus, metadata should allow users to quickly locate data that are in the warehouse. The metadata should also allow analysts to interpret data correctly by providing information about data formats (as in the above data example) and data definitions.

As a concrete example, when a data items in the warehouse Fact table is labeled "Profit," the user should be able to consult the warehouse metadata to learn how the Profit data item is computed.

Metadata Are a Form of Audit Trail for Data Transformation

Metadata document the transformation of source data into warehouse data. Warehouse metadata must be able to explain how a particular piece of warehouse data was derived from the operational systems. All business rules that govern the transformation of data to new values or new formats are also documented as metadata.

This form of audit trail is required if users are to gain confidence in the veracity and quality of warehouse data. It is also essential to the user's understanding of warehouse data to know where they came from.

In addition, some warehousing products use this type of metadata to generate extraction and transformation scripts for use on the warehouse back-end.

Metadata Improve or Maintain Data Quality

Metadata can improve or maintain warehouse data quality through the definition of valid values for individual warehouse data items. Prior to actual loading into the warehouse, the warehouse load images can be reviewed by a data quality tool to check for compliance with valid values for key data items. Data errors are quickly highlighted for correction.

Metadata can even be used as the basis for any error-correction processing that should be done if a data error is found. Error-correction rules are documented in the metadata repository and executed by program code on an as needed basis.

Metadata Types

Although there are still ongoing discussions and debates regarding standards for metadata repositories, it is generally agreed that metadata repository must consider the metadata types described in the next subsections.

Administrative Metadata

Administrative metadata contain descriptions of the source databases and their contents, the data warehouse objects, and the business rules used to transform data from the sources into the data warehouse.

  • Data sources. . These are descriptions of all data sources used by the warehouse, including information about the data ownership. Each record and each data item is defined to ensure a uniform understanding by all warehousing team members and warehouse users. Any relationships between different data sources (e.g., one provides data to another) are also documented.

  • Source-to-target field mapping. . The mapping of source fields (in operational systems) to target fields (in the data warehouse) explains what fields are used to populate the data warehouse. It also documents the transformations and formatting changes that were applied to the original, raw data to derive the warehouse data.

  • Warehouse schema design. . This model of the data warehouse describes the warehouse servers, databases, database tables, fields, and any hierarchies that may exist in the data. All referential tables, system codes, etc., are also documented.

  • Warehouse back-end data structure. . This is a model of the back-end of the warehouse, including staging tables, load image tables, and any other temporary data structures that are used during the data transformation process.

  • Warehouse back-end tools or programs. . A definition of each extraction, transformation, and quality assurance program or tool that is used to build or refresh the data warehouse. This definition includes how often the programs are run, in what sequence, what parameters are expected, and the actual source code of the programs (if applicable). If these programs are generated, the name of the tool and the date and time when the programs were generated should also be included.

  • Warehouse architecture. . If the warehouse architecture is one where an enterprise warehouse feeds many departmental or vertical data marts, the warehouse architecture should be documented as well. If the data mart contains a logical subset of the warehouse contents, this subset should also be defined.

  • Business rules and policies. . All applicable business rules and policies are documented. Examples include business formulas for computing costs or profits.

  • Access and security rules. . Rules governing the security and access rights of users should likewise be defined.

  • Units of measure. . All units of measurement and conversion rates used between different units should also be documented, especially if conversion formulas and rates change over time.

End-User Metadata

End-user metadata help users create their queries and interpret the results. Users may also need to know the definitions of the warehouse data, their descriptions, and any hierarchies that may exist within the various dimensions.

  • Warehouse contents. . Metadata must describe the data structure and contents of the data warehouse in user-friendly terms. The volume of data in various schemas should likewise be presented. Any aliases that are used for data items are documented as well. Rules used to create summaries and other precomputed totals are also documented.

  • Predefined queries and reports. . Queries and reports that have been predefined and that are readily available to users should be documented to avoid duplication of effort. If a report server is used, the schedule for generating new reports should be made known.

  • Business rules and policies. . All business rules applicable to the warehouse data should be documented in business terms. Any changes to business rules over time should also be documented in the same manner.

  • Hierarchy definitions. . Descriptions of the hierarchies in warehouse dimensions are also documented in end-user metadata. Hierarchy definitions are particularly important to support drilling up and down warehouse dimensions.

  • Status information. . Different rollouts of the data warehouse will be in different stages of development. Status information is required to inform warehouse users of the warehouse status at any point in time. Status information may also vary at the table level. For example, the base-level schemas of the warehouse may already be available and online to users while the aggregates are being computed.

  • Data quality. . Any known data quality problems in the warehouse should be clearly documented for the users. This will prompt users to make careful use of warehouse data.

  • Warehouse load history. . A history of all warehouse loads, including data volume, data errors encountered, and load time frame. This should be synchronized with the warehouse status information. The load schedule should also be available—users need to know when new data will be available.

  • Warehouse purging rules. . The rules that determine when data is removed from the warehouse should also be published for the benefit of warehouse end-users. Users need this information to understand when data will become unavailable.

Optimization Metadata

Metadata are maintained to aid in the optimization of the data warehouse design and performance. Examples of such metadata include:

  • Aggregate definitions. . All warehouse aggregates should also be documented in the metadata repository. Warehouse front-end tools with aggregate navigation capabilities rely on this type of metadata to work properly.

  • Collection of query statistics. . It is helpful to track the types of queries that are made against the warehouse. This information serves as an input to the warehouse administrator for database optimization and tuning. It also helps to identify warehouse data that are largely unused.

Versioning

Given the fact that a data warehouse contains data over different time periods, it is important to consider the effect that time may have on the business rules, source-to-target mappings, aggregate definitions, and other types of metadata in the warehouse.

Users must have access to the correct metadata for the time period they are currently studying. Without the appropriate user metadata for all time periods in the warehouse, the business users cannot be blamed for jumping to wrong conclusions and making decisions based on misinterpreted data.

Similarly, IT staff require this information for warehouse maintenance purposes. What at first glance seems to be an error in data transformation or processing may in reality be simply a change in policy or business rules.

Metadata versions, therefore, must be carefully tracked and made available to both users and the warehouse team.

Metadata as the Basis for Automating Warehousing Tasks

Although metadata have traditionally been used as a form of after-the-fact documentation, there is a clear trend in data warehousing toward metadata taking on a more active role. Almost all the major data warehouse products or tools allow their users to record and maintain metadata about the warehouse, and make use of the metadata as a basis for automating one or more aspects of the back-end warehouse process.

For example:

  • Extraction and transformation. . Users of extraction and transformation tools can specify source-to-target field mappings and enter all business rules that govern the transformation of data from the source to the target. The mapping (which is a form of metadata) serves as the basis for generating scripts that automate the extraction and transformation process.

  • Data quality. . Users of data quality tools can specify valid values for different data items in either the source system, load image, or the warehouse itself. These data quality tools use such metadata as the basis for identifying and correcting data errors.

  • Schema generation. . Similarly, users of Warehouse Designer (one of the tools provided with this book) use the tool to record metadata relating to the data structure of a dimensional data warehouse or data mart into the tool. Warehouse Designer then uses the metadata as the basis for generating the SQL Data Definition Language (DDL) statements that create data warehouse tables, fields, indexes, aggregates, etc.

  • Front-end tools. . Front-end tools also make use of metadata to gain access to the warehouse database. R/OLAPXL (the ROLAP front-end tool that accompanies this book) makes use of metadata to display warehouse tables and fields and to redirect queries to summary tables (i.e., aggregate navigation).

In Summary

Although quite a lot has been written or said about the importance of metadata, there is yet to be a consistent and reliable implementation of warehouse metadata and metadata repositories on an industry-wide scale.

To address this industry-wide issue, an organization called the Meta Data Coalition was formed to define and support the ongoing evolution of a metadata interchange format. The coalition has released a metadata interchange specification that aims to be the standard for sharing metadata among different types of products. At least 30 warehousing vendors are currently members of this organization.

Until a clear metadata standard is established, enterprises have no choice but to identify the type of metadata required by their respective warehouse initiatives, then acquire the necessary tools to support their metadata requirements.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.216.229