Chapter 15. Warehouse Maintenance and Evolution

With the data warehouse in production, the warehousing team will face a new set of challenges—the maintenance and evolution of the warehouse.

Regular Warehous Loads

New or updated data must be loaded regularly from the source systems into the data warehouse to ensure that the latest data are available to warehouse users. This loading is typically conducted during the evenings, when the operational systems can be taken offline. Each step in the back-end process—extract, transform, quality assure, and load—must be performed for each warehouse load.

New warehouse loads imply the need to calculate and populate aggregate tables with new records. In cases where the data warehouse feeds one or more data marts, the warehouse loading is not complete until the data marts have likewise been loaded with the latest data.

Warehouse Statistics Collection

Warehouse usage statistics should be collected on a regular basis to monitor the performance and utilization of the warehouse. The following types of statistics will prove to be insightful.

  • Queries per day. . The number of queries that the warehouse responds to on any given day, categorized into levels of complexity whenever possible. Queries against summary tables also indicate the usefulness of these stored aggregates.

  • Query response times. . The time it takes for each query to execute.

  • Alerts per day. . The number of alerts or exceptions that are triggered by the warehouse on any given day, if an alert system is in place.

  • Valid users. . The number of users who have access to the warehouse.

  • Users per day. . The number of users who actually make use of the warehouse on any given day. This number can be compared to the number of valid users.

  • Frequency of use. . The number of times a user actually logs on to the data warehouse within a given time frame. This statistic indicates how much the warehouse supports the user's day-to-day activities.

  • Session length. . The length of time a user stays online each time he logs on to the data warehouse.

  • Time of day, day of week, day of month. . The time of day, day of week, and day of month when each query is executed. This statistic may highlight periods where there is constant, heavy usage of warehouse data.

  • Subject areas. . Identifies which of the subject areas in the warehouse are more frequently used. This information also serves as a guide for subject areas that are candidates for removal.

  • Warehouse size. . The number of records of data for each warehouse table after each warehouse load. This statistic is a useful indicator of the growth rate of the warehouse.

  • Warehouse contents profile. . Statistics about the warehouse contents (e.g., total number of customers or accounts, number of employees, number of unique products, etc.). This information provides interesting metrics about the business growth.

Warehouse User Profiles

As more users access the warehouse, the usability of the data access and retrieval tools becomes critical. The majority of users will not have the patience to learn a whole new set of tools and will simply continue the current and convenient practice of submitting requests to the IT department.

The warehouse team must therefore evaluate the profiles of each of the intended warehouse users. This user evaluation can also be used as input to tool selection and to determine the number of licenses required for each data access and retrieval tool.

In general, there are three types of warehouse end users, and their preferred method for interacting with the data warehouse varies accordingly. These users are:

  • Senior and executive management. . These end users generally prefer to view information through predefined reports with built-in hierarchical drilling capabilities. They prefer reports that use graphical presentation media, such as charts and models, to quickly convey information.

  • Middle management and senior analysts. . These individuals prefer to create their own queries and reports, using the available tools. They create information in an ad hoc style, based on the information needs of senior and executive management. However, their interest is often limited to a specific product group, a specific geographical area, or a specific aspect of the enterprise's performance. The preferred interfaces for users of this type is spreadsheets and front-ends that provide budgeting and forecasting capabilities.

  • Business analyst and IT support. . These individuals are among the heaviest users of warehouse data and are the ones who perform actual data collection and analysis. They create the charts and reports that are required to present their findings to senior management. They also prefer to work with tools that allow them to create their own queries and reports.

The above categories describe the typical user profiles. The actual preference of individual users may vary, depending on individual IT literacy and working style.

Security and Access Profiles

A data warehouse contains critical information in a readily accessible format. It is therefore important to keep secure not only the warehouse data but also the information that is distilled from the warehouse.

OLTP approaches to security, such as the restriction of access to critical tables, will not work with a data warehouse because of the exploratory fashion by which warehouse data are used. Most analysts will use the warehouse in an ad hoc manner and will not necessarily know at the outset what subject areas they will be exploring or even what range of queries they will be creating. By restricting user access to certain tables, the warehouse security may inadvertently inhibit analysts and other warehouse users from discovering critical and meaningful information.

Initial warehouse rollouts typically require fairly low security because of the small and targeted set of users intended for the initial rollouts. There will therefore be a need to revisit the security and access profiles of users as each rollout is deployed.

When users leave an organization, their corresponding user profiles should be removed to prevent the unauthorized retrieval and use of warehouse data.

Also, if the warehouse data are made available to users over the public Internet infrastructure, the appropriate security measures should be put in place.

Data Quality

Data quality (or the lack thereof) will continue to plague warehousing efforts in the years to come. The enterprise will need to determine how data errors will be handled in the warehouse. There are two general approaches to data quality problems.

  • Only clean data gets in. . Only data that are certified 100 percent correct are loaded into the warehouse. Users are confident that the warehouse contains correct data and can take decisive action based on the information it provides. Unfortunately, since data errors may take a long time to identify, and even more to fix, it may be a while before a full warehouse load is completed. Also, a vast majority of queries (e.g., who are our top-10 customers? how many product combinations are we selling?) will not be meaningful if a warehouse load is incomplete.

  • Clean as we go. . All data are loaded into the warehouse, but mechanisms are defined and implemented to identify and correct data errors. Although such an approach allows warehouse loads to take place, the quality of the data is suspect and may result in misleading information and ill-informed decisions. The questionable data quality may also cause problems with user acceptance—users will be less inclined to use the warehouse if they do not believe the information it provides.

It is unrealistic to expect that all data quality errors will be corrected during the course of one warehouse rollout. However, acceptance of this reality does not mean that data quality efforts are for naught and can be abandoned.

Whenever possible, correct the data in the source systems so that cleaner data are provided in the next warehouse load. Provide mechanisms for clearly identifying dirty warehouse data. If users know which parts of the warehouse are suspect, they will still be able to find value in the data that are correct.

It is an unfortunate fact of life that older enterprises have larger data volumes and, consequently, a larger volume of data errors.

Data Growth

Initial warehouse deployments may not face space or capacity problems, but as time passes and the warehouse size grows with each new data load, the proper management of data growth expansion proliferation grows in importance.

There are several ways to handle data growth, including:

  • Use of aggregates. . The use of stored aggregates significantly reduces the space required by the data, especially if the data are required only at a highly summarized level. The detailed data can be deleted or archived after aggregates have been created. Note however, that the removal of detailed data implies the loss of the ability to drill down for more detail. Also, new summaries at other levels may not be derivable from the current portfolio of aggregate schemas.

  • Limiting the time frame. . Although users will want the warehouse to store as much data for as long as possible, there may be a need to compromise by limiting the length of historical data in the warehouse.

  • Removing unused data. . Using query statistics gathered over time, it is possible for warehouse administrators to identify rarely used data in the warehouse. These records are ideal candidates for removal since their storage results in costs with very little business value.

Updates to Warehouse Subsystems

As time passes, a number of conditions will necessitate changes to the data structure of the warehouse, its staging areas, its back-end subsystems, and, consequently, its metadata. We describe some of these conditions in the following subsections.

Source System Evolution

As the source systems evolve, so by necessity does the data warehouse. It is therefore critical that any plans to change the scope, functionality, and availability of the source systems also consider any possible impact on the data warehouse. The CIO is in the best position to ensure that the project efforts are coordinated across multiple projects.

  • Changes in scope. . Scope changes in operational systems typically imply one or more of the following: the availability of new data in an existing system, the removal of previously available data in an existing system, or the migration of currently available data to a new or different computing environment. An example of the latter is the deployment of a new system to replace an existing one.

  • Change in functionality. . There are times when the data structure already existing in the operational systems remains the same but the processing logic and business rules governing the input of future data is changed. Such changes require updates to data integrity rules and metadata used for quality assurance. All quality assurance programs should likewise be updated.

  • Change in availability. . Additional demands on the operational system may affect the availability of the source system (e.g., smaller batch windows). The batch windows may affect the schedule of regular warehouse extractions and may place new efficiency and performance demands on the warehouse extraction and transformation subsystems.

Use of New or Additional External Data

Some data are commercially available for purchase and can be integrated into the data warehouse as the business needs evolve. Not that the use of external data presents its own set of difficulties due to the likelihood of incompatible formats or level of detail.

The use of new or additional external data has the same impact on the warehouse back-end subsystems as do changes to internal data sources.

Database Optimization and Tuning

As query statistics are collected and user base increases, there will be a need to perform database optimization and tuning tasks to maintain an acceptable level of warehouse performance.

To avoid or control the impact of nasty surprises, inform users when changes are made to the production database. Keep in mind that any changes to the database should first be tested in a safe environment.

Databases can be tuned through a number of approaches, including but not limited to the following:

  • Use of parallel query options. . Some of the major database management systems offer options that will split up a large query into several smaller queries that can be run in parallel. The results of the smaller queries are then combined and presented to users as a single result set. While such options have costs, their implementation is transparent to users, who notice only the improvements in response time.

  • Indexing strategies. . As very large database (VLDB) implementations are becoming more popular, database vendors are offering indexing options or strategies to improve the response times to queries against very large tables.

  • Dropping of referential integrity checking. . While debates still exist as to whether or not referential integrity checking should be left on during warehouse loading, it is an undeniable fact that when referential integrity is turned off, the loading of warehouse data becomes faster. Some parties reason that since data are checked prior to warehouse loading, there will be no need to enforce referential integrity constraints.

Data Warehouse Staffing

Not all organizations with a data warehouse choose to create a permanent unit to administer and maintain it. Each organization will have to decide if a permanent unit is required to maintain the data warehouse.

A permanent unit has the advantage of focusing the warehouse staff formally on the care and feeding of the data warehouse. A permanent unit also increases the continuity in staff assignments by decreasing the possibility of losing staff to other IT projects or systems in the enterprise.

The use of matrix organizations in place of permanent units has also proven to be effective, provided that roles and responsibilities are clearly defined and that the IT division is not undermanned.

If the warehouse development was partially or completely outsourced to third parties because of a shortage of internal IT resources, the enterprise may find it necessary to staff up at the end of the warehouse rollout. As the project draws to a close, the consultants or contractors will be turning over the day-to-day operations of the warehouse to internal IT staff. The lack of internal IT resources may result in haphazard turnovers. Alternatively, the enterprise may have to outsource the maintenance of the warehouse.

Warehouse Staff and User Training

The enterprise may find it helpful to establish a training program for both technology staff and end users.

User Training

  • Warehousing overview. . Half-day overviews can be prepared for executive or senior management to manage expectations.

  • User roles. . User training should also cover general data warehousing concepts and explain how users are involved during data warehouse planning, design, and construction activities.

  • Warehouse contents and metadata. . Once a data warehouse has been deployed, the user training should focus strongly on the contents of the warehouse. Users must understand the data that are now available to them and must understand also the limitations imposed by the scope of the warehouse. The contents and usage of business metadata should also be explained.

  • Data access and retrieval tools. . User training should also focus on the selected end-user tools. If users find the tools difficult to use, the IT staff will quickly find themselves saddled with the unwelcome task of creating reports and queries for end users.

Warehouse Staff Training

Warehouse staff require training on a number of topics covering the planning, design, implementation, management, and maintenance of data warehouses. Depending on their project roles, the staff will need to specialize or focus on different areas or different aspects of the warehousing life cycle. For example, the metadata administrator needs specialized courses on metadata repository management. Whereas the warehouse DBA needs dimensional modeling training.

Readers may find it helpful to refer to the training program recommended by the Committee for Data Warehousing Education at the Data Warehousing Institute (http://www.dw-insitute.com).

Subsequent Warehouse Rollouts

The data warehouse is extended continuously. Data warehouse design and construction skills will always be needed as long as end-user requirements and business situations continue to evolve. Each subsequent rollout is designed to extend the functionality and scope of the warehouse.

As new user requirements are studied and subsequent warehouse rollouts get underway, the overall data warehouse architecture is revisited and modified as needed.

Data marts are deployed as needed within the integrating framework of the warehouse. Avoid multiple, unrelated data marts because these will merely create unnecessary data management and administration problems.

Chargeback Schemes

It may be necessary at some point for the IT Department to start charging user groups for warehouse usage, as a way of obtaining continuous funding for the data warehouse initiative.

Note that chargeback schemes will work only if there are reliable mechanisms to track and monitor usage of the warehouse per user. They also put the warehouse to the test—users will have to feel that they are getting their money's worth each time they use the warehouse. Warehouse usage will drop if users feel that the warehouse has no value.

Disaster Recovery

The challenges of deploying new technology may cause warehouse administrators to place a lower priority on disaster recovery. As time passes and more users come to depend on the data warehouse, however, the warehouse achieves mission-critical status. The appropriate disaster recovery procedures are therefore required to safeguard the continuous availability and reliability of the warehouse.

Ideally, the warehouse team conducts a dry run of the disaster recovery procedure at least once prior to the deployment of the warehouse. Disaster recovery drills on a regular basis will also prove helpful.

Some disasters may require the reinstallation of operating systems and database management systems aside from the reloading of warehouse data and population of aggregate tables. The recovery procedures should consider this possibility.

A final note: Review the disaster recovery plan at the end of each rollout. The plan may have to be updated in light of changes to the architecture, scope, and size of the warehouse.

In Summary

A data warehouse initiative does not stop with one successful warehouse deployment. The warehousing team must sustain the initial momentum by maintaining and evolving the data warehouse.

Unfortunately, maintenance activities remain very much in the background—often unseen and unappreciated until something goes wrong. Many people do not realize that evolving a warehouse can be trickier than the initial deployment. The warehousing team has to meet a new set of information requirements without compromising the performance of the initial deployment and without limiting the warehouse's ability to meet future requirements.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.251.155