Chapter 5. The Project Manager

The warehouse Project Manager is responsible for any and all technical activities related to planning, designing, and building a data warehouse. Under ideal circumstances, this role is fulfilled by internal IT staff. It is not unusual, however, for this role to be outsourced, especially for early or pilot projects, because warehousing technologies and techniques are so new.

How Do I Roll Out a Data Warehouse Initiative?

If you are starting a data warehouse initiative, there are three main things to keep in mind. Always start with a planning activity. Always implement a pilot project as your "proof of concept." And, always extend the functionality of the warehouse in an iterative manner.

Start with a Data Warehouse Planning Activity

The scope of a data warehouse varies from one enterprise to another. The desired scope and scale are typically determined by the information requirements that drive the warehouse design and development. These requirements, in turn, are driven by the business context of the enterprise—the industry, the fierceness of competition, and the state of the art in industry practices.

Regardless of the industry, however, it is advisable to start a data warehouse initiative with a short planning activity. The Project Manager should launch and manage the activities listed below.

Decisional Requirements Analysis. Start with an analysis of the decision support needs of the organization. The warehousing team must understand the user requirements and attempt to map these to the data sources available. The team also designs potential queries or reports that can meet the stated information requirements.

Note that unlike system development projects for OLTP applications, the information needs of decisional users cannot be pinned down and are frequently changing. The Requirements Analysis team should therefore gain enough of an understanding of the business to be able to anticipate likely changes to end-user requirements.

Decisional Source System Audit. Conduct an audit of all potential sources of data. This crucial and very detailed task verifies that data sources exist to meet the decisional information needs identified during requirements analysis. There is no point in designing a warehouse schema that cannot be populated because of a lack of source data.

Similarly, there is no point in designing reports or queries when data are not available to generate them. Log all data items that are currently not supported or provided by the operational systems and submit these to the CIO as inputs for IT planning.

Logical and Physical Warehouse Schema Design (Preliminary). The results of requirements analysis and source system audit serve as inputs to the design of the warehouse schema. The schema details all fact and dimension tables and fields, as well as the data sources for each warehouse field. The preliminary schema produced as part of the warehousing planning activity will be progressively refined with each rollout of the data warehouse.

The goal of the team is to design a data structure that will be resilient enough to meet the constantly changing information requirements of warehouse end-users.

Other Concerns. The three tasks described above should also provide the warehousing team with an understanding of:

  • the required warehouse architecture;

  • the appropriate phasing and rollout strategy; and

  • the ideal scope for a pilot implementation.

The data warehouse plan must also evaluate the need for an ODS layer between the operational systems and the data warehouse.

You can find additional information on the above activities in Part III, Process.

Implement a Proof-of-Concept Pilot

Start with a pilot implementation as the first rollout for data warehousing. Pilot projects have the advantage of being small and manageable, thereby providing the organization with a data warehouse "proof of concept" that has a good chance of success.

Determine the functional scope of a pilot implementation based on two factors:

  • The degree of risk the enterprise is willing to take. The project difficulty increases as the number of source systems, users, and locations increases. Politically sensitive areas of the enterprise are also very high risk.

  • The potential for leveraging the pilot project. Avoid constructing a "throwaway" prototype for the pilot project. The pilot warehouse must have actual value to the enterprise. Figure 5-1 is a matrix for assessing the pilot project.

Avoid high-risk projects with very low reward possibilities. Ideally, the pilot project has low or manageable risk factors but has a highly visible impact on the way decisions are made in the enterprise. An early and high-profile success will increase the grassroots support of the warehousing initiative.

Selecting Pilot Projects: Risk vs. Reward

Figure 5-1. Selecting Pilot Projects: Risk vs. Reward

Extend Functionality Iteratively

Once the warehouse pilot is in place and is stable, implement subsequent rollouts of the data warehouse to continuously layer new functionality or extend existing warehousing functionality on a cost-justifiable, prioritized basis, illustrated by the diagram in Figure 5-2.

Top-Down. Drive all rollouts by a top-down study of user requirements. Note that decisional requirements are subject to constant change; the team will never be able to fully document and understand the requirements, simply because the requirements change as the business situation changes. Don't fall into the trap of wanting to analyze everything to extreme detail (i.e., analysis paralysis).

Bottom-Up. While some team members are working top-down, other team members are working bottom-up. The results of the bottom-up study serve as the reality check for the rollout—some of the top-down requirements will quickly become unrealistic, given the state and contents of the intended source systems. End users should be quickly informed of limitations imposed by source system data to properly manage their expectations.

Iterative Extension of Functionality, i.e., Evolution

Figure 5-2. Iterative Extension of Functionality, i.e., Evolution

Back-End. Each rollout or iteration extends the back-end (i.e., the server component) of the data warehouse. Warehouse subsystems are created or extended to extract, transform, clean, and integrate more data. Warehouse data structures are extended to support a larger scope of data. Aggregate records are computed and loaded. Metadata records are populated as required.

Front-End.  The front-end (i.e., client component) of the warehouse is also extended by deploying the existing data access and retrieval tools to more users and by deploying new tools (e.g., data mining tools, new decision support applications) to warehouse users. The availability of more data implies that new reports and new queries can be defined.

How Important Is the Hardware Platform?

Although the mainframe environment is also used as a data warehouse platform, data warehousing hardware discussions typically revolve around two main types of hardware technologies: symmetric multiprocessing (SMP) and massively parallel processing (MPP) servers.

SMPs. Symmetric multiprocessing (SMP) hardware has multiple processors that share one memory (see Figure 5-3). This type of architecture is often referred to as the "Shared Everything" architecture. When additional computing power is required, additional CPUs are added to the machine (although there is a limit to the number) or several SMP machines are clustered together.

MPPs. In contrast, massively parallel processing (MPP) hardware supports multiple nodes, where each node has one or more processors, each with its own memory (see Figure 5-4). Additional nodes can be added to increase processing power.

The choice between SMP and MPP is influenced by a number of factors, including the complexity of the query environment, the price/performance ratio, the proven processing capacity of the hardware platform with the target RDBMS, the anticipated warehouse applications, and the foreseen increases in warehouse size and users.

For example, complex queries that involve multiple table joins might realize better performance with an MPP configuration. MPPs, though, are generally more expensive. Clustered SMPs may provide a highly scalable implementation with better price/performance benefits.

SMP Hardware Configuration

Figure 5-3. SMP Hardware Configuration

MPP Hardware Configuration

Figure 5-4. MPP Hardware Configuration

What Technologies Are Involved?

Several types of technologies are used to make data warehousing possible. These technology types are enumerated briefly below. You can find more information in Part 4, Technology.

  • Source systems. The operational systems of the enterprise are the most likely source systems for a data warehouse. The warehouse may also make use of external data sources from third parties.

  • Middleware, extraction, transportation and transformation technologies. These tools extract and reorganize data from the various source systems. These tools vary greatly in terms of complexity, features, and price. The ideal tools for the enterprise are heavily dependent on the computing environment of the source systems and the intended computing environment of the data warehouse.

  • Data quality tools. These tools identify or correct data quality errors that exist in the raw source data. Most tools of this type are used to call the warehouse team's attention to potential quality problems. Unfortunately, much of the data cleansing process is still manual; it is also tedious due to the volume of data involved.

  • Warehouse storage. Database management systems (DBMS) are used to store the warehouse data. DBMS products are generally classified as relational (e.g., Oracle, Informix, Sybase) or multidimensional (e.g., Essbase, BrioQuery, Express Server).

  • Metadata management. These tools create, store, and manage the warehouse metadata.

  • Data access and retrieval tools. These are tools used by warehouse end users to access, format, and disseminate warehouse data in the form of reports, query results, charts, and graphs. Other data access and retrieval tools actively search the data warehouse for patterns in the data (i.e., data mining tools). Decision Support Systems and Executive Information Systems also fall into this category.

  • Data modeling tools. These tools are used to prepare and maintain an information model of both the source databases and the warehouse database.

  • Warehouse management tools. These tools are used by warehouse administrators to create and maintain the warehouse (e.g., create and modify warehouse data structures, generate indexes).

  • Data warehouse hardware. This refers to the data warehouse server platforms and their related perating systems.

Figure 5-5 depicts the typical warehouse software components and their relationships to one another.

Do I Still Use Relational Databases for Data Warehousing?

Although there were initial doubts about the use of relational database technology in data warehousing, experience has shown that there is actually no other appropriate database management system for an enterprise-wide data warehouse.

Data Warehouse Components

Figure 5-5. Data Warehouse Components

MDDBs. The confusion about relational databases arises from the proliferation of OLAP products that make use of a multidimensional database (MDDB). MDDBs store data in a "hypercube" i.e., a multidimensional array that is paged in and out of memory as needed, as illustrated in Figure 5-6.

RDBMS. In contrast, relational databases store data as tables with rows and columns that do not map directly to the multidimensional view that users have of data. Structured Query Language (SQL) scripts are used to retrieve data from RDBMSes.

Two Rival Approaches

Although these two approaches are apparent "rivals," the apparent competition between MDDB and relational database (RDB) technology presents enterprises with interesting architectural alternatives for implementing data warehousing technology. It is not unusual to find enterprises making use of both technologies, depending on the requirements of the user community.

MDDB Data Structures

Figure 5-6. MDDB Data Structures

From an architectural perspective, the enterprise can get the best of both worlds through the careful use of both technologies in different parts of the warehouse architecture.

  • Enterprise data warehouses. These have a tendency to grow significantly beyond the size limit of most MDDBs and are therefore typically implemented with relational database technology. Only relational database technology is capable of storing up to terabytes of data while still providing acceptable load and query performance.

  • Data marts. A data mart is typically a subset of the enterprise data warehouse. These subsets are determined either by geography (i.e., one data mart per location) or by user group. Data marts, due to their smaller size, may take advantage of multidimensional databases for better reporting and analysis performance.

Warehousing Architectures

Below, we present how the relational and multi-dimensional database technologies can be used together for data warehouse and data mart implementation.

RDBMSes in Warehousing Architectures. Data warehouses are built on relational database technology. Online Analytical Processing (OLAP) tools are then used to interact directly with the relational data warehouse or with a relational data mart (see Figure 5-7). Relational OLAP (ROLAP) tools recognize the relational nature of the database but still present their users with multidimensional views of the data.

MDDBs in Warehousing Architectures. Alternatively, data is extracted from the relational data warehouse and placed in multidimensional data structures to reflect the multidimensional nature of the data (see Figure 5-8). Multidimensional OLAP (MOLAP) tools run against the multidimensional server, rather than against the data warehouse.

Tiered Data Warehousing Architectures. The enterprise is free to mix and match these two database technologies, depending on the scale and size of the data warehouse, as illustrated in Figure 5-9.

It will not be unusual to find an enterprise with the following tiered data warehousing architecture:

  • ROLAP tools, which run directly against relational databases, are used whenever the queries are fairly simple and when the administration overhead that comes with multidimensional tools is not justified for a particular user base.

  • Multidimensional databases are used for data marts, and specific multidimensional front-end applications query the contents of the MDDB. Data marts may also use relational database technology, in which case, users make use of ROLAP front-ends.

Relational Databases

Figure 5-7. Relational Databases

Multidimensional Databases

Figure 5-8. Multidimensional Databases

Trade-Offs: MDDB vs. RDBMS

Consider the following factors when choosing between the multidimensional and relational approaches:

Size. Multidimensional databases are generally limited by size, although the size limit has been increasing gradually over the years. In the mid-1990s, 10 gigabytes of data in a hypercube already presented problems and unacceptable query performance. Some multidimensional products today are able to handle up to 100 gigabytes of data. Despite this improvement, however, large data warehouses are still better served by relational front-ends running against high-performing and scalable relational databases.

Tiered Data Warehousing Architecture

Figure 5-9. Tiered Data Warehousing Architecture

Volatility of Source Data. Highly volatile data are better handled by relational technology. Multidimensional data in hypercubes generally take long to load and update. Thus, the time required to constantly load and update the multidimensional data structure may prevent the enterprise from loading new data as often as desired.

Aggregate Strategy. Multidimensional hypercubes support aggregations better, although this advantage will disappear as relational databases improve their support of aggregate navigation. Drilling up and down on RDBMSes generally take longer than on MDDBs as well. However, due to their size limitation, MDDBs will not be suited to warehouses or data marts with very detailed data.

Investment Protection. Most enterprises have already made significant investments in relational technology (e.g., RDBMS assets) and skill sets. The continued use of these tools and skills for another purpose provides additional return on investment and lowers the technical risk for the data warehousing effort.

Ability to Manage Complexity. A multidimensional DBMS adds a layer to the overall systems architecture of the warehouse. Sufficient resources must be allocated to administer and maintain the MDDB layer. If the administrative overhead is not or cannot be justified, an MDDB will not be appropriate.

Type of Users. Power users generally prefer the range of functionality available in multidimensional OLAP tools. Users that require broad views of enterprise data require access to the data warehouse and therefore are best served by a relational OLAP tool.

Recently, many of the large database vendors have announced plans to integrate their multi-dimensional and relational database products. In this scenario, end-users make use of the multi-dimensional front-end tools for all their queries. If the query requires data that are not available in the MDDB, the tools will retrieve the required data from the larger relational database. Dubbed as a "drill-through" feature, this innovation will certainly introduce new data warehousing architectures.

How Long Does a Data Warehousing Project Last?

Data warehousing is a long, daunting task; it requires significant, prolonged effort on the part of the enterprise and may have the unpleasant side effect of highlighting problem areas in operational systems. Like any task of great magnitude, the data warehousing effort must be partitioned into manageable chunks, where each piece is managed as an individual project or rollout.

Data warehouse rollouts after the pilot warehouse must fit together within an overall strategy. Define the strategy at the start of the data warehousing effort. Constantly update (at least once a year) the data warehouse strategy as new requirements are understood, new operational systems are installed, and new tools become available.

In cases where the enterprise also has an Operational Data Store initiative, the ODS and warehousing projects must be synchronized. The data warehouse should take advantage of the ODS as a source system as soon as possible.

Figure 5-10 depicts how the entire decisional systems effort can be interleaved.

Each data warehouse rollout should be scoped to last anywhere between three to six months, with a team of about 6 to 12 people working on it full time. Part-time team members can easily bring the total number of participants to more than 20. Sufficient resources must be allocated to support each rollout.

Interleaved Operational, ODS, and Warehouse Projects

Figure 5-10. Interleaved Operational, ODS, and Warehouse Projects

How Is a Data Warehouse Different from Other IT Projects?

Since much of computing has focused on meeting operational information needs, IT professionals have a natural tendency to apply the same methodologies, approaches or techniques to data warehousing projects. Unfortunately, data warehousing projects differ from other IT projects in a number of ways, as discussed below.

A Data Warehouse Project is Not a Package Implementation Project

A data warehouse project requires a number of tools and software utilities that are available from multiple vendors. At present, there is still no single suite of tools that can automate the entire data warehousing effort.

Most of the major warehouse vendors, however, are now attempting to provide off-the-shelf solutions for warehousing projects by bundling their warehousing products with that of other warehousing partners. This solution limits the potential tool integration problems of the warehousing team.

A Data Warehouse Never Stops Evolving; It Changes with the Business

Unlike OLTP systems that are subject only to changes related to the process or area of the business they support, a data warehouse is subject to changes to the decisional information requirements of decision-makers. In other words, it is subject to any changes in the business context of the enterprise.

Also, unlike OLTP systems, a successful data warehouse will result in more questions from business users. Change requests for the data warehouse are a positive indication that the warehouse is being used.

Data Warehouses Are Huge

Without exaggeration, enterprise-wide data warehouses are huge. A pilot data warehouse can easily be more than 10 gigabytes in size. A data warehouse in production for a little over a year can easily reach 1 terabyte, depending on the granularity and the volume of data. Databases of this size require different database optimization and tuning techniques.

Project Progress and Effort Are Highly Dependent on Accessibility and Quality of Source System Data

The progress of a data warehouse project is highly dependent on where the operational data resides. Enterprises that make use of proprietary application packages will find themselves dealing with locked data. Enterprises with data distributed over different locations with no easy access will also encounter difficulties.

Similarly, the quality of the existing data plays a major role in the project. Data quality problems consistently remain at the top of the list of data warehouse issues. Unfortunately, none of the available tools can automate away the problem of data quality. Although tools can help identify problem areas, these problems can be only be resolved manually.

What Are the Critical Success Factors of a Data Warehousing Project?

A number of factors influence the progress and success of data warehousing projects. While the list below does not claim to be complete, it highlights areas of the warehousing project that the project manager is in a position to actively control or influence.

  • Proper planning. Define a warehousing strategy and expect to review it after each warehouse rollout. Bear in mind that IT resources are still required to manage and administer the warehouse once it is in production. Stay coordinated with any scheduled maintenance work on the warehouse source systems.

  • Iterative development and change management. Stay away from the big-bang approach. Divide the warehouse initiative into manageable rollouts, each to last anywhere between three to six months. Constantly collect feedback from users. Identify lessons learned at the end of each project and feed these into the next iteration.

  • Access to and involvement of key players. The Project Sponsor, the CIO, and the Project Manager must all be actively involved in setting the direction of the warehousing project. Work together to resolve the business and technical issues that will arise on the project. Choose the project team members carefully, taking care to ensure that the project team roles that must be performed by internal resources are staffed accordingly.

  • Training and communication. If data warehousing is new to the enterprise or if new team members are joining the warehousing initiative, set aside enough time for training the team members. The roles of each team member must be communicated clearly to set role expectations.

  • Vigilant issue management. Keep a close watch on project issues and ensure their speedy resolution. The Project Manager should be quick to identify the negative consequences on the project schedule if issues are left unresolved. The Project Sponsor should intervene and ensure the proper resolution of issues, especially if these are clearly causing delays, or deal with highly politicized areas of the business.

  • Warehousing approach. One of the worst things a Project Manager can do is to apply OLTP development approaches to a warehousing project. Apply a system development approach that is tailored to data warehousing; avoid OLTP development approaches that have simply been repackaged into warehousing terms.

  • Demonstration with a pilot project. The ideal pilot project is a high-impact, low-risk area of the enterprise. Use the pilot as a proof-of-concept to gain champions within the enterprise and to refute the opposition.

  • Focus on essential minimal characteristics. The scope of each project or rollout should be ruthlessly managed to deliver the essential minimal characteristics for that rollout. Don't get carried away by spending time on the bells and whistles, especially with the front-end tools.

In Summary

The Project Manager is responsible for all technical aspects of the project on a day-to-day basis. Since up to 80 percent of any data warehousing project can be devoted to the back-end of the warehouse, the role of Project Manager can easily be one of the busiest on the project.

Despite the fact that business users now drive the warehouse development, a huge part of any data warehouse project is still technology centered. The critical success factors of typical technology projects therefore still apply to data warehousing projects. Bear in mind, however, that data warehousing projects are more susceptible to organizational and logistical issues than the typical technology project.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.251.37