Chapter 21 Miscellaneous topics

This chapter is for topics that are important enough to write about but are not large enough by themselves to warrant their own chapter. The following miscellaneous topics are addressed in this chapter:

Data marts
Monitoring the DW 2.0 environment
Moving data from one data mart to another
What to do about bad data
The speed of the movement of data within DW 2.0
Data warehouse utilities

DW 2.0 has been presented as a representation of the base data that resides at the core of the DW 2.0 enterprise data warehouse. However, there are independent structures that use that data for analytical purposes. The exploration facility is one such structure. Another structure that takes data from DW 2.0 is the data mart.

DATA MARTS

Figure 21.1 depicts a data mart being fed data from the DW 2.0 enterprise data warehouse.

A data mart is built for the convenience of people who look at the data in the same way. Typically data marts are built for departments. Examples of corporate departments that typically have their own data marts include finance, sales, marketing, and accounting.

Any part of the DW 2.0 environment can be used to source a data mart. The normal DW 2.0 source of data mart data is the Integrated Sector. However, it is possible to populate data marts with data from other places in the DW 2.0 environment.

images

FIGURE 21.1 Data marts emanate from the DW 2.0 environment.

THE CONVENIENCE OF A DATA MART

The convenience of the data mart is that data in DW 2.0 is detailed and data in a data mart is usually not so detailed. When people have a common way of looking at data, it is more efficient and more convenient to take the detailed data, structure it once the way the group wants to look at it, and make it available. That way, when someone wants to see their data, it is always structured and formatted the way that individual wants to see it. He/she does not have to worry about the work of taking detailed data and restructuring it.

There are some other important reasons data marts are so popular. Data marts are appealing because the cost of processing normally goes way down when data is taken out of the organization’s enterprise data warehouse. The cost of processing on the machine(s) that houses DW 2.0 is usually as high as computing cycles ever get to be. When data is taken offline and onto another much smaller, departmental machine, the cost of data mart processing goes down.

Another reason data marts are popular is that by taking the data mart to another machine, the machine cycles for the DW 2.0 enterprise data warehouse environment are preserved. Moving machine cycles from the DW 2.0 environment to another environment greatly enhances the performance in the main DW 2.0 environment.

Yet another reason taking a data mart to another machine is a good idea is that different departments like the idea of ownership of their own data and their own processing.

There are many good reasons data marts are popular and separating data from the main DW 2.0 environment makes a lot of sense. Figure 21.2 shows that different groups of people have similar ways of looking at data.

TRANSFORMING DATA MART DATA

The kinds of processing that occur as data is moved from the DW 2.0 environment to the data mart environment include summarization of data, aggregation of data, selection and filtering of data, and restructuring of fields and other data attributes.

Figure 21.3 depicts the types of activities that occur as the detailed data found in the DW 2.0 environment is reformed into the data mart structure.

Determining at what point it makes sense to move analytical processing from the enterprise data warehouse to a data mart is one of the most interesting questions faced by the data warehouse architect. The answer is that when a lot of people look at the data in the same way and are doing a lot of queries, then it makes sense to create a data mart.

images

FIGURE 21.2 Data marts serve groups of people who look at and use data in approximately the same way.

MONITORING DW 2.0

A data warehouse monitor is one of the best ways to know when the time has come to add one or more data marts. Figure 21.4 depicts a data warehouse monitor examining the activities going on inside a data warehouse. When a consistently high usage pattern is detected, it is probably time to build a data mart.

images

FIGURE 21.3 To be prepared for the data mart, data from DW 2.0 is summarized, filtered, aggregated, and restructured.

Data marts are generally autonomous after they have been created. End users pretty much do whatever they want to do with their data mart(s).

MOVING DATA FROM ONE DATA MART TO ANOTHER

The practice of moving data directly from one data mart to another is architecturally unsound. Figure 21.5 indicates that in nearly every circumstance, data should not be moved directly from one data mart to another.

images

FIGURE 21.4 To determine when it is time to build a data mart, the accesses to the DW 2.0 environment need to be monitored.

If there is a need for data to be shared between two or more data marts, the data in question should be placed in the DW 2.0 enterprise data warehouse environment.

images

FIGURE 21.5 In nearly every circumstance, data sharing between data marts is a very poor practice.

BAD DATA

It is not reasonable to expect that all data will always be entered perfectly into a large, complex enterprise data warehouse environment. Figure 21.6 depicts the question—What should be done about bad data in the DW 2.0 environment?

The first thing that should be done is to try to identify the source of bad data. If the source can be found, the next step is to correct the data at the source. Figure 21.7 shows that the first step is to try to find out how bad data got into DW 2.0.

images

FIGURE 21.6 What to do about bad data?

images

FIGURE 21.7 Find out how the bad data got into the DW 2.0 environment and make corrections in ETL or elsewhere.

images

FIGURE 21.8 Add a “balancing” record, if possible.

images

FIGURE 21.9 Make an arbitrary adjustment for a group of records.

A BALANCING ENTRY

Correcting a process that is sending bad data to the data warehouse does not address the problem of what to do about bad data that has already gotten into the data warehouse.

One approach to correcting data in DW 2.0 is to find the bad data and make a “balancing” entry. If it is found that there is an erroneous entry of $23.61 in the system, then another entry equal to –$23.61 will correct it. This approach keeps the books balanced and leaves an audit trail. But this approach works only where there is a finite amount of data to be adjusted and where erroneous data can be identified.

Figure 21.8 depicts a balancing entry.

RESETTING A VALUE

Unfortunately there are many circumstances in which there is not a finite number of incorrect records that can be found and for which a balancing entry can be created. In this case an arbitrary entry is made that “resets” the values for an account.

When an arbitrary resetting of values is done by the creation of a new entry, the audit trail should carefully show how the resetting was done.

Figure 21.9 depicts the resetting of values.

MAKING CORRECTIONS

A third approach to correcting values in a DW 2.0 environment is the practice of finding bad records and then changing the values in those records. Unfortunately there are many drawbacks to this approach. The first problem is that there is no clear and obvious audit trail. A second disadvantage is that the integrity of data has been destroyed.

For example, a report is run at 9:31 AM and a summary value of $5918.91 is calculated. Then at 10:14 AM a lot of direct adjustments are made. Now at 11:57 AM the values are recalculated, and a sum of $4817.73 is calculated. The problem is that there is no way to relate the different summarizations or reconcile why the values are different.

images

FIGURE 21.10 Go in and change values in the incorrect record.

Figure 21.10 shows the actual replacement of values in records in the DW 2.0 environment.

THE SPEED OF MOVEMENT OF DATA

The speed with which data moves into and through the DW 2.0 data warehouse environment raises an interesting philosophical issue. One school of thought says that data ought to move as quickly as possible throughout the DW 2.0. In other words, if a transaction occurs at 7:13 PM, then the transaction should have its data entered into and reflected in the Interactive Sector at 7:14 PM. There are many advocates of this “as soon as possible” speed of data throughout the DW 2.0 environment, as suggested by Figure 21.11.

The alternative approach is to move data in a slower, more deliberate manner throughout the DW 2.0 environment. For example, a transaction might be executed on January 14th and the data may not find its way into the DW 2.0 environment until January 21st. This means that a week will pass before the data enters the DW 2.0 environment. The data is allowed to “settle” while it is awaiting movement into the DW 2.0 environment.

Data settles when it has been allowed enough time for an activity to be adjusted if made incorrectly. For example, suppose a transaction executes on Monday. On Tuesday, it is discovered that an adjustment needs to be made to the transaction. On Wednesday another adjustment that needs to be made is found. Finally, on Friday the transaction is sent to the DW 2.0 environment. Not being in a hurry to enter the transaction data into DW 2.0 has allowed time for the transaction data to settle into a finalized state. The result is much more accurate data and much simpler processing for the DW 2.0 environment.

Figure 21.12 illustrates allowing data to settle before being transmitted to the DW 2.0 environment.

images

FIGURE 21.11 Updating data in DW 2.0 as fast as possible—one approach.

DATA WAREHOUSE UTILITIES

A data warehouse utility (DWU) is a facility that takes some or all data warehouse processing and transparently supplants some or all of the existing data warehouse infrastructure. There are many good reasons for deploying a data warehouse utility, including performance, cost, and extending the license of a DBMS. “Dataupia” is a good example of a data warehouse utility.

images

FIGURE 21.12 Allowing data time to “settle” before it is moved to DW 2.0.

The following example illustrates why a data warehouse utility offers significant advantages. Consider a “standard” data warehouse processing environment as depicted in Figure 21.13.

This figure depicts an end user communicating directly with technology such as SAP. SAP in turn communicates directly with a DBMS such as Oracle. Oracle communicates directly with traditional SAN technology, such as EMC, IBM, or Hitachi.

images

FIGURE 21.13 A traditional DW 2.0 environment.

images

FIGURE 21.14 What happens when DW 2.0 ages.

Over time the amount of data in the traditional environment depicted in Figure 21.13 starts to grow and can become very large. Figure 21.14 depicts what happens to the traditional data base environment over time.

There are three basic reasons data grows so steadfastly in this environment:

Data is collected at the granular level.
Data is collected historically.
Data is collected and integrated from a wide variety of sources.

There are many consequences of the growth of data. One of the biggest consequences is that the cost of the data and the infrastructure to support it grows dramatically as well. Figure 21.15 illustrates that the cost of the infrastructure rises as a function of the volume of data that is managed.

And the cost does not just increase incrementally—it increases dramatically.

The cost of storage is an interesting factor in data warehouse processing. In the first 2 or 3 years of building and developing a data warehouse, the cost of storage is almost incidental. But as a data warehouse matures, the costs of other data warehouse features diminish, while the costs of storage increase. And it is not just the cost of storage, it is the cost of storage infrastructure that increases as well. There are the processor costs, the software license costs, and the channel costs. In addition, after the storage has been acquired and implemented, there is the ongoing operations cost. When the actual cost of storage is considered in light of these other factors, it is seen that the cost of actual storage is only a fraction of the total costs of storage.

Unfortunately, the increase in storage and storage infrastructure costs is inevitable. Once an organization gets to be locked into a way of processing, the organization continues down that path for a long, long time.

But organizations have a way of needing to manage their budget. A major increase in expenditures each year cannot continue indefinitely. It is only natural that organizations desire to find ways to manage their budget. Therefore, it is normal for them to turn to the data warehouse utility to help them manage their budget and their data warehouse environment.

The data warehouse utility is a means of storing and managing data for a fraction of the cost of traditional SAN storage.

images

FIGURE 21.15 The cost of the infrastructure.

images

FIGURE 21.16 The splitting of data.

Figure 21.16 depicts a data warehouse managed under a combination of traditional storage and a DWU.

This figure shows that part of the data is managed under traditional storage and another part is managed under a DWU. The net effect of such a split is that it significantly drives down the cost of operating a data warehouse on a day-to-day basis.

Depending on the particulars, the cost of operating a data warehouse may drop as much as an order of magnitude with the addition of a DWU. For example, suppose an organization has a 10-terabyte data warehouse and the operating budget is $10,000,000 annually. Now suppose the organization reduces its data warehouse storage requirements by half by adding a DWU. The annual operating costs look like:

Five terabytes traditional storage—$5,000,000
Five terabytes DWU storage—$500,000
Total operating costs—$5,500,000
Total savings—$4,500,000

The movement to DWU has cut the ongoing operating cost of storage significantly. Implementing a DWU is not as simple as plugging an electric cord into a wall socket. There are multiple strategies for the deployment of DWU technology. Each style of deployment has its advantages and disadvantages.

One approach to DWU deployment is to replace the traditional technology completely—the DWU is rolled in and the traditional technology is rolled out. The advantage of the replacement strategy is the immediate reduction in the cost of storage. There are also some disadvantages, including:

License disruption: The existing traditional environment cannot be simply discarded without consideration of existing contracts and licenses.
Software disruption: In some cases (not the case for Dataupia), some DWUs are not transparent to the controls needed to manage the storage processing. In this case the DWU cannot supplant the traditional storage without an upheaval.
Storage optimization: The DWU storage is not optimized for OLTP processing. In some cases, the DWU is optimized for data warehouse processing only.

Another approach is to supplant some traditional data warehouse storage incrementally.

The disadvantage of the incremental approach to supplanting traditional data warehouse data storage is that there must be transparency of the data. The DWU must be compatible with the operating system and DBMS that controls the traditional storage. If the DWU is not transparent to the operating system and DBMS of the traditional storage environment, then it must access and manage slices of the traditional environment that can be easily separated from the traditional environment. Unfortunately this incremental, separated approach is somewhat limited in its application.

The third approach to the implementation of DWU technology is the deployment of DWU technology with transparency. In this case, the transparency means that the user of the DBMS does not know where the data resides. The data can reside on the DWU or on traditional storage. The system simply does not care. The actual physical location and management of the data are transparent to the user and the DBMS. For the transparency approach to work, the software that controls the DWU must be compatible with both the operating systems and the DBMS that is used to manage traditional storage.

images

FIGURE 21.17 Data base transparency.

Figure 21.17 depicts the transparent approach.

SUMMARY

Data marts contain departmental data for the purpose of decision making. There are lots of reasons for the creation of a data mart, including

the cost of machine cycles is low;
the end user has control;
the performance of the DW 2.0 environment is enhanced.

When bad data enters the DW 2.0 environment

the source of the bad data should be identified and corrected;
a balancing entry can be created;
a value may be reset;
actual corrections can be made to the data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.48.82