THE IDENTIFICATION OF CHANGES TO DATA

Causality

It would be very helpful if, during the analysis of the kinds of changes that can occur, it is made clear as to whether the changes are causal in nature. It would be sufficient to identify causal changes only and to assume that all unidentified changes are noncausal. This will provide assistance to the designers. Some changes, as has been previously indicated, can occur to attributes that have the property of false retrospection but that, due to the fact that they are determinants, have a “knock-on” effect on other attributes that might have the property of true retrospection. The capture of changes has to be developed into an automated process. Some mechanism is created that enables changes that have occurred to be identified by examining the organization's operational systems as these are, invariably, the source of data for the data warehouse. The source system will not share the data warehouse data model and will not be aware of the effect of changes. For instance, in the Wine Club, there is a relationship between the address of a customer and the sales area that contains the address. So it could be said that the address determines the sales area. This determinant relationship is identical to that used in the process of normalization. However, the purpose here is quite different and has more to do with the synchronization of the timing of changes to attribute values, to ensure temporal consistency, than the normalization of relations. Thus the term causality has been adopted in order to distinguish this requirement as it is unique to data warehousing. The operational system that records customers' details may not be aware of the sales area hierarchy. When a customer moves, the fact that a change in sales area might have occurred would not normally be apparent. It becomes the responsibility of the data warehouse designer to manage this problem.

Generally, there is little recognition of the fact that logically connected data may need to be extracted from different data files which, in turn, might belong to various operational systems. That there may be a need to implement a physical link between data sources due to the causal nature of the relationship is also not recognized. This stitching together of data from various sources is very important in data warehousing. Apart from operational systems, the sources can also be external to the organization. For instance, an external attribute relating to a customer's economic classification might be added to the customer's record. This is a good example of causality. What is the trigger that generates a change in the economic classification when a change in the customer's circumstances is implemented so that temporal consistency is maintained?

Without such a facility, the data warehouse may be recording inconsistent information. If a customer's address changes, then the sales area code must be checked and updated, if necessary, at the same time as the address change. Where the data relating to addresses and sales areas is derived from different source systems, the temporal synchronization of these changes may be difficult to implement. If temporal synchronization is not achieved, then any subsequent query involving the history of these attributes may produce inaccurate results.

The main point is to recognize the problem and ensure that the causal nature of changes is covered during the requirements analysis.

The Frequency of Capture of Changes

Associated with identification of changes is the timing with which changes to data are captured into the data warehouse. In this respect, the behavior of “behavioral data” is different from the behavior of “circumstances.” The frequency of capture for the fact data is usually as close as possible to the granularity of the valid time event. For instance, in the Wine Club example, the granularity of time of a sale is “day” and the sales are captured into the data warehouse on a daily basis. There are exceptions, such as telecommunications where the granularity of time is “seconds” but the frequency of capture is, typically, still daily. Nevertheless, the time assigned to the fact can usually be regarded as the valid time.

The granularity of time for recording changes to the dimensions adopts an appearance that is often misleading. The most frequently used method for identifying changes to dimensions is by use of the file comparison approach, as outlined in the previous chapter. The only time that can be used to determine when the change occurred will be the time that the change was detected (i.e., the time the file comparison process was executed). The time recorded on the dimensional record can be at any level of grain, for example, day. In this example, the granularity of time for the changed data capture appears to be daily because the date that the change was captured will be used to record the change.

However, this is a record of the transaction time so far as the data warehouse is concerned. It is not actually a record of the transaction time that the change was recorded in the originating source system. The granularity is related to the frequency that the changed data is captured. If the changes are detected and captured into the data warehouse on a monthly basis, then the transaction time frequency should be recorded as monthly.

In practical situations, different parts of the model are usually updated at differing frequencies of time. Some changed data is captured daily while others are weekly and, still others, monthly. The frequency of capture is often dependent on the processing cycle of the source systems.

As with the previous section on causality, the valid time and transaction time should be the same, if possible. Where such synchronization is not possible, the difference between the two times should be recorded so that the potential error can be estimated. Our modeling method should provide a means of capturing the true granularity of time on a per attribute basis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.166.61