INTRODUCING DATA QUALITY RULES

Data Quality Rules are central to this method.

  • They keep ownership firmly in the hands of the business (Golden Rule 1).

  • They expose to the project the knowledge hidden in the business (Golden Rule 2).

  • The process of deriving them gets the business to address the issue of the limitations of time and resources (Golden Rule 3).

  • They form the contract between the business and the technicians as to what constitutes quality data and how to go about securing it (Golden Rule 4).

On a well-run data migration project you will spend far more time, effort and resource on Data Quality Rules than on any other activity.

So what are Data Quality Rules?

Data Quality Rules

Data Quality Rules are a statement of the metrics that will be used to measure the quality of the data for each of the data sets under consideration, either at Legacy Data Store or Key Business Data Area level, and the set of steps that will bring current data to the level where these metrics are met.


Hint

If I were to be pedantic I would define the Data Quality Rules as ‘the statement of metrics that will be used to measure the quality of the data sets under consideration‘. As you will see in the paragraphs on what Data Quality Rules documents should contain, there is a ‘Method Statements‘ section for the ‘set of steps that will bring current data to the level where these metrics are met‘. However, by referring to the whole document as the ‘Data Quality Rule‘ I reinforce in the Stakeholders‘ minds that a quality statement is not complete without a metric and a method (or mitigation). I then go one step further and shorten Data Quality Rule to DQR. Although this is one of those dreaded Three Letter Acronyms (TLAs), it has the benefit of not having to be changed to make it plural.

Make ‘DQR‘ part of the vocabulary of your data migration project. You will then not have Data Stakeholders presenting you with ‘data problems‘. They will be requesting additional DQR and so will have accepted the necessity to create metrics and method statements, schedule resources etc. They will have become part of the solution.


They are used at a number of points in a data migration project.

They are used in the first stage of data preparation, where they form the basis of subsequent data cleansing and data preparation activities.

A second set are developed later in data preparation when the new system data design is included and the data prepared in Stage 1 is further enhanced to meet the criteria of the new system.

How Are Data Quality Rules Created?

The most successful way of generating a Data Quality Rule is to invite the Data Stakeholders to a series of facilitated meetings and thrash out the detail face to face (see ‘Generating Data Quality Rules‘ on page 71).

Hint

It is also possible to use the emerging data profiling tools to seek out possible data quality issues. However, keep Golden Rules 1 and 2 in mind. Although this is a good way of creating a ‘straw man‘ to present to your key Data Stakeholders, do not presume that this can tell you the full story.


Use of corporate data models to form baseline

If you are fortunate enough to have the assistance of a Corporate Data Architect, you can use their data models and modelling expertise as the baseline from which to analyse the divergence of Legacy Data Stores that will form the starting point for Data Quality Rules. However, do not discount Legacy Data Stores that do not conform to the corporate model, but use the information sensitively in your Data Quality Rules workshops.

Hint

A key aspect to remember in conversations with the information resource function is that we need to know which data model is being presented to us. This book recommends a two-pass approach to data preparation. In the first we align/measure the difference between the Legacy Data Stores and the legacy data model. In the second we align the legacy data model to the new system data model. Often, the corporate data modellers will have already made this cognitive leap before we arrive on the project and have to be pulled back from an over-enthusiastic, premature rush into moulding legacy data into the new data structure shape.


Data Quality Rules and Legacy Data Stores

It is possible for one Data Quality Rule to address more than one data store and it is possible for a data store to have many Data Quality Rules written for it. However, it is not possible for a Data Quality Rule to address no data stores. Even where the rule is written for a pure data gathering exercise, say to fill a data gap between the Legacy Data Stores and the new system, there will still be a Transient Data Store in the middle to hold new data prior to transmission. Possibly the new system itself may be the data store if the missing data is to be keyed straight in.

First-cut Data Quality Rules

First-cut Data Quality Rules are designed to provide reassurance that legacy data is internally consistent. There is no mapping to the new system, but legacy data is audited to ensure that it will be fit for loading. Any known problems that would inhibit data loading are resolved.

The quality of the Legacy Data Stores, and how rigorously corporate data management techniques have been applied, will partially determine the amount of remedial work that needs to be done. With high quality, well maintained Legacy Data Stores this phase can be restricted to a simple audit, but due diligence dictates that at least one Data Quality Rule be raised so that what is meant by ‘quality data‘ can be rigorously defined and tested. Reassurance that there are no data quality issues is cheaper than uncovering data quality issues once the migration code has been written and time is running short.

Anecdote

Whenever I get involved with ‘failing‘ migration projects I nearly always find that first-cut Data Quality Rules were skimped or are missing altogether. The temptation to trust the corporate Legacy Data Stores is great: after all, these are systems that have been running the company. There are also the issues discussed above of the distance Data Store Owners may have from the day-to-day operation of corporate systems. Resist the temptation to skimp. If nothing else, see it as a dry run for the more complex second-cut Data Quality Rules and use it to build the virtual team you will need to complete the task. However, I guarantee that out of the woodwork data deficiencies will appear that existing work-arounds have covered up.


A more common situation is to be confronted with a mixed bag of systems, of varying quality, which may or may not conform to any corporate standard. Within this disparate bunch of systems there will often be local inconsistencies. It is not uncommon for major corporations to rely, unwittingly, on locally derived data stores, created within the user community, often in spreadsheet format. They may be incompatible with any known corporate standard.

Anecdote

Try to avoid the ‘King Canute‘ mentality to unofficial data stores. I have worked in more than one migration where the official policy of only using the designated corporate systems has led to better quality local data stores being deliberately overlooked. This led to poorer quality data being migrated, user dissatisfaction and the absolute certainty in my mind that those unofficial Legacy Data Stores would be up and running again within days of the new system going live. Once again it is an example of the wrong Data Stakeholders driving the process.


In the first-cut Data Quality Rules we emphasize metrics gathering and creating local consistency. We need to be able to answer the question of what is meant by suitable data quality in a manner that accords with Golden Rules 3 and 4. We need to update the Legacy Data Store definition forms with statements of quality that are backed by clear measurements. Where the data sets fall short of a Data Quality Rule, we need the Data Stakeholders to define either the steps that will get the data measurably to the standard of the rule or to reduce the threshold of acceptance or to drop the rule altogether.

Second-cut Data Quality Rules

It is through the second-cut Data Quality Rules that the new systems data structures are introduced, based on the known data quality of the first-cut Data Quality Rules. The Key Business Data Area will have been brought to a standard level of quality via the first-cut Data Quality Rules, the second-cut Data Quality Rules will take these standardized data sets and derive the rules that will allow them to be mapped onto the new system’s requirements. This is more than a data mapping exercise, although a set of extract, transform and load definitions will be delivered as an output. This step works through, with the business, the issues of where, from amongst multiple choices, the most appropriate data source for each data item is. We decide how data structures can be amended to the satisfaction of Data Store Owners to fit the new requirement. Finally we agree how data, possibly never previously gathered but necessary for the new system, can be generated. This is the application of the new system data requirements to the knowledge we now have of legacy data stores and the creation of steps that will get us from the old to the new. The whole process is led and owned by the business areas affected.

Iteration

For both first and second-cut Data Quality Rules there may be more than one iteration through the Data Quality Rules process. Typically in first-cut Data Quality Rules the first iteration is one of establishing a baseline of rules and measuring the Legacy Data Stores against those rules. The second iteration takes the findings of the first and records data cleansing activities to solve the issues uncovered.

Hint

Although Data Quality Rules are iterative, with one Data Quality Rule revealing issues that can only be dealt with by another Data Quality Rule, multiple Data Quality Rules iterations are to be avoided. Better quality products in the earlier stages of this method are key to reducing the number of cycles. The better the identification and initial analysis of the Legacy Data Stores, the fewer additional data stores will emerge downstream. The better the identification of Data Stakeholders is carried out, the better the business knowledge that is shared, which in turn reduces the number of data quality surprises.


Types of Data Quality Rules

There are different types of Data Quality Rules and therefore different types of possible data migration failure:

  • Internal consistency: The commonest type of rule drawn up by technicians for data loading. This covers all the standard data load validation criteria — range checking, data type checking, referential integrity checking, reasonableness checking etc. These check one Legacy Data Store against its own rules.

  • External consistency: This extends internal consistency checks of a Legacy Data Store with the wider system environment. It checks the Legacy Data Stores against the Key Business Data Area rules in first-cut Data Quality Rules and against the new system requirements in second-cut Data Quality Rules.

  • Reality Check: This is the sort of checking that typically technicians do not attempt because the answers lie outside of the data held in computer systems, in the reality of the business world. But it is an issue that the project must address because it is an issue that causes friction between the business and the project when a new system starts to run either for real, or in parallel. Just because all the internal consistency rules are met does not mean that the data item corresponds with a genuine piece of business reality. It reinforces the need to keep Golden Rule 1 at the forefront of any migration project.

Anecdote

I have seen migration data sets where whole hotels have been duplicated, or where lengths of pipeline that I can see out of the window of the office do not exist in the appropriate data set. Often the Reality Check data error hints at the existence of a data set we have yet to uncover — otherwise how did the business previously function? Asking this question often drives out the missing data set. When it does, invoke the change control procedure, create a new Legacy Data Store definition form, and bring the new system into consideration.


I prefer a two-pass approach like this because:

Out of the first-cut Data Quality Rules we get a range of Key Business Data Areas that contain Legacy Data Stores that, to a known degree, are internally consistent, consistent with the model of the Key Business Data Area and correspond with the real world.

In the second-cut Data Quality Rules we introduce the data model of the new system. We can now approach the question of how we migrate from the Legacy Data Store to the new system confident that we are basing our judgements on known data qualities (and weaknesses — remember Golden Rule 3).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.214.155