THE CAUSES OF POOR-QUALITY DATA

Poor-quality data can arise for a number of reasons, some technical and some human (although even the technical reasons can probably be traced back to some human error):

  • databases having inappropriate schemas;

  • errors being made on data entry;

  • data decaying over time;

  • data being corrupted when moved between systems;

  • lack of understanding of the data when it is used.

As discussed in Chapter 2, databases should be designed so that there is no unnecessary duplication of data. They should also be designed so that they can cope with changes in requirements without major cost implications. Update anomalies, which can lead to data inconsistency, are avoided when there is no unnecessary duplication of data. Data inconsistency can lead to inaccurate information being presented to users. When databases are not designed to cope flexibly with future data requirements and the redesign of the database is too costly for the enterprise, there is a potential for reducing the overall data quality because ‘work-arounds’ are developed that overload columns in the database. Columns are then used to hold data which they were not designed to hold. Often these ‘work-arounds’ are not properly documented; knowledge about the meaning of data is held within the heads of a small number of users. Databases must be designed with flexibility and data quality in mind, even if this is at the expense of performance.

There can be a number of reasons why errors are made on data entry. Some of these reasons are accidental and some are deliberate. Accidental data-entry errors, for example, mistyping a date or a name, is the most common source of poor-quality data. The number of these accidental data-entry errors is sometimes increased because insufficient thought has been given to the way that data is entered. Spelling errors, for example, can be reduced by providing the user with a set of valid values from which to select an option. Another source of data-entry errors is where the system is designed so that values are needed for some data but those values are not actually available. Users then either enter fictional data so that they can complete the process or abandon the data entry until the data values are available.

The value of some data decays over time. This is particularly true of databases supporting human resources operations. The qualifications held by employees are normally recorded when they are first employed but it is very seldom that a human resources department has procedures in place to ensure that this data is regularly checked and updated. An employee can, therefore, work hard to gain new qualifications and yet these are not recorded in the company’s information systems. A search to find employees with appropriate qualifications for a task may well miss the most appropriately qualified employee. In another environment, the stock figures in a retail store’s database are normally amended to take account of the arrival of new stock and of sales, but the stock figures can be inaccurate if they are not regularly adjusted to take account of pilfering and shoplifting.

Perfectly good data can be corrupted when it is moved between systems, for example, when extracted from operational systems to feed a data warehouse. This corruption is generally because the documentation of the feeder systems has not been kept up to date as they have been modified and, consequently, inappropriate transformations and cleaning procedures have been applied to the data.

Finally, there is a danger that data may not be understood when it is presented to users. This is normally caused by the documentation being out of date or metadata being missing or ambiguous.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.76.89