Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Data Quality and MDM

5.1. Introduction

One of the motivating factors for instituting an MDM program is the desire for consistency and accuracy of enterprise data. In fact, many MDM activities have evolved out of data cleansing processes needed for data warehousing or data migrations. The ability to use data parsing, standardization, and matching enabled the development of an “entity index” for people or products, which then was used for ongoing identity resolution and the elimination of duplicate entries. The realization that the entity index itself represented a valuable information resource was a precursor to the development of the master data environment altogether and the corresponding services supporting that master data environment.

When evaluating both the primary business objectives and the operational requirements for evolving toward an environment that relies on master data, there is going to be an underlying expectation that the instantiation of an MDM program necessarily implies improved data quality across the board. The reality is not so straightforward. By necessity, there will be some fundamental questions about the following:

▪ The definition of data quality

▪ The specification of the business data quality expectations

▪ Ways of monitoring those expectations to ensure that they are met in an acceptable manner

At a conceptual layer, these questions center on the trustworthiness of the data as it is extracted from the numerous sources and consolidated into the master repository. However, at the operational level, these questions begin to delve much deeper into the core definitions, perceptions, formats, and representations of the data elements that comprise the model for each master data object. The challenges of monitoring and ensuring data quality within the MDM environment become associated with identifying critical data elements, determining which data elements constitute master data, locating and isolating master data objects that exist within the enterprise, and reviewing and resolving the variances between the different representations in order to consolidate instances into a single view. Even after the initial integration of data into a master data environment, there will still be a need to instantiate data inspection, monitoring, and controls to identify any potential data quality issues and prevent any material business impacts from occurring.

Data assessment, parsing, standardization, identity resolution, enterprise integration—all of these aspects of consolidation rely on data quality tools and techniques successfully employed over time for both operational and analytical (read: data warehousing) purposes. In this chapter, we will look at how the distribution of information leads to inconsistency, and then we explore the data quality techniques needed for MDM and how data quality tools meet those needs.

5.2. Distribution, Diffusion, and Metadata

Because of the ways that diffused application architectures have evolved across different project teams and lines of business, it is likely that although only a relatively small number of core master objects (or more likely, object types) are used, there are going to be many different ways that these objects are named, modeled, represented, and stored. For example, any application that must manage contact information for individual customers will rely on a data model that maintains the customer's name, but one application may maintain the individual's full name, whereas others might break up the name into its first, middle, and last parts. Conceptually, these persistent models are storing the same content, but the slightest variance in representation prevents most computer systems from recognizing the similarity between record instances.

Even when different models are generally similar in structure, there might still be naming differences that would confuse most applications. For example, the unique identifier assigned to a customer is defined as a 10-character numeric string, padded with zeros out to the left, and that format is used consistently across different applications. Yet even when the format and rules are identical, different names, such as CUST_NUM versus CUSTOMER_NUMBER versus CUST_ID, may still confuse the ability to use the identifier as a foreign key among the different data sets. Alternatively, the same data element concept may be represented with slight variations, manifested in data types and lengths. What may be represented as a numeric field in one table may be alphanumeric in another, or similarly named attributes may be of slightly different lengths.

Ultimately, to consolidate data with a high degree of confidence in its quality, the processes must have an aligned view of what data objects are to be consolidated and what those objects look like in their individual instantiations. Therefore, an important process supporting the data quality objectives of an MDM program involves collecting the information to populate a metadata inventory. Not only is this metadata inventory a resource to be used for identification of key data entities and critical data elements, it also is used to standardize the definitions of data elements and connect those definitions to authoritative sources, harmonizing the variances between data element representations as well as identifying master data objects and sources. Collecting and managing the various kinds of master metadata is discussed in Chapter 6.

5.3. Dimensions of Data Quality

We must have some yardstick for measuring the quality of master data. Similar to the way that data quality expectations for operational or analytical data silos are specified, master data quality expectations are organized within defined data quality dimensions to simplify their specification and measurement/validation. This provides an underlying structure to support the expression of data quality expectations that can be reflected as rules employed within a system for validation and monitoring. By using data quality tools, data stewards can define minimum thresholds for meeting business expectations and use those thresholds to monitor data validity with respect to those expectations, which then feeds into the analysis and ultimate elimination of root causes of data issues whenever feasible.

Data quality dimensions are aligned with the business processes to be measured, such as measuring the quality of data associated with data element values or presentation of master data objects. The dimensions associated with data values and data presentation lend themselves well to system automation, making them suitable for employing data rules within the data quality tools used for data validation. These dimensions include (but are not limited to) the following:

▪ Uniqueness

▪ Accuracy

▪ Consistency

▪ Completeness

▪ Timeliness

▪ Currency

5.3.1. Uniqueness

Uniqueness refers to requirements that entities modeled within the master environment are captured, represented, and referenced uniquely within the relevant application architectures. Asserting uniqueness of the entities within a data set implies that no entity logically exists more than once within the MDM environment and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set. For example, in a master product table, each product must appear once and be assigned a unique identifier that represents that product across the client applications.

The dimension of uniqueness is characterized by stating that no entity exists more than once within the data set. When there is an expectation of uniqueness, data instances should not be created if there is an existing record for that entity. This dimension can be monitored two ways. As a static assessment, it implies applying duplicate analysis to the data set to determine if duplicate records exist, and as an ongoing monitoring process, it implies providing an identity matching and resolution service inlined within the component services supporting record creation to locate exact or potential matching records.

5.3.2. Accuracy

Data accuracy refers to the degree with which data correctly represent the “real-life” objects they are intended to model. In many cases, accuracy is measured by how the values agree with an identified source of correct information (such as reference data). There are different sources of correct information: a database of record, a similar corroborative set of data values from another table, dynamically computed values, or perhaps the result of a manual process. Accuracy is actually quite challenging to monitor, not just because one requires a secondary source for corroboration, but because real-world information may change over time. If corroborative data are available as a reference data set, an automated process can be put in place to verify the accuracy, but if not, a manual process may be instituted to contact existing sources of truth to verify value accuracy. The amount of effort expended on manual verification is dependent on the degree of accuracy necessary to meet business expectations.

5.3.3. Consistency

Consistency refers to data values in one data set being consistent with values in another data set. A strict definition of consistency specifies that two data values drawn from separate data sets must not conflict with each other. Note that consistency does not necessarily imply correctness.

The notion of consistency with a set of predefined constraints can be even more complicated. More formal consistency constraints can be encapsulated as a set of rules that specify consistency relationships between values of attributes, either across a record or message, or along all values of a single attribute. However, there are many ways that process errors may be replicated across different platforms, sometimes leading to data values that may be consistent even though they may not be correct.

An example of a consistency rule verifies that, within a corporate hierarchy structure, the sum of the number of customers assigned to each customer representative should not exceed the number of customers for the entire corporation.

Consistency Contexts

▪ Between one set of attribute values and another attribute set within the same record (record-level consistency)

▪ Between one set of attribute values and another attribute set in different records (cross-record consistency)

▪ Between one set of attribute values and the same attribute set within the same record at different points in time (temporal consistency)

▪ Across data values or data elements used in different lines of business or in different applications

▪ Consistency may also take into account the concept of “reasonableness,” in which some range of acceptability is imposed on the values of a set of attributes.

5.3.4. Completeness

The concept of completeness implies the existence of non-null values assigned to specific data elements. Completeness can be characterized in one of three ways. The first is asserting mandatory value assignment—the data element must have a value. The second expresses value optionality, essentially only forcing the data element to have (or not have) a value under specific conditions. The third is in terms of data element values that are inapplicable, such as providing a “waist size” for a hat.

5.3.5. Timeliness

Timeliness refers to the time expectation for accessibility and availability of information. Timeliness can be measured as the time between when information is expected and when it is readily available for use. In the MDM environment, this concept is of particular interest, because synchronization of data updates to application data with the centralized resource supports the concept of the common, shared, unique representation. The success of business applications relying on master data depends on consistent and timely information. Therefore, service levels specifying how quickly the data must be propagated through the centralized repository should be defined so that compliance with those timeliness constraints can be measured.

5.3.6. Currency

Currency refers to the degree to which information is up to date with the world that it models and whether it is correct despite possible time-related changes. Currency may be measured as a function of the expected frequency rate at which the master data elements are expected to be updated, as well as verifying that the data are up to date, which potentially requires both automated and manual processes. Currency rules may be defined to assert the “lifetime” of a data value before it needs to be checked and possibly refreshed. For example, one might assert that the telephone number for each vendor must be current, indicating a requirement to maintain the most recent values associated with the individual's contact data. An investment may be made in manual verification of currency, as with validating accuracy. However, a more reasonable approach might be to adjust business processes to verify the information at transactions with counterparties in which the current values of data may be established.

5.3.7. Format Compliance

Every modeled object has a set of rules bounding its representation, and conformance refers to whether data element values are stored, exchanged, and presented in a format that is consistent with the object's value domain, as well as consistent with similar attribute values. Each column has metadata associated with it: its data type, precision, format patterns, use of a predefined enumeration of values, domain ranges, underlying storage formats, and so on. Parsing and standardization tools can be used to validate data values against defined formats and patterns to monitor adherence to format specifications.

5.3.8. Referential Integrity

Assigning unique identifiers to those data entities that ultimately are managed as master data objects (such as customers or products, etc.) within the master environment simplifies the data's management. However, the need to index every item using a unique identifier introduces new expectations any time that identifier is used as a foreign key across different data applications. There is a need to verify that every assigned identifier is actually assigned to an entity existing within the environment. Conversely, for any “localized” data entity that is assigned a master identifier, there must be an assurance that the master entity matches that identifier. More formally, this is referred to as referential integrity. Rules associated with referential integrity often are manifested as constraints against duplication (to ensure that each entity is represented once, and only once) and reference integrity rules, which assert that all values used for all keys actually refer back to an existing master record.

5.4. Employing Data Quality and Data Integration Tools

Data quality and data integration tools have evolved from simple standardization and pattern matching into suites of tools for complex automation of data analysis, standardization, matching, and aggregation. For example, data profiling has matured from a simplistic distribution analysis into a suite of complex automated analysis techniques that can be used to identify, isolate, monitor, audit, and help address anomalies that degrade the value of an enterprise information asset. Early uses of data profiling for anomaly analysis have been superseded by more complex uses that are integrated into proactive information quality processes. When coupled with other data quality technologies, these processes provide a wide range of functional capabilities. In fact, there is a growing trend to employ data profiling for identification of master data objects in their various instantiations across the enterprise.

A core expectation of the MDM program is the ability to consolidate multiple data sets representing a master data object (such as “customer”) and to resolve variant representations into a conceptual “best representation” whose presentation is promoted as representing a master version for all participating applications. This capability relies on consulting metadata and data standards that have been discovered through the data profiling and discovery process to parse, standardize, match, and resolve the surviving data values from identified replicated records. More relevant is that the tools and techniques used to identify duplicate data and to identify data anomalies are exactly the same ones used to facilitate an effective strategy for resolving those anomalies within an MDM framework. The fact that these capabilities are available from traditional data cleansing vendors is indicated by the numerous consolidations, acquisitions, and partnerships between data integration vendors and data quality tools vendors, but this underscores the conventional wisdom that data quality tools are required for a successful MDM implementation.

Most important is the ability to transparently aggregate data in preparation for presenting a uniquely identifiable representation via a central authority and to provide access for applications to interact with the central authority. In the absence of a standardized integration strategy (and its accompanying tools), the attempt to transition to an MDM environment would be stymied by the need to modernize all existing production applications. Data integration products have evolved to the point where they can adapt to practically any data representation framework and can provide the means for transforming existing data into a form that can be materialized, presented, and manipulated via a master data system.

5.5. Assessment: Data Profiling

Data profiling originated as a set of algorithms for statistical analysis and assessment of the quality of data values within a data set, as well as for exploring relationships that exist between value collections within and across data sets. For each column in a table, a data profiling tool provides a frequency distribution of the different values, offering insight into the type and use of each column. Cross-column analysis can expose embedded value dependencies, whereas intertable analysis explores overlapping value sets that may represent foreign key relationships between entities. It is in this way that profiling can be used for anomaly analysis and assessment. However, the challenges of master data integration have presented new possibilities for the use of data profiling, not just for analyzing the quality of source data but especially with respect to the discovery, assessment, and registration of enterprise metadata as a prelude to determining the best sources for master objects, as well as managing the transition to MDM and its necessary data migration.

5.5.1. Profiling for Metadata Resolution

If the objective of an MDM program is to consolidate and manage a uniquely referenceable centralized master resource, then before we can materialize a single master record for any entity, we must be able to do the following:

1 Discover which enterprise data resources may contain entity information

2 Understand which attributes carry identifying information

3 Extract identifying information from the data resource

4 Transform the identifying information into a standardized or canonical form

5 Establish similarity to other standardized records

This entails cataloging the data sets, their attributes, formats, data domains, definitions, contexts, and semantics, not just as an operational resource but rather in a way that can be used to automate master data consolidation and govern the ongoing application interactions with the master repository. In other words, to be able to manage the master data, one must first be able to manage the master metadata.

Addressing these aspects suggests the need to collect and analyze master metadata in order to assess, resolve, and unify similarity in both structure and semantics. Although many enterprise data sets may have documented metadata (e.g., RDBMS models, COBOL copybooks) that reveal structure, some of the data—such as fixed-format or character-separated files—may have little or no documented metadata at all. The MDM team must be able to resolve master metadata in terms of formats at the element level and structure at the instance level. Among a number of surveyed case studies, this requirement is best addressed by creatively applying data profiling techniques. To best collect comprehensive and consistent metadata from all enterprise sources, the natural technique is to employ both the statistical and analytical algorithms provided by data profiling tools to drive the empirical assessment of structure and format metadata while simultaneously exposing embedded data models and dependencies.

Profiling is used to capture the relevant characteristics of each data set in a standard way, including names and source data type (e.g., RDBMS table, VSAM file, CSV file), as well as the characteristics of each of its columns/attributes (e.g., length, data type, format pattern, among others). Creating a comprehensive inventory of data elements enables the review of meta-model characteristics such as frequently used names, field sizes, and data types. Managing this knowledge in a metadata repository allows again for using the statistical assessment capabilities of data profiling techniques to look for common attribute names (e.g., “CUSTOMER”) and their assigned data types (e.g., VARCHAR(20)) to identify (and potentially standardize against) commonly used types, sizes, and formats. This secondary assessment highlights differences in the forms and structures used to represent similar concepts.

Commonalities among data tables may expose the existence of a master data object. For example, different structures will contain names, addresses, and telephone numbers. Iterative assessment using data profiling techniques will suggest to the analyst that these data elements are common characteristics of what ultimately resolves into a “party” or “customer” type. Approaches to these processes for metadata discovery are covered in greater detail in Chapter 7.

5.5.2. Profiling for Data Quality Assessment

The next use of data profiling as part of an MDM program is to assess the quality of the source data sets that will feed the master repository. The result of the initial assessment phase will be a selection of candidate data sources to feed the master repository, but it will be necessary to evaluate the quality of each data source to determine the degree to which that source conforms to the business expectations. This is where data profiling again comes into play. Column profiling provides statistical information regarding the distribution of data values and associated patterns that are assigned to each data attribute, including range analysis, sparseness, format and pattern evaluation, cardinality and uniqueness analysis, value absence, abstract type recognition, and attribute overloading analysis.

These techniques are used to assert data attribute value conformance to the quality expectations for the consolidated repository. Profiling also involves analyzing dependencies across columns (looking for candidate keys, looking for embedded table structures, discovering business rules, or looking for duplication of data across multiple rows). When applied across tables, profiling evaluates the consistency of relational structure, analyzing foreign keys and ensuring that implied referential integrity constraints actually hold. Data rules can be defined that reflect the expression of data quality expectations, and the data profiler can be used to validate data sets against those rules. Characterizing data quality levels based on data rule conformance provides an objective measure of data quality that can be used to score candidates for suitability for inclusion in the master repository.

5.5.3. Profiling as Part of Migration

The same rules that are discovered or defined during the data quality assessment phase can be used for ongoing conformance as part of the operational processes for streaming data from source data systems into the master repository. By using defined data rules to proactively validate data, an organization can distinguish those records that conform to defined data quality expectations and those that do not. In turn, these defined data rules can contribute to baseline measurements and ongoing auditing for data stewardship and governance. In fact, embedding data profiling rules within the data integration framework makes the validation process for MDM relatively transparent.

5.6. Data Cleansing

The original driver for data quality tools was correcting what was perceived to be “bad” data associated with database marketing, and so the early data quality tools focused on customer name and address cleansing. This typically consists of the following:

▪ Customer record parsing, which will take semistructured customer/entity data and break it up into component pieces such as title, first name, last name, and suffix. This also looks for connectives (DBA, AKA, &, “IN TRUST FOR”) that indicate multiple parties in the data field.

▪ Address parsing, which is a similar activity for addresses.

▪ Address standardization, which makes sure that addresses conform to a published postal standard, such as the postal standard of the U.S. Postal Service. This includes changing street designations to the standard form (e.g., ST for Street, AVE for Avenue, W for West).

▪ Address cleansing, which fills in missing fields in addresses (such as ZIP codes, ZIP+4, or area codes) and corrects mistakes in addresses, such as fixing street names, or reassigning post office locality data or changing the City field in an address from a vanity address (“ROLLING HILLS HEIGHTS” to “SMALL VALLEY”).

The cleansing process for a data value (usually a character string) typically follows this sequence:

▪ Subject the candidate string to parsing to identify key components, called “tokens,” within the string.

▪ Determine whether the components reflect a recognized pattern (such as “First Name, Middle Initial, Last Name” for customer names).

▪ If so, map the recognized tokens to the corresponding components of the pattern.

▪ Apply any rules for standardizing the tokens (such as changing “Mike” to “Michael”).

▪ Apply any rules for standardizing the collection of tokens together (such as mapping tokens extracted from a NAME data element into a FIRST_NAME, MIDDLE_NAME, LAST_NAME).

▪ If the components do not map to a recognized pattern, attempt to determine whether there are similarities to known patterns. This will help in figuring out whether there are specific errors in the data value that can be reviewed by one of the data stewards responsible for overseeing the quality of that specific master data object type.

Once a value has been transformed into a standard representation, existing master reference lists can be searched for the standardized entity name. If a match is found, the candidate record is compared with the existing master entry. Any discrepancies can also be called out into the stewardship process for resolution, with resulting updates communicated into either the master index or the record being evaluated.

Over time, cleansing has become more sophisticated; now we rely on the master repository for cleanliness, but the methods necessary to integrate stewardship roles in making corrections as well as learning from made decisions need to be introduced into the automation processes. For example, early cleansing processes were performed in batch, with out files provided to analysts for “postmortem” review and decision making. Modifications to actual records were performed manually, with all the associated challenges of synchronization and propagation of changes to dependent data sets downstream.

The automation process for MDM at the service layer must now be able to embed the functionality supporting the data stewardship part of the business process. Instead of batch processing for cleansing, services can now inline the identity resolution as part of data acquisition. If the identity can be resolved directly, no interaction with the business client is needed. However, if there are potential match discrepancies, or if no exact matches are found, the application itself can employ the underlying MDM service to prompt the business user for more information to help in the resolution process.

Enabling real-time decisions to be made helps in eliminating the introduction of duplicate or erroneous data at the earliest point of the work stream. At an even higher level of sophistication, there are techniques for learning from the decisions made by users to augment the rule sets for matching, thereby improving the precision of future matching and resolution. The use of parsing, standardization, and matching for master data consolidation is discussed in greater detail in Chapter 10.

5.7. Data Controls

Business processes are implemented within application services and components, which in turn are broken down into individual processing stages, with communication performed via data exchanges. Within the MDM environment, the business processing stages expect that the data being exchanged are of high quality, and the assumption of data appropriateness is carried over to application development as well.

However, no system is immune to the potential for introduction of flawed data into the system, especially when the acquired data are being repurposed across the enterprise. Errors characterized as violations of expectations for completeness, accuracy, timeliness, consistency, and other dimensions of data quality often impede the ability of an automated task to effectively complete its specific role in the business process. Data quality control initiatives are intended to assess the potential for the introduction of data flaws, determine the root causes, and eliminate the source of the introduction of flawed data if possible.

If it is not possible to eliminate the root cause, it may be necessary to use a data source that is known to have flaws. However, being aware of this possibility, notifying the downstream clients, and enabling the staff to mitigate any impacts associated with known flawed data helps to control any potential damage if that is the only source available for the needed data

But the reality is that even the most sophisticated data quality management activities do not prevent all data flaws. Consider the concept of data accuracy. Although we can implement automated processes for validating that values conform to format specifications, belong to defined data domains, or are consistent across columns within a single record, there is no way to automatically determine if a value is accurate. For example, salespeople are required to report their daily sales totals to the sales system, but if one salesperson inadvertently transposed two digits on one of the sales transaction amounts, the sales supervisor would not be able to determine the discrepancy without calling the sales team to verify their numbers (and even they might not remember the right numbers!).

The upshot is that despite your efforts to ensure quality of data, there are always going to be data issues that require attention and remediation. The goal is to determine the protocols that need to be in place to determine data errors as early as possible in the processing stream(s), whom to notify to address the issue, and whether the issue can be resolved appropriately within a “reasonable” amount of time. These protocols are composed of two aspects: controls, which are used to determine the issue, and service level agreements, which specify the reasonable expectations for response and remediation.

5.7.1. Data and Process Controls

In practice, every processing stage has embedded controls, either of the “data control” or “process control” variety. The objective of the control process is to ensure that any issue that might incur a significant business impact late in the processing stream is identified early in the processing stream. The effectiveness of a control process is demonstrated when the following occurs:

▪ Control events occur when data failure events take place.

▪ The proper mitigation or remediation actions are performed.

▪ The corrective actions to correct the problem and eliminate its root cause are performed within a reasonable time frame.

▪ A control event for the same issue is never triggered further downstream.

Contrary to the intuitive data quality ideas around defect prevention, the desire is that the control process discovers many issues, because the goal is assurance that if there are any issues that would cause problems downstream, they can be captured very early upstream.

5.7.2. Data Quality Control versus Data Validation

Data quality control differs from data validation in that validation is a process to review and measure conformance of data with a set of defined business rules, but control is an ongoing process to reduce the number of errors to a reasonable and manageable level and to institute a mitigation or remediation of the root cause within an agreed-to time frame. A data quality control mechanism is valuable for communicating data trustworthiness to enterprise stakeholders by demonstrating that any issue with a potential impact would have been caught early enough to have been addressed and corrected, thereby preventing the impact from occurring altogether.

5.8. MDM and Data Quality Service Level Agreements

A data quality control framework bolsters the ability to establish data quality service level agreements by identifying the issues and initiating processes to evaluate and remediate them. Pushing the controls as far back as possible in each process stream increases trust, especially when the control is instantiated at the point of data acquisition or creation. In retrospect, the master data repository can be used to validate data quality, and, as we will explore in Chapter 12, the component service layer will embed the validation and control across the data life cycle.

A key component of establishing the control framework is a data quality service level agreement (SLA); the sidebar lists what should be in a data quality SLA.

A Data Quality SLA Describes

▪ Which data assets are covered by the SLA

▪ The business impacts associated with data flaws

▪ The data quality dimensions associated with each data element

▪ Characterizations of the data quality expectations for each data element for each identified dimension

▪ How conformance to data quality expectations is measured along with the acceptability threshold for each measurement

▪ The individual to be notified in case the acceptability threshold is not met

▪ The times for expected resolution or remediation of discovered issues and an escalation strategy when the resolution times are not met

▪ A process for logging issues, tracking progress in resolution, and measuring performance in meeting the SLA

5.8.1. Data Controls, Downstream Trust, and the Control Framework

Data controls evaluate the data being propagated from business customer-facing applications to the master data environment and ensure that the data sets conform to quality expectations defined by the business users. Data controls can be expressed at different levels of granularity. At the most granular level, data element level controls review the quality of the value in the context of its assignment to the element. The next level of granularity includes data record level controls, which examine the quality of the set of (element, value) pairs within the context of the record. An even higher level incorporates data set and data collection level controls, which focus on completeness of the data set, availability of data, and timeliness in its delivery.

In essence data quality management for MDM must provide a means for both error prevention and error detection and remediation. Continued monitoring of conformance to data expectations only provides some support to the ability to keep the data aspect of business processes under control. The introduction of a service level agreement, and certifying that the SLAs are being observed, provides a higher level of trust at the end of the business process that any issues with the potential for significant business impact that might have appeared will have been caught and addressed early in the process.

5.9. Influence of Data Profiling and Quality on MDM (and Vice Versa)

In many master data management implementations, MDM team members and their internal customers have indicated that data quality improvement is both a driver and a by-product of their MDM or Customer Data Integration (CDI) initiatives, often citing data quality improvement as the program's major driver. Consider these examples:

▪ A large software firm's customer data integration program was driven by the need to improve customer data integrated from legacy systems or migrated from acquired company systems. As customer data instances were brought into the firm's Customer Relationship Management (CRM) system, the MDM team used data profiling and data quality tools to understand what data were available, to evaluate whether the data met business requirements, and to resolve duplicate identities. In turn, the master customer system was adopted as the baseline for matching newly created customer records to determine potential duplication as part of a quality identity management framework.

▪ An industry information product compiler discussed its need to rapidly and effectively deploy the quality integration of new data sources into its master repository because deploying a new data source could take weeks, if not months. By using data profiling tools, the customer could increase the speed of deploying a new data source. As a by-product, the customer stated that one of the ways it could add value to the data was by improving the quality of the source data. This improvement was facilitated when this company worked with its clients to point out source data inconsistencies and anomalies, and then provided services to assist in root-cause analysis and elimination.

5.10. Summary

Because the MDM program is intended to create a synchronized, consistent repository of quality master information, data quality integration incorporates data profiling, parsing, standardization, and resolution aspects to both inventory and identify candidate master object sets as well as to assess that data's quality. However, this must be done in a way that establishes resolution and the management of the semantics, hierarchies, taxonomies, and relationships of those master objects, and this process will be explored in Chapter 6 and Chapter 10. On the other hand, we have seen (in Chapter 4) that the benefits of imposing an enterprise data governance framework include the oversight of critical data elements, clear unambiguous definitions, and collaboration among multiple organizational divisions.

For MDM, these two ideas converge in the use of data profiling and data quality tools for assessment, semantic analysis, integration, data inspection and control, and monitoring—essentially across the board. Using profiling to assess and inventory enterprise metadata provides an automated approach to the fundamental aspects of building the master data model. Data profiling and data quality together are used to parse and monitor content within each data instance. This means that their use is not just a function of matching names, addresses, or products, but rather automating the conceptual understanding of the information embedded within the representative record. This knowledge is abstracted as part of a “metadata control” approach, with a metadata registry serving as the focus of meaning for shared information. Fully integrating data quality control and management into the organization is one of the single most important success factors for MDM.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5. Data Quality and MDM

Create new playlist

Sign In

Sign Up