Chapter 9

PREFERRED DATA ARCHITECTURE CONCEPT

Defining the to-be data architecture for the organization.

The next step after data cross-referencing is to designate the preferred data architecture for the organization. Designating the preferred data architecture sets the stage to solve the third problem, data redundancy, and fourth problem, data variability. The process builds on the understanding gained through the data inventory process and the data cross-reference process.

Chapter 9 describes the concepts and principles for designating the preferred data architecture. The next chapter describes the process and techniques for designating the preferred data architecture. Chapters 11 and 12 describe how to transform the disparate data resource to a comparate data resource based on the preferred data architecture.

CONCEPTS AND PRINCIPLES

Preferred means to put before; to promote or advance to a rank or position; to like better or best; to give priority; to put or set forward for consideration. Preferred data are data that have the preferred names, definitions, structure, integrity rules, format, and content acceptable for data sharing.

A preferred data architecture is a subset of the common data architecture that contains preferred data. It’s the desired data architecture that provides a pattern for designing a comparate data resource and for transforming a disparate data resource to a comparate data resource.

The term preferred is used rather than official or standard because use of the preferred data architecture is a choice. The preferred data architecture does not have to be used, but anyone not using the preferred data architecture bears the responsibility for problems arising from not using it.

I settled on the term preferred after working with a university regarding their Help Desk problem. The Help Desk provided support for software products throughout the university. However, the variety of software products being purchased were more than the Help Desk staff could learn. The Help Desk produced a standard set of software products that could be purchased and that they would support.

Complaints about the standard were prolific and blunt—you can’t tell me what software to buy with my grant money. A faculty senate committee asked if I’d help find a solution. After considerable discussion with all parties, we settled on a list of preferred software products that would be supported by the Help Desk. Preferred software products were defined as those prominently used throughout the university. Anyone could certainly use non-preferred software products, but could not expect routine support from the Help Desk. However, they could contract with the Help Desk for support of non-preferred software products.

The new approach worked great—nothing standard or official—just a list of preferred software products that would be routinely supported by the Help Desk, and the option to contract for additional non-preferred support.

The preferred data architecture concept is that the redundancy and variability of disparate data will be resolved through the designation of a preferred data architecture and the transformation of disparate data to comparate data according to that preferred data architecture. The data redundancy and variability may not be eliminated, but will be reduced to a known and manageable level. Formal documentation of the preferred data architecture allows people to readily understand and share common data to meet the business information demand.

The preferred data architecture is how the organization chooses to build and maintain their data resource. It’s a preferred subset of their common data architecture that represents the most desirable to-be data architecture for the comparate data resource. It’s the data architecture that works best for understanding and sharing common data.

The preferred data architecture objective is to designate the preferred representation of all data at the organization’s disposal so those data can be readily understood and shared within and without the organization. The objective is to take a common data architecture that was enhanced to cover the data cross-references and designate preferred components that will become a pattern or template for designing and building a comparate data resource and transforming disparate data to comparate data.

The preferred data designation process is the process of designating and finalizing preferred data names, data definitions, data integrity rules, primary and foreign keys, data characteristic variations, data reference set variations, and data sources. Data translation rules between data characteristic variations and data reference items are based on the preferred data designations. It’s a process that produces maximum benefit with minimal effort.

The preferred data designations are made within the context of a common data architecture after data inventorying and cross-referencing have been completed. The designations cannot be made during data inventorying or cross-referencing because the complete picture of disparate data is not known. The complete picture of disparate data is only available after data inventorying and cross-referencing have been completed.

Principles

Several principles apply to designating a preferred data architecture, including the preferred data designation principle, data understanding, enhancements and adjustments, a deterministic – prescriptive – prospective architecture, a resultant data architecture, and redundancy and variability resolution. Each of these principles is described below.

Preferred Data Designation Principle

The preferred data designation principle states that all preferred designations that comprise the preferred data architecture will be made within a common data architecture, after data cross-referencing has been completed, according to the organization’s perception of the business world, by knowledgeable detail data stewards.

Data Understanding

The understanding of disparate data began with the data inventory, and increased through cross-referencing the inventoried data to a common data architecture. The understanding does not end with data cross-referencing. It continues through designation of a preferred data architecture, and even through data transformation to a comparate data resource.

The thorough understanding of both disparate data and comparate data is never completed. Additional understanding is continually gained through additional insights about existing disparate data, or about how the organization perceives the business world where they operate. Additional understanding is also gained as the business world changes and the organization changes in response to those changes.

During one class I as teaching, I was describing how to understand disparate data and make the preferred data designations. One attendee summarized my explanation as preferred data designations were the most reasonable way the organization structures their data resource to meet their business information demand. Then that reasonable structure could be screwed up any way necessary to meet external reporting requirements. My response was, In a nutshell, yes.

Enhancements and Adjustments

Enhancements may be made to a common data architecture during designation of the preferred data architecture. New components may be added that were not identified through data inventory and cross-referencing, but are needed to support business operations. Enhancements may be made to the data definitions as additional insight is gained.

Adjustments may be made to a common data architecture during designation of the preferred data architecture. The increased understanding of the existing data may result in the organization having a different perception of the business world and the data they need to operate in that business world. A common data architecture would be adjusted accordingly. Adjustments can also be made to a common data architecture after the preferred data designations have been made to remove components that are not relevant.

Enhancements to the data inventory may be made based on insights gained during designation of the preferred data architecture. These enhancements usually included enhanced definitions so that people better understand the existing data. Enhancements could include inventorying additional data that were not known at the time of the data inventory.

Adjustments may also be made to the data inventory based on insights gained. These adjustments may be a further breakdown of combined data components, or a different breakdown of combined data components.

Adjustments may be made to the data cross-references based on adjustments to a common data architecture, adjustments to the data inventory, or a better understanding of the data cross-reference. Designation of the preferred data architecture often provides a better understanding of the existing disparate data, resulting in an adjustment to the data cross-references.

Adjustments may need to be made to the preferred data designations, as additional insight is gained through the preferred logical and preferred physical data designation processes. Adjustments may be made to any of the components of the logical or physical preferred data architectures, or to the data translation rules. Adjustments may also be made after the comparate data resource has been implemented, based on changing business needs.

Deterministic – Prescriptive – Prospective

Designating a preferred data architecture is deterministic because it formally designates the preferred data architecture by a set of principles and techniques. It provides a future data architecture for the organization’s data that is different from the probabilistic data architecture of the disparate data resource.

Designating a preferred data architecture is prescriptive because it provides a direction for development of a comparate data resource. It prescribes the future to-be data architecture for the organization’s comparate data resource that is different from describing the existing disparate data resource.

Designating a preferred data architecture is prospective because it looks ahead to what the organization needs to properly build and maintain a comparate data resource. It provides a to-be data architecture for the comparate data resource that is different from the as-is data architecture of the disparate data resource.

Resultant Data Architecture

The preferred data architecture is a result of the data inventory and cross-referencing processes. It is developed within the context of a common data architecture based on the understanding gained through data inventory and data cross-referencing, rather than independent of those processes. It may be considered a by-product of data inventorying and data cross-referencing, but that term sometimes implies waste material from a process. Also, additional effort is needed to make the preferred designations. The preferred data architecture does not happen automatically. Therefore, a resultant data architecture is the proper term.

Redundancy and Variability Resolution

Data redundancy is the third basic problem with disparate data and data variability is the fourth basic problem. Designating the preferred data architecture does not resolve these problems the way data inventory resolved the first basic problem and data cross-referencing resolved the second basic problem. However, designating the preferred data architecture sets the stage for resolving the third problem by designating preferred data sources, and the fourth problem by designating preferred data variations. Data resource transformation resolves those problems.

DATA ARCHITECTURES

Data architectures include the disparate data architecture, enterprise data architecture, preferred logical data architecture, preferred physical data architecture, and the comparate data resource. Each of these topics is described below.

Disparate Data Architecture

The existing disparate data does have an architecture, although that architecture is very disjointed and inconsistent, and is not readily visible. Most disparate data were developed in a probabilistic manner that was seldom planned or coordinated across the organization. The data inventory and cross-referencing processes documented that disparate data architecture, which becomes the as-is data architecture.

Enterprise Data Architecture

Enterprise data architecture is not clearly defined, has multiple definitions, and has been misused and abused to the point of being meaningless. It has become part of the lexical challenge and is not used during data resource integration. A similar situation exists with enterprise data model. Enterprise architecture and enterprise data model are an unqualified term and are not used in data resource integration.

An enterprise data architecture is usually developed independent of the existing disparate data. It’s seldom done at the variation level, which prevents adequate cross-referencing. It provides little or no understanding if the existing disparate data, which makes any formal transformation from disparate data to comparate very difficult.

Even if an enterprise data architecture was complete and perfect, and accurately represented the organization’s perception of the business world, which it usually doesn’t, it’s very difficult, if not impossible, to integrate disparate data according to an enterprise data architecture. Although it may be prospective, it is not based on a retrospective understanding of the existing disparate data. It provides little functionality for formal data transformation. The effort is often wasted, and frequently results in more disparate data through development of new databases.

If an enterprise data architecture has already been developed, it should be reviewed to determine if it was developed with formal data names, has comprehensive data definitions, used a formal data subject thesaurus, and used common data name words.

If an enterprise data architecture has the formality, then it is reviewed to determine how well it represents the organization’s perception of the business world. If the representation is strong, then it could be used as an initial common data architecture for cross-referencing. However, the risk is that unnecessary components may be added to the initial common data architecture.

If an enterprise data architecture does not have the formality and does not represent the organization’s perception of the business world, it should be discarded. If it appears to have some value, then it could be documented as a data product and cross-referenced to a common data architecture.

Generic data architectures and universal data models are often treated as enterprise data architectures. Most are not formal data architectures with formal data names, comprehensive data definitions, proper data structures, and precise data integrity rules. Most do not represent all of the organization’s data or the organization’s perception of the business world. None of them provides any understanding of the existing disparate data.

At best, generic data architectures and universal data models should be documented as data products and cross-referenced to a common data architecture. The data cross-referencing will prove or disprove their worth.

Preferred Logical Data Architecture

The preferred logical data architecture is the common, desired, to-be logical data architecture for the organization. It’s a subset of a common data architecture developed from a thorough understanding gained through data inventorying and cross-referencing. It’s developed through a formal process that normalizes the data based on the organization’s perception of the business world.

The preferred logical data architecture covers all data, not just the data in databases. It’s developed prospectively within a common data architecture according to the organization’s perception of the business world, based on a retrospective understanding of the existing disparate data. It’s a result of data inventorying and cross-referencing that avoids all the perceptions and misperceptions of an enterprise data architecture. It’s not developed independently.

A preferred logical data architecture consists of preferred data characteristic variations, preferred data reference set variations, preferred data definitions, preferred primary keys and foreign keys, preferred data integrity rules, preferred data sources, and preferred data transformation rules. The preferred data names were established and maintained through the data cross-referencing process.

The preferred logical data architecture is a minimal effort approach to understanding and resolving disparate data. It leads to informed decisions about data resource integration, and sets the stage for developing the preferred physical data architecture, which will be used for data transformation.

The preferred logical data architecture is documented within a common data architecture by placing a preferred indicator on the appropriate data characteristic variations, data reference set variations, and primary and foreign keys.

Preferred Physical Data Architecture

The preferred physical data architecture is the common, desired, to-be, physical data architecture for the organization. It’s developed from a formal denormalization of the logical preferred data architecture. It’s the pattern or template for building the comparate data resource.

The preferred physical data architecture covers only the data stored in databases. It sets the stage for transforming disparate physical data to comparate physical data. Data names are abbreviated according to formal data name word abbreviations and a data name abbreviation algorithm. Data definitions and data integrity rules are adjusted to support the data denormalization. Primary keys and foreign keys are adjusted for physical implementation.

The preferred physical data architecture may include generalizations that the organization does not perceive in the business world, such as a legal entity. However, those generalizations must not be viewable by the business. The business should input data and acquire data based on specific data views of the generalized data entity. The data views show the data as the organization perceives the business world. Data integrity rules are placed on the data views, rather than on the generalized data entity.

The preferred physical data architecture is documented as a data product with a designation of preferred. No cross-references are made to a common data architecture, because the link can be made through the formal data names and their formal abbreviations.

Comparate Data Resource

The comparate data resource is developed according to the preferred physical data architecture. It is logically integrated within a common data architecture, but is physically deployed as necessary to be readily accessible. It supports the think globally – act locally principle.

PREFERRED DATA DESIGNATIONS

Preferred data designations are made for data names, data definitions, data characteristic variations, data reference set variations, data keys, and data sources, data occurrences and instances, data integrity rules, and multiple preferred designations. Each of these designations is described below.

A preferred data designation is a data variation that has been accepted by the consensus of knowledgeable people as being preferred for data sharing and development of a comparate data resource. A non-preferred data designation is a data variation that has not been accepted as preferred.

General guidelines can be established for making consistent preferred data designations. For example, dates will always be in a CYMD format, measurement units will all be metric, addresses will always be maintained at the individual component level, and so on. These guidelines help the data stewards make informed decisions about the preferred data designations.

The general approach is to look at the existing data resource within the context of a common data architecture and make preferred designations based on the guidelines and what is reasonable for the organization. The frequency of existence and frequency of use may be a consideration in making the preferred designations, but are not a primary criteria. Consistency and the desires of the organization are the primary concern.

Preferred data designations are identified with a preferred designation indicator. In situations where multiple preferred designations are made due to cultural, geographical, or political differences, a qualifier is added to the preferred designation indicator. Acceptable data designations are identified with an acceptable designation indicator. Obsolete data designations are identified with an obsolete designation indicator.

Preferred Data Names

A preferred logical data name is the data name developed according to the data naming taxonomy and approved by the business as the preferred name for the data. The preferred logical data names are the data names developed for an initial common data architecture and for enhancements to that common data architecture. They should have been developed using a data subject thesaurus and a set of data name common words.

A preferred physical data name is the data name developed from the preferred logical data name during formal data denormalization according to a set of data name word abbreviations and a formal data name abbreviation algorithm.

Preferred Data Definitions

Data definitions were documented and enhanced through the data inventory and cross-referencing processes. These data definitions are an aggregate of existing definitions and insight gained during data inventory and cross-referencing. They need to be reviewed and finalized into preferred data definitions.

A preferred data definition is a comprehensive and denotative data definition developed from all of the insights documented during data inventory and cross-referencing that fully explains the data with respect to the business. That data definition may still be enhanced based on additional insights, but it is finalized with respect to pulling all current insights together into a comprehensive data definition.

Preferred data definitions are prepared for data subjects, data characteristics, data characteristic variations, data reference set variations, and data reference items. The general approach is to start with the data subject definitions, then proceed to data characteristic and data characteristic variation definitions. Then data reference set variation definitions are reviewed, followed by data reference items.

Data subject definitions are based on the organization’s perception of the business world. Data characteristic definitions build on the definition of their parent data subject. The data subject definition is not repeated in each data characteristic, but the data characteristic definition must support the data subject definition. The same is true for data characteristic variations. It builds on the data characteristic definition by describing the specific variation in format or content.

Data reference set variation definitions also build on their parent data subject definition by describing the variation in the data reference items contained in the data reference set. The data reference item definitions describe the data property represented by the data item.

Preferred Data Characteristic Variation

Designation of preferred data characteristic variations apply to all data, not just the data in databases. The splitting and combining of data items was done during the data inventory process. No splitting or combining needs to be done within a common data architecture.

A preferred data characteristic variation is a data characteristic variation within a data characteristic that has been designated as the one preferred for data sharing and development of a comparate data resource.

A non-preferred data characteristic variation is a data characteristic variation within a data characteristic that has not been designated as preferred. A non-preferred data characteristic variation may be either acceptable or obsolete.

An acceptable data characteristic variation is any data characteristic variation that is not preferred, but is acceptable to use for an interim period until appropriate changes can be made to databases or application programs. However, its use should not be perpetuated.

An acceptable data characteristic variation may be designated until more insight is gained to make a preferred designation. However, a preferred data characteristic variation must be designated before data transformation can proceed.

An obsolete data characteristic variation is any data characteristic variation that is obsolete and can no longer be used. Ideally, all data characteristic variations, except the preferred, will become obsolete. However, that goal may not be achieved for many years, although substantial progress toward that goal can be made.

Preferred Data Reference Set Variation

Designation of data reference set variations apply to all data, not just the data in databases. The splitting and combining of data properties was done during the data inventory process. The splitting of a set of data codes was also done during the data inventory process. The combining of partial sets of data codes needs to be done during preferred data designations. If an appropriate data reference set variation does not exist for a combined set of data codes, one needs to be created.

A preferred data reference set variation is a data reference set variation within a data subject that has been designated as preferred for data sharing and development of a comparate data resource.

A non-preferred data reference set variation is a data reference set variation within a data subject that has not been designated as preferred. A non-preferred data characteristic variation may be either acceptable or obsolete.

An acceptable data reference set variation is any data reference set variation that is not preferred, but is acceptable to use for an interim period until appropriate changes can be made to databases or application programs. However, its use should not be perpetuated.

An acceptable data reference set variation may be designated until more insight is gained to make a preferred designation. However, a preferred data reference set variation must be designated before data transformation can proceed.

An obsolete data reference set variation is any data reference set variation that is obsolete and can no longer be used. Ideally, all data reference set variations, except the preferred, will become obsolete. However, that goal may not be achieved for many years, although substantial progress toward that goal can be made.

The preferred data reference set variation only shows the values for the data reference item code, name, and definition. The format for those values is defined by the preferred data characteristic variations. Each data reference set, defined as a data subject, has a preferred data characteristic variation for the coded data value, the name, and the data definition. Those preferred data characteristic variations designate the preferred format for the data reference item.

Preferred Data Keys

The disparate primary and foreign keys are reviewed during preferred data architecture designation and used to designate preferred primary and foreign keys. Designation of preferred primary keys and foreign keys applies only to the data in databases and data models. A check needs to be made to ensure that all appropriate primary and foreign keys from the data inventory have been documented in a common data architecture. Some disparate data primary keys that represent data from multiple data subjects are not appropriate and should not be documented in a common data architecture.

Preferred Primary Keys

Disparate data often contain more than one primary key for a data subject. These disparate primary keys are identified and documented during data inventory and cross-referencing. A disparate primary key may have meaning only for the data product set or variation in which it appears, and is not placed in a common data architecture. The disparate primary keys are reviewed and given a designation based on their validity and range of usefulness.

The preferred primary key principle states that each data subject in a common data architecture will have one and only one preferred primary key designated that uniquely identifies all data occurrences within that data subject in the organization’s common data architecture.

A candidate primary key is a primary key that has been identified and considered as a primary key, but has not been verified. It has been documented during the data inventory and placed in a common data architecture, but has not been reviewed for its validity or range of uniqueness. All primary keys that originate from the data inventory are candidate primary keys.

A preferred primary key is a primary key that has been designated as preferred for use in a comparate data resource. It uniquely identifies all data occurrences in a data subject within a common data architecture for the organization and has been designated as preferred for data sharing and development of a comparate data resource. Only one preferred primary key is designated for each data subject.

An alternate primary key is a primary key that is valid and acceptable, but is not the preferred primary key. It uniquely identifies all data occurrences in a data subject within a common data architecture for the organization, but has not been designated as the preferred primary key. Multiple alternate primary keys may be designated for each data subject.

A limited primary key is a primary key that is available for all data occurrences, but has a limited range of uniqueness for data occurrences. For example, a primary key may uniquely identify vehicles within a state, but not across states. The limited range of uniqueness is specified as a comment for the primary key. A limited primary key is not perpetuated in the comparate data resource.

An obsolete primary key is a primary key that has no further use and should not be used. It no longer uniquely identifies each data occurrence in a data subject within a common data architecture for the organization, contains data characteristics that are not necessary for unique identification, or is not appropriate for some reason.

Preferred, alternate, limited, and obsolete primary keys may be either  business keys or a non-business keys. A business key is a primary key consisting of a fact or facts whose values have meaning to the business. A business key is sometimes referred to as an intelligent key, however that term is not used because a primary key cannot possess intelligence. Generally, a business key is used for data normalization.

A non-business key is a primary key consisting of a fact or facts whose values have no meaning to the business. A non-business key is sometimes referred to as a non-intelligent key; however, that term is not used because a primary key cannot possess intelligence.

A physical key is a preferred or alternate primary key that may or may not be meaningful to the business, but is useful for physical navigation in the database. One of the primary or alternate, business or non-business keys is designated as the physical key during denormalization of the logical preferred data architecture to the preferred physical data architecture. That primary key becomes the physical key in the comparate data resource.

Preferred Foreign Keys

Disparate data often has more than one foreign key to the same parent data subject, whether explicitly stated or implied. These disparate foreign keys are identified and documented during data inventory and cross-referencing. When the primary keys have been designated, the disparate foreign keys are reviewed and given a designation based on the designation of primary keys.

The preferred foreign key principle states that each subordinate data subject in a common data architecture will have one and only one preferred foreign key designated that uniquely identifies the parent data occurrence in a parent data subject.

A candidate foreign key is a foreign key that has been documented during the data inventory and placed in a common data architecture, but has not been reviewed and given a specific designation. All foreign keys that originate from the data inventory are candidate foreign keys.

A preferred foreign key is a foreign key that matches the preferred primary key in a parent data subject. Only one preferred foreign key is designated for each parent data subject.

An alternate foreign key is a foreign key that matches an alternate primary key in a parent data subject. A limited foreign key is a foreign key that matches a limited primary key in a parent data subject. An obsolete foreign key is a foreign key that matches an obsolete primary key in a parent data subject.

Preferred Data Relations

Data relations are based on foreign keys, and only apply to databases and data models. General cardinalities can be documented for the data relation, but specific cardinalities are documented as data integrity rules. Data relation names are documented with the data relation. However, those data relation names should add meaning to the data relation, such as supplies, provides, purchases, and so on. Statements like is one of, has many, belongs to, and so on, add no meaning and are not appropriate data relation names.

Preferred Data Sources

Disparate data are very redundant with the same business fact being stored in a variety of different data files in different databases. The reasons for the data redundancy were described in Chapter 5 and in Data Resource Simplexity. An indication of the extent of data redundancy can be determined after data inventorying and cross-referencing, as described in Chapter 7. That data redundancy must be resolved before developing a comparate data resource.

The fact that data redundancy exists in a disparate data resource means that a specific location needs to be designated for obtaining each business fact. Therefore, the resolution of data redundancy begins with the designation of preferred data sources.

A preferred data source is the data product unit or variation within a data product set or variation representing a data file that will be the source for a business fact. It’s the location where an individual business fact can be obtained that is the most current and most accurate. It’s the location for the highest quality data that is sometimes referred to as the best-of-breed data.

The preferred data source is at a business fact level, not at a data occurrence or data file level. Combined business facts were broken down during data inventory and were cross-referenced to a common data architecture. The preferred data sourcing will be done at the business fact level.

Traditional data integration emphasizes a single system of reference, database of reference, or record of reference. Virtually no traditional data integration approaches emphasize the sourcing of data from a variety of different sources in a disparate data resource. The traditional approach is too simplistic because a single system, database, or record seldom has the highest quality data—the most current and most accurate data. The most current and most accurate data are often scattered across a variety of systems, databases, and records.

One traditional approach to data sourcing is the big behemoth approach where the biggest or the most prominent database becomes the preferred data source. However, size and frequency of use does not necessarily mean the most current or most accurate data. The best approach is to inventory a big behemoth as a data product, make cross-references to a common data architecture, and then determine if it is truly the preferred source for data.

The reality of disparate data is that the best sources of data are scattered throughout the disparate data resource. No one single location has the best data for a comparate data architecture. Identifying single sources of data will likely lead to lower quality data. The only reasonable approach to obtain the highest quality data possible is conditional data sourcing.

Conditional data sourcing is the process of selecting preferred data from a variety of different locations based on which location has the most current and most accurate data. It is done at a business fact level based on the data inventory and cross-referencing. Conditional data sourcing is sometimes referred to as selective data sourcing. Other appropriate terms would be preferred source of value or preferred source of quality.

Conditional data sourcing is based on the most current and most accurate data. It is not based on the format or content of the data. The format and content can be easily translated during data transformation and is not a concern in designating preferred data sources.

Conditional data sourcing is done at the data characteristic level, because data characteristics represent business facts. All of the data product units or variations for all the data characteristic variations within a data characteristic are identified. These data product units or variations are reviewed to determine which is most current and most accurate, regardless of their format or content.

The location that is the most current and most accurate is designated as the preferred source for that business fact. The situation frequently arises, even at the business fact level, that the most current and most accurate data come from different locations based on time or other conditions. The preferred sources are documented accordingly.

Conditional data sourcing may result in different sources for business facts within a data subject, and for different sources for a specific business fact. Both of the differences in data sourcing must be documented in a common data architecture. They will be used during data transformation to acquire the most current and most accurate data.

Preferred data sources are documented as data source rules. A data source rule specifies the preferred source from which a particular business fact is obtained and the conditions that determine the preferred source. The data source rule is stored with the data characteristic and applies to all data characteristic variations for that data characteristic.

An unconditional data source rule is a data source rule that specifies only one location as the preferred data source. A conditional data source rule is a data source rule that specifies multiple locations as the preferred data source and the conditions for selecting one of those locations.

Data source rules specify the preferred source for data, and the data characteristic variation specifies the format and content for each source. The data translation rules will change any non-preferred data characteristic variation to a preferred data characteristic variation.

The set of preferred data sources, as specified by the data source rules, becomes the source of reference for data transformation and development of a comparate data resource.

Preferred Data Occurrences

Redundant physical data occurrences frequently appear in a disparate data resource. Redundant physical data occurrences is the situation where the same logical data occurrence exists multiple times in different data files in a disparate data resource. Data product sets or variations may contain a complete set of redundant logical data occurrences, or may contain a combination of redundant and non-redundant logical data occurrences. The degree of physical data occurrence redundancy in a disparate data resource is a major problem that needs to be identified and resolved.

Data inventorying and cross-referencing to a common data architecture provided complete logical data occurrences for a data subject with respect to the business facts contained in the data occurrences. However, it did not identify redundant physical data occurrences in the disparate data resource.

A data integration key is a set of data characteristics that could identify possible redundant physical data occurrences in a disparate data resource. It’s not a primary key because it does not uniquely identify each data occurrence. It’s not a foreign key because no corresponding primary key exists. It’s only used to identify possible redundant physical data occurrences in a disparate data resource.

A data occurrence may not include all of the data characteristics in a data integration key. However, it’s a fuzzy indication of possible redundant data occurrence. It identifies the most likely redundant data occurrences. People ultimately need to make the final decision by verifying true redundancy and false positive matches.

The disparate primary keys, data product set or variation definitions, data cross-references, sets of data items, and integration key are used to integrate redundant physical data occurrences into one set of logical data occurrences within a common data architecture.

The same situation exists with historical data instances. Redundant historical data instances is the situation where redundant physical data occurrences may have corresponding physical historical data instances. Those physical historical data instances are probably not redundant within their parent physical data occurrence. However, they could be redundant across physical data occurrences in a disparate data resource.

Preferred Data Integrity Rules

Disparate data integrity rules were identified and documented during the data inventory process, including the data edits performed by database management systems and application programs. Those data integrity rules were listed with their respective data product set or data product unit, but were not cross-referenced to a common data architecture. During the preferred data designation process, those disparate data integrity rules are brought over to a common data architecture, reviewed, and finalized into preferred data integrity rules.

A preferred data integrity rule is a data integrity rule that has either been confirmed or created to ensure the integrity of a common data architecture. A candidate data integrity rule is a data integrity rule that was documented during the data inventory and brought over to a common data architecture. Note that each of the types of data integrity rules could be qualified with candidate or preferred for clarification.

Data integrity rules only apply to data subjects and data characteristics. All disparate data integrity rules for data product sets or variations are aggregated to their corresponding data subject. The corresponding data subject is identified by navigating through the data product units or variations within the data product sets or variations, to the corresponding data characteristic variation, to the parent data characteristic, to the parent data subject.

Similarly, all disparate data integrity rules for data product units or variations are aggregated to their corresponding data characteristic. The corresponding data characteristic is identified by navigating from the data product unit or variation, to the corresponding data characteristic variation, to the parent data characteristic.

These aggregated data integrity rules become the candidate data integrity rules that are reviewed and adjusted as necessary to ensure the integrity of preferred data. Generally, very few disparate data integrity rules are documented and aggregated to a common data architecture. Most preferred data integrity rules need to be created. The preferred data integrity rules may need to be adjusted throughout data transformation.

The preferred data integrity rules are documented with the appropriate data subject or data characteristic. The data integrity rule normalization principle states that data integrity rules are normalized to the data resource component that they represent or on which they take action. The data integrity rules are named accordingly. The data integrity rules are then documented with the data subject or data characteristic by which they are named. Sometimes, the specification of preferred data integrity rules results in movement from the component containing the candidate data integrity rule to another component that the preferred data integrity rule represents.

Multiple Preferred Data Designations

Multiple preferred data designations may need to be made based on differences in culture, geography, or politics. For example, a multi-national organization faces a difference in languages, social customs, monetary units, addresses, names, and so on. These differences need appropriate preferred data designations.

Multiple preferred data designations are not made for small segments of an organization that just want to have their own set of preferred data. For example, one department wants a preferred variation for a student’s birth date in the normal sequence and another department wants a preferred variation for a student’s birth date in the inverted sequence. The organization needs to come to a consensus for one preferred variation.

Multiple preferred data designations is the situation where multiple data characteristic variations or multiple data reference set variations are designated as preferred due to culture, geography, or politics. When multiple preferred data designations are made, a qualifier is added to the designation indicating the conditions for which each preferred designation is used.

Multiple preferred data designations pertain to the data values, not to the data architecture. For example, multiple preferred data characteristic variations could be designated for a course description in English, German, and French. Multiple preferred data reference set variations could be designated for management levels in different regions of the world.

The entire common data architecture could be in a different language, including the data inventory, data cross-referencing, and preferred data architecture. A common data architecture variation is a language variation in a common data architecture. The same common data architecture exists in a different language. A  common data architecture variation is different from multiple preferred data designations. Multiple preferred data designations can exist across common data architecture variations.

Preferred Data Templates

Preferred data variations can be used to develop preferred data templates. A preferred data template is a subset of the preferred logical data architecture for a specific subject area that promotes data sharing within or between organizations, and helps organizations develop applications and databases using preferred data. The template is prepared from the logical data architecture so that organizations can implement that logical data architecture in their particular physical operating environment.

These preferred data templates are readily available to any organization that wants to either share data in the preferred form or maintain their data in the preferred form. Preferred data templates are very beneficial in the public sector for sharing data across many different organizations. They result in savings on original development, savings on edge matching data between jurisdictions, and savings on sharing data. They are a good way to prevent data disparity, improve data quality, and effectively use limited resources.

Preferred data templates are an excellent example of how data standards should be presented, and how data registries should be managed. When standards are prepared from preferred data, and registries document preferred data, the data disparity can be substantially reduced. Organizations using the preferred data templates from data standards and data registries can readily share data.

DATA TRANSLATIONS

When the preferred data designations have been made, data translation rules can be prepared between the preferred data and the non-preferred. The data translation principle states that data translation rules are prepared between preferred data designations and non-preferred data designations to assist in the transformation between disparate data to comparate data. Data translation rules are prepared both ways between the preferred data and the non-preferred data. Data translation rules may be prepared between non-preferred data, but only when necessary.

Data Translation Rules

A data translation rule is a data rule that defines the translation of a data value from one unit to another unit. It represents the translation of the values of a single fact to different units, and is not considered to be a data derivation rule. It’s an algorithm for translating data values between preferred and non-preferred data designations, or between different non-preferred data designations, when necessary. It only specifies translations in format or content between data variations. It cannot specify a translation in meaning.

A preferred data translation rule is a data translation rule between a preferred data designation and a non-preferred data designation. Preferred data translation rules are routinely prepared to assist data transformation.

A non-preferred data translation rule is a data translation rule between different non-preferred data designations. Non-preferred data translation rules are very time consuming and are not often used. Therefore, they are only prepared when needed and are used on an interim basis.

A forward data translation rule is a data value translation rule from a non-preferred data designation to a preferred data designation. A reverse data translation rule is a data translation rule from a preferred data designation to a non-preferred data designation. Both forward and reverse data translation rules are prepared between preferred and non-preferred data designations. Forward and reverse data translation rules between different non-preferred data designations are  created only when necessary.

A fundamental data translation rule is a basic data translation rule that can be applied to many specific data translations. The data translation rule is specified once and can be inherited for many specific data translations. For example, fundamental data translation rules can be prepared for changes in measurement units or dates.

A specific data translation rule is a data translation rule that applies directly to the data translations. It may inherit a fundamental data translation rule, or it may specify a unique data translation rule. For example, the translation between Street Segment. Length, Feet to Street Segment. Length, Meters can inherit a fundamental data translation rule for feet to meters.

The data sharing vision emphasizes that data are shared in their preferred form, and that the contributing organization or receiving organization that does not maintain data in the preferred designation is responsible for translation. The sharing of common data translation rules assists the data sharing process.

Data Translation Approaches

Data translation can be performed three different ways: common to physical data translations, physical to physical data translations, and common to common data translations. Each of these approaches is described below.

Common-to-physical data translations are data translations between a common data architecture and the disparate data documented as data products. Specifically, data translations are prepared between the preferred data characteristic variations and data product units or variations, or between the data reference items in a data reference set variation and the data product codes or variations. The problem with preparing common-to-physical data translations is that a translation needs to be prepared for each physical manifestation of the data in the disparate data resource, resulting in a tremendous effort.

Physical-to-physical data translations are data translations between the disparate data documented as data products and the comparate data resource. Specifically data translations are prepared between data product units or variations and the preferred physical variation in the comparate data resource, or between data codes or variations documented as data products and corresponding preferred codes in the comparate data resource. The problem with physical-to-physical data translations is that data translation rules need to be prepared for each physical manifestation of the data in the disparate data resource, resulting in a tremendous effort.

Common- to-common data translations are data translations between the preferred and non-preferred data designations within a common data architecture, and applied as needed to physical data translation. The common-to-common approach specifies the minimum set of data translations that can be applied to physical data translations, as needed. The approach is much more efficient and is the recommended approach.

Data Characteristic Translations

A data characteristic translation rule is a data translation rule that translates data values between non-preferred and preferred variations of a data characteristic. Each data characteristic translation rule has a source data characteristic variation, a translation algorithm, and a target data characteristic variation. Since data translation rules are prepared both ways between preferred and non-preferred data characteristic variations, two data characteristic translation rules are routinely prepared. Others may be prepared between non-preferred data characteristic variations as needed.

Data characteristic translation rules for irregular data characteristic variations can often be difficult. For example, a person’s name in any format could require an extensive algorithm, or may need human intervention to interpret the irregularity and translate it to a specific format. Translating data from a specific format to an irregular form is not possible. Since the format is irregular, the best approach is to use the specific format as the irregular format.

Data translation rules may be coded into one or more programming languages and made available to organizations involved in data sharing. Sharing common translation routines saves resources and promotes data sharing.

Data Reference Item Translations

A data reference item translation rule is a data translation rule that translates coded data values and names between data reference items in preferred and non-preferred data reference set variations within a data subject. The rule translates only the data values, not the format of those data values. The preferred data characteristic variation identifies the format for the data values.

Since data translation rules are prepared both ways between data items in preferred and non-preferred data reference set variations, two data reference item translation rules are routinely prepared. Others may be prepared between data items in different non-preferred data reference set variations as needed. Like data characteristic variations, the translation rules may be coded in several programming languages and made available to organizations involved in data sharing.

Data reference item translations can be more difficult than data characteristic variation translations, because of the relationship between data properties and data reference items.

A one-to-one data reference item translation rule translates the coded data value and/or the name from one data reference item in the source to one data reference item in the target. These translations are very routine.

A many-to-one data reference item translation rule translates the coded data value and/or the name from many different data reference items in the source to one data reference item in the target. These translations are very routine.

A one-to-many data reference item translation rule translates one coded data value and/or name from the source to many data reference items in the target. These translations are difficult and require additional input to make the split from one source value to many target values.

SUMMARY

Preferred data can be designated after data inventorying and cross-referencing have been completed for a major segment of the data resource or for the entire data resource. The preferred data designations set the stage for transforming the disparate data to a comparate data resource. They bring all of the understanding accumulated through data inventorying and cross-referencing together to make the preferred designations.

The data designation process is deterministic, prescriptive, and prospective. The preferred data designations identify the preferred logical data architecture. That preferred logical data architecture is formally denormalized to a preferred physical data architecture, which becomes the template for the comparate data resource.

Preferred data characteristic variations, data reference set variations, primary keys, and foreign keys are designated based on precise criteria. Preferred data sources are designated, including conditional data sources. Comprehensive data definitions are finalized, and precise data integrity rules are developed. Multiple preferred designations can be made based on cultural, geographical, or political differences.

Redundant data occurrences and data instances are identified for consolidation. Data translation rules for common to common data translations are prepared for data characteristic variations and the data items in data reference set variations. Those data translation rules are used as necessary to translate between the disparate data and comparate data.

Designating preferred data ends the process of identifying, documenting, and understanding disparate data. The next step is to begin transforming the disparate data to a comparate data resource, and eliminating the disparate data.

QUESTIONS

The following questions are provided as a review of preferred data designations and translation schemes, and to stimulate thought about defining the preferred data architecture for an organization

  1. What is the preferred data designation principle?
  2. What is the difference between a preferred logical data architecture and a preferred physical data architecture?
  3. Why is the preferred data architecture considered a resultant data architecture?
  4. Why is the traditional enterprise data architecture not developed?
  5. Why might multiple preferred data designations be made?
  6. What types of preferred data designations are made?
  7. What’s the purpose of the preferred physical data architecture?
  8. What’s the purpose of preferred data templates?
  9. What are data translation rules?
  10. Why are data translation rules done from common variations to common variations?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.242.157