Chapter 10

PREFERRED DATA ARCHITECTURE PROCESS

How to define the to-be data architecture.

Chapter 9 described the concepts and principles for designating the preferred data architecture. Designating the preferred data architecture sets the stage for solving the third problem, data redundancy, and fourth problem, data variability. The process builds on the understanding gained through the data inventory process and the data cross-reference process.

Chapter 10 describes the techniques for designating the preferred logical data architecture by making all of the preferred data designations within a common data architecture. The preferred physical data architecture is developed based on the preferred logical data architecture and the organization’s operating environment. The techniques for developing data translation rules between data characteristic variations, and between data reference items in data reference set variations are described. These techniques are based on the concepts and principles presented in the last chapter.

PREFERRED DATA DESIGNATION PREPARATION

Preparation for making preferred data designations includes the scope of preferred data designations, the sequence of preferred data designations, the involvement in making preferred data designations, and adjustments to a common data architecture. Each of these topics is described below.

Scope

The scope for preferred data designations is the same as for data inventorying and data cross-referencing. However, all disparate data within the scope must have been included in data inventorying and data cross-referencing. If preferred data designations are made and another segment of disparate data is found within that scope, the preferred data designations could change based on insights gained from the additional disparate data.

Therefore, the scope must be with respect to subject areas within a common data architecture, not with respect to the disparate data resource. For example, all stream data, or all public works data, or all student data must be included in the data inventorying and data cross-referencing processes. Selecting a segment of the disparate data for data inventorying and data cross-referencing may not include all of the disparate data for a subject area. A check should be made to ensure that all disparate data within a specific subject area are included in the process so that the preferred data designations will be correct and will not need to be changed.

Sequence

The preferred data architecture is designated after data cross-referencing has been completed for the entire disparate data resource, or for a major subject area of the disparate data resource. Preferred data designations cannot be made before data cross-referencing is completed, because all of the insights for preferred data designations may not be available. Preferred data designations could be made that would need to be changed based on additional insights.

Developing and enhancing an initial common data architecture should not be confused with predetermining data designations. A common data architecture is based on the organization’s perception of the business world. However, establishing an initial common data architecture according to an organization’s perception of the business world is not predetermining any preferred data designations.

For example, an organization could decide to put all contractors in one data subject, with a contractor type designation showing plumbers, electricians, fabricators, general contractors, and so on. The data cross-references would be made accordingly. The process establishes a desired common data architecture, but does not establish preferred data designations.

Preferred data designations are not made in any particular sequence. However, the preferred data designations must be made before the data translations rules can be prepared, because the preferred data designations must be known to develop the data translation rules between the preferred and non-preferred data designations.

The general approach for preferred data designations is to work from preferred data names, through preferred data definitions, preferred data characteristic variations, preferred data reference set variations, preferred primary and foreign keys, preferred data sources, preferred data occurrences and instances, and preferred data integrity rules. The data characteristic variation translation schemes and the data reference item translation schemes are designated after the preferred data designations.

Involvement

Designating the preferred data architecture is done by the core team with input from the detail data stewards. The data steward’s knowledge of the organization’s perception of the business world where the organization operates, and their knowledge of the data needed to support business operations, help them designate the preferred data architecture. The core team and detail data stewards may reach out to others in the organization for insights or clarification about the preferred data architecture.

The people involved in designating the preferred data architecture must be knowledgeable about the organization’s perception of the business world, the data the organization needs to operate in that business world, and data normalization and denormalization. Business professionals should be included to verify the data needed to support their business activities. Two good approaches are to use business cases and reporting requirements to verify the data needed and the preferred designations for those data.

Common Data Architecture Adjustment

A common data architecture may need to be adjusted during the preferred data designation process and after the preferred data process has been completed. Adjustments during the preferred data designation process could include adding data characteristic variations or data reference set variations when existing ones are not acceptable for a preferred data designation. Adjustments could include grouping or splitting data subjects based on changes in the organization’s perception of the business world resulting from additional insights gained.

Adjustments could include deletions of data subjects or data characteristics that are no longer relevant. Deletions usually result when an enterprise data architecture is prepared ahead of time and used as an initial common data architecture, but components of that enterprise data architecture have no data cross-references. Deletions may be removal from a common data architecture, with proper entries into a data subject thesaurus, or may be leaving the components in a common data architecture with a notation that they are obsolete.

When a common data architecture is adjusted, data cross-references may need to adjusted accordingly, and data inventorying may even need to be adjusted. The discovery nature of understanding and resolving disparate data requires moving forward and backward through all the processes until the desired result is achieved.

PREFERRED DATA DESIGNATIONS

Making preferred data designations includes designating preferred data names, data definitions, data characteristic variations, data reference set variations, data subjects, data keys, data sources, data occurrences and instances, data integrity rules, and multiple preferred data designations. Each of these preferred designations is described below.

Preferred Data Names

The data names within a common data architecture are likely to be correct when the data naming taxonomy, a data subject thesaurus, and a list of data name common words are used. However, when these techniques are not used, the common data architecture names may not be correct.

The data subject and data characteristic names should be reviewed to determine if they were formed properly and if they represent the data according to how the organization perceives the business world. Any changes to the data names should be made before proceeding with making preferred data designations. The data subject thesaurus and the list of common data name words should also be enhanced if necessary.

Preferred Data Definitions

Developing preferred data definitions can be a challenging task. The traditional approach is to create a short data definition that provides little meaning with respect to the business, which is often a tautology. The data definitions must be comprehensive, denotative, and business oriented, as described in Chapter 3 and Data Resource Simplexity.

The general approach is to pull all of the data definitions accumulated during the data inventory process through the data cross-references to a common data architecture. Data product set definitions are pulled over to the corresponding data subjects, and data product unit definitions are pulled over to the corresponding data characteristic variations.

The set of definitions for each data subject is reviewed and a comprehensive data definition is prepared for that data subject. Data characteristic definitions and data reference set variations can be reviewed to provide insight into a data subject definition. Business professionals can be involved to ensure the data subject definition is complete and correct with respect to the business.

The set of definitions for all of the data characteristic variations within a data characteristic is reviewed and a comprehensive data definition is prepared for the data characteristic. The data subject definitions can help focus the data characteristic definitions, since the data characteristics qualify a data subject. Business professionals should also be involved to ensure that the data characteristic definition is complete and correct with respect to the business.

When the preferred data definitions have been developed and accepted by the business professionals, the data product definitions that were pulled over through the cross-references do not need to remain in a common data architecture. They can be removed from the common data definitions, but are still maintained with the data products to help people understand the existing disparate data.

In some situations, a comprehensive data definition cannot be developed because business professionals disagree. When disagreement occurs, the definition should be expanded until agreement is achieved. If agreement cannot be achieved with an expanded definition, the data subject or data characteristic likely represents two or more components that need to be separated, or separate components that need to be combined.

Try combining or subdividing data subjects or data characteristics to see if consensus can be gained. When consensus is gained by combining or subdividing, the data cross-references may need to be adjusted accordingly, and the data inventory may even need to be adjusted. Those adjustments are part of the discovery process.

Developing preferred data designations is an excellent opportunity to develop fundamental data definitions and inherit those fundamental data definitions. The development and use of fundamental data definitions is described in Chapter 3 and in Data Resource Simplexity.

Preferred Data Characteristic Variations

The set of data characteristic variations for each data characteristic is reviewed to determine which variation is most appropriate for the business as the preferred data characteristic variation. General guidelines can be prepared for designating preferred data characteristic variations, such as all dates will be in the CYMD format, all measurement units will be metric, people’s names in the normal sequence, monetary units in US Dollars, and so on.

The criteria for designating a preferred data characteristic variation are listed below.

When only one data characteristic variation exists within a data characteristic and it is appropriate for the business, it is designated as the preferred data characteristic variation.

When multiple data characteristic variations exist within a data characteristic and one is appropriate for the business, it is designated as the preferred data characteristic variation.

When multiple data characteristic variations exist within a data characteristic and several are appropriate for the business, the most reasonable one is designated as the preferred data characteristic variation.

When none of the existing data characteristic variations within a data characteristic is appropriate for the business, a new data characteristic variation is created that is appropriate for the business and is designated as the preferred data characteristic variation.

All other data characteristic variations become non-preferred data characteristic variations. The non-preferred data characteristic variations are designated as acceptable or obsolete.

Data Characteristic Variation List

A data characteristic variation list is a list of all the data characteristic variations within a data characteristic. The list is useful for reviewing all of the data characteristic variations and designating the preferred data characteristic variation.

For example, the employee birth date variations listed below were identified during data cross-referencing. The general guidelines state the CYMD format is the preferred variation for date, and that variation is designated as preferred. The designation is made with an asterisk after the name.

     Employee. Birth Date

          Employee. Birth Date, YMD

          Employee. Birth Date, MDY

          Employee. Birth Date, M/D/Y

          Employee. Birth Date, CYMD *

Well depth data characteristic variations are listed below. The general guidelines state that metric units are the preferred variation for measurements, but that variation does not exist. A new data characteristic variation is defined for meters and is designated as the preferred data characteristic variation, as shown below.

     Well. Depth

          Well. Depth, Inches

          Well. Depth, Yards

          Well. Depth, Feet

          Well. Depth, Meters *

The well location data characteristic variations are listed below. A review of these variations shows that a preferred data characteristic variation cannot be designated due to the wide range in accuracy and the difficulty of developing data translation schemes. Therefore, no data characteristic variation is designated as preferred, and a companion data characteristic is established to show the format of the well location. Note that the companion data characteristic could be a data reference set with data reference items for the different formats and accuracies of the well location.

     Well. Location

          Well. Location, PLS 40 Acres

          Well. Location, Lat/Lon 5 Degrees

          Well. Location, SPC 50 Feet

          Well. Location, Lat/Lon 1 Degree

     Well. Location

          Well Location. Format

Data Product Unit Cross-Reference List

A data product unit cross-reference list is a list of the data product units or variations and their corresponding data characteristic variations. The list could be useful for understanding the source of the variation and determining the preferred variation.

For example, the data product cross-reference list for equipment identification is s below. On closer inspection, the first data product unit is actually the serial number of the equipment, while the latter two data product units are identifiers assigned by the organization.

     EQP: EQUIP_ID Equipment. Identification Number, Char 16

     EQUIP: EQUIP_ID_NUM Equipment. Identification Number, Num 8

     EQ: EQUIPMENT_NUM Equipment. Identification Number, Char 5

The data cross-references were adjusted as shown below, and two preferred data characteristic variations were designated.

     EQP: EQUIP_ID Equipment. Serial Number, Char 16 *

     EQUIP: EQUIP_ID_NUM Equipment. Identification Number, Num 8

     EQ:  EQUIPMENT_NUM Equipment. Identification Number, Char 5 *

The data product unit cross-reference list below shows the building values that were thought to be the same values. However, on closer inspection the values were actually different.

     FAC: BLDG_VAL Building. Current Value, Dollars 6

     BLDG BLD_VAL Building. Current Value, Thousands 4

     BLD: STRCT_VAL Building. Current Value, Dollars 8

The data cross-references were adjusted as shown below, data characteristic variations were added, and preferred designations were made.

     FAC: BLDG_VAL Building. Current Value, Dollars 6

Building. Current Value, Dollars 8 *

     BLDG BLD_VAL Building. Original Value, Thousands 4

Building. Original Value, Dollars 8 *

     BLD: STRCT_VAL Building. Replacement Cost, Dollars 8 *

The reverse situation could also exist. The data product unit cross-references below show cross-references from a truck data file and an auto data file. On closer inspection, these are all vehicles and should be cross-referenced accordingly.

     TRK_FLE

          ACQ_DT Truck. Acquisition Date, YMD

          MFG Truck. Manufacturer Name Complete, 16

          MDL_NM Truck. Style Name Complete, 12

     AUTO_FLE

          DT_PCH Auto. Purchase Date, MDY

          M_NM Auto. Manufacturer Name Complete, 14

          STYL Auto. Style Name Complete, 14

The data cross-references are adjusted to vehicle data characteristic variations as shown below. Data characteristic variations were added to support the general guidelines, and the preferred data characteristic variations were designated.

     TRK_FLE

          ACQ_DT Vehicle. Acquisition Date, YMD

Vehicle. Acquisition Date, CYMD *

          MFG Vehicle. Manufacturer Name Complete, 16 *

          MDL_NM Vehicle. Style Name Complete, 12 *

     AUTO_FLE

          DT_PCH Vehicle. Purchase Date, YMD

          M_NM Vehicle. Manufacturer Name Complete, 12

          STYL Vehicle. Style Name Complete, 8

Designating preferred data characteristic variations for irregular data is often difficult. The best approach is to designate a preferred data characteristic variation and then try to make the appropriate data characteristic variation translation. For example, existing employee names may be irregular, but the organization desires to have the complete name in the normal sequence. The addition of a preferred data characteristic variation is shown below.

     EMP: EMPL_NM Employee. Name, Irregular 18

Employee. Name Complete, Normal 20 *

These examples show how data product unit cross-reference lists can be used to evaluate the data characteristic variations, make adjustments, and designate the preferred data characteristic variations.

Preferred Data Characteristic Name

The formal data name of a preferred data characteristic in the preferred logical data architecture does not carry the variation name. The appearance of a data characteristic in the preferred logical data architecture means that the preferred data characteristic variation is used. That specific variation does not need to be designated in the data characteristic name.

Using the variation name in the preferred logical data architecture makes the name longer than necessary. Using a variation name of Preferred also makes the name longer than necessary, and does not indicate the actual preferred format or content. Therefore, the data characteristic name in the preferred logical data architecture means the preferred data characteristic variation.

The preferred format and content of the data characteristics are stated in the data characteristic definition, as shown below.

     The employee’s name is in the normal sequence.

     The well depth is in meters.

The format of the data characteristics is also described in the data integrity rules where appropriate, as shown below.

     Employee. Name, Domain! 5 <= Character <= 20

     Well. Depth, Domain! 1 <= Integer <= 5

Preferred Data Reference Set Variations

Making preferred data reference set variations is more difficult than making preferred data characteristic variation definitions, because of the variability in the data reference item code, name, and domain of values. The set of data reference set variations for each data subject is reviewed to determine which variation is the most appropriate for the business as the preferred data reference set variation. The codes, names, definitions, and domain of values are all reviewed to make that determination.

The criteria for designating a preferred data reference set variation are listed below.

When only one data reference set variation exists within a data subject and it is appropriate for the business, it is designated as the preferred data reference set variation.

When multiple data reference set variations exist within a data subject and one is appropriate for the business, it is designated as the preferred data reference set variation.

When multiple data reference set variations exist within a data characteristic and several are appropriate for the business, the most reasonable one is designated as the preferred data reference set variation.

When none of the existing data reference set variations within a data subject are appropriate for the business, a new data reference set variation is created that has the desired data reference items, and is designated as the preferred data reference set variations.

All other data reference set variations become non-preferred data reference set variations. The non-preferred data reference set variations are designated as acceptable or obsolete.

Data Reference Item List

A data reference item list is a listing of all the data reference items in a data reference set variation, including the data reference item codes, data reference item names, and data reference item definitions. For example, a data reference set variation for disability is shown below.

     Disability. Set 1;

S Sight A person with sight impairment.

H Hearing A person with hearing impairment.

P Physical A person with physical impairment.

D Development A person with developmental impairment.

Note that the format for data reference item coded values, names, and definitions is not a concern for designating a preferred data reference set variation. Only the values of the code and name, and the meaning of the data reference item are important for making a preferred designation. The format is determined by the data characteristic representing the code, name, and definition.

A preferred data reference set variation is designated with an asterisk after the name, in the same way as a preferred data characteristic variation. For example, the data reference set variation shown below is the one that exists in the disparate data and is appropriate for the business. Therefore, it is designated as the preferred data reference set variation.

     Stream Gradient. Set 1; *

           1 Flat Less than 4.99 percent gradient.

2 Slight 5 to 9.99 percent gradient.

3 Moderate 10 to 19.00 percent gradient.

4 Steep 20 to 39.99 percent gradient.

5 Extreme 40 percent or steeper gradient.

A data reference item matrix is a matrix of all of the data reference items, for all of the data reference set variations, for a single data subject, including the coded data values, data reference item names, and data reference item definitions. It identifies the variability between disparate data reference set variations for the same data subject.

For example, four data reference set variations for disability are shown below. The names and the codes are different. The definitions were left off of the example due to space, but would normally be included.

     Disability. School:        Disability. Employment;        Disability. Health;

        10   Sight            A   Seeing    V   Vision

        20   Hearing            H   Hear                 S   Sound

        30   Physical            P    Physical    A   Accidental

        40   Develop            D   Developed                 G   Genetic

None of these data reference set variations meet the criteria for a preferred data reference set variation. Therefore, a new data reference set variation is developed with the preferred codes, names, and definitions, and is designated as the preferred data reference set variation.

   Disability.   

     School;               Employment;            Health;   New; *

     10   Sight       A   Seeing V   Vision   S   Sight

     20   Hearing       H   Hear S   Sound   H   Hearing

     30   Physical       P    Physical A   Accidental   P   Physical

     40   Develop       D   Developed G   Genetic   D   Developmental

The data reference items in the above examples represent the same domain of data properties. However, that is not always the case. Different data reference set variations may represent a different domain of data properties within the same data subject.

For example, two disparate data reference sets were found for small vehicle horsepower and large vehicle horsepower, as shown below.

     Vehicle Horsepower. Small;

          Below 100 A Below 100 Horsepower

          100-199 B 100-199 Horsepower

          200-349 C 200-349 Horsepower

          350-499 D 350-499 Horsepower

        Vehicle Horsepower. Large;

          500-749 1 500-749 HP

          750-999 2 750-999 HP

          1000-1499 3 1000-1499 HP

          1500+ 4 1500+ HP

The two data reference sets represent different domains of data properties for vehicle horsepower and need to be combined into one data reference set. The new preferred data reference set is shown below with the two disparate data reference sets. The combined disparate data reference set variations are shown on the left, and the preferred coded data values and definitions are shown on the right.

     Vehicle Horsepower. New; *

          Below 100 A 1 Below 100 Horsepower

          100-199 B 2 100-199 Horsepower

          200-349 C 3 200-349 Horsepower

          350-499 D 4 350-499 Horsepower

          500-749 1 5 500-749 Horsepower

          750-999 2 6 750-999 Horsepower

          1000-1499 3 7 1000-1499 Horsepower

          1500+ 4 8 Above 1500 Horsepower

A similar situation exists with management level data reference items. Two disparate data reference sets of management level data reference items are shown below. Neither the codes nor the names match, and no definitions were provided.

    Management Level. Personnel;

        OP Owner / Partner

        EX Executive

        MN Manager

        SP Supervisor

        LW Lead Worker

     Management Level. Finance;

        041 Senior Manager

        163 Line Manager

        239   Unit Manager

        157   Team Manager

        445 Team Member

A new data reference set was defined to include all of the management-level data reference items, as shown below. The new data reference item names and coded data values are shown, along the corresponding disparate coded data values.

     Management Level. New; *

          Owner 01 OP

          Senior Executive 02 EX 041

          Line Manager 03 MN 163

          Unit Supervisor 04 SP 239

          Lead Member 05 LW 157

          Team Member 06 445

Existing data reference set variations may be too detailed for the business. The data reference items may need to be combined to better represent the business. For example, current ecological regions are show below.

     Ecological Region. Wildlife;

          1 Northeast

          2 East Central

          3 Southeast

          4 Southern Midwest

          5 Central Midwest

          6 Northern Midwest

          7 Southwest

          8 Northwest

These eight detailed ecological regions need to be combined into three general ecological regions, as shown below. A new data reference set variation is created and designated as preferred.

     Ecological Region. Wildlife; Ecological  Region. New *

          1 Northeast E East

          2 East Central E East

          3 Southeast E East

          4 Southern Midwest M Midwest

          5 Central Midwest M Midwest

          6 Northern Midwest M Midwest

          7 Southwest W West

          8 Northwest W West

Preferred Data Reference Set Name

The formal data name of a preferred data reference set variation is the data subject name without any variation name, the same as a preferred data characteristic variation. The appearance of a data reference set in the preferred logical data architecture means that the preferred data reference set variation is used. That specific variation does not need to be designated in the name.

The preferred format of the coded data value, data reference item name, and data reference item definition are defined in their corresponding preferred data characteristic variations. For example, a data subject that represents a data reference set for Animal Species has a preferred data characteristic variation for Animal Species. Code, for Animal Species. Name, and for Animal Species. Definition. Those preferred data characteristic variations define the format and integrity rules for the data reference items in the preferred data reference set variation.

Preferred Data Subjects

Data subjects are designated as preferred when they contain a preferred data reference set variation, or one or more preferred data characteristic variations. Other data subjects become non-preferred. The situation occurs when an enterprise data architecture is developed independent of the existing disparate data and used as an initial common data architecture. After data cross-referencing, some of the data subjects do not have a preferred data reference set or preferred data characteristic variations, and remain non-preferred.

The non-preferred data subjects may remain in a common data architecture, but a note should be added to the data definition that it is non-preferred. An entry should also be made in the data subject thesaurus with the data subject name referring to a preferred data subject. Alternatively, the non-preferred data subject may be removed from a common data architecture.

Preferred Data Keys

Primary keys are important for uniquely identifying each data occurrence in a data subject, and for navigating the data resource. Disparate data have many different primary keys for the same data subject, different ranges of uniqueness, and unnecessary data characteristics not required for uniqueness. Understanding these disparate primary keys and designating preferred primary keys is often a challenging task.

A primary key list is list of the primary keys for a data subject that exists in the disparate data. Only the data characteristic is listed for each primary key, not the data characteristic variation. Designating preferred primary keys is not dependent on the format or content of the data. It’s only dependent on the value of the primary key. The content and format will be translated to the preferred data characteristic variation during data transformation.

The example below shows the primary keys that have been identified for vehicle data in a disparate data resource. The data characteristics are listed for each primary key.

     Vehicle

          Primary Key:

              Vehicle. Identification Number

         Primary Key:

             State. Code, ANSI

             Vehicle. License Number

        Primary Key:

            Vehicle. License Number

        Primary Key:

            Vehicle. License Number

            Vehicle. Model Name

The primary key range of uniqueness is the range of data occurrences for which the primary key provides a unique identification. The primary keys in disparate data may have different ranges of uniqueness that must be identified before a preferred primary key can be designated.

For example, the primary keys identified for a vehicle are shown below with a statement of their range of uniqueness. The range of uniqueness is placed after the Primary Key label. Unnecessary data characteristics are also indicated to provide insight into designating a preferred primary key.

     Vehicle

          Primary Key: Unique for all vehicles manufactured.

              Vehicle. Identification Number

         Primary Key:  Unique for vehicles in the United States.

             State. Code, ANSI

             Vehicle. License Number

        Primary Key:  Unique for vehicles in a state in the United States.

            Vehicle. License Number

        Primary Key:  Unique for vehicles in a state in the United States.

            Vehicle. License Number

            Vehicle. Model Name – Unnecessary data

A primary key matrix is a matrix of the primary keys that shows all of the disparate primary keys for a data subject and across related data subjects. The matrix is helpful for evaluating disparate primary keys, designating a preferred primary key, and combining data occurrences and data instances.

The primary key matrix is useful for identifying commonality across related data subjects in the public sector where data sharing is common. Public sector organizations need to share data to operate effectively and efficiently, and primary keys can be used for that data sharing.

For example, a primary key matrix is shown below for drivers, employees, inmates, welfare recipients, and businesses in a state. The corresponding primary key lists can be used to identify range of uniqueness, unnecessary data characteristics, and other primary keys that could be useful for data sharing.

     Driver    Employee Inmate Welfare    Business

     Social Security Number X           X

     Business Identifier           X

     Driver’s License Number     X

     State Identifier     X  

     Fingerprint Identifier     X

     Person Name Complete      X X            X

     Person Birth Date               X X              X

A primary key matrix could be prepared for all data subjects in the organization’s data resource, or within the scope of data integration. The matrix is not particularly useful for operational data, because the matrix would be relatively sparse. However, the matrix is very useful for aggregated data in variable hierarchies to identify the primary keys for each data set in those variable hierarchies.

Preferred Primary Key

All of the primary keys in a data subject are reviewed to determine the preferred primary key. The criteria for designating primary keys are listed below.

Each primary key identified in the disparate data becomes a candidate primary key for a data subject.

Each candidate primary key in a data subject is reviewed to determine its range of uniqueness for identifying each data occurrence in a data subject and any unnecessary data characteristics.

When a candidate primary key is unique within the scope of the organization’s business, it remains a candidate primary key.

When a candidate primary key does not uniquely identify each data occurrence within the scope of the organization’s business, it is designated as an obsolete primary key.

When a candidate primary key has limited uniqueness within the scope of the organization’s business, it is designated as a limited primary key and the range of uniqueness is noted.

When a candidate primary key is meaningful within the scope of the organization’s business, it is designated as the preferred business primary key. Any remaining candidate primary keys meaningful for the business are designated as alternate business primary keys.

When a candidate primary key is not meaningful within the scope of the organization’s business, but is useful for database management, it is designated as the preferred physical primary key. Any remaining candidate primary keys that are not meaningful for the business, but are useful for database management, are designated alternate physical primary keys.

When no candidate primary keys exist for a data subject, a new primary key that is meaningful within the scope of the organization’s business is created and designated as the preferred business primary key.

When no candidate primary keys exist for a data subject, a new primary key that is meaningful within the scope of the organization’s business for database management is created and designated as the preferred physical primary key.

No candidate primary keys are deleted from a common data architecture, even though they are obsolete. They are retained to show that they had been identified and determined to be obsolete.

Using the vehicle example above, all four primary keys become candidate primary keys. The first primary key becomes preferred because it uniquely identifies all vehicles manufactured. The second primary key is an alternate in the United States because it is readily identifiable and is used in nationwide law enforcement systems. The third primary key is an alternate because it could be useful within a specific state. The fourth primary key is obsolete because it contains unnecessary data characteristics.

     Vehicle

          Primary Key: Unique for all vehicles manufactured.

      Preferred

              Vehicle. Identification Number

         Primary Key:  Unique for vehicles in the United States.

           Alternate

             State. Code, ANSI

             Vehicle. License Number

        Primary Key:  Unique for vehicles in a state in the United States.

     Alternate

            Vehicle. License Number

        Primary Key:  Unique for vehicles in a state in the United States.

     Obsolete

            Vehicle. License Number

            Vehicle. Model Name – Unnecessary data

Both business and physical primary keys can be designated. Business primary keys are used for data normalization and physical primary keys are used in database management systems. Each has a specific purpose and needs to be identified.

For example, the primary keys identified for employees are shown below. The first is preferred and is a business primary key. The second is an alternate and is a business primary key. The third is preferred and is a physical primary key.

     Employee

        Primary Key:  Unique for all employees in organization.

     Preferred. Business.

                Employee. Social Security Number

        Primary Key:  Unique for all employees in organization.

     Alternate. Business

                  Employee. Organization Identifier

        Primary Key:  Unique for all employees in organization.

     Preferred. Physical.

                Employee. System Identifier

Data Integration Key

Primary keys are used to uniquely identify data occurrences within a data subject. However, the primary keys in disparate data may not be useful for integrating data occurrences from different data files representing the same data subject. Those disparate primary keys may be unique within their data file, but are not unique across multiple data files, making the integration of data occurrences difficult.

For example, student data in different data files at the school level, the district level, and the state level may have different primary keys. Integrating student data, particularly when students move between schools, districts, and states can be a difficult task. Some means of uniquely identifying students across all schools, districts, and states is needed to provide accurate student data.

A data integration key was defined in the last chapter as a set of data characteristics that could identify possible redundant physical data occurrences in a disparate data resource. A data integration key is not a primary key because it does not uniquely identify a specific data occurrence. It is only used to identify possible identical data occurrences.

The reason that most traditional data integration approaches use a system of record or system of reference is that the lack of a primary key across multiple data sources is not an issue. One primary data source is used and the other secondary data sources are largely ignored. However, that approach is often not acceptable to the business, because data occurrences could be lost in the integration process.

A data integration key for student data might include the student’s birth name, birth date, birth location (hospital, city, state, and country), race, ethnicity, mother’s maiden name, and so on. The data integration key could be many different data characteristics contained in a data occurrence that might be useful for identifying identical data occurrences. Although it doesn’t uniquely identify identical data occurrences, it provides insight for making a judgment call about whether data occurrences represent the same student.

A data integration key is documented in the same way as a primary key, but is labeled as a data integration key. The data integration key for student data is shown below. The data values of a data integration key are used to identify possible matches between students in different data files. Some data values may not exist in all data files, but the existing data values help identify possible matches.

     Student

          Integration Key: Ethnicity. Code

               Race. Code

               Student. Birth Date

               Student. Birth City Name

               Student. Birth Country Name

               Student. Birth Name Complete

               Student. Birth State Name

               Student. Mother Maiden Name

A data integration key index is a table showing the values of a data integration key for each data occurrence, in all data files, within a data subject. A data integration key index is created for each data subject and is labeled with the data subject name. The index may be quite large, since it contains the data integration key values for each data record, in each data file, within a data subject.

Possible matching data occurrences across the data files within a data subject can be identified electronically, but the final determination of identical data occurrences may need to be made manually. Using a data integration key with too few data characteristics may result in unreasonable matches. Using a data integration key with too many data characteristics may result in unnecessary processing time and no matches. The best approach is to identify a data integration key that will most likely identify realistic matches.

Preferred Foreign Key

All foreign keys in a data subject are reviewed to determine the preferred foreign keys that match the preferred primary keys. The criteria for designating foreign keys are listed below.

When a candidate foreign key matches the preferred business primary key in a parent data subject, it is designated as a preferred business foreign key.

When a candidate foreign key matches the preferred physical primary key in a parent data subject, it is designated as a preferred physical foreign key.

When a candidate foreign key matches an alternate business primary key in a parent data subject, it is designated as an alternate business foreign key.

When a candidate foreign key matches an alternate physical primary key in a parent data subject, it is designated as an alternate physical foreign key.

When a candidate foreign key matches a limited primary key in a parent data subject, it is designated as a limited foreign key.

When a candidate foreign key matches an obsolete primary key in a parent data subject, it is designated as an obsolete foreign key.

Remaining candidate foreign keys are reviewed to determine if a parent data subject has a possible corresponding primary key that has not been designated.

When a parent data subject has a corresponding primary key that has not been designated, that designation is made, and the candidate foreign key is designated accordingly.

When a parent data subject has no corresponding primary key, the candidate foreign key is designated as obsolete.

When no candidate foreign keys exist, and one or more parent data subjects exist, foreign keys are created for the parent data subjects based on the primary keys in those parent data subjects.

A foreign key list is a list of the foreign keys for a data subject that exists in the disparate data. Only the data characteristic is listed for each foreign key, not the data characteristic variation. Designating foreign keys, like primary keys, is not dependent on the format or content of the data. It’s only dependent on the value of the foreign key. The content and format will be translated to the preferred data characteristic during data transformation.

Using the employee example above, each employee receives many pay checks. The foreign key list for pay checks is shown below. The first foreign key is to Employee using Employee. Social Security Number, and it is a preferred business foreign key. The other foreign keys are listed accordingly.

     Pay Check

           Foreign Key:    Employee Employee. Social Security Number

          Preferred. Business.

          Foreign Key:     Employee Employee. Organization Identifier

         Alternate. Business

          Foreign Key:     Employee Employee. System Identifier

         Preferred. Physical.

Generally, identifying the primary key referenced by a foreign key is not difficult when the formal data names are used. The data subject name in the foreign key identifies the parent data subject, and the data characteristic names match the data characteristics in the parent data subject. The additional data subject name after the Foreign Key label is a routine practice for designating foreign keys, and is very useful for documenting foreign keys in disparate data.

A common practice is to list all implicit foreign keys. For example, if a class is identified by the course identifier and the academic year identifier, then foreign keys are defined for both the class and the academic year. The practice is routine for documenting all possible navigations between data subjects.

Preferred Data Relations

Data relations between data subjects are designated by the foreign keys. However, only one data relation exists between a parent and subordinate data subject even though many foreign keys may exist between those two data subjects. That data relation becomes the preferred data relation, and represents both a business data relation and a physical data relation.

Preferred Data Sources

A preferred data source was defined in the last chapter as the data product unit or variation that will be the source for a business fact. It’s the location where an individual business fact can be obtained that is the most current and most accurate, and has the highest quality. Data sources that are not designated as preferred data sources become non-preferred data sources.

Conditional data sourcing was defined in the last chapter as the process of selecting preferred data from a variety of different locations based on which location has the most current and most accurate data. The source is based on the most current and most accurate data, not on the format or content of the data, as specified in a preferred data characteristic variation. The format and content can be easily translated during data transformation and is not a consideration in designating data sources.

In other words, the preferred data characteristic variation and the preferred data source are not necessarily the same. The preferred data source is determined independent of the preferred data characteristic variation.

Data source rules were defined in the last chapter as rules specifying the preferred data source where a particular business fact is obtained, and the conditions that determine the preferred source for each business fact. As described in the last chapter and above, a single system or record of reference seldom exists. The typical situation is conditional data sourcing, which is specified by data source rules.

An unconditional data source rule specifies one location where a business fact is obtained. A conditional data source rule specifies multiple locations where a business fact is obtained, and the conditions for specifying the location.

Possible insights for identifying redundant data occurrences in a disparate data resource are listed below. These insights provide some indication of possible redundant data occurrences.

Disparate primary keys could be used to indicate possible redundant data occurrences. However, the primary keys for the same data occurrence are often different in different data files for the same data subject. In addition, the same primary key, such as a system assigned identifier, may appear in data files representing different data subjects.

The data product set or variation definitions could provide some indication of the data occurrences contained in that data product set or variation and whether they might be redundant with data occurrences in another data product set or variation.

The cross-references to a common data architecture could be reviewed to find all of the data product sets or variations for each data subject, either through cross-references to a data subject variation, or through cross-references to data characteristic variations within the data subject.

The same set of data items in a data file could indicate redundant physical data occurrences. However, the same of set of data items does not necessarily mean those data occurrences are redundant.

A data characteristic source list is a list of all the data sources for each data characteristic. The data characteristics are listed for each data subject, and the data sources are listed for each data characteristic. The data sources for each data characteristic are compiled from the cross-references between data product units or variations and data characteristic variations. The data subject and data characteristic names are the logical common data architecture names, and the source names are the physical data product set or variation and data unit or variation names.

For example, the data characteristic source list for a few of the employee data characteristics is shown below. The data product units or variations are listed for the data characteristics within the employee data subject.

     Employee

          Employee. Name

               PYRL.EMPL_NM

               AA.EMP_NAME

               TRAIN.EMPLYE_NM

          Employee. Birth Date

               PYRL.EMPL_BD

               AA.EMP_BIRTH

               TRAIN.EMPLYE_BDATE

          Employee: Race. Code

               PYRL.EMPL_RACE

               AA.EMP_RC

               TRAIN.EMPLYE_ETH

The criteria for specifying the data source rules are listed below.

Prepare a data characteristic source list for the scope of data being integrated to determine data source rules.

For each data characteristic, look at each source of the data and determine which is the best source based on whether data values are present and on the quality of the data, such as the currentness and accuracy of the data.

Develop the unconditional or conditional data source rule according to the availability and quality of the data.

Data source rules use the data rule notation. The data rule identifier is Source! and any conditions follow the When notation. The double left carets (<<) indicate the source of the data, meaning comes from.

The data source rules for the employee data are shown below. The unconditional source for Race. Code is PYRL.EMPL_RACE. The conditional sources for Employee. Name between January 1, 1990 and December 31, 1999 is PYRL.EMPL_NM, and after January 1, 2000 is AA.EMP_NM. The conditional source for Employee. Birth Date is based on the parent Department. Identifier.

     Employee: Race. Code

          Source! << PYRL.EMPL_RACE

     Employee. Name

          Source ! When Employee. Hire Date >= ‘January 1, 1990’

                        & <= ‘December 31, 1999’ << PYRL.EMPL_NM

                        When Employee. Hire Date >= January 1, 2000

                        << AA.EMP_NAME

     Employee. Birth Date

          Source! When Department. Identifier ^ Employee = ’62’ | ‘63’ | ‘64’

                        << PYRL_EMPL_BD

                        When Department. Identifier ^ Employee <> ’62’ | ‘63’ | ‘64’

                        << TRAIN.EMPLYE_BDATE

In some situations, a data value may not exist in the preferred source designated by the data source rule. If missing data values are likely to occur, alternate data sources may be defined in the data source rules. For example, the data source rule for Employee. Birth Date may designate alternate sources, as shown below. The last two conditions were added so that when the data value for birth date was missing, another data source would be used.

     Employee. Birth Date

          Source! When Department. Identifier ^ Employee = ’62’ | ‘63’ | ‘64’

                        << PYRL_EMPL_BD

                        When Department. Identifier ^ Employee <> ’62’ | ‘63’ | ‘64’

                        << TRAIN.EMPLYE_BDATE

           When TRAIN.EMPLYE_BDATE = ‘ ‘ << PYRL.EMPL_BD

           When PYRL.EMPL_BD = ‘ ‘ << AA.EMP_BIRTH

An alternate notation for data source rules can be used where the data source rules are listed for a data subject rather than for a data characteristic, and the data characteristics are listed within the data source rule criteria. The alternate notation avoids repeating the conditional criteria. For example, the student data sources are shown below.

     Student

          Source!

               Student. Race. Code << PYRL.EMPL_RACE

               When Student. Registration Date <= ‘December 31, 1999’

                    Student. Birth Date << STDT.BRTH_DT

                    Student. Name Complete << STDT.NAM

                    And so on…

               When Student. Registration Date >= ‘January 1, 2000’

                    Student. Birth Date << REG.STD_BD

                    Student. Name Complete << REG.STD_NM

                    And so on…

The examples above are simple, but show how data source rules can be written. The conditional criteria could be a date, a type of entity, a country or region, age, residence, date or date range, or any other combination of criteria. The conditional criteria could be from the data entity receiving the source data, or from a parent data entity. Virtually any criteria can be specified to designate the source of the data. The rules can be structured in any manner that is easy to interpret and uses the fewest notations.

Data Occurrences and Data Instances

The examples above show data source rules for a single non-redundant data occurrence in disparate data, and for multiple redundant data occurrences in disparate data. In other words, if stream data for all streams exist in only one data file, that data file becomes the unconditional source for stream data. However, if stream data for all streams exist in more than one data file, then conditional data sourcing needs to be specified for which data items are sourced from each of those data files to form the preferred data occurrence.

A problem arises when not all data occurrences for a data subject exist in any single data file. For example, different data files may contain grade school students, middle school students, and high school students. Also, multiple data files exist for grade school, middle school, and high school students. Student data occurrences are redundant across the data files within a grade level, but not across data files for different grade levels. All of these student data occurrences need to be brought together into a single data file for all students.

The preferred business primary key uniquely identifies each student, but does not appear in the existing data files. A primary key index can be prepared to show the existing primary keys for students in the existing data files and the preferred business primary key for each student. Data source rules are then prepared to define the preferred data sources for each student within each grade level.

A more intricate problem arises when the student data occurrences are redundant across data files within a grade level, and are redundant across data files for different grade levels. For example, several data files contain data for grade school and middle school students, and several data files contain data for middle school and high school students. The middle school students are redundant both within and across grade levels.

Again, the preferred business primary key uniquely identifies each student, but does not appear in the existing data file. A primary key index can be prepared to show the existing primary keys for students in the existing data files and the preferred business primary key for each student. In addition, a data integration key index can be prepared to show possible matching students in the existing data files. Data source rules are then prepared to define the preferred data sources for each student.

The examples above describe current non-redundant and redundant data occurrences between existing data files—current data instances. The situation can become quite detailed, but can be resolved with the use of a primary key index, a data integration key index, and data source rules. However, the situation becomes even more detailed with historical data instances.

Historical data instances are subordinate to current data instances in a one-to-many relationship. In other words, each data occurrence has a current data instance and could have many historical data instances. These historical data instances can be as redundant as the current data instances described above, and they can represent different times.

Historical data instances will have the same business and physical preferred primary key as their parent current data instance, plus a time component. The time component depends on the particular data subject. Geologic events, such as continental drift, may be a decade or century, timber stand growth may be a year, a person’s name change may be a day, a vehicle accident may be an hour, a chemical reaction may be a second, a nuclear reaction may be a micro-second, and so on.

Data source rules are first defined to bring the current data instances together, as described above. Then data source rules are defined to bring the historical data instances together within their parent data occurrence. The additional parameter for defining the historical data source rules is the time component. A different time results in a different historical data instance, regardless of the source of that data instance or the historical data captured.

The historical data source rules specify which historical data items are to be sourced. The data source rules can specify a subset of the historical data items, such as only those data items whose value changed, or they can specify all data items in the historical data instance.

Preferred Data Integrity Rules

A preferred data integrity rule was defined in the last chapter as a data integrity rule that has either been confirmed or created to ensure integrity of a common data architecture. A candidate data integrity rule was defined as a data integrity rule that was documented during the data inventory and brought over to a common data architecture.

Generally, very few data integrity rules exist in the disparate data resource, which is the reason that most disparate data have very low quality. A complete set of precise data integrity rules usually need to be specified within a common data architecture. However, the data integrity rules that do exist in disparate data can be a start to developing precise data integrity rules within a common data architecture.

The best approach is to bring all data integrity rules documented during the data inventory process over to a common data architecture. These candidate data integrity rules are then reviewed to determine if they can become preferred data integrity rules, or if preferred data integrity rules need to be developed.

Preferred data value rules, conditional data value rules, data structure rules, conditional data structure rules, data derivation and rederivation rules, and data retention rules all need to be defined within a common data architecture as described in Chapter 3 and in Data Resource Simplexity. These preferred data integrity rules define the quality needed in a comparate data resource, based on business needs.

The preferred data integrity rules must have a violation action and violation notification defined. Many people resist defining violation actions and notifications because it’s too time consuming. However, the default for data that fail the data integrity rules is to delete and ignore. That default is not practical for data resource integration. Therefore, the violation actions and notifications must be defined.

Many people resist defining a complete set of precise data integrity rules because it’s a time consuming process. However, it’s usually these same people who complain about the low quality of the existing disparate data. The time must be spent developing a complete set of precise data integrity rules if the comparate data resource is to contain high quality data to support business needs.

The preferred data integrity rules are identified with the word Preferred. For example, a preferred data integrity rule for a student’s birth date would be specified as shown below.

     Student. Birth Date, Domain! Preferred

          January 1, 1986 <= Date <= December 31, 1998

Data integrity rules that are not preferred do not have the preferred designation. They usually remain in a common data architecture as an indication of the data integrity that exists in the disparate data.

Multiple Preferred Data Designations

Multiple preferred data designations was defined in the last chapter as the situation where multiple data characteristic variations or multiple data reference set variations are designated as preferred due to culture, geography, or politics.

For example, multiple data characteristic variations could be designated for a course description in English, German, and French, as shown below. The preferred data characteristic variations can be used as necessary for international students.

     Course. Description, English *

     Course. Description, German *

     Course. Description, French *

Similarly, multiple data reference set variations could be designated for management levels in different regions of the world, as shown below. The data reference set variations can be used as appropriate for the world region.

     Management Level; Europe *

     Management Level; Asia *

     Management Level; North America *

Changes Over Time

A common data architecture often changes over time as the business world changes, or as an organization chooses to operate differently in the business world. Those changes need to be represented in the preferred data designations. The criteria for designating changes over time are listed below.

When a data subject name changes, a version stating the relevant dates is added to the former data subject name, and the data subject new name is established with a version stating the relevant dates. All of the data characteristics and data reference set variations are moved to the new data subjects, with appropriate name changes.

When a data characteristic name changes, a version stating the relevant dates is added to the former data characteristic name, and a new data characteristic name is established with a version stating the relevant dates. All of the data characteristic variations are moved to the new data characteristic.

When a data definition changes, the new definition is placed ahead of the old definitions with the effective date. The former definition is retained for historical purposes.

When a preferred data characteristic variation changes, a version notation is added to the former preferred data characteristic variation stating the relevant dates. A new preferred data characteristic variation is designated as preferred, with a version notation stating the relevant dates.

When a preferred data reference set variation changes, a version notation is added to the former preferred data reference set variation stating the relevant dates. A new preferred data reference set variation is designated as  preferred, with a version notation stating the relevant dates.

When a data reference item in a preferred data reference set changes, the end date for that data reference item is added. The former data reference item is not removed from the data reference set. The begin date is added for a new data reference item.

When primary or foreign key designations are changed, a version notation is added to the former key stating the relevant dates. New designations are made, with a version notation stating the relevant dates.

When a preferred data source changes, which seldom happens, a version notation is added to the former preferred data source stating the relevant dates. A new preferred data source is designated, with a version notation stating the relevant dates.

When a preferred data integrity rule changes, a version notation is added to the former preferred data integrity rule stating the relevant dates. A new preferred data integrity rule is established, with a version notation stating the relevant dates.

Preferred Data Architecture Adjustment

Components may have been added to a common data architecture during the preferred data designation process to develop a comparate data resource that supports how the organization desires to do business. However, some components placed in an initial common data architecture, or components added during the preferred data designation and never used, may no longer be relevant for the way the organization desires to do business.

A common data architecture can be reviewed after the preferred data designations have been made to identify all non-relevant components and either remove them or note that they are no longer relevant. The two approaches are described below and either approach is acceptable.

The first approach is to remove all non-relevant components so that the common data architecture represents only components that support data cross-references and preferred data designations. The approach results in a smaller common data architecture, but does not indicate thoughts and insights gained during the preferred data designation process.

The second approach is to retain the non-relevant components and label them as non-relevant with an explanation as to why they are not relevant. The approach results in a larger common data architecture, but documents the thoughts and insights gained during the preferred data designation process.

DATA TRANSLATION RULES

Data translation rules set the stage to actually convert the data values between disparate data and comparate data. A data translation rule was defined in the last chapter as specifying the algorithm for translating data values between preferred and non-preferred data designations, or between different non-preferred data designations when necessary. Preferred data translation rules are between preferred and non-preferred data designations. Non-preferred data translation rules are between non-preferred data designations.

Forward data translation rules are between non-preferred data designations and preferred data designations. Reverse data translation rules are between preferred data designations and non-preferred data designations.

Data translation rules apply to data characteristic variations and data reference set variations. Each of these types of data translation rules is described below.

Data Characteristic Variation Translation

A data characteristic translation rule was defined in the last chapter as a data translation rule that translates data values between non-preferred and preferred variations of a data characteristic. It consists of a source data characteristic variation, a translation algorithm, and a target data characteristic variation.

A data characteristic translation rule is designated with the keyword Translation!. The notation >> is used to indicate goes to or converted to. The notation << is used to indicate comes from or converted from. The data translation rule is normalized to the data characteristic variation on which it takes action, meaning the receiving data characteristic variation.

Data characteristic translation rules are prepared both ways between preferred and non-preferred data characteristic variations. Measurements, such as feet to inches or miles to kilometers, are relatively easy. For example, the translation between well depth in inches and well depth in meters is shown below.

     Well

          Well. Depth, Meters << Inches Translation!

               Well. Depth, Meters = Well. Depth, Inches / 39.39

Names are more difficult to translate, such as a person’s name from normal to inverted sequence. For example, the name of drivers who receive licenses is translated from an inverted sequence to a normal sequence.

     Driver

          Driver. Name Complete, Normal << Inverted Translation!

          Algorithm for parsing and rearranging the driver’s name

A person’s name in an irregular format would require an extensive algorithm to translate that name to a regular format, and many need human intervention to interpret the irregularity and translate it to a specific format. Translating a person’s name from a regular format to an irregular format would not be possible.

Some translations, such as state plane coordinates to latitude and longitude, are the most difficult and require a detailed mathematical algorithm.

Some specific locations can be translated to more general locations. For example, a construction accident location specified by latitude and longitude to 1 minute accuracy could be translated to a more general location, such as a 40-acre parcel. However, it would not be possible to translate a well location from a general 40-acre parcel to a specific latitude and longitude location with 1 minute accuracy.

Some location translations require more than one source data characteristic and more than one target data characteristic. For example, translating a well location specified as latitude and longitude (Lat / Lon) to a well location specified as a section, tier, and range (PLS for Public Land Survey) requires two source data characteristics and three target data characteristics.

     Well

          Well Location PLS << Lat / Lon Translation!

               Well. Latitude Degrees, 2 Digits

               Well. Longitude Degrees, 2 Digits

               Well. Section Number

               Well. Tier Number

               Well. Range Number

               Algorithm for converting Lat / Lon to PLS

The translation of data value length variations is relatively easy, if the data value does not need to be truncated. Translating from a shorter data value to a longer data value is easy. Translating from a longer data value to a shorter data value can be difficult if the source data value is longer than the target. If the data value needs to be truncated, the translation algorithm becomes very detailed, or human intervention is needed.

Translating data values from right to left justified, or from left to right justified is relatively easy. The translation algorithm specifies the shifting of the data value either right or left.

Translating data values that have different meaning is not valid. For example, translating Well. Depth, Estimated Feet to Well. Depths, Measured Meters is not valid because the data values have different meanings.

Data characteristic translation rules are maintained until all disparate data have been permanently translated to a comparate data resource. The rules then become inactive, but are retained for historical purposes.

Fundamental Data Translation Rules

The fundamental data concept can be used for defining data translation rules. Fundamental data translation rules are basic translation rules that can be inherited in many specific data translations. Specific data translation rules apply directly to the data and can inherit fundamental data translation rules.

Fundamental data translation rules are stored in a Translation Rule data subject. The data characteristics are the specific translation rules. Since data translation rules are specified both ways between non-preferred and preferred data characteristic variations, two translation rules are defined.

For example, the two data translation rules for translating between a date CYMD and a date YMD are shown below. The algorithm for making the translation is shown in the contents of each data characteristic.

     Translation Rule. Date CYMD >> Date YMD

     Translation Rule. Date YMD >> Date CYMD

Similarly, the translation between meters and inches is shown below.

     Translation Rule. Meters >> Inches

     Translation Rule. Inches >> Meters

However, in the case of measurement, the translation could have different accuracies. The data characteristic variation indicates the accuracy, as shown below.

     Translation Rule. Meters >> Inches, 0.001

     Translation Rule. Meters >> Inches, 0.1

An alternative notation is to show the  number of significant digits as the data characteristic variation. For example, the translation from meters to inches could be 39.3900787401575, or 39.4, or 39.0. The data translation rule notations are shown below.

     Translation Rule. Meters >> Inches, 13 Digits

     Translation Rule. Meters >> Inches, 1 Digit

     Translation Rule. Meters >> Inches, 0 Digits

Fundamental data translation rules can be used for specific data translations. For example, translating an employee’s birth date from MDY format to CYMD format using a fundamental data translation rule is shown below.

     Employee

         Employee. Birth Date, CYMD << MDY Translation!

               Translation Rule. CYMD << MDY

Similarly, Pi could be defined in a data translation rule as 3.14159265 or 3.14, as shown below.

     Translation Rule. Pi, 8 Digits

     Translation Rule. Pi, 2 Digits

Pi can then be used in specific data translation rules. For example, a survey plot area is calculated from its radius using the Pi translation rule.

     Survey Plot

          Survey Plot. Area << Radius Translation!

               Survey Plot. Area, 2 Digits = Survey Plot. Radius, 3  Digits ** 2

                   * Translation Rule. Pi, 2 Digits

Fundamental data translation rules can be developed for a majority of the data characteristic variation translations. Very few specific data characteristic translation rules need to be defined. Fundamental data translation rules minimize the documentation and maximize the consistency of data characteristic variation translation.

Data Reference Item Translation

A data reference item translation rule was defined in the last chapter as translating coded data values and names between data reference items in preferred and non-preferred data reference set variations in a data subject. The data reference set translation rule only translates the data value, not the format of the data value. The data characteristic translation rule for the coded data value and name translates the format.

A one-to-one data reference item translation rule translates the coded data value and/or the name from one data reference item in the source to one data reference item in the target, which is relatively easy. A many-to-one data reference item translation rule translates the coded data value and/or the name from many different data reference items in the source to one data reference item in the target, which is relatively easy. A one-to-many data reference item translation rule translates one coded data value and/or name from the source to many data reference items in the target, which can be difficult and may require additional input.

Data reference item translation rules are usually a matrix of the coded data values and names that shows both the non-preferred and the preferred sets of data reference items. The matrix is the translation rule that can be used to translate from the non-preferred to the preferred data reference items, and from the preferred to the non-preferred data reference items.

The data reference item translation rule is named by the data subject containing the data reference set variations. For example, the matrix for disability data reference item translations is named Disability. Translation!, the matrix for vehicle horsepower data reference items translations is named Vehicle Horsepower. Translation!, and the matrix for management level data reference item translations is named Management Level. Translation!. The matrix is documented with the data subject containing the data reference set variations.

The data reference item translation rule for disability is shown below. The coded data values and names are shown for the three disparate data reference set variations, and for the new preferred data reference set variation. The coded data values and names can be translated either way based on the matrix of values.

   Disability. Translation!   

     School;               Employment;            Health;   New; *

     10   Sight       A   Seeing V   Vision   S   Sight

     20   Hearing       H   Hear S   Sound   H   Hearing

     30   Physical       P    Physical A   Accidental   P   Physical

     40   Develop       D   Developed G   Genetic   D   Developmental

The data reference item translation rule for vehicle horsepower is shown below. The matrix shows the combination of two distinctly separate disparate data reference set variations combined into one all-inclusive preferred data reference set variation. The coded data values and names can be translated either direction.

     Vehicle Horsepower. Translation!

          Small;    Large;         New; *

          A     Below 100            1     Below 100 Horsepower

          B     100-199            2     100-199 Horsepower

          C     200-349            3     200-349 Horsepower

          D     350-499            4     350-499 Horsepower

              1     500-749         5     500-749 Horsepower

              2     750-999         6     750-999 Horsepower

              3     1000-1499       7     1000-1499 Horsepower

              4     1500+         8   Above 1500 Horsepower

The data reference item translation rule for management level is shown below. The matrix shows the combination of two overlapping disparate data reference set variations combined into one all-inclusive preferred data reference set variation. The coded data values and names can be translated either way.

     Management Level. Translation!

          New; * Personnel; Finance;

          Owner 01 Owner/Partner OP

          Senior Executive 02 Executive EX Senior Manager 041

          Line Manager 03 Manager MN Line Manager 163

          Unit Supervisor 04 Supervisor SP Unit Manager 239

          Lead Member 05 Lead  Worker LW Team Manager 157

          Team Member 06 Team Member 445

The data reference item translation rule for ecological regions is shown below. The matrix shows a reduction of a detailed disparate data reference set variation into a new, more general preferred data reference set variation. The coded data values and names can be translated from the non-preferred to the preferred, but cannot be translated from the preferred to the non-preferred without additional input.

     Ecological Region. Translation!

          Wildlife; New; *

          1 Northeast E East

          2 East Central E East

          3 Southeast E East

          4 Southern Midwest M Midwest

          5 Central Midwest M Midwest

          6 Northern Midwest M Midwest

          7 Southwest W West

          8 Northwest W West

These data reference item translation rules are simple, but they cover the range of possibilities, and can be used to develop more detailed data reference item translation rules.

PREFERRED PHYSICAL DATA ARCHITECTURE

The preferred physical data architecture can be developed when the preferred logical data architecture designation has been completed. The preferred physical data architecture is developed according to a formal set of data denormalization criteria. These data denormalization criteria were discussed in Chapter 3 and described in detail in Data Resource Simplexity.

The primary objective of formal data denormalization is to develop a physical data architecture that is optimum for a particular operating environment, without compromising the logical data architecture. Note that the entire data architecture is adjusted, not just the structure. The major problems with traditional data denormalization are that only structure is adjusted, and the logical data model is often compromised. The result is a data resource that does not have high quality and does not meet business needs.

All four components of a data architecture are denormalized. First, the data structure is denormalized according to the formal data denormalization criteria. Second, the logical data names are changed to physical data names on the denormalized data structure using a set of data name word abbreviations and an abbreviation algorithm. Third, the logical data definitions are adjusted to the physical data structure. Fourth, the data integrity rules are adjusted to the physical data structure and become the data edits that need to be implemented to ensure data quality.

The resulting physical data architecture is documented as a data product with a designation of preferred. No data cross-references are made between the preferred physical data architecture and a common data architecture. That preferred physical data model is then used to design the comparate data resource, and for data transformation between disparate data and comparate data.

PREFERRED DATA TEMPLATES

A preferred data template was defined in the last chapter as a subset of the preferred logical data architecture for a specific subject area to promote data sharing between organizations and help organization develop applications and databases. Generally, a preferred data template is prepared for the preferred logical data architecture so an organization can formally denormalize that architecture for their particular operating environment.

Preferred data templates are most useful in the public sector where data are common across many different public sector organizations. For example, geologic data, topographic data, streams, roads, power lines, and so on, benefit from preferred data templates. Similarly, crime and criminal data, vehicle registration data, business license data, and so on, benefit from preferred data templates.

SUMMARY

Preferred data designations are the next step after data cross-referencing has been completed. Preferred data designations include preferred data names, preferred data definitions, preferred data characteristic variations, preferred data reference set variations, preferred data subjects, preferred primary and foreign keys, preferred data sources, and preferred data integrity rules. Specific criteria guide the preferred data designation process.

A data integration key is used to identify possible matching data occurrences in data files that contain redundant data. Historical data instances are designated as data subjects subordinate to the data subject containing the current data instance. Multiple preferred data designations may be made for cultural, geographical, or political differences. Changes over time for data names, data definitions, data structure, and data integrity rules are designated with the version notation.

Data translation rules are developed for translating data values between preferred and non-preferred data characteristic variations, and between preferred and non-preferred data reference set variation. Fundamental data translation rules are developed to support specific data translation rules. Fundamental data translation rules provide maximum consistency with minimum effort.

A preferred physical data architecture is developed after the preferred logical data architecture has been designated based on formal data denormalization criteria. Preferred data templates to support data sharing between organizations can be prepared based on the preferred logical data architecture. Preferred data templates are most useful for sharing data between public sector organizations.

The preferred data designation process provides a preferred logical data architecture and a preferred physical data architecture for developing a comparate data resource and transforming disparate data to comparate data. The process ends the documentation and understanding aspects of existing disparate data and sets the stage for physically transforming the disparate data and building a comparate data resource that has high quality and meets business needs.

QUESTIONS

The following questions are provided as a review of preferred data designations and translation schemes, and to stimulate thought about defining the preferred data architecture for an organization.

  1. What preferred data designations are made during the preferred data designation process?
  2. How is a data integration key useful for making preferred data designations?
  3. How are preferred data definitions prepared?
  4. How are preferred primary keys and foreign keys determined?
  5. What is done with primary keys and foreign keys that are not designated as preferred?
  6. How are preferred data sources determined?
  7. How are preferred data integrity rules prepared?
  8. How are redundant data occurrences and data instances identified?
  9. How is a preferred physical data architecture prepared?
  10. How are data translation rules prepared?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.182.62