Chapter 8

DATA CROSS-REFERENCING PROCESS

How to go about cross-referencing disparate data.

Data inventorying began the process of thoroughly understanding disparate data. An initial understanding was gained from existing sources of insight about the disparate data. That initial understanding was documented during the data inventory process. Data cross-referencing expands on that initial understanding to understanding within a common context based on an organization’s perception of the business world.

Chapter 7 described the concepts and principles for cross-referencing the data inventory to a common data architecture. Chapter 8 describes the process and techniques for performing the data cross-referencing between data products and a common data architecture. The extent of disparate data variability and redundancy can be determined after the cross-referencing has been completed.

Data cross-referencing is a non-destructive process that sets the stage for designating the preferred data architecture, which is used to build a comparate data resource. An initial common data architecture is developed based on the organization’s perception of the business world, and is enhanced during the cross-referencing process. The cross-referencing normalizes the basic components of disparate data within that common data architecture.

DATA CROSS-REFERENCE PREPARATION

Preparation for data cross-referencing includes defining the scope of cross-referencing, defining the sequence of cross-referencing, determining who will be involved in cross-referencing, determining how the cross-references will be documented, establishment of an initial common data architecture, maintaining common data definitions, establishing a data subject thesaurus, and establishing a list of common data name words. Each of these topics is described below.

Data Cross-Reference Scope

Setting the scope of data cross-referencing is relatively easy. Data cross-referencing cannot be performed until the disparate data have been inventoried. Therefore, the scope of the data inventory sets the scope of data cross-referencing. Within that scope, the data can be cross-referenced to a common data architecture as soon as an initial common data architecture has been established and enough understanding is available to make the cross-reference.

Data Cross-Referencing Sequence

The sequence of data inventory and data cross-referencing was described in Chapter 6. The data inventory could be completed within the defined scope followed by data cross-referencing within that scope, or data cross-referencing could be performed during the data inventory process.

The primary criterion is that data cross-referencing can be performed as soon as enough understanding has been gained to make a reasonable determination about the cross-reference. Over the years, I’ve found that the sequence is largely a personal preference. Some people prefer to make a cross-reference during the data inventory as soon as sufficient understanding is available. Some people prefer to complete the entire data inventory and then proceed with data cross-referencing.

Either approach is acceptable because the option always exists to make changes to the data cross-references anytime additional insight is gained. Additional insight could lead to an enhancement of the data inventory, which could lead to changes in the data cross-references. The discovery nature of data inventory and data cross-referencing is the normal operating procedure.

Data Cross-Referencing Involvement

The data inventory process included anyone who had insight about the disparate data. Many different individuals were involved in the process because the objective was to gain as much insight about disparate data as possible. Literally dozens, or even hundreds, of people could be involved in the data inventory process. However, a core team drove the data inventory process.

A core team also drives the data cross-referencing process. Few additional people are involved in the data cross-referencing process because of the detailed nature of the cross-referencing. The core team members may reach out to individuals for clarifications before making the cross-references, but the actual cross-referencing is done by relatively few people.

The results of the data cross-referencing should be readily available for anyone to review and comment. Since data cross-referencing increases understanding of the disparate data in a common context, that increased understanding must be readily available to anyone in the organization. Making the data cross-referencing readily available not only encourages comments and additional insights, but it helps people understand the impacts of disparate data and encourages them to support creation of a comparate data resource.

Data Cross-Referencing Documentation

A common data architecture provides the construct for the data cross-referencing process and for documenting the cross-references. Each organization needs to determine how and where a common data architecture will be documented based on their physical operating environment. Software products could be obtained for documenting a common data architecture and the data cross-references, but those products must have the capability of documenting a common data architecture and the data cross-references as described in the current book. Organizations should not consider acquiring a software product that does not support a common data architecture and data cross-referencing as described here.

Many organizations develop their own software for documenting the data inventory, common data architecture, and data cross-references. Although the approach requires some initial resources, the result is a product that fits well with the organization’s physical operating environment and a common data architecture, and is readily available to everyone in the organization.

Initial Common Data Architecture

An initial common data architecture is developed prior to beginning the data cross-referencing process. That initial common data architecture is based on an organization’s perception of the business world and only includes major business objects and events within the scope of the data inventory and data cross-referencing. An initial common data architecture is not intended to be a complete data architecture for the organization. It is intended to provide a starting point for data cross-referencing.

An initial common data architecture is continually enhanced during data cross-referencing, based on the data that are being cross-referenced. The result is a common data architecture that accurately represents the disparate data and can be used to designate a preferred data architecture.

For example, if the scope of data inventory and cross-referencing is about primary education, the major business objects and events might be Student, School District, School, Grade Level, and Academic Year. These business objects and events would be documented as data subjects and an initial definition would be prepared based on the business.

Major business features for each of the business objects and events would be documented as data characteristics with initial definitions. For example, Student. Name Complete, School District. Name, School District. State Identifier, School. Name, School. State Identifier, Grade Level. Name, Academic Year. Begin Date, and Academic Year. End Date might be the major business features.

Each wave of data inventory and data cross-referencing may require an enhancement to a common data architecture for the major business objects and events within the scope of that wave. However, that enhancement is only enough to start the data cross-referencing. It’s not intended to produce a complete data architecture.

For example, the next wave of data inventory and cross-referencing might include all of the classes and teachers for primary education. Initial business objects and events might be Course, Course Section, Education Program, and Educator. These would be documented as data subjects in the  initial common data architecture.

Initial business features might be Course. Identifier, Course. Name, Course Section. Identifier, Education Program. Identifier, Education Program. Name, and Educator. Name Complete. These would be documented as data characteristics.

Common Data Definitions

Data definitions are crucial for making the proper cross-references. Initial data definitions are prepared for components of the initial common data architecture. The data definitions should be as comprehensive as possible, including what is and what is not included. An initial data definition is prepared for any new common data architecture components, and those data definitions are continually enhanced during data cross-referencing.

Changes to a data definition, other than enhancements and clarifications, require careful consideration. All existing cross-references should be reviewed to determine if the changed definition would alter any of those cross-references. If the changed definition requires a change in the cross-references, then those changes are made.

Data Subject Thesaurus

A data subject thesaurus is a list of synonyms and related business terms that help people find data subjects that support their business information needs. It’s a list of business terms and alias data subject names that point to the formal data subject, as described in the last chapter.

A data subject thesaurus needs to be established before data cross-referencing begins. Each data subject name is entered into the data subject thesaurus, and any business terms or alias data subject names are listed for that data subject name.

For example, Student would be entered into the data subject thesaurus. Related terms might be Pupil, Attendee, Participant, and so on. These terms would be entered as aliases pointing to Student. Similarly, Educator would be entered into the data subject thesaurus. Related terms might be Teacher, Instructor, Trainer, and so on. These terms would be entered as aliases to Educator.

The data subject thesaurus is continually enhanced during and after the data cross-referencing process. Anytime a new data subject is considered, the data subject thesaurus should be checked to determine if that data subject already exists. If the data subject does not exist, then the new data subject can be added to a common data architecture. Checking the data subject thesaurus before creating a new data subject ensures that no data subject synonyms or homonyms are developed.

Anytime a new data subject is added to a common data architecture, an entry is made into the data subject thesaurus, including all possible aliases for that new data subject. Anytime a new alias name for a data subject is encountered, an entry is made into the data subject thesaurus. The result is a comprehensive list of alias terms and formal data subject names.

Common Words

A common word is a word that has consistent meaning whenever it is used in a data name. A list of common words must be established for data subjects, data characteristics, and data characteristic variations before data cross-referencing begins. Any word used in a data name that has a common meaning is documented as a common word and is used consistently throughout a common data architecture.

For example, common data subject words might be Activity, History, and Suspense, meaning transaction data, historical data, and data pending some action, respectively. Common data characteristic words might be Number, Amount, Count, and Quantity, meaning an identifying number, a monetary amount, a count of items, and a capacity or size, respectively. Common data characteristic variation words might be Estimated, Measured, Normal, Inverted, and Irregular.

Establishing and maintaining a set of common words and using them consistently for all data names within a common data architecture, ensures that the data names are readily understood and have a consistent meaning.

DATA CROSS-REFERENCES

Three data cross-references are made between data products and a common data architecture, as shown in Figure 7.2. The first is between data product sets or variations and data subjects; the second is between data product units or variations and data characteristic variations; and the third is between data product codes or variations and data reference set variations. Each of these cross-references is described below.

Data Product Set Cross-References

A data product set cross-reference was defined in the last chapter. It is one of the three cross-references between data products and a common data architecture that identifies a data product set or variation as representing a specific subset of a data subject.

For example, a data product set named STDT_DATA references a data file containing a subset of student data representing middle school students. The data product set scope statement describes the subset of data for middle school students. The corresponding data subject would be Student, with a data subject variation for [Middle School] Student. The data subject variation definition describes the subset of middle school student data. A cross-reference is made between STDT_DATA and [Middle School] Student.

Note that the cross-reference in no way indicates that the data file represented by the data product set contains all of the data items for a student. The cross-reference only identifies the subset of data contained in the data file.

Another example is a data product set named Contractor, which represents a data file containing data about general contractors. General contractors are considered by the organization to be a role for contractors. A data subject variation is created within the Contractor data subject for “General” Contractor. A cross-reference is made between Contractor and “General” Contractor.

A third example is data subject variations for manifestations of a data perspective involved in aggregated data analysis. The data subject might be Timber Analytics Focus for the analysis of the growth and harvesting of timber stands. The data subject variations might be Timber Analytics 1, Timber Analytics 2, and so on. The definition would explain the manifestations. The cross-reference might be between a data product set variation representing a data set in a data hierarchy and Timber Stand Analytics 4.

A fourth example is a selection of timber stand data for data analysis. Timber stands are selected for the Douglas fir species, above 2000 feet in elevation, that originated between 1920 and 1970. The data subject variation might be [Selection 3] Timber Stand Analytics. The definition would explain the selection. The cross-reference would be between a data product set representing the aggregated data hierarchy and [Selection 3] Timber Stand Analytics.

Data Product Set Cross-Reference Criteria

The following criteria are used for enhancing a common data architecture to provide cross-references between data product sets or variations and data subject variations. A common data architecture is searched for an appropriate data subject and data subject variation. The data subject thesaurus is helpful for finding data subjects.

When an appropriate data subject variation is found, the cross-reference is made.

When an appropriate data subject variation is not found, a new data subject variation is created and the cross-reference is made.

When an appropriate data subject is not found, a new data subject is created, a data subject variation is created, and the cross-reference is made.

When a new data subject is established, entries are made into the data subject thesaurus for that data subject and any aliases to that data subject.

Data Product Set Cross-Reference Comment

Cross-reference comments can be made at the time of the cross-reference or after the cross-reference. The comment may include any insight into the validity of the cross-reference, additional insight that needs to be gained, and so on. The comment should contain the name of the person making the comment, the date of the comment, and the source of any insight leading to the comment.

Data Product Unit Cross-Reference

A data product unit cross-reference was defined in the last chapter. It is one of the three cross-references between data products and a common data architecture that takes the basic disparate data product units and normalizes them within a common data architecture.

Data Product Unit Cross-Reference List

A data product unit cross-reference list shows the data product units or variations and their respective data characteristic variation. Data product units or variations are listed on the left in the order they appear in the data file. The corresponding data characteristic variations are listed on the right. The list shows the data product units exactly as they appear in the data product, including sequence, spelling, capitalization, and punctuation.

For example, data product unit cross-references for the Water Right File are shown below.

Water Right File

   Control Number

      Type Water Water Resource Category. Code, 1 Numeric

      Region Washington Ecology Region. Number, 3 Alpha

      Old New Water Right Number Status. Code, 1 Numeric

      Assigned Number Water Right. Number, 6 Numeric

      Stage Water Right Stage. Code, 1 Alpha

      Record Modifier Water Right Record. Modifier, 2 Numeric

      Reason For Modifier Water Right Record Reason. Code, 2 Alpha

   AA Transaction

      Trans Code Water Right Transaction. Code, 1 Numeric

      County Washington County. Number, 2 Numeric

      Status Water Right Status. Code, 1 Alpha

      Name Water Right Processor. Name, Variable

      Number of POD/W Water Right Removal Site. Count, 2 Numeric

      Repeat POD/W Water Right Removal. Repeat Count, 1 Numeric

      Location of POD/W Water Right Removal. Location Detail, 42 Text

      WRIA Water Resource Inventory Area.Number,4 alpha

      Section Water Right Section. Number, 2 Alpha

      Township Water Right Tier. Number, 3 Alpha

      Range Water Right Range. Number, 3 Alpha

      E or W Water Right Range. Direction, 1 Alpha

The cross-referencing is based on reasoning, definitions, knowledge of the disparate data, and knowledge of a common data architecture. The data product units or variations are taken one at a time and a common data architecture is searched to find an appropriate data characteristic variation.

When an apparent match is found, the data product unit or variation is reviewed to make sure it fits within the data characteristic variation definition, and the data characteristic variation definition is reviewed to make sure it encompasses the data product unit or variation. If the review results in a match, the cross-reference is made. If a match is not found, the search continues, and may result in creation of a new data characteristic variation.

Any differences in format or content results in a different data characteristic variation. The data characteristic variation name must represent that difference in format or content. The name can include the format, content, length, or any other words representing the variation of a data characteristic, as shown below.

     Person. Name Complete

        Person. Name Complete, Normal 48

        Person. Name Complete, Normal 55

        Person. Name Complete, Normal 60

        Person. Name Complete, Inverted 48

        Person. Name Complete, Inverted 55

        Person. Name Complete, Irregular

     Vehicle. Model Name

        Vehicle. Model Name, 50 Right

        Vehicle. Model Name, 35 Right

        Vehicle. Model Name, 28 Left

     Well. Depth

        Well. Depth, Measured Laser

        Well. Depth, Measured Physical

        Well, Depth, Estimated

Data Product Unit Cross-Referencing Criteria

The following criteria are used for enhancing a common data architecture to provide appropriate cross-references between data product units or variations and data characteristic variations:

When an appropriate data characteristic variation is found, the cross-reference is made.

When an appropriate data characteristic variation is not found, a new data characteristic variation is established and the cross-reference is made.

When an appropriate data characteristic is not found, a new data characteristic is established, a new data characteristic variation is established, and the cross-reference is made. The new data characteristic must not be a synonym or homonym of an existing data characteristic. The use of common words will help identify possible synonyms and homonyms.

When an appropriate data subject is not found after searching the data subject thesaurus for a possible match, a new data subject is established, a new data characteristic is established, a new data characteristic variation is established, and the cross-reference is made. The new data subject must not be a synonym or homonym of an existing data subject.

When a new data subject is established, entries are made into the data subject thesaurus for that data subject and any aliases to that data subject.

Any new data characteristic variations, data characteristics, or data subjects must be unique to ensure a common data architecture remains stable. Entering a new component to a common data architecture without checking for an appropriate component that already exists is often tempting. However, every effort should be made to ensure that each new component is unique.

Data product unit cross-referencing may take many iterations and many reviews by knowledgeable people to determine the appropriate cross-reference. Sometimes the cross-reference is a match, but the data characteristic variation definition does not fully encompass the data product unit or variation definition. In that situation, the data characteristic variation definition is enhanced.

Similarly, the data characteristic definition and data subject definitions need to be reviewed to determine if they encompass the data characteristic variation definitions. Those definitions may need to be enhanced accordingly. The process ensures that the common data definitions are always comprehensive.

If the data product unit or variation is found to represent multiple or variable business facts, then the data inventory needs to be adjusted to break the disparate data down into basic components. Then the data cross-referencing can proceed. For example, a data product unit for a stream size actually contains the width of the stream and the depth of the stream. Those two facts need to be defined as data product unit variations in the data inventory, and each of those basic components need to be cross-referenced to appropriate data characteristic variations.

Any difference in meaning results in a different data characteristic, not in a variation of data characteristic. For example, several data product units may exist for the depth of a water well, and cross-referencing those data product units to variations of a water well depth data characteristic might seem easy. However, a review of the definitions shows that some of those depths are the total depth of the well and others are the depth to water.

Data Product Unit Cross-Reference Comment

Cross-reference comments can be made at the time the cross-reference is made or after the cross-reference. The comment may include any insight into the validity of the cross-reference, additional insight that needs to be gained, and so on. That comment may be enhanced at any time during the cross-reference process.

Cross-reference comments are typically retained rather than being deleted. For example, if a cross-reference comment were made questioning the validity of the cross-reference, and it was later determined that the cross-reference was valid, an additional statement is added that the validity was confirmed. The process provides a good audit trail of concerns about cross-references and the resolution of those concerns.

Each cross-reference comment should contain the name of the person making the comment, the date of the comment, and the source of any insight leading to the comment.

Data Product Code Cross-Reference

A data product code cross-reference was defined in the last chapter. It is one of the three cross-references between data products and a common data architecture that takes the basic disparate data code properties and normalizes them to a data reference set variation within a common data architecture.

Data Reference Set Variation

A data reference set variation contains a specific set of data reference items. Even though individual data product codes or variations are cross-referenced to a data reference set variation, the complete set of data product codes for a particular data product unit or variation must match the complete set of data reference items in the data reference set variation. Any difference in the data product code values, data product code name, data product code meaning, or domain of data product codes results in a different data reference set variation.

Finding the proper data reference set variation requires looking at all the candidate data reference set variations to determine if they contain an exact match to the set of data product codes or variations being cross-referenced. If an exact match is not found, a new data reference set variation is created.

If a data reference set, which is a data subject, does not exist, then a new data subject is created and defined. The definition should describe the meaning of the set of codes contained in that data subject. As with any new data subject, an entry is made in the data subject thesaurus, including any alias data subject names. Then data reference set variations are created within that data subject.

The data subject name should reflect the contents of the data in the set of data reference items, such as Management Level, Gender, Ethnicity, and so on. The data reference set variation names include qualifiers to the data subject name using the data naming taxonomy notation, such as Management Level. Personnel; or Management Level. Finance;.

Data Reference Item List

A data reference item list is a listing of all of the data reference items in a data reference set variation, including the data reference item codes, data reference item names, and data reference item definitions.

For example, a data subject would be defined for Management Level with a definition of the meaning of management level. A data reference set variation would be created for Management Level. Personnel 1; that contained the data reference items shown below.

E Executive Above pay range 16

M Manager Pay range 12 to 1

S Supervisor Pay range 9 to 1

L Lead Worker Pay range 6 to

W Worker Pay range 5 and below

Data product codes or variations from a data product unit or variation that contained that exact set of codes (code value, name, domain, and meaning) would be cross-referenced to the data reference set variation.

For example, a data product unit contains the domain of codes E, M, S, L, and W, with names and definitions as shown above, which are defined as data product codes. Each of those data product codes is cross-referenced to the Management Level. Personnel 1; data reference set variation.

Another data product unit in another data file might contain the domain of names Executive, Manager, Supervisor, Lead Worker, and Worker, as shown above, which are documented as data product codes. Each of these data product codes would be cross-referenced to the Management Level. Personnel 1; data reference set variation.

When a data product unit contains a domain of codes E, M, S, and W, where W represents any worker, those four codes are documented as data product codes. Those data product codes are then cross-referenced to a different data reference set variation due to the difference in the domain and meaning of the codes. A new data reference set variation would be created for Management Level. Personnel 2; containing the four data reference items. The data product codes would then be cross-referenced to that data reference set variation.

The question always arises about a domain of data values. For example, a data product unit contained data product codes E, M, S, L, and W, with the names and definitions as shown above. A data product unit in another data file contained the data product codes E, M, S, and L, with the same names and definitions as shown above. Are these the same set of data codes?

Each of the data product units containing those data codes needs to be reviewed to determine if the two sets of data codes are identical, but the second set didn’t have a value for Workers. A data dictionary, data edits, or application programs may be reviewed to make the determination. When the two sets of data codes are identical, the data product codes can be cross-referenced to the same data reference set variation. If the two sets of data codes are in fact different, then a new data reference set variation needs to be created.

The resolution of these situations is done during the preferred data architecture designation. The data cross-referencing only determines that the specific sets of data codes are identical or are different.

When the coded data values are different, such as EX, MN, SU, LW, and WR, even though the names and definitions are the same, then the data product codes are cross-referenced to a different data reference set variation. Similarly, when the data product code names are different, such as Big Boss, Little Boss, Trainee Boss, Leader, and Worker, even though the coded data values and definitions are the same, the data product codes are cross-referenced to a different data reference set variation.

When the meaning of the data product codes is different, the data product codes belong to different data reference set variations. For example, two sets of data product codes have the same coded data values and names, as shown below.

  

     NA North America

     SA South America

     EU Europe

     AS Asia

These two sets of data product codes appear to be identical and might be considered to be the same data reference set variation. However, the data code definition for North America in one set of data codes includes Central America because of the ease of distribution, but in the other set of data codes the definition for South America includes Central America because of the similarity in language. Therefore, these two sets of data codes belong to different data reference set variations.

The example from Chapter 6 for a combination of gender, hair color, and eye color is shown below. The complete breakdown of two genders, five hair colors, and five eye colors results in 50 individual codes. Twenty five of these codes represent each of the genders, 10 of these codes represent each of the hair colors, and 10 of these codes represent each of the eye colors.

     1 Male, Blond Hair, Blue Eyes

          1     Male

          1     Blond Hair

          1     Blue Eyes

     2 Female, Blond Hair, Blue Eyes

          2     Female

          2     Blond Hair

          2     Blue Eyes

     3 Male, Brown Hair, Blue Eyes

          3     Male

          3     Brown Hair

          3     Blue Eyes

     And so on.

Three data reference set variations are created within their respective data subjects for Gender. 1:, Hair Color. 1;, and Eye Color. 1;. The gender data reference set variation has 25 data reference items for each of the genders, the hair color data reference set has 10 data reference items for each of the hair colors, and the eye color data reference set variation has 10 data reference items for each of the eye colors.

Data cross-references are made between the gender data product code variations and the gender data reference set, between the hair color data product code variations and the hair color data reference set variation, and between the eye color data product code variations and the eye color data reference set variation.

The process may seem a bit too detailed, but the detail will be needed to prepare the data translation rules following the preferred data architecture designations.

The example from Chapter 6 for a hierarchy of census codes is shown below. Three data reference set variations are created within their respective data subjects for Census Race. 1;, Census Race Category. 1;, and Census Race Group. 1;. Data reference items are created in Census Race. 1; for each of the individual Census Race codes. Data reference items are created in Census Race Category. 1; for each distinct range of Census Race Category codes. Data reference items are created in Census Race Group. 1; for each distinct range of Census Race Group codes. Cross-references are then made between the data product codes and their respective data reference set variations.

Census Race

653 Hawaiian

And so on.

Census Race Category

653 – 699 Pacific Islander

And so on.

Census Race Group

653 – 659 Polynesian

         And so on.

Again, the process may seem a bit too detailed, but the detail will be needed to prepare the data translation rules following the preferred data architecture designations.

Data reference set variations and data reference items are defined independent of the format of the data values. Only the data values are important for designating data reference set variations. The format of the data values is documented with the data product unit or variation and is reflected in the corresponding data characteristic variation.

When a data reference set is created as a data subject, the initial definition describes the contents of that data reference set. Each data reference set variation definition inherits its parent data reference set variation. The data reference set variation definition describes the particular variation of the data reference set, such as a larger domain, different definitions, and so on, but it does not list all of the data reference items.

The data reference item definitions must be within the scope of the data reference set definition. Whenever data product code or variations are cross-referenced to a data reference set variation, the definitions must be reviewed to ensure that the definitions of the data product codes fit within the scope of the definition for the data reference set and data reference set variation. When a discrepancy is encountered, the data reference set definition, data reference set variation definition, or data reference item definition must be enhanced. After any enhancement, the existing cross-references must be reviewed to ensure they are valid.

Whenever definitions of data product codes don’t exist, the best determination of their meaning is made and entered as the definition. If a discrepancy is found later, the appropriate changes can be made to the definitions or the data cross-references.

Whenever any combination of data product codes is found, such as super sets or subsets, multiple property data codes, and so on, return to the data inventory and break the codes down into their individual components. Then cross-reference those data product codes to the appropriate data reference set variation.

Data reference item coded values and names may appear in several data reference set variations. That situation is normal and is part of the understanding process. The reason for the different data reference set variations may be a difference in the domain of data reference items, or in the meaning of the data reference items.

Data Product Code Cross-Reference Criteria

The data product code cross-reference criteria are summarized below.

Review the set of data product codes or variations for a data product unit or variation.

Search data reference set variations for a matching set of data reference items.

When a match is found, cross-reference each data product code or variation to the data reference set variation.

When no match is found, create a new data reference set variation with a matching set of data reference items. Then cross-reference each data product code or variation to the data reference set variation

When no matching data reference set is found, create a new data subject for that data reference set. Consult the data subject thesaurus for the existence of a possible data reference set. Make appropriate entries in the data subject thesaurus. Create a data reference set variation with a matching set of data reference items. Then cross-reference each data product code or variation to the data reference set variation.

Data Product Code Cross-Reference Comments

Cross-reference comments can be made during or after data product code cross-referencing. Generally, no cross-reference comment is needed. However, in situations where a match between a set of data product codes and a set of data reference items is questionable, cross-reference comments should be made.

Cross-reference comments are typically retained rather than being deleted. For example, if a cross-reference comment questioning the validity of the match between the set of data product codes and data reference set items were made, and it was later determined that the cross-reference was valid, an additional statement is added that the validity was confirmed. The process provides a good audit trail of concerns about cross-referencing and the resolution of those concerns.

DATA PRODUCTS

Data cross-references are performed for data files, for summary data, for aggregated data, for screens, reports, and forms, for data models, for application programs, for complex data, and for changes over time. Each of these cross-references is described below.

Data Files

Data cross-references are not made between data product sets or variations and data subjects because data files seldom represent complete and single data subjects. Data files typically have a many-to-many relationship with data subjects. The cross-referencing must be done at a more detailed level to place data items within their appropriate data subject. Data cross-references are made between data product sets or variations and data subject variations, as described above.

Data cross-references are not made between data product units or variations and data characteristics because data product units or variations represent some variation of a data characteristic. Data cross-references are made between data product units or variations and data characteristic variations, and between data product codes or variations and data reference variations as described above.

Data cross-referencing at the detailed level splits multiple subject data files in to their respective data subjects, combines multiple file data subjects into one data subject, and resolves other data variability that exists between data files and data subjects.

Splitting Data Files

A data file may be split into many data subjects. For example, a data file has both vegetation data and river data. The organization desires to split the data file into the two different data subjects for vegetation and for rivers. Two data product set variations, for vegetation and for rivers, are created during the data inventory.

The appropriate data product units are documented for each data product set variation. In other words, data items appropriate for vegetation are documented as data product units for the vegetation data product set variation, and data items appropriate for rivers are documented as data product units for the river data product set variation. Data items that are appropriate for both vegetation and rivers are documented as data product units in both data product set variations. The data product units are then cross-referenced to the data characteristic variations within the appropriate data subject for vegetation or rivers.

Combining Data Files

Multiple data files may be combined into one data subject. For example, many different data files contain employee data. The organization desires to combine all employee data into one data subject. A data product set or variation is created for each data file and the employee data items are documented as data product units or variations. The data product units or variations are then cross-referenced to data characteristic variations within the employee data subject.

Combining Data Files with Types

Multiple data files may be combined into one data subject. For example, three data files contain data for prospective students, undergraduate students, and graduate students. The organization desires to combine these data files into one data subject for student, and retain whether the student is prospective, undergraduate, or graduate. Data product sets or variations and data product units or variations are documented for each data file during data inventory.

The data product set representing the data file would be cross-referenced to a data subject variation for “Prospective” Student, which is within the data subject for Student. That cross-reference indicates that all data in that data file pertain to prospective students. The data product units or variations are then cross-referenced to appropriate data characteristic variations within the student data subject, as described above.

A Student Type. Code would be defined to indicate the type of student. However, that definition occurs during the preferred data architecture designation, not during data cross-referencing. The data cross-referencing process only connects existing disparate data to a common data architecture. Data subjects or data characteristics are only created to cover cross-referencing.

Data Records

Data records are not cross-referenced to a common data architecture. Only the data product units or variations contained in the data records are cross-referenced to a common data architecture for understanding. The data product sets and data product set variations document the data records with respect to the physical data files. The data subject containing the data characteristic variations to which the data product units or variations are cross-referenced represents the logical data record. The process complies with the normalization of data during cross-referencing.

Data Instances

Historical data instances are not cross-referenced to the common data architecture. Only the data product units or variations contained in the historical data instances are cross-referenced to a common data architecture for understanding. However, those data product units or variations are cross-referenced to a data characteristic variation that belongs to a history data subject.

For example, the data product units in a historical data instance for students would be cross-referenced to data characteristic variations that belong to the Student History data subject. Cross-referencing historical data instances to a history data subject complies with the normalization of time during data cross-referencing. A determination can be made during the preferred data architecture designation process whether the historical data instances should remain in a separate history file or combined with the current data instances.

Data Keys

Disparate primary keys and foreign keys are not cross-referenced to a common data architecture. The possible many-to-many relation between data product sets or data product set variations and data subjects makes it impractical to try cross-referencing primary and foreign keys. Only the data product units or variations are cross-referenced to a common data architecture for understanding.

However, primary keys and foreign keys can be listed for data subjects if they are relevant. For example, a data file represents employee data and has a primary key of EMPL_ID, defined as a department assigned unique identifier of an employee. The EMPL_ID is cross-referenced to Employee. Department Identifier, Numeric 6. That primary key is documented for the Employee data subject.

A data file for department data has the department name as a primary key. That department name is relevant and is documented as a primary key in a common data architecture.

Another data file for employee data has a primary key of EMP_SSN, which is cross-referenced to Employee. Social Security Number, Character 9. That data file also has a foreign key of DPT_NM, which is cross-referenced to Department. Name Complete, Alpha 24. That primary key can also be documented for Employee. The documentation of these primary keys and foreign keys is shown below.

     Department

          Primary Key: Department. Name Complete

     Employee

          Primary Key: Employee. Department Identifier

          Primary Key: Employee. Social Security Number

          Foreign Key: Department Department. Name Complete

Note that the documentation of primary keys and foreign keys in a common data architecture does not include the data characteristic variation. Only the fact is important for primary keys and foreign keys, not the variation of that fact.

Primary keys are not documented for data subjects when they are not relevant. For example, a data file has a combination of student data, parent or guardian data, and class data. The primary key is a system-assigned identifier. The individual data items are cross-referenced to their respective data subjects for Student, Parent/Guardian, and Class. The system-assigned identifier is not documented for Student, Parent/Guardian, or Class because it only has meaning with respect to the disparate data file.

Data Integrity Rules

Data integrity rules are not cross-referenced to a common data architecture and are not involved in the data cross-reference process. Data characteristic variations are not dependent on variations in data integrity rules, because those rules are too variable and informally defined. The data inventory process documented the data integrity rules that exist in the disparate data. Those existing data integrity rules will be pulled together during the preferred data architecture designation process and used to develop formal data integrity rules.

Data Accuracy

Data accuracy can be documented during the data inventory process and can be used to determine the data characteristic variation name. For example, the lake size in one data file is estimated from aerial photographs at a scale of 1:24,000, but is surveyed on the ground in another data file. These two data items would be cross-referenced to Lake. Size, Acres Estimated 1:24,000 and Lake. Size, Acres Surveyed, respectively.

One alternative in a common data architecture is to create companion data characteristics for the size of the lake and the accuracy of the determination. For example, Lake. Size, Acres would be the data characteristic variation for the size of the lake, and Lake Size Determination. Code would be the data characteristic variation identifying how the lake size was determined. However, the determination is made during the preferred data architecture designation, rather than during data cross-referencing. Data characteristics are only created during data cross-referencing to support cross-references.

Summary Data Cross-Referencing

Summary data in fixed hierarchies usually appear on screens, reports, or forms, although those data could be stored in databases. A data hierarchy for summary data is shown in Appendix B. Since the summary data are named according to the data set in which they appear, they are cross-referenced to data characteristic variations in the data subject representing that data set.

For example, the department data on the data hierarchy shown in Appendix B would be cross-referenced as shown below.

     Department

          Department Identifier Department. Identifier, Alpha 6

          Department Name Department. Name, Alpha 24

          Department Employee Count Department. Employee Count,

     Numeric 2

          Department Annual Budget Department. Annual Budget,

     Numeric 8

          Department Expense To Date Department. Expense To Date,

    Numeric 8

The other data on the report would be cross-referenced in a similar manner.

When summary data are stored in a data file, they are inventoried and documented as data product sets or variations and data product units or variations. These are cross-referenced as described above. In addition, the any primary and foreign keys are documented as described above.

Aggregated Data Cross-Referencing

Aggregated data in variable hierarchies may appear on screens, reports, or forms, or they may appear in databases. A data hierarchy for aggregated student analytics data is shown in Appendix C.

Within a common data architecture, a data subject is created for the data focus, such as Student Analytics Focus. The data definition describes the data focus, such as an accumulation of analytics about students. Data characteristics are created within that data subject, such as Enrollment Count, Average Student Age, and so on. The data definitions describe the meaning of the data characteristics.

Data subject variations are defined for each manifestation of the data focus, such as Student Analytics 1, Student Analytics 2, and so on. The data definitions describe the meaning of each manifestation. Primary keys and foreign keys are documented for each data subject variation.

The example below shows a portion of the Student Analytics data from Appendix C. Student Reporting System is the data product and is not cross-referenced to a common data architecture. Student Enrollment Summary is a data product set and could be cross-referenced to a data subject variation showing any selection or subset of the student data, such as [Selection A] Student Analytics. Funding School Disability Grade and Funding School data sets within the data hierarchy are documented as data product set variations. These would be cross-referenced to corresponding data subject variations showing the parent data sets, such as Student Analytics 1 and Student Analytics 3.

     Student Reporting System          No cross-reference

          Student Enrollment Summary          [Selection A] Student Analytics

               Funding School Disability Grade        Student Analytics 1

                    Enrollment Count

                    Average Student Age

               Funding School          Student Analytics 3

                    Enrollment Count

                    Average Student Age

     And so on.

The data items in each data set of the data hierarchy are cross-referenced to corresponding data characteristic variations, as shown below.

     Funding School Disability Grade

          Enrollment Count         Student Analytics. Enrollment Count, Numeric 5

          Average Student Age   Student Analytics. Average Age, Numeric 5

     Funding School

          Enrollment Count         Student Analytics. Enrollment Count, Numeric 5

          Average Student Age   Student Analytics. Average Age, Numeric 5

     And so on.

The process may seem detailed, but that detail provides the necessary understanding for making preferred data architecture designations and transforming the disparate data.

Predictive Data

Predictive data can be operational data, summary data, or aggregated data. The uniqueness with predictive data is not with the data themselves, but with the processing that is performed on those data. In other words, predictive analysis and data mining techniques are process issues, not data issues. Therefore, the data input to a predictive analysis and the data resulting from a predictive analysis are cross-referenced as described above.

Screens – Reports – Forms

The data on screens, reports, and forms can be operational data, summary data, or predictive data. Those data are cross-referenced as described above for data files, summary, aggregated data, or predictive data. Data hierarchies can be developed for screens, reports, and forms, and used to cross-reference the data.

Screens, reports, and forms don’t have any primary keys or foreign keys. They also don’t have any data integrity rules typically found with data files and data models. However, they do have data derivation rules for producing any summary data, aggregated data, or predictive data. Those data derivation rules need to be documented as data integrity rules.

XML structures typically contain operational data; however, XML structures can also contain summary data, aggregated data, and predictive data. XML structures do have a structure for the data that indicates the data subject. The data in XML structures are cross-referenced as described above.

XML structures have no primary or foreign keys. However, they do have data derivation rules for producing summary data, aggregated data, or predictive data. Those data derivation rules need to be documented as data integrity rules.

Data Models

Logical and physical data models can represent operational data, summary data, aggregated data, or predictive data. Logical and physical data models are cross-referenced as described above.

Application Programs

The data in application programs are cross-referenced as described above. The data read from and written to data files by application programs are not cross-referenced, because that cross-reference would be redundant with the cross-referencing of the data files themselves. Primary keys and foreign keys don’t exist in application programs. However, any data integrity rules enforced by the application program need to be documented as data integrity rules.

The data in purchased applications that are used by the organization are cross-referenced to the data subjects as perceived by the organization, not by the definition of the data files in the purchased application. The data files and data items in many purchased applications are not used by the organization as defined in the application. For example, Party may contain people contributing aid, Product may contain people receiving aid, and Sales Region may contain sites where aid is rendered. The cross-references would be made to Aid Contributors, Aid Recipients, and Aid Sites accordingly.

Complex Structured Data Cross-Referencing

Complex structured data, such as spatial data, textual data, video data, image data, and so on, are not cross-referenced, because they were documented only as data products. The specific data subjects and data characteristics are not known, and cannot be inventoried or cross-referenced. However, the breakdown of complex structured data into the component data structures can be documented and cross-referenced, as described above.

Geographic information systems contain large quantities of operational data that are documented and cross-referenced as described for data files. The geographic component that contains the coordinates for points, lines, and polygons can be documented as a data item. However, the contents of that data item cannot be readily documented.

Changes Over Time

Changes over time for data product units and variations are cross-referenced to corresponding data characteristic versions in the common data architecture. For example, the changes in the Vehicle Collision Comment data item from a general comment to a comment about the injuries resulting from the collision is shown below.

     Veh_Clsn (DPS)

          CMT (DPU)

               Comment <Pre-1999> (DPUV)

               Comment <1999 – Current> (DPUV)

The cross-references to corresponding data characteristic variations are shown below.

Comment <Pre-1998> Vehicle Collision. Comment, Alpha 36 <Pre-

            1999>

     Comment <1999 – Current> Vehicle Collision. Injuries, Alpha 36 <1999 –

Current>

Changes over time for data product codes or variations are not cross-referenced to corresponding data reference items. The data product codes or variations are cross-referenced to a data reference set variation that contains the same set of data product codes or variations, including the variations over time.

For example, the changes in data product codes for an Executive are shown below.

          E     Executive

               E     Executive < Pre-1988>

               E     Executive & Board <1989 – Current>

These data product code variations are cross-referenced to a data reference set variation containing exactly the same set of data reference items with the same variations.

INTERIM COMMON DATA ARCHITECTURES

Data cross-referencing can be done by cross-referencing data products to an interim common data architectures, and then cross-referencing those interim common data architectures to the final common data architecture. The process is used only in very large organizations where it’s difficult to go from data products to a final common data architecture in one process. It was never intended for small segments of a small data resource, because it is too time consuming. Only one level of interim common data architectures is used, because the process is quite detailed.

Two basic approaches can be used for interim common data architectures. Each approach is described below.

The first approach is to inventory and document the data products for a major division or a major geographical area, and cross-reference those data products to an interim common data architecture. For example, a large multi-national organization could inventory and document the data products in major world regions, such as North America, South America, Europe, Asia, and so on. Those data products would then be cross-referenced to an interim common data architecture for that world region.

When the interim data architectures are completed, they are documented as data products. The interim common data architecture becomes a data product; the data subjects, including data reference sets, become data product sets; data subject variations become data product set variations; data characteristics become data product units; data characteristic variations become data product unit variations; and data reference items become data product codes.

The cross-references are then made to the final common data architecture, as described above. The final common data architecture is enhanced as necessary to cover the cross-references. The process is completed by changing the original data cross-references between the original data products and the interim common data architecture, to data cross-references between the original data products and the final common data architecture.

The first approach is very detailed, but is useful for very large organizations with relatively distinct sets of disparate data. The process is monitored by the tactical data stewards to ensure that a correct and complete final common data architecture is developed.

The second approach is to inventory and document data products as described above for the first approach. The interim common data architectures are then merged into a final common data architecture, combining any commonalities in the interim common data architectures. The data products are then merged together. Finally, the interim data cross-references are adjusted to the final common data architecture.

The second approach may be easier than the first approach, or it may be more difficult, depending on the similarity between the initial sets of data products. If the initial sets of data products are quite different, the approach is easier. If the initial sets of data products are very similar, the approach is more difficult. The approach requires close coordination  among the tactical data stewards.

A third approach that is becoming more common with larger networks is to put the entire process online. The data products are inventoried and documented in one central location, the common data architecture is documented in one central location, and all data cross-references are done at that central location. The approach is much simpler than developing interim common data architectures and then cross-referencing those interim common data architectures into a final common data architecture. However, it requires constant coordination from the tactical data stewards. A lack of close coordination could lead to the entire approach failing.

ENHANCEMENT AND ADJUSTMENTS

Enhancements need to be made to the data inventory and to a common data architecture during data cross-referencing. Adjustments also need to be made to the data inventory, to the data cross-references, and to a common data architecture during data cross-referencing. Each of these situations is described below.

These enhancements and adjustments are a natural part of the data resource integration process. The uncertainty about disparate data makes a precise, one-time pass very difficult. Additional insight is continually gained during the data resource integration process that increases understanding about the organization’s perception of the business world, how they operate in that business world, and the data needed to support that operation. That additional insight results in enhancements and adjustments.

Data Inventory Enhancement

Enhancements may need to be made to the data inventory during data cross-referencing. Enhancements to the data inventory include enhancing the data definitions and adding to the data integrity rules. Any time additional insight is gained about the existing disparate data, that insight must be documented.

People frequently think that additional insight can be remembered or is obvious, and therefore doesn’t need to be documented. However, over time, that memory fades and the insight is lost. That loss could lead to difficulty making an accurate data cross-reference.

Common Data Architecture Enhancement

Enhancements are continually made to the common data architecture during data cross-referencing. Enhancements include adding new components to the common data architecture as needed to support cross-references, and enhancing data definitions. Adding new components is relatively easy, because those components must be in place to perform the cross-referencing. However, enhancing the definitions is more difficult.

As with data inventory definitions, people tend to become lax at enhancing definitions for thorough understanding. The data definitions need to be continually enhanced so that they adequately cover subordinate components and the data products being cross-referenced. The process is critical for the designation of a preferred data architecture. Any confusion or uncertainty about a common data architecture will impact the designation of a preferred data architecture.

Data Inventory Adjustments

Adjustments may need to be made to the data inventory during data cross-referencing. The adjustments are usually a further breakdown of the data products to their basic components. Data product sets may need to be further broken down into basic components. Data product units may need to be further broken down into basic components. Data product codes may need to be further broken down into basic components. All of these are normal occurrences during data cross-referencing.

Occasionally, the breakdown of a data product set, data product unit, or data product code is found to be invalid. That breakdown is eliminated from the data inventory and the data cross-referencing can continue.

Occasionally, the structure of the data inventory needs to be adjusted. That adjustment is made and the data cross-referencing can continue.

Common Data Architecture Adjustments

A common data architecture may need to be adjusted based on a better understanding of the business world and how the organization operates in that business world. As an initial common data architecture grows with the data cross-referencing, organizations often have a better understanding of how their data resource could support their business. That better understanding often leads to an adjustment of a common data architecture.

One of the reasons for not going too far with an initial common data architecture is that more adjustments may need to be made. An initial common data architecture may add components that are not needed for cross-referencing, and may create components that are not properly named or defined. A better approach is to let the data cross-referencing drive enhancement of a common data architecture. The result will be fewer adjustments.

Data Name Changes

Data names in a common data architecture can be changed. Changing a data name is relatively easy, because the change doesn’t directly impact the cross-references. The name is simply changed. The data definition may need to be modified to reflect the data name change.

A data subject name may need to be changed from Contractor to Vendor, because more than contractors are included. The name is changed, the definition is adjusted accordingly, and the data subject thesaurus is updated.

A data characteristic name may need to be changed from Book. Cover Illustration to Book. Cover Layout because more than the illustration was included. The definition is changed accordingly.

A data characteristic variation name may need to be changed to further qualify the variation, such as Well. Depth, Measured to Well, Depth Measured Laser because the depth could also be measured with a less accurate measuring tape. The definition is changed accordingly.

A data reference set variation name may need to be changed to enhance understanding, such as Stream Gradient Type. 1; and Stream Gradient Type. 2: to Stream Gradient Type. Canadian; and Stream Gradient Type. American;.

Data Definition Changes

Data definitions are frequently enhanced, as described above. In some situations, however, a data definition needs to be changed, either because the definition is wrong or because the scope changed. Each data definition change must be evaluated to determine if the definition change will result in a name change, or a separation or a combination of components. A separation or combination of components might result in a change to the data cross-references.

Data Subject Changes

Data subjects may be combined. For example, two data subjects are created for Caregiver and Caretaker. These data subjects represent the same thing from the organization’s perception of the business world and are combined into Caregiver. The data subject thesaurus is updated accordingly.

Data subjects may be separated. For example, volunteers were included in Employee due to State reporting requirements. However, the organization decided to separate volunteers because they are managed differently and have a substantially different set of data characteristics. Therefore, Employee and Volunteer are created. The data subject thesaurus is updated accordingly.

Data Characteristic Changes

Data characteristics may be combined. For example, Patient. Weight Clothed and Patient. Weight Unclothed are combined into Patient. Weight, because all patients at the doctor’s office were weighed with their clothes on. The data definition is adjusted accordingly.

Data characteristics may be separated. For example, Patient. Height was separated into Patient. Height Shoes and Patient. Height No Shoes, because patients to a hospital could be measured either way.

Common Data Architecture Reviews

A common data architecture should be reviewed periodically to ensure that it adequately represents the organization’s perception of the business world and the data needed to operate successfully in that business world. Data subjects, data characteristics, data characteristic variations, and data reference set variations should be reviewed for similarities or differences that may result in combining or splitting of data subjects.

More frequent reviews may be necessary when multiple teams are working on the same data subject area. The teams may be diverging in their perception of the business world, and may be creating a diverging common data architecture. A good data subject thesaurus, and frequent checking of that data subject thesaurus help prevent a divergent common data architecture.

A final review should be done after the data cross-referencing has been completed, before the preferred data architecture designations. Discovery of changes during preferred data architecture designations result in a delay, and possible changes to those designations.

Data Subject Thesaurus

Keeping the data subject thesaurus up to date and referring to it regularly helps prevent data subject synonyms and homonyms. Fewer synonyms and homonyms result in fewer adjustments to a common data architecture. When a data subject thesaurus is not maintained and used, a common data architecture begins to deteriorate and frequent adjustments are needed.

Data Name Vocabulary

Keeping the vocabulary of common words up to date and referring to it regularly helps prevent synonyms and homonyms. A good vocabulary ensures consistency in data names. When no vocabulary exists, synonyms and homonyms are easily created, which is less than desirable and usually results in adjustments.

Data Cross-Reference Adjustments

Any changes to data definitions, changes to data names, splitting of components, combining of components, or adjustments to the data inventory may result in an adjustment to the data cross-references. Any data cross-references to components that have adjusted need to be reviewed for validity, and adjusted if necessary.

The data cross-references should also be reviewed periodically to ensure they are still valid. The data cross-reference comments should be reviewed to determine if all concerns about the cross-references have been resolved. If the concern has been resolved, a statement is entered stating how the concern was resolved.

A final review of the data cross-reference should be made prior to designating the preferred data architecture. Data cross-references that are not valid could lead to problems developing or implementing data translation rules or data transformation rules.

DATA REDUNDANCY AND VARIABILITY

An indication of the redundancy and variability of disparate data can be determined after data cross-referencing has been completed for a segment of the disparate data resource or for the entire disparate data resource.

Data Redundancy

When data cross-referencing has been completed for a segment of the disparate data resource or the entire disparate data resource, the data redundancy can be evaluated. Data redundancy was defined in Chapter 4 as the unknown and unmanaged duplication of business facts in a disparate data resource. It’s the same facts, for the same data occurrence, for the same time period. It’s the situation where a single business fact is stored in more than one location, and the locations may not be in synch. Data redundancy is typically determined for data files only.

Data redundancy can be difficult to determine because it’s not known whether data occurrences are redundant or non-redundant. For example, a student’s birth date may exist in two different data files. If those two data files contain redundant data occurrences for students, then the student’s birth date is redundant. However, if those two data files represent different sets of students, such as middle school and high school, the student’s birth date is not redundant.

Apparent data redundancy is the apparent existence of the same business fact in multiple data files, regardless of whether those data files contain redundant data occurrences. It’s the redundancy of a business fact based on the data characteristic name. Actual data redundancy is the existence of the same business fact in multiple data files that contain non-redundant data occurrences. It’s the redundancy of a business fact based on the data characteristic name and a determination of the redundancy in data occurrences. Data occurrence redundancy is the existence of multiple data occurrences for the same existence of a business object or happening of a business event.

Apparent data redundancy is relatively easy to calculate. The number of data product units or variations that exist in data product sets or variations representing data files is counted for each data characteristic. In other words, for each data characteristic, determine all of the data characteristic variations. For each of those data characteristic variations, determine all of the data product units or variations that belong to data files. Add up the number of data product units or variations for a data characteristic and that’s the data redundancy.

For example, a data characteristic has six variations and those variations cross-reference to 14 data product units or variations that belong to data files, the data redundancy for that data characteristic is 14. The individual data characteristic redundancies can be averaged for all the data characteristics in a data subject to provide a data redundancy for that data subject. The individual data characteristic redundancies could be averaged for all the data subjects in a segment of the data resource to determine the redundancy for that segment of the data resource. The same could be done for the entire data resource.

Actual data redundancy is more difficult to calculate because non-redundant data occurrences may exist across the data files. Using the student example above, the data in two data files for middle school students and high school students do not contain redundant data. The only way to determine the actual data redundancy is to determine the data occurrence redundancy across data files and use that redundancy to calculate the actual data redundancy. That task is not easy and is seldom performed.

Actual data redundancy is also difficult to determine because data reference sets may qualify many different data subjects. For example, a gender code data reference set may qualify students, educators, bus drivers, and so on. The gender code would appear in each of these data files. However, that gender code does not represent actual data redundancy.

Actual data redundancy is difficult to determine because historical data instances contain the same data items. For example, a student’s weight may be stored in a current data instance and in three historical data instances. However, those four existences of a student’s weight do not represent actual data redundancy.

Apparent data redundancy could be determined for screens, reports, and forms, even though it is typically intended for data files. The same process described above is used, with the exception that the data products would be screens, reports, or forms, rather than data files. Actual data redundancy of screens, report, and forms is very difficult to determine because of data selections and time frames. However, the process is useful for identifying screens, reports, and forms that might contain redundant data and could be considered for elimination.

Apparent data redundancy is best used as an indicator, rather than an absolute. An apparent data redundancy of 10 or higher is a critical problem. An apparent data redundancy of 3 might be manageable. Each organization needs to determine the data redundancy level that is acceptable.

Data Variability

Data item variability, as defined earlier, is the variability in the format or content of data items representing the same business fact. It’s a measure of how many different formats or contents exist for a particular data item across data files, and on screens, reports, and forms.

Data item variability is relatively easy to determine after data cross-referencing has been completed. The number of data characteristic variations for a data characteristic is the data item variability. For example, if Stream Segment. Name has six variations, the data item variability is six. Data item variability can be averaged for all the data characteristics in a data subject, or for all the data characteristics in a segment of the data resource, or for the entire data resource.

Data code variability, as defined earlier, is the variability in the coded data values, names, definitions, and domain of codes in a set of data codes. It’s a measure of how many variations exist for a particular set of data codes across data files.

Data code variability is relatively easy to determine after data cross-referencing has been completed. The number of data reference set variations for a data reference set is the data code variability. For example, if a data reference set for Gender has eight different data reference set variations, the data code variability is eight. Data code variability can be averaged for all the data reference sets in a segment of the data resource, or for the entire data resource.

SUMMARY

Data cross-referencing continues the process of thoroughly understanding disparate data by cross-referencing the disparate data to a common data architecture. The data cross-reference scope is driven by the scope of the data inventory. The data must be inventoried before they can be cross-referenced. The data cross-referencing may be done in concert with the data inventory, or it may be done after the data inventory has been completed. The data cross-referencing is usually done by a core team, with input from others as needed.

Data cross-referencing is documented according to a common data architecture. The documentation is stored by whatever means is appropriate for the organization. An initial common data architecture is developed and is continually enhanced during data cross-referencing. The data subject thesaurus and a set of common words used in data names are maintained throughout the data cross-referencing.

Data cross-references are made between data product sets or variations and corresponding data subject variations, between data product units or variations and corresponding data characteristic variations, and between data product codes or variations and corresponding data reference set variations. No other cross-references are made between data products and a common data architecture.

Interim common data architectures may be developed for major segments of very larger organizations. These interim common data architectures are then treated as data products and are cross-referenced to a final common data architecture. Tactical data stewards must be actively involved in the process to ensure that it is performed appropriately.

The common data architecture is continually enhanced during data cross-referencing. The data inventory may also be enhanced as additional insight is gained about disparate data. The data inventory may also need to be adjusted based on insights gained during data cross-referencing. A common data architecture should be reviewed periodically and at the conclusion of data cross-referencing. Adjustments are made so that a common data architecture accurately represents an organization’s perception of the business world. Data cross-references may need to be modified based on adjustments to the data inventory and common data architecture.

Data cross-referencing provides an understanding of disparate data within the context of a common data architecture. It sets the stage for designating a preferred data architecture, which is used for data transformation. The quality of data cross-referencing, like the quality of the data inventory, determines the quality of the preferred data architecture and data transformation.

QUESTIONS

The following questions are provided as a review of the process to thoroughly understand disparate data, and to stimulate thought about how to understand disparate data.

  1. How is the scope of data cross-referencing established?
  2. Who should be involved in data cross-referencing?
  3. How is an initial common data architecture developed?
  4. How detailed should an initial common data architecture be?
  5. How is a data subject thesaurus useful?
  6. How are common data words useful?
  7. What’s the purpose of interim common data architectures?
  8. Why are data product codes or variations cross-referenced to a data reference set variation?
  9. What is the purpose of a data cross-reference comment?
  10. How are data redundancy and data variability determined?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.37.126