Chapter 6

DATA INVENTORY PROCESS

How to go about inventorying disparate data.

Organizations seldom have a complete inventory of all data at their disposal. A data inventory process resolves that problem by providing a complete inventory of existing data maintained by the organization and at the organization’s disposal. It also sets the stage for understanding the existing data so they can be cross-referenced to a common data architecture and a preferred data architecture can be designated.

Chapter 4, on Data Variability, described the different types of variability that could be encountered in a disparate data resource. Chapter 5, on Data Inventory Concept, described what needs to be done during the data inventory and the constructs for documenting the results of a data inventory. The current chapter describes how to conduct a data inventory and how to document the results of that inventory.

The data inventory process not only inventories the existing disparate data, but also breaks down any combined data into their basic components and begins documenting an initial understanding about disparate data. Both the breakdown and the initial understanding are needed to thoroughly understand the disparate data and resolve data disparity.

DATA INVENTORY PREPARATION

Data inventory preparation includes setting the scope of the data inventory, establishing the sequence of inventorying data, determining the involvement of people in the data inventory, and documenting the data inventory. Each of these topics is described below.

Data Inventory Scope

A data resource inventory must, ultimately, include all data at an organization’s disposal. No data should be excluded from a complete data resource inventory. Many organizations consider only the data in databases, which is a start. However, inventorying databases is far from a complete inventory of disparate data. All data at an organization’s disposal must ultimately be inventoried.

The data in an organization’s data resource cannot be completely inventoried and documented at one time. A phased approach is established based on the problems that are critical to the business. Priorities for a phased approach are established based on benefits to the business and where the business is experiencing pain.

Major pain points are often the best place to start inventorying and understanding the existing data. If the business is experience many pain points, then a decision needs to be made about the most critical pain points, and the priorities need to be based on the severity of the pain to the business.

Priorities can also be based on business critical data. Business critical data are any data that are critical to operating the business. Business critical data are often referred to as vital data, high priority data, core business data, baseline data, crucial data, and so on. Essentially, business critical data are any data supporting the pulse of the business.

In most situations I’ve encountered, the major pain points occur with the business critical data. Pain with business critical data is often far more severe than pain with data that are not business critical. Therefore, the general approach to establishing priorities favors business critical data that are creating pain.

For example, if public works data is critical to the business and poor quality public works data is causing pain, that subject area becomes the initial scope. Similarly, if employee data, stream data, infrastructure data, and so on, are critical to the business and people are experiencing the pain of low quality data, those subject areas become the initial scope of the data inventory.

The initial scope of the data inventory may increase or decrease as the data inventory progresses. I’ve found that the initial scope actually increases in the majority of situations. What typically happens is that the initial scope is set based on pain with business critical data. However, as the data inventory progresses, additional data are discovered beyond the initial scope, and the scope is increased to include those additional data.

In Data Resource Simplexity, I mentioned a client that identified 16 databases with customer data that were critical to the business, and that the disparity was so bad, the organization had essentially lost track of their customer base. We established a scope and schedule based on those 16 databases. When I arrived, they had discovered an additional seven databases with customer data, for a total of 23 databases containing customer data. That situation is typical of what happens when organizations begin inventorying their disparate data.

I’m often asked how to continue data inventory and documentation when the first phase is over. In the vast majority of situations I’ve encountered, organizations have no problem continuing the data inventory process. People see the benefit and are literally knocking at the door to be the next data inventory phase.

Usually, organizations start waves of data inventory based on priorities. As one waves moves from data inventory into data cross-referencing, another wave of data inventory is initiated. Once the process is started, it is usually self-perpetuating, and is driven by the benefits.

Priorities may, and often do, change during the data inventory process. Balancing increased scope and changes in priorities can be difficult. However, it is not impossible if successive waves of data inventory are established. Formal project management can be implemented to track waves of data inventory, increased scope, and changing priorities.

Data Inventory Sequence

When the data inventory scope has been set, the sources of insight about disparate data within that scope are identified. Typically, the databases that contain any data that could conceivably be within the scope are identified. Not only are the major databases identified, but also any databases that contain data within the scope need to be identified. Even a database that has only a few data items within the scope must be identified and included. Better to err on the side of too many databases than too few to obtain as much understanding as possible.

Application programs, data models, screens, reports, and forms, formal and informal documentation, and knowledgeable people must also be identified and included in the data inventory. Even with the best guess about the sources of insight to be included in the data inventory, additional sources will be identified during the course of the data inventory.

A horizontal phased approach is to conduct a high-level data inventory and document all of the databases within the organization as data products. The high-level data inventory provides a good overview of the data the organization maintains in their data resource—the size of the breadbox. Next, the highest priority databases are inventoried and the data files are documented as data product sets. Finally, the highest priority data files are inventoried and the data items are documented as data product units.

A vertical phased approach is to conduct the data inventory starting with the highest priority database and documenting that database clear down to the data items. Then, the next highest priority database is inventoried and documented.

The horizontal phased approach may seem unnecessary in smaller organizations. However, it could be a valid approach in a very large organization. One state agency where I consulted had a monumental problem answering legislative information requests in less than six months, which was totally unacceptable. In addition, some of the information provided to the legislature was inconsistent and caused concern among the legislators.

One thing an organization does not want to do is to tell executives, superior courts, state or federal legislatures, and other regulatory organization that the request for information cannot be met in a timely manner. Even worse, is to provide conflicting information in a legal or political environment.

We followed a horizontal phased approach by inventorying all of the databases, then the data files in the high priority databases, then the data items in the high priority data files. The process took four people nearly a year just to inventory and document the high priority data. The result showed that the agency had far more data than they had ever expected, and that those data were more disparate than expected, which clarified the problems with disparate data. No wonder the agency had problems meeting requests for information.

The general sequence for a data inventory is to inventory and document the primary sources of insight first, followed by the supplemental sources of insight. The supplemental sources provide additional insight that can be used to enhance the documentation of the primary sources of insight. Usually, trying to document the supplemental sources of insight first produces confusing results and often leads to wasted resources.

Within the primary sources of insight, databases and data files are generally inventoried and documented first. Then the other primary sources of insight, such as data models, application programs, and screens, reports, and forms, are inventoried and documented. However, the sequence can be altered depending on the particular problems that an organization faces.

For example, understanding screens, reports, or forms may be a critical problem for the organization. In that situation, the screens, reports, or forms may be inventoried and documented first. Similarly, the data used and produced by application programs may be a critical problem for the organization. In that situation, the application programs are inventoried and documented. The same is true for data models.

Data Inventory Involvement

The people involved in a data inventory need to be identified and initial time commitments established. Typically, a core team is established and responsibilities are assigned. Ancillary team members are identified and their responsibilities are designated. Finally, anyone with insight about the disparate data is identified.

The core team often remains the same throughout the data inventory. However, additional ancillary team members and people with insight about the disparate data are frequently identified. Operating in an area of discovery often leads to the identification of many people who can contribute to understanding disparate data.

Initial time commitments can be established. However, these time commitments will likely increase as uncertainty about the disparate data increases. Not only are more people involved in the data inventory, but also the time commitment of those people frequently increases.

Each data inventory project takes on a life of its own. Even though an initial scope, involvement, and commitments are established, the project develops a life of its own as people become involved in understanding disparate data. The schedule should be general and not too rigid. The area of discovery requires that the schedule be flexible to allow the capture and documentation of all understanding about the data resource.

A large time commitment up front often turns people away from a data inventory because they don’t see the benefit. However, when people see the benefit they will get involved and contribute as much time as necessary to resolve existing problems.

One situation I encountered was developing a new county property appraisal system. The existing appraisal system had massive problems with a plethora of data reference sets used to document and appraise residential and commercial property. The data reference sets had coded data values and names, but seldom had good definitions.

Appraisers were extremely busy appraising property and defending appraisals during appeals, but were contributing time to developing a new project. I set up an online system for the appraisers to enter the definitions they used, concerned that they had little time to contribute their definitions. To my surprise, massive definitions were being entered, and those definitions were in gross disagreement. My concern about involvement turned to concern about how to resolve the disparity in definitions. To my surprise, the chief commercial and chief residential appraisers went online to establish formal definitions for each data reference item in each data reference set.

The appraisers saw the benefit of resolving a problem that was contributing pain to the assessor’s office. They contributed the time necessary, often their own time, to resolve the problem. Therefore, initial schedules and time commitments should be general and not too rigid. Rigid schedules and time commitments seldom work during data inventory.

A key point about involvement is to let people establish their own schedule. I’ve found that when people need to pause and think, they need to be allowed that time. When people are ready and willing to move ahead, they need to be given every opportunity to move ahead uninterrupted.

Another key point about involvement is to establish a no blame - no whitewash policy to encourage people to become involved. Whenever blame is placed, people are polarized and do not readily contribute to the data inventory. When bad situations are covered up, the understanding about the disparate data is compromised, and any subsequent tasks will likely be less than fully successful. Therefore, all contributions to the data inventory must follow a no blame  - no whitewash policy.

I worked with one organization where the business people brought me in because they could not get the data they needed to support the business. The data had been physically manipulated to the extent that they didn’t know what the data represented. The business had difficulty getting the data they needed, and the data they did get often conflicted.

My first task, which was not trivial, was to get the business and IT to sit down at the same table. The hostilities were obvious. The business attitude was how dare you mess up my critical business data. The IT attitude was how dare you question my ability to manipulate the data. The entire first session of several hours was spent shedding the hostilities and adopting a no blame - no whitewash approach. The situation existed—that was a fact. Let’s get on with understanding the data and resolving the situation.

Data Inventory Documentation

The Data Product Model provides the construct for documenting the data inventory. Each organization needs to determine how and where that Data Product Model is implemented, based on their needs and physical operating environment. Software products for documenting the data inventory may be obtained, but those products must have the capability to document the data inventory as described in the current book. Organizations should not consider acquiring a software product that does not document the data inventory as described here.

The opt for detail principle emphasizes that all detail should be documented during the data inventory process. It’s far better to have more detail than needed to thoroughly understand the disparate data than to have too little detail and not thoroughly understand the disparate data. More detail leads to better understanding and more informed decisions about creating a comparate data resource.

Data inventory is a dynamic discovery process and new insight is continually gained through the data inventory process, and even through the data cross-referencing and preferred data designation processes. The continuous enhancement principle recognizes that new insight is constantly gained and that the documentation needs to be constantly enhanced based on that new insight.

All enhancements to data inventory documentation should contain the name of the person entering the enhancement, the date of the enhancement, and the source of insight leading to the enhancement. These provide a way to track and verify the enhancement. The no blame - no whitewash policy ensures that people can enter enhancements without fear of any repercussion.

The data inventory documentation should be retained, even after data resource integration is complete. The documentation not only provides a history about the integrated data resource (data tracking), but old material may be found at any time. Retaining the documentation provides a way for people to understand that old material.

The data components that are captured during the data inventory are the four components of the Data Architecture Segment of the Data Resource Management Framework—data names, data definitions, data structure, and data integrity rules. All four of these components are needed to thoroughly understand the data.

Data names may or may not exist, as described in the last chapter. If a data name exists, that data name should be used as it appears. If a data name does not exist, an appropriate data name needs to be created.

Data definitions are often vague. Seldom are they complete or accurate. Existing data definitions need to be documented, but do not assume that those data definitions are complete and accurate.

The scope of the data is often vague or unknown. The scope of the data must be documented in a scope statement, which describes any selection criteria, a subset of a larger set of data, or the role of the data.

Data integrity rules are often nonexistent. Most that do exist are data edits in database management systems or applications. Only the data integrity that actually exists should be documented during the data inventory, even though people have a tendency to list the desired data integrity rules. The desired data integrity rules are defined during the preferred data architecture designation process.

SUPPORTING TECHNIQUES

Two supporting techniques for the data inventory are data scanners and data profilers. Each of these techniques is described below.

Data Scanning

Data scanning is used primarily for capturing data to be stored in the data resource, such as document scanning, bar code scanning, and so on. However, data scanning also applies to scanning databases to identify the data items that exist in the database, or application programs to identify and document the data items that are used or produced. The scanning does not capture the actual data values—it only captures the data items so they can be properly documented.

Data scanning in the context of data resource integration is the process of electronically or manually scanning databases or application programs to identify the data stored by databases, or the data used or produced by applications. Data scanning can capture technical insight into the data, but cannot capture semantic insight into the data. Data scanning for data resource integration is often referred to as data discovery scanning because it supports increased understanding of the existing disparate data.

Data scanners are a source of insight about disparate data. Data scanners are good at inventorying existing data and documenting that inventory. Data scanners are good for capturing technical data contained in database management systems and for identifying changes in the databases that occurred since the last scan.

However, data scanners provide little semantic understanding of the data. Data scanners cannot integrate the inventory across multiple sources, such as scanning a database and then scanning an associated data dictionary, because they have no context for that integration. Data scanners do not work well with screens, reports, and forms, or for formal and informal documentation.

In other words, data scanners are good at inventorying the data that exist and changes to those data over time, but provide very little semantic understanding about those data. The semantic understanding comes from human intervention. Data scanners can do the inventory, but people need to interpret the meaning of that inventory.

I frequently receive questions about automated versus manual data scanning. Automated data scanning is faster, but does not provide the semantic understanding. Manual data scanning is slower, but comments that provide insight into the meaning of the data can be entered by the person doing the scanning. In addition, cross-referencing is generally easier and faster when done by the same person who did the manual scanning.

Data cross-referencing has not been described yet. However, data scanning and data cross-referencing can be done serially or in parallel. The process is serial when all of the data within the scope are inventoried and then all those data are cross-referenced to a common data architecture. The process is parallel when each piece of data or group of data is inventoried and then cross-referenced. When a person understands the processes of inventorying and cross-referencing data, those processes can be used in any manner to understand the data with minimum effort.

The risk with the serial approach is that the understanding gained during data inventory may be lost by the time the cross-referencing is performed. The risk with the parallel approach is that a cross-reference may be made without thoroughly understanding the data because not all of the sources of insight have been inventoried. The choice of a serial approach or a parallel approach needs to be made based on the relative risks. In many situations, a mixture of the serial approach and parallel approach are used based on the relative risks.

The choice of a serial or parallel approach also depends on whether data scanners were used for the data inventory. When data scanners are used for the data inventory, the serial approach is probably the best. When the data inventory is done manually, either the serial approach or the parallel approach may be used.

Each organization needs to make a choice about the degree of automated versus manual scanning that is done during the data inventory process. For example, automated scanning may be done for databases followed by manual scanning of data dictionaries, informal notes, and so on. The results of the manual scanning can be used to supplement the automated scanning.

Several organizations where I worked wrote their own automated scanners for database inventory, and ran those scanners periodically to identify changes in the databases. These custom scanners work well for an organization’s unique operating environment, and can be built quickly and inexpensively. They can also be designed for easy entry of supplemental insight gained from other sources.

Data Profiling

Data profiling has many different definitions. It’s an approach to data quality analysis using statistics to show patterns of usage, and patterns of contents, and automated as much as possible. Some profiling activities must be done manually, but most can be automated.2

Data profiling is also the process of examining the data available in existing databases and files, and collecting statistics about those data. The statistics can be used to identify uses of the data, improve data quality, understand the data for integration, document the data, and provide an inventory of the data. Data profiling can be used to identify candidate data sources for a data warehouse, clarify the structure of the data, and identify and understand data anomalies. The purpose of data profiling is to understand the data thoroughly so they support the business.

Data profiling is often referred to as data discovery. It is used to decipher and validate patterns in the data. It is used to identify problem areas so plans can be made to resolve those problem areas. It can be used to identify redundant data across data sources and preparing for the integration of those data. It can be used to identify the completeness and the accuracy of the data.

Data profiling, in the context of data resource integration, is the process of analyzing the data values in databases to determine possible data meaning, data structure, and data integrity rules in preparation for data resource integration. These determinations must be verified before they can be accepted as fact and used for data resource integration. Data profiling for data resource integration is often referred to as data discovery profiling because it supports increased understanding of the existing disparate data.

Data profiling can be useful for understanding disparate data if used properly. It’s similar to psychological profiling of a criminal or the biochemical profiling of blood. Data profiling combines formal analysis with informal sleuthing to gain a better understanding of the data resource. It’s preliminary and not absolute, but it does provide some insight into the data resource.

Data domain profiling analyzes the existing domain of data values for data items in a database. The existing data values, their frequency of distribution, variability, missing values, existence of multiple values, possibility of redundancy, and so on, are analyzed and documented. The analysis can identify the variability in data values, both within a data file and across data files. The variability in content and format, such as measurement units, forms of a person’s name, and so on, can be identified. The existence of elemental and combined facts can be identified.

The analysis can identify the existence or lack of data integrity rules. However, the results of data domain profiling must be reviewed and verified before any data integrity rules are documented. In other words, the analysis is only a preliminary indication that must be verified.

Data domain profiling can identify data anomalies. A data anomaly is any data value that does not follow a pattern that matches a reasonable expectation of the business. It could be a correct data value, or it could be an error. If it’s a correct data value, it could be acceptable or unacceptable to the business. Data anomalies can be used during the designation of preferred data to create precise data integrity rules.

Functional dependency profiling analyzes the data values for possible data relations between sets of data. If the same domain of data values is identified in different data files, a presumption can be made that those two data files might be related through a primary key – foreign key relationship. For example, the domain of values A through G appearing in data items in two different data files might indicate a subordinate data entity and a data reference set.

However, functional data profiling is an indication only and may not be correct. For example, two different data reference sets could exist with each having the domain of values A through G. Therefore, the results of functional dependency profiling must be reviewed and verified before a data relation is established.

Human data profiling identifies the pattern of actions different people exhibit when entering or editing data. Patterns about how people collect data, enter data, and edit data can be helpful for understanding disparate data. The patterns can also be useful for identifying data integrity rules that are not documented anywhere.

Automated tools can support data profiling. Existing data value can be analyzed by some very sophisticated techniques to identify candidate data integrity rules and functional dependencies. These tools, however, cannot capture understanding or meaning placed on data by people or organizations. They cannot determine intent or use of the data, and they cannot track the data from their source to their present location. Therefore, they only provide an indication that must be verified.

The verified results of data profiling are documented in the appropriate data product unit or data product code set. Preliminary results could be documented, but a statement should be made that the results are preliminary and need to be verified.

Data profiling is typically done during the data inventory so that all possible insights are available for data cross-referencing. However, some questions may arise during data cross-referencing, and even during the designation of preferred data, additional data profiling may be required. Therefore, data profiling could be done at any time that additional insight is needed.

INVENTORYING DISPARATE DATA

Data variability was described in Chapter 4. The last chapter described the data product model used for documenting the data inventory. The current section describes how to document the variability of disparate data according to the data product model, including how to split combinations of data down to their basic components.

Databases

Data products are any major independent set of documentation about disparate data, such as an information system, database, data dictionary, major project, data model, and so on. Data products are identified, named, and defined at the beginning of the data inventory process, or anytime during the data inventory process and a relevant data product is discovered.

For example, a Water Rights database is being inventoried. The data product name might be Water Rights Database. The initial data product definition might be something as simple as A database containing a variety of data files documenting all of  the water rights within the State of Washington. The definition is enhanced as additional insight is gained.

No scope statement or data integrity rules are documented for data products. Data file-relation diagrams can be documented for data products. Diagrams can be formal or informal. They can be included in the definition or reference can be made to an external location.

Data Files

Data files can represent a single data subject or multiple data subjects, and they can represent a complete data subject or a partial data subject. These determinations cannot be made with certainty during the data inventory, because the complete scope of a data subject within a disparate data resource is not known. These determinations can only be made with certainty after data cross-referencing has been completed and the complete scope of a data subject has been determined.

However, indications as to the type of variability in a data file can be noted in the data product set definition during the data inventory process. For example, a person inventorying a data file may know or determine that the data file represents part of a data subject or multiple data subjects based on their experience. That insight can be documented in the data product set definition.

Data files are documented as data product sets. The physical data file name usually becomes the data product set name. A data file definition may exist, or an initial definition may need to be created. For example, a data file in the Water Rights Database might be the Water_Right_File, which becomes the name of the data product set. Usually, the data file name is entered as it appears for the physical data file. The initial definition might be as simple as The Water Right File contains the basic features for each water right. That definition is enhanced as additional insight is gained.

The Water Rights database and Water Right File are documented as shown below. The notation in parenthesis after each component indicates the data product model component used to document the data inventory. It’s included here for understanding only, and does not appear when documenting disparate data.

     Water Right Database (DP)

          Water Right File (DPS)

Data files can contain multiple data record types. Each of these data record types is documented as a data product set variation. For example, the Water Right File has two data record types, called Control_Number and AA_Transaction. Data product set variations are established for each of these data record types. The data product set variation names would be Control_Number and AA_Transaction. The initial definitions might be A record of the control numbers for water rights and The location of the water right, respectively. These definitions would be enhanced as additional insight is gained about each data record type.

The components of the data inventory are shown below.

     Water Rights Database (DP)

          Water_Right_File (DPS)

               Control_Number (DPSV)

               AA_Transaction (DPSV)    

A scope statement describes any selection criteria, how the data product set is a subset of a larger set of data, and/or the role of the data. For example, a data file may contain a subset of student data for middle school students, or for students from 2000 through 2010. The data file may contain data about plumbers, which are a subset of contractors. Any description of the scope of the data should be documented for a full understanding about what the data represents.

Data integrity rules that pertain to data records within the data file or between data files are documented for data product sets and data product set variations. The data integrity rules can be documented in the formal notation or in a textual description. The objective is to capture the data integrity rule, not necessarily have that data integrity rule in the formal notation.

Physical primary keys and foreign keys are documented for each data file, as described in the last chapter. The foreign keys are listed in alphabetical order by parent data subject name for ready identification. Data relations are documented with the foreign key, as described in the last chapter.

Primary and foreign keys are often difficult to identify. Some searching may need to be done because the primary and foreign keys are not readily apparent. Primary keys can be identified and used to find foreign keys in subordinate data files. Similarly, foreign keys can be identified and used to find primary keys in parent data files. However, finding data items with the same name in different data files does not always indicate a primary key – foreign key relationship.

Data Records

A data record can represent a complete data occurrence or a partial data occurrence. A data record can also represent a single data occurrence or multiple data occurrences. Multiple data occurrences can be either subordinate data occurrences or parallel data occurrences.

The determination of whether a data record represents a complete or partial data occurrence cannot be made with certainty during the data inventory, because the complete scope of the data instance within a disparate data resource is not known. The determination can only be made with certainty after data cross-referencing has been completed and the complete scope of a data occurrence has been determined. However, indications of the completeness of a data occurrence can be noted in the data product set or data product set variation definition during the data inventory process.

Partial data occurrences might be identified by multiple record types within a data file. For example, record type D might contain all of the detail data for a street segment, record type C might contain an initial comment, and record type E might contain an extension of that comment. Each of these three data record types would be documented as data product set variations, as described above.

The determination of whether a data record represents single or multiple data occurrences, and whether multiple data occurrences are subordinate or parallel, can usually be made during the data inventory. The determination is documented in the data product set or data product set variation definition.

The data items for subordinate data occurrences are only documented once, regardless of how many subordinate data occurrences are contained in the data record. For example, a data record contains multiple subordinate data occurrences for quarterly stream flow during the year. Each data occurrence contains data items for stream depth, stream width, total quarterly flow volume, and average daily flow volume. The data product set definition describes the quarterly data occurrences. Each of the four data items is only documented once.

Similarly, the data items for parallel data occurrences are only documented once, regardless of how many parallel data occurrences are contained in the data record. For example, a single data record for traffic collisions contains four data occurrences for four different traffic collisions. Each data occurrence contains the traffic collision number, the date, the number of injuries, the number of deaths, the number of vehicles, and the total damage. The data product set definition describes the four data occurrences. Each of five data items for a data occurrence is only documented once.

Note that the data file variability and the data record variability may not be the same. For example, a data file could represent a complete data subject, meaning data records for each data occurrence in that data subject, but the data occurrence may contain only part of the data items for that data subject.

Data Instances

Data records can represent current data instances or historical data instances. The historical data instances may either be complete and contain all of the data items, or may be partial and contain only the data items that changed and identifying data items. Historical data instances can be self-contained in the same data file or stored in separate data files.

Usually, these determinations can be made during the data inventory. They are documented in the data product set or data product set variation definition. The data items are documented as described below.

The importance of data product set and data product set variation definitions should be obvious. The comprehensive definitions covering data files, data records, and data instances will be most valuable for data cross-referencing, designating a preferred data architecture, and data transformation. Remember the opt for detail principle and document any insights about the data files.

Data Items

Data items are probably the most variable of any data in a disparate data resource. Each of the data item variability types must be identified and properly documented.

Data items are documented as data product units or data product unit variations. The data product unit name should be the name of the data item as it is contained in the data file. Ideally, the data item names should be listed in the exact format and sequence as they appear in the data file, which helps draw technical people into the data inventory process.

Data item definitions should be as comprehensive as possible. Any source of insight about a data item should be documented in the data product unit definition. Remember to opt for detail principle.

Data items can represent a single data characteristic, multiple data characteristics, or a partial data characteristic.

A single characteristic data item can be consistent, meaning it always contains the same data characteristic. For example, the single characteristic data items for the Water Rights File are each documented as data product units, as shown below. Initial data product unit definitions are entered, and are enhanced as additional insight is gained.

     Water Right Database (DP)

          Water_Right_File (DPS)

               Control_Number (DPSV)

                    Type_Water (DPU)

                    Region (DPU)

                    Old_New (DPU)

                    Assigned_Number (DPU)

                    Stage (DPU)

                    Record Modifier (DPU)

                    Reason_for_Modifier (DPU)

            AA_Transaction (DPSV)

                   Trans_Code (DPU)

                   County (DPU)

                   Status (DPU)

                   Name (DPU)

                   Number_of_POD/W (DPU)

A single characteristic data item can also be variable, meaning it contains different data characteristics in different data records, but only one of those characteristics at any time. For example, a data item might contain a student’s birth date, or the reason for no birth date. The variable data item needs to be broken down into two data product unit variations, as shown below. The data product unit definition states what the variable data item represents as it appears in the database. The data product unit variation names need to be created and should be as close to a reasonable name as possible. The data product unit variation definitions describe the real meaning of each variation.

     Std_Bth_Dt (DPU)

          Student Birth Date (DPUV)

          Birth Date Reason Code (DPUV)

A multiple characteristic data item contains two or more data characteristics that may or may not be closely related. When the data characteristics are closely related, such as a person’s name or birth date, the data item is documented as a data product unit. However, when the data characteristics are not closely related, the individual components need to be broken down and documented as data product unit variations.

For example, a project data item contains the project member’s responsibility on the project, the date the person was assigned to the project, and monthly time commitment. Each of these components needs to be defined as a data product unit variation, as shown below.

     Prjct_Dtl (DPU)

          Project Member Responsibility Description (DPUV)

          Project Member Assigned Date (DPUV)

          Project Member Monthly Hours (DPUV)

A partial characteristic data item contains only part of a data characteristic. For example, a vehicle accident description is broken down into three different data items due to a length limitation. Each of these data items is documented as a data product unit, as described above. However, an additional comment is made in the three data product unit’s definitions that each of these data items are part of a complete data item.

Alternatively, a data product unit could be created for the combined data item with a definition that describes the individual data items as part of a combined data item, as shown below. The physical data items are documented as data product unit variations, with definitions describing that each is part of a combined data item.

     Vehicle Collision Description (DPU)

          Desc_1 (DPUV)

          Desc_2 (DPUV)

          Desc_3 (DPUV)

A data item can have a fixed or variable format. The format of the data item is documented in the data product unit definition, such as a fixed format date is MM/DD/YY, or a variable format mechanic’s name could be in either normal or inverted sequence. If a companion data item identifies the format, then both definitions need to describe the relationship between the two data items. These definitions will assist in the data cross-referencing.

A data item can have a fixed or variable length. The data item length is documented in the data product unit definition. Any delimiters used for designating the data item length for variable length data items are also described. If a companion data item identifies the length, then both definitions need to describe the relationship between the two data items.

A data item can have a fixed or variable sequence within a data record. The data item sequence is documented in the data product unit or data product unit variation definition. Fixed sequence data items are relatively easy to describe. Variable sequence data items usually have a companion data item that provides a mnemonic identifying the variable sequence data item. Both data item definitions need to describe the relationship between the two data items.

Data integrity rules pertaining to the data item or the relation between data items are documented for each data item. The data integrity rules can be documented in the formal notation or in a textual description. The objective is to capture the data integrity rule, not necessarily have that data integrity rule in the formal notation.

A difference in data integrity rules does not represent a data product unit variation. Data integrity rules are often so weak and so variable that trying to document differences as data product unit variations only leads to additional work and confusion. Therefore, the data integrity rules are documented for their respective component.

Data Codes

Data codes are the second most variable data in a disparate data resource. Each of the data code variability types must be identified and documented. A comment can be made in the parent data product unit or data product unit variation regarding the variability of the data codes.

Data codes can be single property or multiple properties. A single property data code represents only one data property. For example, B for Brown Hair, M for Male, and E for Executive Manager are single property data codes.

Single property data codes are documented as data product codes with any combination of the coded data value, the data code name, and the data code definition, as shown below. Sometimes a decision needs to be made if a character string is a long data code name or a short data code definition. The best approach is for the person doing the data inventory to use their best discretion.

E Executive Above pay range 16 (DPC)

M Manager Pay range 12 to 1 (DPC)

S Supervisor Pay range 9 to 1 (DPC)

L Lead Worker Pay range 6 to  (DPC)

W Worker Pay range 5 and below (DPC)

Multiple property data codes combine two or more data properties into a single data code, as shown below.

E Executive Above pay range 16

M Manager, Supervisor Pay range 9 to 15

W Lead Worker, Worker Pay range 8 and below

The multiple property data codes are broken down into single property data codes and documented as data product code variations, as shown below. Note that no data product code variations are defined for Executive. Note also that coded data values do not appear for Supervisor and Lead Worker data product code variations. It’s sometimes difficult to identify the single properties. Usually, the best guess is made with adjustments during cross-referencing based on additional insight.

E Executive Above pay range 16 (DPC)

M Manager, Supervisor Pay range 9 to 1 (DPC)

M Manager Pay range 12 to 15 (DPCV)

Supervisor Pay range 9 to 11 (DPCV)

W Lead Worker, Worker Pay range 8 and below (DPC)

L Lead Worker Pay range 6 to 9 (DPCV)

Worker              Pay range 5 & below (DPCV)

Data codes may represent a single data subject or multiple data subjects. Single subject data codes are documented as shown above.

A multiple subject data code includes more than one data property from different data subjects, as shown below. The data codes represent data subjects for gender, hair color, and eye color.

1 Male, Blond Hair, Blue Eyes

2 Female, Blond Hair, Blue Eyes

3 Male, Brown Hair, Blue Eyes

4 Female, Brown Hair, Blue Eyes

And so on.

These multiple data subjects need to be broken down into single property codes for single data subjects and documented as data product code variations, as shown below.

1 Male, Blond Hair, Blue Eyes (DPC)

     1     Male (DPCV)

1     Blond Hair (DPCV)

1     Blue Eyes (DPCV)

2 Female, Blond Hair, Blue Eyes (DPC)

     2     Female (DPCV)

2     Blond Hair (DPCV)

2     Blue Eyes (DPCV)

3 Male, Brown Hair, Blue Eyes (DPC)

     3     Male (DPCV)

3     Brown Hair (DPCV)

3     Blue Eyes (DPCV)

     And so on.

Note that when the documentation of data product code variations is complete, a single data property from a single data subject will be represented by many different coded data values. For example, Male is represented by data codes 1, 3, and so on. These multiple data codes will be resolved during data cross-referencing.

Set of Data Codes

A set of data codes may represent a complete data subject or a partial data subject. The determination of whether a set of data codes represents a complete or partial data subject cannot be made with certainty during data inventory, because the complete scope of the data codes within a disparate data resource is not known. Additional sets of data codes may be discovered that contain a larger domain of data codes. Therefore, the determination can only be made with certainty after data cross-referencing has been completed and the complete scope of data codes has been determined. An indication whether a set of data codes is complete or partial can be entered in the data product item or data product item variation.

A set of data codes may represent a single data subject or multiple data subjects. A single subject set of data codes was described above.

A multiple subject set of data codes is the situation where the data codes in a set of data codes represents more than one data subject. For example, a set of data codes for counties within a state contains 42 data codes. However, the state contains only 39 counties. Looking at the data codes in detail shows that codes 1 through 39 represent the counties, 40 represents outside the state, but within the United States, 41 represents outside the United States, but within North America, and 42 represents outside North America.

Clearly, the set of data codes represents something other than counties within a state. The data codes actually represent the source of revenue, which is within a county, outside the state but within the United States, outside the United States but within North America, and outside North America.

That set of data codes should be broken into two sets of data codes. The first set of data codes represents the revenue source as from  a county, from the United States outside the state, from North America outside the United States, and from outside North America. The second set of data codes represents the only counties within the state.

Coded data codes are broken down as described above for multiple subject data codes and documented as data product code variations.

Data integrity rules are not documented for data product codes. The data integrity rules for data codes are documented as a data product unit or data product unit variation for the data item containing the data codes.

A set of data codes may contain a hidden hierarchy. Using the example in Chapter 4 with census codes, the data codes are broken down into components of the hierarchy: Census Race Category, Census Race Group, Census Race. The breakdown is done at the data product unit level. The parent data product unit represents the original set of individual data codes. The definition describes that the data codes represent a hidden hierarchy, and the data product unit contains detail codes representing the Census Race. The detail data codes are listed for the parent data product unit and two data product unit variations are created for Census Race Category Code and Census Race Group Code. The definitions describe the hierarchy and the position of the component within that hierarchy. Data product codes are listed for each data product unit variation.

The result of the breakdown of the hidden hierarchy is shown below. Note that the data codes for Census Race Category and Census Race Group contain a range of data code values.

Census Race (DPU)

653 Hawaiian (DPC)

And so on.

Census Race Category (DPUV)

653 – 699 Pacific Islander (DPC)

And so on.

Census Race Group (DPUV)

653 – 659 Polynesian (DPC)

And so on.

The full extent of data code variability may not be known until data cross-referencing, or even the designation of preferred data. Whenever data code variability is encountered, that variability must be documented, and appropriate adjustments must be made to the data cross-referencing and preferred data designations.

Changes Over Time

Many things evolve over time, like species evolution, changes in the Earth’s geology, geographical and political boundaries, laws and regulations, and so on. The data resource is no different. It is dynamic and evolves over time in response to changes in the business.

Data resource evolution is the process where an organization’s data resource changes over time in response to business needs. Data names, data definitions, data structure, and data integrity rules all evolve over time. It’s not the same as the natural drift of the data resource toward disparity.

Data resource evolution must be identified and documented during the data inventory. Basically, any changes over time are documented as variations using the version notation from the data naming taxonomy.

For example, the scope of water rights might have changed over time. Those changes in scope, as reflected in the data definitions, would be documented as data product set variations, as shown below.

     Water Rights Database (DP)

          Water_Right_File (DPS)

               Water_Right_File <Pre-1971> (DPSV)

               Water_Right_File <1971 – 1985> (DPSV )

               Water_Right_File <1986 – Current> (DPSV)

Minor changes in data definitions do not constitute a variation. The definition must be a substantial change in the scope of the data. A data definition threshold is reached when that substantial change occurs and a new variation is identified. For example, the definition of housing was changed to include temporary structures built by the homeless, or the change in employee to include volunteers. Similarly, the data may be split, resulting in the creation of variations.

The Water Rights file described above has two data records. The breakdown of the data records due to changes over time within the data record is shown below.

     Water Rights Database (DP)

          Water_Right_File (DPS)

               Control_Number  <Pre-1978>(DPSV)

               Control_Number <1979 – 1982> (DPSV)

               Control_Number <1983 - Current> (DPSV)

               AA_Transaction <Pre1975> (DPSV)

               AA_Transaction <1976 – 1993> (DPSV)

               AA_Transaction <1994 – Current> (DPSV)

Since only one level of variation is allowed, any multiple variations must be broken down to their basic components. For example, if the Water Right file changes and the record type changes both occurred, the lowest level detail of the components would be documented.

Changes in primary and foreign keys are documented using the same version notation. For example, if the primary key for Education Course changed, both primary keys would be shown with their version notation. Changes in the foreign keys of a subordinate data file would be noted in a similar manner.

     Education Course

          Primary Key: <Pre-2005)

Department. Code

Course. Identifier

Academic Term. Code

          Primary Key: <2005 – Current>

Education Course. System Identifier

           Foreign Key: Department Department. Code

           Foreign Key: Academic Term Academic Term. Code

Any changes in data items would be documented in a similar way. For example, if a general comment field for vehicle collision was changed to a description of the injuries, data product unit variations would be created as shown below. The data definitions would describe the different contents of the two data product unit variations.

     Veh_Clsn (DPS)

          CMT (DPU)

               Comment <Pre-1998> (DPUV)

               Comment <1999 – Current> (DPUV)

Any changes in data codes would be documented in a similar way. For example, Management Level code E used to mean Executive, but now means Executive and Board Members.

          E     Executive (DPC)

               E     Executive < Pre-1988> (DPCV)

               E     Executive & Board <1989 – Current> (DPCV)

Changes may also occur during the data resource integration process. The data resource does not remain static during data resource integration because the business does not remain static. Ideally, the changes should be kept to a minimum. Data files may be added or removed, data items may be added or removed, data codes may be added or removed, and so on. However, changes cannot be prevented during data resource integration. All of these changes need to be tracked and eventually used to enhance the documentation.

Data Models

Many different data models exist for an organization’s data resource. These data models often have different names, definitions, primary and foreign keys, and data integrity rules; many overlap and conflict, resulting in disparate data models. Also, many generic and universal data models are used to start the initial common data architecture, as described in the next chapter. All of these data models need to be inventoried and documented.

Logical and physical data models, including generic data architectures and universal data models, are documented in the same way as described above for databases. The only difference is that the terms for logical data models are changed from data files, data items, and data codes to data entities, data attributes, and data reference items.

For example, multiple subject data files become multiple subject data entities. Multiple characteristic data items become multiple characteristic data attributes. Multiple property data codes become multiple property data reference items. The reader should begin to see the power of, and necessity for, a common data architecture.

Data models are documented as data products. Data entities and data files are documented as data product sets or data product set variations. Data attributes and data items are documented as data product units and data product unit variations. Data reference items and data codes are documented as data product codes or data product code variations.

Application Programs

The data used and produced by application programs are inventoried documented as described above for databases, except that the primary keys, foreign keys, and data relations are usually not known. More often, only the data that are input or output by the application program are documented. However, working data within the application program may be inventoried and documented if those data provide insight into the organization’s disparate data resource.

The primary insight from application programs is the identification and documentation of data integrity rules. Data integrity rules pertaining to the domain of data values, data derivation, data retention, and so on, can only be identified by looking at application programs. The processing performed on the data may provide some meaning about the data. Definitions may be available in comments throughout the application program that may be useful in understanding disparate data.

An information system is documented as a data product. Application programs or program modules are documented as data product sets or data product set variations. Data items are documented as data product units or data product unit variations. Data codes are documented as data product codes or data product code variations.

Comments may be made in the data product set definition pertaining to the screens or reports produced by the application program. However, the documentation of screens and reports is separate from the documentation of the application program. Comments may also be made about the data files accessed or updated by the application program.

Purchased applications often contain far more data files than the organization uses. Only the data files used by the organization are documented, and only the data items in those data files used by the organization are documented. Inventorying the data files and data items not used by the organization, which often number in the thousands or tens of thousands, is a waste of resources, and serves no useful purpose.

Screens – Reports – Forms

Screens, reports, and forms are documented based on the data hierarchies shown in Appendices B and C. A data structure is prepared for each screen, report, or form, including summary or aggregated data, and that structure is used as the base for documentation. Documenting screens, reports, and forms without a corresponding data structure is difficult, if not impossible.

Typically, only data items are documented for screens, reports, and forms, including the elemental data items and any summary or aggregated data items. Any data definitions that exist should be documented. Data definitions are usually non-existent or vague, but column and row headings, and use of the data can provide insight into the meaning of the data. No primary or foreign keys exist on screens, reports, and forms, and no integrity rules exist.

The information system containing the screen, report, or form is documented as a data product. The screen, report, or form itself is documented as a data product set. The data items on the screen, report, or form are documented as data product units. Data items are documented as described above.

XML is inventoried and documented in the same way as a report. The information system containing the XML is documented as a data product. The set of XML is documented as a data product set, or data product set variation. The data items in the XML set are documented as data product units or data product unit variations. Any codes within XML are documented as data product codes or data product code variations.

Generally, the structure of XML is not documented. It’s the XML data items that are of concern and need to be understood. However, the structure of the XML could be documented in the same way as a screen, report, or form so it can be better understood. Generally, no data integrity rules are documented for XML unless derivation rules appear in documentation about the XML.

Supplemental Documentation

Formal and informal documentation includes data dictionaries, project descriptions, data entry instructions, instruction manuals, standards, legislation, external reporting requirements, informal notes, e-mails, meeting minutes, memos, and so on. These supplemental sources of insight can either be documented as separate data products, or they can be used to enhance existing documentation from the primary sources of insight.

For example, if a standard is produced as a data model, it could be documented as a separate data product, as described above for data models. Data dictionaries could either be documented as a separate data product, or could be used to supplement existing documentation. Meeting minutes, memos, e-mails, and so on, would likely be used to supplement existing documentation.

The decision of whether formal and informal documentation should be entered as separate data products or used to supplement existing documentation can be difficult. Generally, if the documentation can stand alone, or relates to multiple primary sources of insight, it is documented as a data product. If the documentation does not stand alone or does not relate to multiple primary sources of insight, it is used to supplement existing documentation.

The tacit knowledge of individuals is used to enhance existing documentation from the primary sources of insight.

Other Forms of Data

Summary data in fixed hierarchies are documented based on data hierarchies described in Appendix B. The overall structure, such as a report, is documented as a data product. Each data set in the hierarchy is documented as a data product set within the data product. Each data item in the hierarchy is documented as a data product unit within its respective data product set. Any data codes are documented as data product codes within their respective data item.

The relationships of the data sets in the data hierarchy are documented with the primary keys for each data set. The primary key consists of the unique identifier for the data set being documented and the primary keys of all parent data sets, as shown in Appendix C. For example, the primary key for a Section within the Organization might be:

     Section

          Primary Key: Department. Identifier

Division. Identifier

Section. Identifier

Aggregated data in variable data hierarchies are documented as described in Appendix C. The overall structure, such as a report, is documented as a data product. Each data set in the hierarchy is documented as a data product set within the data product. Each data item in the hierarchy is documented as a data product unit within its respective data product set. Any data codes are documented as data product codes within their respective data item.

Unlike summary data in fixed hierarchies, the data items are named accordingly as manifestations of the data focus name, as described in Appendix C. For example, the summary data items might be Timber Stand Analytics 6. Total Timber Volume.

Predictive data are documented in the same way as operational data, summary data, or aggregated data, depending on what the data represent. Generally, predictive data that are input to predictive analysis are documented in the same way as described for operational data. The data resulting from predictive analysis are documented in the same way as summary data and aggregated data.

Most spatial data are tabular data that are documented in the same way as operational data. The definition of the spatial data items representing the geography of the points, lines, and polygons can be documented, but the data values, data structure, and data integrity rules for these geographic data items are not documented.

Geographic information system databases are documented as a data product. The data layers, such as timber, soil, roads, and so on, are documented as a data product set. The data items in each data layer are documented as data product items.

Complex structured data are documented either in their complex structured form, or in a normally structured form resulting from breaking down the complex structure into simpler structures, as described in the last chapter. The complex structured form is documented as a data product where the definition contains the nature and contents of the complex structured data. Documentation of the simpler structures is done in the same way as described above for operational data or for screens, reports, and forms. Either approach is acceptable, but at the very least, complex structured data must be documented as a data product.

Non-electronic data should also be documented because they are part of the organization’s data resource. Non-electronic data are documented in the same way as their electronic component, as described above. The definitions should specify the form and location of the non-electronic data, and any other insight that is useful for locating and understanding the non-electronic data.

SUMMARY

The scope of the data inventory is prioritized based on business critical data and where the organization is feeling pain over disparate data. Waves of data inventory can be established to keep the data inventory process moving ahead. Once data inventory is started in an organization, it becomes self-perpetuating as more and more people want to get their data integrated.

The general sequence for a  data inventory is from the primary sources of insight to the secondary sources of insight. The secondary sources usually supplement the insights gained from the primary sources. Either a horizontal or a vertical phased approach may be used with the data inventory.

Involvement in the data inventory usually starts with identification of a core team and ancillary team members. Other people are brought into the process as needed. Since data inventory is a discovery process, a rigid project schedule and precise time commitments should be avoided. A no blame - no whitewash approach should be established so that people feel free to provide candid insights about the disparate data.

Documentation of the data inventory is done according to the Data Product Model. The opt for detail principle ensures that as much data as possible are captured and documented. The organization needs to determine where the data inventory documentation is stored, based on its particular operating environment. The data inventory documentation is retained after the data have been transformed to provide a history of the changes from disparate data to comparate data.

Supporting techniques include data scanning and data profiling, which can be automated or manual. Data scanning identifies the data in databases, application programs, screens, reports, and forms, and so on, but does not deal with data values. Data profiling documents the data values and makes predictions about the meaning of those data values. However, those predictions need to be reviewed and verified before they are accepted.

Databases are documented according to the Data Product Model, including data files, data records, data instances, data items, data codes, and sets of data codes. Changes over time are documented as variations. Data models are documented similar to databases, with a slight change in terms for logical data models. Application programs provide additional insight into the disparate data. Screens, reports, and forms show how the data are used or entered.

Supplemental insight can be either documented according to the Data Product Model, or as supplements to the existing documentation. Summary data, aggregated data, predictive data, complex structured data, and non-electronic data are all documented as part of the organization’s data resource.

A data inventory starts the process of understanding the disparate data and sets the stage for cross-referencing those data to a common data architecture for further understanding and the designation of a preferred data architecture. A good job of inventorying and documenting all of the detail about disparate data results in good cross-referencing, a good preferred data architecture, and good transformation of data to a comparate data resource.

QUESTIONS

The following questions are provided as a review of the data inventory process, and to stimulate thought about how to adequately inventory and manage changes to disparate data.

  1. Why are combined components broken down to their basic components during data inventory?
  2. Why should the data definitions contain as much detail as possible about disparate data?
  3. How are data relations documented?
  4. How are changes over time documented?
  5. What’s the difference between summary data and aggregated data?
  6. How are summary data named?
  7. How are aggregated data named?
  8. Why do non-electronic data need to be inventoried and documented?
  9. How are complex structured data inventoried and documented?
  10. What is the objective of inventorying existing disparate data?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.144.12