Chapter 4

DISPARATE DATA VARIABILITY

All aspects of a disparate data resource are highly variable.

The naming, definition, structure, integrity, and documentation of disparate data are highly variable. The result of high variability in disparate data is a poor understanding about how well the data actually represent the business. That poor understanding increases the uncertainty about how the organization operates in the business world.

Thoroughly understanding disparate data is a primary objective of data resource integration. That thorough understanding begins with an understanding of the different types of variability that may be encountered in a disparate data resource. When the types of variability can be readily recognized, the stage is set for thoroughly understanding the disparate data and building a comparate data resource to adequately support an organization’s business information demand.

Chapter 4 describes the different types of variability that may be encountered in a disparate data resource. It does not include how to document that variability or how to resolve that variability. Those topics are described in the following chapters. The chapter does provide an overview of the wide range of variability that can be encountered in a disparate data resource.

CONCEPTS AND PRINCIPLES

The discussion of disparate data variability begins with an overview of the concepts and principles regarding the wide range of variability in a disparate data resource. The overview includes a definition of data resource variability, the acceptable and unacceptable levels of variability, the need to expect anything when understanding a disparate data resource, the use of a common data architecture as a reference, the need to thoroughly understand disparate data, and the types of data variability that can be expected.

Data Resource Variability

A disparate data resource, as defined earlier, is a data resource that is substantially composed of disparate data that are dis-integrated and not subject oriented. It is in a state of disarray, where the low quality does not, and cannot, adequately support an organization’s business information demand.

Variability is the quality, state, or degree of being variable or changeable; apt or liable to vary or change; changeable; inconsistent; characterized by variations; having much diversity; or not true to type. Variation is the act or process of varying; the state or fact of being varied; the existent to which a thing varies; or an instance of varying. Data variation is the variation in the data meaning, data structure, data integrity, data domain, data content and format, and so on.

Disparate data resource variability is a state where all aspects of a disparate data resource are inconsistent, characterized by data variations, and are not true to the concepts and principles of a comparate data resource. The data are highly variable in their names, definitions, structure, integrity, and documentation. The variability is pervasive throughout the disparate data resource.

The more disparate data that an organization has, the greater the data variability. The more geographically or functionally diverse an organization is, the greater the data variability. The longer an organization has been in business and the larger the organization, the greater the data variability. The more mergers and acquisitions an organization has, the greater the data variability.

Greater data variability makes the task of gaining control of the data and developing a comparate data resource more difficult. Greater data variability causes greater uncertainty about the disparate data. Most of the other problems managing a data resource are relatively minor compared to the problems associated with data variability and uncertainty.

The disparate data resource variability may be explicit or implicit. Explicit disparate data resource variability is the variability that can be readily seen or identified in the data names, definitions, structure, integrity, and documentation of a disparate data resource. Implicit disparate data resource variability is the variability that is not readily seen or identified in the data names, definitions, structure, integrity, and documentation of a disparate data resource. Implicit disparate data resource variability is either implied by existing documentation or exists in people’s minds.

Disparate data resource variability is like different languages or dialects of a language. The variability is different in each organization and needs to be thoroughly understood before the data can be integrated into a comparate data resource. Anyone who is seriously interested in integrating a data resource must be able to readily understand the language of disparate data variability.

The presumed data resource variability principle states that disparate data are highly variable in their names, definitions, structure, integrity, and documentation. Data resource variability should be considered as the norm in most public and private sector organizations. Seldom does one find an existing data resource that does not have some degree of variability.

Acceptable Variability

Acceptable variability is the situation where a normal range of variability is acceptable. Variability exists in all aspects of a business and a normal level of variability must be accepted to perform business successfully. Unacceptable variability is the situation where the variability exceeds the normal range and becomes unacceptable. Most organizations seek to resolve the unacceptable variability.

Acceptable data resource variability is the acceptable level of variability for an organization’s data resource. Acceptable data resource variability can be either temporal or cultural. Temporal variability is the normal change in the data resource due to changes in the business over time. Organizations add or drop lines of business, reorient their focus, establish new initiatives, and so on. The data resource must reflect these changes. Cultural variability is the normal differences due to culture, geography, politics, and so on, such as different names, addresses, monetary units, and so on. The data resource must reflect these cultural differences.

Unacceptable data resource variability is any temporal or cultural variability in the data resource that is beyond the acceptable level. Any data resource variability that is unacceptable and impacts the business must be resolved. A comparate data resource must be developed that fully supports the current and future business information demand.

The data resource variability principle states that every data resource has a level of variability that must be accepted and clarified, and that any variability above that acceptable level must be resolved. Data resource integration seeks to resolve the unacceptable data resource variability and to clarify the acceptable level of variability.

Expect Variability

Both acceptable and unacceptable data resource variability must be expected when integrating a disparate data resource. The expect anything principle states that when seeking to understand and resolve disparate data, anything should be expected. One should expect any situation, even if it seems irrational. One thing I learned in years of law enforcement was to expect the irrational, look for the irrational, learn to understand the irrational, and then deal with it appropriately. When I explain some of the situations I encounter with disparate data, people often respond with That’s irrational. I simply say My point exactly.

An old saying states that Anything which is not expressly forbidden is guaranteed to occur. That saying is quite true. However, with a disparate data resource, even things that are expressly forbidden do occur. A better statement, with respect to a disparate data resource, is Anything that can occur, will occur. Therefore, expect anything when attempting to understand and resolve a disparate data resource, no matter how irrational it might seem.

Common Data Architecture Reference

The Common Data Architecture will be used as the reference point for understanding and resolving disparate data. The terms data subject and data characteristic will be used rather than data entity and data attribute for two reasons. First, data entities and data attributes are used in data models, and those data models are often disparate. Anything that is itself disparate cannot be used to understand and resolve disparity. Second, the Common Data Architecture will be the base for designating preferred data to build a comparate data resource. Therefore, it is best to use the Common Data Architecture throughout the entire data resource integration process.

The common data architecture reference principle states that the thorough understanding and resolution of a disparate data resource, and the development of a comparate data resource, are done within the construct of a common data architecture. The Common Data Architecture is the common construct for understanding and resolving a disparate data resource and developing a comparate data resource that fully supports the business information demand.

Disparate Data Understanding

A disparate data resource needs to be thoroughly understood before any attempt can be made to resolve that disparity. The disparate data understanding principle states that all disparate data variability, including data names, definitions, structure, integrity, and existing documentation will be understood and formally documented at a detailed level within the context of a common data architecture. Any attempt to resolve data disparity and integrate a disparate data resource without thoroughly understanding the disparate data will likely end in failure or a result that is less than fully successful.

Data Variability Types

Even after many years of working with disparate data in a wide variety of public and private sector organizations, I’m still amazed at the number of ingenious ways that people can screw up the data resource. Just when I think I’ve seen it all, I run across yet another way that data are managed improperly. I’ll probably continue to encounter new ways of messing with the data as long as I’m involved in data resource management.

The five types of data variability correspond to the five components of a data architecture:  data names, data definitions, data structure, data integrity, and data documentation. Data name and definition variability pertain to the meaning of the data—the semantics—and will be presented together. Data structure variability pertains to the arrangement and relationships of the data—the structure. Data integrity variability pertains to the rules for maintaining the data resource—the quality. Data documentation variability pertains to the formal documentation of the semantics, structure, and quality of the data resource.

The following sections describe each type of data variability with examples. The basic types of data variability are described, but the manifestation of all the basic types are not described—that would take a book unto itself. A table is presented in the Summary that shows the possible combinations of the basic types of data variability. I’ve attempted to cover all of the different types of variability, but there may be types I’ve never seen.

Any person interested in understanding and resolving a disparate data resource must be aware of these types of data variability and be able to readily recognize them when attempting to thoroughly understand a disparate data resource.

DATA NAME AND DEFINITION VARIABILITY

Data resource integration includes the integration of the semantics, structure, quality, and documentation of the data resource. The semantic component includes data names and data definitions. The last chapter on Data Architecture Integration summarized the data name and data definition problems leading to informal data names and vague data definitions that could be expected with a disparate data resource. Those problems are the root cause of the variability in data names and data definitions.

Data name variability is the situation where data names are informal and have a wide range of variability that contributes little to understanding the data resource. The informal data names often detract from thoroughly understanding the data resource.

Data definition variability is the situation where data definitions are vague and have a wide range of variability that contributes little to understanding the data resource. The vague data definitions often detract from thoroughly understanding the data resource.

Informal data names and vague data definitions are prevalent throughout operational and evaluational data, structured and complex structured data, electronic and non-electronic data, logical and physical data models, and forms, screens, and reports. Anyone attempting to understand and resolve disparate data must learn to recognize the existence of informal data names and vague data definitions, and learn to develop formal data names and comprehensive data definitions to provide a strong semantic understanding about the data resource.

DATA STRUCTURE VARIABILITY

Disparate data do have a structure, but that structure is improper for a comparate data resource that fully meets the business information demand. The last chapter on Integrating the Data Resource summarized the data structure problems leading to improper data structures. The current section describes the data structure variability that exists in a disparate data resource. Knowing the different ways that disparate data are improperly structured helps develop the proper structure for a comparate data resource.

Each new data integration project brings new insights into how physical data can be designed without the benefit of formal logical and physical data modeling. Data files have been developed in every imaginable way, and even some unimaginable ways. Just when I think I’ve seen it all, I encounter yet another way that disparate data are structured.

The primary cause of improper data structures is a lack of proper data normalization and denormalization at all levels, including the data files, data occurrences, data instances, data items, and data codes. Ideally, all data go through a formal data design process that includes data normalization and data denormalization. However, most disparate data never went through formal data modeling, data normalization, or data denormalization. Any structure that does exist is usually a physical structure of the data files developed to meet a specific database or application needs. The result is a disparate data resource that is substantially un-normalized.

Data structure variability is the variability that exists in the improper structure of data in a disparate data resource. Data structure variability can occur with data files, data records, data items, data codes, and data relations, and usually occurs with all five. One major task of data resource integration is identifying the data structure variability and resolving that data variability. Knowing the data structure variability makes it easier to perform data inventory, make the data cross-references, make preferred designations, and develop a comparate data resource.

The current section describes the prominent structural abnormalities that could exist in a disparate data resource. These structural abnormalities apply to all aspects of a data resource, including operational and evaluational data, structured and complex structured data, electronic and non-electronic data, logical and physical data models, and forms, screens, and reports. The basic types of abnormalities and brief examples are provided.

Data File Variability

Data file variability is the variability that exists within and across data files in a disparate data resource. Data file variability can exist at the data file level, the data record level, and the data instance level. Each of these types of data file variability are described below, followed by a description of the data redundancy that results from data file variability.

Data Files

A data file, as defined earlier, is a physical file of data that exists in a database management system, such as a computer file, or outside a database management system, such as a manual file.

A data subject, as defined earlier, is a person, place, thing, concept, or event that is of interest to the organization and about which data are captured and maintained in the organization’s data resource.

Ideally, a data file represents a data subject that has gone through formal data normalization and denormalization. However, most data files containing disparate data never went through that process, as described above. They were usually built to meet a specific database or application need. In many situations, the data from one data subject is split across many data files, and multiple data subjects are often combined into one data file.

A disparate data file is a data file that did not go through formal data normalization and data denormalization, and does not represent a single, complete data subject, or related data subjects resulting from formal data denormalization. Disparate data files often represent multiple data subjects, partial data subjects, or a combination of multiple and partial data subjects.

For example, a training data file contains data about employees, training courses, training classes, and training facilities. However, all of the employee data may be spread across multiple data files for training, payroll, affirmative action, project involvement, and so on.

Similarly, an equipment data file contains data about equipment, equipment breakdowns, equipment maintenance, and mechanics working on the equipment. Other data files may contain additional data about where the equipment was purchased, equipment maintenance schedules, mechanic certifications, and so on.

Disparate data files can contain single data subjects or multiple data subjects. A single subject data file is a data file that contains all of the data items, or a subset of the data items, representing the data characteristics for a single data subject. A multiple subject data file is a data file that contains all of the data items, or a subset of the data items, representing the data characteristics for multiple data subjects.

Disparate data files can contain complete data subjects or partial data subjects. A complete subject data file is a data file that contains all of the data items representing all of the data characteristics for a single data subject or for multiple data subjects. A partial subject data file is a data file that contains a subset of the data items representing the data characteristics for a single data subject or for multiple data subjects.

These categories of disparate data files can be combined into four basic data file types that could exist in a disparate data resource.

A single complete subject data file is relatively rare because it is very difficult to determine if all of the data characteristics have even been identified for a data subject. For example, a timber stand data file is reviewed and appears to contain all of the data items representing timber stands, such as species, slope, aspect, form class, and so on, and may be considered to represent a complete data subject for timber stands. However, data files may be encountered later that contain additional data items representing timber stands.

A multiple complete subject data file is also relatively rare for the same reason. For example, a class data file contains data items representing data characteristics for the student, for the class, for the course, and for the school data subjects. However, the data file does not contain all of the data items for a student, class, course, or school. Additional data items for those data subjects are contained in other data files.

A single partial subject data file is relatively common. For example, stream segment data, such as volume, flow rate, length, width, depth, gradient, sediment load, and so on, are often spread across multiple data files. However, each of those data files contains only data items about stream segments.

A multiple partial subject data file is very common. For example, a data file contains data items representing the data characteristics for a building data subject, such as structure, construction material, rooms, and so on, and for the land parcel where the residence is located, such as size, slope, aspect, and so on. However, it does not contain all of the data items for a resident or land parcel.

Partial subject data files, whether single subject or multiple subject, usually means that the data characteristics for a single data subject are scattered across many different data files. These scattered data characteristics are often redundant and out of synch with each other, which contributes to the variability found in a disparate data resource.

Similarly, complete subject data files do not guarantee that redundant data characteristics don’t exist for that data subject in other data files. One should not assume that when a complete subject data file is encountered, whether single or multiple, that other data files don’t exist containing redundant data items.

Looking at the data file disparity the other way around shows single file data subjects and multiple file data subjects. A single file data subject is a complete data subject that is contained in a single data file. The situation almost never exists in a disparate data resource. A multiple file data subject is a data subject that exists in multiple data files. The situation is common in a disparate data resource.

Data Records

A data record, as defined earlier, is a physical grouping of data items that are stored in or retrieved from a data file. It is the basic component of a data file. A data occurrence, as defined earlier, is a logical record that represents the existence of a business object or the happening of a business event in the business world.

Ideally, a data record represents a single data occurrence for a single data subject. For example, data occurrences in the Employee data subject for John J. Jones, Sally S. Smith, Marilyn M. McDonald, and so on, have corresponding data records in the Employee data file. However, data records in a disparate data resource do not always represent a single data occurrence for a single data subject. They may represent a partial data occurrence, a single data occurrence, or multiple data occurrences from the same data subject or from a subordinate data subject.

A disparate data record is data record that did not go through formal data normalization and denormalization, and does not represent a single data occurrence, or multiple data occurrences resulting from formal data denormalization. Disparate data records often represent partial data occurrences or multiple data occurrences.

A complete occurrence data record is a data record that contains all of the data items for a data occurrence. A partial occurrence data record is a data record that contains only part of the data items for a data occurrence. The complete data occurrence is split across multiple data records, usually due to some length limitation.

The practice of splitting data occurrences across multiple data records was common with punched cards, where the record length was limited to 80 columns, and continued into formal database management systems. A complete data occurrence often needed two or more punched cards to store all the data. For example, an employee’s personal data might be placed in one record type and professional data in another record type. A partial occurrence data record is relatively common, particularly in an older disparate data resource.

The data record types may be explicitly identified with a name or number, such as 1, 2, 3, and so on. However, in some situations, the data record type may be implicitly identified by its sequence in the data file. The later situation was common with punched card data files, where room did not exist on the punched card for a record identifier, so the sequence implied the data record type. One of the major problems that was encountered in EAM days was someone dropping a tray of punched cards that had no record type identification.

A single occurrence data record is a data record that represents a single data occurrence. For example, all of the data about a stream segment is contained in one data record, or all of the data about a timber stand is contained in one data record. A single occurrence data record is relatively common in a disparate data resource.

A multiple occurrence data record is a data record that represents multiple data occurrences in a single data record. A multiple occurrence data record may contain subordinate data occurrences or parallel data occurrences.

A subordinate data occurrence is a data occurrence from a data subject that is subordinate to the data subject represented by the data file. For example, a single data record contains the data occurrence for the annual profits of the organization, but also contains the quarterly profits for that year. The quarterly profits represent a subordinate data subject to the annual profits. A subordinate data occurrence is relatively common and could be the result of formal data denormalization.

A parallel data occurrence is a data occurrence from the same data subject represented by the data file. For example, several short data occurrences for periodic stream flow data might be placed in the same data record. A parallel data occurrence is relatively rare, but does exist with some punch card data files and may have migrated to database management systems.

Data Instances

A data instance, as defined earlier, is a specific set of data values for the characteristics in a data occurrence that are valid at a point in time or for a period of time. Many data instances can exist for each data occurrence, particularly when historical data are maintained.

A current data instance is the most recent data instance that represents the most recent values of the data items in the data occurrence. An historical data instance is any data instance, other than the current data instance, that represents previous data values of the data items in the data occurrence.

A complete historical data instance contains a complete set of data items in the data occurrence, whether or not the data values changed. A partial historical data instance contains a subset of data items in the data occurrence, usually the data items whose data values changed and appropriate identifiers. Both complete and partial historical data instances exist in a disparate data resource.

Note that the complete and partial designations for data instances only applies to the set of data items contained in the data record, not to the full set of data items that may exist for a data subject across multiple data files. For example, if a data file is a partial subject data file and all of those data items are stored as a historical data instance, then the historical data instance is a complete data instance. However, if only the data items whose data value changed are stored, then the historical data instance is a partial data instance.

Self-contained historical data is the situation where historical data instances are retained in the same data file along with the current data instance. Separate historical data is the situation where historical data instances are retained in a separate data file. Both self-contained and separate historical data exist in a disparate data resource.

Historical data instances can be extremely variable in a disparate data resource. Looking at all the possible situations described above with data files, data records, and data instances, it should be quite obvious that the retention of historical data instances can lead to very disparate data. Disparate data instances is the situation where the retention of historical data instances across disparate data files and disparate data records can easily result in large quantities of disparate data.

Data Redundancy

Redundant means exceeding what is necessary or normal; superfluous; characterized by or containing an excess; characterized by similarity or repetition; profuse; or lavish. Redundant data are inconsistently maintained on different data sites, by different methods, and are seldom kept in synch. Data redundancy is the unknown and unmanaged duplication of business facts in a disparate data resource. It’s the same facts, for the same data occurrence, for the same time period. It’s the situation where a single business fact is stored in more than one location, and the locations may not be in synch. It’s the unnecessary duplication of data that is a major contributor to data disparity.

Replication is a copy or reproduction; the action or process of replicating or reproducing; or creating a replica. Data replication is the consistent copying of data from one primary data site to one or more secondary data sites. The copied data are kept in synch with the primary data on a regular basis. Data replication is usually performed for operational efficiency.

A strong distinction is made between data redundancy and data replication during data resource integration. Data redundancy is unnecessary and confusing, and needs to be resolved during data resource integration. Data replication is necessary for operational efficiency and the data are kept in synch with the source. The replicated data certainly have redundancy, since they are copied from a primary data source that has redundancy. However, the emphasis is on resolving the data redundancy in the primary data source. When that data redundancy is resolved, the redundancy of the replicated data will be resolved.

A distinction also needs to be made about data redundancy. Data items with the same name in different data files do not always mean that data redundancy exists. For example, two student data files with the same set of data items may actually contain different data occurrences. One data file may be for middle school students and the other data file may be for high school students. Clearly, these are not redundant data. Another example is two data files for students, where one is high school students and the other is middle school and high school students. One of the sets of high school students is redundant. A third example is two data files for middle school students, but on close examination, one is for students from 1971 through 1990, and the other is for students from 1991 through 2010. These data are not redundant. Data redundancy only occurs with the same data characteristics for the same data occurrence for the same time frame.

Redundant data items is the situation where a data item representing the same data characteristic exists in different data files or different data records, whether that data item has the same data name or a different data name. Redundant data items do not necessarily mean that redundant data exist.

Disparate data files, disparate data records, and disparate data instances create two levels of data redundancy. The first level of data redundancy is created when disparate data files and disparate data records contain redundant data. The data redundancy can be quite large, particularly in organizations that have been in business for many years and have a large data resource.

The second level of data redundancy is created when disparate data instances contain redundant data. The data redundancy greatly magnifies the data redundancy created in the first level of data redundancy, leading to massive quantities of redundant data.

Four situations lead to the second level of data redundancy. The first situation is when complete historical data instances are maintained. Complete historical data instances make it easy to extract data for evaluational processing because the complete records are available and don’t need to be rebuilt. However, massive data redundancy is created by saving data values that did not change.

The second situation is when data items representing the same data characteristic exist in multiple data files, and each of those data files maintains complete data instances. Massive data redundancy can be created.

The third situation is when partial historical data instances are maintained, but they contain more than the data values that changed and the necessary identifiers. The data values retained that did not change contribute to the data redundancy.

The fourth situation is when redundant data exist and the data history is out of synch. One data file received the change and created a historical data instance, yet another data file did not receive the change and did not create a historical data instance. The same thing happens when the data files received the change at different times and created historical data instances with different times. The same thing also happens when one data file received the change and created a historical data instance, while the other file received the change but not create a historical data instance.

Data redundancy is a major problem in a disparate data resource and requires considerable effort to understand and resolve.

Data Item Variability

A data item, as defined earlier, is an individual field in a data record. It represents a data attribute, subject to adjustments made during formal data denormalization. Ideally, a data item represents an elemental or combined data characteristic. However, data items in disparate data files do not always represent an elemental or combined data characteristic.

A data characteristic, as defined earlier, is an individual fact that describes or characterizes a data subject. It represents a business feature and contains a single fact, or closely related facts, about a data subject.

Data item variability is the variability in the format or content of data items representing the same business fact. It’s a measure of how many different formats or contents exist for a particular data item across data files, and on screens, reports, and forms. However, many disparate data items represent multiple data characteristics that may or may not be closely related, partial data characteristics, or complex data characteristics. Each of these situations is described below.

An elemental data characteristic is a single elemental fact that cannot be further divided and retain its meaning, such as a month number or a day number within a month.

A combined data characteristic is the combination of two or more closely related elemental data characteristics into a group that is managed as a single unit. Note that the elemental data characteristic must be closely related. For example, the elemental data characteristics for century, year, month, and day are closely related and may be combined into a data characteristic for date. Similarly, the data characteristics for a person’s individual name, middle name, and family name are closely related and may be combined into a data characteristic for the person’s complete name.

A multiple data characteristic is two or more single or combined data characteristics that are not closely related and should not be stored together or managed as a single unit. The data characteristics may be from the same data subject or from different data subjects. For example, the combination of a project name and project initiation date would be a multiple data characteristic.

A disparate data item is a data item that contains other than an elemental or combined data characteristic. Disparate data items may contain multiple data characteristics, partial data characteristics, or complex data characteristics.

Single Characteristic Data Items

A single characteristic data item is a data item that contains only one elemental or combined data characteristic. For example, vegetation scientific name is an elemental data characteristic, and a project leader’s name is a combined data characteristic.

A consistent characteristic data item is a data item that always contains an elemental or combined data characteristic. For example, a data item consistently contains a vehicle’s model name, or a data item always contains a driver’s name.

A variable characteristic data item is a data item that could contain several different data characteristics, but only one of those data characteristics appears in any data record. In other words, the data characteristics are mutually exclusive. The data characteristic that is contained in the data item is usually determined by the data value in another data item or by the data value itself. For example, a land parcel owner’s birth date data item could contain the actual birth date or the reason for no birth date.

A data item format is the physical format of the data value contained in the data item. A fixed format data item is a data item whose data value is always in the same format. For example, a student’s name is always in the normal sequence and right justified, or an applicant’s name is always in the inverted sequence and left justified. A variable format data item is a data item whose data value could be one of a variety of different formats. For example, a shipping date could be in any of a wide variety of date formats, such as MDY, M/D/Y, YMD, Y/M/D, and so on.

Variable format data items may have an identifier for the format, although the practice is not common. For example, variable format codes might be D1, D2, D3, and so on, for MDY, M/D/Y, YMD, and so on, and could appear before the data item or at the beginning of the data record.

Data item content is the physical variation in the data values contained in a data item. For example, a data item may represent a street segment length. However, the data value may be in feet, meters, miles, and so on.

Data item length is the physical length of the data value contained in the data item. A fixed length data item is a data item whose length is fixed. For example, a vehicle manufacturer’s name is always 50 characters long. A variable length data item is a data item whose length is variable. For example, the length of a textual description or accident explanation might vary.

The length of a variable length data item is usually determined by a length value or by a delimiter. The length value typically precedes the data value, as shown below.

12John R Smith

However, the length value may be stored in a variety of other locations, such as all of the data item lengths at the beginning of a data record.

A delimiter is a special character, such as a comma, ampersand, or colon that appears after each data item. The example below shows a comma as the data item delimiter.

John R Smith,12345 Jackson Highway S,Apartment 6,

Variable Sequence Data Items

Variable sequence data item is the situation where the data items can be in any sequence in a data record. The specific data item is identified by a keyword or mnemonic, followed by the data value. The example below shows mnemonics for tree measurements where SN is the species name, DM is the diameter, HT is the height, and AG is the age. The example shows a variable length with delimiters. However, the length could be fixed.

SNDouglas Fir,DM14.2,HT125,AG50

Multiple Characteristic Data Items

A multiple characteristic data item is a data item that contains a more than one data characteristic. For example, a data item contains both a plant’s common name and its scientific name, such as Red Alder,Alnus Rubra.

Like single characteristic data items, a multiple characteristic data item can contain consistent or variable data characteristics with a fixed or variable format and a fixed or variable length. The variable lengths of multiple characteristics within a data item are delineated similar to the way the variable lengths of single characteristic data items are delineated in a data record.

Partial Characteristic Data Items

A partial characteristic data item is a data item that contains part of a data characteristic. Other parts of the data characteristic are contained in one or more other data items. For example, an accident description may exceed the allowed length of a textual data item, resulting in the description running to multiple data items. The sequence of the multiple data items may be indicated by a number, or may be implied by the sequence of the data items in the data record.

A partial characteristic data item is usually a consistent characteristic with a fixed format, although it may have a fixed or variable length. However, since anything is possible in a disparate data resource, a partial characteristic data item could have a variable composition and a variable format.

More on Delimiters

Multiple data item length delimiters may appear in rare situations. For example, a comma could be used as a length delimiter for data items in a string of data items, such as between each project team member in a string of team member names, as shown below.

John Jones,Jack Smith,Sue Wilson,Bill Arnold

In addition, an ampersand could be used between groups of related data items, such as between a string of project team member names and a string of project team members’ responsibility on the team, as shown below. Note that the team member responsibilities belong to a different data subject than the project team member names.

John,Jack,Sue,Bill&ProjectLead,Secretary,Analyst,Analyst

The groups of related data items could belong to the same data subject, such as the project team member names and their birth dates, as shown below.

John,Jack,Sue,Bill&3/12/78,4/16/82,9/9/88,12/1/72

The team responsibility could be implied by the position of the team member name. In the example below, the first position is project lead, the second position is secretary, the third position is analyst, and the fourth position is designer.

John,Jack,Sue,Bill

Successive delimiters would indicate a missing position on the team. In the example below, the secretary and analyst are missing from the team, as shown by the successive delimiters.

Jack,,,Jason

Any person interested in understanding and resolving disparate data must be aware of these possible situations with data items and be able to readily identify them.

Data Code Variability

Data code variability is the variability in the coded data values, names, definitions, and domain of codes in a set of data codes. It’s a measure of how many variations exist for a particular set of data codes across data files. Ideally, a data code represents a single property of a data subject, and a data code set represents a single data subject. However, many variations of the ideal occur in a disparate data resource.

Data code variability is one of the most confusing things about disparate data, and one of the most difficult to understand and resolve. Data codes can have the same name with different codes, the same codes with different names, the same definitions with different names and/or codes, the same meaning with different definitions, the same codes and names with different meanings, and so on. People have found many ways to invent data code structures that are very difficult to use and to understand.

A data property is a single feature, trait, or quality within a grouping or classification of features, traits, or qualities belonging to a data characteristic. For example, gender has data properties for male, female, and unknown. Management level has data properties for executive, manager, supervisor, and lead worker.

A data code is any data item whose data value has been encoded or shortened in some manner. For example, the gender data properties might be coded as M, F, and U, and the management level data properties might be encoded as E, M, S, and L. A data code is also known as a coded data value.

A data code set is a complete group of data codes that represent all of the data properties for a single data subject. For example, data code sets are defined for the data properties of gender or management level, which are data subjects.

A set of data codes is a subset of a data codes representing only part of the data properties for a complete data code set, or a mixture of properties from different data code sets. For example, a set of data codes for management level might include only E for Executive and M for Manager, or a set of data codes might include a mixture of hair color and eye color.

Data Code Properties

Ideally, each data code represents a single data property. However, data codes may be multiple property or partial property. Each of these situations is described below.

A single property data code is a data code that represents one specific data property of a single data subject. For example, Br represents brown hair and Bl represents blond hair in the data subject for hair color. Another example is management level codes E for Executive, M for Manager, S for Supervisor, and L for Lead Worker. Single property data codes are very common in a disparate data resource.

A multiple property data code is a data code that represents two or more data properties of the same data subject. For example, 1 represents blond and gray hair, 2 represents black and brown hair, and so on. Another example is management level codes E for executive and manager, S for supervisor, and L for lead worker. The executive and manager data properties have been combined into one data code. Multiple property data codes are relatively common in a disparate data resource.

Note that a complex property data code doesn’t exist. A data code represents either a single data property, multiple data properties, or a partial data property. A data code cannot represent any combination of a single data property, multiple data properties, or partial data properties.

Data Code Subjects

The examples above were data codes that represented data properties for a single data subject, such as hair color, gender, management level, and so on. Data codes can also represent multiple data subjects.

A single subject data code is a data code that represents a single data subject, such as the ones shown above for gender, management level, and hair color. Single subject data codes are very common in a disparate data resource.

A multiple subject data code is a data code that represents two or more different data subjects. For example, gender, hair color, and eye color might be combined so that 1 is male, blond hair, blue eyes; 2 is female, blond hair, blue eyes; 3 is male brown hair, blue eyes; and so on. Multiple subject data codes are relatively common in disparate data. Multiple subject data codes are relatively common in a disparate data resource.

Note that complex subject data codes don’t exist. A data code represents either a single data subject or multiple data subjects. A data code can’t represent both a single data subject and a multiple data subject. Also, partial subject data codes don’t exist.

Sets of Data Codes

The examples above were data codes that represented single, multiple, or partial data properties, and single or multiple data subjects. A set of data codes can be complete or partial, and can represent a single data subject or multiple data subjects.

A complete set of data codes contains all of the data properties for a single data subject. For example, engine type codes are defined for gasoline, diesel, propane, and electric. Complete sets of data codes are very common in a disparate data resource.

A partial set of data codes contains a subset of the data properties for a single data subject. For example, the engine type codes above would be a partial data code set if wind, wood, coal, and human power were considered as engine types. Partial sets of data codes are relatively rare in a disparate data resource, but do exist when the entire data resource is considered.

A single subject set of data codes is a set of data codes that represent one data subject. Single subject sets of data codes are relatively common in a disparate data resource.

A multiple subject set of data codes is a set of data codes that represent more than one data subject. For example, a set of county codes has data codes 1 through 41, yet the state has only thirty-nine counties. Code 40 means outside of the state but within the United States, and code 41 means outside of the United States. Although the data codes are mutually exclusive, they obviously represent more than one data subject. Multiple subject sets of data codes are relatively rare in a disparate data resource, but do exist.

Note that a complex subject set of data codes doesn’t exist. A set of data codes represents either a single data subject or multiple data subjects. Similarly, a partial subject set of data codes doesn’t exist.

Disparate data codes is the situation where data codes can represent single, multiple, or partial data properties; where data codes can represent single or multiple data subjects; where sets of data codes can represent single or multiple data subjects; and where sets of data codes can be complete or partial.

Coded Data Codes

Coded data codes is the situation where single property data codes are combined into a multiple property data code. For example, gender, eye color, and hair color data codes might be combined so that 1 is M Br Bl (male, brown eyes, blond hair), 2 is F Br Bl, and so on. Coded data codes are very rare in a disparate data resource, but do occur.

Hidden Hierarchies

A hidden data code hierarchy is the situation where a single set of data codes represents a hierarchy of data codes. For example, the Census Race Code is a three-digit number. However, buried in the three-digit number is a hidden three-level hierarchy of codes. These three levels were defined as Census Race Category, Census Race Group, and Census Race. Census Race Category is identified by a range of three-digit numbers, such as 653 through 699 is Pacific Islander. Census Race Group is identified by another range of three-digit numbers within Census Race Category, such as 653 through 659 is Polynesian. Census Race is identified by a single three-digit number within Census Race Group, such as 653 for Hawaiian.

Another form of a hidden data hierarchy is a single set of data codes that represents a hierarchy of codes, but the distinction is sequential through the data value. For example, a customer identification number might be a seven-digit number. However, the first two digits represent the sales region, the next two digits represent the sales district within the sales region, and the next three digits represent the customer number within the sales district.

Hidden hierarchies are relatively common in a disparate data resource. Anyone working with a disparate data resource must be able to recognize the existence of hidden hierarchies.

Data Relation Variability

Data relation variability is the variability that exists with the data relations, the names and cardinalities for those data relations, primary keys, and foreign keys. Ideally, data relations with their names and cardinalities, primary keys, and foreign keys are formally designed. However, that is far from the norm in a disparate data resource. The different types of data relation variability are described below.

Data Relations

A data relation is an association between data occurrences in different data subjects or data entities, or within a data subject or data entity, or between data records in different data files or within a data file. It provides the connections between data subjects for building the proper data structure, and between data files for navigating in the database.

A Logical data relation is an association between data occurrences in different data subjects or data entities, or within a data subject or data entity. It is defined during data normalization and has a name or short phrase describing the data relation.

A Physical data relation is an association between data records in different data files or within a data file. It  is typically defined during formal data denormalization and has now name.

Data Cardinality

Data cardinality is a specification of the number of data occurrences that are allowed or required in each data subject or data entity that are involved in a data relation, or the number of data records that are allowed or required for each data file that are involved in the data relation.

General data cardinality is the data cardinality specified by the data relation or by a semantic statement for the data relation. A semantic statement is a textual statement of the relationship between data entities. General data cardinality indicates one-to-one, one-to-many, and many-to-many data relations. General data cardinality typically appears for logical data relations, but not for physical data relations.

Specific data cardinality is the data cardinality specified by a notation at the end of a data relation and is more specific than the general data cardinality. General data cardinality indicates 0, 1, M, 0/M, or 1/M data occurrences or data records. Specific data cardinality may appear on logical data relations, but typically does not appear on physical data relations.

Primary Keys

Data relations are based on primary and foreign keys, or simply on the same data characteristics, regardless of the data item names, that exist in different data files.

Primary key is a set of one or more data attributes whose values uniquely identify each data occurrence in a data entity in a logical data model. In a database, a primary key is a set of one or more data items whose values uniquely identify each data record in a data file.

Ideally, primary keys are formally identified during logical and physical data modeling and then incorporated into the data files. However, the primary keys in most disparate data files were defined as they were needed, frequently without any formal data normalization or denormalization. That practice resulted in disparate primary keys.

A disparate primary key is any primary key defined in a disparate data resource that does not meet the formal criteria for a true primary key. The specific situations are described below.

Primary keys often contain data items that are not necessary for the unique identification of each data record in the data file. For example, the primary key for a vehicle might contain the vehicle’s license number and the manufacture date. The situation is relatively common in a disparate data resource.

Primary keys may exist in a data file, but may not be readily identifiable. The situation is very common in a disparate data resource.

Primary keys may never have been defined, particularly if no attempt was made to uniquely identify each data record in a data file. The situation is rare in a disparate data resource, but it does exist with data files that are not part of a formal database management system.

Primary keys may be defined and maintained without any valid need. Many disparate data files have primary keys that are maintained but never used. The situation is relatively rare in a disparate data resource.

Primary keys may be defined and maintained, but do not uniquely identify each data record in the data file; additional data items are needed for unique identification of each data record. The situation is relatively rare in a disparate data resource, but does exist in data files that are not part of a database management system.

Different data files may contain the same primary key, even though the data item names may be different, and may represent the same data subject or different data subjects. The situation is relatively common in a disparate data resource.

System identifiers or counters are often used as the primary key, but they make it difficult to identify which data item or set of data items uniquely identify a data record. In addition, some system identifiers and counters are often reused when a data record has been deleted. The situation is relatively common in a disparate data resource.

Data files representing the same data subject could have different primary keys. For example, a vehicle purchase data file could have a primary key for vehicle identification number and purchase date, a vehicle surplus data file could have a primary key for vehicle license number and sale date, and a vehicle inventory data file cold have a primary key for the vehicle license number. The situation is relatively common in a disparate data resource.

In addition to the variability described above, primary keys could also contain the variability mentioned earlier for data items. The combined variability often results in extreme difficulty identifying primary keys in a disparate data resource.

Foreign Keys

A foreign key in logical data models is the primary key of a data occurrence in a parent data entity that is placed in each data occurrence of a subordinate data entity to identify the parent data occurrence in that parent data entity. In data files, a foreign key is the primary key of a data record in a parent data file that is placed in each data record of a subordinate data file to identify the parent data record in that parent data file.

A disparate foreign key is any foreign key defined in a disparate data resource that does not meet the formal criteria for a true foreign key. The specific situations are described below.

Foreign keys may not be readily identifiable and are often difficult to identify. The situation is very common in a disparate data resource.

A foreign key may be readily identified, but have no corresponding primary key in any data file. The situation is relatively rare in a disparate data resource.

A foreign key may go, or at least appear to go, to many different parent data files based on the data item names. The situation is very common in a disparate data resource.

A foreign key may be a subset of the primary key in a parent data file, particularly when the primary key contains data items that are not necessary for uniqueness. The situation is relatively common in a disparate data resource.

A foreign key may have different data item names than the primary key, even though it represents the same data characteristics. The situation is relatively common in a disparate data resource.

A foreign key may have the same data item names as a primary key, but those data items do not represent the same data characteristics. Therefore, the foreign key is not valid for that parent data file. This situation is relatively common in a disparate data resource.

A foreign key may contain data items that uniquely identify a data record in a parent data file, even though those data items are not designated as a primary key. This situation is relatively rare in a disparate data resource.

Data files subordinate to the same parent data file may have different foreign keys to that parent data file. This situation is relatively rare in a disparate data resource.

The same data files in different databases may have different foreign keys to the same parent data file. This situation is relatively common in a disparate data resource.

A subordinate data file in different databases may have different parents. In other words, the foreign keys to all possible parent data files across databases may not exist in a data file in a particular database. The situation is relatively common in a disparate data resource.

In addition to the variability described above for foreign keys, foreign keys can also contain the variability mentioned earlier for primary keys and for data items. The combined variability often results in extreme difficulty identifying foreign keys in a disparate data resource.

DATA INTEGRITY VARIABILITY

Disparate data do have some level of data integrity, usually in the form of data edits and constraints that are applied through database management systems or applications. However, most of those data edits and constraints are imprecise for a comparate data resource that fully meets the business information demand. The primary reason for the imprecise data edits is that people just didn’t take the time to formally specify the data integrity criteria and ensure those criteria were properly implemented.

Like data structures, I never cease to be amazed at the low integrity in a disparate data resource. The more I look at disparate data in various public and private sector organizations, the more I’m appalled at the lack of formal data edits that are consistently applied to the data resource. The result is a low quality data resource that leads to low quality information.

The last chapter on Integrating the Data Resource summarized the problems leading to imprecise data integrity. The current section describes the types of data integrity variability that can be expected in a disparate data resource. Knowing the variability in data edits helps develop precise data edits for a comparate data resource.

Data integrity variability is the variability that exists with data edits in a disparate data resource. Ideally, data integrity rules are defined during logical data modeling and are transformed to data edits during physical data modeling. However, that is not true for most disparate data resources. Data integrity rules were seldom defined during logical data modeling, and data edits that were often superficial and incomplete were prepared during physical implementation of the data.

Data integrity variability applies to operational and evaluational data, structured and complex structured data, electronic and non-electronic data, logical and physical data models, and forms, reports, and screens. Anyone attempting to understand and resolve disparate data must learn to recognize imprecise data integrity rules, and learn to prepare precise data integrity rules that ensure quality in a comparate data resource.

Data accuracy, as defined earlier, is a measure of how well the data values represent the business world at a point in time or for a period of time. Data accuracy includes the method used to identify objects in the business world and the method of collecting data about those objects. It describes how an object was identified and the means by which the data were collected.

Disparate data have widely varying degrees of accuracy, although the accuracy is frequently unknown and not readily apparent. The accuracy cannot be changed or improved during the data integration process. It can only be identified and documented to increase understanding of the data.

DATA DOCUMENTATION VARIABILITY

Disparate data are frequently documented to some extent. However, the documentation is largely incomplete and inadequate. The last chapter on Integrating the Data Resource summarized the problems leading to limited data documentation. The current section describes the types of data documentation variability that can be expected in a disparate data resource. Knowing the variability in data documentation helps develop robust data documentation for a comparate data resource.

I’m astounded at the lack of formal documentation that exists for the data resource in most public and private sector organizations. It seems to me that any resource that is critical to an organization should be thoroughly documented. However, the critical data resource is not documented to the same extent that other critical resources are documented.

Data documentation variability is the variability that exists with the documentation about a disparate data resource. Ideally, all components of the organization’s data resource are  formally documented and readily available. However, that is not true for most disparate data resources. The documentation is sparse, inconsistent, and widely scattered through the organization, in people’s minds, in database management systems, in data models, and in applications.

Data documentation variability applies to operational and evaluational data, structured and complex structured data, electronic and non-electronic data, logical and physical data models, and forms, reports, and screens. Anyone attempting to understand and resolve disparate data must learn to recognize the lack of complete, robust data documentation, and learn to prepare robust data documentation based on any information they can gain about the disparate data resource.

VARIABILITY OVER TIME

The description of variability presented above applies to a point in time. The first dimension of data variability is the variability in data names, definitions, structure, integrity, and documentation that exists at any point in time with the operational data in a disparate data resource. The first dimension can be considerable and often overwhelming.

However, a data resource can change over time to reflect changes in both business and technology. For example, data names change, data definitions change, data structure changes, data integrity changes, the data documentation changes, and the data values captured and maintained can change. Variability over time due to business and technology changes is necessary and acceptable. What is not acceptable is change for change sake, and not properly managing the necessary and acceptable change.

The second dimension of data variability is the variability in data names, definitions, structure, integrity, and documentation that occurs over time with the operational data in a disparate data resource. The second dimension is in addition to the first dimension of data variability, meaning that the data variability at a point in time is magnified by the data variability over time. The result can easily be overwhelming and makes management of a disparate data resource nearly impossible. The overwhelming nature of the variability is the reason many organizations choose not to attempt understanding and resolution of a disparate data resource.

The first and second dimensions of data variability apply to operational data and their related data models, forms, screens, and reports. A similar situation is occurring with evaluational data, including both analytical data (the aggregation space) and predictive data (the influence and variation space). These evaluational data are following the same path of data variability as operational data.

The third dimension of data variability is the variability in data names, definitions, structure, integrity, and documentation that occurs with evaluational data in a disparate data resource. The third dimension of data variability magnifies the first and second dimensions, because the evaluational data are extracted from the operational data. In addition, the evaluational data are often analyzed in a variety of different ways and by different methods, both at a point in time and over time. No wonder the results are often questionable.

Take the variability of operational data at a point in time, add the variability to operational data over time, extract evaluational data from those operational data, analyze those evaluational data over time, and you have some serious data variability in a disparate data resource. What is the probability that those operational and evaluational data will adequately support the current and future business information demand?

SUMMARY

Data variability occurs with the semantics of a data resource (data names and definitions), the structure of the data resource (structure), the integrity of the data resource (quality), and documentation of the data resource. Data variability occurs with operational and evaluational data, structured and complex structured data, electronic and non-electronic data, logical and physical data models, and forms, reports, and screens. Data variability occurs at a point in time and over time.

One should be able to make five conclusions from the above descriptions of data variability. First, data variability exists throughout the entire disparate data resource. Second, data variability exists in all components of the data architecture. Third, the variability with data names and definitions, data integrity, and data documentation is largely due to a lack of complete and consistent management. Fourth, the variability with data structures is out of control with many things being done that should never be done. Fifth, the existing data disparity will continue, and will get worse, unless strong action is taken to stop further data disparity and to resolve existing data disparity.

These conclusions don’t paint a very positive picture about an organization’s data resource and its ability to support the current and future business information demand. Any of these conclusions would be difficult to accept, but accepting all five conclusions is almost beyond comprehension. No wonder people have trouble understanding their data and the business supported by those data. No wonder the current and future business information demand is not fully supported.

The table below summarizes the structural data variability for data files, data records, data instances, data items, and data codes.

Data File

Composition

Single Subject

Multiple Subjects

Completeness

Complete Subject

Partial Subject

Data Record

Completeness

Complete Occurrence

Partial Occurrence

Composition

Single Occurrence

Multiple Occurrences

     Subordinate Occurrence

     Parallel Occurrence

Data Instance

Time Frame

Current Data Instance

Historical Data Instance

     Complete Historical Data Instance

     Partial Historical Data Instance

History Location

Self-Contained Historical Data

Separate Historical Data

Data Item

Business Fact

Single Characteristic

Multiple Characteristic

Partial Characteristic

Composition

Consistent Characteristic

Variable Characteristic

Format

Fixed

Variable

Length

Fixed

Variable

Data Code

Data Properties

Single Property

Multiple Property

Data Subjects

Single Subject

Multiple Subject

Set of Data Codes

Data Properties

Complete Subject

Partial Subject

Data Subjects

Single Subject

Multiple Subject

Coded Data Code

Hidden Data Code Hierarchy

QUESTIONS

The following questions are provided as a review of disparate data variability, and to stimulate thought about understanding and documenting the variability of disparate data.

  1. What is data variability?
  2. What is the difference between acceptable and unacceptable data variability?
  3. Why does data variability exist in an organization’s data resource?
  4. What impacts does data variability have on the organization?
  5. What component of the data architecture has the most data variability?
  6. What are the dimensions of data variability?
  7. How does data variability impact understanding?
  8. Why is the Common Data Architecture used as the reference for understanding data variability?
  9. Why should one expect data variability in the organization’s data resource?
  10. What can be done to stop further data variability?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.59.192