This chapter briefly discusses data modeling, providing an overview of Third Normal Form (3NF), data vault modeling (e.g., Data Vault 2.0), and dimensional modeling techniques. Many books offer complete explanations better suited to dive deep into theory and practice. Instead, I provide an overview of the main data modeling options to both allow identification and inform why certain techniques are preferred at different points in a data lifecycle.
If you are familiar with data modeling, feel free to skip this chapter, which is intended to be a simple explanation for newcomers. However, I contend that understanding data modeling is important when explaining the why, how, and what of data ingestion through to data consumption.
This chapter is a very brief guide to only get you started. A full explanation of database normalization is at https://en.wikipedia.org/wiki/Database_normalization .
Third Normal Form (3NF)
Whether Snowflake is the target database of choice or not, 3NF is optimized for database writes. Typically, we insert, update, or delete from a single table at a time. I do not discuss transaction boundaries here but draw your attention to Snowflake micro-partitions, discussed in Chapter 3, and earlier comments on not using Snowflake for heavy OLTP workloads. The salient point is that 3NF is not a read-optimized pattern because we must join several tables together to derive full context. Each table holds information about a single “thing.” Object joins can be expensive operations, and the data model must be understood to ensure desired meaning can be derived. We must also ensure uniqueness is enforced in our 3NF model and prevent duplicates, which may consist of programmatically derived unique constraints and date attributes for temporal data.
Database normalization is the process by which the core database design is distilled into a suite of objects containing a single expression of like or grouped values with keys providing relationships. Joining objects via keys informs the model, resulting in a logical and meaningful representation of the business context. 3NF has been the optimal and most encountered implementation pattern for database structures for decades; it is best understood and widely accepted. Now that we know what we are trying to achieve, how do we implement a 3NF data model?
Figure 13-1 depicts a typical scenario where the tables suffixed “_type” hold values used to restrict or constrain values held in the parent tables. While Snowflake does not enforce constraints apart from NOT NULL, we contend declaring constraints allows relationships to be discovered by tooling, easing our business colleagues understanding of our data models.
The importance of using surrogate keys may not be readily apparent from the example, because adding sequences also adds an attribute to each table. But imagine the situation without surrogate keys where party_name changes. For example, a merger or acquisition may predicate the legal entity being renamed. In such a scenario, the party_name change would break the referential integrity between party and party_contact and all downstream tables after that. Better to make the data model agnostic of name changes and use immutable surrogate keys to join tables rather than mutable real-world labels. Surrogate keys provide flexibility and are adaptable, whereas natural keys are inflexible and do not adapt to change cascading through our models. What we also find in practice is most, if not all, SQL optimizers prefer numbers to character equivalence checks. Therefore, the sequence-generated surrogate keys offer a simple way to improve performance.
Let’s look at how sample SQL statements might look, assuming we have created appropriate tables, sequences, and records not shown here.
Using Natural Keys
Using natural keys propagates the number of join criteria for every new table added to the data model leading to code bloat with consequent maintenance overhead for our support team and when adding new features.
Using Surrogate Keys
Using surrogate keys adds readability to our code by reducing the number of join conditions. Not only is the data model more understandable, but adding a new relationship is relatively easy. Adding constraints is also simpler as we do not need to list all the constituent attributes but instead rely upon surrogate keys alone. Where surrogate keys are used, the number of core join conditions is always one less than the number of objects joined.
3NF Assumptions
When working with 3NF data models, there is an implicit assumption that reference data satisfies any declared referential integrity. While Snowflake does not enforce referential integrity, we might reasonably assume our ingested data is correct at the point of delivery and, to our earlier point of not rejecting any invalid records, might optionally set missing or invalid attributes to defaults while validating and reporting invalid records back to the source.
3NF Summary
Joining tables together allows information to be derived from the discrete data sets but implies a high degree of familiarity with the data model. We may also see challenges with identifying data sets for date ranges. Each component table may have its own temporal attributes, particularly where SCD2 has been implemented, implying that further domain knowledge is required. We would extend the number of SQL predicates to explicitly include the date ranges required recognizing the code quickly becomes bloated. However, as previously stated, 3NF is a storage (write) pattern, not a retrieval (read) pattern. We are most concerned with writing data into Snowflake as fast as possible for data ingestion. We later discuss retrieval patterns.
Adding attributes at a higher level of grain than currently supported in a data model is generally very easy. For example, adding a geographical region where a country exists on the party table would require a new reference table for the region and an intersection entity between the country table (not shown) and the region. After which, all reporting referring to a country may need amending to add a filter for the region. Similarly, adding a lower grain of detail to a 3NF data model is generally doable with a little effort and some disruption but typically not nearly as invasive as for dimensional modeling described next.
Denormalization
No discussion on normalization is complete without mentioning denormalization , the process by which we join our discrete 3NF tables to provide meaningful business objects for reporting. While this topic deserves more considered treatment than we can provide here, I offer a few suggestions.
We prefer to represent all data in our data warehouse (see https://en.wikipedia.org/wiki/Data_warehouse ) and not just those that match our reference data. We contend that all data, regardless of quality, is valid, and our business colleagues are responsible for their data quality. We must not exclude records by filtering out “bad” data but instead notify source systems of records failing validation.
Our success is entirely dependent upon adoption by our users. We must enable an infrastructure that enables results driving business value. In other words, we must retain our business users’ perspective and provide what they want, how they want it, and not deliver what we think they want. From this position, we may enrich our data and suggest summaries, aggregates, and pre-filtered data sets tuned to our customer needs.
We must make our data sets easy to discover and self-serve. Adding object tags, as discussed in Chapter 12, is one very powerful way to enable our business community. Lastly, we must not forget data security and adhere to the policy.
Dimensional Modeling
In contrast to 3NF optimized for transactional speed of writing data into an RDBMS, dimensional models are concerned with the speed of reading data from our data warehouse. Two different paradigms with interrelated structures and complementary approaches. Ralph Kimball developed dimensional modeling; more information is at https://en.wikipedia.org/wiki/Dimensional_modeling . For further reading, I recognize Bill Inmon as the father of data warehousing. More information is at https://en.wikipedia.org/wiki/Bill_Inmon .
Date Dimension
The prevalence of a date dimension used twice next for date banding queries is more prominent in dimensional modeling. We might create a date dimension using Julian Day range from before our earliest system record to some point in the far distance, easily created using SQL.
Sample Query
From this query, we can readily see how to extract data for a given time-band by joining the date dimension twice and setting appropriate values. Note that the dimension contains VARCHAR definitions, removing the need to convert to the DATE data type to make usage easy for our business consumers. We might add more predicates to filter on specific criteria or summarize data.
Dimensional Modeling Assumptions
When working with star schemas, an implicit assumption dimension data satisfies all fact data. While Snowflake does not enforce referential integrity, we might reasonably assume our ingested data is correct at the point of delivery and, to our earlier point of not rejecting any invalid records, might optionally set missing or invalid attributes to dimension default values.
Dimensional Modeling Summary
Dimensions are generally quite small, sometimes as large as a few thousands of records, changing slowly. Facts are generally huge, in the order of billions of records, with new data sets added frequently.
One can readily observe fact table contents are easily aggregated and summarized, further speeding up periodic reporting over defined time buckets. Moreover, the same dimensions can be used to report against summarized and aggregated data. With our date dimension attributes extended to months, quarters, and years, the natural calendar relationships enable time bucket–based reporting. We might extend our party dimension to include industry classifications, sectors, and geographical locations, and our agreement dimension includes similar attributes. With the ability to extend dimensions by adding textual search attributes without impacting our fact tables, we should take the opportunity to consult with our business colleagues for their input. Closer working relationships are often achievable, leading to more successful business outcomes. Considering that we may have poor quality data, each dimension must contain a default value against which all non-matching facts are assigned to enable complete reporting. However, we must not apply 3NF to our dimensions and subset the contents into what is paradoxically called a “snowflake” schema, as doing so may significantly impact performance.
Given our knowledge of Snowflake micro-partitions , the potential for querying huge data volumes, and knowledge of the dimensions commonly used for lookups, we must consider appropriate cluster keys and materialized views for summarization, aggregation, and alternative lookup paths.
Adding attributes at a higher level of grain than currently supported is generally very easy; for example, adding a geographical region where a country exists as the new dimension attribute self-evidently rolls up. But what happens when the grain of the dimension is too high to support evolving business requirements? For example, what are the options if we need to decompose an attribute in an existing dimension to a finer grain of detail? Our first option is to restructure the original fact table to decompose existing facts into the same finer grain along with a re-created dimension table with extended attributes and increased surrogate key ranges. Our second option is to create a new fact table with the new data and dimension table. Neither option is easy to implement. The first option is fraught with danger , whereas the second option would be expected to reconcile with the original fact table.
Data Vault 2.0
Data vault modeling breaks down the relationship between elements of an entity and its primary key into hubs and satellites joined by link tables. Data vault is a distinct modeling technique that takes time to master. If done badly, expect a poor outcome.
A data vault is not intended as an end-user query repository, and such usage must be avoided.
A raw data vault is designed to represent unchanged data as it was loaded from the source. We always represent data as supplied regardless of how complete or imperfect on receipt, and our work up to this point in this book explains how to historize using the SCD2 pattern . A raw data vault represents a “system of record”. It may be better expressed as connecting directly to staging tables as the historization comes from the timestamp of the latest record and not through the SCD2 pattern implementing valid_to and valid_from timestamps.
For existing 3NF schemas with a history built up, we should reflect the history of data vault modeling using the valid_from timestamp as the timestamp of the latest record. The key is that it’s not Data Vault 2.0 versus Kimball; it’s Data Vault 2.0 underpinning Kimball and avoiding the expensive rework of a “direct to Kimball” approach .
Ability to bulk load data in parallel
Efficient loading and storing data from disparate sources
Flexibility to add new hubs, links, and satellites without disrupting the existing model
Track and audit data with ease
Scalable, adaptable, and consistent architecture
However, the key disadvantage of data vault modeling is the expansion in the number of objects used in the model, rendering a data vault unsuitable for end-user reporting, which must be avoided. The data vault methodology never recommends direct access to the data vault structures. The proposal is always to build virtual or materialized star schema data marts over the data vault for familiar and optimized reporting.
Flexibility comes with complexity, requiring a mindset shift from traditional data modeling. And data vault modeling is not a good candidate for single source systems and/or relatively static data.
Success Factors
A data vault is not a silver bullet . Poor understanding and execution of core data vault concepts and principles result in poor implementation and general dissatisfaction. Data vaults have been well described in various media. This section seeks to inform basic principles rather than dive into technicalities; it is broken down into a series of headings explaining each principle.
Experience Counts
Without adherence to the data vault methodology, projects fail badly. Having at least one seasoned data vault practitioner on the team is highly recommended to ensure best practice is observed. While costly, this is a risk mitigation strategy. I have been on a failed data vault project. I speak from experience.
Business Keys, Not Surrogate Keys
Every practical 3NF system uses surrogate keys instead of natural keys to represent relationships making both code cleaner and physical data models easier to read. In contrast, data vault modeling relies upon the correct identification of business keys , either single or composite.
At all cost, avoid using sequences or surrogate keys as business keys, take time to identify proper business identifiers, and concatenate for uniqueness if necessary
Using surrogate keys is highly likely to result in refactoring as a data vault relies upon hashing for keys. Where surrogate keys have been selected and given the nature of data vault modeling accepting data sets into a single entity from multiple sources that may use overlapping sequence numbers, a hash collision is almost certain to occur.
Every column included before hashing must be delimited with an uncommon field like double-pipe (||) between columns.
All null values are converted to blank before hashing. Null values break the hash. For surrogate hash keys, these are substituted with zero keys (‘-1’).
Column order must be consistent, and new columns added to the satellite table must be added to the end of the record hash.
Leading and trailing blanks of every column must be removed before hashing.
For surrogate hash keys only, business keys must be up-cased to ensure passive integration across source applications of business keys loading to a common hub. Alternate solutions may be applicable if variable business key treatments are needed.
All business keys must be cast to text. This is a standard treatment because business keys are not metrics. This is done before hashing.
We may optionally include a business key collision code (BKCC) to differentiate keys from disparate sources where overlap would cause hash clashes.
Hashing
Data vault extensively uses hashing , a one-way algorithm translating input value into a fixed length 128-bit value which can be stored in a 32-character length field. Hashes are not used for cryptographic functions but simply to generate a manageable key. A far more comprehensive explanation of hashing is at https://en.wikipedia.org/wiki/Hash_function . For our purposes, we use the Snowflake functionality documented at https://docs.snowflake.com/en/sql-reference/functions/md5.html .
Hash keys are generated from all non-primary key values with a fixed field separator and stored in a field called hashdiff. Comparing hashdiff for existing data with hashdiff for new or modified records identifies whether any non-primary key value has changed. However, the comparison must only be against the most recent record and not the historical record; otherwise, records reverted to earlier values may not be detected as the original hashdiff is calculated and no delta detected.
Hash Clash
Hash clash (also known as hash collision ) occurs when two different input values resolve to the same hash value. With MD5, hash clash can become more likely when over 5.06Bn records exist in a table. Therefore, I suggest the probability is so small as to be insignificant; more information is at https://en.wikipedia.org/wiki/Hash_collision . If hash clash is perceived to be an issue, alternatives exist, such as SHA1/SHA2/SHA3, noting hash generation time is longer than for MD5. Various strategies can be implemented to test for hash collision, some of which are at https://dwa.guide/2017/10/20/2017-10-20-implementing-a-hash-key-collision-strategy/ .
Rehashing
Recognizing data vault modeling as an insert-only model, there may be the temptation to “knife and fork” data without re-creating the associated hash key and subsequent propagation to hubs and links, rendering the explicit relationship between business key and hash broken. We must exclude all manual updates in our systems and rely upon correctly supplied data submission only to retain data fidelity. If data is wrong at the point of submission, our systems regard the data as valid for the full lifecycle of the data. Correction by resubmission is the preferred option as reporting may have relied upon the bad data.
Business Ontology
Business ontology is a consistent representation of data and data relationships by providing a structured way to interpret corporate data as a consolidation of taxonomies. A taxonomy is a clearly defined hierarchical structure for categorizing information.
Without a business ontology, we are destined to fail.
Expect Change
Data models change over time , and data vault modeling facilitates change by adopting a flexible architecture. Incorporating new features and capabilities in data vault modeling is relatively easy compared to 3NF and dimensional models.
Data Representation
Loading new data sets into an existing model is straightforward as data vault modeling expects each record to be identified back to the source by default. Typically our 3NF systems segregate by data source where each schema or data model corresponds to a single data source.
Data Vault Core Components
Hub
A hub is a central entity containing business keys enabling all related satellites and links to interconnect.
When creating hub entries, avoid using sequences or surrogate keys as business keys, take time to identify proper business identifiers, and concatenate for uniqueness if necessary.
A primary key comprising the hash of the business key(s) (Preferably, they should be one or more textual identifiers separated by a fixed delimiter, not sequence or surrogate key.)
Alternate keys from satellites, in our example: party_name_hash/party_name and party_type_hash/party_type_name
A load timestamp defaulted to current_timestamp()
The data source for the loaded data
A hub should have at least one satellite table.
Link
A primary key comprising the hash of party_id and contact_id
A foreign hash key #1 in our example party_id
A foreign hash key #2 in our example contact_id
A load timestamp defaulted to current_timestamp()
The data source for the loaded data
A link implements many-to-many relationships and may also have satellites
Satellite
Satellites contain the business context (i.e., non-key attributes of a single hub or link). For example, party and party_type are both satellites of the party hub. We might also separate volatile attributes (i.e., those which change frequently) from non-volatile attributes (i.e., those which are static or change infrequently) into separate satellite tables. Separating high and low-frequency data attributes can assist with the speed of data loading and significantly reduce the space that historical data consumes.
Normally, a new satellite table would be designed for each data source, and another common consideration is data classification. This approach also enables data to be split apart based on classification or sensitivity. This makes it easier to handle special security considerations by physically separating data elements.
A satellite can only link to a single hub with load_timestamp used to determine the most recent record.
A composite primary key consisting of the parent hub (or link) primary key and load timestamp
Non-key attributes
A load timestamp defaulted to current_timestamp()
The data source for the loaded data
The hash_diff key (hash of all attributes that we want to track for SCD2 changes)
Other Data Vault Tables
Other data vault table types exist for reference data, bridge, and point-in-time tables. Largely, they may be used for edge cases, not core data vault delivery, and may be encountered occasionally.
Data Vault Summary
Summary
This chapter began with a simplified discussion of Third Normal Form (3NF) and pseudocode examples showing the benefits of using surrogate keys, particularly as real-world labels change and data models expand, change, or adapt over time. We also saw how 3NF models are relatively easy to change compared to star schemas. Our 3NF discussion concluded with denormalization, the term used to describe how we bring a 3NF data model back into business usable reporting structures.
Moving on to dimensional modeling, we proposed how our 3NF model might be re-imagined as a dimensional model. I explained how dimensions are created, and attributes may be extended to include a more business-friendly context while recognizing star schemas are difficult to change once implemented.
Finally, you learned the advantages of implementing data vault modeling as an intermediate step between 3NF and star schema became apparent. Data vault modeling delivers successful business outcomes by enabling rapid dimensional model deployment with corresponding object tagging and embedded data ownership, facilitating later self-service.
3NF: Faster database writes, moderately flexible approach
Data vault modeling: Parallel writes, hybrid flexible approach
Dimensional modeling: Faster reads, rigid approach
Having discussed data management, let’s look at the next chapter.