© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. CarruthersBuilding the Snowflake Data Cloudhttps://doi.org/10.1007/978-1-4842-8593-0_13

13. Data Modeling

Andrew Carruthers1  
(1)
Birmingham, UK
 

This chapter briefly discusses data modeling, providing an overview of Third Normal Form (3NF), data vault modeling (e.g., Data Vault 2.0), and dimensional modeling techniques. Many books offer complete explanations better suited to dive deep into theory and practice. Instead, I provide an overview of the main data modeling options to both allow identification and inform why certain techniques are preferred at different points in a data lifecycle.

If you are familiar with data modeling, feel free to skip this chapter, which is intended to be a simple explanation for newcomers. However, I contend that understanding data modeling is important when explaining the why, how, and what of data ingestion through to data consumption.

This chapter is a very brief guide to only get you started. A full explanation of database normalization is at https://en.wikipedia.org/wiki/Database_normalization .

Third Normal Form (3NF)

Whether Snowflake is the target database of choice or not, 3NF is optimized for database writes. Typically, we insert, update, or delete from a single table at a time. I do not discuss transaction boundaries here but draw your attention to Snowflake micro-partitions, discussed in Chapter 3, and earlier comments on not using Snowflake for heavy OLTP workloads. The salient point is that 3NF is not a read-optimized pattern because we must join several tables together to derive full context. Each table holds information about a single “thing.” Object joins can be expensive operations, and the data model must be understood to ensure desired meaning can be derived. We must also ensure uniqueness is enforced in our 3NF model and prevent duplicates, which may consist of programmatically derived unique constraints and date attributes for temporal data.

Database normalization is the process by which the core database design is distilled into a suite of objects containing a single expression of like or grouped values with keys providing relationships. Joining objects via keys informs the model, resulting in a logical and meaningful representation of the business context. 3NF has been the optimal and most encountered implementation pattern for database structures for decades; it is best understood and widely accepted. Now that we know what we are trying to achieve, how do we implement a 3NF data model?

Consider Figure 13-1, which represents part of a sample physical data model where there are parties (companies), contacts (people), and agreements (documents).

A block diagram of a sample 3 N F model. The labels are as follows. Party underscore type, party, party underscore contact, agreement, contact, and contact underscore type.

Figure 13-1

Sample 3NF model

Figure 13-1 depicts a typical scenario where the tables suffixed “_type” hold values used to restrict or constrain values held in the parent tables. While Snowflake does not enforce constraints apart from NOT NULL, we contend declaring constraints allows relationships to be discovered by tooling, easing our business colleagues understanding of our data models.

As you dive deeper into data modeling, you better understand how relationships are the “glue” binding our model together using natural keys (i.e., the physical labels that provide uniqueness quickly becomes unmanageable). Instead, we use surrogate keys expressed as sequences. You may have seen sequences used in previous chapters. Let’s discuss the reasoning. Figure 13-2 illustrates how sequences improve the readability of the data model. The first image shows natural keys in the party_contact intersection table and agreement table, whereas the second image shows the addition of surrogate keys generated from sequences, one per table.

Two block diagrams of natural keys and surrogate keys. The labels in both diagrams are as follows. Party type, party, party contact, agreement, contact, and contact type. The party contact under surrogate keys is highlighted.

Figure 13-2

Natural keys and surrogate keys

The importance of using surrogate keys may not be readily apparent from the example, because adding sequences also adds an attribute to each table. But imagine the situation without surrogate keys where party_name changes. For example, a merger or acquisition may predicate the legal entity being renamed. In such a scenario, the party_name change would break the referential integrity between party and party_contact and all downstream tables after that. Better to make the data model agnostic of name changes and use immutable surrogate keys to join tables rather than mutable real-world labels. Surrogate keys provide flexibility and are adaptable, whereas natural keys are inflexible and do not adapt to change cascading through our models. What we also find in practice is most, if not all, SQL optimizers prefer numbers to character equivalence checks. Therefore, the sequence-generated surrogate keys offer a simple way to improve performance.

Let’s look at how sample SQL statements might look, assuming we have created appropriate tables, sequences, and records not shown here.

Using Natural Keys

With natural keys , we must know the name of the attributes to join, which may not always correlate to the table name, as the first example in Figure 13-2 shows.
SELECT p.party_name,
       c.contact_name,
       a.agreement_name,
       a.agreement_value
FROM   party           p,
       contact         c,
       party_contact   pc,
       agreement       a
WHERE  pc.party_name   = p.party_name
AND    pc.contact_name = c.contact_name
AND    pc.party_name   = a.party_name
AND    pc.contact_name = a.contact_name;

Using natural keys propagates the number of join criteria for every new table added to the data model leading to code bloat with consequent maintenance overhead for our support team and when adding new features.

Using Surrogate Keys

With surrogate keys , we may reasonably assume most (but not all) data models correlate to the table name, as the second example in Figure 13-2 shows.
SELECT p.party_name,
       c.contact_name,
       a.agreement_name,
       a.agreement_value
FROM   party               p,
       contact             c,
       party_contact       pc,
       agreement           a
WHERE  pc.party_id         = p.party_id
AND    pc.contact_id       = c.contact_id
AND    pc.party_contact_id = a.party_contact_id;

Using surrogate keys adds readability to our code by reducing the number of join conditions. Not only is the data model more understandable, but adding a new relationship is relatively easy. Adding constraints is also simpler as we do not need to list all the constituent attributes but instead rely upon surrogate keys alone. Where surrogate keys are used, the number of core join conditions is always one less than the number of objects joined.

3NF Assumptions

When working with 3NF data models, there is an implicit assumption that reference data satisfies any declared referential integrity. While Snowflake does not enforce referential integrity, we might reasonably assume our ingested data is correct at the point of delivery and, to our earlier point of not rejecting any invalid records, might optionally set missing or invalid attributes to defaults while validating and reporting invalid records back to the source.

3NF Summary

Joining tables together allows information to be derived from the discrete data sets but implies a high degree of familiarity with the data model. We may also see challenges with identifying data sets for date ranges. Each component table may have its own temporal attributes, particularly where SCD2 has been implemented, implying that further domain knowledge is required. We would extend the number of SQL predicates to explicitly include the date ranges required recognizing the code quickly becomes bloated. However, as previously stated, 3NF is a storage (write) pattern, not a retrieval (read) pattern. We are most concerned with writing data into Snowflake as fast as possible for data ingestion. We later discuss retrieval patterns.

Adding attributes at a higher level of grain than currently supported in a data model is generally very easy. For example, adding a geographical region where a country exists on the party table would require a new reference table for the region and an intersection entity between the country table (not shown) and the region. After which, all reporting referring to a country may need amending to add a filter for the region. Similarly, adding a lower grain of detail to a 3NF data model is generally doable with a little effort and some disruption but typically not nearly as invasive as for dimensional modeling described next.

Denormalization

No discussion on normalization is complete without mentioning denormalization , the process by which we join our discrete 3NF tables to provide meaningful business objects for reporting. While this topic deserves more considered treatment than we can provide here, I offer a few suggestions.

We prefer to represent all data in our data warehouse (see https://en.wikipedia.org/wiki/Data_warehouse ) and not just those that match our reference data. We contend that all data, regardless of quality, is valid, and our business colleagues are responsible for their data quality. We must not exclude records by filtering out “bad” data but instead notify source systems of records failing validation.

Our success is entirely dependent upon adoption by our users. We must enable an infrastructure that enables results driving business value. In other words, we must retain our business users’ perspective and provide what they want, how they want it, and not deliver what we think they want. From this position, we may enrich our data and suggest summaries, aggregates, and pre-filtered data sets tuned to our customer needs.

We must make our data sets easy to discover and self-serve. Adding object tags, as discussed in Chapter 12, is one very powerful way to enable our business community. Lastly, we must not forget data security and adhere to the policy.

Dimensional Modeling

In contrast to 3NF optimized for transactional speed of writing data into an RDBMS, dimensional models are concerned with the speed of reading data from our data warehouse. Two different paradigms with interrelated structures and complementary approaches. Ralph Kimball developed dimensional modeling; more information is at https://en.wikipedia.org/wiki/Dimensional_modeling . For further reading, I recognize Bill Inmon as the father of data warehousing. More information is at https://en.wikipedia.org/wiki/Bill_Inmon .

Dimensional modeling is often referred to as a star schema reflecting the visual representation of the data model . Re-purposing our earlier 3NF data model into a star schema depicted in Figure 13-3 illustrates how entities are conflated into dimensions containing textual attributes, with a new fact entity containing numeric data with foreign keys to dimensions. Putting aside temporal considerations for dimension entries, we arrive at a business-friendly data model where our business users can intuitively understand and self-serve with minimal intervention or explanation from technical staff.

A block diagram of a sample dimension model. The fact agreement in the center is linked to dim party and dim date on its left and to dim contact and dim agreement on its right.

Figure 13-3

Sample dimension model

Date Dimension

The prevalence of a date dimension used twice next for date banding queries is more prominent in dimensional modeling. We might create a date dimension using Julian Day range from before our earliest system record to some point in the far distance, easily created using SQL.

Sample Query

Dimension keys typically correlate to the dimension name as our example from Figure 13-3 shows.
SELECT p.party_name,
       c.contact_name,
       a.agreement_name,
       a.agreement_value
FROM   fact_agreement    f,
       dim_party         p,
       dim_contact       c,
       dim_agreement     a,
       dim_date          d1,
       dim_date          d2
WHERE  f.party_id        = p.party_id
AND    f.contact_id      = c.contact_id
AND    f.agreement_id    = a.agreement_id
AND    f.valid_from_date = d1.valid_from_date_id
AND    f.valid_to_date   = d2.valid_to_date_id
AND    d1.yyyymmdd       = '20211201'
AND    d2.yyyymmdd       = '20211214';

From this query, we can readily see how to extract data for a given time-band by joining the date dimension twice and setting appropriate values. Note that the dimension contains VARCHAR definitions, removing the need to convert to the DATE data type to make usage easy for our business consumers. We might add more predicates to filter on specific criteria or summarize data.

Dimensional Modeling Assumptions

When working with star schemas, an implicit assumption dimension data satisfies all fact data. While Snowflake does not enforce referential integrity, we might reasonably assume our ingested data is correct at the point of delivery and, to our earlier point of not rejecting any invalid records, might optionally set missing or invalid attributes to dimension default values.

Dimensional Modeling Summary

Dimensions are generally quite small, sometimes as large as a few thousands of records, changing slowly. Facts are generally huge, in the order of billions of records, with new data sets added frequently.

One can readily observe fact table contents are easily aggregated and summarized, further speeding up periodic reporting over defined time buckets. Moreover, the same dimensions can be used to report against summarized and aggregated data. With our date dimension attributes extended to months, quarters, and years, the natural calendar relationships enable time bucket–based reporting. We might extend our party dimension to include industry classifications, sectors, and geographical locations, and our agreement dimension includes similar attributes. With the ability to extend dimensions by adding textual search attributes without impacting our fact tables, we should take the opportunity to consult with our business colleagues for their input. Closer working relationships are often achievable, leading to more successful business outcomes. Considering that we may have poor quality data, each dimension must contain a default value against which all non-matching facts are assigned to enable complete reporting. However, we must not apply 3NF to our dimensions and subset the contents into what is paradoxically called a “snowflake” schema, as doing so may significantly impact performance.

Given our knowledge of Snowflake micro-partitions , the potential for querying huge data volumes, and knowledge of the dimensions commonly used for lookups, we must consider appropriate cluster keys and materialized views for summarization, aggregation, and alternative lookup paths.

Adding attributes at a higher level of grain than currently supported is generally very easy; for example, adding a geographical region where a country exists as the new dimension attribute self-evidently rolls up. But what happens when the grain of the dimension is too high to support evolving business requirements? For example, what are the options if we need to decompose an attribute in an existing dimension to a finer grain of detail? Our first option is to restructure the original fact table to decompose existing facts into the same finer grain along with a re-created dimension table with extended attributes and increased surrogate key ranges. Our second option is to create a new fact table with the new data and dimension table. Neither option is easy to implement. The first option is fraught with danger , whereas the second option would be expected to reconcile with the original fact table.

Data Vault 2.0

Data vault modeling breaks down the relationship between elements of an entity and its primary key into hubs and satellites joined by link tables. Data vault is a distinct modeling technique that takes time to master. If done badly, expect a poor outcome.

Note

A data vault is not intended as an end-user query repository, and such usage must be avoided.

A raw data vault is designed to represent unchanged data as it was loaded from the source. We always represent data as supplied regardless of how complete or imperfect on receipt, and our work up to this point in this book explains how to historize using the SCD2 pattern . A raw data vault represents a “system of record”. It may be better expressed as connecting directly to staging tables as the historization comes from the timestamp of the latest record and not through the SCD2 pattern implementing valid_to and valid_from timestamps.

For existing 3NF schemas with a history built up, we should reflect the history of data vault modeling using the valid_from timestamp as the timestamp of the latest record. The key is that it’s not Data Vault 2.0 versus Kimball; it’s Data Vault 2.0 underpinning Kimball and avoiding the expensive rework of a “direct to Kimball” approach .

Now that you understand where data vault modeling fits, let’s address the distinct advantages that data vault modeling offers over 3NF and dimensional modeling.
  • Ability to bulk load data in parallel

  • Efficient loading and storing data from disparate sources

  • Flexibility to add new hubs, links, and satellites without disrupting the existing model

  • Track and audit data with ease

  • Scalable, adaptable, and consistent architecture

However, the key disadvantage of data vault modeling is the expansion in the number of objects used in the model, rendering a data vault unsuitable for end-user reporting, which must be avoided. The data vault methodology never recommends direct access to the data vault structures. The proposal is always to build virtual or materialized star schema data marts over the data vault for familiar and optimized reporting.

Flexibility comes with complexity, requiring a mindset shift from traditional data modeling. And data vault modeling is not a good candidate for single source systems and/or relatively static data.

Re-purposing our earlier 3NF data model into a data vault format depicted in Figure 13-4 illustrates how entities are expanded into hubs containing business keys, satellites with textual attributes, and a link table containing the relationship between hubs.

A block diagram of a sample data vault model. The highlighted link party contact is connected to satellite party, hub party, and satellite party type on its left and to satellite contact type, hub contact, and satellite contact on its right.

Figure 13-4

Sample data vault model

Success Factors

A data vault is not a silver bullet . Poor understanding and execution of core data vault concepts and principles result in poor implementation and general dissatisfaction. Data vaults have been well described in various media. This section seeks to inform basic principles rather than dive into technicalities; it is broken down into a series of headings explaining each principle.

Experience Counts

Without adherence to the data vault methodology, projects fail badly. Having at least one seasoned data vault practitioner on the team is highly recommended to ensure best practice is observed. While costly, this is a risk mitigation strategy. I have been on a failed data vault project. I speak from experience.

Business Keys, Not Surrogate Keys

Every practical 3NF system uses surrogate keys instead of natural keys to represent relationships making both code cleaner and physical data models easier to read. In contrast, data vault modeling relies upon the correct identification of business keys , either single or composite.

Note

At all cost, avoid using sequences or surrogate keys as business keys, take time to identify proper business identifiers, and concatenate for uniqueness if necessary

Using surrogate keys is highly likely to result in refactoring as a data vault relies upon hashing for keys. Where surrogate keys have been selected and given the nature of data vault modeling accepting data sets into a single entity from multiple sources that may use overlapping sequence numbers, a hash collision is almost certain to occur.

To avoid hash clashes (see the following), we use concatenated natural keys subject to the following rules.
  • Every column included before hashing must be delimited with an uncommon field like double-pipe (||) between columns.

  • All null values are converted to blank before hashing. Null values break the hash. For surrogate hash keys, these are substituted with zero keys (‘-1’).

  • Column order must be consistent, and new columns added to the satellite table must be added to the end of the record hash.

  • Leading and trailing blanks of every column must be removed before hashing.

  • For surrogate hash keys only, business keys must be up-cased to ensure passive integration across source applications of business keys loading to a common hub. Alternate solutions may be applicable if variable business key treatments are needed.

  • All business keys must be cast to text. This is a standard treatment because business keys are not metrics. This is done before hashing.

  • We may optionally include a business key collision code (BKCC) to differentiate keys from disparate sources where overlap would cause hash clashes.

Hashing

Data vault extensively uses hashing , a one-way algorithm translating input value into a fixed length 128-bit value which can be stored in a 32-character length field. Hashes are not used for cryptographic functions but simply to generate a manageable key. A far more comprehensive explanation of hashing is at https://en.wikipedia.org/wiki/Hash_function . For our purposes, we use the Snowflake functionality documented at https://docs.snowflake.com/en/sql-reference/functions/md5.html .

Hash keys are generated from all non-primary key values with a fixed field separator and stored in a field called hashdiff. Comparing hashdiff for existing data with hashdiff for new or modified records identifies whether any non-primary key value has changed. However, the comparison must only be against the most recent record and not the historical record; otherwise, records reverted to earlier values may not be detected as the original hashdiff is calculated and no delta detected.

Hash Clash

Hash clash (also known as hash collision ) occurs when two different input values resolve to the same hash value. With MD5, hash clash can become more likely when over 5.06Bn records exist in a table. Therefore, I suggest the probability is so small as to be insignificant; more information is at https://en.wikipedia.org/wiki/Hash_collision . If hash clash is perceived to be an issue, alternatives exist, such as SHA1/SHA2/SHA3, noting hash generation time is longer than for MD5. Various strategies can be implemented to test for hash collision, some of which are at https://dwa.guide/2017/10/20/2017-10-20-implementing-a-hash-key-collision-strategy/ .

Rehashing

Recognizing data vault modeling as an insert-only model, there may be the temptation to “knife and fork” data without re-creating the associated hash key and subsequent propagation to hubs and links, rendering the explicit relationship between business key and hash broken. We must exclude all manual updates in our systems and rely upon correctly supplied data submission only to retain data fidelity. If data is wrong at the point of submission, our systems regard the data as valid for the full lifecycle of the data. Correction by resubmission is the preferred option as reporting may have relied upon the bad data.

Business Ontology

Business ontology is a consistent representation of data and data relationships by providing a structured way to interpret corporate data as a consolidation of taxonomies. A taxonomy is a clearly defined hierarchical structure for categorizing information.

Note

Without a business ontology, we are destined to fail.

Expect Change

Data models change over time , and data vault modeling facilitates change by adopting a flexible architecture. Incorporating new features and capabilities in data vault modeling is relatively easy compared to 3NF and dimensional models.

Data Representation

Loading new data sets into an existing model is straightforward as data vault modeling expects each record to be identified back to the source by default. Typically our 3NF systems segregate by data source where each schema or data model corresponds to a single data source.

Data Vault Core Components

Noting earlier comments on identifying natural business keys rather than surrogate keys, let’s look at guiding principles for designing each major data vault table type using Figure 13-5 as an example.

A block diagram titled data vault. The link party contact in the center is connected to the satellite party type, hub party, and satellite party on its left and to the satellite contact type, hub contact, and satellite contact on its right.

Figure 13-5

Sample data vault pattern

Hub

A hub is a central entity containing business keys enabling all related satellites and links to interconnect.

When creating hub entries, avoid using sequences or surrogate keys as business keys, take time to identify proper business identifiers, and concatenate for uniqueness if necessary.

Hub tables typically contain these attributes.
  • A primary key comprising the hash of the business key(s) (Preferably, they should be one or more textual identifiers separated by a fixed delimiter, not sequence or surrogate key.)

  • Alternate keys from satellites, in our example: party_name_hash/party_name and party_type_hash/party_type_name

  • A load timestamp defaulted to current_timestamp()

  • The data source for the loaded data

A hub should have at least one satellite table.

Link

The intersection between the party hub and contact hub is the link table. A link table supports many-to-many relationships and typically contains these attributes.
  • A primary key comprising the hash of party_id and contact_id

  • A foreign hash key #1 in our example party_id

  • A foreign hash key #2 in our example contact_id

  • A load timestamp defaulted to current_timestamp()

  • The data source for the loaded data

A link implements many-to-many relationships and may also have satellites

Satellite

Satellites contain the business context (i.e., non-key attributes of a single hub or link). For example, party and party_type are both satellites of the party hub. We might also separate volatile attributes (i.e., those which change frequently) from non-volatile attributes (i.e., those which are static or change infrequently) into separate satellite tables. Separating high and low-frequency data attributes can assist with the speed of data loading and significantly reduce the space that historical data consumes.

Normally, a new satellite table would be designed for each data source, and another common consideration is data classification. This approach also enables data to be split apart based on classification or sensitivity. This makes it easier to handle special security considerations by physically separating data elements.

A satellite can only link to a single hub with load_timestamp used to determine the most recent record.

Satellites always have these attributes.
  • A composite primary key consisting of the parent hub (or link) primary key and load timestamp

  • Non-key attributes

  • A load timestamp defaulted to current_timestamp()

  • The data source for the loaded data

  • The hash_diff key (hash of all attributes that we want to track for SCD2 changes)

Other Data Vault Tables

Other data vault table types exist for reference data, bridge, and point-in-time tables. Largely, they may be used for edge cases, not core data vault delivery, and may be encountered occasionally.

Data Vault Summary

Data vault modeling must not be used for reporting but instead may be an intermediary step between the data source (or 3NF) and star schema, as Figure 13-6 shows.

A flowchart of data model usage. The labels 3 N F schema points to data vault to star schema. The 3 N F schema has party type, party, party contact, and so on. Data vault has satellite party type, hub party, satellite party, link party contact, and so on. Star schema has dim party, dim date, fact agreement, dim contact, and dim agreement.

Figure 13-6

Sample data model usage

Summary

This chapter began with a simplified discussion of Third Normal Form (3NF) and pseudocode examples showing the benefits of using surrogate keys, particularly as real-world labels change and data models expand, change, or adapt over time. We also saw how 3NF models are relatively easy to change compared to star schemas. Our 3NF discussion concluded with denormalization, the term used to describe how we bring a 3NF data model back into business usable reporting structures.

Moving on to dimensional modeling, we proposed how our 3NF model might be re-imagined as a dimensional model. I explained how dimensions are created, and attributes may be extended to include a more business-friendly context while recognizing star schemas are difficult to change once implemented.

Finally, you learned the advantages of implementing data vault modeling as an intermediate step between 3NF and star schema became apparent. Data vault modeling delivers successful business outcomes by enabling rapid dimensional model deployment with corresponding object tagging and embedded data ownership, facilitating later self-service.

I suggest these comparisons among the three modeling approaches discussed.
  • 3NF: Faster database writes, moderately flexible approach

  • Data vault modeling: Parallel writes, hybrid flexible approach

  • Dimensional modeling: Faster reads, rigid approach

Having discussed data management, let’s look at the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.168.203