Chapter 7
Tying Off a Few Data Vault Loose Ends
A quick glance at some more advanced topics
The basic building blocks for a Data Vault have a delightful elegance – there are just Hubs, Links, and Satellites. I love the pattern. But within this simplicity there are many variations. Perhaps one could compare Data Vault to Lego® bricks. The Lego I first encountered some years ago had a few limited sizes, and a few colors, and that was it, yet there was enormous potential to create all sorts of structures.
Data Vault, too, has several recognized ways of putting together the basic building blocks. A few are mentioned below, but there are more. I do encourage you to buy the books, attend the courses and conferences, and participate in forums. This Data Vault primer is most definitely not intended to replace any such formal knowledge acquisition. But the following topics may prove to be helpful in building your foundations for Data Vault success.
Hierarchical Links
In traditional data modeling, we encounter self-referencing relationships, where a Foreign Key points to the Primary Key of another row in the same table.
A simple example might be a table of Employees, with Employee Number as the Primary Key. Each Employee (other than the most senior “boss”) may identify their manager by using their manager’s Employee Number in a Foreign Key relationship. [Some might argue that the employment hierarchy should involve an Employee table and a Position table, but that’s a separate topic. For the sake of introducing Hierarchical Links in Data Vault, I am choosing to present a simpler model.]
Figure 100: Employee hierarchy - operational
Another example might be an Organization Unit table, where individual Organization Units reference another Organization Unit to which it “belongs”.
The Data Vault equivalent of these operational system self-referencing relationships is a Link known commonly as a “hierarchical” Link. In almost every aspect it really is just an ordinary Link, but with each instance in a hierarchical Link pointing to two Hub instances in the same Hub table, for example an Employee Hub instance for a subordinate and an Employee Hub instance for the related manager.
Figure 101: Employee Hierarchy – Data Vault
Same-As Links
One commonly quoted usage of a Same-As Link is to associate multiple customer instances in a Customer Hub that actually represent the same real-world customer. Maybe a few years ago Sam Smith dealt with the company, using Customer Number Cust123. Now, some years later, Sam turns up again and starts buying. The “correct” process may be to recognize Sam as the same person, and get them to use the old Customer Number, but the process breaks and Sam gets a new Customer Number Cust456.
The Data Vault ends up with history for purportedly two separate customers. Subsequently, a cleanup initiative involving the source systems spots the duplication, or maybe an operator identifies and records the duplication. We don’t want to rewrite history by merging data in the Data Vault. That breaks the trust in its audit trail. So what do we do? It’s simple. We create a Same-As Link that has two Foreign Key relationships, one to point to the “old” Customer Number, and one to point to the “new” Customer Number.
Figure 102: Customer Same-As Link
Reference data
Often our IT systems have what many call “reference data”. Examples could include a collection of country codes and names, a collection of order status codes and names, or a collection of employee gender codes and names. There are a variety of ways to manage reference data within a Data Vault. Let’s begin the discussion by looking at an example from a banking scenario. In the operational system, we’ve got Customers, and they are classified according to some Customer Type reference data. A simple reference table could look like this:
Figure 103: Customer Type - no history
We’ve already decided that the Customer from the operational system is to be loaded to a Data Vault Hub. But where do we load the Customer Type reference data?
One of the first questions to ask is whether or not the history of changes is to be kept. We could use the data structure from the above operational table structure directly in our Data Vault if we only want the current values for a given code. As an alternative, if we want our Data Vault to hold the history of slowly changing values in our reference table, the following table structure might do the trick. Each time a Customer Type Name or Description changes, we can add a new row holding a snapshot of the new values, while retaining the previous values, too.
Figure 104: Customer Type - slowly changing history
Maybe Country is another reference set for customers, recording their country of residence. In this scenario for the bank, let’s assume that Country is not just a simple code-&-description reference data set. Instead, it is its own business concept with a recognized business key, and its own rich set of attributes, including recording the currency for each country, a geopolitical classification, a perceived risk of lending to customers in that country, whether or not the bank’s head office recognizes political sanctions on trading with that country, and more. In such a case, maybe “country” could be seen as a Hub in its own right, especially if it has its own rich set of relationships with other Hubs.
The below diagram shows both scenarios. Country is modeled as a completely standard Data Vault Hub-&-Satellite structure, with a Link associating it to the Customer Hub. In contrast, the Customer Type Code attribute in the Customer’s Satellite holds just the code, and this can be joined to the identical code in the Customer Type table.
Figure 105: Reference data in a Data Vault
Two things are to be noted:
There’s another variation that is commonly practiced. The Customer Type table can be replaced by a Hub-&-Satellite pair. As shown in the diagram, it is not connected by a formal Data Vault Link (it’s now become an “unconnected” Hub) but it has the advantage of having all the standard Data Vault mechanisms for data loading and recording audit trails.
Deletions
The baseline functionality of a Data Vault is great for capturing inserts into source system databases. A new Fire Truck is created, and it goes into a Hub. A new relationship between Fire Truck and a Fire is created, and it gets loaded into a Link, and any attributes describing these new instances get loaded into appropriate Satellites.
The baseline functionality of a Data Vault is also great for capturing updates as they occur in source system databases. The changes are reflected by new rows in the affected Satellites.
But deletions? If a Fire Truck is deleted in an operational system, how does a Data Vault respond? A bare-bones, minimalistic Data Vault will do nothing, but don’t get concerned – there are features within Data Vault for recording such activities if we need them.
Before we go any further, let’s see if we can get some clarity about so-called “deletions”. Some of the variations that may be of interest to us are listed below.
Now let’s immediately get one misconception out of the way. If we know something has been deleted in a source, we don’t try to mimic that action in our Data Vault. Data Vault architecture is intended to be “insert only”. We insert new Hub and Link instances to represent new business objects and new business relationships or new transactions. We insert new Satellite rows to represent observed updates. But if a deletion occurs back in the source system, we don’t delete the history for those things from the Data Vault. What we can do is insert some data so that we can keep a record of the fact that a “deletion” has been occurred.
There’s one other foundational matter regarding “deletions” that needs to be communicated. Any perceived “deletion” is a perception told to the Data Vault from a nominated source. Independent of the type of deletion (logical, physical, no longer seen), each and every source system can inform us of their different perceptions, at different times. The Human Resources system can, for example, tell us that an employee has been logically deleted on Monday by setting a termination date for last Friday. The Payroll system might inform us a year later of the termination, after the lucky ex-employee has received a year’s pay without doing any work! Then the source system for managing building access to employees has a different view, as does the system for managing computer network logins.
How is this separation of perceptions managed in the Data Vault? It’s simple, really, and it’s handled by Satellites. The recommended practice for Satellites hanging off a Hub (or Link) is that each source has its own Satellite. If we’ve got four sources of Employee information, we can expect at least four Satellites to be hanging off the Employee Hub. Let’s look quickly at the three types of “deletions” noted in the bullet list above.
Logical deletions such as an employee’s termination date are recorded in data, along with all sorts of other attributes. The dates can be put in an ordinary Satellite along with the other attributes, or if we like, we can split the Satellite out to create a so-called “Effectivity Satellite”. Either way, it’s just data from a source system feed ending up in a regular Satellite.
Physical deletions can be identified by mechanisms such as Change Data Capture (CDC) on a source system’s relational table. If we physically delete a row for an Employee from a system responsible for recording building access permissions, that action may be captured by the database CDC mechanism. If we subsequently expose the CDC results to the Data Vault, we can record the precise moment in time when the deletion occurred. Data Vaults can use what is called a “Status Tracking Satellite” to capture not only physical deletes, but also physical inserts and updates. Some CDC mechanisms even offer the ability to capture the details for each time a row is read!
No-Longer-Seen observations also have a standard Data Vault mechanism, known as a “Record Tracking Satellite”.
Status Tracking Satellites, Record Tracking Satellites, and Effectivity Satellites all have different structures. Some people may prefer a single structure to record “deletions”. This can be achieved as follows:
If we make the Staging Area work a bit harder, as described above, we could present different types of “deletions” in one standard data structure. We could go one step further and put all these “deletion” tracking milestones into one physical Satellite, even if they come from different sources. If we actually implemented the Tracking Satellites this way, there would problems. Standard Satellite update mechanisms simply wouldn’t work with (1) multiple source data feeds populating the one Satellite, and (2) even for one source data feed, having the data presented in a Name/Value Pair construct. It is not what I am suggesting. (Instead, I recommend one Tracking Satellite per source, with appropriately named specific columns.) However, for the sake of simple presentation in the worked example that follows, the data is presented as if all new data is added to the single, shared Tracking Satellite. Let’s look at an example.
There are only perhaps a dozen single-row data feeds, so the scenario isn’t too vast, but it has some interesting and subtle twists and turns. I encourage you to carefully follow the story as it unfolds. The details are a bit technical. If you are happy with the standard Data Vault mechanisms of Status Tracking Satellites, Record Tracking Satellites, and Effectivity Satellites, please read up on them in Dan and Michael’s book.38 However, if you have struggled with these basic building blocks and are looking for another way, there’s an alternative presented below. Either way, the following scenario will still be helpful to understand how these things fit together, even though the data structure presented below is the simplified one used to hold all types of deletions in a single form.
Also note that while the focus of this section is on capturing “deletions”, we could use the same mechanism to capture “insertions” by source system (as compared to the single-source “first seen” Data Vault dates and times). This can assist in constructing a whole-of-life view for nominated objects (Hubs or Links) in the Data Vault. Like with “deletions”, these “insertions” can capture logical insertions (via an Effective Start Date attribute), physical insertions (by using CDC records), and First-Seen insertions based on comparison of consecutive full batch extracts as seen in the Staging Area.
The worked example starts with a person successfully being accepted for a 9-month engagement. Some hard-working individual in the Human Resources (HR) department comes in on New Years’ Day (Monday January 1st), and loads the details for the expected new employee into their HR system. Employee Number 111 is assigned, and an intended start date of Monday January 8th is entered.
The Data Vault Satellite to capture this new employee’s dates is shown below, followed by an explanation of how the data got assembled. For the sake of brevity, date-&-time stamps are abbreviated to just month and day.
Table 19: Tracking Satellite after HR load on January 2nd
A batch job fires up a bit after midnight each night to take a full snapshot extract from the HR system. The Staging Area persists a copy of “active last time” Employee Numbers. Its pre-processing function for this batch noted that Employee 111 wasn’t in the last extract, but is now. It generated a row of data for loading to the Data Vault with a Status Code of “First Seen Date”, and its own Staging Area run date-&-time to declare that Employee 111 was first seen by the Staging Area at 1:00am on Tuesday January 2nd.
The intended Start Date for Employee 111 is just a piece of data that can be mapped to a Satellite. The Staging Area’s pre-processing simply assigns a Status Code of “Effective Start Date”. No End Date is nominated for the Employee, so the logic simply ignores a NULL value and does nothing in the way of generating an “Effective End Date” row.
After loading a new instance in the Employee Hub (a given), we also added rows to our special shared-records Satellite.
Now we swing our attention to the source system that issues passes to permit access to buildings. Instead of using a batch job for a full extract, this system runs on a database with the ability to generated Change Data Capture rows. The HR system had created a record for the new employee a week ahead of the expected start date. The person responsible for creating new building access passes normally waits until a new employee actually turns up on their first day. The source system then, by default, issues cards for six months from the actual date of the card’s creation. The Effective From and Effective To dates are generated and stored.
However, the person who issues cards is taking Monday off, so instead creates the access card ahead of time, on the Friday, and forgets to override the Effective From date, giving the impression that the new employee actually commenced on Friday the 5th. It’s a data quality mistake, but one that the Data Vault will faithfully record. On Friday January 5th, a new row is created in the Building Access system, and a CDC “insert” row is automatically generated by the CDC mechanism. Again we look at the Data Vault results, and then discuss them in more detail.
Table 20: Tracking Satellite after Building Access load on January 5th
The Staging Area pre-processing function generates a row of data for loading to the Data Vault with a Status Code of “CDC Physical Insert”, and the precise date and time, to a fraction of a second (not shown), when the physical insert actually took place in the source system. The Staging Area pre-processing also generates rows for the Effective From and Effective To dates.
Most employees like to get paid, but that doesn’t usually happen on day 1 of their employment. While company policy is to get the HR records loaded before the new employee arrives, and to create the building access card on their arrival, creation of the Payroll records can wait a bit. On Tuesday January 9th, the Payroll records are created, and exposed to the Staging Area early the next morning, Wednesday January 10th. Just like the HR system, a full extract is provided as a batch job. Like the HR system, the “First Seen Date” is generated, as is the “Effective Start Date”, matching that of the HR system. The Payroll worker also provided an “Effective To Date” of Monday October 8th to reflect the 9-month employment contract.
Table 21: Tracking Satellite after Payroll load on January 10th
Now the drama begins. Things haven’t worked out as expected in the relationship between the organization and the worker. The planned 9-month contract is being cut short. On Friday February 2nd, notice is given of premature termination, with an end date in a few weeks (February 23rd). This information is loaded into the HR system, which in turn is loaded to the Data Vault on Monday February 5th.
Table 22: Tracking Satellite after HR load on February 5th
In the turmoil of the unexpected departure, nobody thought to tell Payroll what was happening. The employee may no longer be active in the HR system, but is very much active in the Payroll system! They have continued to pay the ex-employee. It’s now July, in the new financial year.
The logic in the “full” batch extract of HR records isn’t actually a full extract. The extraction code says that if we’re in a new financial year, and an ex-employee was terminated in the previous financial year, exclude them from the extract. This is a beautiful example of one of the differences between a physical delete and “no longer seen”. As of the first working day in the new financial year (Monday July 2nd), Employee 111 is still in the HR system, but not in the extract file. When the file hits the Staging Areas and the business key comparison is performed, 111 is “no longer seen”, with the implication it may have been deleted.
Table 23: Tracking Satellite after HR load in July
A Data Vault is capable of executing business rules. Let’s assume that one very valuable rule is to generate an alert when a person has disappeared from view in the HR system but appears to be still active in the Payroll system. It is discovered on Tuesday 3rd July that in fact the ex-employee has continued to be paid since termination in February. Oops! Amongst other things, their Termination Date in the Payroll system is set after-the-fact back to February 23rd, and this information is loaded into the Data Vault early the following morning.
Table 24: Tracking Satellite after Payroll load in July
Another belated action is taken, this time by the Building Access system. The record for Employee 111 is physically deleted on Friday July 6th. They could have first updated the End Date, but if the record is being deleted, who cares.
Table 25: Tracking Satellite after Building Access load in July
If the scenario above seems a bit complicated, sadly it is a direct reflection of the complexity that sometimes occurs in the real world. So here’s a bit of good news. When the real-world systems disagree as to the “facts”, the Data Vault is well positioned to contribute to the resolution of data quality problems observed in source systems.
There’s another bit of good news. As noted earlier, the mechanisms put in place for tracking “deletions” can also track “insertions”, contributing to the building up of a picture of entity life cycle, as seen by multiple source systems with sometimes conflicting perspectives. This feature hasn’t been shown in the example, but it would be easy to add.
And the last bit of good news is that the three patterns described in this section (logical inserts/deletes, physical inserts/deletes, and first-seen/no-longer-seen), while requiring three separate bits of code, are a pattern for reuse across all Hub and Links, not just Employees as highlighted in this scenario.
Point-In-Time (PIT) and Bridge tables
Dan Linstedt has commented that Point-In-Time (PIT) tables and Bridge tables are “… the two most misunderstood, and misused, and misapplied modeling techniques in the Data Vault landscape today.”39 He elsewhere also stresses the value of these artifacts, especially to assist with simplicity and response time performance of queries for downstream users. So they’re important, but what do they look like?
Let’s start with PIT tables. I’ve come across people who really struggle to understand or explain them, but the fundamentals really aren’t that hard to grasp.
We’ll paint a hypothetical picture. Chris is an employee of Acme. The Data Vault has an Employee Hub, with two of the source systems being the Payroll system and the Human Resources (HR) system. The Hub has Satellites not only split by source system, but also by general topic and/or expected rate of change for the attribute values. A snippet of the Satellites and their values over time for Chris follows.
The first Satellite is from the Payroll system, and holds “Personal” slowly changing attributes such as Date-Of-Birth, Gender, Country-Of-Birth, and Name. Date-Of-Birth may never change unless a data entry mistake was made. Gender may change, but the Data Vault modeler does not expected this to be a frequent change. Country-Of-Birth may possibly change if the country itself changes name, but again this won’t be frequent. Finally, the Name attribute is expected to have occasional but infrequent changes for some employees. For the sake of simplicity, it is assumed that the only attribute in this set that changes for Chris is the Name, and this is the only attribute shown in the sample for the Payroll-sourced “Personal” Satellite.
Figure 106: Point-In-Time (PIT) sample data
The other Payroll Satellite holds many payroll “Rate” attributes such as frequency of pay (weekly or fortnightly), bank account details, grade and weekly pay rate. Chris, and all of the employees, hope that the pay rate is a frequently changing attribute! Again, only a subset of these attributes are shown in the table above.
Similarly, the Human Resource system’s data feed is split, this time into three Satellites, with some of the attributes shown.
Let’s walk through the timeline.
Now we get downstream users who want to submit queries.
Someone wants to compare the situation as at end of June this year (2009), for all employees. But focusing just on Chris, what is Chris’ Name (according to both systems), Grade, Weekly Pay, Education, Job Title, Home phone and Mobile phone as at that date?
Figure 107: Point-In-Time = end June 2009
When we look at the point-in-time line through the Satellite values, we can easily see the answer. But the code for finding the active Satellites at a given point-in-time can be a little trickier. So what if we construct that code, execute it, and store the results? We could hold all of the actual attribute values (Name = C. Jones, Grade = 3 …), but to avoid a bloated data set, a baseline Point-In-Time table instead holds the set of foreign keys that point to the Satellites that were active at the point-in-time.
That’s one example, based on a query for end of June, 2009. Maybe we might want to keep PIT rows for the end of every financial year, or for every month, or for every time any values changes in any of the Satellites. That’s fine. We can do that, too.
That’s the good news, but there’s a caution. The “point-in-time” relates to timestamps in the Data Vault recording when the data was loaded. These are technical dates, and shouldn’t be used for constructing a business view of when things happened. For that perspective, we need to look at dates in the Satellites’ data.
Now let’s talk about Bridge tables.
Earlier in our fire truck example, we saw a pair of Hubs, and one associated Link. It is copied below, with a “Bridge” table added.
In its simplest form, a Bridge table holds a set of foreign keys pointing to related Hubs and their Links, plus a date that records when the associations were applicable. The example shows two Hubs and one Link; the scope of a Bridge table can be much wider, involving lots of Hubs and Links.
Like for Point-In-Time tables, they exist primarily to improve performance. The heavy lifting for complex joins is resolved in a set of keys, ready to be used to support end-user queries.
There’s a bit more to PIT and Bridge tables, but for the Data Vault modeler, it might be enough to have a broad understanding of what they are and where they fit. If you want to read more, Chapter 6 in Dan Linstedt and Michael Olschimke’s book40 goes into greater (and more technical) detail.
Using a Data Vault to Shine Some Light on the “Process” World
Gap analysis using a Data Vault
We’ve already recognized the role of processes to assist in understanding, challenging, and extending the enterprise model. That’s an example of a process view of the world making a contribution to the Data Vault world. But can a Data Vault pay back the favor by making a contribution to the “process” world?
The answer is a resounding, “Yes”, and one way it does this is by giving us a measured and reliable view into the process world. Dan says, “Data Vault models are representative of business processes and are tied to the business through the business keys.”41 So how does this work, and how can we leverage Data Vault to deliver value?
I did some work at a water utility. At first glance, the business was so very simple. One way we could look at it was that they had two products. First, they sold clean water for drinking, cooking, and bathing. Second, they charged, in one way or another, for taking away less-than-clean water, including after people pressed the flush button. But they did have some problems they wanted help with, and water meters were at the heart of their issues.
Sometimes data people and process people don’t see eye to eye. Tommy was different. He’s a really nice bloke, and he’s great at his process stuff. Importantly, he understands “data” as well, and welcomed the opportunity for the two of us to work together. He asked if the Data Vault could provide some metrics to help track actual progress through a number of business processes.
Data Vaults are great at seeing when business keys (in Hubs) are created, when the relationships between them (in Links) are created, and when the Satellites that hold data about them reflect an update. They also have mechanisms for recording when things are deleted (logically or physically) in a source database. Let’s take an example of how we might use a Data Vault to inspect some business processes.
Figure 109: Data Vault objects for a water utility
We start by looking at the Data Vault model. We’ve got a couple of Hubs (shown as boxes):
I’ve got lines representing relationships between the Hubs, and these would be implemented as Links in the Data Vault.
One is for a connection between major and minor water meters as might occur in a block of flats. The major meter measures total flow for the overall property, and the minor meters measure flow for individual houses. The difference between the sum of the minor meters and the total flow for the major meter records “property” use, perhaps for watering of the common garden.
The other relationship, shown as a dotted line between the Meter and the Remote Read Device (RRD), is for the relationship between a water meter and the remote read device. Please refer to the notes in “Hubs with more than one business key“ for further details of this relationship.
Now let’s look at a selected subset of business processes related to meters.
Figure 110: Data Vault and water utility processes
The sunny-day picture goes something like this.
The water utility holds a number of spare meters, and spare remote read devices, in stock. When the levels start to get low, the Order Placement process kicks in. The thing that’s unusual about this process is that the water utility orders ahead of time what the meter numbers are to be for the yet-to-be-manufactured meters. Most manufacturers deliver products with their own serial numbers assigned; here, the manufacturer delivers products with the customer’s serial numbers physically stamped on the meters. So the water utility’s internal records in the operational system (and in the Meter Hub in the Data Vault) can record meter instances before they are even manufactured!
If all goes according to plan, two weeks later the Order Fulfilment process records the delivery of the ordered meters and remote read devices, and they go into the warehouse. Almost immediately, the Asset Management system updates its records. This IT system is responsible for all corporate assets (company cars, laptop computers, … and yes, water meters and remote read devices). Even though it uses its own Asset Number business key for all assets, it also records the Meter Number for water meters and Remote Read Device Number for the RRDs. That’s taken care of the incoming assets. Now we engage with water utility customers. An order comes in for a Meter Installation. After the Meter Installation process is completed, additional details about the meter (in Satellites) and additional Links (for example, to the Customer) can be added.
If the meter is a simple, single meter for a residential customer, there is no need for a Customer Audit. However, if it’s for a large block of units, a factory, or a hospital, where there are multiple meters involved in the installation, an audit will be performed, amongst other things to make sure that the records are correct as to what meter number is connected where. Now the customer has connections, the schedule for meter readers is updated to include the new meters, and in due course, either Physical Reads or Remote Reads are performed.
That’s the sunny day scenario. But what might happen in reality? Orders can get placed, but the delivery can be way beyond the expected two weeks. Deliveries can be made, but the Asset Management system fails to “almost immediately” record the new assets (or maybe it never gets around to recording the new arrivals). Meters can be installed, but in situations where an audit is mandated, the audit process slips through the cracks. And though it may sound impossible, a customer can be connected, but the bills for usage never arrive because the meter is never read.
So how might a Data Vault help? In a blog,42 Dan Linstedt shares the following glimpse of hope: “The original Data Vault model provides pure business value by demonstrating the gaps between the business perception (of the way they THINK their business is running) and the reality (the way the data is actually captured and processed through business processes and systems).”
Dan and Michael go into more detail in their book, in the section titled “The value of bad data”.43 My attempt at summarizing their position goes like this:
A Data Vault can be used to track what’s happening in source systems, facilitate comparison around business keys, and offer the results for downstream analysis. For the water utility, some simple examples follow:
They may be examples of pretty obvious sorts of analytics we could perform. A bit of good old fashioned detective work by Tommy found some more anomalies.
My first story relates to changes in subdivision plans. Maybe a property developer originally intended to build five houses on a large block of land, and to assist the builders, five water meters and taps were installed. Subsequently, the developer decided to build four larger houses instead of five crammed shoe-box dwellings. Permission was given for the changes, and the titles office and the local shire council knew about the new plans, but the water utility was not informed. We’ve got five meters now on four blocks, and one is never going to be connected.
Now the twist that will make you smile. The five meters get scheduled for reading. In due course, readings are taken, and of course, the one that’s not connected has a reading of zero usage. There’s another process that kicks in at this time. Sometimes meters get stuck and won’t turn. Water is still being delivered, but not billed, so there’s a targeted process to replace faulty or blocked meters with a brand new ones. The field worker does his or her job, and replaces the zero-read meter, but no one wakes up to the fact that it’s not turning because it’s not connected. And in due course, the new meter that also records zero usage is also replaced. And the new one also reports a zero read …!
To understand “Process”, we need more than a Data Vault
For us to deliver the full potential of performing analysis on processes, I suggest we need at least three things. The first is to see raw data in all of its glory. Some other approaches to data warehousing aim to clean up the data on the way in. In contrast, I’ve heard Dan talk about capturing “all the data, all the time”. Warts and all, the whole philosophy behind loading source-centric raw data into the Data Vault is to not hide perceived anomalies from view. That’s a tick for the first of my three requirements. Next, we want to perform analytics by source. The bare minimum of Data Vault capture of dates does provide some insight, but it might not be enough. For example, the Meter Hub will record the source of the first data feed to present Meter Number 123 to the Data Vault, and the date it was loaded, but it doesn’t record what other sources have also had visibility of Meter 123, or when. More advanced aspects of Data Vault, such as Status Tracking Satellites or Record Tracking Satellites, can provide insight into business keys, and their creation dates and even “deletion” dates, by source. This is what I wanted.
Lastly, we need to recognize that IT systems, the data they capture, and this data that then gets moved to a Data Vault, is only part of the picture. Yes, it’s important that a Data Vault is capable of faithfully capturing what’s being entered into the source systems. But things can happen in the business that are not even visible in the source systems. People do work-arounds. They capture stuff in spreadsheets. They have totally manual processes that breach the rules. This is the reality.
Dan and Michael are very precise when we look closely at what they say on this topic. They claim (emphasis mine) that, “… source systems provide the truth about the data being collected and what is really happening from a data perspective”.44 In this quote, Dan and Michael are not claiming that the source systems represent the truth, the whole truth, and nothing but the truth for processes that occur outside the view of the source systems.
I remember doing some data quality work for one of Australia’s largest telecommunications companies. The lead consultant, Bill, had a PhD in mathematics. He held the view that we could get some insight by looking at the data in IT systems, and comparing them. If two systems disagreed on a person’s date of birth, he concluded that at least one was wrong, but maybe both were wrong. The way to check was to go back to the real world. I remember a comparison across four systems, three of which agreed. One person at the company concluded that the three must be correct, and the fourth wrong. Bill challenged this assumption, and went on to prove that three were wrong, and one right. Bill had a sense of humor. He got a coffee cup made that portrayed him being at the company’s reception desk, with the receptionist saying something like, “Bill, you can’t be you – three out of four of our systems record the fact that you’re dead!”
Two more stories from my time working with Bill. The first was where a data analysis showed a spike in errors for the IT systems’ data from one phone exchange. Bill dug a bit deeper, and uncovered the fact that the staff at that particular exchange never updated the central records because they had all of the information they needed on a spreadsheet. Then there was the time when Bill’s dogged pursuit of the truth uncovered an illegal hard-wired patch between two telephone exchanges.
A number of us have seen less dramatic examples of where the IT systems don’t capture the facts. Maybe a “standard” process doesn’t work in fringe cases, so people “break the rules” to get the job done. A recent client is involved in forensic analysis related to deaths that the coroner wants investigated. I won’t confront you with the sad background, but sufficient to say that the central IT system simply can’t cope with some of the tragic real-world scenarios. To do their job, workarounds are absolutely essential, and totally ethical. But the key message is that if we were to load the data from operational systems into a Data Vault and then analyze it in isolation from the “real world”, we could come to some dangerously wrong conclusions. To try to protect the community by learning from the past, we must have solid analysis. In this forensics setting, lives depend on it.
What’s the key message from all of this? That while the Data Vault can facilitate data analytics by triggering lines of investigation, or providing detailed insight as the analyst drills down, we need to have a healthy skepticism and challenge what the data is telling us. A capable human being is required to drive what Dan and Michael call “gap analysis”. Don’t underestimate the value that the Data Vault can add to the equation in providing an integrated view, and dates that challenge how processes are really working, but I do encourage you to dig deeper. Bill used to require his team to “ride the trucks”, even if that meant crawling around a dirty, dusty phone exchange to understand what was really happening.
A bit on projects
Agility
Data Vault has the potential to be “agile”, but agility doesn’t just happen because someone thinks it’s a good idea. There’s a lot of stuff that needs to be in place for agile to succeed. Just some of the things we might need for any agile software development project include:
Much of the list above applies to any agile project, whether it is for a Data Vault build or not.
Platform & tooling projects versus the Data Vault project
There are two threats to Data Vault projects I’ve observed that relate to muddying the waters on project scope.
The first threat is confusing the project to build required infrastructure with the project to build a Data Vault. Maybe we want Hadoop as part of our architecture? Fine. And we need a project to get it in place in our organization? Fine again. And we want to fund and manage the Hadoop implementation as part of the Data Vault project? That’s where I see red lights and hear warning bells clanging.
No matter what the new software components might be (Hadoop as described above, or real-time mechanisms, or service-oriented architectures, or …), if they’re part of our strategy, identify the interdependences, but consider funding and managing them separately.
There’s an option we might want to consider: perhaps we can at least start the Data Vault project on an interim platform, deliver tangible value to the business, and accept we may have some technical debt to pay back when the other projects catch up with us. After all, much of the metadata we need (including for the design of Hubs, Links, and Satellites, and source-to-target mapping) should turn out to be relatively stable even if the underlying technology changes.
The second threat is similar, but closer to home for the Data Vault project, and it’s about confusing the Data Vault delivery project with the tool evaluation / acquisition / build project. Sooner or later (probably sooner), we will want to use a tool for moving the data from layer to layer, doing extract/transform/load, though not necessarily in that order!
We may choose to buy a tool, and that strategy has lots of merits. I encourage you to check out what’s on the market. Evaluate the candidates. Do we need to specifically model columns such as Source and Load Date-&-Time stamps, or is the tool smart enough to know these are part of the Data Vault package? Can we define, manage, and execute business rules? Do they integrate with our testing tools and DevOps strategies, or maybe they even include them? Lots of questions. And if we can find a tool that meets our requirements, or is at least “sufficient”, that’s great.
Alternatively, we might decide to build our own tool, or at least hard-code some bits while we do a Data Vault proof-of-concept.
I consulted to one organization that was very altruistic. It wanted to gift its entire Data Vault (less the data) to others in the same industry, and not have them need to get a license for the Data Vault toolset. They chose to develop the tools in-house so they could give them away. I take my hat off for their generosity. But I simply want to warn that building you own toolset does come at a price. And like with establishing other software infrastructure, I recommend that the build of a tool be separately funded and managed.
A few hints
OK, we’ve got everything in place. We’ve got the team, they’re trained, we’ve got the complementary software, a tool, and we’re ready to go. I’ve got a few final hints for us to consider as we build our Data Vault:
There’s one phrase that appeared in all of the above bullet points. It’s “end-to-end”. Why do I see this as being important? I’ve unfortunately seen tool vendors, and Data Vault consultants, who impress with how quickly they can load data into the Data Vault, but fail to give enough focus on actually delivering against business expectations. The good will that might be won by demonstrating a speedy load will sooner or later evaporate unless business value pops out the other end of the sausage machine.
One more suggestion. In the earlier section titled “Is “agile evolution” an option?“, I shared a confronting story. The lack of a data model hampered an agile project. The team took one iteration to assemble a relatively stable data model. Subsequent iterations resulted in a 12-fold increase in productivity.
My conclusion? For a Data Vault project, I suggest that the data modeler(s) on the team try to think about what’s coming up, at least one sprint ahead of the development.
A few controversies
There are a number of well-recognized voices in the Data Vault community in addition to that of Dan Linstedt, the founder of Data Vault. I’ve heard a joke that goes something like this: “If you’ve got three data modelers in a room, you’ll get at least four opinions.” Oops.
I take heart from the PhD of Graeme Simsion on a related matter, as adapted for publication in his book, “Data Modeling Theory and Practice”.45 I can’t do an entire book justice in a few sentences, but here’s my take-away from his work. If we’ve got a bunch of data modelers, and each one thinks there can only be one “correct” solution for a given problem, don’t be surprised if there is heated disagreement. Conversely, if all accept that several variations may actually “work”, a more balanced consideration of alternatives and their relative merits is possible.
I think it is healthy if there is room for respectful discussion on alternatives to Data Vault modeling standards. For me, my baseline is the Data Vault 2.0 standard, but I welcome objective consideration of variations, especially as they may suit specific circumstances. Dan Linstedt also welcomes the standards being challenged. He just, quite rightly, warns against changes without due consideration as to unwelcome side effects.
Modeling a Link or a Hub?
We may encounter debates as to when something could or should be modeled as a Hub versus modeling it as a Link. I know this is contentious in some quarters. Part of me would prefer to not even raise the topic, as whatever position I take, I suspect someone will disagree. Nonetheless, avoidance of the issue won’t make it go away, so I will stick my neck out, hopefully in a manner that is respectful of the views of others.
I would like to share two scenarios that might help shed some light on this topic.
In the first scenario, there is an accident on a highway. Cars are damaged, drivers are upset, insurance companies may be faced with claims, but thankfully no one has died. From the perspective of one of my clients, there may be a police report with “links” to cars and their registration numbers, drivers and their driver’s license numbers, health practitioners who may assist, police force members identified by their member number, and so on. The “key” message (pun intended) is that there is no business key for the accident itself. However, in our Data Vault, we can create a Link (and maybe a Satellite – see the next “controversy” for comments on Links with Satellites) to associate the accident with the cars, drivers, health practitioners, police members, and more.
In the second scenario, a fatality occurs and forensic examination of the deceased is required. A “Case Number” business key is assigned. In the Data Vault, there is a Case Hub. We can create an instance in the Case Hub, populate its Satellite, and also populate Links to the cars, drivers, health practitioners, police members, and so on.
Which model is correct, or are both models OK? Here are my starting-point rule-of-thumb guidelines for choosing Hubs versus Links, at least as they relate to business views of the world:
There we are. Some nice, clean guidelines.
But what about an event or transaction that also has a clearly recognized business key? It seems to span my two definitions; it’s an event so it’s a Link, but it has an independent business key so it’s a Hub! For example, maybe an engineering workshop has “transactions” that are work orders, with a Work Order Number.
If the work orders are “recognized by the business as a tangible concept, with an identifiable business key” (such as a Work Order Number given to the customer so he/she can enquire on progress), then to me, the work orders look like they belong to a Hub. (If something walks like a duck, quacks like a duck, and swims like a duck, maybe it is a duck.) But if the transaction didn’t have a Work Order Number and was identified by nothing more than the collection of business keys in related Hubs, the work orders look like they belong to a Link.
Even if my guidelines are helpful, please recognize that, as noted by Graeme Simsion’s research, there might still be times when experienced modelers may have to agree to disagree, take a position, and see how it works.
Roelant Vos also adds a valuable perspective. We can start by modeling at a higher level of abstraction, capturing how the business sees their world. From this position, if we have flexible tools, we may be able to generate variations of the physical Data Vault tables while having a single, agreed logical view. That approach may actually defuse some of the tension in the following topics. Thanks, Roelant!
Modeling a Satellite on a Link?
I’ve got another controversy to share at this point. Some suggest that a Link should never have a Satellite, apart from maybe an “Effectivity Satellite” that has a From Date and a To Date to record the period the relationship was active. Those who hold this view seem to suggest that if a “thing” has attributes, it has to be a Hub.
Conversely, others happily model Satellites against Links.
For a moment, I want to step back to traditional data modeling, then to the Unified Modeling Language (UML) to see if we can get some clues to help us decide which way to go.
In a traditional data model, we have relationships, and these somewhat approximate a point-in-time equivalent of a Data Vault Link. If an operational system has a one-to-one or one-to-many relationship, the relationship itself is represented by the presence of a Foreign Key in one entity. This type of relationship is unlikely to have associated attributes, and if it does, they will be mixed in with other attributes in the entity holding the Foreign Key. If data from this operational system maps to a Data Vault Link, there is probably no data to even put in a Satellite off the Link.
Conversely, if the relationship is a many-to-many relationship, modeled as a resolution entity, it can have attributes in addition to the Foreign Keys that point to the participating “parent” entities. When we map this resolution entity into a Data Vault, I think most if not all people are happy that the additional attributes go into a Satellite. But is it a Satellite hanging off a Link, or a Satellite hanging off a Hub?
Before we make a decision, let’s quickly look at the UML Class Diagram notation – a modeling style with some correlation to a data modeler’s Entity-Relationship Diagram. A UML class approximates an ERD entity, and a UML association approximates an ERD relationship. Now if the association has its own attributes, a new type of object appears – an “association class”. It’s an association (a relationship) that can have its own attributes in a dedicated class (entity). It looks something like a data modeler’s resolution entity, but can hold attributes for an association (relationship) even if the cardinality is not many-to-many.
So where does all of this lead us? One answer is that there are many answers! Dan Linstedt’s Data Vault 2.0 standard of allowing Satellites to hang off a Link looks like it’s got good company with data modeling resolution entities, and with UML association classes. Others may argue that the UML association class gives us a hint that the moment an association / relationship has attributes, it needs a class, and does that hint at a Hub in the Data Vault world?
OK, I will share my opinion; for me, I like to start by considering whether a “thing” looks like a Hub or a Link (see the section immediately above), independent of whether it has its own attributes or not. And if it turns out to look like a good Link, and it has attributes, I am happy to create a Satellite on the Link. I think this approach is clean, simple, consistent, and it aligns with much of what I observe in both data modeling and class modeling (UML) practices.
Having said that, you may reach a different conclusion, and hopefully we can still be friends!
Normalization for Links, and Links-on-Links
We may encounter guidelines on Links that encourage the structure to be “de-normalized”. A purist data modeler may look at the examples, and suggest that in fact the structures are being “normalized”.
In some ways, it’s simply not worth the debate. Very few can quote precise definitions for first, second and third normal form, let alone Boyce-Codd normal form, or fourth or fifth normal form. And I suspect few care about such abstract terminology anyhow. It is arguable that a working knowledge of these levels of normalization is more important than their definitions. In another way, precision as to whether something is normalized or not does matter. But before we get to that, let me note that the examples I’ve seen on Data Vault modeling sometimes touch on the more advanced levels of normalization, such as 5th Normal Form (5NF). That may sound scary to some, so let me please share a very simple example from a cut-down model for the airline industry. The examples given in the text below the diagram are not meant to be necessarily correct, but just indicative of the types of information that could be held.
Figure 111: Example of normalization for Links
The large circles represent core business concepts for airlines. They could be relational tables in an operational source system, or Hubs in a Data Vault.
Forming an outer ring are three boxes with arrows to the core entities. They could be many-to-many resolution entities in the operational system’s relational database, or Links in a Data Vault.
Each of the above three tables relates to just two “parent” tables. They are normalized. If we tried to have a single three-way table that held the Cartesian product of all combinations from the three separate tables, it would not only cease to be normalized, but it may hold combinations that are not valid. For example:
But a Cartesian product of all of these data sets may be misleading. Perhaps Qantas chooses to only operate its A380s between Melbourne and Los Angeles (not its 767s). Maybe it operates 767s between Melbourne and Heathrow (not its A380s). It is misleading to create a 3-way Link between the three Hubs, based on possible combinations; retaining the set of three 2-way Links is actually holding the relationships in a normalized form.
Now we turn the example on its head.
The single 3-way Operations table/Link holds triplets of codes from the Airport, Aircraft Type, and Airline entities, defining which Airlines actually operate which Aircraft Types out of which Airports. One of the examples above was Qantas operating its A380s into and out of Los Angeles.
It was dangerous to assume we could take all combinations of the three 2-way Links to generate a single 3-way Link to represent which airline flew what aircraft types in and out of what airports. Likewise, we can’t simply take the single 3-way Link and assume it represents all combinations for the other 2-way Links. Maybe no airlines actually operate out of Roxby Down with a Cessna 172, but the Airport is capable of handling them. This time, retaining the single 3-way Link is actually holding the relationships in a normalized form!
We could have included another Link on the diagram, showing Flights. It would be a 4-way Link, involving the Airport table/Hub twice (From, and To), plus the Airline and the Aircraft Type. It too would be normalized for this purpose. So what are the take-away messages?
Closing comments on controversies
I’ve seen divisive debates on alternative approaches. My own opinion is that while the differences are important, there is much in common. Let’s leverage off what is truly common, and respectfully and objectively evaluate the differences.
More uses for top-down models
Remember that Dan Linstedt said that if we have an enterprise ontology, we should use it? That’s great if we have one. But what if we don’t? The section of this book titled “Task #1: Form the Enterprise View” was written to help people who didn’t already have an enterprise data model to create it quickly, without comprising the quality.
But can the same enterprise data model be used for other purposes? The good news is there may be many initiatives within an organization where an enterprise data model can potentially make a contribution. Some of these are listed below.
… and last but not least, we return to the catalyst for this book – using the enterprise data model to shape a Data Vault design.
So an enterprise data model can be used for many purposes. But what happens when it is used for more than one of these reasons, in a single organization?
Even if our motivation for pursuing top-down big-picture enterprise modeling had been for only one of the above reasons, that’s enough. But if the model is used across two or more such initiatives, the multiplier effect kicks in as the investment in one area contributes to another related area. For example, if several agile in-house development projects also produce source data that will be required in a Data Vault, why not base all such projects on a common data architecture? Everybody wins.
In conclusion
Australia has some cities that grew somewhat organically, and their origin is reflected in some chaotic aspects! Australia also has some major cities that were planned from the outset, where the designers articulated a vision for the end-state, and the authorities defined a clear path to get there.
Some Data Vault practitioners appear to follow a simplistic bottom-up raw source-centric Data Vault design and hope the business-centric Data Vault solution will magically appear. Like the source-driven tool vendors, they can impress with their speedy delivery of something, but subsequent delivery of tangible business value may be much slower, and the project, and therefore the Data Vault initiative, may fail.
Of course, the Data Vault is flexible, adaptive, and agile. You can create source specific Hubs and Links, and later create business-centric Data Vault artifacts to prepare the raw data for business consumption. But if you head down this route, don’t be surprised if you experience an explosion of source-specific Hubs, followed by a matching explosion of source-specific Links.
Data Vault is about integration. Just because Data Vault can use the business rules layer to untangle a messy gap between source-specific Hubs and Links, and ready-for-consumption artifacts, why should it? Why not create a small number of clean, agreed, business-centric Hubs, and then define business-centric Links around them to represent the more stable business relationships? If you take this approach of identifying business-centric Hubs and Links, the source-specific Data Vault Satellites can now be loaded happily against this framework, immediately providing a useful level of integration.
The good news is that this can be done. By properly designing the Data Vault, from the very outset, based on an enterprise data model, you can expect to reduce “business debt”. And hopefully this book will help you achieve tangible business value both initially and in the longer term.
And now a parting thought. While Data Vault is capable of contributing to the cleaning up problems in the source operational systems, let’s not forget the role of data governance in moving resolutions back to the source where reasonably possible. Operational staff who use these source systems will thank you. But to guide such an endeavor, you’ll benefit if you start with a top-down enterprise model.
3.143.3.104