© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. CarruthersBuilding the Snowflake Data Cloudhttps://doi.org/10.1007/978-1-4842-8593-0_2

2. Breaking Data Silos

Andrew Carruthers1  
(1)
Birmingham, UK
 

Every organization has data silos —those segregated, highly controlled IT systems and data repositories controlled by a limited, select group of individuals who closely guard their data. Traditionally, each silo operates as a self-contained micro-business, with minimal overlap and essential data interchange. Furthermore, in larger organizations, data silos have coalesced around common industry-leading applications, each with its own embedded data model leading to further challenges to data integration. And the model is self-perpetuating, with niche skillsets required to support each domain or line of business.

We might not recognize our everyday work as contributing to data silos. It is hard to see the forest for the trees when standing in the middle of the woods, but rest assured, data silos exist all around us. We just need to take a step back and think about the invisible boundaries that guard our data, whether they are physical in terms of machinery and location, logical in terms of network access, firewalls, and access groups, or otherwise. But these data silos have locked up value and prevent monetization of our most valuable asset: data.

Some examples of data silos , even for small companies, are in human resources owning employees, work location, diversity information, finance-owning business hierarchies, cost centers, and revenue by product line. There are many more (some surprising) benefits to organizations willing to break open their data silos, where cross-domain data sharing underpins a wide variety of benefits; some unexpected.

I am not advocating a free-for-all nor recommending any breach of proper security safeguards. I am simply proposing to democratize our data by freeing up essential elements for cross-domain reporting and public interchange (for a price), all in a highly secure Snowflake Data Cloud environment where the data owner retains full control.

But first, we must ingest data into Snowflake, and this is where our challenges begin.

Why Break Down Silos

Chapter 1 discussed how 50% of respondents frequently share data with third parties to drive innovation in products and services. But in contrast, 64% admit their organization struggles to integrate data from varied sources. Are these examples of companies that have broken down their data silos or companies that have retained the status quo? It is too simplistic to argue one way or the other without context, but a clear perspective, attitude, and approach are important. Challenging all three allows you to break the status quo and realize the value held in your data.

Data in its correct context provides information. This is more clearly stated using the data, information, knowledge, wisdom (DIKW) pyramid . Typically, information is defined in terms of data, knowledge in terms of information, and wisdom in terms of knowledge. Figure 2-1 illustrates the relationship between each layer of the pyramid identifying the value gained by each layer.

A pyramid chart of D I K W. From top to bottom, the layers are as follows. Wisdom, knowledge, information, and data.

Figure 2-1

DIKW pyramid

You must understand the questions you want to answer and, from the underlying data, build the context to provide the wisdom.

Where does my organization generate revenue from? This question is easily enough answered; you turn to your finance system and run a report. But the report is only available to nominate and suitably entitled finance staff, one of whom must be available to execute the request.

What is the revenue impact if system X fails? This complex example is not as easy to answer because the upstream system may not lie in finance; it probably lies in another data silo, such as sales, marketing, or risk. These are the kind of questions breaking down data silos answers, and using Snowflake Data Cloud, we have the platform to collate, categorize, store, and protect our data, putting information and knowledge at the fingertips of those who need it—when they need it, and how they need it.

Where Are the Silos?

Data silos exist all around us, the generally inaccessible logical containers we may think of in terms of the application name, vendor product, or regularly but invisibly used repositories.

As a starting point, we could map out our organization's operating model, lines of business, and functional units to identify likely sources of information. At this point, we are analyzing the “art of the possible,” indulging ourselves by creatively imagining outcomes if only we had access to the data, in one place, at the same time…

Moltke the Elder famously stated, “No plan survives contact with the enemy.” There is a ring of truth to this quote when navigating our organizations because, for many data owners, knowledge is power. Data control locks up information, knowledge, and wisdom, in and of itself, a data silo to be broken down. But you are reading this book, I encourage you to embrace new thinking, without which the status quo never changes, and we forever remain bound by the past. The Diffusion of Innovation theory was discussed in Chapter 1. The same logic applies. Do you want to be an innovator/early adopter in changing your thinking too?

Knowledge locked up in data silos quickly goes stale. For example, marketing data degrades at 25% per quarter, and technical knowledge—the data silos in our heads of how to do things—also degrades as new tools and techniques arise. Up-to-date documentation helps, and sharing knowledge is one sure way to stay current; self-education is another way.

Data silos are not just the boundaries between business functions or lines of business. They also occur between systems in the same domain. For example, one large European bank in HR alone has 146 systems, some cloud-based, some on-prem. And there is a propensity for each domain to build their data warehouse to solve their problems, often using disparate technologies perpetuating the very problem we seek to solve, rapidly heading toward complexity and the diode point mentioned in Chapter 1, after which recovery is impossible.

We also see data silos and cottage industry proliferation catering to weaknesses in vendor-supplied strategic platforms. The product may not support the workflow required, the overworked support team cannot implement required changes quickly enough, or the demand is for an answer “right now,” and a weeklong delay is unacceptable. Propagation of tactical solutions increases technical debt, the burden which must be paid for at some point , unpicked, and reconciled back to the source. Another suite of data silos loosely coupled to sources, the presence of a system just to catalog these End-User Developed Applications (EUDAs) should give us all cause for concern. But, if someone had the foresight to put a central registry in place to record these EUDAs, we would have a decent starting point to identify and take out systematically.

For those of us totally engaged, utterly immersed, and consumed with passion for their work, how do we keep the lightning in the bottle? How do we retain motivation to over-perform, over-deliver, and continually push boundaries?

Some of the answers lie in our working environment. Research has shown people who work in cognitively demanding environments more readily share their skill, knowledge, and expertise, but not when there are high demands from co-workers and/or management expectations to knowledge share, bringing others up to the same standard where reliance upon a time with constant interruptions often result in withdrawal and lack of engagement, exactly the opposite kind of behavior we want to encourage.

My personal view is to keep encouraging both your visionaries and thought leaders to innovate and push boundaries. These are your 11 out of 10 performers weekly, the ones who can’t wait to get to work, always have something up their sleeve to pull out and amaze, and the self-starters who see opportunity long before others. It is uncomfortable giving freedom to people who think differently. But these people find ways to break down data silos (often unconventional) and allow others to bring their skill, knowledge, and expertise to bear. They see those things hidden in plain sight. Their lateral thinking and ability to connect the dots often result in a competitive advantage.

Forgive them their sins when they overstep boundaries and create noise, provide head-cover, let them get on with innovating and ultimately delivering value to our organizations because if we constrain our rocket scientists into conformance to norms and insist on and noise reduction because management likes things quiet and peaceful, the lightning is sure to escape the bottle . Once the energy has gone, it has gone forever, and we have robbed our organization of the very essence crucial for continued success. Iron sharpens iron, and smart people spark greatness wherever they land and unlock huge potential in others, often without knowing their presence or few words of wisdom. They have a massively positive ripple effect reaching out to far corners of our organizations.

We have few thought leaders and even fewer visionaries. Let’s encourage and retain those we have. There is always an option to work elsewhere with more enlightened teams and organizations, our visionaries and thought leaders' organizational contribution and value far exceed the noise they create. Put another way, noise is evidence that something is happening.

Note

The lessons are clear: facilitate, equip, enable, and release, don’t constrain, overburden, and eventually kill our visionaries and thought leaders. Our organizations cannot survive without them.

Returning to discuss more tangible data silos, we might consider Word documents as data silos. Consider system documentation containing a table of contents (TOC) with information for each heading. If we can access our system documentation centrally from Snowflake, with unstructured document support, the TOC can be extracted and programmatically used to interrogate content according to each heading, storing results into a table.

Imagine a template document with common TOC. Using unstructured document support, we can bring all identically named sections together and interrogate using SQL. Alternatively, imagine scanned invoices where key-value pairs such as invoice number and amount can be extracted programmatically. Who needs expensive OCR software now? Chapter 10 explains how to implement tooling to extract text from images and provides a practical example for you to implement and extend.

Once we have captured data from unstructured documents , how will our business use the data? How will the data inform our decision-making? Why is it important to know? Only you can begin to know the answers to these questions, and naturally, this book can only go so far in exposing tooling to assuage your thirst for wisdom. But, in your hands, you have a starting point and new thinking to apply, hopefully changing perspectives.

We must not think of data silos as being limited to a subset of our organization’s estate or a suite of physical servers and disk drives. Rather, we should approach data silos from the perspective of identifying where the value lies, then identify an access path to unlock the value. The data content defines whether the container is a silo or not.

Data silos are all around us, we just need the eyes to see and the imagination to unlock what is before us.

To quote Saurin Shah, Snowflake developer, “It doesn’t matter how we store the data. It matters how data is shared and consumed.” There is an implicit assumption in Saurin’s comment; we must be able to unlock the data from its storage container, then make the data accessible to allow consumption.

Saurin’s revelation must challenge our thinking on what constitutes a data silo because today’s definition of what a data silo looks like will be very different in 10 years with new, currently unimaginable formats and media available. As an extreme example, who would have thought our own DNA would be such a rich (and complex) source of information? And who can foresee how the information contained in our DNA can be used for the future benefit of mankind?

Regardless of where we find data of interest or value, we must develop and maintain access patterns to unlock desired content for several reasons. Emerging data formats require new custom handlers. Existing handlers require upgrades as product and code versions change. New platforms necessitate handler refactoring. For some, this is a ready-made business idea. Plug-ins for Snowflake deliver data from hard to get at sources.

The lesson is clear. Even if we don’t have a starting point today, a suite of tooling ready to deploy, or even minimal capability to execute, we must be ready. The future is rushing toward us ever faster, and now is the time to begin. A journey of a thousand miles starts with a single step.

How to Break Open Silos

By now, we have an idea of the types of data we are interested in, their likely business locations, data owners, and some thoughts on how to interact with data silos. Naturally, proper governance determines the entities and attributes permissible for access.

You must also consider your organization’s network topology, as accessing data isn’t simple. To bring data together, we must consider the end-to-end connectivity from source to destination, which allows our data to flow. For those of us fortunate enough to have clean running water, I like the analogy of a water tap. When I turn the tap on, I get water, but what facilitates the water arriving? Logically, there has to be pipework from the source to the destination.

Likewise, we can separate our data sourcing strategy into two parts: the plumbing, which provides the end-to-end connectivity, and the tap, which determines how much data flows through the pipework subject to the maximum capacity of the pipework or network bandwidth in our case. While related, these two components should be considered as separate deliverables. In larger organizations, an architectural governance body is mostly concerned with the plumbing. Any system-to-system interconnectivity should require approval to prevent a free-for-all situation. However, the data which flows through the plumbing are governed by a different authority. When breaking down data silos, it is important to remember the distinction and to engage each body separately, notwithstanding our cybersecurity colleagues who rightly insist on data security from connection through transit and final storage in Snowflake. And the hidden point behind this message is the high degree of human interaction and collaboration required to begin the journey of accessing silos.

Each data silo has format, structure, and entitlement (prerequisites to allow access). Some commonly used examples include SharePoint, Excel spreadsheets, business applications, archives, log files, databases, catalogs, active directories, and shared network drives. The list is not exhaustive, and each may contain one or more object types with data of interest, requiring a custom connector to interact with and derive value from.

We must ensure our organization is prepared, equipped, and ready to accelerate. We best set ourselves up for success by having the right attitude and approach, which requires investment into presentations, showcases, internal education, and continual positive reinforcement by evangelizing our approach and product. If we are not enthusiastically endorsing our own work, why should anyone else believe and have confidence in us? Technology is not the issue. People and processes are.

Note

Showcasing capability and conducting “hearts and minds” are critical to our success.

How Data Is Evolving

No conversation on data silos would be complete without discussing how data is evolving. We want to avoid repeating the mistakes of the past and enable data democratization for the future. Previously mentioned themes on future data growth included the Internet of Things (IoT) generating sensor data, imaging, and semi-structured records. But we must also be mindful of medical data and the possibility of personalized gene therapy, satellite imagery, data from different spectra, and climate data. The list goes on.

Our ambition to consume ever more complex and challenging data to process pushes conventional database technology's boundaries. We don’t always have the tools to handle everything at our fingertips, but we can develop them over time. Consumption is not an end-state in and of itself. Rather, consumption facilitates convergence with existing data sets leading to new insights and opportunities from which our organizations benefit.

We are experiencing exponential growth in volume, velocity, and variety of data, broadly categorized into three themes: structured data, our (hopefully) well-understood third normal form OLTP databases, data vaults, data warehouses, or star schemas. More recently, semi-structured data such as JSON, XML, AVRO, Parquet, and ORC formats are all supported and managed by Snowflake tooling. And latterly, unstructured data, Microsoft Word documents, images, audio, video, sensor data, and so on. Considering Snowflake Data Cloud, we might view the three themes as evolutions of data, and the management thereof, as shown in Figure 2-2.

An illustration of the evolution of data management comprises three rectangles. From the innermost, the rectangles are labeled as follows. Table-based data warehouse. File-based plus table-based data lake or lakehouses . Borderless data with the data cloud.

Figure 2-2

Evolution of data management

In Figure 2-2, there are no hard borders between each theme, and with some effort, we can develop frictionless transitions between themes, though we must invest some time and effort to develop each transition. Snowflake supplies the tools and sufficient examples to work out our integration patterns and approach to implementing our Snowflake Data Cloud. We also examine homegrown tooling in Chapter 10, specifically for semi-structured and unstructured data, with hands-on examples delivering the capability to extract textual content from image files.

The most important takeaway from the evolving data landscape is to enable data interoperability. That is, merging data across all three themes into readily usable data sets for everyone to consume. We must hide complexity while retaining full data lineage, that is, full traceability of data from the point of ingestion through to the point of consumption, and provide the means for self-service wherever possible. Snowflake has the tools to do all of this.

Common Reference Data

Surprisingly, or shocking to some, most organizations do not have a single source of standard reference data. There are exceptions, but each domain typically owns its reference data . All is well and good if the reference data matches, and in truth, countries, and currencies rarely change, but they do. Good data governance should address the issue of common reference data across an organization. But it is not always easy as our large corporates frequently purchase subsidiaries and divest other parts of our organizations. This theme is further discussed in Chapter 12.

One immediate benefit we can gain from implementing Snowflake Data Cloud is to share approved golden source reference data, managed and maintained centrally, immediately available to all.

Snowflake and OLTP

You are probably wondering why we don’t use Snowflake as our primary data store and replace the databases supplied by vendors who host their data on any platform. This is a great question, and one answered in Chapter 3, but for now, let’s consider the market segment in which Snowflake operates.

Designed from the ground up, Snowflake is aimed at the data warehouse market, where huge volumes of data are accessed, sorted, and filtered, answering complex business questions. The fundamental storage pattern is very different in a data warehouse than in traditional OLTP systems where individual record management occurs. It is a matter of scale. Snowflake is optimized for huge data volumes, and OLTP systems are optimized for low data volumes, albeit with the capability to implement data warehouse functionality, just not on the same scale as Snowflake.

Accessing Silos

The IT software industry has matured over many years. Numerous organizations have delivered market-leading platforms in specific market segments and industry sectors; examples include Oracle Financials, ServiceNow, Salesforce, and Workday. There are plenty more established vendors and a never-ending list of new entrants eager for your business.

Naturally, each vendor seeks to protect their intellectual property by implementing difficult (or impossible) human-readable database schemas, proprietary compression algorithms for uploaded attachments, inextricably linked data and logic held in large binary objects, and many more fiendishly difficult approaches to challenge the unwary. All of these, and many more, reinforce data silos preventing data interchange except through pre-defined and often limited interfaces, raising barriers to protect vendor interests but rarely serving the data owner's interests.

Each large-scale vendor application has three core components resulting in three primary patterns for accessing application data, as shown in Figure 2-3. This section examines each core component and the pros and cons of using each pattern. This is not an exhaustive treatise. And, yes, there are some out-of-the-box solutions for some of the issues raised in specific domains. But we are talking in general terms here to expose the underlying challenges to a wide audience.

An illustration of silo landscapes has 6 components grouped under security in two columns and three rows. The left column has A P I or RaaS, business model, and data model. The right column has the components with an arrowhead as follows. E L T slash E T L, views, and schema replication.

Figure 2-3

Silo landscape and access paths

Let’s also discuss a few options to solve domain-specific issues.

Security Architecture

Data security must be “front and center” of everything we do with our data ecosystems, whether on-prem or in the cloud. However, the considerations for the cloud are far more stringent. In truth, this section belongs in Chapter 3. However, a brief mention here because we are discussing data moving between containers, and we must keep our data secure.

Before we consider the data sets to move, we must consider how to securely connect to a data source, data protection in transit (particularly over the public Internet), and how Snowflake protects inbound data (see Figure 2-4).

An illustration of the data in transit has the data silo on the left that points to the snowflake on the right. From left to right, the large arrow from the data silo to snowflake has two labels on the top, authenticated and protected, and one label on the bottom right as authenticated.

Figure 2-4

Data in transit

Every data source application provides at least one, and hopefully several, secure authentication mechanisms to establish a connection. We must choose a connection mechanism supported and approved by our cybersecurity architects. We must also ensure the pipework through which our data flows are protected. Our data must be encrypted in transit. Finally, our connection to Snowflake must likewise be secure. We know Snowflake does not allow third parties to create their own drivers. Instead, it develops and maintains its drivers, guaranteeing both authentication and data flow into Snowflake are secure.

Data Model

Every application has a data model . The fundamental data structures enable the application to hold and manage its version of the truth. In database terms, we refer to this as a schema. Of course, there is usually more than one schema in a database, but for this explanation, we stick to a single schema.

Think of any database as a set of Russian dolls. The analogy breaks down very quickly, but simply put, a schema is one of the inner dolls—a logical container into which we put multiple other Russian dolls, or more precisely, tables and views. A table is like a spreadsheet tab and contains data. A view is analogous to formulae applied to cells in the table, an overlay adding value if you like.

Tables can have relationships with other tables in a database schema , linking “like” data and allowing navigation between tables. The result can be likened to a tapestry because the relationships display a picture of how the data is related. However, in a badly structured schema, the opposite side of the tapestry is more representative, with threads hanging all over the place and great difficulty navigating between tables because the relationships are obscured, and the picture is not evident.

In some schemas, the table names have meaning. In others, they are codes designed to hide their true meaning, and sometimes several layers of views on top provide abstraction, but usually no real meaning. This approach and others mentioned previously are signs of vendor lock-in—how it becomes too hard to move away from their custom product suite, providing a perpetual revenue stream to the vendors.

Equally, other software companies specialize in unlocking vendor-specific lock-ins and charge for their product accordingly. A frustrating pattern is where vendors of decommissioned and no longer supported applications retain lock-ins to prevent data access outside their tooling. Where data retention periods of 10 years or more are common, a lot of on-prem hardware and disk storage are needlessly consumed to provide occasional data access. I will discuss how to address archiving legacy data sets later.

Back to our data model , data is input into the data model using proprietary screens or data interfaces where data can (or should) be validated to check for completeness and correctness before insertion. We ought to be able to rely upon data always being of good quality and fit for purpose. But, for a variety of reasons which I do not go into here, we often find gaps in our data—inconsistencies, duplicates, and missing reference data. So forth, the list goes on, and data correction post-capture is far more expensive than data correction at ingestion. It is important to know that we have lots of data in data models that may or may not be well structured or readily understood, and we need to move the data to Snowflake. And for the curious, types of data models are discussed in Chapter 13.

Having a brief insight into data models and data quality, let’s discuss extracting data from our data model and subsequent data interchange.

For live production systems , part of our competitive advantage is derived from having data available in (near) real time in an understandable and easily consumed format. This is where many organizations spend enormous money, time, and effort extracting, transforming, loading then conforming data into usable structures readily understood by consumers. We return to this theme later, but for now, let’s focus on the big picture of getting data out from our source data model and into our target data model, further assuming, for simplicity's sake, the source data model and target data model are the same, regardless of source platform, because we have already decided to use Snowflake as our target data warehouse platform.

Seasoned developers know it is relatively easy to transfer data from one system to another. We won’t dwell on the mechanics but simply state that we can extract from the source and insert it into the target. Having data does not necessarily convey its meaning, only the presence of it—unless we can infer relationships directly from the data model, which is increasingly rare given vendor predilection for lock-ins.

If you move data without an associated context, all you have done is create another duplicate data set perpetuating the problem we want to solve, arguably one small step closer to breaking down a data silo, but at a high price, as discussed next.

Business Model

You now understand in broad brush strokes what a data model is, how a data model should work, and why we need a data model. Overlaying every data model is the business model, a conceptual, logical model rather than a physical “thing” often used to describe how information flows through a system. This is where functionality interacts with data and gives meaning.

Let’s use a simple example. Assume we have a table containing address data. We may implement functionality to validate the postcode/zip code against supplied reference data to ensure each address is valid and there is a house number, the combination of which in the United Kingdom gives the exact place to deliver letters and parcels. We may also have access to the latitude and longitude of each postcode which for the United Kingdom gives geolocation data from which we can use more functionality to derive address density and map our addresses graphically. We can also calculate the distance between a distribution depot and delivery address or calculate travel time between two points for different times of the day using readily available geolocation services.

But there is limited value in knowing the address only. We need to know the context for each address, such as whether residential or commercial, and the name of the company or homeowner. In essence, the relationships from the data model discussed earlier using our tapestry analogy. It is from these relationships we derive information. This business model is a logical overlay to the data model where we can see data ingestion and consumption. I often refer to this simplistically as data on the left, processing, or functionality of some sort in the middle, and data consumption on the right. A business model cannot exist without a data model, and a data model is useless without a business model. In simple terms, the business model gives context and meaning to the data model by providing information from data.

Conventions are important in data modeling because they provide structure and meaning, which often (but not always) translate across domains. As inferred in the previous section, data models can vary wildly in quality and readability. Our objective should always be to apply the KISS Principle: Keep it Simple, Stupid. Alternatively, there is Occam’s Razor. Both broadly amount to the same thing.

If vendor lock-ins obscure meaning, the last thing we should consider is replicating, cloning, or reverse-engineering the full business model into our Snowflake copy, because vendors change their data structures, add new features, refactor tables and relationships rendering our copy mechanism invalid with associated dependency management, maintenance, emergency fixes and pain for all consumers.

On a case-by-case basis, we might consider refactoring a small part of the business logic into a suite of reporting objects or views, as this may give sufficient utility to our consumers. This works well, particularly for archived applications discussed later, and may represent an acceptable way forward subject to the risks outlined. Ideally, and in preference, we need some form of abstraction insulating us from a change in both the underlying data model and overlaid business model. It is discussed next.

Application Interface or Reporting Service

You now know issues inherent with understanding and interpreting our data and business models with a partial answer for the business models. Fortunately (with some caveats), we have other options.

Most vendors provide a technical gateway into their business models, referred to as application programming interfaces (APIs) or reporting as a service (RaaS). Other terminology and differing access paths exist, such as built-in reporting screens, but to keep increasing complexity at bay, we consider both API and RaaS synonymous.

Let’s assume we have our access path to extract information from the underlying business model. Downhill from here, right? Wrong. Very wrong because the presence of an API or RaaS access path, assuming we have solved the plumbing as discussed previously, leads us to another set of challenges—turning on the tap.

Naturally, vendors do not make it easy to understand all the nuances and implications of invoking their APIs. While sample code is supplied, it rarely (i.e., never) meets our exact requirements and leaves much information for us to discover by trial and error. Our challenge is finding someone who has the skill, knowledge, and expertise to understand how to invoke the API to derive the required resultant data set. These people are hard to find and harder to engage as they are subject matter experts (SMEs) in high demand and consequently command high fees to engage. Every cost-conscious organization, that is, every single organization in existence, has a small number of SMEs. Each is highly loaded to get the best value from them. We need their managers' engagement to procure SME time, which is harder to attain where the benefits of breaking down data silos have not been explained or understood, and our additional ask is another demand against a precious resource.

Assuming we have access to an SME, our next challenge is explaining what we want from the API , the results of which must be in a mutually acceptable format. We might consider a semi-structured format such as JSON, giving nested records for later processing into flattened relational records. We must also consider the trade-offs with invoking an API, traversing the resultant data set, and recursively calling the same API with slightly different parameters because each round trip takes time, consumes bandwidth, and can fail for a variety of reasons. One answer is to have our SME write a wrapper that resolves all recursive calls and returns a single super-set of data, but this takes time to articulate and understand requirements and then develop an appropriate wrapper, thus compounding our challenge in extracting data.

Depending upon the returned data set format, other considerations arise. Some attributes contain embedded commas, where we have specified a comma-separated values format. Embedded commas in the resultant data set may result in extra columns. For this reason, we prefer pipe-delimited output, assuming the data doesn’t contain embedded pipes. You get the idea. Furthermore, APIs and RaaS may not provide 100% coverage of all the information we need from a system.

However, what we should be able to rely upon with APIs and RaaS, is a degree of isolation from changes to both the underlying data model and the business model. As we have discussed in previous sections, vendor changes to both can result in our implementations becoming brittle, causing unexpected failures, emergency changes, and unwanted extra work. A key factor when investigating third-party applications and tooling must be an examination of available SMEs in the marketplace and the desire in the organization to invest in their staff to mitigate execution risk.

We must also consider another vendor lock-in strategy. Rate limiting API and RaaS calls to restrict the number of records returned and the number of times an API or RaaS call can be made in a specific timeframe .

Built-in Reporting Screens

Some vendor products come with built-in reporting tooling with canned reports (those available out of the box) and the capability to develop homegrown screens. Built-in reporting tools can be used to derive data feeds and possibly scheduled too, depending upon the implementation.

Like API and RaaS solutions, built-in reporting screens may provide an attractive option for generating data extracts while also providing a degree of isolation from change to both data model and business model. A further advantage of reporting screens is that they may be developed by power users and do not require SME knowledge. They facilitate closer cooperation from operations staff, providing much-needed contextual information. If you understand why our systems are used, along with typical access paths, you are in a far better position to understand the implications of change.

Custom Views

Database views are stored SQL queries that overlay our data model. They are physical database objects providing an abstraction layer where we may put logic to join tables, pre-filter, aggregate, or summarize data. We often use views to denormalize our data model into more business-friendly terminology and presentation, reducing the technical knowledge required to understand our data.

As with the business model, it may be possible to create a (small) suite of custom reporting views to enable efficient data extract. These require maintenance in the event of underlying data model changes. Likely to be highly dependent upon SME knowledge and at risk of creating more technical debt, custom views are only as good as the underlying SQL used to source the data. Any filters, aggregations, or summarizations should be well described in the documentation.

Application Connectors

Salesforce integration is via Tableau CRM Sync Out Connector, offering out-of-the-box incremental data loads into Snowflake (see Figure 2-5).

A flowchart of Salesforce integration has the following steps from left to right. Salesforce. Tableau, sync out. Snowflake.

Figure 2-5

Tableau CRM Sync Out

Sample scripts are supplied to configure Snowflake, after which Salesforce connection must be configured. Note the type and frequency of data synchronization are determined by Salesforce configuration. This push-only interface from Salesforce into Snowflake managing object creation provides a degree of insulation from Salesforce data model changes. Note that no mention is made of data model delta changes, only object creation. The author has not tested the Tableau CRM Sync Out Connector, but more information is at www.snowflake.com/blog/integrating-salesforce-data-with-snowflake-using-tableau-crm-sync-out/ .

Third-Party Products

Historically, Snowflake has not seen itself as a data integration tool supplier. Instead, Snowflake typically refers to integration questions to their Partner Connect program accessible via Snowflake User Interface. Many ETL/ELT vendors, such as Boomi and Matillion, provide connectors offering the capability to abstract details away from the real work of implementing API or RaaS calls, in other words, simplifying data interchange.

We observe a clear trend toward consolidating disparate data sets into Snowflake for cross-domain, consistent, strategic reporting, which has not gone unnoticed in the marketplace. Ease of connectivity from disparate data sources and vendor platforms into Snowflake is an area to watch out for developments. Another option is third-party tooling to replicate whole databases and schemas. Partner Connect has several offerings available.

In-House Application Development

Having worked through a typical vendor-supplied application stack and exposed many challenges in breaking down data silos, we must ask whether IT has served our organizations well by disbanding our high-performance, on-site development teams with custom implementations instead of turning excellent developers into functional managers of offshore teams. Some organizations have lost their corporate knowledge of their key systems, and formerly excellent developers have lost their core technical skills. If you don’t use it, you will lose it. But we can’t turn back the clock. We only move forward and deal with the consequences, many of which were unforeseen at the time decisions were made.

Approaches to Data Capture

There are many approaches to capturing data, and when implementing Snowflake, we should consider a pattern-based approach. But what is a pattern-based approach? How can we develop patterns? And most importantly, why should we bother? I’m glad you asked.

Patterns provide an easily described, documented, repeatable, and consistent integration path to ingesting data sets. The alternative is a custom integration path for each data set, hard to maintain, leverage for reuse, explain to others, and support. Naturally, one size does not fit all. Therefore, we may have several patterns, but not many; ideally a handful or less.

If we want to accelerate our software delivery, reduce our operating risk, and make our systems robust and scalable, we should adopt a pattern-based approach. Furthermore, we may have architectural governance to approve our proposals before allowing changes to occur, and by implementing common patterns, we make their job easier too. Remember the tapestry analogy from Chapter 1? Patterns change our perspective, so we look at the picture, not the reverse side. Patterns and an example are covered in Chapter 12.

Tactical Approach

We often face pressure to deliver as quickly as possible regardless of long-term consequences and technical debt, which is often not addressed. IT is not a perfect discipline, and as you have seen, not going to reduce in complexity any time soon. Quite the opposite. Recognizing the need for business consumers to derive knowledge and wisdom, we are often constrained when delivering quality, robust data sets and relying upon someone cobbling data into a spreadsheet manually. Something is better than nothing, and we can always refactor it later, or so we are told. But in the meantime, we have increased operational risk because the data set is tactical and technical debt, because we must later replace tactical feeds with strategically sourced data.

Adopting a pattern-based approach where a rules-driven engine automagically creates objects and integrates with our continuous delivery pipeline is readily achievable. This approach does not cater to data management principles outlined in Chapter 11 but does provide a starting point for automating our data ingestion capability.

For every book of work item where a tactical approach is the only achievable option in the short term, I strongly recommend tactical feeds are only accepted on the condition funding is made available to remediate at a later stage when strategic data sets become available. Our end-state should be to decommission all tactical feeds and repoint strategic feeds, which is discussed next.

Strategic Approach

What is a strategic data set, and how does it differ from a tactical data set? One definition of a strategic data set is one programmatically sourced from the system which holds the system of record, otherwise referred to as the golden source.

Our preferred approach to data ingestion must be to use strategic data wherever possible and migrate tactical feeds at the earliest opportunity. It takes discipline and determination to reduce technical debt. But we do not serve our organizations well if we allow—or worse, encourage—proliferation and reliance on non-authoritative sourced data sets.

Regardless of our position in our organization, we all have a role to play and should adopt a common approach to reducing our technical debt. Without quality data, we cannot expect accurate information, knowledge may be compromised, and wisdom out of step with reality. Data quality (completeness, volume, validated) and veracity (accurate, timely, reliable) contribute to improved decision-making and better outcomes.

Data Pipeline and Data Content

Previously, you learned why we should separate pipework from the data flow. Where resources are constrained, I advise focusing on the plumbing first, including security impact because this is the biggest challenge, followed by invoking API/RaaS where different challenges await. This approach may also mitigate delivery risk for large amounts of work, since resources are not always available when needed, and we often must wait for availability. We should therefore advance upon a wide front and address those opportunities available to action while waiting for others to open.

Change Data Capture (CDC)

CDC is a technique to capture only those records which have changed, the delta. Depending upon how the originating source handles changed records, CDC may not be available without significant rework, testing, and later deployment; in other words, it’s not worth the effort. However, CDC is a very elegant and powerful approach to minimizing data captured for onward processing and should be considered for every interface.

You might also consider implementing Data Vault 2.0, where CDC is required. While non-trivial to deliver a Data Vault 2.0–based design, the outcome is worth the investment if done well. Data Vault 2.0 is covered in Chapter 13.

Bulk Load or Trickle Feed

Some data sets led themselves to a bulk load approach , and other data sets lend themselves to a trickle feed approach , one size does not fit all, and we must determine the appropriate approach for each feed. As a rule of thumb, a monthly feed may be acceptable for infrequently received data; for example, new starters to our organization. However, inventory changes may need to be intra-day to enable just-in-time shipment to maintain warehouse stock levels. Regardless of the approach, Snowflake has built-in tooling to ease your implementation.

Data Usage

Once we have the data in Snowflake Data Cloud, we must address the questions of ownership and usage . Having identified there may be data from many source systems in an individual line of business or functional domain, we must determine the ground rules for data access.

For some organizations, it is sufficient for a user to belong to a line of business or functional domain to be granted access to all data in the governed data set. A finer-grained approach is preferred for other organizations, where RBAC is created for individual application data sets.

The former approach is something of a blunt instrument and may inadvertently grant access to data to unintended recipients. The latter approach requires more administration (typically via AD group membership) but provides tighter control, and this is the approach we prefer. It is always easier to relax a control than to tighten up controls later.

Row-level security (RLS) controls access to subsets of data in a domain. Typically, RLS is applied to objects created for data consumption. We return to data consumption in Chapter 9.

Migration from Snowflake

Despite being convinced Snowflake Data Cloud is the way forward for all the good reasons explained before, we must consider the possibilities of both Snowflake becoming another data silo. Also, some future products may eclipse Snowflake's core capabilities, to which we may want to re-home our data.

Snowflake as a data silo is an interesting concept. The tooling itself locking up data is anathema to the open market approach adopted by Snowflake, but there is an outside possibility, albeit highly improbable.

Moving data platforms is a very infrequent event in an organization's life. Typically, vendor lock-ins, non-transferable employee skillsets, system interfaces and infrastructure, lack of funding, and appetite impede the bold decision to re-platform. Yet some architectural governance forums prefer to have an exit plan as part of their approval process.

We have already determined SQL skills to be transferrable as the lingua-franca of databases; therefore, our only available data platforms to migrate away from Snowflake are those already in common everyday use, precisely those platforms for data warehousing we are moving from. I’m sure I am not the only one to see the irony here. I find it difficult to imagine what new platform will be available in, say, five years; therefore, the subject of exit planning should be deferred until the question becomes relevant, as any current answer is speculative. Regardless, Snowflake is committed to continued access post exit, which may be enshrined in your master service agreement or simply a reversion to the on-demand charging model.

Summary

This chapter began by explaining why every organization has data silos and identified the types of data silos where value is locked up. We then looked at why organizations have data silos and the perceived benefits thereof, systematically breaking down the arguments and building the case for the Snowflake Data Cloud.

Looking toward the future, where unknowable data types and new forms of media await, you have begun to understand how to address future state data demands. We can now begin to think in terms of yes being our first response backed up by the knowledge we can develop tooling to access data, rather than immediately saying no because the answers lie outside of our skill, knowledge, expertise, and comfort zone.

This discussion has taken you through some of the complexities in accessing data silos, calling out some key dependencies and challenges we will face, not the least of which is our security posture. And having established the right mindset in preparation for looking deeper into our data silos, let’s open the door to the next chapter, which discusses Snowflake’s architecture.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.212.217