© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. CarruthersBuilding the Snowflake Data Cloudhttps://doi.org/10.1007/978-1-4842-8593-0_1

1. The Snowflake Data Cloud

Andrew Carruthers1  
(1)
Birmingham, UK
 

The Snowflake Data Cloud is a means by which data can be shared and consumed in near real time by willing participants across geographical borders—seamlessly and securely. Snowflake offers solutions to consume, reformat, categorize, and publish a wide variety of data formats and leverage siloed legacy data sets while opening new market opportunities to monetize data assets.

We must first understand the nature of the problems we currently face in our organizations. We cannot address problems if we do not acknowledge them.

I use the experience of more than 30 years working in widely varying organizations, each having its own challenges. I broadly distill my experience into several core themes discussed in this book. No one-size solution fits all, and there are always edge cases, but with some lateral thinking and innovative approaches, resolution may be at hand.

By developing an understanding of how the Snowflake Data Cloud can solve the problems we face, and providing a platform, we deliver a springboard for future success.

The journey isn’t easy, and convincing our peers is equally hard, but the rewards as we progress, both in personal terms of our own development and organizations’ returns for the faith they place in us, are huge.

Though leadership places a great burden upon us, identifying opportunities and having the courage and conviction to pursue them are equally important. The only way to be certain of failure is to not try. This book not only seeks to short-circuit your timeline to success but provides tools to participate in the Snowflake Data Cloud revolution far sooner than you imagine.

This chapter overviews Snowflake’s core capabilities, including Secure Direct Data Share, Data Exchange, and Data Marketplace. With an understanding of each component, we then explain why we should integrate with the Snowflake Data Cloud.

Let’s approach the Snowflake Data Cloud by addressing the why, how, and what. In a TEDx Talk, Simon Sinek unpacked “How great leaders inspire action” ( www.youtube.com/watch?v=u4ZoJKF_VuA ). Please take 18 minutes to watch this video, which I found transformational. This book adopts Sinek’s pattern, with why being the most important question.

Snowflake documentation is at https://docs.snowflake.com/en/ . One legitimate reason for reading this book is to short-circuit your learning curve. However, there are times when there is no substitute for reading official documentation (which is rather good). We come to some of this later, but for now, at least you know where the documentation is.

Setting the Scene

Snowflake helps organizations meet the increasing challenges posed by integrating data from different systems. Snowflake helps organizations keep that data secure while providing the means to analyze the data for trends and reporting purposes. Snowflake is key to making data-driven decisions that drive future profitability for many organizations.

Almost without exception, large corporations have grown through mergers and acquisitions, resulting in diverse systems and tooling, all largely siloed around specific functions such as human resources, finance, sales, risk, or market aligned, such as equities, derivatives, and commodities. The organizational estate is further complicated by internal operating companies with less than 100% ownership and complex international structures subject to different regulatory regimes.

Growth by acquisition results in duplicate systems and data, often from different vendors using their own way of representing information. For example, Oracle Financials and SAP compete in the same space but have very different tooling, terminology, chart of accounts, mechanisms to input adjustments, and so forth.

At the macro and micro levels, inconsistency abounds in each line of business, functional domain, and lowest grain operating unit. We have armies of bright, intelligent, articulate people reconciling, mapping, adjusting, and correcting data daily, tying up talent and preventing our organizations from progressing.

Increasingly, our organizations are reliant upon data. For some, data is both their life-blood and energy. Recent mergers and acquisitions demonstrate data provides a competitive advantage, increased market share, and opportunity to grow revenue streams. One example is the London Stock Exchange Group’s (LSEG) $27 billion acquisition of data and analytics company Refinitiv ( www.lseg.com/refinitiv-acquisition ) to become a major financial data provider. And there are other interesting acquisitions in the data visualization space. Google recently acquired Looker, and Salesforce acquired Tableau. Wherever we look, progressive, forward-looking companies are positioning themselves to both take advantage of and monetize their data, some with Snowflake at the center of their strategy.

These are not knee-jerk reactions. All three leading cloud providers are under pressure to provide integration pathways and tooling to enable low-friction data transfer and seamless data access, all with good reasons. Data is increasingly powering our economies, and the volume, velocity, and variety of data constantly challenge our infrastructure and capabilities to consume and make sense of it. We return to these themes repeatedly in this book and offer solutions on our journey to understanding the Snowflake Data Cloud.

The Published Research

If you need further convincing, a cursory examination of published research provides startling insight. Some studies provide insight into why Snowflake Data Cloud is important.

International Data Corporation (IDC) ( www.idc.com/getdoc.jsp?containerId=prUS46286020 ) stated that in 2020, more than 59 ZB (59 trillion gigabytes) of data was created, captured, copied, and consumed. Most of these are copies of data (or worse, copies of copies), with the ratio of replicated data (copied and consumed) to unique data (created and captured) being approximately 9:1, trending toward less unique and more replicated data. Soon, the number of replicated copies is expected to be 10:1, and this growth forecast is set to continue through 2024 with a five-year compound annual growth rate (CAGR) of 26%.

The Economist Intelligence Unit, in its Snowflake-sponsored report, “Data Evolution in the Cloud: The Lynchpin of Competitive Advantage” ( www.snowflake.com/economist-research/data-evolution-in-the-cloud-the-lynchpin-of-competitive-advantage/ ), states that 87% of 914 executives surveyed agree that data is the most important differentiator in the business landscape today, where 50% of respondents frequently share data with third parties to drive innovation in products and services. But in contrast, 64% admit their organization struggles to integrate data from varied sources. The 50% of organizations that frequently share data must be doing something right . How are they leveraging their infrastructure and tooling to share their data? Where did they draw their inspiration from? We return to these themes later, but an immediate conclusion we can draw is that someone, or some groups, must have changed their thinking to embrace a new paradigm and found themselves in an environment where they could try something new. Likewise, the 64% who struggle to integrate data from varied sources are at least acknowledging they have a problem and hopefully looking for solutions. If your organization is in the 64%, keep reading because out of the same sample, 83% agree their organizations will routinely use AI to process data in the next five years. Without integrating more disparate data, less complete, potentially inaccurate results will arise .

The Global Perspective

Our organizations are a microcosm of the global perspective . Whether we know or acknowledge our position in the digital economy, our organizations are subject to external market forces that force change and adaptation to new working methods. We need a global perspective to underpin our organization’s success, and the Snowflake Data Cloud does just that. A single platform is delivering tooling out of the box enabling frictionless, seamless, and rapid data integration with the opportunity to monetize our digital assets.

In every organization, whether immediately obvious or not, there are data silos. Those repositories which we think may hold value but are inaccessible; data sets we are not allowed to see or use, and those data sets hidden in plain sight, the obvious and the not so obvious. Data silos are discussed in Chapter 2, where we start to challenge received wisdom in tightly segregating our data. Continuing with our theme of asking why, there are typically two reasons why we don’t get answers to questions. The first is because knowledge is power, and the second is because the answer is unknown.

The volume, velocity, and variety of data are ever-increasing. I touched upon these earlier and later discussed the same themes in Chapter 10. As a sneak preview, we also explain how to address some of the less obvious data silos, and for the avoidance of doubt, this was my favorite chapter to write, the one with the most scope to unlock the potential of our organization’s data, bear with me, as there is much to understand before we get there, oh alright, you skipped straight there.

Unavoidable in today’s increasingly complex, international, and multi-jurisdictional organizations, data governance policies, processes, and procedures oversee and increasingly control every aspect of our organization’s software development lifecycle (SDLC) . How policies and procedures impact what, where, and how we can use our data are covered in Chapter 11. It’s not all doom and gloom, there are very good reasons for data governance, and we lift the lid on some hidden reasons why we must comply.

We all know software is increasingly more complex. Anyone using a desktop PC without a deep knowledge of the underlying operating system and tools to keep both process and memory hogs at bay looks to upgrade hardware far more often than might otherwise be needed. Subtle software interactions and edge cases cause unforeseen scenarios, and we don’t always have the tools to identify and remediate issues. Chapter 13 offers some ideas and concepts to remediate these challenges; not an easy subject to resolve, but we can do something to help ourselves.

Having read this, I must be clear: organizations cannot fund constant hardware, storage , network, and associated infrastructure upgrades. Increasingly costs are associated with software patching, products to help us manage our infrastructure, and improvements to our security posture dealing with both current (what we know) threats and emerging threats (those we don’t yet know about). All of which is without considering up-skilling and increasing our staffing levels while the regulatory environment tightens and imposes greater controls and reporting obligations.

Every day, we focus on what we need to do to keep the lights on. Our operational staff faces more demand from ever-growing data volumes but with the same people and infrastructure, constantly reacting and firefighting to the same challenges daily. If we are fortunate, we have robust documentation explaining how we do our day-to-day jobs. But documentation only remains relevant if we devote sufficient skill, time, and energy to its maintenance. We all know the first thing to suffer in highly stretched teams is the documentation, offering a litmus test in and of itself.

If we are ever to break this never-ending cycle, we need to think differently, challenge the status quo, and quickly pivot to focus on why. Embracing automation, low-code tooling, reducing dependencies upon key people, and implementing appropriate controls and alerting mechanisms reduce the daily noise and allow us time to understand why.

We have the means at our fingertips to resolve some of these seemingly intractable problems. But we must start now because the longer we delay, the big issues compound, become harder to understand, and head toward an inflection point beyond which we will never recover. A friend described this as a diode (a discrete component that only conducts current in one direction) at some unknowable point on our upward trajectory of increasing complexity. Once passed, we can never go back because the complexity, inertia, complacency, and all other reasons make it far too hard, and the only way from this point onward is to start again.

Why Consider the Snowflake Data Cloud?

If we want a cloud-first strategy, we must look at the core capabilities required to re-homing our data. Central to answering this question is identifying a platform that supports much of what we already know and use daily, a SQL-based engine of some sort because we use SQL as the lingua franca of our databases which underpins most of our current systems. We cannot leave behind our rich legacy and skill sets but must leverage them while adopting a new paradigm.

Much has been made of data lakes using Hadoop, HDFS, map-reduce, and other tools. In certain cases, this remains valid. But largely, after more than five years of trying and showing initial promise, and in some cases, huge financial investment, the results are generally disappointing. Many independent comparisons show Snowflake to be superior across a range of capabilities.

Naturally, no single offering will ever encompass all aspects of a cloud-first strategy which must consider more than a single product and have wider utility; for example, implementing a data fabric. But there are good reasons for putting Snowflake Data Cloud at the center of our cloud-first strategy, first and foremost. It is the only one built for a cloud SaaS data warehouse with security baked in by design. In a book titled Building the Snowflake Data Cloud, you would expect Snowflake to be the first component in implementing our cloud-first vision, but not without sound reasons.

Having decided to investigate Snowflake, we should satisfy ourselves that the marketing hype lives up to expectation and that claims are verified. In our organization, a proof of concept (POC) was stood up from a standing start, where a lone developer with zero prior knowledge of Snowflake (but with significant Oracle experience) was able, over seven weeks, to use representative sample data sets, able to deliver a robust, extensible application subsequently used as the basis for production rollout. The claim may be unbelievable to some, but I was the developer, and two years later am now responsible for running the department and production rollout.

For those still unsure and thinking maybe the author got lucky or the POC was flawed, challenge yourself to imagine “what if” the same could be repeated in your organization, dare to try Snowflake for yourself, and establish your own POC and measurement criteria. I don’t ask anyone to trust my word. Find out for yourself because facts and evidence are impossible to argue against. Anyone with a good understanding of relational database techniques and a firm grasp of SQL will find Snowflake readily accessible, and the learning curve is not as sharp as imagined. Moreover, Snowflake offers a free 30-day trial with sufficient credits to run a credible POC. There is no excuse for not trying.

This book challenges established thinking. Our current tooling perpetuates long-established, entrenched thinking, which does not work in the new data-driven paradigm. Something must give—we continue to spend vast sums of money supporting the status quo, or we find another way. The same rationale must underpin our decision to investigate the Snowflake Data Cloud.

Benefits of Cloud Integration

I have already discussed the broader themes leading to the inevitable conclusion that we must change our thinking and approach to take advantage of the data-driven paradigm before us. This section identifies tangible benefits of cloud integration, drawing parallels between cloud service providers and current on-premise implementations. I offer this perspective as the outcomes underpin the adoption of the Snowflake Data Cloud.

Following the theme of why, how, and what, I first offer deeper insight into why we should move to a cloud provider. Naturally, these themes cover both how and what. According to an International Data Group (IDG) research report (sponsored by Accenture AWS Business Group), benefits accruing from cloud integration include a 45% infrastructure and storage costs reduction, an organization’s typical operational efficiency is increased by 53% with 43% better utilization of data. This is significant, bearing in mind our previous observation of 9:1 copies of data to original data sets. Perhaps not surprising, 45% greater customer satisfaction is also recorded.

Let’s now look at some more tangible benefits the Snowflake Data Cloud offers.

Hardware and Software

The most obvious difference between cloud and on-premise (on-prem) is hardware provisioning, with software provisioning a close second. Most organizations have two or more data centers filled with hardware with high bandwidth network connections facilitating both application failover and disaster recovery in the event of server or, worst case, data center outage.

Hardware, air conditioning, physical security, staff, UPS, halon protection, miles of networking cables, switches, routers, racking, test equipment, servicing, repair, renew and replace. The list is almost endless, and each item has a fixed cost; ever depreciating, the upgrade cycle is endless and increasingly demanding.

With the cloud, organization-owned data centers are no longer required. Cloud service providers (CSPs) provision everything, available on demand, 24×7×365, infinitely scalable, fault tolerant, secure, maintained, and patched. While we pay for service provision, we do not have the headache of managing the infrastructure. Moreover, apart from storage costs, we (mostly) only pay for what we consume. Storage costs are very low. Snowflake simply passes through storage charges, and depending upon location, at the time of writing, costs vary from $23 to $24.50 per terabyte.

Figure 1-1 shows the relative costs of retaining fixed on-prem infrastructure compared to moving to cloud-based elastic provisioning.
Figure 1-1

Data cloud adoption trends

Performance

On-Prem dedicated hardware is always limited to the physical machines, CPU, memory, and disk. Regardless of whether the servers are in use, we continue to pay for their provision, yet servers are only utilized for about 5% of the time. The same is true of network bandwidth; every organization has a limit to the amount of data carried on their internal network with utilization peaks and troughs.

With cloud-based applications, instant performance scaling and per-second costing provide performance when required, for only as long as required, with predictable costs, and then scale back. We pay for what we consume, not what we provision, and have both elastic performance and storage at our fingertips. We don’t incur network bandwidth charges for cloud-to-cloud connectivity as data transits the public Internet; however, we may pay cloud provider egress charges.

When hardware failures occur, they are invisible to us. We benefit from always-on, infinite compute, infinite storage elastically provisioned on demand. We need to pay attention to authentication and network transport layer security, all discussed later .

Staffing

The highest costs to an organization are its people. With on-premise hardware, we incur security guards, physical installation costs, specialists to perform operating system and application patching, servicing, installation costs, PATS testing, and management to run teams.

With the cloud , costs are significantly reduced, but not to zero. We need more cloud security specialists instead. However, the cost savings are significant when compared to on-prem equivalents.

The COVID-19 pandemic has shown that our organizations can operate with at least the same or higher efficiency than otherwise thought possible, with reduced office capacity needed to support staff that can work remotely and connects to cloud services natively.

Control

Each organization has exclusive control of hardware and data in its own domain with on-prem implementations. Total control is physically expressed in the infrastructure provisioned.

Cloud implementations have a different perspective. Absolute control is only as good as the security posture implemented, and in a shared environment, such as may be provisioned for managed services. Across our estate, both on-prem and cloud, the security boundaries must remain impenetrable, and this is where our cybersecurity team is most important.

We also realize benefits from seamless patching and product updates that are applied “behind the scenes” are invisible to us. Vendors release updates without downtime, or outages lead to higher uptime and availability. While we give up some administrative oversight, we benefit from integrated, joined-up automated delivery throughout all our environments.

Another benefit is the feature-rich cloud environment offering built-in tooling and virtual hosting for third-party tools and applications. Driven by the need to increase market share, cloud providers continually improve their tools and integration pathways, enabling developers to innovate, reduce time to market, and find solutions to difficult business problems.

Data Security

Data security is of paramount importance to cloud-based solutions. Snowflake has a comprehensive suite of security controls—an interesting subject, which is discussed later.

Typically, each business owner determines the business data classification of each attribute in a data set with specific protections mandated for each business data classification. For on-prem data which never leaves the internal network boundaries, security considerations do not arise to the same extent as they do for cloud, where data in transit is expected to be encrypted at the connection level (conform to TLS1.2, for example) or be encrypted at rest (Bring Your Own Key, SHA256, for example). We will discuss data security in more detail, but for now, it is enough to note the distinction between on-prem and cloud-hosted solutions.

Compliance

For on-premise solutions, most regulators are content with knowing data is held in their own jurisdiction. But for cloud-based solutions, some regulators apply additional scrutiny on a per data set basis before allowing data onto a cloud platform. Additional regulation is not identical between regulators; therefore, an individual approach must be adopted to satisfy each regulator’s appetite.

Additional governance may apply before approval is granted to move artifacts to the cloud, particularly where data is hosted in a location outside of the governing authority’s jurisdiction. This is covered later, though it is important to note data governance and controls follow the data, not the cloud, ensuring we protect the data while maintaining the highest levels of security and compliance .

Data Volumes

You have seen from the research I identified that the typical number of data copies to original data is currently around 9:1 and expected to increase to 10:1 soon. These alarming figures hide many underlying issues. Replicated data sets as pure copies (or copies of copies) result in the inability to reconcile across data sets due to different extract times and SQL predicates.

Unreconcilable data leads to management mistrust and a desire to build siloed capabilities as the central system must be wrong. You see the problem emerging, particularly where manual out-of-cycle adjustments are made, further calling data provenance and authenticity into question.

Other direct consequences arise with increased data storage requirements and higher network bandwidth utilization as we shuffle data to and fro while requiring more powerful CPUs and memory to process data sets. There are others, but you get the idea.

With alarming regularity, new data formats, types, and challenges appear. And the velocity, volume, and variety of information are increasing exponentially. With the Snowflake Data Cloud, many of these challenges disappear. I will show you the how and the what.

Re-platforming to Cloud

If we simply want to reduce our hardware, infrastructure, and associated costs, can we simply “lift and shift” our technical footprint onto a cloud platform?

In short, yes , we can port our hardware and software to a cloud platform, but this misses the point. While superficially attractive by allowing the decommissioning of on-premise hardware, performing “lift and shift” onto the cloud only delays the inevitable. We still require specialist support staff and cannot fully take advantage of the benefits of re-platforming.

For example, imagine porting an existing data warehouse application to the cloud using “lift and shift,” we can decommission the underlying hardware and gain the advantage of both elastic storage and CPU/memory. However, typically we must bounce our applications when making configuration changes, require the same support teams but introduce the added complication of cybersecurity to ensure our new implementation is secure.

In contrast , if we were to re-platform our data warehouse to Snowflake, we would immediately reduce our platform support requirements almost to zero. Elastic storage is automatically available; no configuration is required. CPU/memory configuration changes no longer require system outages, allowing us to break open our data silos. And we can still decommission our on-premises hardware. Both approaches require cybersecurity to ensure our configurations are safe. I show you how in Chapter 4.

In summary, re-platforming to the cloud buys us little at the expense of introducing cloud security and effectively reinforces existing silos because our data remains locked up in the same platforms.

Where Is the Snowflake Data Cloud?

The Snowflake Data Cloud is not a single physical place. Rather, it is a collection of end-points on the Internet—hidden in plain sight among various CSPs.

The chances are you have accessed many cloud platforms already just by browsing the Internet. Snowflake has chosen the top three CSPs to build the Snowflake Data Cloud .
  • Amazon Web Services (AWS)

  • Microsoft Azure (Azure)

  • Google Compute Platform (GCP)

Figure 1-2 shows where Snowflake has established its presence. Note that locations are added from time to time.
Figure 1-2

Snowflake Data Cloud locations

Several other cloud providers exist, typically focused on specific applications or technologies, whereas the preceding three clouds are relatively agnostic in their approach.

In addition to the cloud providers, we describe our legacy infrastructure in our datacenters as on-premise (or on-prem for short). Along with that term, the following is some additional terminology that you should know.
  • Connectivity between any cloud location and on-premise is North/South. North is in the cloud, and South is on-prem.

  • Connectivity between any two cloud locations (regardless of platform) is East/West.

The distinction in connectivity terminology becomes important for several reasons.
  • Establishing East/West connectivity is harder than it first appears for various reasons discussed later.

  • East/West bandwidth consumption is largely irrelevant, except for egress charges, discussed later.

  • North/South connectivity is usually easier than East/West and is usually established via closed connectivity (AWS Direct Connect).

The implication of applications deployed to the cloud from the end-user perspective is the physical location is irrelevant. From a technologist’s perspective, it matters less than it used to, but the new considerations for the cloud are very different than for on-prem. What matters more is security, a subject we return to later.

Snowflake Data Cloud Features

Built from the ground up for the cloud, Snowflake Data Warehouse has been in development since 2012 and publicly available since 2014, becoming the largest software company to IPO in the United States, launching on the New York Stock Exchange in September 2020

Snowflake has security at its heart. From the very beginning, Snowflake has been security-focused. Offering exceptional performance when compared to a traditional data warehouse, Snowflake is both highly scalable and resilient. Implemented on all three major cloud platforms (AWS, Azure, GCP) and equally interoperable across all three, support is provided centrally. There are no installation disks, periodic patching, operating system maintenance, or highly specialized database administration tasks. These are all performed seamlessly behind the scenes, leaving us to focus on delivering business benefits and monetizing our data, which is what this book is all about.

Snowflake supports ANSI standard Structured Query Language (SQL) , user-defined functions in both Java and Scala with Python soon to follow, and both JavaScript and SQL stored procedures. Almost everything you have learned about SQL, data warehousing, tools, tips, and techniques on other platforms translates to Snowflake. If you are a database expert, you have landed well here.

Snowflake is used ubiquitously across many industry sectors due to its excellent performance. New uses are emerging as the velocity, volume, and variety of information increases exponentially, and businesses search for competitive advantage and opportunities to monetize their data.

Business leaders looking to improve their decision-making process by faster data collection and information delivery should consider Snowflake. Improving knowledge delivery and increasing wisdom enables decision makers to set the trend instead of reacting to the trend. Imagine what you would do if you had knowledge at your fingertips five minutes before your competitors; Snowflake offers you the opportunity to turn this into reality. I’ll show you how later.

Snowflake is the only data platform built for the cloud. Natively supporting structured, semi-structured, and (soon) unstructured data, with a plethora of built-in tooling to rapidly ingest, transform and manipulate data, Snowflake delivers.

Having identified many differences between cloud and on-premises implementations and the benefits of moving to the cloud, let’s now discuss how the Snowflake Data Cloud benefits your organization.

The Snowflake Data Cloud enables secure, unified data across your organization and those you choose to collaborate with, resulting in a global ecosystem where participants choose their collaboration partners and effortlessly both publish and consume data sets and data services of choice. Huge quantities and ever-increasing data can easily and rapidly be connected, accessed, consumed, and value extracted.

When properly implemented, your organization realizes the benefits of broken-down data silos and experiences greater agility enabling faster innovation. Extracting value from your data becomes far easier leading to business transformation and opportunities to monetize your data.

There is no mention of hardware, platforms, operating systems, storage, or other limiting factors. Just pause for a moment and think about the implications. The vision is extraordinary, representing the holy grail every organization has searched for in one form or another since before the advent of the Internet age. How to seamlessly access cross-domain data sets, unlock value, find new markets, and monetize our data assets.

The Snowflake Data Cloud is the place with secure unified data, seamlessly connected, available where and when we need it. If we get our implementation right, by using Snowflake Data Cloud, we have a single source of the truth at our fingertips with the ability to go back in time up to 90 days. Seamlessly, out of the box.

Business Lifecycle

Evidence for Snowflake being at the bottom of the growth stage is not hard to find. Look at Figure 1-3, which illustrates the standard product lifecycle showing the various stages every product goes through. Supporting my assertion, financial statements are at www.macrotrends.net/stocks/charts/SNOW/snowflake/financial-statements . Snowflake is a Unicorn by any standard; big institutional investors at IPO include Berkshire Hathaway and Salesforce.
Figure 1-3

Business lifecycle

The six-year pre-IPO period might be regarded as the introduction stage in the diagram with low sales, high costs, and low profits. The growth stage started post seed funding with increased sales, reduced costs, and profits. Snowflake is certainly not into the maturity phase; the sheer volume of new features and product enhancements demonstrate Snowflake is ever-growing in capability and scope.

Snowflake published the numbers shown in Figure 1-4. Draw your own comparisons. The latest figures are at https://investors.snowflake.com/overview/default.aspx .
Figure 1-4

Snowflake published company highlights

Diffusion of Innovations

Snowflake’s innovative design separating storage from computing is a paradigm shift from old thinking, representing a clean break from the past. Snowflake architecture is explained in Chapter 3.

When presented with new ideas, we are often challenged to accept new thinking. The Diffusion of Innovations theory provides some answers by breaking down the population into five distinct segments, each with its own propensity to adopt or resist a specific innovation. Figure 1-5 presents the bell curve corresponding to adoption or resistance to innovation. I leave it to you to determine where your organization falls.
Figure 1-5

Diffusion of Innovations

Snowflake adoption is no different. In Figure 1-5, Snowflake is somewhere to the left of the early adopter’s profile. This is where those of us looking to make a strategic leap forward find ourselves, an opportunity to embrace a rapidly maturing product, enabling the Snowflake Data Cloud and providing another opportunity to monetize our data.

On the left, the innovators, 2.5% of the population, have already adopted Snowflake, endured the challenges of dealing with an immature product, unknowingly conducted user acceptance testing, and suffered sleepless nights creating workarounds for previously broken features—the list goes on.

The next 13.5% are early adopters. If you have read this far, you are likely one of them. We take advantage of the hard work the innovators have put into Snowflake and seize the day, grasp what is before us, and move forward.

At our fingertips is Snowflake, a mature product of which the next 34%—the early majority—are just becoming aware of. It offers a window of opportunity to create the greatest commercial advantage and develop into new markets to monetize our data.

Let’s pause our discussion on Diffusion of Innovations here; few want to be in the late majority, and less still in the laggards, but this is where 50% of the population find themselves, whether they know it or not.

Better to be early than just on time, never to be late.

Future State Enabled

Our world is often changing in unpredictable ways. For example, face recognition is now commonplace but unthinkable ten years ago with a consequential rise not only in volume but the velocity of data too. By the time you read this book, Snowflake will have released support for unstructured data, a subject for later in this book, which mainstream databases struggle to deal with.

And new challenges arise. If you want to extract billing information from a photographed invoice, Snowflake has the tools to do just that. If you want to save paragraphs from a Microsoft Word document into a structured data format, Snowflake can do that too. I will show you how to do both in Chapter 9.

This represents microcosms of data sets requiring particular tools and techniques to process, and Snowflake has the answers. New features arrive all the time. As I wrote this chapter, object tagging became available; see Chapter 11 for details.

Emerging themes on future data growth (all discussed later) such as the Internet of Things (IoT) generating sensor data, imaging, semi-structured records, and more, big data and analytics showcasing how we can make sense of these huge volumes of data, machine learning decision-making affecting our everyday lives.

First-Mover Advantage

Combining the Diffusion of Innovations theory with the business lifecycle gives us the knowledge to understand we have a window of opportunity and first-mover advantage. We can identify and create new markets, be the first entrants into the market, develop brand recognition and loyalty, gain competitive advantage, and consequently, a marketer’s dream.

But time is not on our side. Your organization’s competitors have already done their research and begun executing their strategies. Some are reading this book just as you are.

Your Journey

Your journey to the cloud is just that—a journey with some lessons to learn. Some are outside of the scope of this book, but many are illustrated here.

Not only are there significant cost savings, performance benefits, and flexibility to be gained, but also wider security considerations to address. And for those who enjoy technology, Snowflake has plenty to offer with an ever-expanding feature set delivered faster than most of us can assimilate change.

How Snowflake Implements the Data Cloud

For all approaches, it is important to understand the data owner always remains in full control. Role-based access control (RBAC) is discussed in Chapter 5, but for a sneak preview, everything in Snowflake is an object subject to RBAC, including secure direct data shares and the objects contained therein.

In other words, Snowflake is inherently highly secure from the ground up. Some of these features are discussed in Chapter 4. I will later show you how to configure Snowflake security monitoring, exceeding Snowflake recommendations and exposing some underlying features.

Figure 1-6 illustrates the options available to us, along with some background information. Each option is briefly explained next.
Figure 1-6

Snowflake Data Cloud options

Global Data Mesh

The Snowflake infrastructure underpinning the Snowflake Data Cloud is its global data mesh . Snowflake addresses the challenges of building data platforms in unique ways which readily facilitate the delivery of robust data architectures.

With some prior thought on how RBAC enables data ownership and access, adherence to published standards for data ownership, and clarity of thought on logically and physically separating data sets, delivery of the Snowflake Data Cloud is readily achievable. But only if we do things the right way from the outset, or more probable for organizations with an existing data lake, develop a plan to bring structure and governance together along with clear direction to remediate the past.

Engagement Criteria

To successfully create our contribution to the Snowflake Data Cloud, we must adopt a domain-based approach to data ownership. In practice, we must know “who owns what” and enforce data ownership as a toll-gate before allowing data onto our platform. Not as easy as one might think, our organizations are becoming more tightly regulated, and while attractive to “just throw the data in,” any audit or governance function wants to know “who owns what.” Better to set the standard at the outset rather than suffer the consequences later.

With multiple data domains in our data cloud, new opportunities exist for cross-domain reporting. Subject to correctly segregating our data using RBAC, we simply need to grant roles to users to enable data usage. Simple, right? No, not simple in practice because we may not want to grant access to all data in the domain, so we must restrict data to the minimal subset required to satisfy the business usage. And Snowflake has the tools to do this. Finer-grained RBAC, row-level security, object tagging, and the creation of custom views are some techniques we use.

As I will unpack shortly, data discovery and self-service are central to Snowflake Data Cloud capability.

Secure Direct Data Share

Secure Direct Data Share is a built-in, highly secure core Snowflake capability reliant upon cloud storage to share data with other Snowflake customers; that is, consumers both inside and outside your immediate organization. We unpack Secure Direct Data Share in Chapter 14 with hands-on examples.

With Secure Direct Data Share, the data owners decide the data sets to share, and technologists provide the implementation. At all times, full control is maintained over which external customers have read-only access. Addressing “who can see what,” Snowflake is extending its monitoring capabilities to provide metrics on consumption. Chapter 14 provides a template.

Data is maintained in real time. As we update the source object data, our customers see the same changes in real time as they occur. Data sharing occurs behind the scenes. Once we declare the objects and entitle customers to access the share, our clients consume it; it is as simple as that. No more SFTP, data dumps, process management, and other cumbersome data interchange scheduling and tooling.

Costs differ according to CSP and actions performed, noting as Snowflake product capabilities evolve, the costing model may also change. For further up-to-date information, see the documentation at https://docs.snowflake.com/en/user-guide/billing-data-transfer.html#understanding-snowflake-data-transfer-billing .

Suitable for all applications, Secure Direct Data Share explicitly controls which customers have access to shared data and are a great way to securely and safely monetize your data while retaining full control.

Many organizations suffer from incomplete, incorrect, divergent reference data where inconsistencies lead to rejected records, unmatched attributes lead to missing content, and over time, slowly lead to mistrust of the system.

If we can address this one seemingly simple (it is not simple) issue across the many hundreds or thousands of applications in our organizations by sourcing our reference data once and distributing it across all our applications seamlessly, we would soon enjoy the benefits of corrected, conformed data which can more easily be joined as many more attribute values will match.

And there are more opportunities; the story just gets better.

Data Exchange

Utilizing Snowflake’s built-in data sharing capability, Data Exchange is the organization’s internal place where data is published to consumers, that is, people and organizations both in and outside the publishing organization boundary. Data Exchange is also where organizations and individuals discover available data sets. We unpack Data Exchange with worked hands-on examples in Chapter 14.

Data sets controlled by business owners are published for consumption by others, either by approval-based subscription or freely available to all. Truly the beginnings of data democratization, data publishing and data discovery allow everyone to fully participate. They are both able and entitled to access data sets. Moreover, the silos begin to crumble, allowing data discovery, enrichments, and utilization in previously unthinkable ways.

Marking a significant step forward in an organization’s capability to seamlessly interact, Data Exchange quickly allows subscribers to access their entitled data sets. Time to market is slashed by hours, or at most days, to implement data integration between organizations—no more SFTP, authentication, firewalls, or handshaking, and previous impediments disappear.

Central to the concept of data sharing is security. With Data Exchange, the data owner retains complete control of their own data sets and can curate, publish, manage, and remove entitlement at will. Naturally, one year retained immutable full audit trail is available; this is discussed in more detail later.

Snowflake Marketplace

Organizations use Replication and Secure Direct Data Sharing capabilities to create listings on the Snowflake Marketplace, the single Internet location for seamless data interchange.

Snowflake Marketplace is available globally to all Snowflake accounts (except the VPS account, a special case) across all cloud providers.

Because Snowflake Marketplace uses Secure Data Sharing, capability, security, and real-time features remain the same. The integration pattern differs as replication, not share, is used as the integration mechanism.

Snowflake Marketplace supports three types of listings.
  • Standard: As soon as the data is published, all consumers have immediate access to the data.

  • Purchase with a free sample: Upgrade to the full data set upon payment.

  • Personalized: Consumers must request access to the data for subsequent approval, or data is shared with a subset of consumer accounts.

However, there is a hand-shake between subscriber and provider as information may be required to subscribe, and data sets may require pre-filtering, resulting in custom data presentation.

Once complete, the source database is replicated, becoming available to import into the subscribing account. Both options require local configuration to import the database, which becomes accessible similarly to every other database in the account.

Who Is Using the Snowflake Data Cloud?

With a very low barrier to entry, several organizations, including Refinitiv, FactSet, S&P Global, and Knoema, already use Snowflake to make their data available and are monetizing their assets, each offering a free trial enabling future clients to “try before you buy.” this is data democratization at its best, enabling both individuals and organizations to decide for themselves which data sets best suit their needs with a very low barrier to entry.

Addressing the COVID-19 pandemic, many vendors offer relevant data sets enabling rapid dissemination and accrual of differing perspectives. Combined with machine learning and AI, unprecedented opportunities abound.

The variety of organizations using Snowflake Marketplace is growing rapidly, with free participation.

Beginning Your Snowflake Journey

The process of migrating to the cloud does not happen overnight. What you have read is thought-provoking, groundbreaking, and significantly impacts your organization.

You may find your organization is unprepared for the journey, and you may need to showcase features as they are developed to expose the benefits.

The Snowflake Data Cloud is not a single destination. It is a journey. Features are constantly being developed, improved, and enhanced. Our job is to show others the way and lead using the examples found in this book while building out proof of concepts showcasing evolving Snowflake capabilities.

Summary

This chapter began by setting the scene, outlining problems every organization has in scaling to meet the torrent of data, and offering ways to mitigate and reduce cost while expanding capability.

You looked at the emerging data landscape, changing the nature of data and increasing storage costs inherent in holding multiple copies of data.

Introducing the Snowflake Data Cloud as the answer to many issues we face in our organizations set us on our journey to realize the tangible benefits not just in cost savings but also in future proofing our organization by “right-platforming” for the future.

Through practical experience, the importance and value of showcasing technical capability and conducting a “hearts and minds” campaign to our business colleagues cannot be underestimated. Remember our discussion on Diffusion of Innovations: we must endeavor to “shift our thinking to the left” to help our colleagues embrace a new and unfamiliar paradigm.

And having established the right mindset in preparation for looking deeper into our data silos, let’s open the door to Chapter 2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.22.23