© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. CarruthersBuilding the Snowflake Data Cloudhttps://doi.org/10.1007/978-1-4842-8593-0_3

3. Architecture

Andrew Carruthers1  
(1)
Birmingham, UK
 

Understanding how Snowflake works helps you develop your strategy for implementing your contribution to the Snowflake Data Cloud. Snowflake is optimized for data warehouses but not OLTP. The distinction is clearly explained in this chapter. For the curious, search for micro-partitions.

Information is presented early in this book to clarify what is happening “under the hood” as you work through the sample code presented in later chapters. This chapter begins our deep dive into technical matters, and we focus on some items not immediately obvious. I also attempt to provide information at a high enough level to satisfy the curious by asking what, how, and why, later delving into the depths of some obscure, not well-understood parts of Snowflake to assuage the thirst of the technophiles reading this book.

It should be said this chapter is not intended to provide comprehensive product coverage. It is written from my experience learning Snowflake and based on prior knowledge of Oracle. Therefore, this chapter should be read in the context of extending your presumed technical knowledge into the new paradigm Snowflake offers and drawing out information that I, in hindsight, would have benefitted from when learning Snowflake.

Understanding how Snowflake is built from the ground up to take full advantage of cloud capabilities helps you later understand how we integrate with the Snowflake Data Cloud.

Snowflake Service

We start our investigation into Snowflake architecture in the same place as almost every other Snowflake introductory presentation—by examining the three discrete layers implemented by Snowflake: cloud services, multi-cluster compute, and centralized storage. Each layer provides different functionalities and capabilities to Snowflake. We examine each separately. Snowflake introduces its key concepts at https://docs.snowflake.com/en/user-guide/intro-key-concepts.html .

The separation of storage and compute uniquely differentiate Snowflake from its competitors. As we work through this chapter, you learn how this fundamental architectural difference is a game-changer.

An important takeaway to note is the cloud agnostic layer. Snowflake clients get to choose which of the three supported cloud platforms (AWS, Azure, Google Cloud) on which to host the Snowflake VPCs. Although all cloud platforms provide the same capability, GCP lags a little behind AWS and Azure for both physical region availability and product feature implementation at the time of writing this book. The Snowflake Status site indicates not only Snowflake’s operational status but physical region availability (see https://status.snowflake.com ).

Nor are we restricted to a single cloud service provider (CSP) . We could host Snowflake in all three clouds and seamlessly move data between. Later, you see some potential issues arise with moving data between CSPs. We might also prefer to balance service provision across two or all three CSPs to reduce our application concentration risk or data gravity exposure to a single cloud provider.

The bulk of this book is based upon AWS since this is where my primary expertise lies. Almost all the written materials are transportable. Substitute Azure Blob Storage or Google Cloud Storage for Amazon Simple Storage Service (Amazon S3). There are syntax changes required later when we write some code, but these should be easily understood and resolved.

Snowflake architecture is a hybrid of shared-disk and shared-nothing architectures, with a centralized storage repository for persisted data accessible from all multi-cluster compute nodes in the account, represented by the intersection between centralized storage and multi-cluster compute, as illustrated in Figure 3-1. This means your data is always available for processing in a consistent state while each compute node operates independently.
Figure 3-1

Snowflake architecture

Cloud Services Layer

Cloud services ( https://docs.snowflake.com/en/user-guide/intro-key-concepts.html#cloud-services ) provide coordination services across all Snowflake activities, including access control, authentication (who can log in, and what can they do), infrastructure management, metadata management, query compilation, query planning, and query optimization.

Multi-Cluster Compute

Multi-cluster compute , also known as virtual warehouses ( https://docs.snowflake.com/en/user-guide/intro-key-concepts.html#query-processing ), lets us choose the number of processors and the amount of memory to allocate to processing our queries. The Snowflake approach starkly contrasts the plethora of well-understood databases in our organizations, such as Oracle and SQL Server, where fixed numbers of processors and memory are available. Highly scalable, Snowflake provides tee-shirt-sized warehouse templates to select from also clustering. Snowflake delivers the performance required when we need it, on demand and subject to the parameters set for the warehouse, and scales automatically as our workloads increase.

Note

Each Snowflake warehouse is a named wrapper for a cluster of servers with CPU, memory, and temporary storage (SSD).

All queries require the warehouse to be declared. Queries whose results can be satisfied from cache or metadata do not consume Snowflake credits. All other query invocations consume Snowflake credits. More information is at https://docs.snowflake.com/en/user-guide/what-are-credits.html#what-are-snowflake-credits .

Database Storage

Database storage maps to CSP fundamental storage (Amazon S3, Azure Blob Storage, Google Cloud Storage). Regardless of the cloud provider, Snowflake manages all storage interactions for data warehouse core operations transparently through SQL. More information is at https://docs.snowflake.com/en/user-guide/intro-key-concepts.html#database-storage .

Snowflake pass-through storage charges from the cloud provider vary according to region and cloud provider, nominally for AWS, and depending upon the region, approximately $23/terabyte. Note additional storage changes are incurred by the Time Travel and Fail-safe features.

Later, you learn how to view the underlying storage for database objects, but for now, it is enough to know Snowflake uses our chosen cloud provider storage to hold our data in a highly secure manner . An infinite amount of storage is available from the moment we log in. There is no need to ask Snowflake for more; storage is always available.

Snowflake Provisioning

Snowflake first establishes its technical footprint across the three supported clouds in disparate physical locations. For more information, please see the following .
For AWS and Azure, the Snowflake account provisioned is functionally equivalent with minor syntax changes for certain commands. At the time of writing, GCP Snowflake account features lag those for AWS and Azure. However, all Snowflake accounts are provisioned according to a common pattern, as shown in Figure 3-2.
Figure 3-2

Snowflake provisioning

Some clues in Figure 3-2 later aid our understanding. First, storage is outside the multi-tenant pod but still in the Snowflake deployment, meaning storage management is devolved from a single Snowflake account, making storage available across multiple Snowflake accounts. This approach underpins shares. Second, the metadata store is stored in a key-pair repository supplying the global services layer, discussed next. Lastly, the diagram calls out the virtual warehouse (your Snowflake account), addressed next. It is the highest-level container available in our Snowflake provision.

Global Services Layer

Underpinning Snowflake and implementing the global services layer is FoundationDB, an open-source key-value data store well suited to OLTP style data management for high-speed read/write and data manipulation. FoundationDB holds our Snowflake account metadata, that is, information about every object, relationship, and security feature, the catalog which documents and articulates our account. Regardless of the cloud provider chosen, the global services layer implementation is consistent across all clouds, ensuring Snowflake interoperability.

From time to time, we need to access the global services layer. Some application development features, for example, used in security monitoring, can only be metadata driven if we can programmatically access our metadata. I explain more in Chapter 6, which dives into the technical details.

Not only do we need to access the global services layer for our application programming needs, but Snowflake capabilities also use the metadata to drive built-in functionality. Every aspect of Snowflake relies upon FoundationDB, so we must understand how to interact with our metadata using the tools available.

Security monitoring is covered in Chapter 4, but I want to discuss it briefly now. FoundationDB metadata is exposed into Snowflake by SQL commands such as LIST, SHOW, and DESCRIBE. These commands do not return record sets usable by other SQL commands, so it is impossible to create a view based upon the output of a SHOW command. Instead, we must think laterally and use JavaScript stored procedures, RESULT_SCAN, and local table to convert the last query output into a usable record set. Note that this approach is only a point-in-time solution, and a caveat applies to waiting to trip up the unwary. For now, it is enough to know there is some friction when accessing the global services layer from SQL. You see how to resolve the challenges later.

Snowhouse

Snowhouse is the in-house tool that Snowflake uses to monitor and manage itself and provide internal metrics on usage telemetry.

Note

At no point can Snowflake staff circumvents account security using Snowhouse; therefore, your data always remains secure.

Snowhouse merges every organization’s account metadata with data from other Snowflake internal systems ranging from sales and marketing to infrastructure monitoring tooling, providing a holistic view of each account utilization. As an internal management utility, Snowhouse is used for marketing and sales opportunities by sales support engineers to identify consumption spikes, investigate underlying SQL performance issues, and cloud provider virtual machine configuration and performance.

Snowflake is the biggest consumer of Snowflake, confidence building for everyone considering using Snowflake as their Data Cloud, so if contacted by your Snowflake support team, there is sound, evidence-based reasoning for the information imparted. Compare the proactive nature of Snowflake support with other vendors and draw your own conclusions.

Naturally, as end-consumers of Snowflake service provision, we do not have access to Snowhouse, so this section is included for information only.

Snowgrid

Snowgrid is a term you are sure to hear more about soon. This Snowflake in-house technology underpins transparent data sharing regardless of the cloud provider and physical location.

High Availability (HA)

In each region (or geographical location), each CSP implements its core capability across three geographically separate availability zones, segregated but interconnected . Snowflake inherits the underlying CSP architecture as a product deployed in a region, as shown in Figure 3-3. Behind the scenes, Snowflake replicates changes in (near) real time, transparent to end users and seamless operation. Every environment provides resilience and is built for synchronous replication. We cannot see which availability zone our account is hosted on. Simply accept the service as provisioned.
Figure 3-3

Snowflake high availability

We won’t get into the marketing hype or analysis of around 99.999999999% durability (an act of war has a higher chance of affecting your data) or 99.99% availability (less than an hour of downtime per year), but bring these metrics to your attention both for what cloud providers quote and for further investigation if you so desire.

With three availability zones, and an immutable micro-partition approach (discussed later), Snowflake protects many scenarios , including the following .
  • Customer error includes accidental deletion of data and objects, loading one or more data files with bad data, or loading data out of sequence.

  • Virtual machine failure includes triple redundancy for critical cloud services, automatic retries for failed parts of a query.

  • Single availability zone failure includes multiple availability zones on AWS and multiple availability sets on Azure. Snowflake’s service may be degraded, as load balancing is across one or two availability zones depending upon the nature of the failure.

  • Region failure includes loss of service in a region, cross-region database replication, and failover.

  • Multi-region failure includes loss of service in a region, cross-region database replication, and failover.

The root cause of region failure would most likely be a Snowflake core service provision outage rather than the whole CSP infrastructure . The real-time Snowflake service status is at https://status.snowflake.com .

Patching and Maintenance

Every Snowflake presence is automatically patched and updated with zero downtime, transparent to accounts via a staged release process through which Snowflake updates are first applied to early access accounts 24 hours in advance of Standard accounts and finally Enterprise and upward accounts.

Occasionally Snowflake makes feature changes. These are pre-notified and generally not of great significance. However, for those fortunate to work in large corporates with dedicated Snowflake support staff or support contracts expect to see periodic emails pre-notifying of upcoming changes.

Further information on Snowflake releases is at https://docs.snowflake.com/en/user-guide/intro-releases.html#snowflake-releases . Chapter 4 walks through how to enable and disable behavior change releases to allow testing before deployment.

Create Your Trial Account

Most value is obtained by physically doing something. Practice makes perfect, and the repetitive act of typing commands, while tedious, reinforces memory retention. But to type commands, you must have a Snowflake account to practice on. Go to www.snowflake.com , click the Start For Free button, and enter a few details to start a 30-day free trial account .

Note

Be sure to select the Business Critical Edition because it is likely the version used by your organization.

We rely upon certain features in this book; therefore, the Business Critical Edition is essential for successfully working through sample code. I do not cover every feature (I must leave something for you to discover yourself!), but some features are essential to our understanding, particularly those related to security features. Figure 3-4 illustrates features available in each Snowflake edition .
Figure 3-4

Snowflake editions

All examples in this book have been tested using a trial account on AWS; therefore, I recommend AWS as the chosen platform. Your mileage will vary otherwise.

Organizations

Until recently, every Snowflake account was created in coordination with a Snowflake technical representative. A new feature called Organizations provides a high degree of self-service in managing across all Snowflake accounts for your company. Chapter 4 discusses the Organizations feature , but for now, it is sufficient to note all your accounts can be managed centrally regardless of location and cloud provider(s) using the ORGADMIN role.

Accounts

Snowflake accounts were discussed earlier. Figure 3-2 shows that a virtual warehouse as synonymous with your Snowflake account. Let’s unpack this by explaining what an account physically contains and the functionality each layer provides. We think of an account as the highest level of the container and the outermost boundary of our Snowflake provisioned environment. The virtual warehouse is called out in Figure 3-2 as the Snowflake account.

Snowflake provision accounts in one of two billing formats. Snowflake On Demand is the pay-as-you-go option. Every credit is a fixed price. In contrast, a Capacity contract is a bulk credit purchase up-front over a fixed term. At the time of writing, each credit costs $5 with bulk volume discounts. A fuller discussion of this topic is out of scope for this book. Please discuss with your Snowflake sales representative for further details. Snowflake credits are also explained at https://docs.snowflake.com/en/user-guide/what-are-credits.html#what-are-snowflake-credits .

Cloud Services

The cloud services layer is a functional suite of components coordinating activities across Snowflake to process user requests, from login to query dispatch. The cloud services layer runs on compute instances provisioned by Snowflake from the cloud provider. Metadata operations do not use compute, and cloud services are charged only if consumption exceeds 10% of daily compute resource consumption. Several services are provisioned in the cloud services layer. Let’s discuss them in turn.

Authentication and Access Control

Authentication is the process of verifying the identity of an actor attempting to access Snowflake. An actor may be a named individual user or generic service user for system-to-system interconnectivity.

Snowflake supports multi-factor authentication via Duo, an opt-in feature available to all users.

Access control is enforced by network policies, which can be thought of as a ring fence around your Snowflake account, allowing access to your environment from various end-points and configured at the account level. I discuss setting network policies and how to effectively monitor those set in Chapter 4.

Infrastructure Management

Infrastructure management includes the dynamic (elastic) allocation and deallocation of resources at any time resulting in the exact number of resources required to fulfill the demand for all users and workloads, leaving developers and users free to concentrate on delivering both functionality and business value rather than being distracted by runtime environment issues .

Metadata Management

The metadata repository uses FoundationDB to hold table definitions, references for all micro-partition files for each object, tracking for all versions of the table data in the data retention window, and various statistics. Metadata management services provide query optimization, compilation, planning, and security.

Query Optimization

Snowflake query optimizer performs static pruning in the metadata cache by eliminating micro-partitions based upon query predicates. Filters then identify the exact data to be processed using metadata.

When the result set can satisfy the cache, the records return from the result set cache.

For the curious, the metadata cache holds several sets of statistics.
  • Tables: Row count, table size in bytes, file reference, and table version

  • Micro-partitions: Max/min value range, the number of distinct values, NULL count

  • Clustering: Total# micro-partitions, micro-partition overlap values, micro-partition depth

This is used in performance tuning, a topic covered in Chapter 11. Dynamic pruning occurs in a warehouse cache, covered later in this chapter.

Result Cache

The result cache stores the results from queries executed by all warehouses with a retention period of 24 hours. Whenever an exact match query runs, the result set expiry timer is reset, extending retention for 24 hours unless the underlying data has changed. The maximum duration result sets persist for 31 days before being automatically purged, requiring subsequent query execution to reinstate cached data.

Virtual Warehouses

Virtual warehouses are also called multi-cluster compute . Each is a named wrapper for a group (or cluster) of servers with multicore/hyperthreading CPU(s), memory, and temporary storage (SSD). Warehouses are self-tuning massively parallel processing (MPP) engines.

Warehouse runtime is where Snowflake makes its money. Therefore, careful attention must be paid to right-sizing and parameter setting, particularly auto-suspend. Not all queries require a running warehouse, as some resolve their results directly from metadata or query cache, but for those which do require a running warehouse, if not already done so, the warehouse instantiates; that is, the cloud services layer provisions the correct number of servers (CPU, memory, and SSD cache) to fulfill the warehouse configuration and begin executing the query.

Snowflake maintains several warehouses in each availability zone, reducing spin-up time on demand. Through Snowhouse statistical analysis, Snowflake accurately predicts usage, but for new regions where usage patterns are not yet established, predictions are less accurate; therefore, provisioning delays of more than 10 seconds may be experienced. Naturally, as usage patterns evolve, predictions become more accurate.

Warehouse minimum runtime is 1 minute, then per second after that. Warehouses can be manually created and managed at the command line or via the user interface.

Note

Check that all warehouses initially have auto-suspend set to 1 minute.

Occasionally warehouses fail to provision during startup. Snowflake detects failures and automatically attempts to repair failed resources. SQL statements start processing when 50% or more of the requested resources have been provisioned.

There is no single answer to identifying optimal warehouse configurations without proper testing. While warehouse scaling provides many answers, other performance options, particularly for large data sets to consider, are outlined later in this chapter. We can say that we require differing warehouse configurations according to the type of activity performed. Data loads and queries have different workload characteristics and must be sized accordingly.

In summary, these broad principles apply to sizing warehouses.
  • Scale up increases T-shirt size to improve query performance, process more data, run more complex queries

  • Scale out adds clusters to run more queries concurrently, support more users

  • Scale across declares more warehouses to isolate workloads on different warehouse sizes to remove resource contention

Single Warehouse Concurrency

Note

Caveat. This section contains (reasonably safe) assumptions regarding underlying cloud provider CPU hardware (AWS EC2, Azure VM). Your mileage may vary according to real-world experience.

An X-small warehouse provisions a single server that contains CPU, memory, and SSD storage. There is no means of identifying the actual hardware—nor should we when operating in a cloud environment. But there is one aspect where the underlying hardware is material to configuring our warehouses: CPU configuration.

CPUs are no longer a single monolithic silicon block capable of processing a single task. Instead, a CPU has multiple cores, and each can handle two or more concurrent threads. For our discussion, let’s assume a single CPU has four cores. Each core has two threads, resulting in 16 available concurrent threads, as Figure 3-5 illustrates.
Figure 3-5

Suggested server configuration

While each server may have 16 threads available in total, it is also clear each server must retain the capacity to execute other services, including its own operating system. Therefore, we further assume only four of the suggested eight cores are available to Snowflake at any time.

Snowflake Advanced SQL course indicates each server in a cluster has eight concurrent threads available to service queries. When demand for the warehouse exceeds eight concurrent execution requests and clustering is enabled, another warehouse cluster of the same size is instantiated, providing a further eight concurrent threads. Up to ten clusters can be instantiated concurrently, resulting in a maximum of 80 threads for a single X-small warehouse declaration at any time. Active clusters are scaled down when the last working thread has ceased executing its SQL statement subject to auto-suspend setting and not running in maximized mode.

Let’s look at an alternate view. There are eight cores in an X-small warehouse, of which we assume four are usable by Snowflake. When a query is submitted to this warehouse, Snowflake tries to split the query into four parts. Each of the cores tries to scan one-fourth of the table. Note that this suggestion is ideal; the query must be splittable, and the warehouse cores must be available for parallelization.

For those seeking a deeper understanding of concurrency, I suggest investigating the MAX_CONCURRENCY_LEVEL parameter at https://docs.snowflake.com/en/sql-reference/parameters .html#max-concurrency-level and the blog by Subhrajit Bandyopadhyay at www.linkedin.com/pulse/snowflake-concurrency-parallel-processing-subhrajit-bandyopadhyay/ .

Scaling

Warehouses are defined from built-in templates in T-shirt sizes ranging from X-small with a single server, and for a 6XL having 512 servers, this is called scaling up and down. The number of servers instantiated doubles for each increase in T-shirt size. Warehouses can be resized dynamically; the T-shirt size clusters can be changed at the command line as desired. Warehouses can also be manually suspended in a session.

A further option (my preferred option) is to have multiple warehouses declared and use the one most appropriate for the workload at hand. This is called scaling across and back. You may be asking why the author prefers this approach. First, there is the possibility that a process having resized a warehouse could fail, with the warehouse retaining the resized declaration for the next invocation, which could be inappropriate and costly. Second, for workload segregation, repeatable performance tuning, determining credit consumption for billing purposes, and preserving a consistent system configuration perspective, static warehouse declarations are a better, more appropriate fit. Third, both instances of a re-declared warehouse run briefly while the old warehouse is quiesced, which could be costly for larger warehouses. Fourth, our data warehouses are complex and ever-increasing environments. Multiple warehouse declarations provide workload segregation.

Warehouses can also be clustered. Multiple instances of the same T-shirt size can be declared in the warehouse configuration and dynamically instantiated by Snowflake as load increases. It automatically shuts down when the load decreases. This is called scaling out and in. Equally, a warehouse can be set to run in maximized mode by setting the minimum and the maximum number of clusters to the same value greater than 1, in which case all clusters are automatically instantiated when the warehouse starts.

Warehouse clusters are always of the same cluster type. It is impossible to select different T-shirt sizes in a single warehouse declaration. Figure 3-6 illustrates the number of clusters instantiated and credit consumption per hour to inform our warehouse configuration. I discuss how to monitor credit consumption later .
Figure 3-6

Warehouse scaling options

For a single large query loading one billion rows, the effect of scaling up is illustrated in Figure 3-7. Increasing the warehouse size improves response time but ultimately at increased cost if the warehouse size increases too much. Warehouse size selection and associated costs must be tested with a representative workload profile recognizing multiple warehouses can (and should) be declared as most appropriately selected for the target workload.
Figure 3-7

Warehouse scaling up cost/benefit

The important takeaway from Figure 3-7 is that scalability is linear until the query is no longer resource bound.

The scaling policy is the final warehouse parameter affecting multi-cluster warehouse behavior, which may be either Standard (default) or Economy. However, a deprecated scaling policy called Legacy also exists; its behavior has been changed to Standard and should not be used. A scaling policy affects when clusters are instantiated or shut down. For more information, please refer to the documentation at https://docs.snowflake.com/en/user-guide/warehouses-multicluster.html#setting-the-scaling-policy-for-a-multi-cluster-warehouse .

Query Processing

Snowflake query processing engine uses native SQL. A Snowflake session can only have one warehouse operating at a time, which may be either changed or resized, noting earlier comments on the implications of resizing warehouses.

Previous sections discussed some aspects of query processing. Where possible, queries are split according to the nature of the query and available processing cores for parallel processing.

Warehouse Caching

Behind the scenes , Snowflake maintains a pool of ready-to-run warehouses. This explains why Snowflake tries to reinstate the original cache on reinitializing a suspended warehouse, though there is no guarantee this will occur. As cache reinstatement is not guaranteed, the minimum expectation is for cache to be purged when the warehouse is suspended. Cache reinstatement is a bonus.

The warehouse cache is local to each specific warehouse and is never shared with other warehouses. There is no regular schedule for when the warehouse cache is purged. If a warehouse is always running, its cache will never be purged. This may be an attractive performance option for repeat queries. However, I recommend experimenting and thorough testing before implementation, as credits are consumed per second after the first minute of warehouse operation.

Resizing warehouses results in new server provision. Therefore, the cache is purged for the resized warehouse. This behavior is inferred from previously explained behavior where resizing warehouses may result in additional credit consumption while the new servers are instantiated—another reason not to resize warehouses .

Caches are reused if the same (i.e., case sensitive) SQL text/query is reissued unless the Snowflake cache is invalidated by underlying data changes, ensuring the query always returns the latest data.

Query Profiling

Query profiling can be performed on both queries which have completed execution and queries undergoing execution. A profile can only be viewed in the user interface, and current information cannot be extracted programmatically for external tool analysis.

Detailed investigation and explanation of query profiling are beyond the scope of this book. However, I discuss the basic principles in Chapter 11. Further information is at https://docs.snowflake.com/en/user-guide/ui-snowsight-activity.html , noting the use of Snowsight.

Monitoring

The classic user interface provides a screen to monitor warehouse load, as shown in Figure 3-8.
Figure 3-8

Warehouse monitor

Selecting an individual warehouse drills into utilization for the past 14 days , as shown in Figure 3-9.
Figure 3-9

Recent warehouse utilization

Later, I explain how to extract warehouse utilization information and credit consumption from the Snowflake Account Usage store, as it relates to identifying credit consumption. But, for now, it is sufficient to know Snowflake provides monitoring tooling.

Storage

Underpinning every Snowflake implementation, regardless of platform, is storage, the physical place where data lands and is persisted. Our discussion is more AWS-focused. Therefore, substitute Azure Blob Storage or Google Cloud Storage for Amazon S3.

Unlike previous generations of databases, Snowflake takes away all the headache of manually configuring storage on installation, no more configuring a storage area network (SAN) or network-attached storage (NAS) , no more databases hanging or transactions failing due to running out of disk space, the elastic nature of cloud storage provision ensures disk is always available, as much as we need, always available at the point of demand. Furthermore, due to the innovative approach implemented by Snowflake, we don’t have to be concerned with transaction management in the same way other database vendors are reliant upon—no more rollback, undo, row chaining, or separating tablespaces onto separate physical drives.

As a nod to our cybersecurity colleagues, we must always ensure external storage is properly secured against unauthorized access by implementing appropriate security policies and then proving the security policies tested to prove fit.

Different Platforms, Different Approaches

We must ask ourselves why Snowflake adopted their chosen approach. Snowflake is a data warehouse but not an online transaction processing (OLTP) database; therefore, storage considerations change.

Part of the answer lies with the fundamentally different nature of cloud service storage provision. We do not have to segment our data containers into different logical groups. Snowflake could have imposed similar containers, but the concept no longer applies. Furthermore, data warehousing has been an extension to most historical database vendor products, not an original architectural consideration or initial core capability. Data warehousing may not have been designed into the core product. And the market has matured, and data volumes, velocity, and variety have exploded. Data warehousing is now essential for serious data analysis; therefore, a different approach is warranted. Snowflake positions itself as a data warehouse provider, thus enabling a different storage management paradigm, which is discussed next.

Micro-partitions and Cluster Keys

Snowflake implements storage using micro-partitions , the underlying contiguous units of storage that comprise logical tables. Each micro-partition contains up to 16 MB of data compressed using proprietary compression algorithms. Uncompressed, each micro-partition holds between 50 MB and 500 MB of data organized in hybrid columnar format, optimizing compression according to the column’s data type.

Object metadata containing rowcount, table size in bytes, physical file reference, table version, and zone map information is retained for all written micro-partitions according to the Snowflake edition and retention period specified. (Data protection is covered later in this chapter.) For each micro-partition, metadata for the maximum and minimum value range, number of distinct values, and NULL count is retained. Where multiple micro-partitions are written for a single object, metadata includes the total number of micro-partitions, overlap depth, and overlap values. I briefly discuss how metadata is used later, but don’t expect a deep dive into Snowflake performance tuning because it is a significant subject and beyond the scope of this book.

Traditional database vendor disk management policy is mutable object storage; the object’s contents change according to data manipulation language (DML) operations (insert, update and delete), and the object resizes dynamically, leading to mutable storage containers. Data changes in objects and indexes in separate disk structures lead to proactive monitoring and human/machine intervention for disk management. The Snowflake approach is different; once a disk structure (a micro-partition) is written, it is immutable. DML operations result in new micro-partitions being written. Micro-partitions can no longer be physically managed, apart from changing clustering keys, nor can their behavior be influenced.

Understanding the difference between mutable and immutable storage management is key to understanding why OLTP is not considered a good use case for Snowflake. Micro-partition management for individual record operations leads to queueing. Transaction failure as the underlying storage struggles to keep up in high volume DML scenarios. However, careful use of temporary tables to collate OLTP data with a subsequent bulk load into permanent tables may provide an acceptable pattern assuming error handling in the event of failure allows re-run capabilities. This is just a suggestion .

Note

Use Snowflake for data warehousing, not OLTP.

Figure 3-10 illustrates a table with three micro-partitions and a DML operation to the United Kingdom data set, resulting in new micro-partition creation. We build upon this theme throughout this chapter.
Figure 3-10

Immutable micro-partitions

The implications for the Snowflake micro-partition approach are huge. On initial data load, Snowflake automatically clusters the data in micro-partitions without making any assumptions about the data. No histogram or other algorithm is applied, each micro-partition is written and metadata captured. The default approach works well for most scenarios, and most often, particularly for data loads of less than 1 TB, query performance is fine.

Over time, where data has been subject to DML or data volumes exceed 1 TB, performance degradation or initial poor performance may be observed. Under these circumstances, the data may not be optimally clustered in the table according to the query predicates most often used to access the micro-partitions. We might choose to add a clustering key, a set of columns or expressions, on a table explicitly designed to match the most frequently used query predicates. Recognizing one size does not fit all, and alternative query predicate data access paths may be disadvantaged, Snowflake offers materialized views where different clustering strategies can be applied. Your mileage will vary according to your particular scenarios.

Snowflake provides automatic clustering , which monitors the state of clustered tables and, where appropriate, reclusters behind the scenes, but only for tables where a cluster key has been declared. Manual reclustering has been deprecated for all accounts; therefore, it should not be considered. Adding a cluster key is not recommended for all scenarios, and performance testing should be conducted to prove the benefits of any change before release into production environments.

Micro-partitions enable horizontal and vertical partition pruning with static pruning performed in the metadata cache to only include micro-partitions matching query predicates. Dynamic pruning occurs at runtime during query execution based upon join conditions and other constructs. This approach explains the importance of cluster keys and data storage.

Although I discuss how our data in cloud storage is protected later, it is important to note all Snowflake customer data is encrypted by default using the latest security standards and best practices. Snowflake uses strong AES 256-bit encryption with a hierarchical key model rooted in a hardware security module. The Snowflake service rotates the keys regularly, and data can be automatically re-encrypted (rekeyed) regularly. Data encryption and key management are entirely transparent and require no configuration or management. See https://docs.snowflake.com/en/user-guide/security-encryption.html#encryption-key-management for further details.

Indexes and Constraints

Snowflake does not support the physical creation of separate indexes as objects stored in addition to the primary table. While unique, primary key and foreign key constraints can be declared, they are not enforced by Snowflake, and no resultant objects are created. The declarations facilitate data self-discovery tools allowing entity relationship diagrams to be automatically created. The only constraint enforced is NOT NULL.

Coming from a traditional RDBMS background with significant data warehousing experience, the absence of enforced constraints at first seems both counter-intuitive and a retrograde step.

In developing Snowflake applications, we face a mindset challenge because we expect our data to be clean at the source in a data warehousing environment. In contrast, in an OLTP environment, we expect our data to be validated before being allowed to enter our database. If we accept this principle, our data warehouse must not allow data updates locally, and all data changes must be made at the source. The absence of enforced constraints no longer matters. It is the declaration themselves that matters. Furthermore, suppose our data warehousing ingestion strategy includes mechanisms to identify data quality issues but not exclude any data regardless of quality. In that case, we have an automated mechanism to both feedback data gaps to the originating system, closing the feedback cycle, and measuring data quality over time .

Materialized Views

As with other RDBMS platforms, Snowflake supports materialized views for Enterprise Edition and higher. We may choose to implement materialized views for a variety of reasons. This section focuses on micro-partitions and not the wider performance tuning considerations recognizing micro-partition pruning is a significant factor in query performance.

Materialized views are maintained using the serverless compute resources discussed later in this chapter. For now, it is enough to know micro-partition maintenance is taken care of by Snowflake and billed accordingly. All we have to do is declare the materialized view with the new cluster key definition. Snowflake takes care of the maintenance.

Note

Materialized views do not have the Time Travel feature.

There are several limitations to Snowflake materialized views when compared to other RDBMS. The most notable is a reference to a single table only. Materialized views cannot reference another materialized view. All limitations are explained at https://docs.snowflake.com/en/user-guide/views-materialized.html#limitations-on-creating-materialized-views .

Addressing the full reasons why we would want to implement materialized views is beyond the scope of this book, as the answers lie in the realms of performance tuning and query optimization. The key point is the option to redefine our tables by reclustering, thus supporting different optimal access paths.

Stages

Storage is also referred to as a stage. Later, I illustrate mapping S3 storage in our AWS account using storage integration, which along with associated AWS account configuration to establish the trust relationship, provides the first of four storage types, called an external stage .

We also referred to S3 storage mapped in our Snowflake account where no AWS configuration was required, the first of which is called an internal stage , a named stage, or an internal named stage. All three labels are used interchangeably and are generally inaccessible outside the Snowflake boundary. Note that future capability is being developed to make internal stages accessible. When data is staged to an internal stage using the PUT command, the data is encrypted on the local machine before being transmitted to the internal stage to ensure security throughout the data transit lifecycle.

There are object types such as tables and materialized views that require storage. Each consumes S3 storage inside our Snowflake account. They are called table stages, uniquely named one per object.

Finally, user stages underpin the user interface. They should not be referenced directly. They also consume storage from our accounts.

Note

For external stages, apply and test your security policy before releasing it into production.

Figure 3-11 illustrates the physical location of stages, which is covered later in this chapter.
Figure 3-11

Snowflake stages

Shares

Chapter 1 discussed Secure Direct Data Share. Now that you understand the immutable nature of micro-partitions and how Snowflake persists objects using cloud storage, let’s discuss the first integration pathway to the Snowflake Data Cloud built on Snowgrid, Snowflake’s underlying proprietary data interchange platform. Most accounts are enabled by default for sharing. If you encounter an issue, contact Snowflake support .

Note

Shares are available for account data copy to any local account in the same cloud provider and region.

The party who owns the data and grants entitlement to those who can use the data is the provider. The party who uses the data is the consumer. Providers always control who can access their data. Snowflake’s staff cannot view, override, or take control of a provider’s data. Objects in a consumed share cannot be modified or added to. The data can only be queried, not modified. A share is a read-only object for a consumer, and no other action can be performed except reading data from the share by the consumer. Consumers cannot share data with other accounts when provided via share technology, but they can use it and join it to their data sets. Consumers can insert shared data into their tables by selecting from the share and sharing their tables. This is the only way to cascade data sharing.

Secure Direct Data Sharing enables sharing selected objects in a database in your account with other Snowflake accounts, but only for those providers and consumers in the same cloud provider and Snowflake region.

For Snowflake customers, compute is paid for by the consuming account. The subject of non-Snowflake consumers using a reader account is discussed briefly in Chapter 14. the provider pays for consumption. Consumers may create a single database per share. An inbound database must be created before the share can be queried, and created databases can be renamed and dropped.

Most importantly, no data is copied. Metadata alone enables the secure sharing of data in underlying storage. Since no data is copied from the provider account, the consumer Snowflake account is not charged for storage. Just think of the implications: zero-copy, real-time transactional data sharing with clients under your absolute control, screaming data democratization, and enabling citizen scientists. Figure 3-12 illustrates how micro-partitions are shared.
Figure 3-12

Shared data conceptual model

Snowflake is constantly working on improving monitoring and exposing useful metrics to its customers. An upcoming enhancement allows providers to gather basic metrics on how their shared data has been consumed.

There are some restrictions. For example, only the most recent micro-partitions are available in the share, and not all schema object types can be shared. But I hope you agree that shares are an incredibly powerful, easy-to-manage, out-of-the-box solution to seamlessly sharing and monetizing data in your organization. Chapter 14 explains how to physically implement shares.

Cloning

Since you now understand the immutable nature of micro-partitions, you can see how Snowflake external stages, tables, schemas, and databases can be cloned using metadata operations alone, resulting in zero-copy cloning. A cloned object is a new, separate object against which all types of permissible operations for the parent object can be performed. The original underlying micro-partitions remain the same until the data is changed in a cloned object. Only when a DML operation occurs against the primary or cloned object are the affected micro-partitions instantiated, and at this point, storage costs are incurred. Cloning is almost instantaneous, a few minutes at most, and there are no limits to the size of the object that can be cloned.

Some limitations apply. Internal named stages cannot be cloned; but tables and their table stages can be cloned. Cloned external stage contents (files) are not copied. Temporary tables and transient tables cannot be cloned as permanent tables. Materialized Views cannot be directly cloned, but if a cloned database or schema contains materialized views, they are cloned. Also, check the documentation for information on how object tags are treated for cloned objects.

Note

Database clones are available internally in a single account. Schema clones are available in a database. Table clones are available in a schema.

Other cloned objects may exhibit unusual or unexpected behavior; more information is at https://docs.snowflake.com/en/user-guide/object-clone.html . Above all, test, retest, then retest again before relying upon your cloning procedures to ensure all edge cases have been considered when running automated test cases and on User Acceptance Testing environments.

Clones may be created for the current timestamp, at or before a timestamp in history, or at or before a SQL statement identifier. For historical object cloning, the point in time chosen must be both in the Time Travel retention period and exist at the point in time chosen. Note your current role must have the appropriate entitlement to the source object. Check the documentation for details.

Cloned databases may be used to quickly spin up like-for-like environments to rapidly test new code or as a starting point for producing sample data sets and all manner of previously unthinkable scenarios due to the inherent time taken by legacy databases to produce replica environments. When coupled with the Time Travel feature, the huge advantages of the Snowflake micro-partition strategy become apparent. Cloned databases usually form the basis of replication, discussed next.

Replication

Replication must be enabled for your account using either an ORGADMIN or an ACCOUNTADMIN role. In larger organizations, the ORGADMIN feature may have been enabled; therefore, replication is probably enabled from a central function. Alternatively, replication can be enabled using the ACCOUNTADMIN role. Account security is addressed in Chapter 4. It goes without saying—but I do for clarity’s sake—that a second Snowflake account is required for replication. As stated, accounts may be on any supported cloud provider.

It is fair to say the Snowflake replication story is still a work in progress. The core components to protect data are in place, albeit currently limited to database replication available across cloud providers enabling instant failover and recovery. Account replication is expected in Public Preview H1 2022, with object-level replication planned. Database replication is massively important. The immutable nature of micro-partitions acts as an enabler for data sharing, along with metadata synchronization. However, there are other considerations, such as repointing infrastructure to use the failed over the site, then seamlessly fail back when service is restored. This is where client redirect helps.

Database replication is distinctly different from shares in one fundamental way. Data with a shared object is available in real-time in the consuming account. Data in a database replicated to another account must be refreshed and may take more than 20 seconds to appear in the consuming account as the serverless compute operations completely. Your mileage will vary, and testing must be conducted to ensure replication completes in an acceptable timeframe.

Note

Replicated databases are available for account data copy to any local or remote account, cloud provider, and region.

Some limitations apply. Replication operations fail when object tagging is deployed, and the target account version is lower than the Enterprise Edition. Also, referenced object tags must be in the database to be replicated. Databases with external tables cannot be replicated. The externally referenced objects are not copied, so the replication operation fails; the same for databases created from imported shares. Several object types are not replicated, including users, warehouses, roles, resource monitors, and network policies. The message is clear. Test, test again, and then repeat once more.

Note

Replicated databases do not carry forward roles which must be recreated

Having identified several limitations with replicated databases, it is clear there are precursor activities to configure our target account for receiving replicated databases, and several post-replication database import activities must be performed. However, the Snowflake replication story is improving, with less coding expected with future releases.

Regardless of the cloud provider and region chosen, replication between accounts is possible, subject to restrictions outlined in the documentation. Note all secondary replicas are available as read-only until one is made master. From this point, onward is read-write with all other replicas, including the original reverting to read-only.

Same Cloud Provider and Region

This scenario is limited in data protection as both provider and consumer accounts rely on the same underlying HA infrastructure. However, valid scenarios exist where organizations may want to adopt this approach. Where a merger or acquisition provides two distinct accounts for consolidation under a single organization, the security postures differ, and/or the accounts support disparate business use cases. In these scenarios, it can be seen consolidating some data into a single account can be highly beneficial while retaining both security postures independently. Shares may also be used to seamlessly exchange data sets between both accounts at zero cost as the underlying storage is reused and only metadata is copied.

Same Cloud Provider and Different Region

Providing total failover protection in the event a single primary region fails, a replicated secondary may be activated as primary, thus restoring service. In the unlikely event that the CSP or Snowflake suffers from two (or more) availability zone outages, failover may not be possible if both accounts are affected. Having made this statement, the probability of such an event is exceedingly small.

For the same cloud provider in a different region, storage costs are incurred for both accounts as the underlying storage cannot be shared via metadata. Both accounts must instantiate micro-partitions.

Different Cloud Providers and Different Regions

Providing total failover protection in the event a single primary region fails, a replicated secondary may be activated as primary, thus restoring service. Noting two or all of the CSPs would have to suffer concurrent failures to render Snowflake unrecoverable, this highly improbable scenario is thought so extreme as to not be possible.

For different CSPs in different regions, storage costs are incurred for both accounts as the underlying storage cannot be shared via metadata. Both accounts must instantiate micro-partitions. Also, egress costs are incurred when data is moved between cloud providers, which soon add up when refreshes occur.

Storage Summary

Unfortunately, the Snowflake documentation is less than clear when describing shares, cloning, replication, and available options. This section attempts to summarize the options available. Table 3-1 provides a quick summary.
Table 3-1

Storage Options

Storage Option

Approach

Storage Cost

Resilience

Share

Same CSP, same region

None, Metadata copy only

Same CSP, HA, single region, real time

Share

Same CSP, different region

Storage cost per account, per region

Same CSP, HA, multi-region

Share

Different CSP

Storage cost per account, per region

Different CSP, HA, multi-region

Clone Database

Same account

None, Metadata copy only initially

Same CSP, HA, single region

Clone Database

Same CSP, different region

Not Available

Not Available

Clone Database

Different CSP

Not Available

Not Available

Replicated Database

Same CSP, same region

None, Metadata copy only

Same CSP, HA, single region, refreshed

Replicated Database

Same CSP, different region

Storage cost per account, per region

Same CSP, HA, multi-region, refreshed

Replicated Database

Different CSP

Storage cost per account, per region

Different CSP, HA, multi-region, refreshed

Serverless Compute

Some Snowflake capabilities do not require virtual warehouses but operate using Snowflake supplied and managed compute resources. Naturally, nothing is for free, and these “behind the scenes” capabilities are still costed and charged to your account.

But which services use serverless compute? The list is growing but includes Snowpipe, automatic clustering , search optimization service, external table metadata refresh, materialized view maintenance, database replication, failover and failback, and most recently, tasks that no longer require a dedicated warehouse to be configured.

We dive into some of these features later. I have already mentioned automatic maintenance for other features. This section serves as a reminder of line items that appear separately on your billing. For a fuller explanation, please see the documentation at https://docs.snowflake.com/en/user-guide/admin-serverless-billing.html .

Data Protection

This section discusses the Snowflake features available to protect our data, focusing on physical data recovery. Recall how CSP HA underpins confidence in stated claims of 99.999999999% durability and 99.99% availability. Snowflake relies upon published CSP HA capability to support its product features.

Time Travel

From the discussion on how micro-partitions are managed, you know their immutable nature leads to several extraordinary built-in Snowflake capabilities. The Time Travel feature allows you to revert objects, conduct queries, and clone to a point in time. Note that it is restricted to a 1-day maximum for Standard Edition and up to 90 days for all higher editions and cannot be disabled at the account level but can be disabled for database, schemas, and tables.

Note

It is recommended you set DATA_RETENTION_TIME_IN_DAYS to 90 for your account using the ACCOUNTADMIN role.

Assuming an appropriate retention period is set, a Snowflake administrator can query data at any point in the retained period that has since been updated or deleted; also create clones of databases, schemas, and tables using a specific timestamp, relative timestamp offset, or SQL query ID.

Storage overheads of typically 10% to 15% are incurred when implementing Time Travel. But storage is very cheap, and the benefits far outweigh the cost overhead. If anyone has waited hours for backups to restore a database, this single feature is the one for you. Here you see why the immutable nature of micro-partitions is such a powerful enabler because recovery or cloning is typically sub-minute regardless of the volume of data affected by reverting to an earlier timestamp. We have all experienced mistakes—on production systems while performing releases, inadvertently dropping an object, truncating a table, or incorrectly updating data. The Time Travel feature is an immediate recovery option at our fingertips. There is no need to call a database administrator and request a backup. We have the means to self-serve and move forward, with an immutable 1-year query history to identify the point in time or SQL statement to revert to.

There are some caveats. Referenced objects must be in the available retention period. Restoring tables and schemas is only supported in the current schema or database as set by the in-scope role, even if a fully qualified object name is specified. Also, the user executing the UNDROP command must have ownership privilege on the table granted to the role in use. Finally, the user must have CREATE privilege on the target schema. Note both transient and temporary tables have a maximum Time Travel retention period of 1 day regardless of Snowflake edition. Time Travel is not supported for materialized views or external tables for the (obvious) reason their micro-partitions for materialized views are built from existing table data. For external tables, the storage is not Snowflake managed, only referenced by Snowflake.

Changing the retention period for your account or individual objects changes the value for all lower-level objects where a retention period is not explicitly set. If the retention period at the account level is changed, all databases, schemas, and tables that do not have an explicit retention period automatically inherit the new retention period. The retention period defaults to 1 day and is enabled for all accounts on initial provisioning.

Note

I highly recommend the Time Travel feature be set to 90 days at database creation time.

When a table that has Time Travel enabled is dropped, and if a new table with the same name is created, it fails when the UNDROP table command is executed. You must rename the existing object to enable restoration of the previous version of the object.

Assuming the execution context is set correctly, you can observe a table’s version history by executing the following.
SHOW TABLES HISTORY LIKE 'table_name' IN database.schema;

The Time Travel feature should be used to quickly restore service in the event of a bad data load with affected objects or database being reverted to the point in time immediately before the bad data was loaded, then rolled forward carefully before resuming normal service. No more waiting for database administrators to restore backups or hours spent rolling back transactions.

Fail-safe

The Fail-safe is a data recovery service that provides an additional, non-configurable, 7-day retention period exclusively managed by Snowflake. It represents the final stage of storage lifecycle retention. When objects age out of Time Travel, the underlying micro-partitions are not immediately deallocated and returned to the cloud provider storage pool. Instead, Snowflake retains the micro-partitions—and by inference, associated metadata—and can restore objects, though this may take several hours to complete.

If the Time Travel retention period is reduced, any objects falling outside the reduced period are moved to the Fail-safe service and are no longer directly accessible. Objects moved to Fail-safe are not accessible in the owning account. Objects are only accessible via raising a support ticket to Snowflake. According to Snowflake documentation, “Fail-safe is not provided as a means for accessing historical data after the Time Travel retention period has ended. It is for use only by Snowflake to recover data that may have been lost or damaged due to extreme operational failures.”

Caveats apply . Transient, temporary, and external tables have no Fail-safe service period. Fail-safe also incurs storage costs for the storage consumed by objects retained for the 7-day period.

Backup Strategy

Given the available options in Snowflake for data protection, we might be asking ourselves why a backup strategy may still be relevant. There is no single clear-cut answer to this question as several themes, including the right to be forgotten, come into play. What if, outside of the 90-day Time Travel period and the 7-day Fail-safe period, a mistake was made, and the wrong person’s details were removed? Should we consider taking a hard backup of all affected objects into offline storage before removing an individual? What does your organization’s policy state, and how do we translate policy into the process and operational procedures? Unfortunately, more questions than answers, shaped by your organization’s approach, understanding, and maturity of Snowflake , posed to provoke thought.

Disaster Recovery (DR)

Having discussed data protection, let’s turn our thoughts to disaster recovery , which we also discussed earlier in our storage summary section. Recognizing our organizations may be subject to external regulatory requirements, and some business colleagues may not fully appreciate the inherent robust nature of Snowflake, we may find ourselves required to implement a full DR policy with complementary processes and procedures.

Our first proactive action should be to inform and educate our colleagues on the fundamentally different nature of CSP-provisioned Snowflake and the inherent protections and tooling baked into each. Having conducted our initial “hearts and minds” campaign, regularly refreshed content, and held repeat sessions, we may find many DR requirements are satisfied by Snowflake’s built-in capabilities, reducing our work. However, I recognize this takes time and offer the upcoming information as a starting point for developing your DR strategy.

We must clearly articulate each environment by maintaining a network topology diagram, operations manual, and DR runbook. Each should be periodically reviewed and approved, also stored in a readily accessible form. Our organizations may also have a centralized help desk and ticketing system with procedures to follow during a service outage. This is where our service level agreement documentation dictates both response level and timeline. We must also maintain our technical documentation and ensure our support staff can access the latest detailed information on each aspect of our system. None of this should come as a surprise to seasoned developers .

Business Continuity

Our data is safely stored in Snowflake and, when configured appropriately, immediately accessible at any point in the preceding 90-day period via Time Travel. Depending upon our data replication strategy, we can guarantee our data availability. However, diving deeper into the specifics of shares, clones, and replication, we find some limitations on how specific object types are treated. These must be investigated on a case-by-case basis with Snowflake documentation as the first point of reference recognizing replication features are constantly evolving.

Snowflake cannot consider every possible scenario, and each organization’s business continuity requirements and strategy involve a degree of manual activity. It should be said failover and failback in Snowflake are a good deal easier than most other platforms, but still, some work is required.

You may find existing or legacy documentation to assist in identifying essential infrastructure to protect and ensure continuity of service. Organizations often have central risk registers where known weaknesses are stored. These should correlate to applications, interfaces, services, and tooling. Likewise, your organization may have a system catalog containing contact details of application owners and support groups. And for the most forward-looking organizations and those subject to external regulatory requirements, end-to-end system mappings from data ingestion to data consumption may be in place. These sources point to organizational maturity and inputs into delivering robust business continuity plans .

Note

Test your failover and failback scenario regularly by cloning production, replicating, and then failing over the clone to another account. Use automated testing wherever possible.

Snowflake Service Status

The first check should be whether your CSP and location are up and running; see https://status.snowflake.com . If your service is affected, failover is a manually invoked customer-controlled activity achieved by metadata update; therefore, it is very quick.

Failover and Failback

Depending upon the type of objects contained in your replicated databases, you may need to run additional cleanup scripts. However, it is hoped all cleanup activity is well known and documented by the support team, so no nasty surprises arise.

We also assume the security model implemented via roles has been applied to the DR replicated databases and all entitlements are consistent with the primary site. Surprisingly, it is common for assumptions to be made that entitlement follow replicated objects when changes are made, which is sadly neglected in many release runbooks.

Naturally, client tooling needs repointing, which is later covered separately. Batch loads, Snowpipe, and other custom loading mechanisms also need consideration to ensure seamless post-failover operation where some feeds may need replaying and others prevented from running twice. Note it may be certain CSP storage infrastructure allows independent replication as AWS S3 does, which may simplify failover and failback.

Note

Just as failover needs careful consideration and planning, so does failback.

Our failover and failback must also consider CSP-specific requirements outside of Snowflake; for example, these may include storage and API integrations, Lambda, and S3 and SQS configurations.

Client Redirect

Client redirect uses Snowflake organization enhancements to URLs for accessing your Snowflake account. A Connection object is created for each account, with only one pointing to the primary account. When failing over, the Connection object is altered to become the primary, causing a behind-the-scenes update to the URL. After the CNAME record (a type of DNS record) is updated, typically in 30 to 60 seconds, all new access via the URL is to the new primary site. Note failover and client redirect are two separate operations.

Extending Snowflake Capability

This section describes some interactions between Snowflake and cloud services essential for implementing the Snowflake Data Cloud. More features are explored in later chapters. For AWS-based Snowflake accounts, S3 storage underpins all our database objects; therefore, we must have a means of accessing S3 buckets from Snowflake. We also need a way to interact with cloud storage outside Snowflake because our data feeds are often supplied as flat files in batch format. Every organization, at some point, resolves its data transfer to the lowest common denominator of flat file batch transfer with the expectation of refactoring later. We start with flat files being the minimum standard for data transfer when moving data from our silos into Snowflake.

Keeping the technical details at bay for now, I must introduce a new component into our discussion, the CSP account, as the services we rely upon are available here, in our case, an AWS account which can be created at https://aws.amazon.com/free .

Why should we create an AWS account? Simply put, to access the wealth of capabilities developed and supported by your chosen cloud vendor, which lie outside of Snowflake’s core capabilities. While Snowflake delivers huge capabilities with data, and the capabilities are ever expanding, some features that should never become fully integrated, such as Secrets Manager, and others that are a natural fit for Snowflake, such as Lambdas, document parsers, and S3 buckets.

Figure 3-13 shows S3 storage referenced by Snowflake in two locations. Remember, our Snowflake account is a Virtual Private Client (VPC) , and our AWS account is also a VPC, two different containers which co-exist in the same (in this case, but could also be in a different) cloud region. But note that Snowflake can reference both S3 buckets. But how?
Figure 3-13

AWS storage integration

The Snowflake account boundary allows access to unlimited storage. We just have to declare, then reference, our elastic cloud provisions as much as we need. This internal storage can only be referenced in the Snowflake boundary when writing this book. However, it should be noted that upcoming new features can make this internal storage accessible elsewhere.

In our AWS account, we can also create S3 storage into which, from our desktop machines, we can upload files or, using supplied AWS functionality, automagically copy files across from other S3 buckets. We can also upload files programmatically using a variety of tools.

Note

Unprotected files in our AWS account S3 buckets can be seen from the public Internet, so please ensure an appropriate security policy is applied and tested before adding files.

Snowflake has commands to make external storage visible and accessible, called storage integration ( https://docs.snowflake.com/en/sql-reference/sql/create-storage-integration.html ). The commands require work in both Snowflake and AWS Management Console to establish the trust relationship. Then, we can drop files into S3 and access them using SQL. This theme is returned to when covering unstructured document support in Chapter 10.

Not only can we access files in external storage from Snowflake, but we can also automatically ingest data when a file lands in our external storage by using Snowpipe, a built-in Snowflake feature; see https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html . Figure 3-14 has been extended to incorporate a new AWS service, Simple Queue Service (SQS) ( https://aws.amazon.com/sqs/ ), which notifies Snowpipe when files are ready to consume. Snowpipe is discussed in Chapter 8.
Figure 3-14

Snowpipe automated ingestion

Summary

This chapter explained how Snowflake software as a service (SaaS) is deployed centrally in a CSP region and how Snowflake accounts are provisioned. The discussion also gave insight into the Snowflake editions, calling out capabilities we investigate later and answering the question of why we should integrate with an AWS account.

The chapter looked at cloud storage and investigated the different types of stages, understanding their differences and uses, and how Snowflake implements storage using micro-partitioning as these underpin much of the Snowflake Data Cloud core capabilities.

Looking toward the future where unknowable data types and new forms of media await, we understand how to address future state data demands. You can now begin to think in terms of “yes” being your first response backed up by the knowledge we can develop tooling to access data, rather than immediately saying “no” because the answers lie outside of your skill, knowledge, expertise, and comfort zone.

This discussion has taken you through some of the complexities in accessing data silos, calling out some key dependencies and challenges, not the least of which is security posture.

And having established the right mindset in preparation for looking into data silos, let’s open the door to the chapter on account security.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.59