Performance tuning is often regarded as a “black” art, available to those with experience gained over many years of hands-on work. This section offers insights into how Snowflake works “under the hood” and provides tools to better understand behavior and identify options to improve performance.
Chapter 3 discusses micro-partitions and cluster keys for large-volume data sets that are critical to performance. This chapter delves into the fundamentals driving performance, starting with clustering. We cannot proactively propose solutions without understanding how clustering is implemented and a few techniques to optimally cluster our tables. I provide summary guidance, or “rules of thumb,” to facilitate design or later remediation.
By understanding how Snowflake processes queries to enable efficient data access, we can choose the most appropriate storage and warehouse strategy to reduce costs. Our next section focuses on query performance, where we investigate various scenarios in preparation for your real-world situations.
This chapter is neither exhaustive nor prescriptive in approach but offers a starting point in determining performance bottlenecks and ways to investigate and remediate them. Your real-world experience will vary from the scenarios presented.
Naturally, there is always much more to discover. I trust this brief chapter provides sufficient information and context and delivers motivation for your further investigations.
Clustering
We know each micro-partition holds between 50 MB and 500 MB of data organized in hybrid columnar format, optimizing compression according to the column data type. But how does Snowflake determine the order in which data is organized in micro-partitions? In the absence of explicit cluster key information, Snowflake’s only logic is to use the order in which data is stored in the source file to create the requested file size. In other words, data order is not changed, analyzed, or modified during load.
For many low-volume data sets, clustering by key value is not particularly relevant. For example, in those scenarios where all data for an object fit into a single micro-partition, no micro-partition pruning is possible. But for those large data sets spanning several or many micro-partitions with appropriate clustering keys declared, micro-partition pruning is possible, assuming query predicates match the cluster key.
This section focuses on cluster keys, recognizing that performance is also a function of CPU and memory. We later revisit warehouses.
What Is a Clustering Key?
A clustering key comprises a subset of attributes or expressions on a table, declared in the least selective to most selective order, with the express intent of co-locating data and designed to match the commonly used query predicates. Superficially a clustering key is “like” an index. But the similarity soon breaks down because only a single clustering key can be declared for a table. Unlike an index, all attributes are stored along with the clustered attributes. Clustering key documentation is at https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html .
Cluster Key Operations
Despite ALTER TABLE completing without delay, work continues in the background. To determine whether reclustering is complete, check the average_depth attribute in the result set for system$clustering_information, and when the average_depth value remains constant, reclustering is complete.
Default Clustering
Take a few moments to examine partition_depth_histogram, which tells us the disposition of the data . The preceding example confirms the table is well clustered, showing high partition counts at level 00001 to low partition counts at level 00006.
Just as we did before, take a few moments to examine the partition_depth_histogram, which tells us the disposition of the data. The preceding example confirms the table is poorly clustered, showing low partition counts to high partition counts and then low partition counts.
Clustering depth documentation is at https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#clustering-depth .
Identify Cluster Key
Fewer attributes in the key are better. In preference, only one or two attributes
Cluster key attribute order should be the lowest cardinality first to the highest cardinality last
Apply expressions to reduce the cardinality of keys, for example, DATE_TRUNC()
Putting a high cardinality attribute before a low cardinality attribute reduces the effectiveness of clustering on the second attribute.
As each real-world scenario differs, the methodology described in the following subsections identifies cluster keys.
Step 1. Identify Access Paths
By profiling your queries, identify the most often used access paths to the object.
To do this using the Snowflake user interface, navigate to the History tab. Using the available filters, click the query ID corresponding to the SQL text of interest, then click the Profile tab .
We are not using the cluster key as a filter column in query predicates, or the table is not clustered.
Step 2. Gather Baseline Metrics
Step 3. Assign Cluster Key
With the preceding information in hand, we should try another key aligned to predicates used in poorly performing queries identified earlier and retest. For this test, we need real data to work with, created in the next section .
However, reclustering uses serverless compute, and operations must complete before results can be checked and repeating earlier comments. To determine whether reclustering is complete, check the average_depth attribute in the result set for system$clustering_information, and when the average_depth value remains constant, reclustering is complete.
Step 4. Retest
As with every change, our results must be tested and proven good.
As the next section demonstrates, we should make a single change at a time and be methodical about how we approach testing.
Setup Test Data
After confirming reclustering has been completed, recheck clustering where you see a problem. Note that some result set rows were removed to reduce space.
Better clustering is indicated by
average_depth = low
In this example, we have partially achieved our desired effect.
average_depth desired = low, actual = medium (3.0055)
total_constant_partition_count desired = high, actual = high (1768)
ps_partkey = 200000000
ps_suppkey = 10000000
A closer look reveals the truth.
average_depth desired = low, actual = medium (8.7319)
total_constant_partition_count desired = high, actual = low (0)
But we can see notes providing useful information.
average_depth desired = low, actual = low (3.4401)
total_constant_partition_count desired = high, actual = low (0)
Our results are arguably better than those delivered for the source table where we began our investigation. In reality, they are only as good as the improvement in performance we experience in real-world usage. Caveat emptor. Under the circumstances, probably the best we can achieve.
Materialized Views
Having discussed materialized views in Chapter 10, we consider them worthy of a second mention.
Materialized views can be clustered using a different cluster key. As they are maintained behind the scenes using serverless compute, they can always be guaranteed to be consistent with the underlying table. Therefore, materialized views represent an alternative way to implement clustering more suited to different access paths than our primary use case, albeit incurring storage and maintenance costs.
Automatic Clustering
To be clear, not every table benefits from clustering. Only very large tables in the order of several terabytes and larger benefit from clustering. If data volumes and testing indicate adding cluster keys are beneficial, then consider enabling automatic clustering. Automatic clustering is where Snowflake utilizes serverless computers and manages all future maintenance on an as-needs basis. More information is at https://docs.snowflake.com/en/user-guide/tables-auto-reclustering.html#automatic-clustering .
Automatic clustering is enabled by default when a table is reclustered.
Automatic clustering does not attempt to provide the best performance possible but instead aims to improve clustering depth to achieve both predictable and acceptable performance.
As we might expect, once enabled, we cannot influence automatic clustering behavior except by disabling it. There are no other levers available to apply.
The corresponding view can be found in Account Usage. Note that the latency may be up to 3 hours. More information is at https://docs.snowflake.com/en/sql-reference/account-usage/automatic_clustering_history.html#automatic-clustering-history-view .
Factors affecting credit consumption include the number and cardinality of key(s), how many micro-partitions are involved in reclustering, the size of the reclustered table, and the frequency and pattern of DML operations.
Clustering Recommendations
As previously stated, clustering is not for every table. I recommend starting with tables of 1 TB or more. For those smaller tables, you may see some benefits to adding a cluster key but proceed on a case-by-case basis.
Query predicates (WHERE, JOIN, GROUP BY columns) are a good indicator of appropriate cluster keys to add. We also have materialized views available to provide alternate clustering for the same table, and the optimizer may select from materialized views in preference to referenced tables.
Begin with one or two cluster key attributes and aim to keep the total number of cluster keys small, in preference, less than three.
Always order cluster key attributes from least selective to highest selective— low cardinality to high cardinality.
Monitor table activity as DML operations can skew micro-partition content over time.
Consider enabling automatic clustering, but note resource monitors do not currently track credit consumption.
Query Performance
Having looked at clustering, let’s now look at how to identify poorly performing queries and discover remediation techniques. Documentation is at https://docs.snowflake.com/en/user-guide/ui-query-profile.html#query-operator-details .
Warehouse Considerations
Snowflake retains cached results for 24 hours and a maximum of 31 days subject to the same SQL re-running before the purge.
We can check warehouse settings from the Snowflake user interface Warehouses tab.
Query Runtime and Memory Spills
This section addresses scaling up from Figure 11-4 .
For this investigation, we resize our existing warehouse compute_wh, don’t forget to resize it back to X-Small at the end of this exercise. We do not recommend resizing as a general practice.
And re-run the same query, and then repeat the same steps. Navigate to the History tab; refresh the screen if necessary. Note that the size is Medium, and then click Query ID for the most recent SQL statement invocation. Total Duration is now 59.1 seconds in my case. Click Profile. In Profile Overview, spilling has been reduced to 6.47 GB.
Re-run the same query and then repeat the same steps. Navigate to the History tab; refresh the screen if necessary. Note that the size is Large, and then click Query ID for the most recent SQL statement invocation. Total Duration is now 29.3 seconds in my case. Click Profile. In Profile Overview, note the absence of spilling.
We can conclude for this sample query that either a Medium or a Large warehouse are reasonable compromises depending upon performance criteria. For your information, I re-ran using the X-Large warehouse (16 times the X-Small T-shirt size) with a reduction in runtime to 14 seconds.
One final consideration relating to cost. At the beginning of this section, we set auto-suspend at 60 seconds. Snowflake’s minimum warehouse runtime change is 60 seconds per X-Small cluster. Therefore, we must consider the cost implications of running larger warehouses because each increase in T-shirt size doubles the cost, albeit at reduced runtime.
Concurrency
This section addresses scaling out from Figure 11-4.
Clustering is enabled by setting the max_cluster_count greater than min_cluster_count, meaning the warehouse starts in Auto-Scale mode. Where max_cluster_count and min_cluster_count are set to a value greater than 1, the warehouse runs in Maximized mode, meaning the warehouse always initializes the stated number of clusters on startup.
Scaling out determines when and how Snowflake turns on/off additional clusters in the warehouse, up to the max_cluster_count according to the query demand, maximizing query responsiveness and concurrency. There is no additional cost to implementing Auto-Scale mode as a single cluster is initially instantiated on demand. As demand rises, more clusters are instantiated, then shut down as demand falls. In contrast, Maximize mode always instantiates the stated number of clusters on startup and does not shut clusters down as demand falls. Therefore the initial cost is higher and may be acceptable for consistently repeatable performance. There are no hard and fast rules, simply guidelines according to the importance of response times, concurrency demands, and performance. We pay according to requirements.
Snowflake implements a hard limit of 10 concurrent clusters, with each cluster capable of handling 8 concurrent requests.
Although the creation of multiple parallel execution threads is well beyond the scope of this chapter , you can get useful information on the effects of concurrent execution, SQL queueing, and clustering at https://community.snowflake.com/s/article/Putting-Snowflake-s-Automatic-Concurrency-Scaling-to-the-Test . I do not endorse third-party products. I simply offer this link to add value to your own investigations.
Workload Segregation
This section addresses scaling across from Figure 11-4.
You have seen the effect of scaling up by increasing cluster size and improving query response times and discussed the effect of adding clusters to scale out. This section addresses scaling across and explains why we should consider adding warehouses of the same size but with multiple declarations.
To a point, adding more discretely named warehouses also addresses clustering concerns, but the aim is more subtle. When we scale across, we segment our billing and cross-charging consumption by warehouse usage, where each department’s requirements for cost effectiveness and response times may differ. We also present a greater probability of result cache reuse as SQL queries issued in a department may have been run before the warehouse quiesces. A further benefit is the reduction in resource contention as warehouse usage is restricted to a subset of the user base and, therefore, more tunable to a narrower workload than might otherwise be possible.
The case for scaling across is less clear than for scaling up and out, but in larger organizations becomes clearer and is one left for your further investigations.
Summary
This chapter began by diving into clustering. You learned how to define a cluster key and determine better clustering as indicated by average_depth = low and total_constant_partition_count = high with a descending number of partitions across a shallow depth.
We then identified costs associated with automatic clustering. We determined how to disable and enable automatic clustering noting the aim is to improve clustering depth to achieve both predictable and acceptable performance.
Ending our discussion on clustering by providing recommendations, we discussed query performance and spilling data from memory to disk, commenting on the cost implications of increasing warehouse size.
Implementing concurrency falls outside the available space, though I hope you find the discussion useful. Finally, the comments on workload segregation may not resonate with smaller organizations.
Having looked at query performance, let’s next look at data management.