Overview of Snowflake virtual data warehouses
Understanding warehouse use cases
Virtual data warehouse considerations
Building your first Snowflake virtual data warehouse
Overview of Snowflake Virtual Warehouses
Warehouse sizing and features
Multicluster warehouses
Warehouse Sizes and Features
Snowflake virtual warehouse sizes (as mentioned in Chapter 2) are T-shirt sizes such as X-Small, Small, Medium, Large, and so on. The credit charges for a virtual warehouse start at 1 for X-Small and double each size you go up. This makes it an easy pricing model to remember.
Choosing the Right Size
Even with the simple pricing model, it still might seem like a daunting task to come up with the approximate size needed for your virtual warehouse. Data loading and query processing are the two considerations that need to be mapped out when determining the size you need. As queries get more complex, the time it takes for the server to execute can increase. Likewise, as more data gets loaded into Snowflake, the loading performance might be affected.
Note
Larger virtual warehouses may not result in better performance for data loading or query processing.
Concurrency
The number of queries that a virtual warehouse can concurrently process is determined by the size and complexity of each query. As queries are submitted, the virtual warehouse calculates and reserves the compute resources needed to process each query. If the virtual warehouse does not have enough remaining resources to process a query, the query is queued, pending resources that become available as other running queries complete.
STATEMENT_QUEUED_TIMEOUT_IN_SECONDS
STATEMENT_TIMEOUT_IN_SECONDS
Note
If queries are queuing more than desired, another virtual warehouse can be created, and queries can be manually redirected to the new virtual warehouse. In addition, resizing a virtual warehouse can enable limited scaling for query concurrency and queuing; however, virtual warehouse resizing is primarily for improving query performance.
To enable fully automated scaling for concurrency, Snowflake recommends multicluster virtual warehouses, which provide essentially the same benefits as creating additional virtual warehouses and redirecting queries but without requiring manual intervention. Multicluster virtual warehouses are discussed later in this chapter.
Default Virtual Warehouses in Sessions
When a session is initiated in Snowflake, the session does not, by default, have a virtual warehouse associated with it. Until a session has a virtual warehouse associated with it, queries cannot be submitted within the session. Snowflake allows sessions to specify which virtual warehouse they will default to. Sessions can be initiated by users or tools, and sessions may also change to another virtual warehouse by using the USE WAREHOUSE command .
To facilitate querying immediately after a session is initiated, Snowflake supports specifying a default virtual warehouse for each individual user. The default warehouse for a user is used as the warehouse for all sessions initiated by that user. A default warehouse can be specified when creating or modifying the user, either by using the web interface or by using CREATE USER/ALTER USER.
Snowflake clients (SnowSQL, JDBC driver, ODBC driver, Python connector, etc.) can specify a default warehouse through their connections or configuration files, as appropriate. Chapters 4, 5, and 6 discuss these Snowflake clients and provide more information on how to set up the default warehouse.
Multicluster Virtual Warehouses
Some organizations may replicate data into separate data marts. They may also shift some data workloads outside of normal business hours or queue usage to boost performance. Snowflake offers users the ability to automatically scale out their virtual warehouse by distributing replicated data, in memory, across separate compute clusters.
Resizing your virtual warehouse can provide performance benefits for slow-running queries and data loading. Multicluster warehouses will automatically increase or decrease the number of queries.
Multicluster virtual warehouses are an Enterprise Edition feature. If multicluster virtual warehouses are enabled for your account, you can also set the maximum and minimum number of clusters for the warehouse. Multicluster warehouses will use multiple clusters of servers to handle cases where fluctuating numbers of concurrent queries occur, such as during peak and off-peak hours. As the load increases, the virtual warehouse automatically starts more clusters to prevent queries from queuing. When the additional clusters are no longer needed, it shuts them down.
Note
Multicluster virtual warehouses are a Snowflake Enterprise Edition feature.
Overview of Multicluster Virtual Warehouses
Multicluster virtual warehouses improve concurrency and are designed specifically for handling queuing and performance issues related to large numbers of concurrent queries by many users. In addition, multicluster virtual warehouses can be automated to scale if your number of users/queries tends to fluctuate.
All your virtual warehouses should be configured as multicluster virtual warehouses.
Multicluster virtual warehouses should be configured to run in autoscaling mode, which enables Snowflake to automatically start and stop clusters as needed.
Minimum: Keep the default value of 1; this ensures that additional clusters are started only as needed. However, if high availability of the virtual warehouse is a concern, set the value higher than 1. This helps ensure virtual warehouse availability and continuity in the unlikely event that a cluster fails.
Maximum: Set this value as large as possible, while being mindful of the virtual warehouse size and corresponding credit costs. For example, an X-Large virtual warehouse (16 servers) with maximum clusters (10) will consume 160 credits in an hour if all 10 clusters run continuously for the hour.
Multicluster Credits and Usage
The number of servers per cluster, which is determined by the virtual warehouse size
The number of clusters, if using multicluster virtual warehouses
The length of time each server in each cluster runs
Virtual Warehouse Considerations
Query design
Query caching, reuse of frequently run queries
Scaling
Query Design
Table size is more significant than the number of rows.
Filters using predicates impact processing.
A higher number of joins will also impact processing.
Tip
To achieve the best results, try to execute relatively homogeneous queries (size, complexity, data sets, etc.) on the same warehouse; executing queries of widely varying size and/or complexity on the same warehouse makes it more difficult to analyze warehouse load, which can make it more difficult to select the best size to match the size, composition, and number of queries in your workload.
Caching Impacts
When a virtual warehouse is in a running state, it maintains a cache of table data that has been accessed from previously completed queries. The size of the cache is determined by the number of servers in the virtual warehouse. The size of the cache increases as the virtual warehouse size increases.
The cache is dropped when the virtual warehouse is suspended, which may result in slower initial performance for some queries as the virtual warehouse is resumed. The cache is rebuilt when a warehouse is resumed, and queries are then able to take advantage of the improved performance that the cache provides.
Consider the trade-off between saving credits by suspending a virtual warehouse versus maintaining the cache of data from previous queries to help with performance.
Scaling
Virtual warehouse resizing is supported by Snowflake at any time. You can even resize a warehouse while the virtual warehouse is running. When you have a slow query and have other similarly sized queries queued up, it might make sense to resize the warehouse while it’s running.
But be aware that larger is not necessarily faster; for smaller, basic queries that are already executing quickly, you may not see any significant improvement after resizing. If a query is already running, then it will not be impacted by the resized virtual warehouse. The additional virtual warehouses will be used only by the queued or new queries that will be processed.
Building a Snowflake Virtual Warehouse
There are two ways you can build a Snowflake virtual warehouse. It can be done either through the web interface or through SQL commands. We will demonstrate how to do it in both ways. We will be doing this through an already created Snowflake account. If you haven’t yet created your Snowflake account, please review Chapter 2 to get your Snowflake account set up. Let’s get started!
Creating a Virtual Warehouse
Snowflake Web Interface
- 1.
Log in to your Snowflake web interface.
- 2.
Click Warehouses + Create.
- 3.
Give your virtual warehouse a name, choose a size, and set the maximum idle time before the virtual warehouse automatically suspends. Click Finish. Figure 3-2 shows this information entered into the web interface.
Tip
SQL Commands
- 1.
The following CREATE WAREHOUSE command will give the same results as what was done in the web interface:
CREATE WAREHOUSE TestWarehouseWITH WAREHOUSE_SIZE = 'MEDIUM'WAREHOUSE_TYPE = 'STANDARD'AUTO_SUSPEND = 600AUTO_RESUME = TRUE; - 2.Figure 3-4 shows that our warehouse, TESTWAREHOUSE, is created after provisioning was completed.
Important
Always ensure that auto suspend and auto resume are set in your warehouse. By default, these settings are set for you when a virtual warehouse is provisioned. Auto suspend stops a warehouse if it sits idle for a specified period of time, while auto resume starts a suspended virtual warehouse when queries are submitted to it. This is important because a running warehouse will consume Snowflake credits only when compute resources are being utilized. Shutting down your warehouse, when they are not in use, will help conserve credits and control costs.
Starting, Resuming, Suspending, and Resizing
You can also manually start/resume, suspend, and resize a virtual warehouse. These are especially helpful for one-offs or if you don’t want to wait for the automatic start or suspension of your virtual warehouse. A virtual warehouse can be started at any time, even when it’s first created.
Web Interface
Start/Resume: Click Warehouses and select your warehouse. Click Resume.
Suspend: Click Warehouses and select your warehouse. Click Suspend.
Resize: Click Warehouses and select your warehouse. Click Configure. Select the new size from the Size drop-down. Click Finish.
SQL Command
Start/Resume: Use the ALTER WAREHOUSE command with RESUME.
ALTER WAREHOUSE TestWarehouseRESUME IF SUSPENDEDSuspend: Use the ALTER WAREHOUSE command with SUSPEND.
ALTER WAREHOUSE TestWarehouseSUSPENDResize: Use the ALTER WAREHOUSE command with SET WAREHOUSE_SIZE.
ALTER WAREHOUSE TestWarehouseSET WAREHOUSE_SIZE = XSMALL
Using a Warehouse
Setting Up Load Monitoring
- 1.
Click Warehouses and highlight your chosen warehouse. A page will appear on the right allowing you to grant privileges.
- 2.
Click Grant Privileges.
- 3.
For “Privileges to grant,” select MONITOR.
- 4.
For “Grant privileges to,” select ACCOUNTADMIN.
- 5.
Click Grant, as shown in Figure 3-6.
- 1.
Click Warehouses.
- 2.
Click your choice of warehouse name (which should be hyperlinked).
Note
The date slider has a minimum/maximum range of 8 hours to 14 days. Load monitoring data is not available previous to 14 days.
Understanding Load Monitoring
Running: Queries that were actively running during the interval. These queries may have started before or during the interval.
Queued: Queries that are in wait status. The wait could be because of the warehouse load being maxed out and therefore would need to wait for running queries to complete processing.
Queued Provisioning: Queries that are in wait status because the warehouse is provisioning, usually after a warehouse resumes.
Queued Repairing: Queries that are in wait status while the warehouse is repaired. While rare, this occurs only for a few minutes.
Query load is calculated by taking the execution time for each query and dividing it by the interval time in seconds. This produces a percent that is then additive for all the queries in that interval. The percent is divided up by query status, which is displayed in the bar chart.
Best Practices for Load Monitoring
A high query load and queuing are indicative of needing to start up a separate warehouse. Move the queued queries to the new warehouse.
A low query load could mean your running queries need more compute resources. Resizing the warehouse can take care of this.
Use the load monitor to study how the daily workloads look over the course of two weeks. Notice the trends in spikes and consider creating a warehouse specifically for peak workloads. Or consider switching to multicluster so that autoscaling takes care of the peak traffic.
Use the load monitor to look at trends in Snowflake credit usages. See whether the minimum cluster count or warehouse size needs to be decreased to save money on credits.
Summary
This chapter covered the basics of Snowflake virtual warehouses. You should now know the different virtual warehouse sizes and features and what a multicluster virtual warehouse is. We also reviewed some important things to consider with your virtual warehouse. Last, we showed how to create your first virtual warehouse and highlighted the load monitoring tool.