Chapter 3. Building Scalable Data Applications

Scalability is a requirement for successful data applications. A product that scales well can quickly onboard new customers, enable customers to run new workloads without impacting the performance of others, and take advantage of the elasticity of the cloud to keep costs in check. By thinking about scalability from the beginning you can avoid bottlenecks and costly redesign efforts that can blunt product growth.

In Chapter 2 you learned about important features of a modern data platform. In this chapter you will learn how to best leverage those features to design scalable data applications. We will begin with an overview of the key design considerations for building data applications that scale. The rest of the chapter will dive into best practices and real-world examples to support these considerations. At the end of this chapter you will understand how to make the best use of Snowflake’s features for designing scalable data applications.

Design Considerations for Data Applications

As discussed in Chapter 2, support for multiple tenants is a foundational requirement for data applications. Underlying this requirement are three components: storage, compute, and security. In this section we will present design patterns and examples covering each of these areas. Data application customers will be referred to as “tenants” and individual users associated with a customer will be referred to as “users.”

This section will include examples using Snowflake’s virtual warehouses—clusters of compute resources that provide resources such as CPU, memory, and temporary storage to perform SQL operations.1 We will also introduce Snowflake’s multi-cluster warehouses, which offer the capabilities of virtual warehouses with a larger pool of compute resources, including built-in load balancing.2

Design Patterns for Storage

Data applications need to ensure that customers can’t see each other’s data, but the details of how this is accomplished depend on a variety of factors that will vary by application. In this section we will discuss different methods to isolate data between tenants and processes and provide recommendations on when to use each method.

Multi-tenant tables

Multi-tenant tables combine all tenants into a single set of database objects. In this scenario, all tenants belong to the same table, with row-level security applied to ensure isolation. This greatly reduces the number of objects you have to maintain, which can make it easier to support many more tenants without increasing your operational burden.

Snowflake implements multi-tenant tables as illustrated in Figure 3-1. Source data is exposed in a single table shared across all tenants, with an application layer that ensures each tenant can only access the data it has permission to access.

Snowflake multitenant table
Figure 3-1. Snowflake multi-tenant table

An important consideration for multi-tenant tables is ensuring performance does not degrade as the tables grow to significant size. Without optimization, table growth will result in slower access times for interacting with data. To address this issue, it is recommended that you cluster data in multi-tenant tables based on the tenant ID rather than by date. This table clustering scheme ensures that as each tenant queries its slice of the data, Snowflake has to scan as few files as possible, resulting in increased performance.

Object per tenant

In an object per tenant model, underlying database instances are shared but database objects are allocated to separate tenants. For example, tenants may have their own databases, schemas, and tables but be commingled in a single database instance. Role-based access control (RBAC) is used to isolate tenants to their respective objects. While the overhead is lower than with an account per tenant model, discussed next, this approach also can quickly become unwieldy to manage as the customer base grows.

The object per tenant model is commonly used when the shape of the data for each tenant is different. In cases where requirements demand greater data separation than row-level security, object per tenant can be an effective choice. Figure 3-2 depicts how this option is designed in Snowflake.

Snowflake object per tenant design pattern
Figure 3-2. Snowflake object per tenant design pattern

Account per tenant

Another way to provide data isolation is to create a new database instance for every tenant. This ensures complete separation between tenants, which is important for applications with contractual or regulatory requirements.

While this approach guarantees complete isolation, it comes with higher overhead and maintenance costs due to additional administrative objects to maintain for every instance. For example, for every tenant you will need to maintain a separate database instance. As the number of tenants increases, so will the number of resources needing to be upgraded, monitored, and debugged, leading to significant maintenance and support costs. Therefore, consideration should be given to the number of tenants you will need to support with this method.

Snowflake makes the creation of instances simpler, because an instance in Snowflake is just a logical Snowflake account which can be created with a SQL statement. There is still some administrative overhead with this approach, but not as much as with a physical database instance. Figure 3-3 shows the Snowflake account per tenant design pattern. In this case, each tenant has a dedicated database instance associated with its account.

Snowflake account per tenant design pattern
Figure 3-3. Snowflake account per tenant design pattern

Design Patterns for Compute

Suboptimal design for compute in multi-tenant environments can lead to poor query performance, delays in ingesting new data, and difficulties servicing the needs of different tiers of customers. In this section we will discuss different methods for allocating compute resources to meet performance requirements.

Compute scaling is discussed along two axes: vertical and horizontal. Vertical scaling refers to the ability to provide more powerful resources to perform a task. If a customer runs a complex task, vertical scaling can improve the runtime. In cloud platforms, this involves provisioning different types of compute instances with more powerful specifications.

While vertical scaling can be used to make a task run faster, horizontal scaling can increase the number of tasks that can be run simultaneously, such as when many users access the platform simultaneously. Another example is a data ingestion process onboarding a large dataset, where adding additional nodes by scaling out horizontally improves parallelization, resulting in faster data ingestion. Horizontal scaling is implemented by changing the number of compute instances available under a load balancer.

Overprovisioning

One approach to keeping up with compute demand is to provision an excess of compute resources in anticipation of increased or variable demand. Having additional compute resources available means you can quickly scale out capacity without having to wait for a new instance to be provisioned.

Overprovisioning relies on the ability to predict demand, which can be tricky. Consider the impact of COVID-19 resulting in sudden, enormous demand for video conferencing. The impossibility of predicting such an event could lead to disruption in services for customers.

Additional drawbacks to overprovisioning include cost and load balancing. If demand is lower than forecast, it’s costly to pay for idle resources. If demand is higher than forecast, there is a poor customer experience as performance will degrade. Furthermore, when scaling out capacity, existing jobs need to be reorganized to balance the load across the additional resources. This can be challenging to accomplish without impacting existing workloads.

Autoscaling

Instead of trying to predict demand, autoscaling will increase the amount of compute available as demand rises and decrease it as demand subsides. Snowflake’s multi-cluster, shared data architecture manages this for you by autoscaling within a configurable number of minimum and maximum clusters and automatically load balancing as the number of users changes.

When it comes to scaling, consideration needs to be given to trade-offs in cost, resource availability, and performance SLAs. With multi-cluster warehouses you can choose to provide dedicated compute resources for tenants that need them, and for others a pool of shared resources that can autoscale horizontally when more tenants are on the system, enabling you to easily provide separate tiers of service. For example, customers who pay more could each be given their own more powerful warehouse, while lower-paying customers could be pooled onto a smaller, cheaper warehouse that can autoscale up to a maximum number of clusters when more customers are simultaneously using the application. In this case, balancing loads across the scaled compute capacity must be managed as well. Snowflake has built-in load balancing that handles this out of the box with a simple configuration.

Workload isolation

As discussed in Chapter 2, workload isolation is important to ensure good performance in multi-tenant data systems. Isolating workloads also helps protect against runaway processes or bad actors. For example, if a single compute instance were shared among several tenants, a rogue process could disrupt all the tenants. With separate instances, the instance with the rogue process could be shut down without disrupting other workloads.

Different workloads have different compute needs, making workload isolation attractive for separating synchronous workloads from asynchronous ones, isolating simple workloads from more complex workloads, and separating data processing tasks from analytical tasks.

The virtual warehouse model provided by Snowflake achieves workload isolation by enabling multiple compute instances to interact with the same underlying dataset, as shown in Figure 3-4. Workloads for ingestion, transformation, and different tenants operate within separate compute environments, enabling resources to be sized independently. This allows workloads to consume as many or as few resources as required, while also ensuring the different workloads don’t impact the performance of others. Tenants aren’t affected by continuous data ingestion and transformation, and synchronous and asynchronous workloads can be isolated.

Separate virtual warehouses can be provisioned for data processing and analytical workloads to allow these processes to work in parallel without impacting each other’s performance. All processes interact with the same data, with guaranteed consistency provided by Snowflake’s cloud services layer.3

Workload and tenant isolation in Snowflake
Figure 3-4. Workload and tenant isolation in Snowflake

However, with multiple compute instances comes the need to manage access across tenants. Snowflake provides configurable compute privileges, enabling data applications to determine which compute instances customers have access to and the types of operations they are allowed to perform. We’ll discuss this and other security considerations in the next section.

Design Patterns for Security

With security breaches frequently in the news, you should expect customers to have concerns about the security of their data. Security features in a multi-tenant data platform should include guarantees for regulatory and contractual security requirements, as well as managing access to data and compute resources.

Access control

Within a multi-tenant system it is important to have some way of granting and restricting access for different users. Access control broadly refers to the mechanisms systems put in place to achieve this goal. In this section we will talk about two types of access control: role-based (RBAC) and discretionary (DAC).

It is useful to think of access management in terms of relationships between users and objects in the system. In a data application, objects include databases, tables, configuration controls, and compute resources. Relationships between users and objects are set by privileges generally falling into the categories of create, view, modify, and delete. As it is typical for a user to have more than one of these privileges, the grouping of multiple privileges into a role used to control access is common. This is the RBAC model, shown in Figure 3-5.

RBAC example in Snowflake
Figure 3-5. RBAC example in Snowflake

As depicted in Figure 3-5, Role 1 and Role 2 each encapsulate multiple privileges for interacting with the available assets, and each individual user may be assigned one or both roles.

In addition to typical access controls for database objects, Snowflake includes warehouses and other entities in its set of securable objects, or elements that can have access constraints applied to them.4 This can save data application developers significant overhead when setting up permissions for tenants. The alternative is a patchwork of permissions across different services, such as database grant permissions for relational components and specific cloud-based controls for granting permission to interact with blob storage. Encapsulating these lower-level permissions in this way streamlines permission management and reduces the chances of omissions and mistakes.

In a multi-tenant system where data can be shared, it is helpful to enable data creators to specify who should have access to their objects. This is where DAC comes in, where object owners can grant access to other users at their discretion. Snowflake combines these models, resulting in object owners providing access through roles that are assigned to users.

One aspect to keep in mind when handling permissions is that it’s best to minimize the spread of privileges to a given object across several roles. Limiting access to a given object to a single role reduces overhead if there is a need to modify that permission in the future. A role hierarchy can then be used to create combinations of privileges, as in the example depicted in Figure 3-6.

Role hierarchy and privilege inheritance
Figure 3-6. Role hierarchy and privilege inheritance

In the upper-left corner of Figure 3-6 you can see that the application database is organized with two different schemas: a schema of base tables and a scheme of secure views. It’s good practice with the multi-tenant table approach to isolate secure views into their own schema, and this is also important for defining the role hierarchy.

You begin by creating the roles associated with the database schema permissions, shown in the bottom row. Functional roles, shown in the middle row, associate a user’s function with the appropriate database permissions. For example, the DEVELOPER_READ role includes the BASE_SCHEMA_READ and VIEW_SCHEMA_READ roles.

Roles can be granted to other roles to create a hierarchy of inherited privileges, as we saw with DEVELOPER_READ. Notice also that BASE_SCHEMA_WRITE inherits BASE_SCHEMA_READ, further simplifying the permissions hierarchy by including read access with write access.

Because tenants should only have access to the secure views, the TENANT_TEMPLATE role is only granted the VIEW_SCHEMA_READ role. The TENANT_ROLE inherits the permissions of TENANT_TEMPLATE and can be assigned to system tenants. With this inheritance, any future changes made to the TENANT_TEMPLATE role will automatically propagate to all tenants.

All functional roles are granted up the privilege chain to the APP_OWNER and finally the SYSADMIN, to ensure the administrator role has access to everything in the system.

Auditing

The ability to audit changes in access controls is critical for assessing system vulnerabilities, tracking down the source of corrupt data, and complying with regulations such as the European Union’s General Data Protection Regulation (GDPR). In addition, many industries require auditing capabilities to do business; for example, if you hope to market your data application to healthcare or finance organizations, this is a critical requirement in a data platform. Snowflake meets this need with robust auditing of user permissions, logins, and queries, which can be accessed through logs.

Access and authorization controls

Another important area of security is ensuring the connections between the application tier and the underlying data platform are secure. Considerations in this space include authentication, encryption management, and secure network design.

To guarantee a secure connection between the application and the Snowflake Data Cloud, you can use AWS5 or Azure PrivateLink.6 These services allow you to create direct, secure connections between Snowflake and your application tier without exposure to the public internet. Snowflake allows connections from any device or IP by default, but you can limit access using allow and block lists.

Snowflake provides a variety of options for user authentication, including OAuth, multifactor authentication (MFA), key pair authentication and rotation, and single sign-on (SSO). Having these services built in removes a significant burden from product teams, as providing support for even one of these methods is a substantial engineering undertaking.

As an example, consider a data application using OAuth to generate a secure token to access Snowflake. With OAuth, credentials do not have to be shared or stored, eliminating the need to build secure credential sharing and storage into your data application. Key pair authentication is another option for authentication where username/password credentials do not need to be explicitly shared; instead, a private key pair retrieved from a key vault can be used to control access.

Summary

Ensuring your approach to storage, compute, and security will meet the demands of the ever-changing data landscape is fundamental to building successful data applications. In this chapter you learned how to take advantage of the virtually infinite storage and compute resources of cloud platforms to create scalable data applications. We discussed different approaches to storage, including the multi-tenant table, object per tenant, and account per tenant models. You also learned about approaches to optimizing compute resources, including using autoscaling to provision resources in response to demand instead of attempting to predict demand. With an understanding of the trade-offs and use cases for each approach, you can make informed design decisions.

Taking advantage of the scalability of the cloud requires an approach to security that will scale as well. With Snowflake, creating a role hierarchy to manage permissions and coupling DAC and RBAC result in robust and flexible access control while keeping permissions management manageable. To control access to Snowflake a variety of secure user authentication modes are provided, as is support for securely connecting with application tiers on Azure and AWS.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.229.253