© Dmitry Anoshin, Dmitry Shirokov, Donna Strok 2020
D. Anoshin et al.Jumpstart Snowflakehttps://doi.org/10.1007/978-1-4842-5328-1_13

13. Migrating to Snowflake

Dmitry Anoshin1 , Dmitry Shirokov2 and Donna Strok3
(1)
British Columbia, Canada
(2)
Burnaby, BC, Canada
(3)
Seattle, WA, USA
 

Throughout the book you have learned about key concepts of Snowflake, including its architecture and its security capabilities. You have also met some unique Snowflake features. Moreover, you saw how Snowflake can be integrated with third-party tools for ELT/ETL and BI purposes as well as big data and advanced analytics use cases with Spark.

In this chapter, we will highlight some key migration scenarios to give you an idea of how you can migrate your legacy solution to the cloud. In addition, some organizations might try to upgrade an existing cloud solution that isn’t sufficient for a business use case or is very expensive.

Data warehouse modernization is the hottest topic right now, and many organizations are looking for best practices to modernize their legacy, expensive, and ineffective solutions using the cloud. Snowflake is a good choice for organizations because it is available on main cloud platforms such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), and it allows you to get instant value by democratizing the data across the organization.

In this chapter, we will cover the following topics:
  • Data warehouse migration scenarios

  • Common data architectures

  • Key steps for a data warehouse migration

  • Real-world project

  • Additional resources for Snowflake migration

Data Warehouse Migration Scenarios

The goal of a data warehouse migration is to serve the growing data appetite of end users who are hungry for data insights. Before we dive deep into this topic, let’s categorize the organizations and their data needs. We will split organizations by their analytics maturity, as shown here:
  • Startups and small business without a proper analytics solution

  • Organizations with on-premise data solutions

  • Organizations with a default cloud solution deployed on Azure, GCP, or AWS

Startup or Small Business Analytics Scenario

The easiest deployment process is for startup companies. They don’t have any analytics solution yet and are usually connecting to source systems using business intelligence (BI) tool or spreadsheets. They are looking for better alternatives, and they don’t want to invest in an expensive solution, but they want to be sure that they can start small and scale easily. With Snowflake, they get all the benefits of Snowflake and pay only for their workloads. Over time, they will grow, and as a result, their Snowflake implementation will grow.

Figure 13-1 shows an example architecture before Snowflake and with it for small companies.
../images/482416_1_En_13_Chapter/482416_1_En_13_Fig1_HTML.png
Figure 13-1

Before and with Snowflake for startups

Startups track key metrics, and it is important to get timely insights from data. As a result, analysts connect to the source systems and extract the data. This process is manual and not scalable. The next logical step is to hire a data engineer or analytics consulting company and deploy a data warehouse with Snowflake. This will allow you to get insights you’re your data and grow the business.

On-Premise Analytics Scenario for Enterprises and Large Organizations

The second scenario is the biggest and the most popular. There are lots of enterprise organizations that are looking for a way to improve their existing on-premise solutions. These solutions are extremely expensive, and they require lots of resources to maintain. Moreover, they have lots of custom solutions for big data, streaming, and so on. The complexity of these solutions is extremely high, but the value isn’t high because on-premise solutions are a bottleneck, and it is not easy to scale a solution even in the case of an unlimited budget. So, the best way is to migrate the existing on-premise solutions to the cloud and leverage an innovative analytics data platform such as Snowflake. With Snowflake, enterprises can migrate all their data to the cloud, use a single platform for a data warehouse, share data, and make use of machine learning.

Figure 13-2 shows an example architecture before and with Snowflake for enterprises and other large companies.
../images/482416_1_En_13_Chapter/482416_1_En_13_Fig2_HTML.png
Figure 13-2

Before and with Snowflake for enterprises

The figure is a top-level overview of an on-premise organization with big data (a data lake, usually deployed on top of Hadoop) and an on-premise data warehouse massive parallel processing (MPP) solution such as Oracle, Teradata, or Netezza. Usually, enterprises use enterprise-grade ETL solutions that are expensive and require powerful hardware. There are multiple options for streaming, and one of the most popular is Apache Kafka. Moreover, enterprises handle a big volume of data with a semistructured format such as JSON, AVRO, Parquet, and so on. In the example in Figure 13-2, we are uploading JSON into a data lake and then parsing and loading it into a data warehouse. Finally, some organizations have to share data. This isn’t an easy or cheap task for an on-premise solution.

With Snowflake, organizations will migrate all their data into the cloud. Moreover, they will use a single data platform for streaming use cases, storing semistructured data, and querying the data via SQL, without physically moving the data. So, there are lots of benefits that will open new horizons for analytics and help to make business decisions driven by data.

Cloud Analytics Modernization with Snowflake

The last scenario is the trickiest one. Some modern organizations have already leveraged cloud vendors or migrated a legacy solution to the cloud. However, they may be facing challenges such as high cost, performance issues related to concurrency, or having multiple tools for various business scenarios such as streaming and big data analytics. As a result, they decide to try Snowflake and unify their data analytics with a single platform and get almost unlimited scalability and elasticity.

Figure 13-3 shows an example architecture before and with Snowflake for cloud deployments with Microsoft Azure.
../images/482416_1_En_13_Chapter/482416_1_En_13_Fig3_HTML.png
Figure 13-3

Before and with Snowflake for cloud analytics modernization on Microsoft Azure

On the left, we have multiple solutions from Azure for the data warehouse and data lake, such as Azure Data Factory and Azure Streaming. On the right, we have Snowflake that is hosted on the Azure cloud, and we have leveraged another cloud ELT tool, Matillion ETL, that allows us to create complex transformations visually. However, we can still use Azure Data Factory for ELT. Finally, with this new architecture, you can leverage the data sharing capabilities without physically moving the data.

Data Warehouse Migration Process

We just reviewed three common scenarios for Snowflake migrations. We will dive deep into the second scenario because it is one of the most popular and complex. The first scenario isn’t a real migration scenario; it is more a DW design and implementation project. The third scenario is an evolution of the second; it has a similar idea, and usually it is easier to perform since all the data is already in the cloud.

When we talk about DW migration, there are two major approaches.
  • Lift and shift: Just copy the data as is with limited changes.

  • Split and flip: Split a solution into logical functional data layers. Match the data functionality with the right technology. Leverage the wide selection of tools on the cloud to best fit the needs. Move data in phases such as prototype, learn, and perfect.

Despite the fact that “lift and shift” is a faster approach, it has limited value for long-term organizational goals. As a result, we always prefer to “split and flip.” This will guarantee that we won’t sacrifice for short-term value.

We can split the migration process into two main buckets.
  • The organizational part of the migration project

  • The technical part of the migration project

Let’s review them in detail.

Organizational Part of the Migration Project

Figure 13-4 shows a high-level overview of the steps needed to prepare and execute the migration of an existing system to Snowflake.
../images/482416_1_En_13_Chapter/482416_1_En_13_Fig4_HTML.jpg
Figure 13-4

Key steps of migration process

Let’s learn more about each of migration steps that are recommended by Snowflake.

Document the Existing Solution

You already know that Snowflake uses row-based access control; therefore, we have to document the existing users, their roles, and their permissions. This allows you to replicate the data access and security strategy implemented in your legacy system. You should pay special attention to sensitive data sets and how they’re secured, as well as how frequently security provisioning processes run in order to create similar security within Snowflake. Finally, you want to ensure that you have an existing architectural diagram of the existing solution.

Establish a Migration Approach

Then, you should establish a migration approach. You should list all the existing processes that you want to migrate. Moreover, you should identify all the processes that have to be refactored as well as the broken processes that need to be fixed. This will allow you to draft these deliverables and create the data architecture diagram to present to the stakeholders.

Snowflake generally recommends minimal re-engineering for the first iteration unless the current system is truly outdated. To provide a value for the business as soon as possible, you should avoid a single “big-bang” deliverable as the migration approach and instead break the migration into incremental deliverables that enable your organization to start making the transition to Snowflake more quickly. This process is called agile data warehousing and allows you to deliver fast value for the end users.

Moreover, organizations may want to change their development or deployment processes as part of the migration. You should document new tools that will be introduced as a result of the migration, tools that will need to be deprecated, and development environments that are needed for the migration. Whether the development and deployment processes change or not, you should capture the development environments that will be used for the migration.

Capture the Development and Deployment Processes

Modern organizations care about DevOps. If you didn’t widely use it before, it could be a good opportunity to start implementing DevOps/DataOps procedures that will increase the quality of your analytics solution.

For example, usually organizations have the following environments:
  • Dev

  • QA

  • Prod

Moreover , they have source control repositories and methods for capturing deployment changes from one environment to another. These will be used for that migration. This information is critical to direct how the development and deployments are implemented.

The ideal candidates for starting the migration provide value to the business and require a minimal migration effort.

Prioritize Data Sets for Migration

You should learn more about the available data sets in the legacy solutions. Rather than starting with the most complex data sets, we prefer to begin with a simple data set that can be migrated quickly to establish a foundation through the development and deployment processes that can be reused for the rest of the migration effort. To prioritize data sets for migration, you should understand the dependencies among data sets. Those dependencies need to be documented and conform with business stakeholders. Ideally, this documentation can be captured using an automated process that collects information from existing ETL jobs, job schedules, and so on. This will help you avoid manual work for identifying and documenting changes.

Creating an automated process provides value throughout the migration project by more easily identifying the ongoing changes that will occur throughout the migration project since the underlying systems are unlikely to be static during the migration.

Identify the Migration Team

Another important thing is to build the migration team. Some common roles needed for the migration are developer, quality assurance, business owner, project manager, program manager, scrum master, and communication. When a Snowflake solution partner is engaged for migration, they may fulfill multiple needs, including solution design, gathering requirements, delivering migration project, producing documentation and conducting Snowflake training.

Based on our experience, the challenge is to change the paradigm from a traditional DW to a cloud DW. Engineers should be ready to learn new skills, and they may apply for additional professional courses related to cloud foundations and Snowflake best practices.

Define the Migration Deadlines and Budget

The expectations for any migration should be clear to all parties. But the expectations need to be combined with other information that has been gathered to determine whether the deadlines can be met. One of the benefits of gathering all of this information is to establish and communicate achievable deadlines, even if the deadlines are different from what the business expects.

It is common in migration projects that deadlines are defined before evaluating the scope of the project to determine whether the deadlines are achievable, especially if the business is trying to deprecate the legacy system before the renewal date. In situations where the deadline can’t be moved and the migration scope requires more time than is available before the deadline, work needs to be done with the business to agree on a path forward.

Understanding the budget that has been advocated to complete the migration is also critically important. The amount of migration work and the cost associated with the migration work both need to be compared to the available budget to ensure that there are sufficient funds to complete the work. Pausing in the middle of a migration or stopping it altogether is a bad outcome for all involved parties.

When we are planning the budget, we should estimate the cost of Snowflake deployment and the cost of the migration project.

Determine the Migration Outcomes

Migration outcomes should be used to validate that the migration project is providing the overall benefit the business expects to achieve from the migration. For example, turning off the Oracle database system is one of the desired outcomes. That outcome should be achieved with the migration plan. This documentation can be expressed as success or failure criteria for the migration project and may also include benchmarks that compare process execution. Once compiled, this information should be used for communicating with stakeholders.

After identifying the migration outcomes, you should present them to the business along with the mitigation strategy and confirm the proposed approach will meet their requirements. This should be done to set appropriate expectations from the beginning of the migration.

The escalation process needs to be documented, including who is responsible for working on the issue, who is responsible for communicating the progress of the issue, and a list of contexts from the business, Snowflake, and any other involved parties that are involved in resolving the issue.

Establish Security

Depending on the security requirements, there may be a need to capture role creation, user creation, and the granting of users to roles for auditing purposes. While the existing database security can be a good starting point for setting up security within Snowflake, the security model should be evaluated to determine whether there are roles and users who are no longer needed or should be implemented differently as part of the migration to Snowflake. Additional roles may be required for restricting access to sensitive data. Moreover, you can think about improving the solution security by implementing two-step authentication, collecting security logs, and so on.

Develop a Test Plan

Develop a test plan by identifying the appropriate level and scope for each environment. For example, schedules aren’t executed in dev, but only in QA and prod. Automate as much possible to ensure repeatable test processes with consistent output for validation purposes and to find agreed-on document acceptance criteria.

Moreover, you should involve business users in this process; they are subject-matter experts and will help to evaluate solutions and help you quickly identify the data discrepancy and processes that are wrong.

Prepare Snowflake for Loading

Despite that Snowflake is a SQL data warehouse, it is different from other analytical DW platforms.

When you have physical servers, you can use a dedicated server for each environment (dev, test, prod). The hierarchy for the on-premise solution looks like this:
  • Physical server
    • Databases
      • Schemas
        • Tables/views/functions

In the case of Snowflake, you don’t have a physical machine. When you sign up for Snowflake, you get the link https://<our company name>.snowflakecomputing.com/, and you stick to this account. As a result, you don’t have a physical server layer, and you should think about the organization of environments. To solve this particular issue, you have several options.
  • Use multiple accounts (different URLs).

  • Create many databases with an environment prefix (FIN_DEV, SALES_DEV, FIND_TEST, and so on).

  • Create databases that will represent your environments and then create a schema that will represent a database.

This will require you to modify DDL while you are moving the schema from the on-premise solution to the cloud. This is one of the biggest engineering efforts in a migration. There are a number of tools available for this purpose that can do forward and reverse engineering. Moreover, you can leverage the Snowflake community and learn how others performed this step.

Finally, you should assign databases, database objects, and virtual warehouses to the appropriate security roles.

When you are ready, you can start to make an initial load into your data warehouse. Many options are available for loading. For example, you can unload data into the cloud storage, such as S3 in the case of using AWS, and then collect this data via Snowflake. Or you can leverage cloud ETL tools like Glue (AWS product) or Matillion ETL (a third-party commercial product). You can even use open source solutions like Apache Airflow or even Python.

Keep Data Up-to-Date (Executing the Migration)

After an initial load of data is complete, you should start to develop incremental load processes. This is the time when ETL/ELT tools are handy and help you to accelerate your development effort.

These processes should be scheduled and take into consideration the appropriate process dependency. The state of the data loading should be clearly understood and communicated. For example, loading is in progress, loading is completed successfully, and load failures occurred that need to be addressed. Finally, begin comparing execution timings to ensure that SLAs are being met.

One of the key things is to constantly communicate with business users and allow them to visually track the load process. You can ensure this by collecting ETL logs on all stages of the ETL process and visualize it with BI tool.

Implement the Test Plan (Executing the Migration)

Once an ETL/ELT process is in place, testing can begin. You can start from initial data comparisons. This will allow for quickly identifying discrepancies and sharing these results with stakeholders. Additional groups should be engaged after the initial testing is completed. This helps to validate the data and fix issues within a new solution.

Run Systems in Parallel (Executing the Migration)

As business units are engaged in testing, you should run both systems (legacy DW and Snowflake DW) in parallel to ensure the continued validation of data to facilitate comparing data. In some cases, you may export data from a legacy DW, which can be used for comparing data at the raw level. These comparisons should take place in Snowflake, where resources can be provisioned to compare data without negatively impacting the system.

You should attempt to minimize the time the two systems are running in parallel while still maintaining a sufficient validation process.

Repoint Tools to Snowflake

Up until now, the migration process has been focused on raw data comparisons. The final step is to point all business users’ connections to the new Snowflake DW. After the business units have validated that their tools are producing required results, they cut over to Snowflake, begin scheduling, and communicate the cutover plans to all stakeholders.

Once the cutover is complete, users should have the ability to log into Business Intelligence tools and repoint them to the Snowflake data warehouse.

Technical Aspects of a Migration Project

Figure 13-5 shows the key elements of a migration project from a technical point of view for a traditional on-premise data warehouse.
../images/482416_1_En_13_Chapter/482416_1_En_13_Fig5_HTML.jpg
Figure 13-5

Simplified DW migration flow

Let’s consider an example where we have an on-premise DW that we have decided to move to Snowflake. We should start with the DDL for moving the schemas, tables, views, and so on. There are many ways to replicate a data warehouse model in Snowflake, starting from the Python scripts that will convert the source system’s DDL into Snowflake DDL. In addition, we can leverage data modeling tools like SQLDbm that have good integration with Snowflake and can copy the source system DDL, convert it to Snowflake DDL, and deploy it into Snowflake. Moreover, we can use other tools that support forward and reverse engineering. This will help us automate this process and save time and money.

After the DDL, we should move data. There are many approaches to do this. We can leverage cloud ETL tool capabilities and migrate data from an on-premise solution to Snowflake. For example, Matillion ETL can connect to the on-premise DW and load data directly to Snowflake using cloud data storage such as S3, Blob Storage, and so on. This is an efficient way of moving data. Or, you could leverage Snowflake’s snowSQL CLI and load data with the help of SQL. It is totally up to you. In some extreme cases for a large volume of data, we might use physical devices such as AWS Snowball or Azure Data Box.

Finally , the most complicated part is migrating the ETL/ELT logic. This is the longest part, and there is a linear correlation between the number of DW objects and the time it takes to perform a migration. This is the time when we want to decide whether we want to migrate existing logic as is (lift and shift) or we want to work closely with the business stakeholders and learn about the business logic behind the code so we can take it apart and improve it (split and flip).

From a tools standpoint, we can leverage scripting in Python, or we can leverage Snowflake Partner Connect and choose an ETL tool that was built specifically for the Snowflake DW. Some tools are managed services, and others give you more freedom. For example, Matillion provides a virtual machine that is hosted in our virtual private cloud (VPC), and we can establish a mature security level. Moreover, when we are using ETL tools, we can create a pattern, and then we can copy this pattern across the use cases. The tools also allow end users to follow the process and visually observe the data flow. Finally, Snowflake supports stored procedures, and this gives you the ability to implement an ETL solution with stored procedures like previously done in Oracle, Teradata, or SQL Server.

Real-World Migration Project

Let’s look at a real-world project. Figure 13-6 shows an architecture diagram for an e-commerce company that is selling used books online.
../images/482416_1_En_13_Chapter/482416_1_En_13_Fig6_HTML.jpg
Figure 13-6

Legacy DW architecture

It is a straightforward solution that was built on an Oracle database technology stack. It used PL/SQL as a main ETL tool, and with daily ETL, it was loading data from several transactional systems as well as consuming data from marketing-specific APIs and SFTPs. These were the challenges:
  • The solution was expensive from a license perspective.

  • ETL was complicated, and the database team owned the logic. They were a kind of bottleneck for all new requests.

  • The DW had storage and compute limitations.

  • The DW required full-time DBA support (for patching, backups, and so on).

  • Performance was an issue and required deep knowledge of Oracle sizing and tuning (indexes, keys, partitions, query plans, and so on).

The company decided to move to the cloud to get more room to grow and to get the benefits of a cloud infrastructure. Figure 13-7 shows an architecture diagram of the new solution. This organization decided to go with Snowflake, because it wanted to have unlimited concurrency for queries, a consolidated DW, and a big data solution on a single data platform, as well as dedicated virtual warehouses for analysts with heavy queries.
../images/482416_1_En_13_Chapter/482416_1_En_13_Fig7_HTML.jpg
Figure 13-7

Modern DW architecture with Snowflake

Another major decision was made regarding the ETL tool. We reviewed several tools and decided to go with Matillion ETL, because it was built specifically for Snowflake and allows us to solve previous challenges with “bottlenecks” in the ETL process. It has an intuitive user interface and doesn’t require any coding knowledge. In addition, the organization deployed Tableau as a main BI tool and settled on self-service analytics; that’s why concurrency is a great benefit of Snowflake. Moreover, the choice addressed another security requirement because it deploys within a private subnet in AWS VPC.

Snowflake helped us to leverage big data and streaming capabilities that were impossible with the legacy solution. For big data, we were processing web logs within Apache Spark deployed on top of the EMR cluster. Snowflake accesses Parquet files, and we don’t need to load them into Snowflake. For the streaming use case, we leveraged DynamoDB streams and Kinesis Firehose, and all data is sent into an S3 bucket where Snowflake can consume it.

This core project with an Oracle DW and ETL migration took us six months with a team of two engineers; it took another three to four months to design and implement the streaming and big data solutions. The organization also leveraged AWS SageMaker service for machine learning and advanced analytics, which can be easily connected to Snowflake in order to query data from Snowflake and can write models result back to the Snowflake.

Additional Resources

Working with Snowflake requires you to have a new set of skills related to cloud computing. If you want to succeed with Snowflake, you should learn the best practices of deploying cloud analytics solutions and follow the market trends by reviewing new tools and methods for data processing and transformation in the cloud.

Currently Snowflake is available in AWS, Microsoft Azure, and Google Cloud Platform. We highly recommend you study a cloud vendor’s learning materials in order to get a better understanding of cloud computing and data storage. For example, if you deploy Snowflake using AWS, you may start with the AWS Technical Essentials course that is free and gives you an overview of AWS. Then you can go deeper with AWS analytics using big data specialization.

At the same time, you should learn Snowflake best practices using Snowflake training resources, community web sites, and blog posts. This book is a good start.

Summary

This chapter talked about the needs of organizations depending on their maturity model and identified three common organization types. Then you learned about the legacy DW modernization process and identified the key steps.

Finally, we looked at a real-world project of migrating to Snowflake and learned about its data architecture and project outcome.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.168.8