© Dmitry Anoshin, Dmitry Shirokov, Donna Strok 2020
D. Anoshin et al.Jumpstart Snowflakehttps://doi.org/10.1007/978-1-4842-5328-1_1

1. Getting Started with Cloud Analytics

Dmitry Anoshin1 , Dmitry Shirokov2 and Donna Strok3
(1)
British Columbia, Canada
(2)
Burnaby, BC, Canada
(3)
Seattle, WA, USA
 

“Don’t shoot for the middle. Dare to think big. Disrupt. Revolutionize. Don’t be afraid to form a sweeping dream that inspires, not only others, but yourself as well. Incremental innovation will not lead to real change—it only improves something slightly. Look for breakthrough innovations, change that will make a difference.”—Leonard Brody and David Raffa

Cloud technologies can change the way organizations do analytics. The cloud allows organizations to move fast and use best-of-breed technologies. Traditionally, data warehouse (DW) and business intelligence (BI) projects were considered a serious investment and took years to build. They required a solid team of BI, DW, and data integration (ETL) developers and architects. Moreover, they required big investments, IT support, and resources and hardware purchases. Even if you had the team, budget, and hardware in place, there was still a chance you would fail.

The cloud computing concept isn’t new, but only recently has it started to be widely used for analytics use cases. The cloud creates access to near-infinite, low-cost storage; provides scalable computing; and gives you tools for building secure solutions. Finally, you pay only for what you use.

In this chapter, we will cover the analytics market trends over the last decade and the DW evolution. In addition, we will cover key cloud concepts. Finally, you will meet the Snowflake DW and learn about its unique architecture.

Time to Innovate

As data professionals, we have worked on many data warehouse projects. We have designed and implemented numerous enterprise data warehouse solutions across various industries. Some projects we built from scratch, and others we fixed. Moreover, we have migrated systems from “legacy” to modern massive parallel processing (MPP) platforms and leveraged extract-load-transform (ELT) to let the MPP DW platform do the heavy lifting.

MPP is one of the core principles of analytics data warehousing, and it is still valid today. It is good to know about the alternative that existed before MPP was introduced, namely, symmetric multiprocessing. Figure 1-1 shows an easy example that help us understand the difference between SMP and MPP.
../images/482416_1_En_1_Chapter/482416_1_En_1_Fig1_HTML.jpg
Figure 1-1

SMP vs. MPP

Let’s look at a simple example. Imagine you have to do laundry. You have two options.
  • Miss a party on Friday night but visit the laundromat where you can run all your laundry loads in parallel because everyone else is at the party. (This is MPP.)

  • Visit the laundromat on Saturday and use just one washing machine. (This is SMP.)

It is obvious that running six washing machines at the same time is faster than running one at a time. It is this linear scalability of MPP systems that allows us to accomplish our task faster. Table 1-1 compares the SMP and MPP systems. If you work with a DW, you are probably aware of these concepts. Snowflake innovates in this area and actually combines SMP and MPP.
Table 1-1

MPP vs. SMP

Model

Description

Massively parallel processing (MPP)

The coordinated processing of a single task by multiple processors, with each processor using its own OS and memory and communicating with each other using some form of messaging interface. Usually MPP is a share-nothing architecture.

Symmetric multiprocessing (SMP)

A tightly coupled multiprocessor system where processors share resources such as single instances of the operating system (OS), memory, I/O devices, and a common bus. SMP is a shared-disk architecture.

In our past work, Oracle was popular across enterprise organizations. All the DW solutions had one thing in common: they were extremely expensive and required the purchase of hardware. For consulting companies, the hardware drove revenue; you could have an unprofitable consulting project, but a hardware deal would cover the yearly bonus.

Later, we saw the rise of Hadoop and big data. The Internet was full of news about the replacement of traditional DWs with Hadoop ecosystems. It was a good time for Java developers, who could enjoy coding and writing Map Reduce jobs until the community released a bunch of SQL tools such as Hive, Presto, and so on. Instead of learning Java, personally we applied Pareto principles, where we could solve 20 percent of tasks using traditional DW platforms and SQL to bring 80 percent of the value. (In reality, we think it was more like 80 percent of the cases produced 95 percent of the value.)

Later, we saw the rise of data science and machine learning, and developers started to learn R and Python. But we found we still should have ELT/ETL and DW in place; otherwise, these local R/Python scripts didn’t have any value. It was relatively easy to get a sample data set and build a model using data mining techniques. However, it was a challenge to automate and scale this process because of a lack of computing power.

Then came data lakes. It was clear that a DW couldn’t fit all the data, and we couldn’t store all the data in a DW because it was expensive. If you aren’t familiar with data lakes, see https://medium.com/rock-your-data/getting-started-with-data-lake-4bb13643f9.

Again, some parties argued that data lakes were new DWs, and everyone should immediately migrate their traditional solutions to data lakes using the Hadoop technology stack. We personally didn’t believe that data lakes could replace the traditional SQL DWs based on our experience with BI and business users. However, a data lake could complement an existing DW solution when there was a big volume of unstructured data and we didn’t want to leverage the existing DW because it lacked computing power and storage capabilities. Apache products such as Hive, Presto, and Impala helped us to get SQL access for big data storage and leverage data lake data with traditional BI solutions. It is obvious that this path was expensive but could work for big companies with resources and strong IT teams.

In 2013, we heard about DWs in the cloud, namely, Amazon Redshift. We didn’t see a difference between the cloud edition of Amazon Redshift and the on-premise Teradata, but it was obvious that we could get the same results without buying an extremely expensive appliance. Even at that time, we noticed the one benefit of Redshift. It was built on top of the existing open source database Postgres. This meant we didn’t really need to learn something new. We knew the MPP concept from Teradata and we knew Postgres, so we could start to use Redshift immediately. It was definitely a breath of fresh air in a world of big dinosaurs like Oracle and Teradata.

It should be obvious to you that Amazon Redshift wasn’t a disruptive innovation. It was an incremental innovation that built on a foundation already in place. In other words, it was an improvement to the existing technology or system. That is the core difference between Snowflake and other cloud DW platforms.

Amazon Redshift became quite popular, and other companies introduced their cloud DW platforms. Nowadays, all big market vendors are building a DW solution for the cloud.

As a result, Snowflake was the disruptive innovation. The founders of Snowflake collected all the pain points of the existing DW platforms and came up with a new architecture and new product that addresses modern data needs and allows organization to move fast with limited budgets and small teams. If you are interested in a market overview of DW solutions, refer to Gartner’s Quadrant for Data Management Solutions for Analytics, as shown in Figure 1-2.
../images/482416_1_En_1_Chapter/482416_1_En_1_Fig2_HTML.jpg
Figure 1-2

Gartner’s Quadrant for Data Management Solutions for Analytics. Source: Smoot, Rob; “Snowflake Recognized as a Leader by Gartner: Third Consecutive Year Positioned in the Magic Quadrant Report,” Jan 23, 2019, https://www.snowflake.com/blog/snowflake-recognized-as-a-leader-by-gartner-third-consecutive-year-positioned-in-the-magic-quadrant-report/

Everyone has their own journey. Some worked with big data technologies like Hadoop; others spent time with traditional DW and BI solutions. But all of us have a common goal of helping our organizations to be truly data-driven. With the rise of cloud computing, we have many new opportunities to do our jobs better and faster. Moreover, cloud computing opened new ways of doing analytics. Snowflake was founded in 2012, came out in stealth mode in October 2014, and became generally available in June 2015. Snowflake brought innovation into the data warehouse world, and it is the new era of data warehousing.

Key Cloud Computing Concepts

Before jumping into Snowflake, we’ll cover key cloud fundamentals to help you better understand the value of the cloud platform.

Basically, cloud computing is a remote virtual pool of on-demand shared resources offering compute, storage, database, and network services that can be rapidly deployed at scale. Figure 1-3 shows the key elements of cloud computing.
../images/482416_1_En_1_Chapter/482416_1_En_1_Fig3_HTML.jpg
Figure 1-3

Key terms of cloud computing

Table 1-2 defines the key terms of cloud computing. These are the building blocks for a cloud analytics solution as well as the Snowflake DW.
Table 1-2

Key Terms for Cloud Computing

Term

Description

Compute

The “brain” to process our workload. It has the CPUs and RAM to run workloads and processes, in our case, data.

Databases

Traditional SQL or NoSQL databases that we can leverage for our applications and analytics solutions in order to store structure data.

Storage

Allows us to save and store data in raw format as files. It could be traditional text files, images, or audio. Any resource in the cloud that can store data is a storage resource.

Network

Provides resources for connectivity between other cloud services and consumers.

ML/AI

Provides special types of resources for heavy computations and analytics workloads.

It is important to mention hypervisors as a core element of cloud computing. Figure 1-4 shows a host with multiple virtual machines and a hypervisor that is used to create a virtualized environment that allows multiple VMs to be deployed on a single physical host.
../images/482416_1_En_1_Chapter/482416_1_En_1_Fig4_HTML.jpg
Figure 1-4

Role of hypervisor

Virtualization gives us the following benefits:
  • Reduces capital expenditure

  • Reduces operating costs

  • Provides a smaller footprint

  • Provides optimization for resources

There are three cloud deployment models, as shown in Figure 1-5.
../images/482416_1_En_1_Chapter/482416_1_En_1_Fig5_HTML.jpg
Figure 1-5

Cloud deployment models

The model you choose depends on the organization’s data handling policies and security requirements. For example, often government and health organizations that have a lot of critical customer information prefer to keep the data in a private cloud. Table 1-3 defines the cloud deployment models.
Table 1-3

Cloud Deployment Models

Model

Description

Public cloud

The service provider opens up the cloud infrastructure for organizations to use, and the infrastructure is on the premises of the service provider (data centers), but it is operated by the organization paying for it.

Private cloud

The cloud is solely owned by a particular institution, organization, or enterprise.

Hybrid cloud

This is a mix of public and private clouds.

In most cases, we prefer to go with a public cloud. AWS, Azure, and GCP all are public clouds, and you can start building solutions and applications immediately.

It is also good to know about cloud service models (as opposed to on-premise solutions). Figure 1-6 shows three main service models with an easy analogy “Hamburger as a Service”.
../images/482416_1_En_1_Chapter/482416_1_En_1_Fig6_HTML.jpg
Figure 1-6

Cloud service models, pizza as a service

One example of IaaS is a cloud virtual machine. Amazon EC2 is an example of IaaS. Amazon Elastic MapReduce (i.e., managed Hadoop) is an example of PaaS, and DynamoDB (AWS NoSQL database) is an example of SaaS that is completely managed for you.

Note

In a cloud software distribution model, SaaS is the most comprehensive service, and it abstracts much of the underlying hardware and software maintenance from the end user. It is characterized by a seamless, web-based experience, with as little management and optimization as possible required of the end user. The IaaS and PaaS models, comparatively, often require significantly more management of the underlying hardware or software.

Snowflake is a SaaS model also known as data warehouse as a service (DWaaS). Everything—from the database storage infrastructure to the compute resources used for analysis and the optimization of data within the database—is handled by Snowflake.

A final aspect of cloud computing theory is the shared responsibility model (SRM). Figure 1-7 shows a key elements of SRM.
../images/482416_1_En_1_Chapter/482416_1_En_1_Fig7_HTML.jpg
Figure 1-7

Cloud Vendors Shared Responsibility Model

SRM has many attributes, but the main idea is that the cloud vendor is responsible for the security of the cloud, and the customers are responsible for the security in the cloud. This means that the clients should define their security strategies and leverage best practices for their data in order to keep it secure.

When we talk about the cloud, you should know that cloud resources are hosted in data centers and there is a concept of a region. You can find information about Snowflake availability for the different cloud vendors and regions at https://docs.snowflake.net/manuals/user-guide/intro-regions.html.

Before moving to the next section, refer to Figure 1-8, which shows how long data takes to upload to the cloud; this reference comes from Google Cloud Platform presentation.
../images/482416_1_En_1_Chapter/482416_1_En_1_Fig8_HTML.jpg
Figure 1-8

Modern bandwidth

This table is a useful reference when migrating a DW from an on-premise solution to the cloud. You will learn more about DW migration and modernization in Chapter 14.

Meet Snowflake

Snowflake is the first data warehouse that was built for the cloud from the ground up, and it is a first-in-class data warehouse as a service. Snowflake runs on the most popular cloud providers such as Amazon Web Services and Microsoft Azure. Moreover, Snowflake has announced availability on Google Cloud Platform. As a result, we can deploy the DW platform on any of the major cloud vendors. Snowflake is faster and easier to use and far more flexible than a traditional DW. It handles all aspects of authentication, configurations, resource management, data protection, availability, and optimization.

It is easy to get started with Snowflake. You just need to choose the right edition of Snowflake and sign up. You can start with a free trial and learn about the key features of Snowflake or compare it with other DW platforms at https://trial.snowflake.com. You can immediately load your data and get insights. All the components of Snowflake services run in a public cloud infrastructure.

Note

Snowflake cannot be run on private cloud infrastructures (on-premises or hosted). It is not a packaged software offering that can be installed by a user. Snowflake manages all aspects of software installation and updates.

Snowflake was built from the ground up and designed to handle modern big data and analytics challenges. It combines the benefits of both SMP and MPP architectures and takes full advantage of the cloud. Figure 1-9 shows the architecture of Snowflake.
../images/482416_1_En_1_Chapter/482416_1_En_1_Fig9_HTML.jpg
Figure 1-9

Snowflake architecture

Similar to an SMP architecture, Snowflake uses a central storage that is accessible from all the compute nodes. In addition, similar to an MPP architecture, Snowflake processes queries using MPP compute clusters, also known as virtual warehouses . As a result, Snowflake combines the simplicity of data management and scalability with a shared-nothing architecture (like in MPP).

As shown in Figure 1-9, the Snowflake architecture consists of three main layers. Table 1-4 describes each layer.
Table 1-4

Key Layers of Snowflake

Layer

Description

Service layer

Consists of services that coordinate Snowflake’s work. Services run on a dedicated instance and include authentication, infrastructure management, metadata management, query parsing and optimization, and access control.

Compute layer

Consists of a virtual warehouse (VW). Each VW is an MPP compute cluster that consists of multiple compute nodes. Each VM is an independent compute cluster that doesn’t share resources with other VMs.

Storage layer

Stores data in an internal compressed columnar format using cloud storage. For example, in AWS it is S3; in Azure it is Blob storage. Snowflake manages all aspects of data storage, and customers don’t have direct access to file storage. Data is accessible only via SQL.

In other words, Snowflake offers almost unlimited computing and storage capabilities by utilizing cloud storage and computing. Let’s look at a simple example of a traditional organization with a DW platform. For example, say you have a DW, and you run ETL processing overnight. During heavy ETL processing, business users can’t use the DW a lot, and there aren’t many resources available. At the same time, the marketing department should run complex queries for calculating the attribution model. The inventory team should run their reports and optimize inventory. In other words, every process and every team in the organization is important, but the DW is a bottleneck. In the case of Snowflake, every team or department can have its own virtual warehouse that can be scaled up and down immediately depending on the requirements. Moreover, the ETL process can have its own virtual warehouse that is running only overnight. This means the DW isn’t a bottleneck anymore and allows the organization to unlock its data’s potential. Moreover, the organization will pay only for the resources it uses. You don’t have to buy expensive appliances or think about future workloads. Snowflake is truly democratizing data and gives almost unlimited power to business users.

In addition to scalability and simplicity, Snowflake offers many more unique features that didn’t exist before and aren’t available in other DW platforms (cloud or on-premise) such as data sharing, time travel, database replication and failover, zero-copy cloning, and more that you will learn in this book.

Summary

In this chapter, we briefly reviewed the history of data warehousing and covered the fundamentals of cloud computing. This information gave you some background so that you have a better understanding of why Snowflake was brought to the market and why the cloud is the future of data warehousing and modern analytics. Finally, you learned about the unique architecture of Snowflake and its key layers. In the next chapter, you will learn how to start working with Snowflake.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.111.197