© Dmitry Anoshin, Dmitry Shirokov, Donna Strok 2020
D. Anoshin et al.Jumpstart Snowflakehttps://doi.org/10.1007/978-1-4842-5328-1_12

12. Snowflake and Data Science

Dmitry Anoshin1 , Dmitry Shirokov2 and Donna Strok3
(1)
British Columbia, Canada
(2)
Burnaby, BC, Canada
(3)
Seattle, WA, USA
 

“You’re only given a little spark of madness. You mustn’t lose it.”

—Robin Williams,

“Little spark of madness” stand-up, 1977

Nowadays, data is one of the main assets of any company. As a result, each team of analysts is faced with the need to organize data science processes. Snowflake is a smart choice as a data source for storing structured and semistructured data.

In other words, elastic storage and computes allow you to store an unlimited amount of data at no extra cost with the ability to search for insights through data analysis and model building.

Additionally, the platform has integration possibilities with the most popular data analytical solutions.

In this chapter, you will learn about how Snowflake and data science platforms work together. We will cover the following topics:
  • Snowflake-supported advanced analytics solutions

  • Apache Spark introduction

  • Snowflake and Spark connector

  • Snowflake and Databricks

Snowflake and the Analytics Ecosystem

Snowflake supports many popular analytical solutions. Table 12-1 shows some of the platforms that are integrated with Snowflake.
Table 12-1

Popular Analytics Solutions That Work with Snowflake

Tool

Description

Alteryx

Alteryx1 is a self-service data analytics platform.

Apache Spark

Apache Spark2 is an open source cluster computing framework.

Databricks

Databricks3 is a cloud-based big data processing company founded by the creators of Apache Spark.

DataRobot

DataRobot4 is a predictive analytics platform to rapidly build and deploy predictive models in the cloud or in an enterprise.

H2O.io

H2O.io5 is an open source machine learning and artificial intelligence platform.

R Studio

R Studio6 is an open source integrated development environment for R.

Qubola

Qubola7 is cloud-native data platform based on Apache Spark, Apache Airflow, and Presto.

Snowflake and Apache Spark

Let’s look at an example of how the interaction between the database and the analytical platform works. Apache Spark is the de facto industry standard for big data engineering and analytics. Spark is an open source analytics framework based on a distributed computing cluster. It usually uses engineering data pipelines, including streaming mode, ad hoc analysis, machine learning, graph analysis, and other types of analytics.

Machine learning is becoming increasingly popular in many companies because it can significantly impact many of the company’s business processes. Spark can work for model training and production. We can build a machine learning model using Spark MLlib, which is an internal machine learning library using distributed, highly scaling algorithms. Additionally, Spark works well with Pandas,8 Scikit-learn,9 TensorFlow,10 and other popular statistical, machine learning, and deep learning libraries.

Moreover, you can use Apache MLflow11 for organizing the lifecycle of the model and Apache Airflow12 or similar solutions for building data pipelines.

Data scientists and analysts prefer to use SQL, R, and Python. Data engineers usually use languages such as Python, Java, and Scala. Spark provides an API with a different number of languages, including all these languages. It depends on the assets and knowledge base of your team members; each can be writing Spark code that executes on a distributed cluster of machines. Your company can choose the optimal strategy for deployment because Apache Spark can be set up and deployed on-premises or in the cloud. Most popular cloud providers such as Amazon AWS, Microsoft Azure, and Google supply Spark as a component or service. AWS supports Spark as part of an EMR13 service. Microsoft supports Spark in the Azure Hadoop–based HDInsight14 as well as the Azure Databricks platform. Additionally, Google Cloud Dataproc15 is a managed service based on Hadoop with Spark.

Let’s look at the main components of Apache Spark; see Figure 12-1.
../images/482416_1_En_12_Chapter/482416_1_En_12_Fig1_HTML.jpg
Figure 12-1

Apache Spark high-level components

Table 12-2 describes the components of the platform.
Table 12-2

Components of Apache Spark

Component

Description

Spark SQL

Spark SQL is the module for working with structured data by using SQL. It is compliant with the SQL ANSI 2011 specification and also supports Hive QL.

Spark Streaming

Spark Streaming is a Spark Structured Streaming API that allows you to build a scalable data pipeline solution that works in near-real-time mode.

MLlib

MLlib is the implementation of machine learning algorithms in a scalable and distributed manner.

GraphX

GraphX is the module for graph analytics. For instance, you can use graph analysis as part of the implementation of fraud analytics or customer churn analysis.

Spark Core with Dataframe API

Spark Core is the distributed scalable engine that deploys on different types of big data clusters such as Mesos, Hadoop YARN, and Kubernetes. It provides a high-level abstraction including RDD and Dataframe16 for operating with structured and semi/unstructured data by using the Python, R, and Scala/Java languages.

Datasource API

The Datasource API provides the ability to deelop connectors for connecting to Apache Spark. Snowflake has already developed such a library.

Spark Modules

Spark has an ecosystem for the development and distribution of Spark-compatible libraries.

Connector for Apache Spark

Snowflake provides Apache Spark Connector,17 which allows you to use Snowflake and Spark together. Let’s dive into how it works.

Figure 12-2 shows a data flow process between Snowflake data warehouse services and managed Apache Spark.
../images/482416_1_En_12_Chapter/482416_1_En_12_Fig2_HTML.jpg
Figure 12-2

Bidirectional data transfer between Snowflake and Spark

The connector supports two modes of data transfer, depending on whether the internal stage is used or external. Stages can be based on an AWS S3 or Azure Blob Storage container. Internal stages automatically create and drop during data transfer inside the Snowflake session. However, you can choose the external stages if you prefer to manage the data transfer yourself.

Note

Best practice is to use internal transfer. Use external transfer only if there is a need to store data for more than 36 hours.

Additionally, one of the key features of the Spark connector is query pushdown optimization. Pushdown optimization is an approach in which the logic for transforming or querying data happens on the database side. The adapter has deep integration with Spark and will be able to read Spark’s logical query plans and transfer fully or partially executed to Snowflake. This allows you to reduce the amount of data being moved between the Snowflake database and Spark, which dramatically improves performance. See Table 12-3.
Table 12-3

Interaction Between Spark and Snowflake

Interaction

Description

Spark dataframe

A Spark dataframe is the data structure in the distributed memory of Apache Spark that can be automatically created and based on the data from Snowflake’s table. The schema of the table and the dataframe schema must match. Otherwise, you must specify the mapping.18 In the opposite direction, it is the data structure that stores the data that is written to the table.

Snowflake JDBC driver

The Snowflake JDBC driver is a high-performance optimized driver developed by the Snowflake corporation.

Snowflake Spark Connector

Snowflake Spark Connector is a connector that implements the Spark Datasource API for Snowflake and is published as the Maven19 package.

Snowflake internal/external stages

Snowflake internal/external stages are Snowflake stages used by the data transfer process.

Snowflake table

A Snowflake table is the source/target table into the Snowflake DB.

Working with Databricks

Databricks provides the Databricks Unified Analytics Platform,20 which is a data science and data engineering platform with tools for data engineers to build data pipelines, for data scientists to build machine learning models, and for business users to consume real-time dashboards.

In addition to the Spark engine, the platform provides many additional enterprise-level components for building a complete process for gathering insights from data sets, as well as designing and testing machine learning models.

Databricks has already set up Snowflake Spark Connector. We can easily use the interface for quickly setting up a connection between data platforms. See Figure 12-3 and Table 12-4.
../images/482416_1_En_12_Chapter/482416_1_En_12_Fig3_HTML.jpg
Figure 12-3

Elements of Databricks Unified Analytics Platform

Table 12-4

Components of Databricks Unified Analytics Platform

Component

Description

Databricks Workspace

The Databricks Workspace is an environment that provides a hierarchy with notebooks, libraries, dashboards, and experiments with appropriate access to them.

Databricks Runtime

The Databricks Runtime is a number of components that improve the usability, performance, and security of big data analytics.

Apache Spark is an open source analytics framework based on a distributed computing cluster.

Delta Lake21 is an open source storage layer with ACID, scalable metadata, and versioning features. It also includes schema evolution features for building data lakes. It can be built on S3, Azure Data Lake Storage, and HDFS.22

ML Libraries is a list of the most popular machine learning libraries such as TensorFlow.

ML Flow23 is an open source platform for managing the machine learning lifecycle, including the following:

ML Flow Project provides a code packaging format.

ML Flow Models provides a model packaging format. You can deploy it to Docker or Azure ML for serving Apache Spark.

ML Flow Tracking is a component for tracking experiments, including code, parameters, and metrics.

Databricks Cloud Services

The platform can be deployed on AWS and Azure. Databricks is software as a service. This means the platform provides the benefits of a fully managed service and reduces infrastructure complexity.

Using Snowflake and Databricks Together

Let’s see how the Databricks interface works with Snowflake. This is just an example of how you might do this.
  1. 1.

    Sign into Azure at azure.microsoft.com.24

     

Note The minimum requirement for a Databricks cluster is two nodes. Your Azure account has to be “pay as you go.” Please check your account’s limits and quotas25 and pricing details.26

  1. 2.

    Log into your Azure account.

     
  2. 3.
    Create a new Databricks service using Home ➤ Azure Databricks ➤ Create Azure Databricks service. See Figure 12-4.
    ../images/482416_1_En_12_Chapter/482416_1_En_12_Fig4_HTML.jpg
    Figure 12-4

    Creating a new Azure Databricks service

     
  1. 4.

    Open a Databricks environment.

     
A Databricks notebook practically represents an extended version of Python Notebook. See Figure 12-5.
../images/482416_1_En_12_Chapter/482416_1_En_12_Fig5_HTML.jpg
Figure 12-5

Azure Databricks environment

  1. 5.

    Create a new small Spark cluster by selecting Cluster ➤ Create cluster. See Figure 12-6.

     
../images/482416_1_En_12_Chapter/482416_1_En_12_Fig6_HTML.png
Figure 12-6

Launching a new Spark cluster

Set Cluster Name to db_cluster.

Databricks Runtime 5.3 ML means the set of core runtimes, including Apache Spark and Scala; machine learning libraries like Pandas, PyTorch, and TensorFlow; and other popular data science packages.

Next, use Python version 3.

Finally, we have to choose the Worker Type setup. We can use the standard type here. For better performance, we can choose a worker called Databricks Delta Caching.27 See Figure 12-7.
../images/482416_1_En_12_Chapter/482416_1_En_12_Fig7_HTML.jpg
Figure 12-7

Launching a new Spark cluster

  1. 6.

    Create a new notebook using Azure Databricks ➤ Create a blank notebook, call it snowflake_test, and attach the existing cluster. See Figure 12-8.

     
  1. 7.

    Connect to Snowflake.

     
../images/482416_1_En_12_Chapter/482416_1_En_12_Fig8_HTML.jpg
Figure 12-8

Attaching the cluster to a notebook

Replace the substitutions according to your Snowflake credentials before executing Listing 12-1.

Caution Best practice is to use Databricks secrets28 instead of these substitutions. In Listing 12-1 we left substitutions for ease of perception.
options = dict(sfUrl="<your_snowflake_account>.snowflakecomputing.com",
   sfUser="<user>",
   sfPassword="<password>",
   sfDatabase= "SNOWFLAKE_SAMPLE_DATA",
   sfSchema= "TPCH_SF1",
      sfWarehouse= "SMALL_COMPUTE_WH")
Listing 12-1

Connecting to Snowflake’s TPCH_SF1.SNOWFLAKE_SAMPLE_DATA

  1. 8.

    Read data from Snowflake.

     
df = spark.read
  .format("snowflake")
  .options(**options)
  .option("dbtable", "ORDERS")
  .load()
display(df)
  1. 9.

    Write data into Snowflake.

     
df.write
    .format("snowflake")
    .options(sfOptions)
    .option("dbtable", "sampletable")
    .mode(SaveMode.Overwrite)
    .save()

Summary

In this chapter, we covered how Snowflake works with modern analytics solutions. You learned about which popular advanced analytics platforms have deep integration with Snowflake and saw how this is done in practice by running through a quick example of Databricks.

In the next chapter, you will learn about how to migrate a legacy data warehouse system into Snowflake.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.178.9