Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

D. Anoshin et al.Jumpstart Snowflakehttps://doi.org/10.1007/978-1-4842-5328-1_12

12. Snowflake and Data Science

Dmitry Anoshin¹, Dmitry Shirokov² and Donna Strok³

(1)

British Columbia, Canada

(2)

Burnaby, BC, Canada

(3)

Seattle, WA, USA

“You’re only given a little spark of madness. You mustn’t lose it.”
—Robin Williams,
“Little spark of madness” stand-up, 1977

Nowadays, data is one of the main assets of any company. As a result, each team of analysts is faced with the need to organize data science processes. Snowflake is a smart choice as a data source for storing structured and semistructured data.

In other words, elastic storage and computes allow you to store an unlimited amount of data at no extra cost with the ability to search for insights through data analysis and model building.

Additionally, the platform has integration possibilities with the most popular data analytical solutions.

In this chapter, you will learn about how Snowflake and data science platforms work together. We will cover the following topics:

Snowflake-supported advanced analytics solutions
Apache Spark introduction
Snowflake and Spark connector
Snowflake and Databricks

Snowflake and the Analytics Ecosystem

Snowflake supports many popular analytical solutions. Table 12-1 shows some of the platforms that are integrated with Snowflake.

Table 12-1

Popular Analytics Solutions That Work with Snowflake

Tool	Description
Alteryx	Alteryx¹ is a self-service data analytics platform.
Apache Spark	Apache Spark² is an open source cluster computing framework.
Databricks	Databricks³ is a cloud-based big data processing company founded by the creators of Apache Spark.
DataRobot	DataRobot⁴ is a predictive analytics platform to rapidly build and deploy predictive models in the cloud or in an enterprise.
H2O.io	H2O.io⁵ is an open source machine learning and artificial intelligence platform.
R Studio	R Studio⁶ is an open source integrated development environment for R.
Qubola	Qubola⁷ is cloud-native data platform based on Apache Spark, Apache Airflow, and Presto.

Snowflake and Apache Spark

Let’s look at an example of how the interaction between the database and the analytical platform works. Apache Spark is the de facto industry standard for big data engineering and analytics. Spark is an open source analytics framework based on a distributed computing cluster. It usually uses engineering data pipelines, including streaming mode, ad hoc analysis, machine learning, graph analysis, and other types of analytics.

Machine learning is becoming increasingly popular in many companies because it can significantly impact many of the company’s business processes. Spark can work for model training and production. We can build a machine learning model using Spark MLlib, which is an internal machine learning library using distributed, highly scaling algorithms. Additionally, Spark works well with Pandas,⁸ Scikit-learn,⁹ TensorFlow,¹⁰ and other popular statistical, machine learning, and deep learning libraries.

Moreover, you can use Apache MLflow¹¹ for organizing the lifecycle of the model and Apache Airflow¹² or similar solutions for building data pipelines.

Data scientists and analysts prefer to use SQL, R, and Python. Data engineers usually use languages such as Python, Java, and Scala. Spark provides an API with a different number of languages, including all these languages. It depends on the assets and knowledge base of your team members; each can be writing Spark code that executes on a distributed cluster of machines. Your company can choose the optimal strategy for deployment because Apache Spark can be set up and deployed on-premises or in the cloud. Most popular cloud providers such as Amazon AWS, Microsoft Azure, and Google supply Spark as a component or service. AWS supports Spark as part of an EMR¹³ service. Microsoft supports Spark in the Azure Hadoop–based HDInsight¹⁴ as well as the Azure Databricks platform. Additionally, Google Cloud Dataproc¹⁵ is a managed service based on Hadoop with Spark.

Let’s look at the main components of Apache Spark; see Figure 12-1.

../images/482416_1_En_12_Chapter/482416_1_En_12_Fig1_HTML.jpg — Figure 12-1
Apache Spark high-level components

Table 12-2 describes the components of the platform.

Table 12-2

Components of Apache Spark

Component	Description
Spark SQL	Spark SQL is the module for working with structured data by using SQL. It is compliant with the SQL ANSI 2011 specification and also supports Hive QL.
Spark Streaming	Spark Streaming is a Spark Structured Streaming API that allows you to build a scalable data pipeline solution that works in near-real-time mode.
MLlib	MLlib is the implementation of machine learning algorithms in a scalable and distributed manner.
GraphX	GraphX is the module for graph analytics. For instance, you can use graph analysis as part of the implementation of fraud analytics or customer churn analysis.
Spark Core with Dataframe API	Spark Core is the distributed scalable engine that deploys on different types of big data clusters such as Mesos, Hadoop YARN, and Kubernetes. It provides a high-level abstraction including RDD and Dataframe¹⁶ for operating with structured and semi/unstructured data by using the Python, R, and Scala/Java languages.
Datasource API	The Datasource API provides the ability to deelop connectors for connecting to Apache Spark. Snowflake has already developed such a library.
Spark Modules	Spark has an ecosystem for the development and distribution of Spark-compatible libraries.

Connector for Apache Spark

Snowflake provides Apache Spark Connector,¹⁷ which allows you to use Snowflake and Spark together. Let’s dive into how it works.

Figure 12-2 shows a data flow process between Snowflake data warehouse services and managed Apache Spark.

../images/482416_1_En_12_Chapter/482416_1_En_12_Fig2_HTML.jpg — Figure 12-2
Bidirectional data transfer between Snowflake and Spark

The connector supports two modes of data transfer, depending on whether the internal stage is used or external. Stages can be based on an AWS S3 or Azure Blob Storage container. Internal stages automatically create and drop during data transfer inside the Snowflake session. However, you can choose the external stages if you prefer to manage the data transfer yourself.

Note

Best practice is to use internal transfer. Use external transfer only if there is a need to store data for more than 36 hours.

Additionally, one of the key features of the Spark connector is query pushdown optimization. Pushdown optimization is an approach in which the logic for transforming or querying data happens on the database side. The adapter has deep integration with Spark and will be able to read Spark’s logical query plans and transfer fully or partially executed to Snowflake. This allows you to reduce the amount of data being moved between the Snowflake database and Spark, which dramatically improves performance. See Table 12-3.

Table 12-3

Interaction Between Spark and Snowflake

Interaction	Description
Spark dataframe	A Spark dataframe is the data structure in the distributed memory of Apache Spark that can be automatically created and based on the data from Snowflake’s table. The schema of the table and the dataframe schema must match. Otherwise, you must specify the mapping.¹⁸ In the opposite direction, it is the data structure that stores the data that is written to the table.
Snowflake JDBC driver	The Snowflake JDBC driver is a high-performance optimized driver developed by the Snowflake corporation.
Snowflake Spark Connector	Snowflake Spark Connector is a connector that implements the Spark Datasource API for Snowflake and is published as the Maven¹⁹ package.
Snowflake internal/external stages	Snowflake internal/external stages are Snowflake stages used by the data transfer process.
Snowflake table	A Snowflake table is the source/target table into the Snowflake DB.

Working with Databricks

Databricks provides the Databricks Unified Analytics Platform,²⁰ which is a data science and data engineering platform with tools for data engineers to build data pipelines, for data scientists to build machine learning models, and for business users to consume real-time dashboards.

In addition to the Spark engine, the platform provides many additional enterprise-level components for building a complete process for gathering insights from data sets, as well as designing and testing machine learning models.

Databricks has already set up Snowflake Spark Connector. We can easily use the interface for quickly setting up a connection between data platforms. See Figure 12-3 and Table 12-4.

../images/482416_1_En_12_Chapter/482416_1_En_12_Fig3_HTML.jpg — Figure 12-3
Elements of Databricks Unified Analytics Platform

Table 12-4

Components of Databricks Unified Analytics Platform

Component	Description
Databricks Workspace	The Databricks Workspace is an environment that provides a hierarchy with notebooks, libraries, dashboards, and experiments with appropriate access to them.
Databricks Runtime	The Databricks Runtime is a number of components that improve the usability, performance, and security of big data analytics. Apache Spark is an open source analytics framework based on a distributed computing cluster. Delta Lake²¹ is an open source storage layer with ACID, scalable metadata, and versioning features. It also includes schema evolution features for building data lakes. It can be built on S3, Azure Data Lake Storage, and HDFS.²² ML Libraries is a list of the most popular machine learning libraries such as TensorFlow. ML Flow²³ is an open source platform for managing the machine learning lifecycle, including the following: ML Flow Project provides a code packaging format. ML Flow Models provides a model packaging format. You can deploy it to Docker or Azure ML for serving Apache Spark. ML Flow Tracking is a component for tracking experiments, including code, parameters, and metrics.
Databricks Cloud Services	The platform can be deployed on AWS and Azure. Databricks is software as a service. This means the platform provides the benefits of a fully managed service and reduces infrastructure complexity.

Component

Description

Databricks Workspace

The Databricks Workspace is an environment that provides a hierarchy with notebooks, libraries, dashboards, and experiments with appropriate access to them.

Databricks Runtime

The Databricks Runtime is a number of components that improve the usability, performance, and security of big data analytics.

Apache Spark is an open source analytics framework based on a distributed computing cluster.

Delta Lake²¹ is an open source storage layer with ACID, scalable metadata, and versioning features. It also includes schema evolution features for building data lakes. It can be built on S3, Azure Data Lake Storage, and HDFS.²²

ML Libraries is a list of the most popular machine learning libraries such as TensorFlow.

ML Flow²³ is an open source platform for managing the machine learning lifecycle, including the following:

ML Flow Project provides a code packaging format.

ML Flow Models provides a model packaging format. You can deploy it to Docker or Azure ML for serving Apache Spark.

ML Flow Tracking is a component for tracking experiments, including code, parameters, and metrics.

Databricks Cloud Services

The platform can be deployed on AWS and Azure. Databricks is software as a service. This means the platform provides the benefits of a fully managed service and reduces infrastructure complexity.

Using Snowflake and Databricks Together

Let’s see how the Databricks interface works with Snowflake. This is just an example of how you might do this.

1.
Sign into Azure at azure.microsoft.com.²⁴

Note The minimum requirement for a Databricks cluster is two nodes. Your Azure account has to be “pay as you go.” Please check your account’s limits and quotas²⁵ and pricing details.²⁶

2.
Log into your Azure account.
3.
Create a new Databricks service using Home ➤ Azure Databricks ➤ Create Azure Databricks service. See Figure 12-4.
Figure 12-4
Creating a new Azure Databricks service

4.
Open a Databricks environment.

A Databricks notebook practically represents an extended version of Python Notebook. See Figure 12-5.

../images/482416_1_En_12_Chapter/482416_1_En_12_Fig5_HTML.jpg — Figure 12-5
Azure Databricks environment

5.
Create a new small Spark cluster by selecting Cluster ➤ Create cluster. See Figure 12-6.

../images/482416_1_En_12_Chapter/482416_1_En_12_Fig6_HTML.png — Figure 12-6
Launching a new Spark cluster

Set Cluster Name to db_cluster.

Databricks Runtime 5.3 ML means the set of core runtimes, including Apache Spark and Scala; machine learning libraries like Pandas, PyTorch, and TensorFlow; and other popular data science packages.

Next, use Python version 3.

Finally, we have to choose the Worker Type setup. We can use the standard type here. For better performance, we can choose a worker called Databricks Delta Caching.²⁷ See Figure 12-7.

../images/482416_1_En_12_Chapter/482416_1_En_12_Fig7_HTML.jpg — Figure 12-7
Launching a new Spark cluster

6.
Create a new notebook using Azure Databricks ➤ Create a blank notebook, call it snowflake_test, and attach the existing cluster. See Figure 12-8.

7.
Connect to Snowflake.

../images/482416_1_En_12_Chapter/482416_1_En_12_Fig8_HTML.jpg — Figure 12-8
Attaching the cluster to a notebook

Replace the substitutions according to your Snowflake credentials before executing Listing 12-1.

Caution Best practice is to use Databricks secrets²⁸ instead of these substitutions. In Listing 12-1 we left substitutions for ease of perception.

options = dict(sfUrl="<your_snowflake_account>.snowflakecomputing.com",

sfUser="<user>",

sfPassword="<password>",

sfDatabase= "SNOWFLAKE_SAMPLE_DATA",

sfSchema= "TPCH_SF1",

sfWarehouse= "SMALL_COMPUTE_WH")

Listing 12-1

Connecting to Snowflake’s TPCH_SF1.SNOWFLAKE_SAMPLE_DATA

8.
Read data from Snowflake.

df = spark.read

.format("snowflake")

.options(**options)

.option("dbtable", "ORDERS")

.load()

display(df)

9.
Write data into Snowflake.

df.write

.format("snowflake")

.options(sfOptions)

.option("dbtable", "sampletable")

.mode(SaveMode.Overwrite)

.save()

Summary

In this chapter, we covered how Snowflake works with modern analytics solutions. You learned about which popular advanced analytics platforms have deep integration with Snowflake and saw how this is done in practice by running through a quick example of Databricks.

In the next chapter, you will learn about how to migrate a legacy data warehouse system into Snowflake.

Footnotes

You can find more information about the Alteryx platform at https://www.alteryx.com/.

You can find more information about Apache Spark at https://spark.apache.org/.

You can find more information about the Databricks platform at https://databricks.com/.

You can find more information about DataRobot at https://www.datarobot.com/.

You can find more information about the H2O platform at https://www.h2o.ai/.

You can find more information about R Studio at https://www.rstudio.com/.

You can find more information about Qubola at https://www.qubole.com/.

Pandas is a Python library providing data structures and data analysis methods. For more information, see https://pandas.pydata.org.

Scikit-learn is a free Python machine learning library. For more information, see https://scikit-learn.org.

TensorFlow is an open source deep learning library. For more information, see https://www.tensorflow.org.

MLflow is an open source platform for the machine learning lifecycle. For more information, see https://mlflow.org/.

Apache Airflow is a schedule and monitor workflows tool. For more information, see https://airflow.apache.org.

AWS Elastic MapReduce (EMR) is a Hadoop managed service on AWS. For more information, see https://aws.amazon.com/emr/.

HDInsight is a Hadoop–managed service on Azure. For more information, see https://azure.microsoft.com/en-us/services/hdinsight/.

Google Cloud Dataproc is a Hadoop–managed service on GCP. For more information, see https://cloud.google.com/dataproc/.

Spark provides data frames and data sets. For more information, see https://spark.apache.org/docs/latest/sql-programming-guide.html.

For more information about Snowflake Connector for Spark, see https://docs.snowflake.net/manuals/user-guide/spark-connector.html.

For more information about column mapping, see https://docs.snowflake.net/manuals/user-guide/spark-connector-use.html#label-spark-options.

Maven is a build automation tool used primarily for Java projects. For more information, see https://maven.apache.org/.

For more information, see https://databricks.com/product/unified-analytics-platform.

For more information about Delta Lake, see https://delta.io/.

For more information about Apache Hadoop Distributed File System, see https://hadoop.apache.org/.

For more information about MLFlow, see https://mlflow.org/.

For more information about Microsoft Azure, see https://azure.microsoft.com.

For more information about limits, see https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits.

For more information about Azure Databricks, see https://azure.microsoft.com/en-us/pricing/details/databricks/.

For more about optimizing performance with caching, see https://docs.databricks.com/delta/delta-cache.html.

For more about Databricks secrets, see https://docs.databricks.com/user-guide/secrets/index.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12. Snowflake and Data Science

Create new playlist

Sign In

Sign Up