Chapter 1, “Core Data Concepts,” and Chapter 2, “Relational Databases in Azure,” examine the fundamental concepts of analytical workloads, including common definitions and design patterns. This chapter expands on these concepts by exploring the various components that can be involved in Azure-based analytical workloads. These components include services that are involved in ingesting and processing data and storage options for a modern data warehouse.
Throughout this book we have covered the features and design considerations used by different workload types. For this reason, the following sections will only provide a summary of the different workload types. The important takeaway for this chapter is how analytical workloads differentiate from transactional ones and how batch and stream processing are used in a modern data warehouse solution. Understanding these features will set the stage for the rest of the chapter when we examine how to build modern data warehouses in Azure.
Analytical workloads can be built using many of the same technologies and components as transactional workloads. However, there are several design practices and features that are more optimal for one over the other. When designing a modern data warehouse, it is important to consider what sets analytical and transactional workloads apart.
As discussed in Chapter 1, “Core Data Concepts,” online transaction processing (OLTP) systems capture the business transactions that support the data-to-day operations of a business. Data stores that are used for OLTP systems must be able to handle millions of transactions a day while ensuring that none of the data is corrupted. Traditionally, OLTP systems have always been hosted on relational databases as these platforms implement ACID properties to ensure data integrity.
Relational databases supporting OLTP systems are highly normalized, typically following third normal form (3NF), separating related data into multiple tables to eliminate data redundancy. This design standard ensures that database tables are optimized for write operations. While this level of normalization is ideal for write operations, it is less efficient for analytical workloads that perform read-heavy operations. Analysts who have built reports from databases that are designed for OLTP workloads will inevitably be forced to write complicated queries that use several join operations to create the desired result set. This can lead to bad performance and concurrency issues with write operations.
Before examining features and best practices for analytical workloads, it is important to note that not all OLTP workloads are suitable for highly normalized, relational databases. Transactional data that is produced in large volumes and at high speeds can take a performance hit when being conformed to a fixed, normalized schema. In these cases, organizations can choose to host their transactional workloads on NoSQL document databases such as the Azure Cosmos DB Core (SQL) API. These databases store data in its original state as semi-structured documents, enabling transactions to be written to them very quickly.
While document databases are extremely efficient data stores for large volume and high velocity write operations, the lack of a consistent structure makes them difficult to use with analytical applications like reporting tools. Useful data fields are typically extrapolated from semi-structured NoSQL documents and stored in a format that is optimized for read-heavy operations. Several modern analytical services can also leverage data virtualization techniques to structure data for reporting applications while leaving the data in its source data store.
Analytical workloads are designed to help business users make data-driven decisions. These systems are used to answer several questions about the business: What has happened over the previous period? Why did particular events happen? What will happen if all things stay the same? What will happen if we make specific changes in different areas? As discussed in Chapter 1, these questions are answered by the different types of analytics that make up the analytics maturity model.
Data-driven business decisions come from extracting useful information from several source data stores, including OLTP databases. Once extracted, source data will typically undergo several transformation steps to remove extraneous features and remediate data quality issues. Cleansed data is then conformed to an easy-to-use data model for analytics. Data that is ready to be analyzed is stored in a relational data warehouse, an OLAP model, or as files in an enterprise data lake.
Reporting applications and analytical applications used to analyze historical data typically retrieve data from read optimized data stores such as a data warehouse or an OLAP model. Many of these systems offer in-memory and column-based storage capabilities that are optimal for analytical queries that aggregate large amounts of data. Data warehouses and department-specific data marts are built with relational databases like Azure Synapse Analytics dedicated SQL pools or Azure SQL Database. Unlike OLTP data stores that are built with relational databases, data warehouses use a denormalized data model. The section “Data Modeling Best Practices for Data Warehouses” later in this chapter covers this approach in further detail.
While most analytical workloads store processed data used by reporting applications in a relational data warehouse such as Azure Synapse Analytics dedicated SQL pools or an OLAP tool such as Azure Analysis Services, many organizations choose to store data used by data scientists as files in an enterprise data lake. Cloud-based data lakes such as Azure Data Lake Store Gen2 (ADLS) can store massive amounts of data much cheaper than a relational data warehouse. Data lakes can also store large amounts of unstructured data such as images, video, and audio that data scientists can leverage with deep learning techniques. Data architects can take advantage of these capabilities by providing data scientists with large volumes and several types of data that they can use to build insightful machine learning models.
Modern cloud-based analytical workloads typically use a combination of an enterprise data lake and a data warehouse. Relational database engines used to host data warehouses offer faster query performance, higher user concurrency, and more granular security than data lake technologies. On the other hand, data lake services can host unlimited amounts data at a much cheaper cost, allowing users to store multiple copies of data to leverage for several different use cases. Data lakes can also store a wide variety of data, allowing users to interact with semi-structured and unstructured data with relative ease. This is why most organizations store all of their data in an enterprise data lake and only load data that is necessary for reporting from the data lake into a data warehouse.
In recent years, several technologies have been introduced that are optimized for ad hoc analysis with data stored in a data lake. By storing data using a columnar format such as Parquet, analysts can leverage data virtualization technologies such as Azure Synapse Analytics serverless SQL pools to query their data with T-SQL without having to create a separate copy of the data in a relational database. Data engineers can also store data in ADLS with Delta Lake. Delta Lake is an open-source storage layer that enables ACID properties on Parquet files in ADLS. This ensures data integrity for data stored in ADLS, perfect for ad hoc analysis and data science initiatives. More information about Delta Lake and the “Lakehouse” concept can be found at https://delta.io
.
While using a data lake like a data warehouse can serve as a relational database replacement with smaller workloads, large reporting workloads that analyze data from several sources can benefit from the performance of a relational database. You can find more information about the benefits of using a relational data warehouse and a data lake together in the following blog post from James Serra: www.jamesserra.com/archive/2020/09/data-lakehouse-synapse
.
Data used by analytical workloads have to go through a data processing workflow before it eventually lands in a data lake and/or a data warehouse. Even if the data does not undergo any transformations, it still needs to be extracted and loaded into a destination data store. Data engineers can use one or a combination of the following data processing techniques to create an end-to-end data pipeline: batch processing and stream processing.
Batch and stream processing are two data processing techniques that are used to manipulate data at rest and in real time. As discussed in Chapter 1, these techniques can be leveraged together in modern data processing architectures such as the Lambda architecture. This empowers organizations to make decisions with a wide variety of data that is generated at different speeds. Let's examine each of these techniques further in the following sections before exploring how they can be used in the same solution.
Batch processing activities act on groups, or batches, of data at predetermined periods of time or after a specified event. One example of batch processing is a retail company processing daily sales every night and loading the transformed data into a data warehouse. The following list included reasons for why you would want to use batch processing:
Data architects can implement batch processing activities using one of two techniques: extract, transform, and load (ETL) or extract, load, and transform (ELT). ETL pipelines extract data from one or more source systems, transform the data to meet user specifications, and then load the data in an analytical data store. ELT processes flip the transform and load stages and allow data engineers to transform data in the analytical data store. Because the ELT pattern is optimized for big data workloads, the analytical data store must be capable of working on data at scale. For this reason, ELT pipelines commonly use MPP technologies like Azure Synapse Analytics as the analytical data store.
Batch processing workflows in the cloud generally use the following components:
Figure 5.1 illustrates an example of a batch processing workflow that uses ADF to extract data from a few source systems, lands the raw data in ADLS, processes the data with a combination of Azure Databricks and ADF mapping data flows, and finally loads the processed data into an Azure Synapse Analytics dedicated SQL pool.
Stream processing is a data processing technique that involves ingesting a continuous stream of data and performing computations on the data in real time. It is used for processing scenarios that have very short latency requirements, typically measured in seconds or milliseconds. Data that is ready for analysis is either sent directly to a dashboard or loaded into a persistent data store such as ADLS or Azure Synapse Analytics dedicated SQL pool for long-term analysis. Some examples of stream processing are listed here:
Cloud-based stream processing workflows generally use the following components:
As discussed in Chapter 1, stream processing workflows can use one of two approaches: live or on demand. The “live” approach is the most commonly used pattern, processing data continuously as it is generated. The “on-demand” approach persists incoming data in object storage and processes it in micro-batches. An example of this approach is illustrated in Figure 5.2.
Azure data services make it easy for data architects to use batch and stream processing workflows in the same solution. This flexibility gives business units the ability to quickly make well informed decisions from their data. These cloud-native solutions are designed with modern data processing patterns like the Lambda architecture.
The Lambda architecture is a data processing pattern that provides a framework for how users can use a combination of batch and stream processing for data analysis. Solutions that use the Lambda architecture separate batch and stream processing operations into a cold and hot path. Figure 5.3 illustrates the components and process flow used by the Lambda architecture.
The cold path, also known as the batch layer, manages all operations that are not constrained by low latency requirements. Batch layer operations typically process large datasets at predetermined periods of time. Once processed, data is loaded into the serving layer (e.g., an analytical data store like Azure Synapse Analytics) to be analyzed by reporting and analytical applications.
The hot path, also known as the speed layer, manages stream processing operations. Data is immediately processed and is either directly sent to a reporting application for instant analysis or loaded into the serving layer and combined with data processed in the batch layer.
Modern data warehouse solutions are more than just a simple analytical data store. They are made up of several components that give users flexible options for how they can analyze their data. Technologies used by modern data warehouse solutions are designed to scale horizontally as well as vertically, meaning that they can process and store very large datasets. Modern computing paradigms that enable these technologies to manage large and diverse datasets have also led to more dynamic design patterns. As discussed previously in this chapter, modern data warehouse solutions can combine batch and stream processing workflows with the Lambda architecture.
Cloud platforms such as Azure make building these solutions more accessible than ever before. Instead of having to procure hardware and spend the time configuring distributed services such as Hadoop or Spark to work in an on-premises environment, users can quickly deploy services that are designed to be core components of a modern data warehouse solution. Azure's pay-per-use cost model and the ability to quickly scale or delete services allow organizations to test different modern data warehouse components by completing short projects known as proofs of concept (POCs). POCs enable users to evaluate critical design decisions without having to make any large upfront hardware commitments.
The following sections explore data modeling best practices for the most commonly used Azure services for modern data warehouse solutions.
Data warehouses are data management systems that support analytical workloads and business intelligence (BI) activities. Data managed by a data warehouse is derived from several sources, such as OLTP systems, web APIs, IoT devices, and social media networks. Unlike OLTP systems, data warehouses use data models that are read-optimized so analytical queries issued against them can efficiently return aggregated calculations to support business decisions.
As discussed in Chapter 2, data warehouses use denormalized data models that are optimized for analytical queries and read-heavy workloads. The most common design practice for this approach is the star schema. Star schemas denormalize business data to minimize the number of tables in the data model. Tables consist of business entities and measurable events that are related to those entities. This division of data categories is represented by the two types of tables defined in the star schema: dimension tables and fact tables.
Dimension tables contain information that describes a particular business entity. These tables are typically very wide, containing several descriptor columns and a key column that serves as a unique identifier. Some common entities that are stored as dimension tables include date, customer, product category, and product subcategory information. In all of these cases, there could be a relatively small number of rows but a large number of columns to provide as much descriptive information as possible.
Fact tables store quantifiable observations that are related to the dimension tables. These tables can grow to be very large, comprising several millions of rows related to specific measurable events. Some fact table examples include Internet sales, product inventory, and weather metrics. Fact tables also include foreign key columns that are used to establish relationships with dimension tables. These relationships determine the level of granularity that analytical queries can use when filtering fact table data. For example, a query that is filtering an Internet sales fact table by a date dimension can only return time slices for the level of detail contained in the date dimension.
In the Azure ecosystem there are several services that can be used to build a modern data warehouse solution. Depending on the scenario and the skillset of the engineers building the solution, most Azure services can be used to build different components of a data processing pipeline. However, there is a set of core Azure data services that are specifically designed to process big data workloads:
Each of these services can perform a variety of different functions in a data processing pipeline. This versatility allows them to be used in various stages of ETL or ELT pipelines. They have the flexibility to manage data in a variety of different formats and can scale horizontally as well as vertically to process very large volumes of data.
First, let's examine how Azure HDInsight, Azure Databricks, and ADF are used in modern data warehouse solutions. End-to-end data processing solutions with Azure Synapse Analytics will be described in the section “End-to-End Analytics with Azure Synapse Analytics” later in this chapter.
Azure HDInsight is a managed, open-source analytics service in Azure. With Azure HDInsight, you can deploy distributed clusters for Apache Hadoop, Apache Spark, Apache Interactive Query/LLAP (Live Long and Process), Apache Kafka, Apache Storm, and Apache HBase in Azure. Being able to quickly stand up these environments without having to procure and manage hardware reduces the barriers to entry for organizations who are beginning to build a modern data warehouse.
Open-source frameworks like Hadoop and Spark are designed to handle large-scale data processing activities by using a scale-out architecture. While they can be installed on a single server node for test purposes, most use cases leverage multiple server nodes that are clustered together to perform processing activities at scale. Clusters consist of a head/driver node that divides jobs into smaller tasks and one or more worker nodes that execute each task.
Distributed frameworks also rely on resource managers like Apache Hadoop YARN (Yet Another Resource Negotiator) to manage cluster resources and job scheduling. Resource managers designate compute resources (such as CPU, memory, IO) to cluster nodes and monitor the resource usage. Knowing details of how YARN and other resource managers are designed is beyond the scope of the DP-900 exam and this book, but you can find more information at the following link if you would like to learn more: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
.
Azure HDInsight makes it easy to manage distributed frameworks like Hadoop and Spark and offers the capability to customize a cluster deployment, such as adding new components and languages. Also, since Azure HDInsight is a PaaS service, you can easily scale the number of worker nodes allocated to cluster up or down to increase compute power or cut back on cost.
It is important to understand the different Azure HDInsight cluster types and when you should use them. Also, keep in mind that after you have deployed an Azure HDInsight cluster, you will not be able to change the cluster type. For this reason, it is critical that you understand the scenarios the cluster will be supporting. The following list describes each of the cluster types supported by Azure HDInsight:
https://docs.microsoft.com/en-us/azure/hdinsight/hadoop/hdinsight-use-hive
.
One drawback to Hadoop is that it only supports batch processing, forcing users to leverage another service like Apache Storm or Apache Spark for distributed stream processing. Hadoop also reads and writes data from and to disk, potentially leading to poorer processing performance than Apache Spark, which supports in-memory processing.
As with any service in Azure, you can configure and deploy an Azure HDInsight cluster through the Azure Portal, through an Azure PowerShell or Azure CLI script, or via an Infrastructure as Code template like ARM or Bicep. Creating an Azure HDInsight cluster in Azure deploys the service chosen as the cluster type, the Apache Hadoop YARN resource manager to manage cluster resources, and several popular open-source tools such as Ambari, Avro, Hive, Sqoop, Tez, Pig, and Zookeeper. This greatly reduces the time it takes to get started building distributed solutions.
Most modern data warehouse scenarios leverage Apache Spark over Apache Hadoop, Apache Storm, and Apache Interactive Query to process large datasets due to its speed, ability to perform batch and stream processing activities, number of data source connectors, and overall ease of use. As a matter of fact, ADF mapping data flows use Apache Spark clusters to perform ETL activities. Apache Spark also enables data scientists and data analysts to interactively manipulate data concurrently.
There are a few management aspects that must be considered when deploying an Azure HDInsight cluster:
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-use-external-metadata-stores#custom-metastore
.https://docs.microsoft.com/en-us/azure/hdinsight/enterprise-security-package
.Later in this chapter we will discuss two other Azure services that can be used to build Apache Spark clusters. Azure Databricks and Azure Synapse Apache Spark pools are two Apache Spark–based analytics platforms that overcome the management overhead presented by Azure HDInsight. Both services allow you to easily pause (referred to as “terminate” in Azure Databricks) Spark clusters and maintain schema metadata without needing a custom external metastore. They are also natively integrated with Azure AD, enabling users to leverage their existing authentication/authorization mechanisms. Because of the ease of use and the additional components that provide a unified development experience for data engineers, Azure Databricks and Azure Synapse Analytics are the preferred choices for Apache Spark workloads. Reasons to use Azure Databricks instead of Azure Synapse Analytics Apache Spark pools and vice versa will be described in the following sections.
Azure HDInsight clusters are typically used in scenarios where Azure Databricks and Azure Synapse Analytics cannot be used or if Apache Kafka is required. The most common example of a scenario where Azure Databricks and Azure Synapse Analytics cannot be used is a solution that requires its Azure resources to come from a region that does not support either of these services. Azure Event Hubs also provides an endpoint compatible with Apache Kafka that can be leveraged by most Apache Kafka applications as an alternative to managing an Apache Kafka cluster with Azure HDInsight. Configuring the Azure Event Hubs Kafka endpoint is beyond the scope of the DP-900 exam, but you can find more information at https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
if you would like to learn more.
Apache Spark was developed in 2009 by researchers at the University of California, Berkeley. Their goal was to build a solution that overcame the inefficiencies of the Apache Hadoop MapReduce framework for big data processing activities. While based off of the MapReduce framework for distributing processing activities across several compute servers, Apache Spark enhances this framework by performing several operations in-memory. Spark also extends MapReduce by allowing users to interactively query data on the fly and create stream processing workflows.
The Spark architecture is very similar to the distributed pattern used by Hadoop. At a high level, Spark applications can be broken down into the following four components:
Figure 5.4 illustrates how the components of a Spark application fit into the architecture of a three node (one driver and two workers) Spark cluster.
The Spark Core API enables developers to build Spark applications with several popular development languages, including Java, Scala, Python, R, and SQL. These languages have Spark-specific APIs, like PySpark for Python and SparkR for R, that are designed to parallelize code operations across Spark executors. The creators of Spark also developed several Spark-based libraries designed for a variety of big data scenarios, including MLlib for distributed machine learning applications, GraphX for graph processing, Spark Structured Streaming for stream processing, and Spark SQL + DataFrames for structuring and analyzing data.
As mentioned earlier, the Spark RDD API and the Spark DataFrame API are used to create and manipulate data objects. The RDD API is a low-level API that serves as the foundation for Spark programming. An RDD is an immutable distributed collection of data, partitioned across multiple worker nodes. The RDD API has several operations that allow developers to perform transformations and actions in a parallelized manner. While the Spark DataFrame API is used more often than the Spark RDD API, there are still some scenarios where RDDs can be more optimal than DataFrames. More information on RDDs can be found at https://databricks.com/glossary/what-is-rdd
.
The DataFrame API is a high-level abstraction of the RDD API that allows developers to use a query language like SQL to manipulate data. Unlike RDDs, DataFrame objects are organized as named columns (like a relational database table), making them easy to manipulate. DataFrames are also optimized with Spark's native optimization engine, the catalyst optimizer, a feature that is not available for RDDs. More information on how to get started with the DataFrame API can be found at https://docs.microsoft.com/en-us/azure/databricks/getting-started/spark/dataframes
.
In 2013, the creators of Apache Spark founded Databricks, a data and artificial intelligence company that packages the Spark ecosystem into an easy-to-use cloud-native platform. The company brands the Databricks service as a “Unified Analytics Platform” that enables data engineers, data scientists, and data analysts to work together in the same environment. Within a single instantiation of a Databricks environment, known as a workspace, users can take advantage of the following features:
The Databricks platform can be used on Azure with the Azure Databricks service. Azure Databricks is fully integrated with other Azure services such as Azure AD and has connectors for several popular Azure data stores such as ADLS, Azure SQL Database, Azure Cosmos DB, and Azure Synapse Analytics dedicated SQL pools. Because Azure Databricks natively integrates with Azure AD, administrators can use their existing identity infrastructure to enable fine-grained user permissions for Databricks objects such as notebooks, clusters, jobs, and data.
The platform architecture for Azure Databricks can be broken down into two fundamental layers: the control plane and the data plane.
Spark clusters deployed through Azure Databricks use Azure VMs as cluster nodes. As we will discuss in the section “Creating a Spark Cluster with Azure Databricks” later in this chapter, users can choose from several different VM types to serve different use cases.
Azure Databricks allows users to create two types of Spark clusters: all-purpose and job. All-purpose clusters can be used to analyze data collaboratively with interactive notebooks, while job clusters are used to run automated jobs for dedicated workloads. Job clusters are brought online when a job is started and terminated when the job is finished.
Azure Databricks workspaces can be deployed with one of three price tiers: standard, premium, or trial. The primary difference between the standard and premium price tiers is that role-based access control for workspace objects and Azure AD credential passthrough is only available with the premium price tier. The trial price tier is a 14-day free trial of the Azure Databricks premium price tier.
Pricing for Spark clusters created in Azure Databricks consists of two primary components: the cost of the driver and worker node VMs and the processing cost. Processing cost is measured by the number of Databricks Units (DBUs) consumed during cluster runtime. A DBU is a unit of processing capability per hour, billed on per-second usage. You can easily calculate the number of DBUs usage by multiplying the total number of cluster nodes by the number of hours the cluster was running. For example, a cluster with 1 driver node and 3 worker nodes that ran for a total of 2 hours consumed 8 DBUs (that is, 4 nodes × 2 cluster runtime hours).
While the Azure VM cost will remain the same regardless of which price tier the Azure Databricks workspace was deployed with, the DBU price will vary. Table 5.1 lists the DBU price differences for the standard and premium price tiers.
TABLE 5.1 Standard and premium tier DBU prices
Workload | Standard Tier DBU Price | Premium Tier DBU Price |
---|---|---|
All-Purpose Compute | $0.40 DBU/hour | $0.55 DBU/hour |
Jobs Compute | $0.15 DBU/hour | $0.30 DBU/hour |
Jobs Light Compute | $0.07 DBU/hour | $0.22 DBU/hour |
You can create an Azure Databricks workspace through any of the Azure deployment methods. The easiest way to get started is by creating a workspace through the Azure Portal with the following steps:
portal.azure.com
and search for Azure Databricks in the search bar at the top of the page. Click Azure Databricks to go to the Azure Databricks page in the Azure Portal.Once the Azure Databricks workspace is deployed, go back to the Azure Databricks page, and click on the newly created workspace. Click on the Launch Workspace button in the middle of the overview page to navigate to the workspace UI and start working within the Databricks ecosystem. Figure 5.7 is an example of what this button looks like.
A new browser window will open after you click the Launch Workspace button, prompting you to sign in with your Azure AD credentials. Once you are signed in, you will be brought to the Azure Databricks web application where you can begin working with Databricks. The next section describes the key components of the web application.
The home page for an Azure Databricks workspace serves as a location for users to start working with Databricks. Figure 5.8 is an example of the Azure Databricks web application home page.
As you can see in Figure 5.8, there are common task options such as creating a new notebook and importing data. There are also quick navigation links to recently worked on notebooks, Spark documentation, and helpful blog posts.
On the left side of the page is a toolbar with several buttons. The number of buttons in the toolbar varies based on which persona is chosen. Azure Databricks personas include Data Science & Engineering, Machine Learning, and SQL. You can change the persona by clicking the icon below the Databricks logo in the toolbar. Figure 5.9 illustrates this icon and the different options that can be selected from it.
For the purposes of this book and the DP-900 exam, we will only cover the Data Science & Engineering persona. Of the 13 buttons that are under the Data Science & Engineering persona icon, the first 8 buttons are the most relevant to building solutions in Azure Databricks, including the following:
Spark clusters can be configured and deployed by clicking on the Create Cluster button on the Compute page. Clicking this button will take you to the Create Cluster page, where you will be required to define the following settings (see Figure 5.13):
The first step to begin working with data is to create a new notebook. You can do this by clicking the Create button on the left-side toolbar and clicking Notebook. This will open a pop-up window that will prompt you to enter a name for the notebook, choose a primary language (Python, Scala, SQL, or R), and select a cluster to attach the notebook to. Once these options are set, click the Create button to create the notebook. You will be guided to the notebook once it is finished being created. Figure 5.14 illustrates how to create a new Python notebook from this window.
The first cell in a notebook is typically used to import any libraries that will be needed to manipulate data or to establish a connection with an external data source. This section will focus on connecting to Azure Storage, more specifically ADLS. There are three ways to establish a connection to ADLS with Azure Databricks:
Creating a service principal is out of scope for the DP-900 exam and will not be covered in this book. Refer to the following blog to learn how to create a service principal that can be used to establish a connection with ADLS: https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal
. For now, we will cover how to establish a connection by creating a mount point in DBFS with Azure AD credential passthrough.
To create a mount point in DBFS for an ADLS account, use the dbutils.fs.mount
command in the first notebook cell. This command uses three parameters to define a mount point:
source
parameter that takes the ADLS URI as an argument. If required, the URI can point to a specific subdirectory in ADLS.mount_point
parameter that sets the location (in DBFS) and name of the mount point.extra_config
parameter that accepts the authorization information required to access the external storage account. You can set a variable to the OAuth and Spark configuration settings for Azure AD credential passthrough and pass it in the extra_config
parameter to make the dbutils.fs.mount
command reusable and more readable.Once the mount point has been created, you can run the dbutils.fs.ls
command with the mount point name as an argument to verify that you can view the subdirectories in the dp900-adls-container container. See Figure 5.15 for an illustration of both the dbutils.fs.mount
and dbutils.fs.ls
commands.
Users attempting to read or write data via the mount point will have their credentials evaluated. Alternatively, to creating a mount point, users can access data directly from an ADLS account with Azure AD credential passthrough by passing the ADLS URI in a spark.read
command. For example, the following PySpark code assumes that the cluster running the command has Azure AD credential passthrough enabled and the user running the command has the appropriate permissions to the products subdirectory of the dp900-adls-container container:
readCsvData = spark.read.csv("abfss://[email protected]/products/products/*.csv")
While Azure AD credential passthrough is the most seamless method for accessing an ADLS account, there are several scenarios where you will need to use one of the other two access methods. For example, batch processing jobs that are orchestrated via ADF or an Azure Databricks job will need to establish a connection to the ADLS path with a service principal to guarantee a consistent connection. Refer to the following to learn more about how to use the different access methods to establish a connection with ADLS: https://cloudblogs.microsoft.com/industry-blog/en-gb/technetuk/2020/07/01/securing-access-to-azure-data-lake-gen-2-from-azure-databricks
.
Azure Data Factory (ADF) is a managed cloud service that can be used to build complex ETL, ELT, and data integration projects. With ADF, data engineers can create automated workflows (known as pipelines) that orchestrate data movement and data transformation activities. The following list includes several strengths that make ADF an integral part of any data-driven solution built in Azure:
A single Azure subscription can have one or more data factories (also known as ADF instances). This is so users can isolate different projects as well as support different stages of a solution's development life cycle, like development, test, quality assurance, and production.
ADF instances are composed of the following core components:
https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-overview#supported-data-stores-and-formats
.https://docs.microsoft.com/en-us/azure/data-factory/transform-data#external-transformations
.https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime?tabs=data-factory
.https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime#azure-ssis-integration-runtime
.
Now that we have established what the core components of ADF are, let's dive into how to create an ADF instance through the Azure Portal and how to navigate the Azure Data Factory Studio UI.
The following steps describe how to create a new Azure Data Factory instance through the Azure Portal:
portal.azure.com
and search for Data factories in the search bar at the top of the page. Click Data Factories to go to the Data factories page in the Azure Portal.Once the ADF instance is deployed, go back to the Data factories page and click on the newly created workspace. Click on the Open Azure Data Factory Studio button in the middle of the overview page to navigate to the Azure Data Factory Studio and start working within the ADF ecosystem. Figure 5.18 is an example of the overview page with the Open Azure Data Factory Studio button highlighted.
Clicking the Open Azure Data Factory Studio button will open a new browser window, using your Azure AD credentials to log into the Azure Data Factory Studio. Figure 5.19 highlights the main features of the Azure Data Factory Studio home page.
The Azure Data Factory Studio is the central tool for authoring ADF resources. There are several buttons on the home page that enable users to start building new workflows very quickly:
On the left side of the page there is a toolbar with four buttons, including a Home button that will navigate users back to the Azure Data Factory Studio home page. The following list describes how you can use the other buttons in the toolbar to build and manage ADF resources:
The following section, “Building an ADF Pipeline with a Copy Data Activity,” will detail how to create the activity, datasets, and linked services that are associated with the pipeline in Figure 5.20 (shown earlier). More specifically, it will demonstrate how to use the copy activity to copy data from an Azure SQL Database to an ADLS account. The source database is restored from the publicly available AdventureWorksLT2019 database backup. If you would like to build this demo on your own, you can find the database backup at https://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver15&tabs=ssms#download-backup-files
.
The first step in creating an ADF pipeline through the Azure Data Factory Studio is to navigate to the Author page by clicking either the Author button on the left-side toolbar or the Orchestrate button on the home page. The left pane on the Author page contains a tree view named Factory Resources. From here, you can create or navigate through existing pipelines, datasets, mapping data flows, or Power Query activities by clicking the + button or the ellipsis (…) next to each menu item. Figure 5.23 illustrates how to create a blank pipeline by clicking the + button.
After you click Pipeline, a blank pipeline canvas will open with a new toolbar on the left side of the canvas that contains every activity that can be added to the pipeline. Any of these activities can be dragged from the Activities toolbar and dropped onto the central canvas to build out the pipeline. At the top of the canvas there are buttons to validate the pipeline for any errors, debug the pipeline, and add a trigger to the pipeline. On the right side of the canvas is the Properties tab where you can add a friendly name and a description for the pipeline. At the bottom of the canvas there are options to create new parameters and variables that can make pipeline runs more dynamic. Figure 5.24 illustrates each of these components with a friendly name added in the Properties tab.
To add a copy activity, expand the Move & Transform option in the Activities toolbar and drag the Copy Data activity to the canvas. The new activity will include six configuration tabs that will be located at the bottom of the tab. The first tab (General tab) allows you to provide a friendly name and description for the activity as well as time-out and retry settings. Figure 5.25 is an example of this view with a friendly name that describes the activity's functionality.
Out of the six copy activity configuration tabs, only two of them require user input: the Source tab and the Sink tab. These two tabs will define the source dataset and the destination, or sink, dataset that the data is being copied to. The Source tab allows you to choose an existing dataset or create a new one. If you click the + New button, a new page will open where you can choose from one of the available data source connectors (see Figure 5.26).
In the search bar, type Azure SQL Database and choose the Azure SQL Database connector. Click Continue at the bottom of the page to open the Set Properties page for the dataset. This page allows you to set a friendly name for the dataset and choose/create the linked service that will be used to connect to the data source. Expand the Linked Service drop-down menu and click + New to create a linked service for the database. This will open a new page where you can set a friendly name for the linked service, the integration runtime, and the connection information for the database. Figure 5.27 is a completed example of the New Linked Service page for an Azure SQL Database.
Once the settings for the linked service are properly set, click the Create button to create the linked service and to be redirected to the Set Properties page for the dataset. With the linked service defined, the next step will be to either choose the table or view that the dataset will represent or leave the Table Name setting blank. For the purposes of this example, we will choose the SalesLT.ProductCategory table. Figure 5.28 is a completed example of the Set properties page.
After you click OK at the bottom of the Set Properties page, the dataset will be created and added as the source dataset in the copy activity. Because the source dataset is an Azure SQL Database, the Source tab includes several optional settings that are tailored to relational databases. For example, if you did not choose a table or view in the dataset tab, you can use a query or a stored procedure to define the dataset. You can also parameterize this setting so that the dataset varies based on the value passed to the parameter. Figure 5.29 illustrates the list of options that are available in the Source tab for an Azure SQL Database.
Now that the source dataset is set, the next step is to configure a sink dataset. The Sink tab provides the same options as the Source tab, along with the ability to create a new dataset. Because this example uses an ADLS account as the sink data store, choose the Azure Data Lake Storage Gen2 connector on the New Dataset page. After clicking Continue, you will be prompted to set a file format for the dataset. For this example, choose the DelimitedText (CSV) option and click Continue.
As with the Azure SQL Database dataset, the Set Properties page allows you to set a friendly name for the dataset and choose/create the linked service that will be used to connect to the data source. The new linked service page for ADLS is also similar to the new linked service page for Azure SQL Database as it allows you to set a friendly name for the linked service, the integration runtime, and the connection information for the storage account (see Figure 5.30). Click the Create button to create the linked service and to be redirected to the Set Properties page for the dataset.
With the ADLS linked service defined, the Set Properties page allows you to either set a file path for the dataset or leave it blank. This example uses the dp900-adls-container/products/
file path for the sink dataset (see Figure 5.31).
After you click OK at the bottom of the Set Properties page, the dataset will be created and added as the sink dataset in the copy activity. Like the Azure SQL Database dataset, there are several additional settings in the Sink tab that will be relevant to the chosen dataset type. The Sink tab (and the Source tab) also allows you to open the dataset with the Open button (next to the Sink dataset setting). This button opens a new page that allows you to make several changes that are specific to the dataset type. Because the sink dataset is CSV data stored in ADLS, the list of settings that can be edited include how the data is compressed, the column and row delimiters for the data, how the data is encoded, and whether the first row should be treated as a header. You can also use this page to define a filename for the dataset. Figure 5.32 illustrates this page with all of the available dataset settings.
Once the dataset settings are properly configured, navigate back to the pipeline by clicking on the pipeline tab at the top of the page. Click on the Mapping tab to map the source dataset columns to the sink columns. This tab also allows you to set datatype settings, such as the date/time format, and whether to truncate data that is longer than what the column definition allows. Figure 5.33 is an example of the Mapping tab.
Navigate to the Settings tab after mapping the source and sink columns. This tab allows you to set how many DIUs you want allocated to the pipeline, or if you want the pipeline to automatically apply the optimal number of DIUs. You can also set the degree of parallelism that the copy activity will use if the volume of source dataset requires a scale-out solution.
The last tab in the copy activity Is the User Properties tab. This allows you to tag and monitor specific ADF resources, such as datasets.
Click the Publish All button at the top of the page to save the pipeline and the datasets. To run the pipeline or schedule the pipeline to run at a later time, click the Add Trigger button at the top of the canvas and choose either Trigger now to begin a pipeline run or New/Edit to create a scheduled or event-based trigger. Figure 5.34 illustrates where the Publish All and Add Trigger buttons are located. Once the pipeline is published, click the Trigger button to either run it right then and there or to create a schedule to run it at a later time.
While the previously described set of services can be used in a variety of data processing tasks, it is important to note that there are other Azure data services that are used for niche data processing use cases. For example, Azure Stream Analytics and Azure Data Explorer are almost exclusively used in stream processing workflows. These services are out of scope for the DP-900 exam and will only be covered briefly in the following sections.
Azure Stream Analytics is a PaaS stream processing engine that can be used to process high volumes of streaming data from multiple sources. Users can create Azure Stream Analytics jobs through the Azure Portal, Azure CLI, Azure PowerShell, or an Infrastructure as Code template like ARM. Jobs consist of three core components: one or more inputs, a query, and one or more outputs.
Inputs can include real-time message ingestion services like Azure Event Hubs and Azure IoT Hub as well as persistent data stores like Azure Blob Storage and Azure SQL Database. This enables developers to combine streaming data with historical data or with reference data for lookup operations.
Developers can use the Stream Analytics query language to filter, sort, aggregate, or join data from different sources. This language is a subset of standard T-SQL with additional functionality to apply computations over specific time windows. The language can also be extended with JavaScript and C# user-defined functions.
Jobs deliver processed information to one or more outputs. Azure Stream Analytics allows you to customize what happens based on the results of the data that was processed. Here are some common outputs:
If you would like to learn more about Azure Stream Analytics, visit https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction
.
Azure Data Explorer is a near real-time processing engine that analyzes structured, semi-structured, and unstructured data across time windows. It uses the Kusto Query Language (KQL) to analyze data and is capable of ingesting and analyzing petabytes of data. Typical use cases for Azure Data Explorer include interactively analyzing logs and conducting time series analytics on metric data from IoT sensors.
If you would like to learn more about Azure Data Explorer, visit https://docs.microsoft.com/en-us/azure/data-explorer/data-explorer-overview
.
Azure Synapse Analytics is an enterprise analytics system that integrates multiple services that serve analytical workloads in a single environment. Through the Azure Synapse workspace, users can leverage the following services to build a modern data warehouse solution:
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-pool-configurations
.https://docs.microsoft.com/en-us/azure/cosmos-db/synapse-link
.As you can see, Azure Synapse Analytics allows users to leverage several different technologies to build modern data warehouse solutions in the same environment. The following sections describe how to get started with Azure Synapse Analytics, including how to deploy a workspace and how to navigate Synapse Studio. Afterward, we will examine the two categories of SQL pools, dedicated and serverless, and when to use them.
Like any service in Azure, an Azure Synapse workspace can be deployed though the Azure Portal, Azure PowerShell, or Azure CLI or via an Infrastructure as Code template. The following steps describe how to deploy a new Azure Synapse workspace through the Azure Portal:
portal.azure.com
and search for Azure Synapse Analytics in the search bar at the top of the page. Click Azure Synapse Analytics to go to the Azure Synapse Analytics page in the Azure Portal.The Create Synapse Workspace page includes five tabs to tailor the workspace configuration. Let's start by exploring the options in the Basics tab. Just as with other services, this tab requires you to choose an Azure subscription, a resource group, a name, and a region for the workspace. You will also need to associate an ADLS account to the workspace. Azure Synapse will use this ADLS account as the primary storage account and the container to store workspace data. A completed example of this tab can be seen in Figure 5.35.
The Security tab requires you to create an administrator account for the serverless and dedicated SQL pools managed by the workspace. You can also use this tab to enable network access to the associated ADLS account with the workspace managed identity and enable double encryption with a key that you provide. Figure 5.36 is an example of the security tab.
The Networking tab allows you to choose whether to set up a dedicated, managed VNet for Azure Synapse Analytics. You can also enable access for all IP addresses through this tab or choose to grant access to specific IP addresses after the workspace is deployed.
The Tags tab allows you to place tags on the resources deployed with Azure Synapse Analytics. Tags are used to categorize resources for cost management purposes.
Finally, the Review + Create tab allows you to review the configuration choices made during the design process. If you are satisfied with the choices made for the Azure Synapse Analytics workspace, click the Create button to begin deploying the workspace.
Once the Azure Synapse Analytics workspace is deployed, go back to the Azure Synapse Analytics page, and click on the newly created workspace. From the Azure Portal, administrators can set Azure AD authentication, create new analytics pools (dedicated SQL, Apache Spark, and Data Explorer pools), configure network settings, and monitor performance. Click on the Open Synapse Studio button in the middle of the overview page to navigate to Synapse Studio UI. Figure 5.37 is an example of the workspace overview page with the Open Synapse Studio button highlighted.
A new browser window will open after you click the Open Synapse Studio button, using your Azure AD credentials to log in to the workspace. Figure 5.38 is an example of the Synapse Studio home page.
Synapse Studio is the central tool for administering and managing all aspects of the Azure Synapse Analytics ecosystem that enable users to start building new Azure Synapse Analytics workflows very quickly, including an Ingest button to begin moving data, an Explore and Analyze button to navigate users to Azure Synapse Analytics tutorials, and a Visualize button to connect to a Power BI workspace. On the left side of the page there is a toolbar with six buttons, including a Home button that will navigate users back to Synapse Studio home page. The following list describes how you can use the other buttons in the toolbar to build a modern data warehouse:
Azure Synapse Analytics dedicated SQL pools (formerly Azure SQL Data Warehouse) are relational data stores that use a massively parallel processing (MPP) architecture to optimally manage large datasets. This can be done by separating compute and storage by using a SQL engine to perform computations and Azure Storage to store the data. Dedicated SQL pools use a relational schema, typically a star schema, to serve data to users as tables or views for business intelligence applications.
In a modern data warehouse architecture, a dedicated SQL pool is at the end of an ETL/ELT process, serving as the single source of truth for data analysts and BI applications. Tables using columnstore compression can store an unlimited amount of data, making dedicated SQL pools the ideal destination data store for big data workloads that process several terabytes or even petabytes worth of data. Additional processes can also extract subsets of data that represent specific business segments from a dedicated SQL pool and load them into Azure Analysis Services or Power BI OLAP models for self-service BI scenarios.
As mentioned in Chapter 2, dedicated SQL pools shard data into 60 distributions across one or more compute nodes depending on the dedicated SQL pool's service level objective (SLO). Tables can be defined with one of three distribution patterns to optimize how data is sharded throughout the distributions. The following list is a quick reminder of the three distribution patterns and when to use each one:
Along with classic relational database features such as partitioning, row-store indexes, and statistics, dedicated SQL pools include several features that optimize the performance of analytical queries that aggregate large numbers of rows. These features are especially useful for querying historical data from fact tables, which can quickly become very large. Some of the most important features are as follows:
ALTER DATABASE dp900dedicatedSQLpool
SET RESULT_SET_CACHING ON;
Unlike OLTP database engines like Azure SQL Database, dedicated SQL pools are not suitable for transactional workloads, which are characterized by frequent, small write operations and queries that interact with only a few rows of data (such as a query with a WHERE clause that performs a seek operation to a specific set of rows). Instead, it is best used for bulk write operations and queries that perform aggregations over large amounts of data.
In addition to Synapse Studio, dedicated SQL pools support several management tasks and tools that are commonly used by other Microsoft SQL offerings (such as SQL Server and Azure SQL Database). SQL developers can connect to a dedicated SQL pool with Azure Data Studio or SQL Server Management Studio (SSMS). Database administrators can also leverage security postures that are common to Azure SQL, such as the following:
More information about the different security components available for Azure SQL can be found in Chapter 2.
In addition to the methods described previously in this book for deploying Azure resources, users can deploy a new dedicated SQL pool through Synapse Studio with the following steps:
As with any PaaS database in Azure, the SLO of a dedicated SQL pool can be easily scaled up or down to meet different workload needs. This can be done through the Azure Portal, Azure PowerShell, T-SQL, or the Create or Update Database REST API. The following is a sample T-SQL script that updates a dedicated SQL pool's SLO to DW1000c:
ALTER DATABASE dp900dedicatedSQLpool
MODIFY (SERVICE_OBJECTIVE = 'DW1000C');
Because compute and storage are separated, dedicated SQL pools can be paused when they are not used to save on compute costs. Users can pause and restart dedicated SQL pools through the Azure Portal, Synapse Studio, Azure PowerShell, and the dedicated SQL pool REST APIs. Pause and restart for dedicated SQL pools can also be automated with Azure Automation runbooks, Synapse pipelines, or ADF. Figure 5.46 illustrates where to find the pause button for a dedicated SQL pool in Synapse Studio. Once the pool is paused, the pause button will be replaced by a resume button.
Traditional relational databases that use a symmetric multiprocessing (SMP) design such as SQL Server or Azure SQL Database use an ETL process for data loading. Distributed platforms that use a MPP design like Azure Synapse Analytics dedicated SQL pools can process and store large amounts of data at-scale, allowing them to leverage ELT patterns to load and transform data within the same service. This allows developers to perform data processing activities without having to rely on additional services for data transformation prior to loading.
Dedicated SQL pools support several data loading methods, including popular SQL Server methods such as the bulk copy program (bcp) utility and the SQLBulkCopy API. However, the fastest and most scalable way to load data is through the PolyBase or the COPY statement. In fact, when loading data into a dedicated SQL pool via ADF, it is recommended to set the Copy Method setting in the Sink tab to use either the Copy command or PolyBase. With PolyBase and the COPY statement, developers can access data stored in Azure Blob storage or ADLS via T-SQL commands.
Generally, both of these data loading options are best when used to load data into staging tables. Staging tables are usually defined as heap tables, or tables without any indexes. The lack of an index means that data will not be reordered as it is being written, allowing the data to be written very quickly. Staging tables can be predefined before the external table is created with a normal CREATE TABLE
command or created after the external table is established with a CREATE TABLE AS SELECT
(CTAS) statement. More information about the CTAS statement can be found in the section “PolyBase” later in this chapter.
Once data is loaded into the staging tables, developers can use different techniques to update production tables with the staging data. Some techniques include using the MERGE statement to insert, update, or delete data in the production table based on differences in the staging table or replacing a section of the production table with the updated staging table through a process called partition switching. More information about partition switching can be found at https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition#partition-switching
. New production tables that are based off of the staging table, but use a different distribution method and index design, can be created with a CTAS operation.
While the COPY statement offers the best performance and most flexibility for loading data, it is still important to understand how to use PolyBase to load data into a dedicated SQL pool. The following sections describe how to use PolyBase and the COPY statement to load data from Azure Storage into a dedicated SQL pool.
PolyBase is a data virtualization technology that enables dedicated SQL pools to query Azure Storage data while allowing the data to stay in its original location and format. PolyBase uses external tables to shape and access Azure Storage data. External tables overlay a schema on top of the data so that it can be easily queried with T-SQL commands.
Defining external tables involves specifying the data source, the format of files in Azure Storage, and the table definition. These can be defined with the following T-SQL commands:
CREATE EXTERNAL DATA SOURCE
CREATE EXTERNAL FILE FORMAT
CREATE EXTERNAL TABLE
External data sources are used to establish a connection with an Azure storage account, such as one that supports Azure Blob Storage or ADLS. The CREATE EXTERNAL DATA SOURCE
command that is used to create an external data source requires the following arguments:
LOCATION
—This provides the connectivity protocol and path to the data source, such as abfss://[email protected]/
.CREDENTIAL
—This specifies the database-scoped credential used to authenticate to the external data source. This argument is only required if the storage object does not allow anonymous access. Storage account access keys, service principals, and managed identities are the only support authentication mechanisms for Azure Storage. Developers can create a database-scoped credential with the CREATE DATABASE SCOPED CREDENTIAL
T-SQL command.TYPE=HADOOP
—This specifies the external data source type that is being configured. It is required when the external data source is ADLS and Azure Blob Storage.The following example creates an ADLS external data source that uses an access key to authenticate to the storage account:
/* The following creates a database master key that is used to encrypt
the credential secret created in the CREATE DATABASE SCOPED CREDENTIAL step. */
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
/* Use the following command to create the database-scoped
credential with the storage account key. */
CREATE DATABASE SCOPED CREDENTIAL dp900StorageCredential
WITH
IDENTITY = 'dp900adls001' -- This is the storage account name.
SECRET = '<storage_account_access_key>'
CREATE EXTERNAL DATA SOURCE dp900_ADLS_Ext_Source
WITH
(
LOCATION = 'abfss://[email protected]/',
CREDENTIAL = dp900StorageCredential,
TYPE = HADOOP
);
The next step in using PolyBase is to define the file format of the data stored in the external data source. External file formats created for Azure Synapse Analytics SQL pools (both dedicated and serverless SQL pools) support delimited text (such as CSV or TSV) and Parquet file formats. The CREATE EXTERNAL FILE FORMAT
command accepts a required FORMAT_TYPE
argument that defines the file format and several optional arguments such as how the data is compressed. Several of these optional arguments apply only to delimited text files, including:
FIELD_TERMINATOR
—This specifies what character in a delimited text file marks the end of each field (column). The default field terminator is the pipe character (|
).STRING_TERMINATOR
—This specifies the field terminator for words or string data in a delimited text file. The default string terminator is an empty string (""
).FIRST_ROW
—This specifies the row number that is read first by all files.DATE_FORMAT
—This specifies a specific format for date and time data in a delimited text file.The following example creates an external file format for CSV files:
CREATE EXTERNAL FILE FORMAT dp900_CSV_File_Format
WITH
(
FORMAT_TYPE = DELIMITEDTEXT,
FIELD_TERMINATOR = ',',
STRING_TERMINATOR = '″'
);
Now that the external data source and external file format is defined, we can finally create the external table. The CREATE EXTERNAL TABLE
command allows developers to define column names and data types for external data. It also accepts arguments for the external data source and the external file format. It also enables developers to specify the folder or the file path and filename for the data in the external data source with the optional LOCATION
argument.
The CREATE EXTERNAL TABLE
command also allows developers to specify reject parameters that will determine how PolyBase handles dirty records. This information is stored as metadata when the external table is created and is used when a SELECT statement is issued against the table to determine the number or percentage of rows that can be rejected before the query fails. The query will return partial results until the reject threshold is exceeded, after which the query will fail with the appropriate error message. The following arguments can be used to set the reject threshold:
REJECT_TYPE
—Clarifies if the REJECT_VALUE
option is specified as a literal value or a percentage. When value is chosen, a query issued against the external table will fail when the number of rejected rows exceeds the defined value. When percentage is chosen, a query issued against the external table will fail when the percentage of rejected rows exceeds the defined threshold.REJECT_VALUE
—This specifies the value or the percentage of rows that can be rejected before the query fails. When value is chosen, the argument must be an integer between 0 and 2,147,483,647. When percentage is chosen, the argument must be a decimal value between 0 and 100.REJECT_SAMPLE_VALUE
—This determines the number of rows to attempt to retrieve before PolyBase recalculates the percentage of rejected rows. It is only available when percentage is chosen for the REJECT_TYPE and must be an integer between 0 and 2,147,483,647.The following example creates an external table for the SalesLT.ProductCategory CSV file that was created in ADLS by the ADF copy activity described previously in this chapter:
CREATE EXTERNAL TABLE [dbo].[ProductCategory_External]
(
ProductID INT,
ProductSubcategoryID INT,
ProductName VARCHAR(50)
)
WITH
(
LOCATION = '/products/productcategory.csv',
DATA_SOURCE = dp900_ADLS_Ext_Source,
FILE_FORMAT = dp900_CSV_File_Format,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
With the external table defined, developers can issue queries against the data without having to move the data from Azure Storage to the dedicated SQL pool. If they would like to create a copy of the data in the dedicated SQL pool, then they can do so with a CTAS statement. CTAS statements allow developers to create new tables based on the output of a SELECT statement. In a dedicated SQL pool, developers can define the distribution method and index design within the context of a CTAS statement. The following example uses a CTAS to create a dedicated SQL pool staging table based on the previously created external table:
CREATE TABLE [dbo].[ProductCategory_Staging]
WITH (DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM [dbo].[ProductCategory_External]
Once the data is stored in the staging table, data engineers can perform transformations with native T-SQL queries that leverage the built-in distributed query processing capabilities of the dedicated SQL pool. Transformed data can then be moved from the staging table to a production table through a variety of methods, such as a MERGE statement, partition switching, or with an INSERT INTO SELECT
statement. New production tables can also be created with a CTAS statement where the SELECT statement retrieves data from the staging table.
The COPY statement is a T-SQL construct that provides the most flexibility and best performance for parallel data ingestion into an Azure Synapse Analytics dedicated SQL pool. It provides several data loading feature enhancements over PolyBase:
The COPY command uses several arguments to determine how to ingest data:
FILE_TYPE
—This specifies the format of the external data. Supported file formats include CSV, Parquet, and ORC.CREDENTIAL
—This specifies the identity mechanism used to access the Azure storage account.MAXERRORS
—This optional argument specifies the maximum number of reject rows allowed before the COPY statement is cancelled. If not specified, the default value for this argument will be 0.COMPRESSION
—This optional argument specifies the data compression method for the data.FIELDQUOTE
—This argument applies to CSV files and specifies the character that will be used as the quote character in the file. If not specified, the quote character ("
) will be used as the default value for this argument.FIELDTERMINATOR
—This argument applies to CSV files and specifies the field terminator that will be used in the CSV file. If not specified, the comma character (,
) will be used as the default value for this argument.ROWTERMINATOR
—This argument applies to CSV files and specifies the row terminator that will be used in the CSV file. By default, the row terminator is
.FIRSTROW
—This argument applies to CSV files and specifies the row number that is read first in all files by the COPY statement.DATEFORMAT
—This argument applies to CSV files and specifies the date format of the date mapping using SQL Server date formats. Supported date formats include mdy, dmy, ymd, ydm, myd, and dym.ENCODING
—This argument applies to CSV files and specifies the data encoding standard for the files. The default for this argument is UTF8, but it can be changed to UTF16 depending on the encoding standard used by the files loaded by the COPY statement.IDENTITY_INSERT
—This argument is specific to values that map to an identity column in the target table. If the argument is set to off (this is the default), then the values are verified but not imported. If the argument is set to on, then the values will be imported into the identity column.AUTO_CREATE_TABLE
—This argument specifies if the table could be automatically created by working alongside the automatic schema discovery feature.A more extensive list of the arguments that can be used with the COPY statement can be found at https://docs.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest&preserve-view=true#syntax
.
The following example uses the COPY statement to load the SalesLT.ProductCategory CSV file from ADLS into the [dbo].[ProductCategory_Staging] dedicated SQL pool table. It assumes that the table has already been created and is empty.
COPY INTO [dbo].[ProductCategory_Staging]
FROM 'https://dp900adls001.dfs.core.windows.net/
dp900-adls-container/products/productcategory.csv'
WITH (
FILE_TYPE = 'CSV',
CREDENTIAL = (IDENTITY='Shared Access Signature', SECRET = '<SAS_TOKEN>'),
FIELDQUOTE = '"',
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0X0A'
)
With this command, developers can quickly load data from ADLS into a dedicated SQL pool staging table and perform any computations required before moving the staging data into a production table.
Azure Synapse Analytics serverless SQL pool is a serverless query service that enables users to analyze files in Azure Storage with T-SQL queries. Every workspace comes with a serverless SQL endpoint (named “Built-in”) that data analysts and developers can use to quickly begin querying data in a variety of different formats, including Parquet, CSV, and JSON. Additionally, serverless SQL pools can be used to query Azure Cosmos DB with Azure Synapse Link and Spark tables that are created with Azure Synapse Analytics Apache Spark pools.
Typical use cases for serverless SQL pools are as follows:
Logical data warehouses that are built with serverless SQL pools use similar data virtualization techniques as those that are used with dedicated SQL pools, including external data sources to connect to storage accounts, external file formats that define the format of the data in Azure Storage, and external tables that define a schema for your external data sources. The primary difference is that external data sources are native to synapse SQL pools and do not require (or support) the TYPE=HADOOP
argument. More information about using these constructs to create a logical data warehouse with a serverless SQL pool can be found at https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/tutorial-logical-data-warehouse
.
Just like dedicated SQL pools, serverless SQL pools support several management tasks and tools that are common to the Microsoft suite of SQL offerings. Developers can choose to run ad hoc queries against a serverless SQL pool endpoint from Synapse Studio or via common client tools like Azure Data Studio and SQL Server Management Studio (SSMS). Furthermore, database administrators can manage authentication and authorization with SQL authentication and Azure AD.
While serverless and dedicated SQL pools both leverage distributed processing architectures that are designed to manage large datasets, when to use one or the other depends on use case requirements and the acceptable cost-to-performance threshold. Serverless SQL pools use a pay-per-query cost model, only charging users for the amount of data processed by each query. This cost model provides a cheap alternative to dedicated SQL pools for quickly analyzing data with ad hoc queries. However, because storage is not local to the serverless SQL pool and compute is automatically scaled, queries tend to run slower (a factor of seconds or minutes) than queries executed against a dedicated SQL pool. For this reason, dedicated SQL pools are a better option for workloads that require optimized and consistent performance requirements.
Synapse Studio makes it easy to start analyzing data with the serverless SQL pool by creating a new SQL script. To do this, click on the Develop button on the left-side toolbar and click on the + icon. Select SQL script to open a new SQL script window. Within the script window, you can write SQL scripts that use the serverless SQL pool or one of the dedicated SQL pools associated with the workspace. The properties pane on the right side of the script window allows you to rename the script and add a description that explains the functionality of the script. You can also save scripts in Synapse Studio or to an associated Git repository by clicking Publish All at the top of the script window. Figure 5.47 illustrates the layout of the SQL script window.
The ribbon at the top of the SQL script window includes several options for running a script, viewing a query's execution plan (exclusive to dedicated SQL pools), connecting to a SQL pool, and setting the database context. To execute queries with the serverless SQL pool endpoint, make sure the “Built-in” SQL pool is chosen in the Connect To drop-down menu (see Figure 5.48).
Before going over how to run queries in Synapse Studio, let's briefly discuss the basic structure of a serverless SQL pool query. Serverless SQL pool queries that perform exploratory analysis rely heavily on the OPENROWSET function to read data from external storage devices. For example, the following query uses the OPENROWSET function to retrieve the first 100 entries of the publicly available New York City yellow taxicab dataset.
SELECTTOP100 * FROM
OPENROWSET(
BULK 'https://azureopendatastorage.blob.core.windows.net/
nyctlc/yellow/puYear=*/puMonth=*/*.parquet',
FORMAT='PARQUET'
) AS [nyc]
The BULK parameter specifies the location of the data while FORMAT specifies the file format of the data being read. The URL location used by the query also uses wildcards (*) to read all of the Parquet files in all of the year and month folders.
This query also uses the column metadata in the Parquet files to infer the column names and data types of the result set. Queries can also automatically infer the column names of data from CSV files if there is a header row. However, there are times where you will want to explicitly define a schema to have more control of the data. Explicitly defining a schema also allows you to specify what columns you want to read from the files. You can define a schema for your data by adding a WITH clause with the column names and data types at the close of the OPENROWSET command. The following example uses the WITH clause to explicitly return three columns from the New York City yellow taxicab dataset.
SELECTTOP100 * FROM
OPENROWSET(
BULK 'https://azureopendatastorage.blob.core.windows.net/
nyctlc/yellow/puYear=*/puMonth=*/*.parquet',
FORMAT='PARQUET'
) WITH (
tpepPickupDateTime DATETIME2,
passengerCount INT,
tripDistance FLOAT
) AS [nyc]
Passing the entire storage URL into the BULK parameter is a quick and easy way to read the content of the files with basic authentication methods such as Azure AD authentication for Azure AD logins or from files that are publicly available. However, this option provides limited authentication options and can become tedious as it forces developers to add the storage URL to the BULK parameter when they query the storage account. A more repeatable and secure option is to persist the location as an external data source and the access credential as an external scoped credential in a serverless SQL pool logical database.
The following example creates a new logical database and an external data source that references the location of the New York City yellow taxicab dataset. You can then pass the external data source name to the optional DATA_SOURCE parameter in the OPENROWSET command. This allows you to alter the argument passed to the BULK parameter to only the folder path that needs to be queried.
CREATE DATABASE dp900_serverlessdb;
USEdp900_serverlessdb
GO;
CREATEEXTERNAL DATA SOURCE nyc_yellowcab
WITH
(
location = 'https://azureopendatastorage.blob.core.windows.net/
nyctlc/yellow/'
);
SELECT TOP 100 * FROM
OPENROWSET(
BULK 'puYear=*/puMonth=*/*.parquet',
DATA_SOURCE='nyc_yellowcab',
FORMAT='PARQUET'
) WITH (
tpepPickupDateTime DATETIME2,
passengerCount INT,
tripDistance FLOAT
) AS [nyc]
This script can be executed in the Synapse Studio SQL script window by clicking the Run button at the top of the window. Figure 5.49 shows the SQL script window and the results from the executed script.
More information about how to use OPENROWSET to query external data with a serverless SQL pool can be found at https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-openrowset
.
This chapter started by discussing different types of data workflows. Transactional systems, also known as online transaction processing (OLTP) systems, capture business transactions such as sales in a point-of-sale system. They are optimized to handle write-heavy workloads, often handling millions of transactions a day. Analytical systems differ from transactional systems in that they are optimized for read-heavy operations. Data is gathered from several source systems and consolidated in one or a few data stores that users can use for reports, ad hoc analysis, and data science projects.
Analytical systems can process data in batches or as a continuous stream of data. Batch processing involves processing large amounts of data at predetermined periods of time or after a specified event. Stream processing ingests and transforms data in real time as it is generated. Modern data architectures like the Lambda architecture make it easy to use both batch and stream processing in the same solution.
There are several services offered through Azure that data engineers can use when building a modern data warehouse solution. Azure HDInsight is a PaaS resource that can be used to build data processing pipelines with several popular open-source frameworks such as Apache Hadoop, Apache Spark, Apache Kafka, Apache HBase, Apache Interactive Query, and Apache Storm. Azure Databricks is another PaaS resource that provides a unified platform for data engineers building data processing pipelines with Databricks Spark. Databricks Spark is a highly optimized version of Apache Spark, making it the most ideal service for most Spark applications.
Azure HDInsight, Azure Databricks, and several other data movement and data transformation activities can be orchestrated with Azure Data Factory (ADF). ADF enables data engineers to build data engineering pipelines with the Azure Data Factory Studio, a drag-and-drop, low-code/no-code development environment. Developers can author data movement and data transformation activities to run in parallel or chain them together so that they run sequentially. Because ADF is a PaaS offering, compute infrastructure is abstracted from the user in the form of an integration runtime. However, users can choose to use their own compute infrastructure by installing a self-hosted integration runtime on a virtual machine. This allows users to leverage data sources that are located in on-premises and private networks.
Organizations that want to use a single platform to achieve end-to-end analytics can do so with Azure Synapse Analytics. With Azure Synapse Analytics, users can use the Synapse Studio to manage all aspects of the data processing life cycle. Developers can author low-code/no-code data integration pipelines to move and transform data, leverage the serverless SQL pool to explore operational and object data stores with T-SQL without moving the data, build scale-out data engineering solutions with Apache Spark pools, and store report-ready data in relational tables that are optimized to serve analytical queries with a dedicated SQL pool.
FIELD_TERMINATOR = '|'
STRING_TERMINATOR = '|'
STRING_TERMINATOR = ','
FIELD_TERMINATOR = ','
3.135.190.232