Chapter 4: Engineering Real-Time Analytics with Azure Synapse Link Using Cosmos DB

In this chapter, we will cover how to perform real-time analytics with Cosmos DB by integrating Azure Synapse Link. Designing and implementing Internet of Things (IoT) end-to-end solutions from data ingestion, data processing, and deriving insights will be explored in depth. This will help us to learn about real-world use cases involving Synapse Link and Cosmos DB, where we will use it effectively for real-time analytics and reporting.

We will cover the following recipes:

  • Integrating an Azure Synapse ETL pipeline with Cosmos DB
  • Setting up Azure Cosmos DB analytical store
  • Enabling Azure Synapse Link and connecting Azure Cosmos DB to Azure Synapse
  • IoT end-to-end solutions and getting real-time insights
  • Use cases with Synapse Link

Integrating an Azure Synapse ETL pipeline with Cosmos DB

In this section, we will understand Azure Synapse Link for Cosmos DB and its usage for real-time analytics. We will learn about the following topics:

  • The basics of Cosmos DB
  • Azure Synapse Link integration with Cosmos DB
  • Feature support of Cosmos DB with Synapse
  • Synapse runtime support
  • Structured streaming support
  • Security features of Cosmos DB

Introducing Cosmos DB

Azure Cosmos DB is a fully isolated columnar store for large-scale analytics with operational data and a transactional store that is schema-agnostic, which allows us to use it with transactional applications without the overhead of managing schema or indexes.

Azure Cosmos DB analytical store is optimized for use for analytical workloads, which offers good analytical query performance.

When operational data is stored in a transaction store in row store format, analytical queries would be expensive, since we need provisioned throughput on the data scan stored in row format. This will impact the performance of transactional workloads that are used by real-time applications. Hence, larger operational data will be stored in a data warehouse as a separate data storage layer. This data will then be analyzed by Spark-based clusters for efficient analytics. Separation of analytical storage and compute for this operation data incurs more costs since it involves both Spark clusters and data warehouse storage. It also results in additional latency because the ETL pipelines are run less frequently than the transaction workloads. The upsert and merge statements on this operation data in analytical storage will become complex and it is not fast like newly ingested data.

Cosmos DB analytical store addresses the complexity and latency challenges of designing ETL pipelines. It can sync the operational data into a separate columnstore automatically instead of having a data warehouse. As you know, columnstore is the preferred storage format for all large-scale analytical queries, which results in faster query performance.

Azure Synapse Link integration

Azure Synapse Link for Azure Cosmos DB is a cloud-native managed offering from Microsoft that enables hybrid transactional and analytical processing (HTAP) to run near real-time analytical queries with operation data stored in Azure Cosmos DB.

Synapse Link enables no-ETL (Extract-Transform-Load) analytics in Azure Synapse Analytics for all operational data at a larger scale. Synapse Spark or Synapse SQL is used along with Synapse Link and Cosmos DB by enabling near real-time business intelligence, analytics, and machine learning analytical pipelines. The existing transactional workload on Cosmos DB is not impacted because of enabling analytical store and Synapse Link.

No separate connectors are required and we can now directly connect Azure Cosmos DB containers with Azure Synapse Analytics using Synapse Link. There is no additional data transformation required to analyze the operation data. The analytical store in Cosmos DB can be analyzed with the following:

  • An Apache Spark pool of Synapse, where existing data transformations are written for the operational data
  • Serverless SQL pool and integration support with BI tools such as Power BI

Supported features of Azure Synapse Link

Azure Cosmos DB supports two types of containers:

  • HTAP container – This container supports both transactional and analytical storage. It has Synapse Link integration enabled.
  • Online Transactional Processing (OLTP) container – This container supports only transactional stores and it does not have Synapse Link enabled with it.

You can use an Apache Spark pool and serverless SQL pool. The following table shows the supported features of Synapse Link:

Figure 4.1 – Table showing the supported features of Synapse Link

Figure 4.1 – Table showing the supported features of Synapse Link

Hybrid transactional and analytical processing containers support the following code-generated actions for serverless SQL pool:

  • Use T-SQL syntax and infer schema automatically from a Cosmos DB container.
  • Create a SQL view that directly accesses the Cosmos DB container for BI.
  • Combine queries from Cosmos DB containers with queries from Azure Blob storage or ADLS using CETAS(CREATE EXTERNAL TABLE AS SELECT).

Here are the important points with respect to Synapse Link:

  • Synapse Link is now supported by the Azure Cosmos DB SQL Core API and MongoDB API.
  • It is not supported by Gremlin, Cassandra, and the Table API.
  • Likewise, there is no support for Azure Cosmos DB serverless.
  • Once you have enabled the Synapse Link capability, it cannot be disabled.

Azure Synapse runtime support

The following table details the actions supported by Azure Synapse runtime.

Figure 4.2 – Table showing Synapse runtime support

Figure 4.2 – Table showing Synapse runtime support

Structured streaming support

Currently, Spark structured streaming support for Azure Cosmos DB is implemented using the change feed functionality of the transactional store. Streaming is not supported by analytical store.

Network and data security support for Azure Synapse Link with Cosmos DB

Managed private endpoints (Private Link) can be established between both a transactional store and analytical store with the same Azure Cosmos DB account in an Azure Synapse Analytics workspace.

Data is encrypted at rest using customer-managed keys stored in Azure Key Vault. Data is encrypted at transit using HTTPS/SSL.

Setting up Azure Cosmos DB analytical store

In this recipe, we will learn how to set up Azure Cosmos DB analytical store, which helps us to run near real-time analytical queries with operation data. CosmosDB is used as storage for real-time data and data that's near real-time, such as from IoT sensory devices.

Getting ready

Before we begin the recipe, create the resource group SynapseRG in the Azure portal.

How to do it…

We will create an Azure Cosmos DB account and learn how to set up the container in it. Let's get started:

  1. Log in to Azure portal: https://portal.azure.com/#home.
  2. Search for cosmos by using the top search bar on the Microsoft Azure page and navigate to Azure Cosmos DB.
Figure 4.3 – Search for Azure Cosmos DB

Figure 4.3 – Search for Azure Cosmos DB

  1. Click on Azure Cosmos DB.
    Figure 4.4 – Create an Azure Cosmos DB account

Figure 4.4 – Create an Azure Cosmos DB account

  1. Select Core (SQL) by clicking Create.
Figure 4.5 – Select Core (SQL)

Figure 4.5 – Select Core (SQL)

  1. On the Basics tab, select the existing resource group and fill in Account Name. Select Location so that it's the same as the Synapse workspace, which in our case is (US) East US. Capacity mode has two options, Provisioned throughput and Serverless. Choose Provisioned throughput.
 Figure 4.6 – Basics tab of Cosmos DB account

Figure 4.6 – Basics tab of Cosmos DB account

  1. Select Global distribution. Cosmos DB supports geo-redundancy and multi-region writes in selected availability zones. We are disabling these settings for this Azure Cosmos DB account.
Figure 4.7 – Global Distribution of Cosmos DB account

Figure 4.7 – Global Distribution of Cosmos DB account

  1. Select Networking and select the connectivity choices listed in the following screenshot. We are allowing connectivity for All networks.
Figure 4.8 – Networking for Cosmos DB account

Figure 4.8 – Networking for Cosmos DB account

  1. In the Backup Policy tab, specify Backup policy as Periodic, Backup interval in minutes, and Backup retention in hours. In case of disaster, we are enabling two copies to be retained and enabling Backup storage redundancy as Geo-redundant backup storage.
Figure 4.9 – Backup Policy of Cosmos DB account

Figure 4.9 – Backup Policy of Cosmos DB account

  1. Create data encryption keys by selecting Service-managed key.
Figure 4.10 – Encryption of Cosmos DB account

Figure 4.10 – Encryption of Cosmos DB account

  1. Review and create the Azure Cosmos DB account.
Figure 4.11 – Review + create of Cosmos DB account

Figure 4.11 – Review + create of Cosmos DB account

  1. Click Add Container and create a new Azure Cosmos DB container.
Figure 4.12 – Add Container for Azure Cosmos DB

Figure 4.12 – Add Container for Azure Cosmos DB

  1. Create a database ID, select a database throughput, and enter a container ID and a partition key. Click OK.
    Figure 4.13 – Add a new container for the Cosmos DB account

Figure 4.13 – Add a new container for the Cosmos DB account

We have successfully created an Azure Cosmos DB account and Cosmos DB container.

Enabling Azure Synapse Link and connecting Azure Cosmos DB to Azure Synapse

In this recipe, we will learn how to enable Synapse Link and connect Azure Cosmos DB to Azure Synapse. For large-scale analytics that provide high-scale performance without impacting on operational load, we recommend enabling Synapse Link. This helps to achieve HTAP capability for the Cosmos DB container.

Getting ready

Create a resource group called SynapseRG in the Azure portal.

How to do it…

We need to enable the features of Azure Synapse Link in the Azure Cosmos DB account and a linked service is required to connect to the Azure Cosmos DB database. Let's explore step by step how to enable Synapse Link in the Cosmos DB account by adding linked servers in Synapse Studio:

  1. Go to the previously created Azure Cosmos DB account from the Setting up Azure Cosmos DB analytical store recipe and select the Features settings. Select Azure Synapse Link.
Figure 4.14 – Select Azure Synapse Link

Figure 4.14 – Select Azure Synapse Link

  1. Enable Azure Synapse Link.
 Figure 4.15 – Enable Azure Synapse Link

Figure 4.15 – Enable Azure Synapse Link

  1. Go to the existing Synapse Analytics workspace and navigate to Synapse Studio.
Figure 4.16 – Synapse workspace

Figure 4.16 – Synapse workspace

  1. Navigate to the Data tab and Add new resource.
Figure 4.17 – Add new resource

Figure 4.17 – Add new resource

  1. Select Azure Cosmos DB (SQL API) and click Create.
Figure 4.18 – Connect to Cosmos DB (SQL API)

Figure 4.18 – Connect to Cosmos DB (SQL API)

  1. On the New linked service tab, enter a name, and make a selection for Authentication method. Also make a selection for Azure Cosmos DB account name and fill in Database Name.
Figure 4.19 – Creating a new linked service

Figure 4.19 – Creating a new linked service

  1. Create the linked service and publish it for us to view the linked services in the Data tab.

IoT end-to-end solutions and getting real-time insights

Azure Cosmos DB analytical store allows us to perform near real-time analytics on operational data. It has both transactional and analytical stores. Before Synapse was introduced, it was really hard to get real-time analytics out of Cosmos DB. Before the release of Synapse, we needed to extract and load data in a separate analytics store to build real-time dashboards.

Now, with the introduction of Azure Synapse Link, we can integrate Cosmos DB with Synapse Analytics, serverless SQL pools, and Spark pools, thus leading the way for data analysts to analyze real-time data. Using Spark pools, Synapse, and serverless SQL, BI engineers can create real-time dashboards using Power BI. Data scientists can now use the Cosmos DB analytical store to preprocess data and create models using a Synapse Spark pool and derive critical business insights from data without disturbing the transactional store of Azure Cosmos DB.

The IoT has increased automation and performance efficiency across all industries. Capturing streaming data from IoT devices and analyzing and predicting the future state is a common scenario to improve performance efficiency in all fields.

Figure 4.20 shows the architecture, which uses Azure Cosmos DB effectively as transactional and analytical store at near real time. It involves IoT devices installed on-premises or at the edge, streaming data at regular intervals into Azure IoT hub.

Azure IoT hub uses Azure Stream Analytics to stream data processing and ingests data into Azure Cosmos transactional store. The data is then interactively analyzed in Azure Synapse serverless SQL, which queries data from Cosmos DB analytical store using Synapse Link integration. Real time dashboards are generated in Power BI.

Data is transformed, standardized, feature engineered, and then available for predictive analytics and advanced analytics. This predicted value is used as actions for IoT devices.

Figure 4.20 – Reference architecture for IoT real-time analytics

Figure 4.20 – Reference architecture for IoT real-time analytics

As we go through the recipe, we will simulate the ingestion of data from IoT devices with the help of Synapse notebooks – Spark Structured Streaming that, in the real world, can be thought of as data from IoT devices ingested through Azure IoT Hub and analyzed by Stream Analytics. Real-time insights will be created with Azure Cosmos DB by integrating with Synapse Link.

Getting ready

We will be using an IoT temperature dataset for our scenario. This dataset will consist of temperature data for the month of October in a factory: this includes attributes such as ID, room ID, date and time, temperature, and location – whether inside or outside the set of rooms:

We will use a single item of this dataset in JSON format to simulate streaming ingestion to Azure Cosmos DB containers.

How to do it…

Let's get started:

  1. Let's simulate stream ingestion from IoT devices to a Cosmos DB transactional store. We will use spark.readStream and add all columns to the data frame, which simulates the IoT temperature data. Then writeStream into Cosmos DB with the Synapse linked service:

    df = (spark.readStream.format("rate").option("rowsPerSecond", 10).load())

    from pyspark.sql.functions import lit

    dfIoTSignals=(df.withColumn('id', lit('1'))

                    .withColumn('room_id', lit('Room Admin'))

                    .withColumn('noted_date', df["timestamp"].cast(StringType()))

                    .withColumn('temp', lit('1'))

                    .withColumn('location', lit('Test')) 

                    )

    streamQuery = dfIoTSignals

                        .writeStream

                        .format("cosmos.oltp")

                        .outputMode("append")

                        .option("spark.cosmos.connection.mode", "gateway") 

                        .option("spark.synapse.linkedService", "cosmoscookbook")

                        .option("spark.cosmos.container", "cosmosIoTdb")

                        .option("checkpointLocation", "/writeCheckpointDir")

                        .start()

    streamQuery.awaitTermination()

Now, let's follow the step-by-step procedure to set up Azure Cosmos DB, analyze the data using SQL, and create real-time dashboards with Power BI.

  1. Create/edit the container in Azure Cosmos DB. Fill in Database id, make a selection for Database throughput, and fill in Container id and Partition key. Click OK.
    Figure 4.21 – Add a new container for Azure Cosmos DB

Figure 4.21 – Add a new container for Azure Cosmos DB

  1. Upload a JSON file to Azure Cosmos DB.
Figure 4.22 – Upload JSON to Cosmos DB

Figure 4.22 – Upload JSON to Cosmos DB

  1. Create a Serverless SQL database in Synapse Studio.
    Figure 4.23 – Create a Serverless SQL database

Figure 4.23 – Create a Serverless SQL database

  1. Next, we will create a database in the SQL script shown as follows in Figure 4.24 and then connect it to Built-in.
Figure 4.24 – Create a new database

Figure 4.24 – Create a new database

  1. In the next step, we will use Create view which access the Cosmos DB container with the account name, database name, region in lowercase, and primary read-only key:

    CREATE VIEW IOTTEMP

    AS 

    SELECT  *

    FROM OPENROWSET (

        'CosmosDB', N'account=cosmoscookbooks;database=cosmosIoTdb;region=westus;key=k2G5cNMSgjanJfJNv0BCNyr9ydE0avGecR17WiCLJmSacs4gPiWtHklDJXKVAi7SEM9ZgjylHEPWEvoYFtL8Ew==',IoTTemp)

    AS q1

  2. Query the view with the select statement to find out how many temperatures have been recorded inside and outside the rooms:

    select location, count(*) as count from IOTTEMP group by location

The results can be viewed as follows.

Figure 4.25 – Results window

Figure 4.25 – Results window

  1. Now, analyze the real-time data interactively and create charts to understand the data:

    select DAY(CONVERT(DATETIME,noted_date,103)) AS 'Days in October', temp from IOTTEMP order by 'Days in October'

  2. Click Chart in the results and a line chart will be generated, as shown in the following screenshot of the Synapse Studio SQL script Results window.
Figure 4.26 – Chart Results window

Figure 4.26 – Chart Results window

  1. Let's connect the view as dataset in Power BI and create a scatter plot now. We need to first connect to the view by selecting SQL Server as the input data source.
Figure 4.27 – Power BI – SQL Server data source connection

Figure 4.27 – Power BI – SQL Server data source connection

  1. Go to synapsecookbook, click Properties, and copy Serverless SQL endpoint.
Figure 4.28 – Synapse SQL pool credentials

Figure 4.28 – Synapse SQL pool credentials

  1. Enter the SQL Server database endpoint that was previously copied as the Power BI SQL Server data source.
Figure 4.29 – Synapse configuration in Power BI

Figure 4.29 – Synapse configuration in Power BI

  1. Select the view from the Synapse views as shown in the following screenshot. Select Load.
Figure 4.30 – Power BI – dataset view configuration

Figure 4.30 – Power BI – dataset view configuration

  1. The dataset is now available in the report window and we can create a scatter plot visualization with X Axis as Day in October and Y Axis as temp.
Figure 4.31 – Visualizations dataset

Figure 4.31 – Visualizations dataset

  1. The scatter plot graph is now ready. This shows temperature data for October, broken down by day.
Figure 4.32 – Power BI – scatter plot

Figure 4.32 – Power BI – scatter plot

Use cases using Synapse Link

Enterprises are leveraging real-time streaming analytics for quicker responses and near real-time analytics in several industry-standard use cases. All the following industry use cases make use of cloud-managed native HTAP integration between Azure Cosmos DB and Azure Synapse Analytics with Synapse Link:

  • Detecting anomalies and fraud in real time
  • Real-time personalization
  • Providing healthcare, emergency, and humanitarian services
  • Data-driven decision making
  • Assortment optimization
  • Proximity marketing
  • Contextual recommendations
  • Ad optimization
  • Money laundering/payment fraud detection
  • Risk management in rapidly changing capital markets
  • Stock market surveillance
  • Real-time location analytics
  • Fleet management
  • Analytics in sports
  • Smart energy analytics
  • Social media analytics
  • System and network monitoring

In this section, we will learn about different pipelines for the use cases explained in the IoT end-to-end solutions and getting real-time insights recipe.

Smart meters and industry plants regularly sync data to Azure Event Hubs, which helps in smart energy analytics and predictive maintenance.

Immense data is generated from social media daily and it is ingested in real time for social media analytics.

Data-driven websites use end users' clicks for stream analytics to derive real-time personalization and ad optimization along with contextual recommendations.

All real-time data sources stream data in real time, which is stored in Cosmos DB Analytical store, and real-time dashboards are generated in Power BI using Synapse Link.

Based on the IoT end-to-end solution architecture in Figure 4.20, we have used Synapse Link along with Cosmos DB for the design of the following pipelines:

  • Data engineering pipeline – The stream is processed with window functions and standardized and transformed to derive meaningful insights.
  • Predictive analytics and advanced analytics machine learning pipeline – Synapse Link enables us to analyze quick-changing operational data in Azure Cosmos DB. Synapse Link is leveraged by data engineers and data scientists to build efficient advanced analytics machine learning and predictive pipelines.
  • Azure Cosmos DB analytical store leverages integration with Apache Spark pools in Azure Synapse Analytics where data is queried interactively.
  • Build machine learning (ML) models and deep learning models with Spark ML and Azure ML integration in Azure Synapse Analytics for batch and real-time inferencing.
  • The model inference is stored in Azure Cosmos DB, which helps with near real-time operational scoring.
  • Business intelligence – real-time dashboards with Power BI.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.255.127