In this chapter, we will cover how to perform real-time analytics with Cosmos DB by integrating Azure Synapse Link. Designing and implementing Internet of Things (IoT) end-to-end solutions from data ingestion, data processing, and deriving insights will be explored in depth. This will help us to learn about real-world use cases involving Synapse Link and Cosmos DB, where we will use it effectively for real-time analytics and reporting.
We will cover the following recipes:
In this section, we will understand Azure Synapse Link for Cosmos DB and its usage for real-time analytics. We will learn about the following topics:
Azure Cosmos DB is a fully isolated columnar store for large-scale analytics with operational data and a transactional store that is schema-agnostic, which allows us to use it with transactional applications without the overhead of managing schema or indexes.
Azure Cosmos DB analytical store is optimized for use for analytical workloads, which offers good analytical query performance.
When operational data is stored in a transaction store in row store format, analytical queries would be expensive, since we need provisioned throughput on the data scan stored in row format. This will impact the performance of transactional workloads that are used by real-time applications. Hence, larger operational data will be stored in a data warehouse as a separate data storage layer. This data will then be analyzed by Spark-based clusters for efficient analytics. Separation of analytical storage and compute for this operation data incurs more costs since it involves both Spark clusters and data warehouse storage. It also results in additional latency because the ETL pipelines are run less frequently than the transaction workloads. The upsert and merge statements on this operation data in analytical storage will become complex and it is not fast like newly ingested data.
Cosmos DB analytical store addresses the complexity and latency challenges of designing ETL pipelines. It can sync the operational data into a separate columnstore automatically instead of having a data warehouse. As you know, columnstore is the preferred storage format for all large-scale analytical queries, which results in faster query performance.
Azure Synapse Link for Azure Cosmos DB is a cloud-native managed offering from Microsoft that enables hybrid transactional and analytical processing (HTAP) to run near real-time analytical queries with operation data stored in Azure Cosmos DB.
Synapse Link enables no-ETL (Extract-Transform-Load) analytics in Azure Synapse Analytics for all operational data at a larger scale. Synapse Spark or Synapse SQL is used along with Synapse Link and Cosmos DB by enabling near real-time business intelligence, analytics, and machine learning analytical pipelines. The existing transactional workload on Cosmos DB is not impacted because of enabling analytical store and Synapse Link.
No separate connectors are required and we can now directly connect Azure Cosmos DB containers with Azure Synapse Analytics using Synapse Link. There is no additional data transformation required to analyze the operation data. The analytical store in Cosmos DB can be analyzed with the following:
Azure Cosmos DB supports two types of containers:
You can use an Apache Spark pool and serverless SQL pool. The following table shows the supported features of Synapse Link:
Hybrid transactional and analytical processing containers support the following code-generated actions for serverless SQL pool:
Here are the important points with respect to Synapse Link:
The following table details the actions supported by Azure Synapse runtime.
Currently, Spark structured streaming support for Azure Cosmos DB is implemented using the change feed functionality of the transactional store. Streaming is not supported by analytical store.
Managed private endpoints (Private Link) can be established between both a transactional store and analytical store with the same Azure Cosmos DB account in an Azure Synapse Analytics workspace.
Data is encrypted at rest using customer-managed keys stored in Azure Key Vault. Data is encrypted at transit using HTTPS/SSL.
In this recipe, we will learn how to set up Azure Cosmos DB analytical store, which helps us to run near real-time analytical queries with operation data. CosmosDB is used as storage for real-time data and data that's near real-time, such as from IoT sensory devices.
Before we begin the recipe, create the resource group SynapseRG in the Azure portal.
We will create an Azure Cosmos DB account and learn how to set up the container in it. Let's get started:
We have successfully created an Azure Cosmos DB account and Cosmos DB container.
In this recipe, we will learn how to enable Synapse Link and connect Azure Cosmos DB to Azure Synapse. For large-scale analytics that provide high-scale performance without impacting on operational load, we recommend enabling Synapse Link. This helps to achieve HTAP capability for the Cosmos DB container.
Create a resource group called SynapseRG in the Azure portal.
We need to enable the features of Azure Synapse Link in the Azure Cosmos DB account and a linked service is required to connect to the Azure Cosmos DB database. Let's explore step by step how to enable Synapse Link in the Cosmos DB account by adding linked servers in Synapse Studio:
Azure Cosmos DB analytical store allows us to perform near real-time analytics on operational data. It has both transactional and analytical stores. Before Synapse was introduced, it was really hard to get real-time analytics out of Cosmos DB. Before the release of Synapse, we needed to extract and load data in a separate analytics store to build real-time dashboards.
Now, with the introduction of Azure Synapse Link, we can integrate Cosmos DB with Synapse Analytics, serverless SQL pools, and Spark pools, thus leading the way for data analysts to analyze real-time data. Using Spark pools, Synapse, and serverless SQL, BI engineers can create real-time dashboards using Power BI. Data scientists can now use the Cosmos DB analytical store to preprocess data and create models using a Synapse Spark pool and derive critical business insights from data without disturbing the transactional store of Azure Cosmos DB.
The IoT has increased automation and performance efficiency across all industries. Capturing streaming data from IoT devices and analyzing and predicting the future state is a common scenario to improve performance efficiency in all fields.
Figure 4.20 shows the architecture, which uses Azure Cosmos DB effectively as transactional and analytical store at near real time. It involves IoT devices installed on-premises or at the edge, streaming data at regular intervals into Azure IoT hub.
Azure IoT hub uses Azure Stream Analytics to stream data processing and ingests data into Azure Cosmos transactional store. The data is then interactively analyzed in Azure Synapse serverless SQL, which queries data from Cosmos DB analytical store using Synapse Link integration. Real time dashboards are generated in Power BI.
Data is transformed, standardized, feature engineered, and then available for predictive analytics and advanced analytics. This predicted value is used as actions for IoT devices.
As we go through the recipe, we will simulate the ingestion of data from IoT devices with the help of Synapse notebooks – Spark Structured Streaming that, in the real world, can be thought of as data from IoT devices ingested through Azure IoT Hub and analyzed by Stream Analytics. Real-time insights will be created with Azure Cosmos DB by integrating with Synapse Link.
We will be using an IoT temperature dataset for our scenario. This dataset will consist of temperature data for the month of October in a factory: this includes attributes such as ID, room ID, date and time, temperature, and location – whether inside or outside the set of rooms:
We will use a single item of this dataset in JSON format to simulate streaming ingestion to Azure Cosmos DB containers.
Let's get started:
df = (spark.readStream.format("rate").option("rowsPerSecond", 10).load())
from pyspark.sql.functions import lit
dfIoTSignals=(df.withColumn('id', lit('1'))
.withColumn('room_id', lit('Room Admin'))
.withColumn('noted_date', df["timestamp"].cast(StringType()))
.withColumn('temp', lit('1'))
.withColumn('location', lit('Test'))
)
streamQuery = dfIoTSignals
.writeStream
.format("cosmos.oltp")
.outputMode("append")
.option("spark.cosmos.connection.mode", "gateway")
.option("spark.synapse.linkedService", "cosmoscookbook")
.option("spark.cosmos.container", "cosmosIoTdb")
.option("checkpointLocation", "/writeCheckpointDir")
.start()
streamQuery.awaitTermination()
Now, let's follow the step-by-step procedure to set up Azure Cosmos DB, analyze the data using SQL, and create real-time dashboards with Power BI.
CREATE VIEW IOTTEMP
AS
SELECT *
FROM OPENROWSET (
'CosmosDB', N'account=cosmoscookbooks;database=cosmosIoTdb;region=westus;key=k2G5cNMSgjanJfJNv0BCNyr9ydE0avGecR17WiCLJmSacs4gPiWtHklDJXKVAi7SEM9ZgjylHEPWEvoYFtL8Ew==',IoTTemp)
AS q1
select location, count(*) as count from IOTTEMP group by location
The results can be viewed as follows.
select DAY(CONVERT(DATETIME,noted_date,103)) AS 'Days in October', temp from IOTTEMP order by 'Days in October'
Enterprises are leveraging real-time streaming analytics for quicker responses and near real-time analytics in several industry-standard use cases. All the following industry use cases make use of cloud-managed native HTAP integration between Azure Cosmos DB and Azure Synapse Analytics with Synapse Link:
In this section, we will learn about different pipelines for the use cases explained in the IoT end-to-end solutions and getting real-time insights recipe.
Smart meters and industry plants regularly sync data to Azure Event Hubs, which helps in smart energy analytics and predictive maintenance.
Immense data is generated from social media daily and it is ingested in real time for social media analytics.
Data-driven websites use end users' clicks for stream analytics to derive real-time personalization and ad optimization along with contextual recommendations.
All real-time data sources stream data in real time, which is stored in Cosmos DB Analytical store, and real-time dashboards are generated in Power BI using Synapse Link.
Based on the IoT end-to-end solution architecture in Figure 4.20, we have used Synapse Link along with Cosmos DB for the design of the following pipelines:
3.146.255.127