Thus far, our more detailed exploration of Microsoft’s IoT footprint has focused on analysis of streaming data from devices, managing and monitoring the devices, and pushing intelligence to the edge. Chances are, you will want to integrate some of your existing data sources into your new IoT solution. Much of this chapter focuses on methods for finding and integrating the data that you will need.
We’ll first look at typical preexisting potential sources of data. Then we’ll explore the roles of Azure Data Factory, Azure Data Explorer, and PolyBase in moving and/or accessing data. We’ll also look at configuring VPN connections and/or ExpressRoute for network connectivity between on-premises devices and systems and the cloud, using the Azure Data Box to physically move data to Azure data centers, and using Azure Data Catalog to find where data is located.
Finally, there are multiple Microsoft partners who provided data historians (that are time series databases) for years that are often part of IoT solutions. We’ll look at how two of them, OSIsoft and PTC, are integrating Azure into their modern cloud-based deployment architectures.
Preexisting sources of data
Integrating and finding data sources
Data historians and integration with Azure
Preexisting Sources of Data
Organizations that deploy IoT solutions almost always find a need to integrate data from legacy transactional and data warehousing systems in order to deliver the key metrics needed in answering business questions. As transactional data fits neatly into rows and columns, relational databases are typically deployed to provide data management solutions for such data.
Preexisting footprints are almost always unique. Legacy data sources might reside on premises, in clouds, or from a combination of locations. On-premises relational databases typically found include Microsoft SQL Server, Oracle, IBM DB2, MySQL, PostgreSQL, and others. When deployed in Azure, these databases can be deployed within virtual machines. There are also databases available for PaaS deployment such as Microsoft SQL Database, Microsoft SQL Data Warehouse, and Snowflake.
Microsoft’s Azure SQL Database is a fully managed service and is based on the latest version of Microsoft SQL Server general-purpose database engine. It is where you will find the newest capabilities for the SQL Server family released first. Azure SQL Database also supports non-relational structures including graphs, JSON, spatial, and XML. It has a hyperscale service tier that enables database scalability up to 100 terabytes. By configuring elastic pools, you can assign resources that are shared.
The Microsoft SQL Data Warehouse uses a massively parallel processing (MPP) engine to perform queries across extremely large databases often found in data warehouses, including those up to Petabytes in size. The database consists of a control node that optimizes and coordinates parallel queries and multiple compute nodes (up to 60).
In this figure, we’ve pictured Azure SQL Database or Azure SQL Data Warehouse serving as data warehouses in the cloud. However, the data warehouse engine could also be one of the other cloud-based or on-premises databases that we previously mentioned.
Integrating and Finding Data Sources
We might choose to analyze our various sources of data by gathering the data to a single location. We could also keep the data in place and analyze it through a distributed query. In this section of the chapter, we explore ways of doing this and describe how you can track where the data resides. We begin by discussing how Azure Data Factory can be used to gather data.
Azure Data Factory
Data Movement Activities. Copy data from a data source to a specified target (also known as a data sink).
Data Transformation Activities. Custom coding using Hive, Pig, MapReduce, Hadoop Streaming or Spark in HDInsight, Machine Learning in an Azure VM, stored procedures in a SQL engine, Databricks, or Azure batch process.
Control Activities. Used to invoke another pipeline from within a pipeline, define repeating loops to perform iterations and do-until loops, call REST endpoints, lookup records, table names or values from external sources, retrieve metadata, establish branches based on conditions, and specify wait times for pipelines.
Data connectors are available for a variety of sources and targets in Azure including databases, NoSQL databases, files and file systems, and services and applications. You can also use generic protocols and interfaces to access data, such as ODBC, OData, and REST.
Connectors are also available for database data sources such as Greenplum, HBase, IBM DB2, IBM Informix, IBM Netezza, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SAP Business Warehouse, SAP HANA, Spark, Sybase, Teradata, and Vertica. Connectors for NoSQL databases include Cassandra, Couchbase, and MongoDB. Examples of connectors for applications include Microsoft Dynamics, Concur, Oracle Eloqua, Marketo, PayPal, Salesforce, SAP, and Square. Various connectors for the Amazon and Google cloud data management systems are also provided including for Amazon Redshift and Google BigQuery.
Note
In organizations that prefer developing ELT scripts using SQL Server Integration Services (SSIS), you can configure and provision a SSIS Integration Runtime from within ADF to enable a lift and shift of your SSIS packages.
We can schedule the workflow for execution when we are satisfied with our design. Pipeline runs can be instantiated by passing arguments to parameters (e.g., datasets or linked services) using manual methods or within triggers.
Once deployed, you can monitor the success of your activities and pipelines. ADF supports pipeline monitoring via the Azure Monitor and APIs, PowerShell, Azure Monitor logs, and Azure Portal health indicators.
Query Services Across Diverse Data
Microsoft offers a couple of services enabling queries across semi-structured and structured data sources. Azure Data Explorer requires moving data to a common database, while PolyBase simply accesses data through external tables as if the data was local.
Microsoft describes Azure Data Explorer as a fast and scalable data exploration service. The cluster and database service can load streaming incoming data and/or copy data from other data sources through Event Hubs, Event Grids, Kafka, Python, Node SDKs, the .NET Standard SDK, Logstash, and ADF.
You begin by creating a cluster (including name and compute specifications) in a resource group using the Azure Portal. You then create a Data Explorer database and define a retention period (in days) and a cache period (in days). You start the cluster (and can later stop it to minimize cost). You next create a target table in the Azure Data Explorer cluster using a create table command through the query interface and provide a mapping of incoming JSON data to column names and data types in the table. Lastly, you add a data connection.
You can query data using Azure Data Explorer in a web-based interface. You can also write queries using the Kusto Query Language (KQL) . Queries generally begin by referencing a table or function. Some of the advanced capabilities of KQL include scalar and tabular operators, time series analysis and anomaly detection, aggregations, forecasting, and machine learning.
PolyBase is built into SQL Server instances and the Azure Data Warehouse. You access data that resides in Azure Storage, Hadoop, and other file systems and databases using external table definitions. Connections to the data repositories are via ODBC. The definitions and user permissions are stored in the database. Query performance can be improved by configuring PolyBase scale-out groups.
PolyBase is sometimes used in combination with ADF, especially when the goal is to speed data transformations and updates. Behind the scenes, it provides the copy capabilities that we described earlier in the ADF section of this chapter.
Note
Key PolyBase features, including support for certain sources, vary based on the version of the SQL engine that PolyBase is bundled with. Always first check the Microsoft documentation regarding feature availability for the version that you plan to deploy.
Connecting On-Premises Networks to Azure
Virtual Private Network (VPN) connection
Azure ExpressRoute connection
ExpressRoute connection with VPN failover
A VPN connection includes an Azure VPN gateway that is used to send encrypted traffic between the Azure VNet and an on-premises network. VPN connections use the public Internet and were limited to speeds of 1.25 Gb per second at the time this book was published. Network traffic received by the VPN gateway is routed to an internal load balancer when traffic is being sent to applications in Azure.
A VPN appliance must also be present on-premises providing external connectivity for that network. The VPN appliance might be a dedicated hardware device or a software service (such as the Routing and Remote Access Service in Windows Server).
ExpressRoute with VPN failover can be thought of as combining the two previous options that we just described. Both types of connections are configured. ExpressRoute circuits are configured to provide connectivity under normal conditions. Failover to VPN connections are configured so that they can provide connectivity in situations where there is a loss of ExpressRoute connectivity.
Within the backend cloud, a hub–spoke topology can be deployed in Azure to isolate workloads when services are to be shared. For example, network virtual appliances and DNS servers can be shared for different workloads, different departments, or different stages of development and deployment.
The hub–spoke topology is deployed using CLI scripts found in GitHub. An Azure VNet serves as a hub in the topology providing a central connection to the on-premises network and provides shared services for other VNets serving as spokes. Each spoke VNet is connected to the hub VNet by peering, enabling traffic exchange between each spoke and the hub. The spoke VNets enable isolation and separate management of the workloads.
Bulk Data Transfer
Legacy data warehouses and remote IoT databases sometimes measure in the terabytes or more in size. Network connections into Azure might be too slow to provide timely and reliable data transfers. In such scenarios, offline data transfer can make sense, and Microsoft offers Azure Data Box Disk solutions for this purpose.
You can order the Data Box through the Azure Portal. Microsoft ships the Data Box to your site where you upload data through a local web user interface. You then ship it back to a Microsoft Azure Data Center, and the data is uploaded there into your Azure Storage account. You can track this process through the Azure Portal.
The Data Box can be used for one-time migration of very large data quantities. If data is subsequently gathered and needs to be uploaded to Azure, the incremental volumes for updates are usually a fraction of the size and can likely be handled by network connections to Azure. If remote locations periodically become disconnected and very large data volumes are again gathered, a Data Box can be requested again and used to move these larger data volumes to Azure.
Alternatively, you might choose online data transfer. Data Box Gateway is a virtual device residing on-premises supporting NFS and SMB as protocols. The virtual device transfers data to Azure block blobs, Azure page blobs, or Azure Files.
Azure Data Box Edge is a physical device with the gateway capabilities of the Data Box Gateway. It additionally can be used to preprocess data including performing data aggregation, modifying data (e.g., removing sensitive data), selecting and sending only subsets of data, or analyzing and reacting to IoT events locally. You can apply ML models at the edge before data is transferred.
For example, the Data Box Edge can be used to capture video and, configured with the Azure IoT Edge Runtime, can push video frames through a FPGA-based AI model. The source code for such an application is posted on GitHub.
Azure Data Catalog
Finding the location of data that you could need from within your IoT architecture can be challenging. As you’ve seen in previous architecture diagrams, the data could be stored in a variety of locations and in a variety of types of data management systems.
Azure Data Catalog is a cloud-based service where data management locations in your implementation can be registered and metadata describing the data in each location is stored. The metadata is searchable and can be enhanced by users of the Data Catalog, enabling crowdsourcing of descriptions, tags, and other descriptive metadata. The search capability makes finding where data is located much simpler.
Name of asset
Type of asset
Description of asset
Names of attributes or columns
Data types of attributes or columns
Descriptions of attributes or columns
Through the interface, we can also open the tables through a variety of interfaces. For the sensor facts table, we can browse the data contained in the table in Excel (all or just the first 1000 entries) or use SQL Server Data Tools, Power BI Desktop, or Power Query. If we choose to explore a database, we see all the tables present in the database and can explore each one.
Data Historians and Integration to Azure
Vendors of earlier-generation IoT solutions designed for entirely on-premises deployment of those solutions. They provided landing spots for device data in on-premises time series data stores called data historians. Today, many of them are moving key components to the cloud. The components they choose to move vary from vendor to vendor. Here, we’ll look at how OSIsoft and PTC ThingWorx are leveraging Azure. As you might expect, both leverage the Azure IoT Hub to land streaming data in the cloud.
PI Data Archive. A time series database in which the data is tagged so that metadata can also be used to query other data besides the date and time.
PI Asset Framework. Maps sensor readings in tags into models (e.g., parent–child relationships) enabling determination of the operational condition of devices.
Asset Analytics. Enables real-time device metrics to be viewed and real-time analytics to be applied at the edge.
PI Event Frames. Conditions can be defined producing events of interest tied to analytics, anomaly detection, and notifications.
PI Notifications. Enables alerting of operators when defined thresholds are met or exceeded, or failures or other anomalies occur.
Note the similarity on the backend to previously described Azure architectures. PI Vision provides an additional mobile/desktop interface in the OSIsoft solution.
In comparison, PTC chose to develop ThingWorx with components deployed across the entire ecosystem. Within the IoT Edge, PTC provides Software Content Management (ThingWorx SCM), Remote Access and Control (ThingWorx RAC), and Industrial Connectivity (Kepware). Microsoft’s complementary offerings include Azure Stream Analytics, Azure ML, and Azure Functions in containers.
In the backend, the ThingWorx historian is hosted on Azure PostgreSQL. PTC also provides an Asset Advisor and machine learning (ML) solution. Microsoft provides the Azure IoT Hub, Blob Storage, Time Series Insights, Azure Active Directory, and Azure Machine Learning as important complementary components.
Not all the components pictured in the two previous architectures are always implemented. The historians for each of these companies are widely present in their installed bases, so the components directly tied to the historians tend to have the greatest adoption.