© Robert Stackowiak 2019
R. StackowiakAzure Internet of Things Revealedhttps://doi.org/10.1007/978-1-4842-5470-7_7

7. Infrastructure Integration

Robert Stackowiak1 
(1)
Elgin, IL, USA
 

Thus far, our more detailed exploration of Microsoft’s IoT footprint has focused on analysis of streaming data from devices, managing and monitoring the devices, and pushing intelligence to the edge. Chances are, you will want to integrate some of your existing data sources into your new IoT solution. Much of this chapter focuses on methods for finding and integrating the data that you will need.

We’ll first look at typical preexisting potential sources of data. Then we’ll explore the roles of Azure Data Factory, Azure Data Explorer, and PolyBase in moving and/or accessing data. We’ll also look at configuring VPN connections and/or ExpressRoute for network connectivity between on-premises devices and systems and the cloud, using the Azure Data Box to physically move data to Azure data centers, and using Azure Data Catalog to find where data is located.

Finally, there are multiple Microsoft partners who provided data historians (that are time series databases) for years that are often part of IoT solutions. We’ll look at how two of them, OSIsoft and PTC, are integrating Azure into their modern cloud-based deployment architectures.

The chapter includes the following major sections:
  • Preexisting sources of data

  • Integrating and finding data sources

  • Data historians and integration with Azure

Preexisting Sources of Data

Organizations that deploy IoT solutions almost always find a need to integrate data from legacy transactional and data warehousing systems in order to deliver the key metrics needed in answering business questions. As transactional data fits neatly into rows and columns, relational databases are typically deployed to provide data management solutions for such data.

Preexisting footprints are almost always unique. Legacy data sources might reside on premises, in clouds, or from a combination of locations. On-premises relational databases typically found include Microsoft SQL Server, Oracle, IBM DB2, MySQL, PostgreSQL, and others. When deployed in Azure, these databases can be deployed within virtual machines. There are also databases available for PaaS deployment such as Microsoft SQL Database, Microsoft SQL Data Warehouse, and Snowflake.

Microsoft’s Azure SQL Database is a fully managed service and is based on the latest version of Microsoft SQL Server general-purpose database engine. It is where you will find the newest capabilities for the SQL Server family released first. Azure SQL Database also supports non-relational structures including graphs, JSON, spatial, and XML. It has a hyperscale service tier that enables database scalability up to 100 terabytes. By configuring elastic pools, you can assign resources that are shared.

The Microsoft SQL Data Warehouse uses a massively parallel processing (MPP) engine to perform queries across extremely large databases often found in data warehouses, including those up to Petabytes in size. The database consists of a control node that optimizes and coordinates parallel queries and multiple compute nodes (up to 60).

In Figure 7-1, we illustrate a typical footprint that varies slightly from diagrams presented earlier in this book as we have identified some of the key transactional systems. In this figure, the ERP system might be SAP, Oracle E-Business Suite, Oracle Fusion Applications, Infor, or some other vendor solution that is deployed either on-premises or in a cloud. The ERP DW pictured would likely be SAP BW in an SAP implementation. The HR system could be Workday, another cloud-based HR solution, or a legacy HR system on premises. The CRM solution could be Microsoft Dynamics 365 or SalesForce in the cloud or other legacy on-premises or cloud application.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig1_HTML.jpg
Figure 7-1

Example of transactional systems in full footprint

In this figure, we’ve pictured Azure SQL Database or Azure SQL Data Warehouse serving as data warehouses in the cloud. However, the data warehouse engine could also be one of the other cloud-based or on-premises databases that we previously mentioned.

Integrating and Finding Data Sources

We might choose to analyze our various sources of data by gathering the data to a single location. We could also keep the data in place and analyze it through a distributed query. In this section of the chapter, we explore ways of doing this and describe how you can track where the data resides. We begin by discussing how Azure Data Factory can be used to gather data.

Azure Data Factory

Azure Data Factory (ADF) is a cloud-based integration service that is used in performing automated data extraction, loading, and transform (ELT) from within pipelines. The pipelines are workflows that are created and scheduled within ADF. Pipelines can contain three types of activities:
  • Data Movement Activities. Copy data from a data source to a specified target (also known as a data sink).

  • Data Transformation Activities. Custom coding using Hive, Pig, MapReduce, Hadoop Streaming or Spark in HDInsight, Machine Learning in an Azure VM, stored procedures in a SQL engine, Databricks, or Azure batch process.

  • Control Activities. Used to invoke another pipeline from within a pipeline, define repeating loops to perform iterations and do-until loops, call REST endpoints, lookup records, table names or values from external sources, retrieve metadata, establish branches based on conditions, and specify wait times for pipelines.

Data connectors are available for a variety of sources and targets in Azure including databases, NoSQL databases, files and file systems, and services and applications. You can also use generic protocols and interfaces to access data, such as ODBC, OData, and REST.

Figure 7-2 shows an Azure Portal interface view of linked Azure data management services that can serve as sources or targets. Azure data connectors shown include Azure Blob Storage, Cosmos DB, Azure Data Lake Storage, Azure Database for MariaDB, Azure Database for MySQL, Azure Database for PostgreSQL, Azure SQL Data Warehouse, and Azure SQL Database.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig2_HTML.jpg
Figure 7-2

Azure linked services within Azure Data Factory

Connectors are also available for database data sources such as Greenplum, HBase, IBM DB2, IBM Informix, IBM Netezza, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SAP Business Warehouse, SAP HANA, Spark, Sybase, Teradata, and Vertica. Connectors for NoSQL databases include Cassandra, Couchbase, and MongoDB. Examples of connectors for applications include Microsoft Dynamics, Concur, Oracle Eloqua, Marketo, PayPal, Salesforce, SAP, and Square. Various connectors for the Amazon and Google cloud data management systems are also provided including for Amazon Redshift and Google BigQuery.

A simple Copy Activity function can be initiated in a data pipeline to move data from on-premises or cloud-based data sources into Azure for further processing. Figure 7-3 shows the ADF interface that we used to create and then validate our data copy pipeline. More often, pipelines consist of a series of steps where control flows are used to orchestrate pipeline activities.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig3_HTML.png
Figure 7-3

Azure Data Factory pipeline containing copy data function

Note

In organizations that prefer developing ELT scripts using SQL Server Integration Services (SSIS), you can configure and provision a SSIS Integration Runtime from within ADF to enable a lift and shift of your SSIS packages.

We can schedule the workflow for execution when we are satisfied with our design. Pipeline runs can be instantiated by passing arguments to parameters (e.g., datasets or linked services) using manual methods or within triggers.

Once deployed, you can monitor the success of your activities and pipelines. ADF supports pipeline monitoring via the Azure Monitor and APIs, PowerShell, Azure Monitor logs, and Azure Portal health indicators.

Query Services Across Diverse Data

Microsoft offers a couple of services enabling queries across semi-structured and structured data sources. Azure Data Explorer requires moving data to a common database, while PolyBase simply accesses data through external tables as if the data was local.

Microsoft describes Azure Data Explorer as a fast and scalable data exploration service. The cluster and database service can load streaming incoming data and/or copy data from other data sources through Event Hubs, Event Grids, Kafka, Python, Node SDKs, the .NET Standard SDK, Logstash, and ADF.

You begin by creating a cluster (including name and compute specifications) in a resource group using the Azure Portal. You then create a Data Explorer database and define a retention period (in days) and a cache period (in days). You start the cluster (and can later stop it to minimize cost). You next create a target table in the Azure Data Explorer cluster using a create table command through the query interface and provide a mapping of incoming JSON data to column names and data types in the table. Lastly, you add a data connection.

You can query data using Azure Data Explorer in a web-based interface. You can also write queries using the Kusto Query Language (KQL) . Queries generally begin by referencing a table or function. Some of the advanced capabilities of KQL include scalar and tabular operators, time series analysis and anomaly detection, aggregations, forecasting, and machine learning.

PolyBase is built into SQL Server instances and the Azure Data Warehouse. You access data that resides in Azure Storage, Hadoop, and other file systems and databases using external table definitions. Connections to the data repositories are via ODBC. The definitions and user permissions are stored in the database. Query performance can be improved by configuring PolyBase scale-out groups.

Figure 7-4 illustrates a typical scale-out group consisting of a head node and multiple compute nodes. The illustration of the head node shows a SQL instance, PolyBase engine, and PolyBase Data Movement Service (DMS). The head node is the location to which queries are submitted. The PolyBase engine parses queries on external data, generates the query plan, and distributes work to the compute nodes’ DMS. Compute nodes simply contain a SQL instance and PolyBase DMS and are replicated in numbers to adequately scale processing that will meet query performance needs.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig4_HTML.jpg
Figure 7-4

PolyBase scale-out group

PolyBase is sometimes used in combination with ADF, especially when the goal is to speed data transformations and updates. Behind the scenes, it provides the copy capabilities that we described earlier in the ADF section of this chapter.

Note

Key PolyBase features, including support for certain sources, vary based on the version of the SQL engine that PolyBase is bundled with. Always first check the Microsoft documentation regarding feature availability for the version that you plan to deploy.

Connecting On-Premises Networks to Azure

There are multiple ways by which you can connect on-premises networks to an Azure Virtual Network (VNet) . They include
  • Virtual Private Network (VPN) connection

  • Azure ExpressRoute connection

  • ExpressRoute connection with VPN failover

A VPN connection includes an Azure VPN gateway that is used to send encrypted traffic between the Azure VNet and an on-premises network. VPN connections use the public Internet and were limited to speeds of 1.25 Gb per second at the time this book was published. Network traffic received by the VPN gateway is routed to an internal load balancer when traffic is being sent to applications in Azure.

Figure 7-5 shows the Azure Portal interface used in configuring a VPN Gateway. Note that there is also an option to choose ExpressRoute as the gateway type which then presents a different set of prompts.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig5_HTML.jpg
Figure 7-5

Configuring a VPN Gateway through the Azure Portal

A VPN appliance must also be present on-premises providing external connectivity for that network. The VPN appliance might be a dedicated hardware device or a software service (such as the Routing and Remote Access Service in Windows Server).

Traffic over the public Internet is encrypted and flows through an IPSec tunnel. Given possible latency challenges when deploying VPN connections, they are generally used in situations where traffic is considered as being light. Figure 7-6 illustrates a VPN connection in a site-to-site configuration. Multiple VPN connections are typically present to help scale bandwidth.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig6_HTML.jpg
Figure 7-6

VPN site-to-site configuration

Azure ExpressRoute connections use private and dedicated two-layer or three-layer circuits provided by third-party network providers. Bandwidths of up to 10 Gb per second are possible. Some providers offer dynamic scaling of bandwidth to meet changing requirements and to enable charge-back for the bandwidth that is being used. High-bandwidth routers are required for connection to on-premises networks. Microsoft edge routers are used to provide connections to the Azure VNet in the cloud. Figure 7-7 illustrates private ExpressRoute connection in a site-to-site configuration.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig7_HTML.jpg
Figure 7-7

ExpressRoute site-to-site configuration

ExpressRoute with VPN failover can be thought of as combining the two previous options that we just described. Both types of connections are configured. ExpressRoute circuits are configured to provide connectivity under normal conditions. Failover to VPN connections are configured so that they can provide connectivity in situations where there is a loss of ExpressRoute connectivity.

Within the backend cloud, a hub–spoke topology can be deployed in Azure to isolate workloads when services are to be shared. For example, network virtual appliances and DNS servers can be shared for different workloads, different departments, or different stages of development and deployment.

The hub–spoke topology is deployed using CLI scripts found in GitHub. An Azure VNet serves as a hub in the topology providing a central connection to the on-premises network and provides shared services for other VNets serving as spokes. Each spoke VNet is connected to the hub VNet by peering, enabling traffic exchange between each spoke and the hub. The spoke VNets enable isolation and separate management of the workloads.

Bulk Data Transfer

Legacy data warehouses and remote IoT databases sometimes measure in the terabytes or more in size. Network connections into Azure might be too slow to provide timely and reliable data transfers. In such scenarios, offline data transfer can make sense, and Microsoft offers Azure Data Box Disk solutions for this purpose.

You can order the Data Box through the Azure Portal. Microsoft ships the Data Box to your site where you upload data through a local web user interface. You then ship it back to a Microsoft Azure Data Center, and the data is uploaded there into your Azure Storage account. You can track this process through the Azure Portal.

The Data Box can be used for one-time migration of very large data quantities. If data is subsequently gathered and needs to be uploaded to Azure, the incremental volumes for updates are usually a fraction of the size and can likely be handled by network connections to Azure. If remote locations periodically become disconnected and very large data volumes are again gathered, a Data Box can be requested again and used to move these larger data volumes to Azure.

The standard Azure Data Box comes in three variations: Data Box Disk (40 TB of disk, 35 TB of usable space), Data Box standard (100 TB of disk, 80 TB of usable space), and Data Box Heavy (1000 TB of disk, 800 TB of usable space). You can also send up to ten of your own disks to an Azure data center for loading. Figure 7-8 illustrates the choices of Azure Data Box selections viewed through the Azure Portal.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig8_HTML.png
Figure 7-8

Azure Data Box creation

Alternatively, you might choose online data transfer. Data Box Gateway is a virtual device residing on-premises supporting NFS and SMB as protocols. The virtual device transfers data to Azure block blobs, Azure page blobs, or Azure Files.

Azure Data Box Edge is a physical device with the gateway capabilities of the Data Box Gateway. It additionally can be used to preprocess data including performing data aggregation, modifying data (e.g., removing sensitive data), selecting and sending only subsets of data, or analyzing and reacting to IoT events locally. You can apply ML models at the edge before data is transferred.

For example, the Data Box Edge can be used to capture video and, configured with the Azure IoT Edge Runtime, can push video frames through a FPGA-based AI model. The source code for such an application is posted on GitHub.

Figure 7-9 illustrates the Azure Portal interface used in creating the Data Box Gateway and Azure Data Box Edge.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig9_HTML.png
Figure 7-9

Azure Data Box Gateway/Edge creation

Azure Data Catalog

Finding the location of data that you could need from within your IoT architecture can be challenging. As you’ve seen in previous architecture diagrams, the data could be stored in a variety of locations and in a variety of types of data management systems.

Azure Data Catalog is a cloud-based service where data management locations in your implementation can be registered and metadata describing the data in each location is stored. The metadata is searchable and can be enhanced by users of the Data Catalog, enabling crowdsourcing of descriptions, tags, and other descriptive metadata. The search capability makes finding where data is located much simpler.

Typical metadata includes
  • Name of asset

  • Type of asset

  • Description of asset

  • Names of attributes or columns

  • Data types of attributes or columns

  • Descriptions of attributes or columns

Figure 7-10 illustrates an interface in the first-generation version of Azure Data Catalog. We searched for data tagged as “sensor.” On the first page of search results, we found that sensor data is stored in SQL Data Warehouses, HIVE databases and tables, and an Azure container. When we select one of the SQL Data Warehouse locations, we see information on filters available, experts assigned, glossary terms, and user tags that have been assigned to the sensor facts table on the left side of the interface. On the right side, we can provide a friendly name, better description, add experts by providing their email addresses, add tags, change connection information, and can take ownership of management.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig10_HTML.png
Figure 7-10

Azure Data Catalog view of sensor data management locations

Through the interface, we can also open the tables through a variety of interfaces. For the sensor facts table, we can browse the data contained in the table in Excel (all or just the first 1000 entries) or use SQL Server Data Tools, Power BI Desktop, or Power Query. If we choose to explore a database, we see all the tables present in the database and can explore each one.

Data Historians and Integration to Azure

Vendors of earlier-generation IoT solutions designed for entirely on-premises deployment of those solutions. They provided landing spots for device data in on-premises time series data stores called data historians. Today, many of them are moving key components to the cloud. The components they choose to move vary from vendor to vendor. Here, we’ll look at how OSIsoft and PTC ThingWorx are leveraging Azure. As you might expect, both leverage the Azure IoT Hub to land streaming data in the cloud.

OSIsoft chose to leverage Microsoft Azure data and analytics solutions on the backend and fully deliver their platform at the edge. The OSIsoft solution is built upon the PI System that is deployed on-premises where devices are located. The PI Server serves as a data historian as it is used to capture, store, and manage data that is being produced by the edge devices. Key components in the PI Server include
  • PI Data Archive. A time series database in which the data is tagged so that metadata can also be used to query other data besides the date and time.

  • PI Asset Framework. Maps sensor readings in tags into models (e.g., parent–child relationships) enabling determination of the operational condition of devices.

  • Asset Analytics. Enables real-time device metrics to be viewed and real-time analytics to be applied at the edge.

  • PI Event Frames. Conditions can be defined producing events of interest tied to analytics, anomaly detection, and notifications.

  • PI Notifications. Enables alerting of operators when defined thresholds are met or exceeded, or failures or other anomalies occur.

Integration from the PI System to Azure for streaming data is provided by the PI Integrator Advanced Edition that serves as a gateway to the Azure IoT Hub. Integration from the PI System to Azure is provided by the PI Integrator Standard Edition (for batch feeds) directly into Azure SQL Database or Azure SQL Data Warehouse. These are pictured in Figure 7-11.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig11_HTML.jpg
Figure 7-11

OSIsoft PI System integration with Azure components

Note the similarity on the backend to previously described Azure architectures. PI Vision provides an additional mobile/desktop interface in the OSIsoft solution.

In comparison, PTC chose to develop ThingWorx with components deployed across the entire ecosystem. Within the IoT Edge, PTC provides Software Content Management (ThingWorx SCM), Remote Access and Control (ThingWorx RAC), and Industrial Connectivity (Kepware). Microsoft’s complementary offerings include Azure Stream Analytics, Azure ML, and Azure Functions in containers.

In the backend, the ThingWorx historian is hosted on Azure PostgreSQL. PTC also provides an Asset Advisor and machine learning (ML) solution. Microsoft provides the Azure IoT Hub, Blob Storage, Time Series Insights, Azure Active Directory, and Azure Machine Learning as important complementary components.

Figure 7-12 illustrates how all these components fit together in a reference architecture.
../images/480071_1_En_7_Chapter/480071_1_En_7_Fig12_HTML.jpg
Figure 7-12

PTC ThingWorx integration with Azure components

Not all the components pictured in the two previous architectures are always implemented. The historians for each of these companies are widely present in their installed bases, so the components directly tied to the historians tend to have the greatest adoption.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.8.216