In this chapter, we explore processing, analyzing, and visualizing data that lands in the Azure cloud at a deeper level than in the previous introduction provided in Chapter 2. Our goal is to help you understand how each of the platforms and tools described are best utilized as you consider their inclusion into your own architecture. You should gain insight into where and how to deploy them.
As data coming from IoT devices is most often semi-structured, we focus the data management discussion in this chapter on Azure HDInsight and Cosmos DB. Data warehouses are also often part of the architecture as they enable business intelligence and analytics solutions where the data lines up neatly into rows and columns. We’ll address how they and associated tools can fit into this architecture in Chapter 7 when we consider integration with legacy data solutions.
Azure Stream Analytics
Time Series Insights
Azure Databricks
Semi-structured Data Management (Azure HDInsight and Cosmos DB)
Azure Machine Learning
Cognitive Services
Data Visualization and Power BI
Azure Bot Service and Bot Framework
Azure Stream Analytics
The Azure Stream Analytics in-memory streaming data analytics and event processing engine is designed to run transformation queries against input coming from IoT Hubs, Event Hubs, and Azure Blob Storage. It can be deployed in Azure or at the edge in containers deployed to devices.
Transformation queries are based on SQL and are used for filtering, sorting, aggregating, and joining streaming data or applying geospatial functions. You can also define function calls to the Azure Machine Learning service and/or create user-defined JavaScript or C# functions that you run in jobs. Stream Analytics jobs can be created using the Azure Portal, Azure PowerShell or Visual Studio.
Stream Analytics can process millions of events every second in Azure. Through partitioning, complex queries can be parallelized and executed on multiple nodes. The Stream Analytics SLA guarantees 99.9 percent availability for event processing every minute. There are built-in checkpoints and recoverability if delivery of an event fails.
Output can be sent to a monitored queue (such as to an Azure Service Bus, Azure Functions, or Azure Event Hubs) to trigger alerts or custom workflows. Data can be stored in downstream Azure data management solutions such as Azure Data Lake Storage, Cosmos DB, SQL Database, or SQL Data Warehouse and is often visualized in Power BI.
When you create a new job using the Azure Portal, you begin by defining the job name, choose a subscription and resource group to use, choose a location, and indicate the hosting environment and (in cloud deployment) the number of streaming units that provide a pool of computation resources.
Streaming inputs can be defined coming from IoT Hubs, Event Hubs, or Blob Storage. Reference inputs can be defined coming from Blob Storage or a SQL Database. Outputs can be designated to Event Hubs, SQL Database, Blob Storage, Table storage, Service Bus topics, Service Bus queues, Cosmos DB, Power BI, Azure Data Lake Storage, or Azure Functions.
Time Series Insights
IoT devices commonly send telemetry messages to the cloud in a time series (i.e., the data is timestamped). The data initially lands in Azure in the Azure IoT Hub or Azure Event Hub. Time Series Insights connects to Azure IoT Hubs and Azure Event Hubs and parses JSON from these incoming messages. Metadata is joined with telemetry, and the data is indexed in a columnar store. The data is stored in memory and SSDs for up to 400 days. It can be queried using the Time Series Insights explorer or using APIs in custom applications.
You begin deployment by defining a Time Series Insights environment to be used. The Azure Portal prompts you for an environment name, subscription, location, and pricing tier (where tiers selected define ingress rates in millions of events per day and storage capacity in millions of events). Next, you define the event source by providing a name and source type (IoT Hub or Event Hub). You then select a hub (usually an existing hub) and apply an IoT Hub access policy name. For IoT Hubs, you also set a consumer group parameter and can create an event source timestamp property name. You can then create the Time Series Insights environment.
Time series data can be monitored to determine the health of the device. You can apply perspective views and discern patterns when performing root cause analysis. Azure Stream Analytics might also be inserted into the data flow to help you find anomalies and send alerts.
Azure Databricks
Azure Databricks enables a fully managed Apache Spark cluster in the cloud. You can program in Python, R, Scala, SQL, and Java and utilize the Spark Core API. As the entire Spark ecosystem is provided, you can use Spark SQL to work with tabular data stored in DataFrames, process and analyze streaming data in real-time (with integration to HDFS, Flume, and Kafka), utilize GraphX, and access the MLib machine learning library that includes classification, regression, clustering, collaborative filtering, and dimensionality reduction algorithms.
The Databricks Runtime is built upon this Spark base and can be deployed as serverless. It can also be utilized with datastores that support Spark such as Azure Data Lake Storage, Blob Storage, Cosmos DB, and Azure SQL Data Warehouse.
Through the Azure Portal, you begin by creating an Azure Databricks workspace (providing a workspace name, subscription, resource group, location, and pricing tier). You are then ready to create a Databricks cluster.
Databricks cluster creation begins with you providing a cluster name and defining the cluster mode (standard or high concurrency). You select the Databricks runtime version that you wish to deploy as well as the Python version that will be used. You next select whether you want autoscaling turned on and when you would like the cluster terminated if there is inactivity (where the length of time is provided in minutes). Next, you select the minimum number and maximum number of worker nodes and the type of hardware used. You also select the type of hardware used for the driver. Advanced options can be applied including Spark configuration options, tags, logging, and init scripts.
Within notebooks, you can provide code in R, Python, Scala, or SQL and provide supporting commentary and documentation. You can visualize data using tools such as Matplotlib, ggplot, or d3. Power BI provides additional data visualization capabilities as described later in this chapter.
Semi-structured Data Management
In addition to processing and analyzing data at the edge or within the data stream, machine learning models are often developed through analysis of historical data over lengthy time periods. Such data needs to land in a data management system designed for storing and analyzing such data.
NoSQL databases are ideal for semi-structured data. At the beginning of this century, Hadoop established itself as a popular open-source historical data store. The Hadoop version available in a PaaS offering from Microsoft is Azure HDInsight. More recently, NoSQL databases that are globally distributed have proven their ability to scale to enormous sizes. Microsoft’s PaaS offering here is Cosmos DB.
In this section of the chapter, we’ll describe Azure HDInsight and Cosmos DB. Either can be created through the Azure Portal, Azure CLI, and PowerShell. We’ll describe the creation of these data management systems using the Azure Portal.
Azure HDInsight
Azure HDInsight is Microsoft’s cloud-based offering that consists of Apache Hadoop components in the Hortonworks Data Platform (HDP). HDInsight clusters enable deployment of Hadoop, Spark for in-memory processing, Hive low-latency analytical processing (LLAP) for queries, Kafka and Storm for processing streaming data, HBase (a NoSQL database), and/or ML Services.
Clusters are monitored using Apache Ambari and the Azure Monitor. Cluster health and availability, cluster resource utilization, performance across the entire cluster, and YARN job queues are monitored with Ambari. Resource utilization at the virtual machine level is monitored using Azure Monitor. Information about the workloads being run is present in the YARN resource manager and in Azure Monitor logs.
Languages native to Hadoop include Pig Latin, HiveQL, and SparkSQL. Programming languages supported include Java, Python, .NET, and Go. Other languages, such as Scala, can be deployed in Java Virtual Machines. Typical development environments that are used include Visual Studio, Visual Studio Code, Eclipse, and Intellij for Scala.
Microsoft released several versions of the distribution that was initially deployed to either Azure Data Lake Storage (ADLS) Gen1 featuring a hierarchical file system or to Blob Storage. The release of ADLS Gen2 provides a combination of hierarchical file system and Blob Storage capabilities, and it is now commonly selected for deployment of HDInsight clusters.
An Azure Blob System (ABFS) driver is provided with HDInsight, as well as Databricks, providing access to storage. If you are going to use Azure Data Lake Storage in the deployment, ADLS must be created first.
Note
Using the Azure Portal to create ADLS, you first select a subscription and resource group for the storage account, give it a name, and set the location. You can also specify performance, account kind, replication, and access tier. Next in advance, you can set security and virtual network fields (if not satisfied with the defaults provided). In the Data Lake Storage Gen2 section, you set the hierarchical namespace to enabled.
Deploying HDInsight is a three-step process using the Azure Portal. You begin by defining basic properties including a name for the Hadoop cluster, subscription to be used, cluster login name and password, secure shell (SSH) username, password for SSH, resource group for the cluster and dependent storage account, and location. You also select the cluster type and select the version of HDInsight that you want to deploy.
Next, you select the storage type (either Azure Blob Storage or Azure Data Lake Storage) and the storage account (from your subscriptions or from another subscription by providing an access key). You can choose to preserve metadata outside of the cluster by linking a SQL database for Hive and/or Oozie.
In the third step, you receive a summary of your selections and can edit those selections. When satisfied with the choices made, you next create the cluster. Clusters can take up to 20 minutes to be created.
A common means of moving data into and out of HDInsight when connected to the IoT Hub is to use Apache Kafka. You would begin by installing the IoT Hub Connector on an edge node in the HDInsight cluster. You would then get the IoT Hub connection information, configure the connector to serve as a sink and/or source for data movement, and start the connector.
Cosmos DB
Cosmos DB is a globally distributed multi-model database. The database can manage key-value, columnar, document, and graph data. Indexing of all data is automatic, and no schema or secondary indexes are required. Data can be made accessible using SQL, the MongoDB API, Cassandra API, Azure Table Storage API, or Gremlin API.
Storage and throughput are elastically scaled across regions making it possible to handle hundreds of millions of requests per second. Since the data is globally distributed, SLAs are provided where 99 percent of read and write requests will occur within 10 milliseconds in the region closest to the user. SLAs of 99.999 percent for high availability can also be attained.
Strong Consistency. Only when an operation is complete is it is visible to all.
Bounded Staleness Consistency. Read operations will lag writes based on consistent prefixes or time intervals; this level preserves 99.99 percent availability.
Session Consistency. Consistent prefixes are applied with predictable consistency for a session, featuring high read throughput and low latency.
Consistent Prefix Consistency. Reads will never see out-of-order writes.
Eventual Consistency. Provides the lowest cost for reads; however, there is a potential for reads seeing out-of-order data.
Loading of data from the Databricks in-memory engine (where data initially landed in Azure in the IoT Hub and then was loaded into Databricks)
Creating stored procedures and Logic Apps in an Event Grid deployed in the IoT Hub that write data into Cosmos DB
Deploying Azure Functions in IoT Hub message routing that write data to Cosmos DB
Azure Machine Learning
Azure Machine Learning Studio
Azure Machine Learning service (including development environments)
Azure Machine Learning Studio
Azure Machine Learning Studio is an online development environment providing a drag-and-drop interface that is used in building, testing, and deploying predictive analytics solutions. At the time this book was published, experiments were limited to training sets of no more than 10 GB in size. However, a visual interface based on ML Studio integrated with the Azure Machine Learning service was in preview enabling preparing, training, and deployment with much larger datasets typically used by data scientists.
Drag–drop modules and functions are provided for building experiments that include saved datasets, trained models, transforms, data format conversions, data transformation, feature selection, machine learning, Open CV library modules, Python language modules, R Language Modules, statistical functions, text analytics, time series anomaly detection, and web services. The machine learning category includes functions used in evaluation, initializing the model using anomaly detection, classification, clustering, or regression algorithms, scoring, and training. Statistical functions include math operations, linear correlation, probability distribution functions, t-test, and descriptive statistics reporting.
In the figure, we see a typical experiment data flow that begins with data input containing known outcomes, then preparing the data, splitting it for model training purposes, testing various mathematical models against the data, scoring them, and evaluating them for accuracy. Once we’re satisfied with a specific model, we convert the training experiment into a predictive experiment and can deploy it as a web service. Sample code is also provided in C#, Python, and R.
Azure Machine Learning Service
The Azure Machine Learning service is Microsoft’s PaaS offering used to train, deploy, and manage machine learning models at scales that data scientists typically work with. It is an open framework and can be used with open-source libraries that include MXNet, PyTorch, scikit-learn, and TensorFlow.
You begin by first generating a Machine Learning service workspace, typically through the Azure Portal. You provide a workspace name, subscription, resource group, and Azure region location for the workspace to be run.
You’ll have access to “Getting Started in Azure Notebooks,” a Forum, samples in GitHub, and the documentation when you enter the workspace. You will also have access to other features under public preview.
Azure Notebooks are a preinstalled free cloud service that support up to 4 GB of memory and 1 GB of data. To remove these limits, you can attach a Notebooks project to a VM running the Jupyter server or to the Azure Data Science Virtual Machine.
The Azure Data Science VM includes popular data science and related tools preinstalled and pre-configured and comes in Linux Ubuntu and Windows editions. Some of the tools that you will find here include Microsoft R/Open, Microsoft ML Server (with support for R and Python), Anaconda Python, various data management servers, Spark-based big data platforms used for development and testing, a Jupyter Notebook Server, IDE support for R Studio and Visual Studio, data movement and management tools, machine learning tools, and deep learning tools.
Microsoft developers will be happy to find that Visual Studio can also be used for building, testing, and deploying Azure Machine Learning service solutions. The code editor highlights syntax, provides intelligent code completion (known as Intellisense), and provides auto text formatting. You can debug your code locally by installing appropriate Python versions and libraries and the deep learning frameworks that you are using in your project.
Cognitive Services
Azure Cognitive Services provides APIs, SDKs, and services enabling software developers to add cognitive features into applications. As noted in Chapter 2, these services focus in the areas of vision, speech, language, search, and decision. In the building of IoT applications, vision and decision are most often considered for deployment.
The Computer Vision Service provides advanced algorithms for processing information and returning information. The Custom Vision Service enables building of custom image classifiers. Both services are typically used with smart cameras that capture images at the edge and perform local analysis or transmit images to the cloud where the algorithms process the data.
The Computer Vision Service has several visual features relevant in IoT applications. It can be used to detect brands, assign images to categories based on taxonomies that you define, determine accent and dominant colors, provide descriptions, detect objects, and apply tagging.
The Custom Vision Service provides an image training environment. You begin by tagging a set of training images using tags that are consistent with what you are trying to detect. For example, if you are trying to train the service to detect the types of crops in a farm field, you’d first assemble a training set of images that are tagged with the crop types you wish to detect.
Next you train the model and set a probability threshold for accuracy. The default is a goal of reaching 50 percent accuracy or above. You begin the training by simply hitting the train button shown in the previous figure.
Custom Vision can have many other use cases. For example, models might be produced for use in visual inspection of the condition utility lines to determine the need for their replacement, analyzing medical images for possible anomalies where further diagnoses might be needed, and determining whether there is proper alignment of components being placed into parts on a manufacturing assembly line.
Among the decision APIs, the Anomaly Detector is particularly relevant to IoT applications. You can use these RESTful APIs to detect anomalies in streaming data, leveraging previously seen data points. The APIs can also generate models that detect anomalies in JSON formatted time series datasets created in batch processes.
The APIs can provide details about the data including expected values, anomaly boundaries, and positions. Anomaly boundaries are automatically set. However, you can manually adjust the boundaries if you prefer more (or less) sensitivity in identifying anomalies.
Data Visualization and Power BI
Power BI is a business intelligence platform from Microsoft used in visualizing, aggregating, analyzing, and sharing data and data analysis. The Power BI service is deployed in the Microsoft cloud. The Power BI Desktop is free, downloadable software for your personal computer providing an environment to connect to data sources, develop data models, create visuals, and combine visuals into reports. Once created, you can publish these reports to the Power BI service.
When starting in Power BI Desktop, you likely will first download a sample of data to begin development. As development progresses and/or you deploy to the Power BI service, you can use Direct Query to analyze and report on the full live dataset.
In IoT scenarios, typical data sources include Blob Storage, Azure Data Lake Storage, HDInsight (HDFS, Interactive Query, and Spark), and Cosmos DB. Relational database sources that can be accessed include Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Service, Microsoft SQL Server and SQL Server Analysis Services, IBM DB2, Informix, and Netezza, MySQL, Oracle, PostgresSQL, SAP HANA and Business Warehouse, Snowflake, and any database supporting ODBC. Online services such as Dynamics and Salesforce can be accessed. Additionally, file types such as Excel, XML, JSON, PDF, and text or CSV can be leveraged.
Once loaded into Power BI Desktop, you might choose to transform data in the data model. For example, you can rename tables, update data types, append tables together and cleanse data so that similar sets can be combined, and rename groups of data.
As you create the report, you can select from many different data visualizations provided. Examples of available visualizations include stacked bar charts, stacked column charts, clustered bar charts, clustered column charts, 100 percent stacked bar charts, 100 percent stacked column charts, line charts, area charts, stacked area charts, line and stacked column charts, line and clustered column charts, ribbon charts, waterfall charts, scatter charts pie charts, donut charts, treemaps, filled maps, funnels, gauges, cards, multi-row cards, KPIs, slicers, tables, matrices, R script visuals, Python visuals, ArcGIS Maps, globe maps, tornado charts, and custom visuals that you import.
Whereas reports show data from a single dataset, dashboards can display data present from a variety of datasets and reports. As such, they can provide a more holistic view as to how a business is functioning and leverage data from IoT devices and lines of business systems.
Dashboards are created only in the Power BI service (not through the Desktop). The dashboards can be created from scratch directly from datasets, by pinning reports, or by modifying existing dashboards.
Category outliers (top and/or bottom)
Change points in a time series
Correlation
Low variance
Major factors (e.g., most of a total value comes from a single factor)
Overall trends in time series
Seasonality in time series
Steady share
Time series outliers
Note
Power BI users can be granted access to Azure Machine Learning models developed by data scientists. Power Query will discover the models which the user has access to and exposes them as dynamic Power Query functions. At the time this book was published, this capability was supported in Power BI dataflows and in Power Query online in the Power BI service.
You can collaborate with others in the creation of reports and dashboards by sharing workspaces. Once created, access to reports and dashboard tiles can be made available through Microsoft Teams by adding Power BI Tabs to channels and pointing to the report or tile link. Reports can also be printed (including as PDFs) or embedded into portals.
Reports and dashboards in the Power BI service can also be shared directly to e-mail addresses where the individuals will have the same access as the publisher (unless row-level security applied to the dataset restricts them). When granting access, the publisher can choose to allow the recipient to also share the report or dashboard or build new content using the underlying dataset.
Azure Bot Service and Bot Framework
Bots provide a question and answer or natural language interface akin to talking to a human or intelligent robot. The Azure Bot Service and Bot Framework provide tools used in building, testing, deploying, and managing intelligent bots. Microsoft provides an extensible framework that includes the SDK, tools, templates, and AI services.
You can extend your bot’s functionality by using Microsoft’s QnA Maker to set up a knowledge base to answer questions. Natural language understanding is accomplished by leveraging LUIS in Cognitive Services. Multiple models can be managed and leveraged during a bot conversation. Graphics, menus, cards, and buttons can be added to text to complete the experience.
For example, you might use QnA maker as a front-end to users that then pushes SQL to backend data management systems. You might also use a bot to push a command to an IoT edge device.
Microsoft provides a Bot Framework Emulator useful in debugging and interrogations. Once you have configured your bot in the Azure Portal, the bot can also be reached through a web chat interface for testing. When testing is complete, you can publish your bot to Azure or a web service.
Once deployed, you can gather data in the Azure Portal related to traffic, latency, users, messages, and channels. You can use this data to determine how best to improve the capabilities and performance of your bot.