Ingestion – streaming, processing, and data lakes

An IoT device is usually associated with some sensor or a device whose purpose is to measure or monitor the physical world. It does so asynchronously with respect to the rest of the IoT technology stack. That is, a sensor is always attempting to broadcast data, whether or not a cloud or fog node is listening. This is important, because the value of a corporation is in the data. Even if most of the data produced is redundant, there is always the opportunity that a significant event can occur. This is the data stream.

The IoT stream from a sensor to a cloud is assumed to be:

Constant and never-ending
Asynchronous
Unstructured, or structured
As close to real-time as possible

We discussed the cloud latency problem earlier in the Chapter 10, Cloud and Fog Topologies. We also learned about the need for fog computing to help resolve the latency issue, but even without fog computing nodes, efforts are taken to optimize the cloud architecture to support IoT real-time needs. To do this, clouds need to maintain a flow of data and keep it moving. Essentially, data moving from one service to another in the cloud must do so as a pipeline, without the need to poll for data. The alternative form of processing data is called batch processing. Most hardware architectures treat data flow the same way, moving data from one block to another, and the process of data arrival triggers the next function. Additionally, careful use of storage and filesystem access is critical to reducing overall latency.

For this reason, most streaming frameworks will support in-memory operations and avoid the cost of temporary storage to a mass filesystem altogether. Michael Stonebraker called out the importance of data streaming in this fashion, see Michael Stonebraker, Uǧur Çetintemel, and Stan Zdonik. 2005. "The 8 Requirements of Real-time Stream Processing.", SIGMOD Rec. 34, 4 (December 2005), 42-47. A well-designed message queue assists with this pattern. To build a successful architecture in a cloud that scales from hundreds of nodes to millions needs consideration.

The data stream will also not be perfect. With hundreds to thousands of sensors streaming asynchronous data more often than not, data will be missing (sensor lost communication), data will be poorly formed (error in transmission), or data will be out of sequence (data may flow to the cloud from multiple paths). At a minimum, a streaming system must:

Scale with event growth and spikes
Provide a publish/subscribe API to interface
Approach near-real-time latency
Provide scaling of processing of rules
Support data lakes and data warehousing

Apache provides several open source software projects (under the Apache 2 license) that assist with building a stream processing architecture. Apache Spark is a stream processing framework that processes data in small batches. It is particularly useful when memory size is constrained on a cluster in the cloud (for example, < 1TB). Spark is built on in-memory processing, which has the advantages of reducing filesystem dependency and latency, as mentioned previously. The other advantage of working on batch data is that it is particularly useful when dealing with machine learning models, which will be covered later in this chapter. Several models, such as Convolutional Neural Network, can work on data in batches. An alternative from Apache is Storm. Storm attempts to process data as close to real-time as possible in a cloud architecture. It has a low-level API versus Spark and processes data as large events rather than dividing them up into batches. This has the effect of being low latency (sub-second performance).

To feed the stream processing frameworks, we can use Apache Kafka or Flume. Apache Kafka is an MQTT on the ingest from various IoT sensors and clients, and connects to Spark or Storm on the outbound side. MQTT doesn't buffer data. If thousands of clients are communicating to the cloud over MQTT, some system will be needed to react to an incoming stream and provide the buffering needed. This allows Kafka to scale on demand (another important cloud attribute), and can react well to spikes in events. A stream of 1,00,000 events per second can be supported with Kafka. Flume, on the other hand, is a distributed system to collect, aggregate, and move data from one source to another, and is slightly easier to use out-of-the-box. It is also tightly integrated with Hadoop. Flume is slightly less scalable than Kafka, since adding more consumers means changing the Flume architecture. Both investors could stream in-memory without ever storing it; however, generally we don't want to do that, we want to take the raw sensor data and store it in as raw a form as possible with all the other sensors streaming in simultaneously.

When we think of IoT deployments in the thousands or millions of sensors and end nodes, a cloud environment may make use of a data lake. A data lake is essentially a massive storage facility holding raw unfiltered data from many sources. Data lakes are flat filesystems. A typical filesystem will be organized hierarchically, with volume, directories, files, and folders in a basic sense. A data lake organizes elements in its storage by attaching metadata element (tags) to each entry. The classic data lake model is Apache Hadoop, and nearly all cloud providers use some form of data lake underneath their services.

Data lake storage is particularly useful in the IoT, as it will store any form of data whether it is structured or unstructured. A data lake also assumes that all data is valuable, and will be kept permanently. This bulk persistent mass of data is optimal for the data analytic engines. Many of those algorithms function better based on how much data they are fed, or how much data is used to train their models.

A conceptual architecture using traditional batch processing and stream processing is illustrated in the following diagram. In the architecture, the data lake is fed by a Kafka instance. Kafka could provide the interface to Spark in batches and send data to a data warehouse.

There are several ways to reconfigure the topology in the following diagram, as the connectors between components are standardized:

Basic diagram of cloud ingestion engine to a data warehouse. Spark acts as the stream channel service.

Table of Contents for Ingestion – streaming, processing, and data lakes

Create new playlist

Sign In

Sign Up

Table of Contents for
Ingestion – streaming, processing, and data lakes