To stream or not to stream

Streams are datasets that continuously update as each new data message arrives with little to no latency. Streaming analytics operate on this continuously updating dataset at much shorter intervals than batch processing. Real-time analytics is a little bit of a misnomer when applied to streaming analytics as intervals are typically in minutes rather than continuously ongoing. The frequency affects processing and technology requirements, so intervals should be set for longer time periods if possible in order to save costs.

Stream datasets normally keep data for a window of time, and then discard it. There are specialized technology and processing options to handle streams, which are, for the most part, in addition to requirements for long term big data store technology we have focused on in this chapter. Amazon Kinesis is an example of a specialized data streaming technology service.

The technology and the programming code base needed to support analytics are (usually) different for streaming than that needed for long-term historical analytics. This means that two systems and code bases are required for most use cases. There are multiple efforts underway to consolidate this into one; Spark 2.1 is a notable example with the Spark Streaming module, but this area is still maturing.

For IoT analytics, it is important to consider the use cases and what datasets you will need to support before deciding real-time streaming analytics is what is actually needed. When you are looking for insights and new business opportunities, you will almost always be using datasets will long term history. Data from the latest hour or minute adds little incremental value for those use cases.

Even though the lure of real-time analytics is tempting, you have to consider if the additional cost and maintenance of streaming technology is worth it. In most cases for IoT analytics, it is not and could be better served by increasing the batch processing frequency. You also have limitations in what you can do with streaming analytics versus batch analytics when incorporating additional datasets. Unless those datasets are also real-time, they are only usable for less frequent batch processing.

While this is the case for exploratory analytics to find insights, it may not be the case for productionalizing the insights from those analytics. You may want near real-time analytics processing in some cases, such as when applying a predictive model to streaming sensor data to predict an equipment failure. You may want real-time streaming analytics for reporting dashboards that are monitored by fast reaction customer support teams.

These are separate decisions, however. The recommendation is to avoid the overhead of adding streaming analytics unless there is a strong business case to do so. Real-time sounds good, but you need to make sure the incremental gain is worth the additional expense in both time and resources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.183.89