If you bought this book from Amazon, then you may remember how other books on similar topics were recommended. Ever wondered how does Amazon do this in real time? They have a recommendation system based on clustering and it recommends similar or related items for sale.
In the financial industry, there is a use case that is just the opposite of this and that is fraud detection, which is to identify outliers or anything that doesn't belong to a cluster. I will discuss it in more detail in this chapter as a project.
In this chapter, I will explain low latency or real-time analytics and cover the full data lifecycle of the fraud detection project.
Till now, I have discussed tools and technologies to solve big data problems, using batch processing. Batch processing is obviously processing data in batches and has higher latency from minutes to days. But what exactly is real-time processing?
If you ask this question to business and IT, the answer will vary from milliseconds to minutes, and in fact that is almost a correct answer. It depends on the perspective and the use case. For a stock trader, it is milliseconds, but for news or media trend analysis, minutes is acceptable. For visual interfaces, business users will accept a delay of a few seconds.
There are only two types of processing—batch and stream. Batch processing is further classified into large batches and micro batches. I have already discussed large batch processing using Hive, Pig, and Java MapReduce.
Micro batch and stream processing are classified as real-time processing—process smaller sets of data as they arrive, such as live stock prices.
Let's also briefly discuss what software and tools we have at our disposal to process data in real time:
For structured data, use the Spark SQL module of Spark. Please visit http://spark.apache.org/sql/ for more details.
I expect a natural question from readers at this point: if using Spark or Storm with Kafka and HBase makes the data processing real time, why do you even bother to do batch processing at high latency?
Real-time processing is in-memory data processing, instead of disk based, and that means it is more expensive. However, Spark- or Storm-based on Hadoop is still horizontally scalable and cheaper than traditional high-end in-memory systems, such as SAP HANA.
3.142.133.54