Chapter 6. Getting Experienced

If you bought this book from Amazon, then you may remember how other books on similar topics were recommended. Ever wondered how does Amazon do this in real time? They have a recommendation system based on clustering and it recommends similar or related items for sale.

In the financial industry, there is a use case that is just the opposite of this and that is fraud detection, which is to identify outliers or anything that doesn't belong to a cluster. I will discuss it in more detail in this chapter as a project.

In this chapter, I will explain low latency or real-time analytics and cover the full data lifecycle of the fraud detection project.

  • Data collection—ingesting data using Kafka, Storm, and Spark
  • Data transformation—using Storm and Spark and writing back the results to Kafka and HBase

Real-time big data

Till now, I have discussed tools and technologies to solve big data problems, using batch processing. Batch processing is obviously processing data in batches and has higher latency from minutes to days. But what exactly is real-time processing?

If you ask this question to business and IT, the answer will vary from milliseconds to minutes, and in fact that is almost a correct answer. It depends on the perspective and the use case. For a stock trader, it is milliseconds, but for news or media trend analysis, minutes is acceptable. For visual interfaces, business users will accept a delay of a few seconds.

There are only two types of processing—batch and stream. Batch processing is further classified into large batches and micro batches. I have already discussed large batch processing using Hive, Pig, and Java MapReduce.

Micro batch and stream processing are classified as real-time processing—process smaller sets of data as they arrive, such as live stock prices.

Let's also briefly discuss what software and tools we have at our disposal to process data in real time:

  • Apache Spark: Open source cluster of systems to compute data, and, as mentioned on their website, the Spark programs run up to 100 times faster than Hadoop MapReduce in memory and 10 times faster on disk. It is accessible using APIs in Java, Python, SQL, and (most importantly) Scala. Visit http://spark.apache.org/ for more details.

    Note

    For structured data, use the Spark SQL module of Spark. Please visit http://spark.apache.org/sql/ for more details.

  • Apache Storm: Open source stream processing across systems to integrate with existing real-time sources, such as queues. Visit http://storm.apache.org/ for more details.
  • Apache Kafka: Extremely scalable high throughput distributed publish-subscribe messaging system. Visit http://kafka.apache.org/ for more details.
  • Apache HBase: As discussed briefly in Chapter 1, Big Data Overview, this is a near real time, read/write key-value type of NoSQL database.
  • Programming languages: I encourage developers to learn additional languages, such as Scala and Python in addition to Java.
  • IBM Infosphere Streams: This is the computing platform to ingest and analyze real-time sources up to millions of events per second with milliseconds response time. It is also freely available but licensed for commercial production systems. Visit http://www-03.ibm.com/software/products/en/infosphere-streams for more details.

I expect a natural question from readers at this point: if using Spark or Storm with Kafka and HBase makes the data processing real time, why do you even bother to do batch processing at high latency?

Real-time processing is in-memory data processing, instead of disk based, and that means it is more expensive. However, Spark- or Storm-based on Hadoop is still horizontally scalable and cheaper than traditional high-end in-memory systems, such as SAP HANA.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.50.87