Spark and Elasticsearch for Real-Time Analytics

In the previous chapter, we looked at the machine learning feature of Elastic Stack. We used a single metric job to track one-dimensional data (with the volume field of the cf_rfem_hist_price index) to detect anomalies by using Kibana. We also introduced the scikit-learn Python package and performed the same anomaly detection, but with three-dimensional data (with two more fields: changePercent and changeOverTime) by using Python programming.

In this chapter, we will look at another advanced feature, which is known as Elasticsearch for Apache Hadoop (ES-Hadoop). The ES-Hadoop feature contains two major areas. The first area is the integration of Elasticsearch with Hadoop distributed computing environments, such as Apache Spark, Apache Storm, and Hive. The second area is the integration of Elasticsearch to use the Hadoop filesystem as the backend storage so that you can index Hadoop data into the Elastic Stack to take advantage of the search engine and visualization. In this chapter, our focus is on ES-Hadoop's Apache Spark support. Apache Spark is an open source processing engine for analytics, machine learning, and overcoming a range of data challenges. We'll practice reading data from an Elasticsearch index, performing some computations using Spark, and then writing the results back to Elasticsearch.

By the end of this chapter, we will have covered the following topics:

Overview of ES-Hadoop
Apache Spark support
Real-time analytics using Elasticsearch and Apache Spark

Table of Contents for Spark and Elasticsearch for Real-Time Analytics

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark and Elasticsearch for Real-Time Analytics