Chapter 3. Apache Spark Streaming

The Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by checkpointing the data stream that is being processed. The following areas will be covered in this chapter after an initial section, which will provide a practical overview of how Apache Spark processes stream-based data:

  • Error recovery and checkpointing
  • TCP-based Stream Processing
  • File Streams
  • Flume Stream source
  • Kafka Stream source

For each topic, I will provide a worked example in Scala, and will show how the stream-based architecture can be set up and tested.

Overview

When giving an overview of the Apache Spark streaming module, I would advise you to check the http://spark.apache.org/ website for up-to-date information, as well as the Spark-based user groups such as . My reason for saying this is because these are the primary places where Spark information is available. Also the extremely fast (and increasing) pace of change means that by the time you read this new Spark functionality and versions, will be available. So, in the light of this, when giving an overview, I will try to generalize.

Overview

The previous figure shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS. These feed into the Spark Streaming module, and are processed as discrete streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process the stream-based data. The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but I wanted to extend it for both—expressing the Spark module functionality, and for dashboarding options. The previous diagram shows a MetricSystems feed being fed from Spark to Graphite. Also, it is possible to feed Solr-based data to Lucidworks banana (a port of kabana). It is also worth mentioning here that Databricks (see Chapter 8, Spark Databricks and Chapter 9, Databricks Visualization) can also present the Spark stream data as a dashboard.

Overview

When discussing Spark discrete streams, the previous figure, again taken from the Spark website at http://spark.apache.org/, is the diagram I like to use. The green boxes in the previous figure show the continuous data stream sent to Spark, being broken down into a discrete stream (DStream). The size of each element in the stream is then based on a batch time, which might be two seconds. It is also possible to create a window, expressed as the previous red box, over the DStream. For instance, when carrying out trend analysis in real time, it might be necessary to determine the top ten Twitter-based Hashtags over a ten minute window.

So, given that Spark can be used for Stream processing, how is a Stream created? The following Scala-based code shows how a Twitter stream can be created. This example is simplified because Twitter authorization has not been included, but you get the idea (the full example code is in the Checkpointing section). The Spark stream context, called ssc, is created using the spark context sc. A batch time is specified when it is created; in this case, five seconds. A Twitter-based DStream, called stream, is then created from the Streamingcontext using a window of 60 seconds:

    val ssc    = new StreamingContext(sc, Seconds(5) )
    val stream = TwitterUtils.createStream(ssc,None).window( Seconds(60) )

The stream processing can be started with the stream context start method (shown next), and the awaitTermination method indicates that it should process until stopped. So, if this code is embedded in a library-based application, it will run until the session is terminated, perhaps with a Crtl + C:

    ssc.start()
    ssc.awaitTermination()

This explains what Spark streaming is, and what it does, but it does not explain error handling, or what to do if your stream-based application fails. The next section will examine Spark streaming error management and recovery.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.180.43