Getting started with Spark Streaming jobs

Unlike all the previous batch jobs, streaming jobs are unique in its sense that it is a continuous running job and Spark Streaming is no exception to this rule despite being called micro batched. Before discussing any further let's first write a hello world program in Spark Streaming and what better to start with than a word count problem on a streaming source, a socket. In the following example, we will count the number of words each micro batch has:

SparkConf sparkConf = new SparkConf().setAppName("WordCountSocketEx").setMaster("local[*]"); 
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(1)); 
 
JavaReceiverInputDStream<String> StreamingLines = streamingContext.socketTextStream( "IP_OF_THE_SOCKET", Integer.parseInt("PORT_OF_THE_SOCKET"), StorageLevels.MEMORY_AND_DISK_SER); 
        
JavaDStream<String> words = StreamingLines.flatMap( str -> Arrays.asList(str.split(" ")).iterator() ); 
       
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(str-> new Tuple2<>(str, 1)).reduceByKey((count1,count2) ->count1+count2 ); 
       
wordCounts.print(); 
streamingContext.start(); 
streamingContext.awaitTermination();

In the example, a couple of points that need to be observed while writing a streaming job are as follows:

For any streaming job, create a StreamingContext, in case of Java it's JavaStreamingContext.
Each streaming context accepts two parameters, SparkConf and batchDuration. Batch duration determines the interval of micro batching and hence should be chosen very carefully.
Any Spark Streaming job does not start computation until and unless the start() method of streaming context is called.
Also call the method awaitTermination() or stop() explicitly to stop the streaming job.
Never set the master URL as local or local[1] as that transpires into running only one thread per job. While DStream-based receivers (sockets, Kafka, Flume, Kinesis, and so on), require one thread to execute the receiver while the other can be used for processing the DStream data. Hence it is always suggested to use local[n] configuration with n being greater than one or better still one can use local[*]. Similar logic can be extended to Streaming jobs running on clusters as well, the number of cores allocated to the Spark Streaming application must be more than the number of receivers.
Also add the following Maven dependency for working with Spark Streaming:

<dependency> 
  <groupId>org.apache.spark</groupId> 
  <artifactId>spark-streaming_2.11</artifactId> 
  <version>2.1.1</version> 
</dependency>

To stream data on a socket one can use the utility/command on Linux-based machines called Netcat. Use the following command to create a server-side socket for streaming data:
$nc -lk <Port Number>

Table of Contents for Getting started with Spark Streaming jobs

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting started with Spark Streaming jobs