Input sources and output stores

Spark Streaming supports three kinds of input sources:

  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, socket connections, and Akka actors.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, Twitter, and so on, which are available through extra utility classes.
  • Custom sources: Requires implementing a user-defined receiver.

Multiple receivers can be created in the same application to receive data from different sources. It is important to allocate enough resources (cores and memory) for enabling receivers and tasks to execute simultaneously. For example, if you start your application with one core, it will be taken by the receiver and no tasks will be executed because of a lack of available cores.

Basic sources

There are four basic sources available in Spark StreamingContext as shown in the following table:

Source

Description

TCP stream

For streaming data via TCP/IP by specifying a hostname and a port number. StreamingContext.socketTextStream is used for creating an input DStream

File stream

For reading files from a file system such as HDFS, NFS, and S3. streamingContext.fileStream or Python's

streamingContext.textFileStream(dataDirectory) is used for creating the input DStream

Akka actors

For creating DStreams through Akka actors, streamingContext.actorStream is used for creating the DStream

Queue of RDDs

For testing a Spark Streaming application with a queue of RDDs, streamingContext.queueStream is used for creating DStream

Advanced sources

Let's take a look at some of the important custom sources:

Source

Description

Kafka

Kafka is the most widely-used source in any streaming application as it provides high reliability. It is a publish-subscribe messaging system which stores data as a distributed commit log and enables streaming applications to pull data. There are approaches for using Kafka: receiver-based approach and direct approach.

File stream

Flume is a reliable log collection tool, which is tightly integrated with Hadoop. Like Kafka, Flume also offers two approaches; regular push-based and a custom pull-based approach.

Kinesis

Kinesis is a service offered by Amazon Web Services (AWS), which is used for stream processing. The Kinesis receiver creates an input DStream using the Kinesis Client Library (KCL) provided by AWS.

Twitter

For streaming a public stream of tweets using the Twitter Streaming API. TwitterUtils uses Twitter4j.

ZeroMQ

ZeroMQ is another high performance asynchronous messaging system.

MQTT

MQTT is a publish-subscribe based messaging protocol for use on top of the TCP/IP protocol.

Custom sources

Input DStreams can also be created out of custom data sources of your own, by extending the API. Implement a user-defined receiver that can receive data from the new custom source. Python API does not support this functionality yet.

Receiver reliability

There are two types of receivers in Spark Streaming based on reliability:

  • Reliable receiver: A reliable receiver sends acknowledgment to source systems after receiving the blocks and replicating in the Spark cluster. So, in case of failures of receivers, data is not lost.
  • Unreliable receiver: An unreliable receiver does not acknowledge after receiving the blocks. So, in this case, if a receiver fails, data is lost.

So, depending on the type of the receiver, data can be received at-least-once or exactly-once. For example, while the regular Kafka API provides at-least-once semantics, the Kafka direct API provides exactly-once semantics. For unreliable receivers, to make sure that received blocks are not lost in case of driver failures, enabling WAL will help.

Output stores

Once the data is processed in the Spark Streaming application, it can be written to a variety of sinks such as HDFS, any RDBMS database, HBase, Cassandra, Kafka, or Elasticsearch, and so on. All output operations are processed one-at-a-time and they are executed in the same order they are defined in the application. Also, in some cases, the same record can be processed more than one time and will be duplicated in output stores.

It is important to understand at-most-once, at-least-once, and exactly-once guarantees offered by Spark Streaming. For example, using the Kafka direct API provides exactly once semantics, so a record is received, processed, and sent to the output store exactly once. Note that, irrespective of whether data is sent once or twice from the source, the Spark Streaming application processes any record exactly once. So, the outputs are affected by choosing the type of receiver. For NoSQL databases such as HBase, sending the same record twice will just update it with the new version. For updating the same record on HBase, the timestamp can be passed from source systems instead of HBase picking up a timestamp.

So, if the record is processed and sent multiple times to HBase, it will just replace the existing record because of the same timestamp. This concept is called Idempotent updates. However, for RDBMS databases, sending the same record twice may either throw an exception or insert the second record as a duplicate. Another way to achieve exactly once is to follow the Transactional updates approach to update exactly once.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.173.238