Chapter 4. Storm and Kafka Integration

Apache Kafka is a high-throughput, distributed, fault tolerant, and replicated messaging system that was first developed at LinkedIn. The use cases of Kafka vary from log aggregation to stream processing to replacing other messaging systems.

Kafka has emerged as one of the important components of real-time processing pipelines in combination with Storm. Kafka can act as a buffer or feeder for messages that need to be processed by Storm. Kafka can also be used as the output sink for results emitted from the Storm topologies.

In this chapter, we will cover the following topics:

  • An overview of Apache Kafka and how it differs from traditional messaging platforms
  • Setting up a single node and multinode Kafka cluster
  • Producing data into a Kafka partition
  • Using KafkaSpout in a Storm topology to consume messages from Kafka

The Kafka architecture

Kafka has an architecture that differs significantly from other messaging systems. Kafka is a peer-to-peer system in which each node is called a broker. The brokers coordinate their actions with the help of a ZooKeeper ensemble.

The Kafka architecture

A Kafka cluster

The following are the important components of Kafka.

The producer

In Kafka, messages are published by a producer to named entities called topics. A topic is a queue that can be consumed by multiple consumers. For parallelism, a Kafka topic can have multiple partitions. Reads and writes can happen to each partition in parallel. Data for each partition of a topic is stored in a different directory on the disk. Each of these directories can be on different disks, allowing us to overcome the I/O limitations of a single disk. Also, two partitions of a single topic can be allocated on different brokers, thus increasing throughput as each partition is independent of each other. Each message in a partition has a unique sequence number associated with it called an offset.

Have a look at the following diagram showing the Kafka topic distribution:

The producer

Kafka topics distribution

Replication

Kafka supports the replication of partitions of a topic to support fault tolerance. It automatically handles the replication of a partition and makes sure that the replica of the partition will be assigned to different brokers. Kafka elects one broker as the leader of a partition, and all the writes and reads must go to the leader partition. The replication feature was introduced in Kafka 0.8.0.

Consumers

A consumer reads a range of messages from a broker. A group ID is associated with each consumer. All the consumers with the same group ID act as a single logical consumer. Each message of the topic is delivered to one consumer from a consumer group (with the same group ID). Different consumer groups for a particular topic can process messages at their own pace as messages are not removed from the topics as soon as they are consumed. In fact, it is the responsibility of the consumers to keep track of how many messages they have consumed.

The following diagram depicts the relationship between consumers and consumer groups. We have a topic and two consumer groups with group ID 1 and 2. The consumer group 1 has two consumers, namely A and B, and each of them will consume from one of the partitions of the topic. Here, consumer A is consuming from partition p and consumer B is consuming from partition q. For the consumer group 2, we only have a single consumer, X, that will consume the message from both the p and q partitions of the topic.

Consumers

Kafka consumer groups

As mentioned earlier in this section, each message in a partition has a unique sequence number associated with it, called an offset. It is through this offset that consumers know how much of the stream they have already processed. If a consumer decides to replay already-processed messages, all it needs to do is just set the value of the offset to an earlier value while consuming messages from Kafka.

Brokers

A broker receives the messages from a producer (push mechanism) and delivers the messages to a consumer (pull mechanism). A broker also manages the persistence of messages on the disk. For each topic, it will create a directory on the disk. This directory will contain multiple files. The Kafka broker is very lightweight; it only opens the file handlers for partitions to persist messages and manage the TCP connections.

Data retention

Each topic in Kafka has an associated retention time that can be controlled with the log.retention.minutes property in the broker configuration. When this time expires, Kafka deletes the expired data files for that particular topic. This is a very efficient operation as it's a file delete operation.

Another way of controlling retention is through the log.retention.bytes property. It tells Kafka to delete expired data files when a certain size is reached for a partition. If both the properties are configured, the deletion will happen when any of the limits are reached.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.133.54