Container messaging

For the scenario that we just talked about, where back-pressure on individual processing stages causes cascade backflow stoppages, message queues (often alternatively referred to as pub/sub messaging systems) are here to provide us with the exact solution we need. Message queues generally store data as messages in a First-In, First-Out (FIFO) queue structure and work by allowing the sender to add the desired inputs to a particular stage's queue ("enqueue") and the worker (listener) to trigger on new messages within that queue. When the worker is processing a message, the queue hides it from the rest of the workers and when the worker is complete and successful, the message is removed from the queue permanently. By operating on results in an asynchronous manner, we can allow the senders to keep working on their own tasks and completely modularize the data processing pipeline.

To see queues in action, let's say we have two running containers and within a very short period of time, messages A, B, C, and D arrive one after another as inputs from some imaginary processing step (red indicating the top of the queue):

Internally, the queue tracks their ordering, and initially, neither of the container queue listeners have noticed the messages, but very quickly, they get a notification that there is new work to be done, so they get the messages in the order in which they were received. The messaging queue (depending on the exact implementation) marks those messages as unavailable for other listeners and sets a timeout for the worker to complete. In this example Message A and Message B have been marked for processing by the available workers:

During this process, let's assume that Container 1 had a catastrophic failure and it just died. Message A timeout on the queue expires without it being finished so the queue puts it back on top and makes it available again for listeners while our other container keeps on working:

With Message B successfully completed, Container 2 notifies the queue that the task is done and the queue removes it completely from its lists. With that out of the way, the container now takes the topmost message, which turns out to be the unfinished Message A and the process continues just like before:

While this cluster stage has been dealing with failures and overloading, the previous stage that put all of these messages in the queue kept working on its dedicated workload. Our current stage also has not lost any data even though half of our processing capability got forcefully removed at a random point in time.

The new pseudocode loop for a worker would be a bit more like this now:

Register as a listener on a queue.
Loop forever doing the following:
- Wait for a message from the queue.
- Process the data from the queue.
- Send the processed data to the next queue.

With this new system, if there is any kind of processing slowdown in the pipeline, the queues for those overloaded stages will start to grow in size, but if the earlier stages slow down, the queues will shrink until they are empty. As long as the maximum queue size can handle the volume of messages and the overloaded stages can handle the average demands, you can ascertain that all the data that is in the pipeline will be eventually processed and your triggers for scaling up stages are pretty much as simple as noticing larger queues that are not caused by bugs. This not only helps mitigate differences in pipeline stage scaling, but it also helps preserve data if pieces of your cluster go down since the queues will grow during failure time and then empty as you bring your infrastructure back to fully working - and all of this will happen without data loss.

If this bundle of benefits was not enough of a positive, consider that you can now have a guarantee that the data was processed since the queue keeps the data around so if a worker dies, the queue will (as we've seen earlier) put the message back in the queue to possibly get processed by another worker, unlike socket-based processing which would just silently die in that case. The increase in processing density, increase in failure tolerance, and better handling of burst data makes queues extremely attractive to container developers. If all your communication is also done with queues, service discovery might not even be needed for these workers except to tell them where the queue manager is since the queue is doing that discovery work for you.

Unsurprisingly, most queues come at a development cost, which is why they are not as widely in use as one might expect. In most cases, you will not only need to add custom queue client libraries to your worker code, but in many types of deployments, you will also need a process or a daemon somewhere that will be the main queue arbitrator that handles the messages. In fact, I would probably go as far as to say that choosing the messaging system alone is a research task onto itself, but if you're looking for quick answers, generally Apache Kafka (https://kafka.apache.org/), RabbitMQ (https://www.rabbitmq.com/), and Redis-backed custom implementations (https://redis.io/) seem to be more popular in clustering contexts for in-house messaging queues going from the biggest deployments to the smallest, respectively.

As with all things we have been covering so far, most cloud providers offer some type of service for this (AWS SQS, Google Cloud Pub/Sub, Azure Queue Storage, and so on) so that you don't have to build it yourself. If you are OK with spending a little bit more money, you can utilize these and not worry about hosting the daemon process yourself. Historically, messaging queues have been hard to maintain and manage properly in house, so I would venture to say that many, if not most, cloud systems use these services instead of deploying their own.

Table of Contents for Container messaging

Create new playlist

Sign In

Sign Up

Table of Contents for
Container messaging