State reconciliation

Besides our just-learned dictionary, we also need to understand the underlying algorithm of almost all orchestration frameworks, state reconciliation, which deserves its own little section here. The basic principle that this works on is a very simple three-step process, as follows:

The user setting the desired count(s) of each service or a service disappearing.
The orchestration framework seeing what is needed in order to change the current state to the desired state (delta evaluation).
Executing whatever is needed to take the cluster to that state (known as state reconciliation).

For example, if we currently have five running tasks for a service in the cluster and change the desired state to only three tasks, our management/orchestration system will see that the difference is -2 and thus pick two random tasks and kill them seamlessly. Conversely, if we have three tasks running and we want five instead, the management/orchestration system will see that the desired delta is +2 so it will pick two places with available resources for it and start two new tasks. A short explanation of two state transitions should also help clarify this process:

Initial State: Service #1 (3 tasks), Service #2 (2 tasks)
Desired State: Service #1 (1 task),  Service #2 (4 tasks)

Reconciliation:
 - Kill 2 random Service #1 tasks
 - Start 2 Service #2 tasks on available nodes

New Initial State: Service #1 (1 tasks), Service #2 (4 tasks)

New Desired State: Service #1 (2 tasks), Service #2 (0 tasks)

Reconciliation:
 - Start 1 tasks of Service #1 on available node
 - Kill all 4 running tasks of Service #2

Final State: Service #1 (2 tasks), Service #2 (0 tasks)

Using this very simple but powerful logic, we can dynamically scale up and down our services without worrying about the intermediate stages (to a degree). Internally, keeping and maintaining states is such a difficult task that most orchestration frameworks use a special, high-speed key-value store component to do this for them (that is, etcd, ZooKeeper, and Consul).

Since our system only cares about where our current state is and where it needs to be, this algorithm also doubles as the system for building resilience as a dead node, or the container will reduce the current task count for applications and will trigger a state transition back to the desired counts automatically. As long as services are mostly stateless and you have the resources to run the new services, these clusters are resilient to almost any type of failure and now you can hopefully see how a few simple concepts tie together to create such a robust infrastructure.

With our new understanding of management and orchestration framework basics, we will now take a brief look at each one of our available options (Docker Swarm, Kubernetes, Marathon) and see how they compare with each other.

Table of Contents for State reconciliation

Create new playlist

Sign In

Sign Up

Table of Contents for
State reconciliation