A Scheduling Solution

As we just learned, the simple job of scheduling tasks to run on a node can quickly become complex. Resource schedulers in operating systems have evolved over time and we have had a lot of experience scheduling local resources. Solutions for scheduling meant to work across a cluster of machines is not nearly as mature or advanced. However, some great solutions are available to help in addressing this need, and they are evolving at a very rapid rate.

Many of the cluster scheduling solutions today provide a lot of similar features and approaches to container placement. Before covering some of the more popular schedulers, we will first discuss some of the common features available in some of these schedulers.

Strategies, Policies, and Algorithms

The strategies, or logic, used for determining the nodes on which to run a task or service can be as simple as selecting a random node that meets our constraints. Or it can be something far more complex. As mentioned earlier, we would like to optimize resource utilization without adversely impacting application performance. Some strategies a scheduler might use are bin packing, spread, and random. A bin packing approach simply fills up one node and then moves on and starts filling up the next. A spread approach essentially performs a round-robin placement of services across all available nodes. A random approach will randomly place services across nodes in the cluster. It can get a lot more complicated than this as we place different types of workloads next to one another and need to prioritize them.

Many of the technologies used for scheduling will offer extensibility for scheduling strategies, so that more complex strategies can be used if needed, or strategies that better meet the business requirements.

Rules & Constraints

Most schedulers enable us to tag nodes and then use that data to apply constraints when scheduling tasks. For example, we could tag nodes that have SSD storage attached to them, and then when we are scheduling tasks, we can provide a constraint stating that the task must be scheduled to a node with SSD storage. We can tag nodes with fault domains and provide a constraint stating that the scheduler needs to spread a set of tasks across the fault or update domains, ensuring all our service instances are not in the same fault domain.


Image Availability Sets

In Azure, virtual machines are organized into availability sets which assign fault and update domains to the virtual machine instances. Update domains indicate groups of hardware that can be rebooted together. Fault domains indicate groups of machines that can share a common power source and network switches, and so on.


Dependencies

A scheduler needs a way to define a grouping of containers, their dependencies, and connectivity requirements. A service can be composed of multiple containers that need to remain in close proximity. Kubernetes enables you to group a set of containers that make up a service into pods. A good example of this is a node.js service that uses NGINX for its static content and Redis as a “local” cache. Other services might need to be near each other for efficiency reasons.

Replication

A scheduler needs to deal with optimally scheduling multiple instances of a service across the cluster. It’s likely we don’t want to place all the instances of a service on a single node or in the same fault domain. The scheduler often needs to be aware of these factors and distribute the instances across multiple nodes in the cluster.

Reallocation

Cluster resources can change as nodes are added, removed, or even fail. In addition to the cluster changing, service load can change as well. A good scheduling solution needs to be able to constantly monitor the nodes and services, and then make decisions on how to best reallocate resources based on availability as well as the most optimal performance, scale, and efficiency needs of the system.


Image Azure Healing

In Azure, if the hardware running the services fails, Azure will handle moving the virtual machine the services are running on. This situation, where a machine goes away and comes back, can be challenging for schedulers to manage. Also note that Azure works at the machine level and will not help us if an individual container fails.


Highly Available

It could be important that our orchestration and scheduling services are highly available. Most of the orchestration tools enable us to deploy multiple management instances, so that in the event one is unavailable for planned or unplanned maintenance, we can connect to another. Depending on the application requirements and cluster management features, it can be okay to run a single instance of that management and orchestration service. If for some reason the management node is temporarily unavailable, it can simply mean we are unable to reschedule services until it’s available again, and the application is not impacted.

Rolling Updates

As we roll out updates to our services, the scheduler might need to scale down an existing version of a service while scaling up the new version and routing traffic across both. The scheduler should monitor the health and status of the deployment and automatically roll back the update if necessary.

Autoscaling

The cluster scheduler might be able to schedule instances of a service based on time of day, or based on monitoring metrics to match the load on the system. In addition to being able to monitor and schedule task instances within a cluster, there could be a need to automatically scale the cluster nodes to reduce the cost of idle resources or to respond to increased capacity needs. The scheduler can raise an event for the provisioning systems to add new nodes to the cluster or remove some nodes from the cluster.

API

Continuous Integration/Deployment systems might need to schedule new services or update an existing service. An API will be necessary to make this possible. The API can even expose events which can be used to integrate with other services in the infrastructure. This can be useful for monitoring, scheduler tuning, and the integration of other systems or even cloud platform infrastructure.

There are a growing number of technologies available in the market that are used for the orchestration and scheduling of services. Some of them are still not necessarily feature-complete, but they are moving extremely fast. We will cover a few of the more popular ones in use as of this writing. The first we will look at is Docker Swarm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.211.166