Chapter 7. The Future of Mesos

Mesos has come a long way. It started out as a graduate student project at the University of California, Berkley, but since then, it’s been rolled out in production across hundreds of thousands of machines, attracting scores of developers to its ecosystem. We’ve learned about building applications on Mesos today, but where is Mesos going? In this chapter, we’ll look at several current initiatives in the Mesos ecosystem that stand to become critical and valuable features of Mesos.

Multitenant Workloads

Before we discuss multitenancy, let’s look at a motivating problem: noisy neighbors. Noisy neighbors are a problem in real life as well as in distributed systems with multiple users (tenants). When an apartment building’s walls are too thin, you can hear your neighbors blaring music through the walls. Analogously, when a system doesn’t provide sufficient isolation (i.e., thick walls), your application’s performance can be adversely affected by other applications running on the same machine. For example, multiple CPU-intensive applications running on the same machine could all compete to use every CPU, resulting in reduced overall performance. As a result, it’s hard for the users or cluster operators to predict the performance of their applications—whereas if those applications were running in containers, they’d each be guaranteed a share of the CPU, reducing the unpredictability of their performance.

Multitenancy refers to when a single resource (in our case, a Mesos cluster and the resources on its slaves) must be shared by many users, and the associated problems when those users accidentally monopolize what should be shared. You want a system that is designed for multitenancy, because it’s much easier to have Mesos manage the isolation of resources rather than relying on users to carefully write cooperative applications.

Today, many Mesos installations benefit from containerization by isolating different applications from one another; in the future, more and more Mesos clusters will be isolating different users or customers from one another. Many large enterprises have struggled with multitenancy problems. For example, many companies administer and run separate Hadoop clusters for each team that needs a cluster. Why don’t they just share one big cluster? Frequently, it’s not possible for each team to achieve the level of service and performance they need if they can’t predict how much the cluster will be utilized at any point. Separate clusters isolate the teams from one another, ensuring that each always can fully utilize the resources it needs. The separation of their systems is a solution to the noisy neighbor problem; however, since they’re not sharing, the companies have to buy many more computers.

Mesos can serve multitenant workloads by allowing many users to share the same physical hardware (or VMs); with Mesos, Linux containers provide the isolation. Linux containers are a technology based on cgroups, open sourced by Google to enable high-performance isolation of multiple workloads on the same machine. Through projects like CoreOS, Docker, and Mesos, Linux containers have grown from an interesting but difficult to use technology into a powerful, popular, lightweight isolation mechanism. Although cgroups initially only provided support for isolating CPUs and memory, more and more Linux kernel subsystems are being integrated. For instance, in July 2014, Mesos added deep integration with the Linux network isolation stack. This allows for Mesos to be configured to control and isolate network bandwidth usage between containers running on the cluster. Currently, there is ongoing work to bring isolation to disk I/O usage between containers. Over time, we’ll see Mesos get more and more isolation features added, which will continue to reduce the noisy neighbor problem in multitenant workloads.

There are also Mesos frameworks being built to solve multitenancy problems at a higher level, to make it easier for enterprises to achieve good isolation between different users of a shared Mesos cluster. For example, Myriad was started in 2014 by eBay to run YARN on Mesos. YARN is the resource manager for Hadoop 2.0. With Myriad, a single REST API request is all that’s needed to create a new, fully isolated YARN cluster in Mesos. Thus, if you deploy Myriad, when a new team needs a Hadoop cluster for experimentation or production, there’s nearly no administrative or operational cost. Another multitenant Mesos framework is Cook, written by the author of this book and open sourced by Two Sigma in 2015. Cook is a preemptive job scheduler; it’s designed to allow any user to use as many resources as are available, but automatically scale that user back if other users’ demand for capacity increases later.

Another way that Mesos supports multitenancy is through resource reservations. Reservations can be used to enable and enforce minimum quotas in multitenant environments, making them the strongest guarantee in a multitenant environment, at the cost of dynamic flexibility. If you believe that generally users won’t monopolize the entire Mesos cluster, you can not bother with reservations and allow any framework to scale itself up or down freely. When some users have production-critical service-level agreements that must be met, you can use reservations to guarantee them specific resources. As of version 0.25, Mesos has added RESTful HTTP APIs so that cluster operators can easily create reservations for users on the fly, to simplify guaranteeing resources to particular users and workloads. There’s also an initiative to add a RESTful quota API to Mesos, so that operators can allow Mesos to figure out reservations for them, based on high-level guidelines on how many resources each role should be limited to.

Mesos offers many features for building more robust multitenant systems. It integrates with the Linux kernel’s resource isolation mechanisms to provide low-level isolation between tasks on the cluster. Framework authors are also aware of this problem, and more and more frameworks are coming out to simplify the management of multitenant clusters. Resource reservations are the most powerful tool for solving multitenancy problems: they absolutely guarantee a tenant resources, at the expense of preventing other users from accessing those resources. Now, what if there was a way for a resource to be reserved but still shared if it wasn’t currently needed? In the next section, we’ll look at the oversubscription feature Mesos added in version 0.23.

Oversubscription

What does it mean to utilize a cluster at 100%? To some, we’re at 100% utilization when we have no more free resources on our Mesos cluster. However, even when a framework reserves a resource for a task (such as a web server), that task might not fully use all those resources. In fact, on most clusters, the actual usage is only 10–30%. To counter this, the Mesos oversubscription feature was created. It allows for Mesos clusters to automatically utilize their reserved but unused resources.1

To understand oversubscription, let’s first define slack. Slack is the difference between what resources you think you’re using and what you’re actually using (Figure 7-1). Everyone’s goal is to reduce slack: slack is just waste, where resources could be doing something productive, but instead sit idle. There are two kinds of slack on a Mesos cluster:

Allocation slack

Allocation slack is the difference between the resources available on the cluster and the resources that are reserved by frameworks. Mesos was engineered to address this type of slack efficiently, by repeatedly reoffering resources to all connected frameworks. This way, if one framework didn’t want or couldn’t make use of some resources, another framework could have the chance to use those resources. Some frameworks, such as Spark, take advantage of this by launching many small tasks that use few resources, so that they can get tiny allocations on many machines, driving cluster utilization up and exploiting those resources for their users.

Usage slack

Usage slack is the difference between the reserved resources and what resources are actually used. For instance, if a web server reserves two CPUs, it may be using nearly no resources if it’s an off-peak time where it doesn’t have many requests to process. The oversubscription feature is designed to allow Mesos to reduce this type of slack as well.

Figure 7-1. Allocation and usage slack of CPUs and memory

In some ways, Mesos has become a victim of its own success: allocation slack can be dramatically reduced on Mesos clusters, with commercial production clusters reporting allocation slacks of 5–15%, which is much lower than the 20–30% seen with earlier systems. So, a consortium of companies, including Twitter, Mesosphere, and Intel, set out to build a system that would enable Mesos to reduce usage slack. The result of their work is the Mesos oversubscription system, which applies control theory to usage slack on many resources.

To enable oversubscription, Mesos added a new type of resource offer: a revocable offer. Revocable offers are exactly like normal offers, except that tasks launched on revocable offers can be killed by Mesos at any time. By default, frameworks won’t receive any revocable offers; they can opt in by adding the REVOCABLE_RESOURCES capability to their FrameworkInfos when they register. If the Mesos cluster has also been configured to enable the oversubscription mechanism, then any opted-in frameworks will see another type of offer along with normal offers: these special offers will have their revocable field populated, which means they can be canceled at any time. Note that currently an executor must have only revocable or normal resources; a mix of revocable and nonrevocable tasks cannot be launched on the same executor.

The oversubscription system is comprised of two pluggable components: the resource estimator, and the quality of service (QoS) controller. The resource estimator’s job is to report to the slave how much slack is currently available in its running tasks, so that the slave can advertise the slack as additional resources. The QoS controller’s job is to track the usage slack. When the usage slack decreases below what has been allocated to revocable resources, the QoS controller will begin to kill revocable tasks to ensure that the cluster’s guaranteed reserved resources are provided.

In Mesos 0.23, there is only a single, fixed resource estimator, which allows a cluster administrator to have all slaves advertise fixed extra resources. This can be useful when you know that most tasks request one or two CPUs, but typically only use 5–10% of their resources. The fixed resource estimator can then allow every slave to have many more tasks scheduled on it than it actually has CPUs. Only a no-op QoS controller ships with Mesos 0.23.

Mesosphere and Intel are also building Serenity, which is a sophisticated control system that repeatedly dynamically measures the usage slack on each slave, so that the cluster can utilize those resources. Serenity also understands how to estimate the impact of noisy neighbors, how to distinguish between a task that is starting up and a task that has reached its steady state, and other practical adjustments necessary to optimize cluster utilization.

Oversubscription is a powerful new feature that will help large clusters squeeze 10–40% more out of their resources. The revocable resources that oversubscription brings aren’t only useful for allocating into slack resources; they will also form the basis of preemptable tasks.

Databases and Turnkey Infrastructure

Today, creating the complete infrastructure for an always-available distributed system is challenging. You need to set up databases, web servers, big data filesystems, analytics tools, monitoring systems, and report generation processes. This is a lot of work—beside the challenge of sorting through the countless technologies and solutions in each space, every tool has its own specific way that it needs to be installed and configured. Creating a process to enable repeatable installation and configuration is often a major sticking point. On the one hand, one must typically write a large amount of bespoke deployment code to ensure that it’s possible to reliably rebuild the entire infrastructure. On the other hand, it’s very expensive and time-consuming to do so, and thus often that work will be deferred until the system has time to mature. Mesos stands in a position to change the way we approach deploying distributed application infrastructures.

Mesos has always been adept at running computation-oriented tasks, such as web servers and analytics frameworks. In version 0.23, Mesos finally added support for managing disks on the cluster. As a result, many data-oriented applications are now able to become Mesos frameworks. For instance, Cassandra,2 HDFS,3 and Kafka4 have already started seeing use on Mesos clusters. Over the course of early 2016, as the Mesos persistent disks API sees more uptake, Mesos’s ecosystem will finally provide numerous robust choices for disk-based applications. One of the exciting projects to watch is Apache Cotton (formerly Mysos), which provides powerful automation for MySQL clusters.

In 2016, we’re finally going to be able to simply install a Mesos cluster, and then ask the Mesos cluster to bootstrap our databases, monitoring systems, web servers, and analytics engines. This will dramatically simplify the amount of work needed to get a scalable, high-availability, self-healing cluster up and running. Since Mesos frameworks are running programs that can react to changing cluster conditions, they’ll be able to automatically handle rote database maintenance and other standard cluster operation tasks. By automating these time-consuming tasks, operations staff will be able to spend more time focusing on other issues, such as application tuning and increasing utilization.

IP per Container

Another exciting feature coming for Mesos is IP per container. The problem we’d like to solve has two cases:

  1. We’d like to use plain DNS names to have our browser automatically find tasks on our Mesos cluster (e.g., mytask.mesos.mycompany.com).

  2. We’d like to more easily port applications that were built to run on an entire machine to Mesos (e.g., Cassandra, Riak, and other databases).

Mesos already has a DNS solution, called Mesos-DNS. Mesos-DNS reads all the task metadata from the masters, and, for tasks that specify service discovery information or have alphanumeric-only names, Mesos-DNS will generate DNS records so that those tasks can be found through standard mechanisms. Alas, there remains a difficult problem: when a task needs to listen on some ports, most Mesos frameworks will dynamically allocate those ports based on what’s available on the slave. As a result, even if you know the DNS name of the host, the browser will only try ports 80 (HTTP) and 443 (HTTPS). Thus, the browser won’t be able to actually connect to the running application. Today, Mesos-DNS solves this by also serving SRV records, which include the port number that the service is running on. Unfortunately, browsers don’t support SRV records, although you can use Nginx as a stopgap.5

The IP per container system solves this problem by allocating each executor its own specific IP address. This way, the DNS entry for the task points to the executor’s IP address, and then the task can bind to whatever ports it wants on its private IP address. The way this works is that each executor’s container gets its own Linux network namespace. This creates a separate network stack per container, so that the host can route traffic to the appropriate container. In addition, each network stack can have quality of service controls applied to it, so that each container can be assigned guaranteed network bandwidth. This feature will allow Mesos to manage IP addresses as first-class resources, like CPU, memory, and disk are today.

Summary

There are a lot of new and exciting features and plans in the Mesos ecosystem. Mesos has been built from the ground up to provide the strong isolation guarantees necessary to build out multitenant systems, and we’ve been starting to see that with systems like Myriad and Cook, as well as features like the network isolator.

Alongside the growing use in multitenant applications, Mesos is developing a sophisticated oversubscription feature that will enable the same type of high cluster utilization that Google has been able to achieve. This oversubscription feature as designed is easy to get started with (if you want fixed oversubscription), and advanced dynamically adaptive oversubscription is well underway.

Finally, persistent disks and IP per container are the last features needed to allow Mesos to offer the complete infrastructure management system that other products like OpenStack were built to serve. Mesos clusters can manage all of an organization’s infrastructure, including analytics, databases, and services. Having an IP address per container will make it much easier to port legacy applications that require specific port ranges, and to easily expose Mesos applications to end users without building complex routing and proxying layers.

Mesos is taking over the data center, bringing lightweight containerization, orchestration, and simple, centralized management to everyone with its open source ecosystem.

1 The Mesos oversubscription feature is a solution to low actual utilizations, based on Google’s Heracles system.

2 Cassandra is a high-performance NoSQL database, popular for its extremely high scalability.

3 HDFS, the Hadoop Filesystem, is one of the most popular big data filesystems.

4 Kafka is a high-performance, replicated queue that enables big data publish/subscribe workloads to scale.

5 Nginx can be configured to route incoming domain names based on SRV records with SRV Router.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.197.212