Command and Control

Live control is only necessary if it takes your instances a long time to be ready to run. As a thought experiment, imagine that any configuration change took ten milliseconds to roll out and that each instance could be restarted in another hundred milliseconds. In that world, live control would be more trouble than it was worth. Whenever an instance needed to be modified, it would be simpler to just kill the instance and let the scheduler start a new one.

If your instances run in containers and get their configuration from a configuration service, then that is exactly the world you live in. Containers start very quickly. New configuration would be used immediately.

Sadly, not every service is made of instances that start up so quickly. Anything based on Oracle’s JVM (or OpenJDK for that matter) needs a “warm-up” period before the JIT really kicks in and makes it fast. Many services need to hold a lot of data in cache before they perform well enough. That also adds to the startup time. If the underlying infrastructure uses virtual machines instead of containers, then it can take several minutes to restart.

Controls to Offer

In those cases, you need to look at ways to send control signals to running instances. Here is a brief checklist of controls to plan for:

  • Reset circuit breakers.
  • Adjust connection pool sizes and timeouts.
  • Disable specific outbound integrations.
  • Reload configuration.
  • Start or stop accepting load.
  • Feature toggles.

Not every service will need all of these controls. They should give you a place to start, though.

Many services also expose controls to update the database schema, or even to delete all data and reseed it. These are presumably helpful in test environments but extremely hazardous in production. These controls result from a breakdown in roles. Developers don’t trust operations to deploy the software and run the scripts correctly. Operations doesn’t allow developers to log in to the production machines to update the schemata. That breakdown is itself a problem to fix. Don’t build a self-destruct button into your production code!

Another common control is the “flush cache” button. This is also quite hazardous. It may not be a self-destruct button, but it’s the button that vents all your atmosphere into space. An instance that flushes a cache will have really bad performance for the next several minutes. It may also generate a dogpile on the underlying service or database. Some kinds of services just can’t respond until their working set is loaded into memory.

Sending Commands

Once you’ve decided which controls to expose, there’s still the question of how to convey the operator’s intention out to the instances themselves. The simplest approach is to offer an admin API over HTTP. Each instance of a service would listen on a port for these requests. It needs to be a different port than ordinary traffic, however. The admin API should not be available to the general public!

An HTTP API leaves the door open for higher levels of automation in the future. In the beginning, it’s fine to use cURL or any other HTTP client to poke the admin API. If that API happens to be described in Open API format,[44] then a GUI comes for free with Swagger UI.[45]

At larger scales, simple scripts to call the admin API may no longer suffice. For one thing, it takes time to make the API call to each instance. Suppose each API call takes just a quarter-second to complete. It will take two minutes to loop over a fleet of 500 instances. Actually, that assumes all the instances are up and responding properly. More likely, whatever script loops over those API calls will stall out partway through because some instance doesn’t respond.

That’s when it’s time to build a “command queue.” This is a shared message queue or pub/sub bus that all the instances can listen to. The admin tool sends out a command that the instances then perform.

Be careful, though! With a command queue, it’s even easier to create a dogpile. It’s often a good idea to have each instance add a random bit of delay to spread them out a bit. It can also help to identify “waves” or “gangs” of instances. So a command may target “wave 1,” followed by “wave 2” and “wave 3” a few minutes later.

Scriptable Interfaces

Admin GUIs demo very well. Unfortunately, they are a nightmare in production. The chief problem with a GUI is all the clicking. Mice are not easily scriptable—operators have to resort to GUI testing tools like Watir or RoboForms to automate them. GUIs slow down operations by forcing administrators to do the same manual process on each service or instance (there might be many) every time the process is needed. For example, the clean shutdown sequence on a particular order management system I worked on required clicking—and waiting several minutes—on each of six different servers. Guess how often the clean shutdown sequence was observed? With a one-hour change window, nobody can afford to spend half of it waiting on the GUI.

The net result is that GUIs make terrible administrative interfaces for long-term production operation. The best interface for long-term operation is the command line. Given a command line, operators can easily build a scaffolding of scripts, logging, and automated actions to keep your software happy.

Remember This

It’s easy to get excited about control plane software. Blog posts and Hacker News will always egg you on to build more. But always keep the operating costs in mind. Anything you build must either be maintained or torn down. Choose the options that are appropriate for your team size and the scale of your workload.

Start with visibility. Use logging, tracing, and metrics to create transparency. Collect and index logs to look for general patterns. That also gets logs off of the machines for postmortem analysis when a machine or instance fails.

Use configuration, provisioning, and deployment services to gain leverage over larger or more dynamic systems. The more you move toward ephemeral machines, the more you need these. This pipeline to production is not just a set of development tools. It is the production environment that developers use to produce value. Treat it with the same care as you would any other production environment.

Once the system is (somewhat) stabilized and problems are visible, build control mechanisms. These should give you more precise control than just reconfiguring and restarting instances. A large system deployed to long-lived machines benefits more from control mechanisms than a highly dynamic environment will.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.73.207