Continuous evaluation

Cloud computing infrastructure providers compete on cost, performance, and features. Their offerings are gradually cheaper, quicker to start up and more efficient with CPU- or disk- or network-intense workloads, and support more exotic hardware such as GPUs. Due to these inevitable changes and market dynamics, it is important to evaluate the accuracy of the planner over time continuously.

The accuracy of the planner depends on a few factors. First, the various supported machine instance types (for example, m4.large, c4.large, etc.) may change over time. The costs per hour may change. And the performance characteristics may change: the machines may start up faster, or they may handle the same processing task more or less efficiently. In our example planning application, all of these numbers were coded directly in the Main class, but a traditional database may be used to store this information in order to facilitate easy updates.

Continuous evaluation in a production environment should include active benchmarking: for every task completed on a cloud machine of a certain type, a record should be made in a database of the time-to-completion for that task and machine. With this information, each run of the planner can recompute the average time to complete the task on various cloud machine instance types to enable more accurate estimates.

We have not yet asked a critical question about our planner. Was it at all accurate? The planner estimated that the image processing job would require 59.25 minutes to complete, including the time required to start and configure the cloud machines. In other words, it predicted that from the point that the various setup-and-run.sh scripts were executed (all in parallel for the 10 planned machines), to the time the job was finished and all machines terminated, would be 59.25 minutes. In actuality, the time required for this entire process was 58.64 minutes, an error of about 1%.

Interestingly, a little naiveté about a cloud provider's offerings can have big consequences. The t2.* instance types on AWS (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html), and the B-series machines on Microsoft's Azure (https://techcrunch.com/2017/09/11/azure-gets-bursty/), are designed for bursty performance. If we run a benchmark of the image processing task for a single subset of 100 images, we will see a certain (high) performance. However, if we then give one of those same machines a long list of image processing tasks, eventually the machine will slow down. These machine types are cheaper because they only offer high performance at short intervals. This cannot be detected in a quick benchmark; it can only be detected after a long processing task has been underway for some time. Or, one could read all documentation before attempting anything:

T2 instances are designed to provide a baseline level of CPU performance with the ability to burst to a higher level when required by your workload.

(https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html)

When a job that is predicted to take about an hour drags on for two, three, or more hours, one begins to suspect that something is wrong. The following figure shows a graph of CPU utilization on a t2.* instance. It is clear from the graph that either the image processing code has something seriously wrong, or the cloud provider is enforcing no more than 10% CPU utilization after about 30 minutes of processing.

These are the kinds of subtleties that require some forewarning and demonstrate the importance of continuous evaluation and careful monitoring:

Continuous evaluation

Figure 3: Bursty performance of Amazon's t2.* instance types. CPU was expected to have been utilized 100% at all times. Other instance types such as m4.* and c4.* perform as expected, that is, non-bursty.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.100.205