Imagine the checkout counter of a supermarket on a Saturday evening, the usual rush-hour time. It is common to see long queues of people waiting to check out with their purchases. What could a store manager do to reduce the rush and waiting time?
A typical manager would try a few approaches, including telling those manning the checkout counters to pick up their speed, and to try and redistribute people to different queues so that each queue roughly has the same waiting time. In other words, they would manage the current load with available resources by optimizing the performance of the existing resources.
However, if the store has existing counters that are not in operation—and enough people at hand to manage them—the manager could enable those counters, and move people to these new counters. In other words, they would add resources to the store to scale the operation.
Software systems, too, scale in a similar way. An existing software application can be scaled by adding compute resources to it.
When the system scales by either adding or making better use of resources inside a compute node, such as CPU or RAM, it is said to scale vertically or scale up. On the other hand, when a system scales by adding more compute nodes to it, such as a creating a load-balanced cluster of servers, it is said to scale horizontally or scale out.
The degree to which a software system is able to scale when compute resources are added is called its scalability. Scalability is measured in terms of how much the system's performance characteristics, such as throughput or latency, improve with respect to the addition of resources. For example, if a system doubles its capacity by doubling the number of servers, it is scaling linearly.
Increasing the concurrency of a system often increases its scalability. In the supermarket example given earlier, the manager is able to scale out his operations by opening additional counters. In other words, they increase the amount of concurrent processing done in their store. Concurrency is the amount of work that gets done simultaneously in a system.
In this chapter, we look at the different techniques of scaling a software application with Python.
We will be following the approximate sketch of the following topics in our discussion in this chapter:
asyncio
in Python async
and await
The Mandelbrot set - Using Celery
uWSGI – WSGI middleware on steroids
Gunicorn – unicorn for WSGI
Gunicorn versus uWSGI
How do we measure the scalability of a system? Let's take an example, and see how this is done.
Let's say our application is a simple report generation system for employees. It is able to load employee data from a database, and generate a variety of reports in bulk, such as pay slips, tax deduction reports, employee leave reports, and more.
The system is able to generate 120 reports per minute—this is the throughput or capacity of the system expressed as the number of successfully completed operations in a given unit of time. Let's say the time it takes to generate a report at the server side (latency) is roughly 2 seconds.
Let's say the architect decides to scale up the system by doubling the RAM on its server.
Once this is done, a test shows that the system is able to increase its throughput to 180 reports per minute. The latency remains the same at 2 seconds.
So, at this point, the system has scaled close to linear in terms of the memory added. The scalability of the system expressed in terms of throughput increase is as follows:
Scalability (throughput) = 180/120 = 1.5X
As a second step, the architect decides to double the number of servers on the backend—all with the same memory. After this step, it's found that the system's performance throughput has now increased to 350 reports per minute. The scalability achieved by this step is given as follows:
Scalability (throughput) = 350/180 = 1.9X
The system has now responded much better with a close to linear increase in scalability.
After further analysis, the architect finds that by rewriting the code that was processing reports on the server to run in multiple processes instead of a single process, he is able to reduce the processing time at the server, and hence, the latency of each request by roughly 1 second per request at peak time. The latency has now gone down from 2 seconds to 1 second.
The system's performance with respect to latency has become better as follows:
Performance (latency): X = 2/1 = 2X
How does this improve scalability? Since the time taken to process each request is less now, the system overall will be able to respond to similar loads at a faster rate than what it was able to earlier. With the exact same resources, the system's throughput performance, and hence, scalability has increased assuming other factors remain the same.
Let's summarize what we've discussed so far, as follows:
We find that there is a relationship between scalability, performance, concurrency, and latency. This can be explained as follows:
We have captured these relationships in the following table:
Concurrency |
Latency |
Performance |
Scalability |
---|---|---|---|
High |
Low |
High |
High |
High |
High |
Variable |
Variable |
Low |
High |
Poor |
Poor |
An ideal system is one that has good concurrency and low latency; such a system has high performance, and would respond better to scaling up and/or scaling out.
A system with high concurrency, but also high latency, would have variable characteristics—its performance, and hence, scalability would be potentially very sensitive to other factors such as current system load, network congestion, geographical distribution of compute resources and requests, and so on.
A system with low concurrency and high latency is the worst case—it would be difficult to scale such a system, as it has poor performance characteristics. The latency and concurrency issues should be addressed before the architect decides to scale the system horizontally or vertically.
Scalability is always described in terms of variation in performance throughput.
3.144.242.235