Chapter 5. Writing Applications that Scale

Imagine the checkout counter of a supermarket on a Saturday evening, the usual rush-hour time. It is common to see long queues of people waiting to check out with their purchases. What could a store manager do to reduce the rush and waiting time?

A typical manager would try a few approaches, including telling those manning the checkout counters to pick up their speed, and to try and redistribute people to different queues so that each queue roughly has the same waiting time. In other words, they would manage the current load with available resources by optimizing the performance of the existing resources.

However, if the store has existing counters that are not in operation—and enough people at hand to manage them—the manager could enable those counters, and move people to these new counters. In other words, they would add resources to the store to scale the operation.

Software systems, too, scale in a similar way. An existing software application can be scaled by adding compute resources to it.

When the system scales by either adding or making better use of resources inside a compute node, such as CPU or RAM, it is said to scale vertically or scale up. On the other hand, when a system scales by adding more compute nodes to it, such as a creating a load-balanced cluster of servers, it is said to scale horizontally or scale out.

The degree to which a software system is able to scale when compute resources are added is called its scalability. Scalability is measured in terms of how much the system's performance characteristics, such as throughput or latency, improve with respect to the addition of resources. For example, if a system doubles its capacity by doubling the number of servers, it is scaling linearly.

Increasing the concurrency of a system often increases its scalability. In the supermarket example given earlier, the manager is able to scale out his operations by opening additional counters. In other words, they increase the amount of concurrent processing done in their store. Concurrency is the amount of work that gets done simultaneously in a system.

In this chapter, we look at the different techniques of scaling a software application with Python.

We will be following the approximate sketch of the following topics in our discussion in this chapter:

  • Scalability and performance
  • Concurrency
    • Concurrency and parallelism
    • Concurrency in Python – multithreading
      • Thumbnail generator
      • Thumbnail generator – producer/consumer architecture
      • Thumbnail generator – program end condition
      • Thumbnail generator – resource constraint using locks
      • Thumbnail generator – resource constraint using semaphores
      • Resource constraint – semaphore versus lock
      • Thumbnail generator – URL rate controller using conditions
      • Multi-threading – Python and GIL
    • Concurrency in Python – multiprocessing:
      • A primality checker
      • Sorting disk files
      • Sorting disk files – using a counter
      • Sorting disk files – using multiprocessing
    • Multi-threading versus multiprocessing
    • Concurrency in Python – Asynchronous execution
      • Pre-emptive versus co-operative multitasking
      • asyncio in Python
      • Waiting for future – async and await
      • Concurrent futures – high-level concurrent processing
    • Concurrency options - how to choose
  • Parallel processing libraries:
    • joblib
    • PyMP
      • Fractals – the Mandelbrot set
      • Fractals – scaling the Mandelbrot set implementation
  • Scaling for the web:
    • Scaling workflows – message queues and task queues
    • Celery – a distributed task queue

      The Mandelbrot set - Using Celery

    • Serving Python on the web – WSGI

      uWSGI – WSGI middleware on steroids

      Gunicorn – unicorn for WSGI

      Gunicorn versus uWSGI

  • Scalability architectures:
    • Vertical scalability architectures
    • Horizontal scalability architectures

Scalability and performance

How do we measure the scalability of a system? Let's take an example, and see how this is done.

Let's say our application is a simple report generation system for employees. It is able to load employee data from a database, and generate a variety of reports in bulk, such as pay slips, tax deduction reports, employee leave reports, and more.

The system is able to generate 120 reports per minute—this is the throughput or capacity of the system expressed as the number of successfully completed operations in a given unit of time. Let's say the time it takes to generate a report at the server side (latency) is roughly 2 seconds.

Let's say the architect decides to scale up the system by doubling the RAM on its server.

Once this is done, a test shows that the system is able to increase its throughput to 180 reports per minute. The latency remains the same at 2 seconds.

So, at this point, the system has scaled close to linear in terms of the memory added. The scalability of the system expressed in terms of throughput increase is as follows:

Scalability (throughput) = 180/120 = 1.5X

As a second step, the architect decides to double the number of servers on the backend—all with the same memory. After this step, it's found that the system's performance throughput has now increased to 350 reports per minute. The scalability achieved by this step is given as follows:

Scalability (throughput) = 350/180 = 1.9X

The system has now responded much better with a close to linear increase in scalability.

After further analysis, the architect finds that by rewriting the code that was processing reports on the server to run in multiple processes instead of a single process, he is able to reduce the processing time at the server, and hence, the latency of each request by roughly 1 second per request at peak time. The latency has now gone down from 2 seconds to 1 second.

The system's performance with respect to latency has become better as follows:

Performance (latency): X = 2/1 = 2X

How does this improve scalability? Since the time taken to process each request is less now, the system overall will be able to respond to similar loads at a faster rate than what it was able to earlier. With the exact same resources, the system's throughput performance, and hence, scalability has increased assuming other factors remain the same.

Let's summarize what we've discussed so far, as follows:

  1. In the first step, the architect increased the throughput of a single system by scaling it up by adding extra memory as a resource, which increased the overall scalability of the system. In other words, he scaled the performance of a single system by scaling up, which boosted the overall performance of the whole system.
  2. In the second step, he added more nodes to the system, and hence, its ability to perform work concurrently, and found that the system responded well by rewarding him with a near-linear scalability factor. In other words, he increased the throughput of the system by scaling its resource capacity. Thus, he increased scalability of the system by scaling out, that is, by adding more compute nodes.
  3. In the third step, he made a critical fix by running a computation in more than one process. In other words, he increased the concurrency of a single system by dividing the computation into more than one part. He found that this increased the performance characteristic of the application by reducing its latency, potentially setting up the application to handle workloads better at high stress.

We find that there is a relationship between scalability, performance, concurrency, and latency. This can be explained as follows:

  1. When the performance of one of the components in a system goes up, generally the performance of the overall system goes up.
  2. When an application scales in a single machine by increasing its concurrency, it has the potential to improve performance, and hence, the net scalability of the system in deployment.
  3. When a system reduces its performance time, or its latency, at the server, it positively contributes to scalability.

We have captured these relationships in the following table:

Concurrency

Latency

Performance

Scalability

High

Low

High

High

High

High

Variable

Variable

Low

High

Poor

Poor

An ideal system is one that has good concurrency and low latency; such a system has high performance, and would respond better to scaling up and/or scaling out.

A system with high concurrency, but also high latency, would have variable characteristics—its performance, and hence, scalability would be potentially very sensitive to other factors such as current system load, network congestion, geographical distribution of compute resources and requests, and so on.

A system with low concurrency and high latency is the worst case—it would be difficult to scale such a system, as it has poor performance characteristics. The latency and concurrency issues should be addressed before the architect decides to scale the system horizontally or vertically.

Scalability is always described in terms of variation in performance throughput.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.242.235