Impact of processing models on throughput and latency

To verify this statement, let's try to do a simple load test. For this purpose, we are going to use a simple Spring Boot 2.x application with Web MVC or WebFlux (let's call it the middleware). We are also going to simulate I/O activity from the middleware by making a few network calls to a third-party service, which will return an empty successful response with a guaranteed 200 milliseconds of average latency. The communication flow is depicted as follows:

Diagram 6.13. Communication flow for benchmark

To launch our middleware and simulate client activity, we are going to use a Microsoft Azure infrastructure with Ubuntu Server 16.04 installed on each machine. For the middleware, we are going to use D12 v2 VM (4 virtual CPUs and 28 GB RAM). For the client, we are going to use F4 v2 VM (4 virtual CPUs and 8 GB RAM). User activity will be increased sequentially in small steps. We are going to start our load test with four simultaneous users and finish with 20,000 simultaneous users. This will give us a smooth latency curve and throughput change and allow us to create understandable graphics. To produce an appropriate load on the middleware and collect statistics and measurement characteristics correctly, we are going to use a modern HTTP benchmarking tool called wrk (https://github.com/wg/wrk).

Note that these benchmarks are intended to show the tendency rather than the system's stability over time, and to measure how proper the current implementation of the WebFlux framework is. The following measurements show the advantages of non-blocking and asynchronous communication in WebFlux over blocking synchronous and thread-based communication in Web MVC.

The following is an example of the Web MVC middleware code used for measurements:

@RestController                                                    // (1)
@SpringBootApplication                                             // 
public class BlockingDemoApplication                               //
   implements InitializingBean {                                   //
   ...                                                             // (1.1)
   @GetMapping("/")                                                // (2)
   public void get() {                                             //
      restTemplate.getForObject(someUri, String.class);            // (2.1)
      restTemplate.getForObject(someUri, String.class);            // (2.2)
   }                                                               //
   ...                                                             //
}                                                                  //

The preceding code can be described as follows:

This is the declaration of the class, annotated by@SpringBootApplication. At the same time, this class is a controller annotated with @RestController. To keep this example as simple as possible, we have skipped the initialization process and declared fields in this class, as shown at point (1.1).
Here, we have a get method with the @GetMapping declaration. In order to reduce redundant network traffic and focus only on framework performance, we do not return any content in the response body. According to the flow mentioned in the preceding diagram, we perform two HTTP requests to the remote server, as shown at points (2.1) and (2.2).

As we can see from the previous example and schema, the middleware's average response time should be around 400 milliseconds.

Note that, for this test, we are going to use a Tomcat web server, which is the default for Web MVC. In addition, to see how the performance changes in Web MVC, we are going to set up as many Thread instances as simultaneous users. The following sh script shows a setup for Tomcat:

java -Xss512K -Xmx24G -Xms24G 
   -Dserver.tomcat.prestartmin-spare-threads=true
   -Dserver.tomcat.prestart-min-spare-threads=true 
   -Dserver.tomcat.max-threads=$1
   -Dserver.tomcat.min-spare-threads=$1
   -Dserver.tomcat.max-connections=100000
   -Dserver.tomcat.accept-count=100000
   -jar ...

As we can see from the preceding script, the values of the max-threads and min-spare-threads parameters are dynamic and are defined by the number of parallel users in the test.

The preceding setup is not production ready and is used only for the purpose of showing the disadvantages of the threading model used in Spring Web MVC, especially the thread-per-connection model.

By launching the test suite against our service, we will get the following result curve:

Diagram 6.14. Web MVC throughput measurement results

The preceding diagram shows that, at some point, we start losing throughput, which means that there is contention or incoherence in our application.

In order to compare the performance results of the Web MVC framework, we have to run an identical test for WebFlux as well. The following is the code that we use in order to measure WebFlux-based application performance:

@RestController                                                   
@SpringBootApplication
public class ReactiveDemoApplication 
   implements InitializingBean {
   ...
   @GetMapping("/")
   public Mono<Void> get() {                                       // (1)
      return                                                       // 
         webClient                                                 //
               .get()                                              // (2)
               .uri(someUri)                                       // 
               .retrieve()                                         // 
               .bodyToMono(DataBuffer.class)                       //
               .doOnNext(DataBufferUtils::release)                 //
         .then(                                                    // (3)
            webClient                                              //
               .get()                                              // (4)
               .uri(someUri)                                       //
               .retrieve()                                         //
               .bodyToMono(DataBuffer.class)                       //
               .doOnNext(DataBufferUtils::release)                 //
               .then()                                             //
         )                                                         //
         .then();                                                  // (5)
    }
    ...
}

The preceding code shows that we are now actively using Spring WebFlux and Project Reactor features in order to achieve asynchronous and non-blocking requests and response processing. Just as in the Web MVC case, at point (1), we return a Void result, but it is now wrapped in the reactive type, Mono. Then, we execute a remote call using the WebClient API, and then at point (3), we perform – in the same sequential fashion – the second remote call, shown at point (4). Finally, we skip the result of the execution of both calls and return a Mono<Void> result that notifies the subscriber of the completion of both executions.

Note that, with the Reactor technique, we may improve the execution time without performing both requests in parallel. Since both executions are non-blocking and asynchronous, we do not have to allocate additional Thread instances for that. However, in order to maintain the behavior of the system mentioned in diagram 6.13, we keep the execution sequential, so the resulting latency should be ~400 milliseconds on average.

By launching the test suite against our WebFlux-based middleware, we will get the following result curve:

Diagram 6.15. WebFlux throughput measurement results

As we may see from the preceding chart, the tendency of the WebFlux curve is somewhat similar to the WebMVC curve.

In order to compare both curves, let's put them on the same plot:

Diagram 6.16. WebFlux versus Web MVC throughput measurement results comparison

In the preceding diagram, the line of + (plus) symbols is for Web MVC and the line of - (dash) symbols is for WebFlux. In this case, higher means better; as we can see, WebFlux has almost twice the throughput.

Also, it should be noted here that there are no measurements for Web MVC after 12,000 parallel users. The problem is that Tomcat's thread pool takes too much memory and does not fit in the given 28 GB. Therefore, each time Tomcat tries to dedicate more than 12,000 Thread instances, the Linux kernel kills that process. This point emphasizes that the thread-per-connection model does not fit in cases where we need to handle more than around 10,000 users.

The preceding comparison is a comparison of the thread-per-connection model versus the non-blocking asynchronous processing model. In the first case, the only way to process requests without significant impact on latency is by dedicating a separate Thread for each user. In this way, we minimize the time spent by the user in the queue waiting for an available Thread. In contrast, the configuration of WebFlux does not require the allocation of a separate Thread per user since we use a non-blocking I/O. In real-world scenarios, the usual configuration of a Tomcat server has a limited size for the thread pool.

Nevertheless, both curves show a similar tendency and have critical points, after which they start degrading in throughput. This may be explained by the fact that many systems have their limitations in terms of open client connections. In addition, the comparison may be a bit unfair since we use different implementations of HTTP clients, with different configurations. For example, the default connection strategy for RestTemplate is to allocate a new HTTP connection on each new call. In contrast, the default Netty-based WebClient implementation uses a connection pool under the hood. In this case, a connection may be reused. Even though the system may be tuned to reuse opened connections, such a comparison may be misrepresentative.

Therefore, to get a better comparison, we are going to simulate the network activity by providing a 400 millisecond delay. For both cases, the following code is used:

Mono.empty()
    .delaySubscription(Duration.ofMillis(200))
    .then(Mono.empty()
              .delaySubscription(Duration.ofMillis(200)))
    .then()

For WebFlux, the return type is Mono<Void>, and for Web MVC, the execution flow is ended by calling the .block() operation, so the Thread will be blocked for a specified delay. Here, we use the same code in order to get identical behavior for delay scheduling.

We are also going to use a similar cloud setup. For the middleware, we are going to use E4S V3 VM (four virtual CPUs and 32 GB of RAM) and for the client, B4MS VM (four virtual CPUs and 16 GB RAM).

By running our test suite against the services, the following results can be observed:

Diagram 6.17. WebFlux versus Web MVC throughput measurement result comparison, without additional I/O

In the preceding diagram, the line of + (plus) symbols is for Web MVC and the line of - (dash) symbols is for WebFlux. As we can see, the overall results are higher than with real external calls. That means that either a connection pool within the application or a connection policy within the operating system has a huge impact on system performance.

Nevertheless, WebFlux is still showing twice the throughput of Web MVC, which finally proves our assumption about the inefficiency of the thread-per-connection model. WebFlux still behaves as was proposed by Amdahl's Law. However, we should remember that, along with application limitations, there are system limitations, which may alter our interpretation of the final results.

We can also compare both modules with regard to their latency and CPU usage, which are depicted in diagrams 6.18 and 6.19 respectively:

Diagram 6.18. A comparison of the latency of WebFlux and Web MVC without additional I/O

In the preceding diagram, the line of + (plus) symbols is for Web MVC and the line of - (dash) symbols is for WebFlux. In this case, the lower the result, the better. The preceding diagram depicts a huge degradation in latency for Web MVC. At a parallelization level of 12,000 simultaneous users, WebFlux shows a response time that is around 2.1 times better.

From the perspective of CPU usage, we have the following tendency:

Diagram 6.19. A comparison of the CPU usage of WebFlux and Web MVC without additional I/O

In the preceding diagram, the solid line is for Web MVC and the dashed line is for WebFlux. Again, the lower the result, the better in this case. We can conclude that WebFlux is much more efficient with regard to throughput, latency, and CPU usage. The difference in CPU usage may be explained by the redundant work context switching between different Thread instances.

Table of Contents for Impact of processing models on throughput and latency

Create new playlist

Sign In

Sign Up

Table of Contents for
Impact of processing models on throughput and latency