© Hillel Wayne 2018
Hillel WaynePractical TLA+https://doi.org/10.1007/978-1-4842-3829-5_5

5. Concurrency

Hillel Wayne1 
(1)
Chicago, Illinois, USA
 

Almost everything we do is time dependent. Every mutation splits the temporal state of the program in two: one before the change, and one after. For a simple system, we can precisely define any state based on the initial state and the lines of code. It evolves in a deterministic, predictable way.

But many programs aren’t that simple. In a concurrent system, there is no single timeline. We have multiple actions that can happen in any order, producing a fractured spread of new time lines. Concurrent systems describe everything from threads sharing memory to independent actors to changes in our real world. And concurrent systems are notoriously hard to design correctly. There are simply too many possible behaviors to reason through.

So we’ll reason with TLA+ instead. We’ve already done some basic nondeterminism with either and with. In this chapter, we introduce the idea of processes and labels, which give us the structure we need to spec out and test generalized concurrent code.

Labels

Before we talk about concurrency, we need to cover labels. The last time we used labels was back in Chapter 2 with the wire example. That’s because we don’t need labels for single-process applications, which we’ve been writing so far. We need labels to accurately describe concurrent systems.

Labels determine the grain of atomicity of your spec. TLC executes everything in a label in a single step, or action. Then it checks all invariants and looks for the next label to execute (action to take). Just as TLC checks all possible behaviors on every either and with, it also checks all possible behaviors on the set of next possible labels. In other words, if you have a concurrent system, TLC will test all available next actions for a possible error.

When translating PlusCal into TLA+, we get an extra pc (“program counter”) variable that tracks which label we're currently on. If pc = "A" then the next state will consist of everything under the A label. We can jump to a label in the same process with goto NameOfLabel. Since specifications are smaller than programs, goto is a lot easier to reason about in PlusCal than in a programming language, and it's often quite useful.

Tip

PlusCal automatically defines a “Done” label at the end of every process. You cannot use “Done” as part of your own label, but you can jump to it with goto.

You can have as many labels as you’d like with the main cost being performance. However, there’s also a minimum number of labels you need. You have to place labels with the following rules:
  • You must have a label at the beginning of each process and before every while.

  • You may not place a label inside a macro or a with statement.

  • You must place a label after every goto.

  • If you use an either or an if and any possible branch has a label inside it, you must place a label after the end of the control structure.

  • You may not assign to the same variable twice in a label.

The last rule deserves a little more explanation. Given the following:
Valid:
  either x := 1;
  or     x := 2;
  end either;
Invalid:
  x := 1;
  x := 2;
Valid is a valid use of a label: even though x appears twice, only one of those assignments can happen in any given execution of the label. Invalid assigns to x twice, so it’s an invalid use of a label. This can become a problem when dealing with functions. We cannot write
Invalid:
  struct.key1 = 1;
  struct.key2 = 2;
because that assigns to struct twice. For this particular case, PlusCal has the || operator. You can combine two assignments with || and they will be evaluated simultaneously.
Valid:
  struct.key1 = 1 ||
  struct.key2 = 2;

With that, we’re ready to talk concurrency.

Processes

A common situation in programming is the reader-writer pattern . This is where you have two or more asynchronous processes communicating over a shared channel, one of which is primarily writing messages and one of which is primarily consuming them. This occurs in a lot of places: pub-sub in Internet services, threads with a shared buffer, environmental sensors, etc. We’ll model the case where the shared channel is bounded, where “the message buffer length does not exceed the maximum size” is an invariant.
EXTENDS TLC, Integers, Sequences
CONSTANTS MaxQueueSize
(*--algorithm message_queue
variable queue = <<>>;
define
  BoundedQueue == Len(queue) <= MaxQueueSize
end define;
process writer = "writer"
begin Write:
  while TRUE do
    queue := Append(queue, "msg");
  end while;
end process;
process reader = "reader"
variables current_message = "none";
begin Read:
  while TRUE do
    current_message := Head(queue);
    queue := Tail(queue);
  end while;
end process;
end algorithm;*)

The most important thing about this system is that it is concurrent. This means there’s no enforced order to when either process runs: the writer could write a dozen messages before the reader reads six, and then the writer could only add one more before the reader reads the rest. We do this by using the process keyword. Each process must be assigned to a value; in this case strings. Unlike with single-process algorithms, all processes must explicitly use (and begin with) labels.

TLC is able to choose any process to run. It executes one label in that process, calculates invariants, and then chooses the next label in the next process to run. Note that pc is no longer a single value. Now it’s a function that represents the current label each process can execute.

The reader also has a local variable . current_message is inaccessible to the writer process or anything in a define block. However, a macro can use it if called in the process. Like global variables, local variables can also be defined with in, in which case TLC will check all of the possible starting states.

Run this with MaxQueueSize <- 2 and INVARIANT BoundedQueue. You should see it immediately fail. TLC starts by immediately running the reader process, which tries to Head an empty queue. Since Head is undefined for empty sequences, the spec fails. The problem is that we have no way of forcing the reader to wait until there’s something in the queue. Let’s fix that.

Await

await Expression prevents a step from running until Expression is true. You can also use the keyword when.
process reader = "reader"
variable current_message = "none";
begin Read:
  while TRUE do
    await queue /= <<>>;
    current_message := Head(queue);
    queue := Tail(queue);
  end while;
end process;
Both of the assignments in the Read label can’t happen until the queue is not empty. This means that the Read action is not enabled when the queue is empty: it cannot happen. Then the only enabled action is Write, meaning TLC must execute Write next. In effect this forces the reader to wait until the writer adds something to the queue. Adding this prevents the empty read case, so TLC reveals a different error: the writer can write until the length of the queue exceeds BoundedQueueSize. We fix this by adding an await to the writer, too.
process writer = "writer"
begin Write:
  while TRUE do
    queue := Append(queue, "msg");
    await Len(queue) <= MaxQueueSize;
  end while;
end process;

Here, I put the await after the append to queue. This has slightly different behavior: the step can’t happen until the await is true with the updated queue. If taking the action would end with queue being above the maximum size, the await disables the action entirely. This can be a little confusing when you first encounter it, so I recommend always placing your awaits at the beginning of the step unless you have a good reason not to.

If you run this, it should pass (9 states).

Deadlocks

await prevents a process from continuing until its conditions are met. What happens when all of our processes are prevented from continuing?

Let’s add the case where the reader might fail to properly handle the message. This can happen several states after we pop the message from the queue. In this case, we usually want to log some error to be processed, which means the reader should add an error to the queue. Here’s what this all looks like:
macro add_to_queue(val) begin
  await Len(queue) < MaxQueueSize;
  queue := Append(queue, val);
end macro;
process writer = "writer"
begin Write:
  while TRUE do
    add_to_queue("msg");
  end while;
end process;
process reader = "reader"
variable current_message = "none";
begin Read:
  while TRUE do
    await queue /= <<>>;
    current_message := Head(queue);
    queue := Tail(queue);
    either
      skip;
    or
      NotifyFailure:
        current_message := "none";
        add_to_queue("fail");
    end either;
  end while;
end process;

First, since both of the processes write to the queue, we pull the add logic into a macro named add_to_queue. To simulate the reader process failing, we use a common PlusCal pattern I call possibly: an either with two branches, one of which does nothing (skip). In the other, we need to use a new label. This is because we’ve already modified both current_message and queue in the Read action. Since you cannot assign to the same variable twice in the same step, we add the NotifyFailure label. Since one of the branches of the either has a label in it, we’d need to put a new label after end either if we wanted to write more in the process. However, the end of the either is the end of the while block and the end of the while block is the end of the process, so we don't need another label.

Try running this. You should see a new error: Deadlock Reached. A deadlock is when none of the processes in your spec can take any actions. Usually this is because of an await statement, but it can also happen with with x in S if S is the empty set. Usually deadlocks are bad. If you’re writing a spec where a deadlock isn’t bad, you can disable the check in the model, under Model Overview > What to Check? > Deadlock.

Process Sets

One common fix you see a lot in the wild is to add more readers: if both the writer and the reader are stuck in a deadlock, the second reader can pop from the queue. Practically, this sometimes works. But does it always work, or can it, in some circumstances, still lead to a deadlock? To test this, let’s change the reader from a single process to a set of them.
process reader in {"r1", "r2"}
variable current_message = "none";
begin Read:
  while TRUE do
    await queue /= <<>>;
    current_message := Head(queue);
    queue := Tail(queue);
    either
      skip;
    or
      NotifyFailure:
        current_message := "none";
        add_to_queue(self);
    end either;
  end while;
end process;

We made two changes here. The first is that instead of assigning reader to a value, we’re saying it’s in the set {"r1", "r2"}. TLC will create two copies of reader: one for each element, and assign each of them its own set of local variables. During the model checking, at every step TLC can advance “writer” or “r1” or “r2”. Second, to distinguish the two readers in the message queue, we call add_to_queue with self instead of “fail”. If a process has multiple copies, such as “r1” and “r2”, self is whatever value that given copy is assigned to.

Note

All process names across all processes must be comparable. Since the value for writer is a string, the value for reader can be either a set of strings or a set of model values.

If we run this, we should still see a deadlock. While multiple readers may reduce the chances of deadlocks, it does not eliminate them entirely, and TLC will still catch that error.

Procedures

What if we want to share multiple-step behavior between processes? Macros cannot contain labels, so we cannot use them for this purpose. Our final piece of PlusCal syntax, procedures, addresses this use case. To demonstrate them, here’s what our spec looks like when we replace the macro with a single label procedure:
procedure add_to_queue(val="") begin
  Add:
    await Len(queue) < MaxQueueSize;
    queue := Append(queue, val);
    return;
end procedure;
process writer = "writer"
begin Write:
  while TRUE do
    call add_to_queue("msg");
  end while;
end process;
process reader in {"r1", "r2"}
variable current_message = "none";
begin Read:
  while TRUE do
    await queue /= <<>>;
    current_message := Head(queue);
    queue := Tail(queue);
    either
      skip;
    or
      NotifyFailure:
        current_message := "none";
        call add_to_queue(self);
    end either;
  end while;
end process;

If you run this, you should see the same expected deadlock. A procedure has the same syntax as a macro, except that it has labels in it. In addition, you can define local variables for a procedure in the same manner you would processes. You can only define the local variables as equaling an expression (=), though, but not being some element of a set (in). We exit the procedure with return. Return does not return any value to the calling process. It simply ends the procedure.

In order to call a procedure in a process, we have to prefix it with call. A called procedure must be immediately followed by a label, the end of an enclosing block, a goto, or a return.

Procedures must be defined after macros and before processes. A good rule of thumb to remember this is that procedures can use macros but macros cannot use procedures, so procedures must follow macros. Similarly, processes can call procedures and macros, but procedures cannot use processes.

Tip

When using process sets that use procedures or macros, you can still use self inside of the procedure or macro. It will refer to the value of the calling process.

Example

We can use processes to model anything concurrent, not just algorithms. One common use case is to use processes to model time periods : where some external activity happens every so often. For this example, we’ll have several clients consume some shared resource that periodically renews. This is a generic implementation and can represent clients calling a rate-limited API, loggers cutting a tree farm, scheduling CPU time, etc.

First, let’s implement what this might look like without any renewal process.
EXTENDS Integers
CONSTANT ResourceCap, MaxConsumerReq
ASSUME ResourceCap > 0
ASSUME MaxConsumerReq in 1..ResourceCap
(*--algorithm cache
variables resources_left = ResourceCap;
define
  ResourceInvariant == resources_left >= 0
end define;
process actor = "actor"
variables
  resources_needed in 1..MaxConsumerReq
begin
  UseResources:
    while TRUE do
      resources_left := resources_left - resources_needed;
    end while;
end process;
end algorithm; *)

We have two constants: one that represents the total possible resources in the system, and one that represents the maximum a given actor can consume per tick. The actor will continuously consume from the global pool of resources, eventually depleting them all. We want to make it so that it never consumes more resources than are possible (depleting to zero is fine).

Run this with ResourceCap <- 6, MaxConsumerReq <- 2 and INVARIANT ResourceInvariant. This should fail by violating ResourceInvariant. Since we don’t have anything stopping us from overconsuming, this makes sense. Let’s add an await to make sure this doesn’t happen.
  UseResources:
    while TRUE do
      await resources_left >= resources_needed;
      resources_left := resources_left - resources_needed;
    end while;
The good news is this no longer violates ResourceInvariant. The bad news is it deadlocks. Once we run out of resources, the actor can’t do anything. Since the resource is supposed to be renewable, we should add a “time” process that occasionally refreshes resources_left.
process time = "time"
begin
  Tick:
    resources_left := ResourceCap;
    goto Tick;
end process;

Whenever time runs it resets resources_left back to the cap. Now the actor cannot ever deadlock, and our spec passes (22 states).

Let’s make this more complex. Often we have a number of consumers using the same resource , not just one. If they don’t coordinate, they can often cause global problems even if each one is locally safe. We start by generalizing the number of actors in the system.
EXTENDS Integers
CONSTANT ResourceCap, MaxConsumerReq, Actors
ASSUME ResourceCap > 0
ASSUME Actors /= {}
ASSUME MaxConsumerReq in 1..ResourceCap
(*--algorithm cache
variables resources_left = ResourceCap;
define
  ResourceInvariant == resources_left >= 0
end define;
process actor in Actors
variables
  resources_needed in 1..MaxConsumerReq;
begin
  WaitForResources:
    while TRUE do
      await resources_left >= resources_needed;
      resources_left := resources_left - resources_needed;
    end while;
end process;

time remains the same. The constant Actors will have to be a set of strings or a set of model values. In these kinds of cases, usually using a set of model values is preferable to using a set of strings. Since resources_needed is local to each actor, they don’t need to all have the same value. Try Actors <- [model value] {a1, a2} and rerun. You should see that it still passes (69 states).

What if the actors don’t drain the resources atomically? Once they start, they remove them over some period of time, during which the other actors can also be draining resources. Let’s also add that they only check that there’s enough resources when they first start consuming and are not doing any consistency checks in the middle of the process.
process actor in Actors
variables
  resources_needed in 1..MaxConsumerReq;
begin
  WaitForResources:
    while TRUE do
      await resources_left >= resources_needed;
      UseResources:
        while resources_needed > 0 do
          resources_left := resources_left - 1;
          resources_needed := resources_needed - 1;
        end while;
        with x in 1..MaxConsumerReq do
          resources_needed := x;
        end with;
    end while;
end process;

Since UseResources is under the same label as the await, we can only step into it if there are enough resources available. However, once we do, the while loop will keep running until we’ve consumed our fill. Since we’re destructively updating resources_needed, we need to reset it at the end of the loop. However, this way we can update it to a different value than at the beginning of the process. The actor may first need one unit of resource, then two, then one again, etc.

If we run this, we now violate ResourceInvariant again. One actor can start consuming, but halfway through another actor can deplete the rest of the pool, at which point the first actor breaks the invariant.

We’ll try two fixes for this: one that won’t succeed and one that will. Our first fix will be to only let each actor complete once before refreshing. The cap is currently 6, there are currently two actors, and the most each can consume per complete iteration is 2. 6 >= 2 * 2, so limiting them should work, right?

We need to add some additional supplementary values. These are not necessarily “real” qualities of the system , just bookkeeping we add to guarantee that each actor only runs once per tick. We can do this by adding a variable called ran to each actor, and then having time set it to false on every tick. Since two separate processes are using it, we need to make it a global value.
(*--algorithm cache
variables
  resources_left = ResourceCap,
  ran = [a in Actors |-> FALSE];
* ...
process actor in Actors
variables
  resources_needed in 1..MaxConsumerReq
begin
  WaitForResources:
    while TRUE do
      await ~ran[self];
      when resources_left >= resources_needed;
      UseResources:
        while resources_needed > 0 do
          resources_left := resources_left - 1;
          resources_needed := resources_needed - 1;
        end while;
        with x in 1..MaxConsumerReq do
          resources_needed := x;
        end with;
        ran[self] := TRUE;
   end while;
end process;
process time = "time"
begin
  Tick:
    resources_left := ResourceCap;
    ran := [a in Actors |-> FALSE];
    goto Tick;
end process;
This passes (389 states). But that’s only because we’re exploring a very small state space, and maybe some other configuration of values would break this. We can test this by letting the model span over a whole range of possible maximum capacities, not just the single value we picked.
(*--algorithm cache
variables
  resource_cap in 1..ResourceCap,
  resources_left = resource_cap,
  ran = [a in Actors |-> FALSE];
define
  ResourceInvariant == resources_left >= 0
end define;
* . . .
process time = "time"
begin
  Tick:
    resources_left := resource_cap;
    ran := [a in Actors |-> FALSE];
    goto Tick;
end process;

When we run this, we once again violate ResourceInvariant. TLC picks resource_cap = 1, and as 1 < 2 * 2 our “fix” no longer works. This is why it’s important to look at larger state spaces.

SYMMETRY SETS

The downside of larger state spaces is how “larger” gets intractable much too quickly. Try rerunning the model without ResourceInvariant as a checked invariant. That change balloons our search space from 389 to 2085 states. If you add a third actor, you now have 24,485 states!

This is where symmetry sets can make a big difference. If you convert Actors to a symmetry set, the model should pass with only 6040 states – less than a quarter the size of the original state space. Using symmetry sets is often a good first-pass optimization for your slower specs.

Restricting how many times each actor could run didn’t work. Let’s try using a semaphore instead. A semaphore is a shared value all of the actors can access that we use for coordination. What we can do is have the actors “reserve” how many resources they intend to consume. Instead of checking whether there are enough resources in the world, they instead check how many are left in the semaphore value. Since we can subtract the amount instantaneously we don’t have to worry about the same race conditions. We’ll remove all of the “ran” junk, since that wasn’t helpful.
variables
  resource_cap in 1..ResourceCap,
  resources_left = resource_cap,
  reserved = 0; * our semaphore
define
  ResourceInvariant == resources_left >= 0
end define;
process actor in Actors
variables
  resources_needed in 1..MaxConsumerReq
begin
  WaitForResources:
    while TRUE do
      await reserved + resources_needed <= resources_left;
      reserved := reserved + resources_needed;
      UseResources:
        while resources_needed > 0 do
          resources_left := resources_left - 1;
          resources_needed := resources_needed - 1;
        end while;
        with x in 1..MaxConsumerReq do
          resources_needed := x;
        end with;
    end while;
end process;
* 2
process time = "time"
begin
  Tick:
    resources_left := resource_cap;
    reserved := 0;
    goto Tick;
end process;
This still fails for ResourceCap <- 6, MaxConsumerReq <- 2. TLA+ will find an error similar to the following:
  • 1. TLC sets resource_cap to 1.

  • 2. a1 reserves 1 resource and enters UseResources.

  • 3. Before a1 does anything, Tick happens, resetting reserved to 0.

  • 4. a2 reserves 1 resource and enters UseResources.

  • 5. Both a1 and a2 resolve UseResources, bringing resources_left to -1.

Instead of trying reserved to reset every tick, let's try instead having the actors gradually mark when they've consumed resources and no longer need capacity reserved.

  UseResources:
        while resources_needed > 0 do
          resources_left := resources_left - 1;
          resources_needed := resources_needed - 1;
          reserved := reserved - 1;
        end while;
* ...
* 2
process time = "time"
begin
  Tick:
    resources_left := resource_cap;
    * line deleted here
    goto Tick;
end process;

This passes (1588 states) .

Summary

In this chapter, we introduced concurrent specifications and how we can model them in PlusCal. We also observed that there’s a wide range of exciting new problems that concurrent specifications run into, such as race conditions and deadlocks. We also saw how to model effects with processes.

Modeling concurrency is one of the best-known use cases for TLA+. We programmers are very good at reasoning about deterministic code and very bad at reasoning about concurrent systems, but the risks and dangers of the latter are so much higher. As we saw in our example, specification can be a powerful tool for safely managing concurrency.

In the next chapter, we will learn how to write temporal properties: invariants that apply to entire behaviors at once.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.9.147