Chapter 17

Concurrent copying & compaction

In this chapter we discuss approaches to defragmenting the heap concurrently with the mutator, relocating live objects either by concurrent copying or by concurrent compaction. Here we consider how the mark-compact approaches of Chapter 3 and the copying approaches of Chapter 4 extend to operate concurrently with the mutator.

We focus initially on collection techniques based on copying (evacuating or scavenging) reachable objects out of a fromspace into a tospace, after which the fromspace can be reclaimed. Recall that when scanning object fields the collector must convert all fromspace pointers to tospace pointers, replacing each fromspace pointer with the forwarding address of its fromspace target, copying the fromspace target the first time it is encountered.

Concurrent copying collectors must not only protect the collector against mutation but also protect the mutator against concurrent copying. Moreover, concurrent updates by the mutator must be propagated to the copies being constructed in tospace by the collector.

For copying collectors, a black mutator must by definition hold only tospace pointers. If it held fromspace pointers then the collector would never revisit and forward them, violating correctness. This is called the black mutator tospace invariant: the mutator operates at all times ahead of the wavefront in tospace. Similarly, a grey mutator must by definition hold only fromspace pointers at the beginning of the collector cycle. In the absence of a read barrier to forward a fromspace pointer to the tospace copy, the grey mutator cannot directly acquire tospace pointers from fromspace objects (since the copying collector does not forward pointers stored in fromspace objects). This is called the grey mutator fromspace invariant. Of course, for termination of a copying algorithm, all mutator threads must end the collection cycle holding only tospace pointers, so any copying collector that allows grey mutator threads to continue operating in fromspace must eventually switch them all over to tospace by forwarding their roots. Moreover, updates by the mutator in fromspace must also be reflected in tospace or else they will be lost.

17.1  Mostly-concurrent copying: Baker’s algorithm

Maintaining a tospace invariant for all mutator threads is perhaps the simplest approach to concurrent copying because it guarantees that the mutator threads never see objects that the collector is yet to copy, or is in the middle of copying. Establishing the tospace invariant in a mostly-concurrent world requires stopping all the mutator threads (atomically) to sample and forward their roots (copying their targets) at the beginning of the collection cycle. At this point, the now-black mutators contain only (grey) tospace pointers, but the (unscanned) grey targets will still contain fromspace pointers. Baker’s [1978] black mutator read barrier was first formulated for incremental collection to protect against a mutator acquiring one of these fromspace pointers, and subsequently extended by Halstead [1985] for concurrent copying. The read barrier has the effect of presenting the illusion to the mutator threads that the collection cycle has completed, by preventing them from crossing the collector wavefront boundary between tospace and fromspace.

Baker-style concurrent collection is illustrated in Algorithm 17.1, as a revision of the non-concurrent copying algorithm of Algorithm 4.2. Notice that the read barrier needs to trigger only when loading from a grey tospace object (ahead of the collector wavefront). Only then is the forward operation needed to ensure that the loaded reference is to a tospace object, copying any uncopied fromspace object as necessary. As specified here, synchronisation between mutator and collector is relatively coarse-grained (at the level of objects): the collector atomic block scans the next grey object, while the mutator atomic read barrier forwards any reference loaded from a grey object. The atomic blocks ensure that a mutator thread can never load a reference from an object that is in the middle of being scanned (to turn it from grey to black).

As presented in Algorithm 17.1, atomicity of the Read operation ensures that the mutator sees the correct state of the src object (grey or not) and the target object (forwarded or not), as well as allowing the mutator to copy the target object if it is in fromspace, without interfering with ongoing copying by the collector in process. Thus, the mutator’s atomic Read operation may incur overhead proportional to the size of the object being copied. It is possible to obtain finer-grained atomicity by carefully synchronising each of these operations more carefully with the collector.

One approach is to allow finer-grained synchronisation using a work list holding field addresses rather than object references. A difficulty then is how to distinguish grey fields from black fields. The problem is ensuring that the wavefront is easily determined by the mutator. At the granularity of objects it is simple enough to set a grey bit in the header of each grey object, but for fields there is not usually a cheap place to store this information. However, with Cheney scanning the scan pointer can be advanced (atomically) as each field is scanned, so black fields lie behind the scan pointer and grey fields in front. In this case, the read barrier might look something like:

atomic Read(src, i):

 ref ← src[i]

if ref ≠ null && scan ≤ &src[i]

/* src[i] is grey */

  ref ← forward(ref)

return ref

Of course, this description leaves out all the machinery needed to advance the wavefront atomically through each of the fields. We will see techniques for achieving this finer-grained processing in Chapter 19, where minimising interruptions by the collector becomes important for real-time systems.

Mostly-concurrent, mostly-copying collection

Mostly-concurrent collection also naturally applies to mostly-copying collections. Recall that a mostly-copying collector must treat ambiguous roots conservatively, pinning all objects referenced by ambiguous roots. The collector is free to move the remaining objects not directly referenced by ambiguous roots. It is straightforward to use the brief stop-the-world phase of a mostly-concurrent collector to mark (and pin) all the objects referenced by the ambiguous roots in the mutator thread stacks and registers. At this point all the mutator threads are black, and a Baker-style read barrier will ensure that the mutator threads never subsequently acquire references to uncopied objects.

Algorithm 17.1: Mostly-concurrent copying

 1 shared worklist ← empty

 2

 3 collect():

 4  atomic

 5   flip()

 6   for each fld in Roots

 7    process(fld)

 8  loop

 9   atomic

10     if isEmpty(worklist)

11      break

/* exit loop */

12     ref ← remove(worklist)

13     scan(ref)

14

15 flip():

16   fromspace, tospace ← tospace, fromspace

17   free, top ← tospace, tospace + extent

18

19 scan(toRef):

20   for each fld in Pointers(toRef)

21    process(fld)

22

23 process(fld):

24   fromRef ← *fld

25   if fromRef ≠ null

26    *fld ← forward(fromRef )

/* update with tospace reference */

27

28 forward(fromRef):

29   toRef ← forwardingAddress(fromRef )

30   if toRef = null

/* not copied (not marked) */

31   toRef ← copy(fromRef)

32   return toRef

33

34 copy(fromRef):

35   toRef ← free

36   free ← free + size(fromRef )

37   if free > top

38    error "Out of memory"

39   move(fromRef, toRef)

40   forwardingAddress(fromRef ) ← toRef

/* mark */

41   add(worklist, toRef)

42   return toRef

43

44 atomic Read(src, i):

45   ref ← src[i]

46   if isGrey(src)

47    ref ← forward(ref)

48   return ref

DeTreville [1990] used this approach for concurrently collecting Modula-2+ and subsequently for Modula-3 [Cardelli et al, 1992], both systems-oriented programming languages whose compilers were not sophisticated enough to generate accurate stack maps. Also, because their compilers did not emit an explicit barrier for heap accesses, DeTreville applied an Appel et al [1988] read barrier to synchronise the mutator with the collector using virtual memory page protection. Detlefs [1990] used the same technique for C++, modifying the AT&T C++ compiler to derive automatically the accurate pointer maps for heap objects needed to allow copying of objects not referenced directly from ambiguous roots.

Subsequently, Hosking [2006] replaced use of coarse-grained virtual memory page protection as the read barrier mechanism with compiler-generated object-grained read barrier support. The motivation for this was the difficulty of managing page protections atomically in the presence of mutator threads that are preemptively scheduled by the operating system. Because the read barrier is needed only during the copying phase of collection, after all the mutator threads have been stopped to scan their ambiguous roots and make them black, it is possible to avoid expensive atomic instructions in the fast path of the barrier that checks if the source object is grey. Atomic operations are thus needed only to ensure atomicity of the forwarding operation.

17.2  Brooks’s indirection barrier

An alternative approach to requiring a tospace invariant is to allow the mutator to make progress without concern for the wavefront. Brooks [1984] observes that if every object (whether in fromspace or tospace) has a non-null forwarding pointer (either to its fromspace original or to its copy in tospace) then the test on the src object in the read barrier can be eliminated. A fromspace object that has not yet been copied will have an indirection field that points to itself. When copying an object, the fromspace indirection field is atomically updated to refer to the tospace copy. The tospace copy has an indirection field that points to itself. All heap accesses, both reads and writes, of pointers, non-pointers and mutable values in header words, now always require an unconditional dereference operation to follow any indirection pointer to the tospace copy if one exists. Thus, the Read barrier for the mutator is rewritten by Brooks as in Algorithm 17.2.

Now the only problem is that the read barrier can still read a field ahead of the wavefront that might refer to an uncopied fromspace object. Fortunately, the ubiquitous indirection field relaxes the need for the tospace invariant imposed by Baker so the mutator is allowed to operate grey and hold fromspace references. To ensure termination Brooks imposes a Dijkstra-style Write barrier to prevent the insertion of fromspace pointers behind the wavefront as in Algorithm 17.2.

Because mutator threads now operate grey, once copying is finished they need a final scan of their stacks to replace any remaining unforwarded references. The alternative, as performed in the original incremental Brooks collector, is simply to scan the thread stacks and registers of each mutator thread after each collector increment, in order to redirect any references they may hold to copied objects before they can resume.

17.3  Self-erasing read barriers

Baker-style collectors require a read barrier to preserve their black mutator invariant. Read barriers are often considered to be more expensive than write barriers since reads are more prevalent than writes. Furthermore, read barriers are conditional: given a Read(src,i), they must test whether src[i] is in tospace and evacuate it if not. Cheadle et al [2004] eliminate this test and eliminate all overheads in accessing a black tospace object for a Baker-style incremental copying collector in the Glasgow Haskell Compiler (GHC). The first word of every object (closure) in GHC points to its entry code: the code to execute (enter) when the closure is evaluated. They provide two versions of this code. In addition to the standard version, a second version will scavenge the closure before entering the standard code. Let us see how this scheme operates. When the collector is off, the entry-code word points to the standard, non-scavenging code. However, when an object is copied to tospace, this word is hijacked and set to point to the self-scavenging code. If the object, now in tospace, is entered, the self-scavenging code is executed first to copy the object’s children to tospace. Then the original value of the entry-code word is reinstated. Finally, the standard version of the code is entered. The beauty of this scheme is that if the closure is evaluated in the future then its standard code will be entered unconditionally: the read barrier has been erased. The cost of this scheme is some duplication of code: Cheadle et al found the overhead to be 25% over that of a stop-the-world copying collector. In [Cheadle et al, 2008] they applied this technique to flip method-table pointers in the Jikes RVM Java virtual machine. To do so they have to virtualise most accesses to an object (all method calls and accesses to fields unless they are static or private). However, they were able to recoup some of this cost by using the run-time compiler to inline aggressively.

Algorithm 17.2: Brooks’s indirection barriers

1 atomic Read(src, i):

2   src ← forwardingAddress(src)

3   return src[i]

4

5 atomic Write(src, i, ref):

6   src ← forwardingAddress(src)

7   if isBlack(src)

/* src is behind wavefront in tospace */

8    ref ← forward(ref)

9   src[i] ← ref

17.4  Replication copying

The Brooks indirection barrier imposes a time and space penalty. Following an indirection pointer adds (bounded) overhead to every mutator heap access (both reads and writes, pointers and non-pointers), and the indirection pointer adds an additional pointer word to the header of every object. It has the advantage of avoiding the need for Baker’s tospace invariant which forces the mutator to perform copying work when loading a fromspace reference from the heap, while preserving the essential property that accesses (both reads and writes) go to the tospace copy whenever one is present. This has the important result that heap updates are never lost because they occur either to the fromspace original before it is copied or to the tospace copy afterwards.1

Replication copying collectors [Nettles et al, 1992; Nettles and O’Toole, 1993] relax this requirement by allowing the mutator to continue operating against fromspace originals even while the collector is copying them to tospace. That is, the mutator threads obey a fromspace invariant, updating the fromspace objects directly, while a write barrier logs all updates to fromspace objects to record the differences that must still be applied to their tospace copies. In other words, replication copying collectors allow the state of the tospace copy to lag behind that of its fromspace original, so long as by the time the collector is finished copying, but before it can discard fromspace, all mutator updates have been applied from the log to the tospace copy and all mutator roots have been forwarded. Thus, the termination condition for collection is that the mutation log is empty, the mutator’s roots have all been scanned, and all of the objects in tospace have been scanned.

Concurrent replication copying requires synchronisation between the mutator and collector via the mutation log, and when updating the roots from the mutators. Thread-local buffers and work stealing techniques can minimise the synchronisation overhead when manipulating the mutation log [Azagury et al, 1999]. The collector must use the mutation log to ensure that all replicas reach a consistent state before the collection terminates. When the collector modifies a replica that has already been scanned it must rescan the replica to make sure that any object referenced as a result of the mutation is also replicated in tospace. Termination of the collector requires that each mutator thread be stopped to scan its roots. When there are no more objects to scan, the mutator log is empty, and no mutator has any remaining references to uncopied objects, then the collection cycle is finished. At this point all the mutator threads are stopped together briefly to switch them over to tospace by redirecting their roots.

The resulting algorithm imposes only short pauses to sample (and at the end redirect) the mutator roots: each mutator thread is stopped separately to scan its roots, with a brief stop-the-world phase at the end of the cycle to switch all the threads over to tospace.

The downside to replication copying is that every mutation of the heap, not just pointers, needs to be logged by the mutator threads. This imposes a much higher write barrier overhead than for traditional pointer-only write barriers, and the mutation log can become a bottleneck. For languages that discourage mutation, such as the functional language ML used by Nettles and O’Toole, this is less of an issue so performance does not suffer.

17.5  Multi-version copying

Nettles and O’Toole [1993] still require global stop-the-world synchronisation of the mutator threads to transition them to tospace. Their algorithm is not lock-free because no mutator can make progress while this transition occurs. Herlihy and Moss [1992] dispense with the need for a global transition. They adapt Halstead’s multiprocessor refinement of Baker’s [1978] algorithm, which divides the heap into multiple per-processor regions. Each processor has its own fromspace and tospace, and is responsible for evacuating into its own tospace any fromspace object it discovers while scanning. Halstead uses locking to handle races between processors that compete to copy the same object, and for updates to avoid writing to an object while it is being evacuated. He also retains global synchronisation to have all the processors perform the flip into their tospace before discarding their fromspace. To eliminate this global synchronisation, Herlihy and Moss decouple fromspace reclamation from the flip. They divide each processor region into a single tospace plus multiple (zero or more) fromspaces. As copying proceeds, multiple fromspace versions of an object can accumulate in different spaces. Only one of these versions is current while the rest are obsolete.

Each processor2 alternates between its mutator task and a scanning task that checks local variables and its tospace for pointers to fromspace versions. When such a pointer is found, the scanner locates the object’s current version. If that version is in a fromspace then it copies it to a new current version in its tospace (the old version is now obsolete).

In this way, the processors cooperate to move objects from fromspaces to tospaces, and to redirect reachable pointers to the tospaces. Each processor is responsible for scanning its own tospace for fromspace pointers, and for copying any fromspace object it finds (including objects in fromspaces of other processors) that does not have a current tospace copy in some processor. A processor can flip at any time during its mutator task (when its tospace is full and so long as it has sufficient free space to allocate a new tospace), but not in the middle of a scan. It cannot free its fromspaces until it can be sure no other processor holds references to any of its fromspace objects.

To manage versions, Herlihy and Moss maintain a forwarding pointer field next at all times in each object, so that each obsolete fromspace version refers to its next version, terminating at the current version which has a null forwarding pointer. When copying a fromspace object into its own tospace, a scanning processor atomically installs the tospace copy at the end of the version chain using CompareAndSwap, making it current. Thus, every mutator heap access must traverse to the end of the chain of versions before performing the access. Moreover, to preserve lock-freedom while ensuring that heap updates are not lost, every store into an object creates a new version of the object in the mutating processor’s tospace, using CompareAndSwap to make it current. Thus, scanning and copying require no global synchronisation, while preserving all mutator updates.

A processor owning fromspaces (the owner) can discard them only if no other scanning processor (scanners) holds any of its fromspace pointers. A scan is clean with respect to a given owner if the scan completes without finding any pointers to versions in any of its fromspaces, otherwise it is dirty. A round is an interval during which every processor starts and completes a scan. A clean round is one in which every scan is clean and no processor executes a flip. After a processor executes a flip the resulting fromspace can be reclaimed after completion of a subsequent clean round.

An owner detects that another scanner has started and completed a scan using two atomic handshake bits, each written by one processor and read by the other. Initially, both bits agree. To start a flip, the owner creates a new tospace, marks the old tospace as a fromspace, and inverts its handshake bit. At the start of a scan, the scanner reads the owner’s handshake bit, performs the scan, and sets its handshake bit to the value read from the owner’s. Thus, the handshake bits will agree once the scanner has started and completed a scan in the interval since the owner’s bit was inverted.

However, an owner must detect that all processes have started and completed a scan, and every processor is symmetrically both an owner and a scanner, so the handshake bits are arranged into two arrays. An owner array contains the owner handshake bits, indexed by owner processor. A 2-dimensional scanner array contains the scanner handshake bits, with an element for each owner-scanner pair. Because a scan can complete with respect to multiple owners, the scanner must copy the entire owner array into a local array on each scan. At the end of the scan, the scanner must set its corresponding scanner bits to these previously saved values. An owner detects that the round is complete as soon as its owner bit agrees with the bits from all scanners. An owner cannot begin a new round until the current round is complete.

To detect whether a completed round was clean the processors share an array of dirty bits, indexed by processor. When an owner executes a flip, it sets the dirty bit for all other processors. Also, when a scanner finds a pointer into another processor’s fromspace it sets that processor’s dirty bit. If an owner’s dirty bit is clear at the end of a round then the round was clean, and it can reclaim its fromspaces. If not, then it simply clears its dirty bit and starts a new scan. By associating dirty bits with fromspaces rather than processor regions, and having scanners set the dirty bit for the target fromspace when they find a pointer, it is also possible to reclaim fromspaces individually rather than all at once.

Algorithm 17.3: Herlihy and Moss [1992] owner update in place

 1 WriteLocai(a, next, i, v):

 2   seq — (next + 1) % 2

 3   seq(a) ← seq    $

 4   index(a) ← i

 5   value(a) ← v

 6   if CompareAndSet(&next(a), next, seq)

 7    scan(a[i])

 8    a[i] ← v

 9   else

10    Write(a, i, v)

Herlihy and Moss prove safety and liveness for their algorithm, but they do not explore performance of an actual implementation. The liveness argument relies on the observation that if each processor always eventually scans then some processor always eventually reclaims its fromspaces. At worst, because each processor will eventually exhaust its free spaces, further flips will cease, and all processors will eventually focus on scanning until a clean round ensues. Of course, this resource exhaustion has the effect of causing blocking in the system as a whole.

Extensions to avoid copy-on-write

The novelty of this multi-versioning algorithm is that it is entirely lock-free. Its downside is the need to create a new version on every heap update, though this may be useful on a non-uniform memory architecture multiprocessor to improve locality. Herlihy and Moss consider several alternatives to avoiding versioning on every update:

Update in place with CompareAndSwap2. The first extension assumes the availability of the CompareAndSwap2 operator which allows both performing the update and ensuring that the forwarding pointer next remains null as a single atomic operation. Unfortunately, CompareAndSwap2 is not widely implemented on modern multiprocessors. Transactional memory might be a viable alternative; in fact, this algorithm inspired the work leading to Herlihy and Moss.

Owner update in place. Another approach simply uses CompareAndSwap but it requires additional fields in the header of object a: seq(a) is a modulo two sequence number, index(a) is the index of the slot being updated and value(a) is the new value for that slot. Also, the forwarding pointer field next(a) is permitted to hold a sequence number, in addition to a pointer or null (this is easy enough to achieve by tagging the forwarding pointer field with a low bit to distinguish pointers to suitably aligned objects from a sequence number). There need only be two values for sequence numbers: if seq(a)=next(a) then the current update is installed, and otherwise it is ignored.

To perform a store using the full write barrier, a processor chains down the list of versions until it finds the current version (one with null or a sequence number stored in its next field). If the current version is local, then the processor performs the WriteLocal operation illustrated in Algorithm 17.3. This takes the current version a, the observed next field (either null or a sequence number), the index i of the slot to be modified, and the new value of the slot v. It then uses CompareAndSwap to install the new sequence number in the next field. If successful, then the processor performs a deletion barrier to scan any pointer overwritten by the store (this preserves the invariant that scanning has inspected every pointer written into tospace), before performing the store. Otherwise, the processor locates the newer version and retries the update by invoking the full write barrier. Having the owning process update in place is well-suited to a non-uniform memory architecture where it is more efficient to update local objects.

If the object is remote then the new owner makes a local tospace copy as before, except that after making the copy, but before performing the store, it must check whether next(a) = seq(a). If they are equal, then it must first complete the pending update, performing the deletion barrier to scan the slot indicated by the index field and storing the value from the value field into that slot. The same action must be performed when the scanner evacuates an object into tospace. This ensures that any writes performed on the original object while it is being copied are linearised before writes performed to the copy.

Locking update in place. Finally, there is the alternative of giving up on lock-freedom and using CompareAndSwap to lock the object while it is updated. As before, only the owner of the current version may update in place. The owner locks an object by:

1.  using CompareAndSwap to lock the object by installing a distinguished locked value in its next field;

2.  scanning the pointer (if any) being overwritten by the store;

3.  performing the update;

4.  scanning the pointer (if any) being stored; and

5.  unlocking the object by setting next back to null.

Since the owner is the only processor that updates the object in place, there is no need to synchronise with the scanner. The deletion barrier in step 2 ensures that pointers possibly seen by other processors will be scanned. The insertion barrier in step 4 ensures that if the object has already been scanned then the new pointer will not be mistakenly omitted.

17.6  Sapphire

A problem with symmetric division of the heap into independently collected regions per processor as done by Halstead [1985] and Herlihy and Moss [1992] is that it ties the heap structure to the topology of the multiprocessor. Unfortunately, application heap structures and thread-level parallelism may not map so easily to this configuration. Moreover, one processor can become a bottleneck because it happens to own a particularly large or knotty portion of the heap, causing other processors to wait for it to complete its scan before they can discard their fromspaces, so they may end up stalling if they have no free space in which to allocate. It may be possible to steal free space from another processor, but this requires the ability to reconfigure the per-processor heap regions dynamically. These issues were discussed earlier in Chapter 14. Instead, non-parallel concurrent collectors place collector work asymmetrically on one or more dedicated collector threads, whose priority can easily be adjusted to achieve a balance of throughput between mutator and collector threads.

Algorithm 17.4: Sapphire phases

 1 MarkCopy:

 2   Mark

/* mark reachable objects */

 3   Allocate

/* allocate tospace shells */

 4   Copy

/* copy fromspace contents into tospace shells */

 5

 6 Mark:

 7   PreMark

/* install Mark phase write barrier */

 8   RootMark

/* blacken global variables */

 9   HeapMark/StackMark

/* process collector mark queue */

10

11 Flip:

12   PreFlip

/* install Flip phase write barrier */

13   HeapFlip

/* flip all heap fromspace pointers to tospace */

14   ThreadFlip

/* flip threads, one at a time */

15   Reclaim

/* reclaim fromspace */

Sapphire [Hudson and Moss, 2001, 2003] is a concurrent copying algorithm designed to work well in the presence of a large number of mutator threads on small- to mediumscale shared memory multiprocessors. It extends previous concurrent replication copying algorithms to allow one thread at a time to flip from operating in fromspace, as opposed to having to stop them to transition them all at once over to tospace. This minimises the amount of time that any given application thread may need to block to support the collector. To cope with mutators operating in both fromspace and tospace at the same time, Sapphire requires that they update both the fromspace and tospace copies of an object, when both exist. Sapphire also requires that new objects created by the mutator be allocated black in a separate newspace in the heap that survives the current collection but is not scanned like tospace. This helps guarantee termination of the collection cycle, since fromspace (and hence tospace) are bounded in size. Because new objects are not scanned, they are treated as black (behind the wavefront), and subject to an insertion write barrier that scans and forwards any pointer as it is written.

Collector phases

Sapphire has two major groups of phases (outlined in Algorithm 17.4).

MarkCopy: The first group of phases marks the objects directly reachable from global variables and mutator thread stacks and registers and copies them into tospace. During these phases the mutators all read from the originals in fromspace, but also must mirror their writes to the tospace copies. The fromspace and tospace copies are kept loosely coherent by relying on the programming language memory model (in this case for Java [Manson et al, 2005; Gosling et al, 2015], but which should also apply to the forthcoming memory model for C++ [Boehm and Adve, 2008]) to avoid a barrier on reads of non-volatile fields (volatile fields require a read barrier). This means the updates to each copy need not be atomic or simultaneous. Rather, a Java application need only perceive that the values in the copies cohere at application-level synchronisation points. Any changes made by a mutator thread to fromspace copies between two synchronisation points will be propagated to the tospace copies before passing the second synchronisation point. If all threads are at synchronisation points, thenthe fromspace and tospace copies will be consistent with one another.3 This is important during the second group of phases, when mutators can observe both fromspace and tospace copies.

Algorithm 17.5: Sapphire pointer equality

(a) Fast path

1 pointerEQ(p, q):

2   if p = q return true

3   if q = null return false

4   if p = null return false

5   return flipPointerEQ(p, q)

/* called only during Flip phases */

(b) Flip phase slow path

1 flipPointerEQ(p, q):

2   pp ← forward(p)

3   qq ← forward(q)

4   return pp = qq

(c) Pointer forwarding

1 forward(p):

/* p is a non-null pointer */

2   pp ← toAddress(p)

/* pp is null if p is in tospace */

3   if pp = null

4    pp ← p

5   return pp

Flip: The second group of phases forwards pointers in global variables and thread stacks and registers, flipping them one at a time into tospace. Unflipped mutator threads may hold references to both fromspace and tospace copies (even of the same object). Previous concurrent copying collectors impose a tospace invariant using a read barrier to redirect mutators out of fromspace [Baker, 1978], or impose a fromspace invariant while replicating and then flip all at once [Nettles and O’Toole, 1993]. Incremental flipping plus having no read barrier means that mutators may access both fromspace and tospace at the same time, which requires slightly tighter synchronisation of updates to both copies.

This also affects pointer equality, since the fromspace and tospace copies of the same object must appear to have the same reference at the language level. Every pointer equality operation must apply a barrier as illustrated in Algorithm 17.5a. Note that if either argument is statically null then the compiler can revert the test to the simple p=q. The Flip phases must also call the flipPointerEQ function (see Algorithm 17.5b) to compare the forwarded pointers similarly to the read barrier of Brooks [1984].

MarkCopy: Mark. The Mark phase marks every fromspace object reachable from the roots, both global variables and thread stacks/registers. The Sapphire design calls for the collector to perform all the marking, working from a queue. The mutator write barrier ensures that if the mutator stores (into a global variable or the heap) a reference to an unmarked fromspace object p then p is added to the queue. The collector first scans the global variables, enqueuing any reference it finds to an unmarked fromspace object. The collector then marks and scans the unmarked objects in the queue. When it removes a pointer to object p from the queue, if p is not yet marked then it marks p and scans its slots to enqueue any unmarked fromspace objects referred to by p.

When the collector finds the mark queue empty it scans each mutator, one at a time, stopping each mutator to scan its stacks and registers and enqueuing any reference it finds to an unmarked fromspace object. If the collector makes a pass through all the mutators without enqueuing any objects for marking then marking is complete; otherwise marking and scanning continue. Termination relies on the fact that the write barrier prevents retreating the marking wavefront, and that new objects are allocated black. Eventually all reachable fromspace objects will be marked and scanned.

The Mark phase has three steps. The PreMark step installs the Mark phase write barrier WriteMark, shown in Algorithm 17.6a. Mutators do not perform any marking directly, but rather enqueue objects for the collector to mark. Fromspace objects that are not marked are implicitly white. Objects in the mark queue are implicitly grey. This can be encoded using an ‘enqueued’ bit which also allows avoiding enqueuing an object more than once. Each mutator has its own queue, so enqueuing normally involves no synchronisation. When the collector scans a mutator’s stack it also empties that mutator’s queue by appending the mutator’s queue onto its own queue.

The RootMark step scans and blackens the global variables by enqueuing their unmarked targets for the collector to mark using WriteMark. Stores into newly-allocated objects, including initialising stores, invoke the write barrier, so newly-allocated objects are treated as black. Mutator stacks and registers are still grey.

Finally, the HeapMark/StackMark step processes the collector’s mark queue, a separate set of explicitly grey (marked) objects, and the thread stacks. For each reference in the mark queue, the collector checks if it is already marked. If not, the collector marks the object and enters it into the explicit grey set for scanning (otherwise the already marked object is ignored). Each explicitly grey source object is scanned to blacken its slots by enqueuing their unmarked target objects for the collector to mark using WriteMark, and then the source object is considered to be black, as noted by the fact that the object is marked but not in the explicit grey set. The collector iterates until both the mark queue and the explicit grey set are both empty. (An object can be enqueued for marking more than once, but eventually it will be marked and no longer enqueued by the mutators.)

Whenever the mark queue and grey sets are both empty, the collector scans a mutator stack by briefly stopping the mutator thread at a safe point (which cannot be in the middle of a write barrier), scanning the thread’s stack and registers to blacken them by enqueuing every unmarked root using WriteMark. Having scanned every thread’s stack without finding any white pointers or enqueued objects, and with the mark queue and grey set empty, then there can be no white pointers in the thread stacks, global variables, or newly-allocated objects. They are now all black. The termination argument for this phase relies on the write barrier to keep globals and newly-allocated objects black. The write barrier also prevents mutators from writing white references into the heap. A mutator can obtain a white pointer only from a (reachable) grey or white object. Because there were no grey objects since the mutator threads were scanned, it cannot obtain a white pointer from a grey object, so it can only obtain a white pointer from another white object. But, because the mutator had no white references when it was scanned, it must have discarded them since the scan, so it cannot obtain any further white references after the scan. This applies to all mutators, so the thread stacks must all be black.

Algorithm 17.6: Sapphire write barriers

(a) The Mark phase barrier

1 WriteMark(p, i, q):

2   p[i] ← q

3   if isFromSpace(q) && not marked(q)

/* white */

4   enqueue(q)

/* collector will mark later */

(b) The Copy phase barrier

1 Writecopy(p, i, q):

2  p[i] ← q

$

3  pp ← toAddress(p)

$

4  if pp ≠ null

/* p is in fromspace */

5   q ← forward(q)

/* omit this for non-pointer q */

6  pp[i] ← q

$

(c) The Flip phase barrier

 1 WriteFlip(p, i, q):

 2  q ← forward(q)

/* omit this for non-pointer q */

 3  p[i] ← q

 4  pp ← toAddress(p)

 5  if pp ≠ null

/* p is in fromspace */

 6   pp[i] ← q

 7   return

 8  pp ← fromAddress(p)

 9  if pp ≠ null

/* p is in tospace */

10   pp[i] ← q

11   return

Note that only mutators active since their last scan during this Mark phase need to be rescanned. Similarly, only stack frames active since their last scan within this Mark phase need to be rescanned.

At this point, all white fromspace objects are unreachable.

MarkCopy: Allocate. Once the Mark phase has determined the set of reachable fromspace objects, the collector allocates an empty tospace copy ‘shell’ for each marked fromspace object. It sets a forwarding pointer in the fromspace object to refer to the tospace copy, and builds a hash table for the reverse mapping from each tospace copy to its fromspace copy. This is needed because a mutator thread that has been flipped to tospace still needs to update fromspace copies whenever other threads are still operating in fromspace.

MarkCopy: Copy. Once every marked fromspace object has a tospace copy and forwarding pointer installed, the collector begins copying the contents of fromspace objects to their tospace shells. This phase maintains the invariant that tospace objects refer only to other tospace objects by using a new Copy phase write barrier WriteCopy, which keeps both the fromspace and tospace copies up to date with respect to all writes, and makes sure to store only tospace pointers into tospace objects, as shown in Algorithm 17.6b. The mutators are all still operating unflipped in fromspace, so the barrier acts only to propagate all updates of fromspace objects to their tospace copies. Here, toAddress is the same as forwardingAddress from previous copying schemes, returning null for tospace objects. Thus, the forward operation in Sapphire will convert a fromspace pointer to its corresponding tospace pointer, but act as the identity operation on non-fromspace pointers. The memory accesses performed by this barrier (marked with $) must execute in the specified order, but otherwise the barrier is unsynchronised because Sapphire assumes no mutator-mutator data races on non-volatiles.

Algorithm 17.7: Sapphire word copying procedure

 1 copyWord(p, q) :

 2 for i ← 1 to MAX_RETRY do

 3  toValue ← *p

$

 4  toValue ← forward(toValue)

/* omit this fornon -pointers */

 5  *q ← toValue

$

 6  fromValue ← *p

$

 7  if toValue = fromValue

 8   return

 9 LoadLinked(q)

$

10 toValue ← *p

$

11 toValue ← forward(toValue)

/* omit this for non-pointers */

12 StoreConditionally(q, toValue)

/* assuming no spurious failure */ $

13

14 copyWordSafe(p, q) :

15 for

/* as in copyWord */

16  loop

17  LoadLinked(q)

$

18  toValue ← *p

$

19  toValue ← forward(toValue)

/* omit this for non-pointers */

20  if StoreConditionally(q, toValue)

$

21   return

/* SC succeeded */

The Copy phase comprises the following steps. The Pre-Copy step installs the Copy phase write barrier WriteCopy (Algorithm 17.6b). The Copy step copies the contents of each black (marked and scanned) fromspace object into its tospace copy. To cope with concurrent updates by mutators while it is copying object contents, the collector uses lock-free synchronisation to resolve any race that occurs, as shown in Algorithm 17.7. This tries copying the value of a word from one location p to another location q without synchronisation primitives, retrying (up to some limit) when the ‘from’ location p is observed to change, and then resorts to using a combination of the LoadLinked/StoreConditionally (LL/SC) primitives to effect the copy (so long as no update occurs to the ‘to’ location q, otherwise the mutator write barrier ensures both copies are updated in which case the collector need not). Again, memory accesses (marked with $) must execute in the specified order. Note that the algorithm as specified here using LoadLinked/StoreConditionally assumes that SC fails only in the case that the mutator really did update the ‘from’ location p. Unfortunately, current hardware does not provide such guarantees (SC can fail spuriously). Thus, LoadLinked/StoreConditionally cannot ensure that the collector will make progress in the face of repeated updates to the ‘from’ location.4 In practice, the copyWord loop must be re-coded defensively as shown in copyWordSafe, but the collector will fail to make progress if the mutator repeatedly updates the ‘from’ location p between every invocation of LoadLinked and StoreConditionally.

Flip. Again, the phase operates in several steps. Beginning in this phase, unflipped mutators can operate in both fromspace and tospace. The PreFlip step installs the Flip phase write barrier WriteFlip to cope with this (Algorithm 17.6c). The HeapFlip step then processes references held in global variables and newspace objects, flipping all fromspace pointers to tospace. WriteFlip guarantees not to undo this work by ensuring that only tospace pointers are written into global variables and newspace. The ThreadFlip step then flips each mutator thread, one at a time: it stops the thread, flips any fromspace pointers in its stacks and registers over to tospace, and restarts the thread. During this phase, all mutators still need to update both fromspace and tospace copies. Thus, flipped threads need to be able to map from tospace copies back to fromspace copies (using fromAddress analogously to toAddress). Finally, once all threads are flipped and no thread is still executing WriteFlip, the Reclaim step reclaims fromspace and discards the reverse mapping table from tospace back to fromspace.

Since unflipped threads may access both fromspace and tospace copies of the same object, the pointer equality test needs to compare the tospace pointers (Algorithm 17.5b).

Merging phases

Sapphire allows some phases to be merged. For example, RootMark, HeapMark/Stack-Mark, Allocate, and Copy can be merged into a single Replicate phase, which combines the WriteMark and WriteCopy into a Replicate write barrier: when writing a fromspace pointer into a tospace slot, in addition to enqueuing the fromspace object for copying, the write barrier also enqueues the tospace slot so that it can be fixed later by the collector.

Volatile fields

Java volatile fields require a physical memory access for each source code access, and accesses must appear to be sequentially consistent. For this reason, volatile fields require heavier synchronisation on mutator access and while the collector is copying them to ensure that their copies are kept properly coherent. Hudson and Moss describe several techniques for achieving this, each of which imposes substantial additional overhead for accessing volatile fields.

In summary, Sapphire extends previous concurrent copying algorithms, and has much in common with replication schemes. It permits one thread at a time to flip from fromspace to tospace rather than all at once, and minimises thread blocking (pauses) while avoiding a read barrier for non-volatile fields. Mutators simply update both the fromspace and tospace copies of an object (when both exist) to keep them coherent.

17.7  Concurrent compaction

Chapter 3 discussed approaches to garbage collection that split into two phases, marking and compacting. Recall that compaction is decoupled from the tracing phase that determines reachable objects. This allows greater freedom than a copying collector over the order in which objects are relocated (by address, say, rather than in order of tracing for reachability).

Compressor

Compressor [Kermany and Petrank, 2006], presented earlier in Section 3.4 and Section 14.8, exploits the freedom allowed by separating marking from copying to perform compaction concurrently with the mutator threads.

Recall that Compressor first computes an auxiliary first-object table that maps a tospace page to the first fromspace object that will be moved into the page. Parallel compactor threads then race to claim an unmapped tospace virtual page, map it to a physical page, populate it with its copies from fromspace pages, and redirect each pointer field in the copies to refer to tospace. Once all the live objects in a fromspace page have been copied it is immediately unmapped.

To enable concurrent compaction, Compressor exploits virtual memory page protection primitives similarly to Appel et al [1988], where protection served as the read barrier for concurrent copying collection in order to prevent the mutator from accessing tospace pages whose objects have not yet been copied or contain unforwarded pointers. Ossia et al [2004] also used protection simply to allow concurrent forwarding of pointers in pages containing compacted objects, but Compressor drives both compaction and forwarding using page protection. Compressor protects the tospace pages from read and write access (without yet mapping them to physical pages). Computing the first-object table and protecting tospace occurs concurrently with mutator threads operating in fromspace. Compressor then briefly stops all the mutator threads to switch their roots to refer to tospace addresses before releasing the threads. Of course, the contents of those pages have not yet been copied. At this point, if a mutator accesses a protected tospace page it will trap. Handling the trap requires doing the work of compaction to map and populate the page with its copies (the mutator performs incremental compaction work as if it was a compactor thread), and forward the references in those copies, before the mutator can resume and access the page. Note that only the data for this page is copied, thus the handler will not evacuate the beginning of an object that starts on the previous page or the end of one that continues onto the next page. Concurrent compaction requires that a compactor thread be able to access the page while other mutator threads are still protected from access. To support this, Compressor double-maps each physical page when its contents are to be copied, once in its ‘natural’ (still-protected) tospace virtual page, and again in an unprotected virtual page private to the compactor thread (see also Section 11.10). Once the compaction work has been done for that page, the tospace virtual page can be unprotected so mutators can proceed, and the private mapping is discarded.

In essence, Compressor applies the standard tricolour invariant. Fromspace pages are white, protected tospace pages are grey, and unprotected tospace pages are black. Initially, the mutator threads operate grey in fromspace while the first-object table is computed along with the tospace addresses. When the mutator threads are flipped over to tospace they are black. The protection-driven double mapping read barrier prevents the black mutator threads from acquiring stale fromspace references from grey pages that are still in the process of being populated with their fromspace copies.

Compressor must also handle other aspects of the tricolour invariant. In particular, after marking and before the task of determining the first-object table begins, mutators must allocate all new objects in tospace, to prevent those allocations from interfering with the relocation map (otherwise, allocating to a hole in fromspace would interfere). Moreover, these newly allocated objects must eventually have their pointer fields scanned after the mutators flip to tospace, to redirect any stale fromspace references in those fields over to tospace, and similarly for global roots. Thus, both newly allocated tospace objects and the global roots must be protected from access by mutators, with traps on their pages forcing scanning to redirect their pointers.

Because the performance of Compressor depends heavily on the cost of virtual memory page mapping and protection primitives, which can be onerous [Hosking and Moss, 1993], it is important to batch these operations as much as possible. For example, Compressor actually protects and double-maps the entire tospace at the beginning of collection (to avoid the cost of double mapping each page as it is processed). Similarly, Compressor moves eight virtual pages per trap (to better amortise the trap overhead on mutator threads).

One downside of Compressor is that when a mutator traps on access to a protected tospace page then it must not only copy all of that page’s objects, it must also forward all the pointers in those objects to refer to their relocated (or soon to be relocated) targets. This can impose significant pauses on the mutator. In a moment, we will discuss the Pauseless collector, which reduces the amount of incremental work needed to be performed by a mutator to copying at most one object (without needing to forward any of the stale fromspace references it contains). Before doing so, let us briefly review the way in which Compressor drives compaction using page protection, as illustrated in Figure 17.1. The figures show the logical grouping of virtual pages into distinct categories (the linear address-ordered layout of the heap is intentionally not represented):

Live: pages containing (mostly) live objects (initially dark grey in the figures)

Condemned: pages containing some live objects, but mostly dead, which are good candidates for compaction (light grey in the figures, with dark grey live objects)

Free: pages currently free but available for allocation (dashed borders)

New Live: pages in which copied live objects have been allocated but not yet copied (dashed borders, with dashed space allocated for copies)

Dead: unmapped pages that can be recycled (freed for allocation) once there are no pointers to them (shown hatched in the figures)

Figure 17.1a illustrates the initial state in which live objects have been identified along with those to be relocated. For ease of later comparison with Pauseless, we take the liberty here to restrict compaction only to pages sparsely occupied by live objects. In Compressor, live tospace pages containing stale references that need forwarding, and tospace pages into which objects are yet to be relocated, must first be protected to prevent the mutators from accessing them. Concurrently with the mutators, the forwarding information for the live objects is prepared on the side in auxiliary data structures. At this point, the heap pages are configured as in Figure 17.1b, and the mutator roots are all flipped over to refer only to the protected tospace pages. Compaction can now proceed concurrently with the mutators, which will trap if they try to access an unprocessed tospace page. Trapping on a live tospace page causes all of the references in that page to be forwarded to refer to tospace, as in Figure 17.1c. Trapping on a reserved tospace page evacuates objects from condemned fromspace pages to fill the page, and the references contained in these copied objects are forwarded to refer to tospace (Figure 17.1d). When all the live objects in a condemned fromspace page have been evacuated, it is completely dead and its physical page can be unmapped and returned to the operating system, though its virtual page cannot be recycled until all references to it have been forwarded. Compaction ceases when all tospace pages have been processed and unprotected (Figure 17.1e). We now contrast this approach with the Pauseless collector.

Image

Figure 17.1: Compressor

Pauseless

The Pauseless collector [Click et al, 2005; Azul, 2008], and its generational extension C4 [Tene et al, 2011], protects fromspace pages that contain objects being moved, instead of protecting tospace pages containing moved objects and/or stale pointers. Rather than needing to protect all of the tospace pages like Compressor, Pauseless protects the much smaller set of pages whose objects are actually being moved (focusing on sparsely populated pages that will yield most space), and these pages can be protected incrementally. Pauseless uses a read barrier to intercept and repair stale fromspace references before the mutator can use them, and avoids blocking the mutator to fix up entire pages. The initial implementation of Pauseless used proprietary hardware to implement the read barrier directly as a special load-reference instruction, but on stock hardware Pauseless compiles the necessary logic inline with every load-reference operation by the mutator.

Hardware and operating system support.

Azul’s proprietary hardware supports a number of fast user-mode trap handlers. The hardware translation lookaside buffer supports an additional GC-mode privilege level, in addition to the usual user and kernel modes. Several of the fast user-mode traps switch to GC-mode in the trap handler. The translation lookaside buffer also supports large (one or two megabyte) pages, much larger than the usual page size of standard processors. These large pages are the standard unit of work for the Pauseless collector.

The hardware also supports a fast co-operative preemption mechanism via interrupts that are taken only on user-selected instructions, corresponding to GC-safe points. Blocked threads are already at GC-safe points. Running threads can quickly be brought to a GC-safe point and allowed to continue, without idling them. This permits a very fast checkpoint operation, where mutators can be asked to quickly perform a small amount of GC-related work and then carry on, while blocked threads can have the work performed on their behalf by the collector. In contrast, a normal stop-the-world operation requires that all mutators reach a GC-safe point before they can proceed. In a checkpoint, running threads are never idled, and the GC work is spread out in time.

As mentioned earlier, Azul’s proprietary hardware supports a hardware read barrier, which performs a number of checks and actions depending on the particular phase of the collector. The read barrier, sketched in Algorithm 17.8, executes first the load, followed by the barrier logic which cycles the loaded value (assumed to be an address) through the translation lookaside buffer as if it were an address. If this address corresponds to a GC-protected page then a fast user-mode GC-trap handler is invoked. The barrier ignores null references. Unlike a Brooks-style indirection barrier there is no null check, no memory access, no load-use penalty, no forwarding word in the object header and no cache footprint imposed by this mechanism.

Pauseless also steals one address bit from the 64-bit address space. The hardware ignores (strips) this bit in loads and stores. This bit is called the Not-Marked-Through (NMT) bit and is used during the concurrent marking phase of the collector to decide whether the reference has previously been scanned by the collector. The hardware maintains a desired value for the Not-Marked-Through bit and will trap to the Not-Marked-Through-trap handler if the reference has the wrong flavour. Null references are also ignored here.

On standard hardware, the read barrier must be emulated at some cost. The GC protection check is emulated with standard page protection and the read barrier emulated by issuing a dead load instruction or by explicitly checking a side-table in software for the need to trap. The Not-Marked-Through check is emulated by multi-mapping memory and changing page protections to reflect the expected Not-Marked-Through bit value.

Algorithm 17.8: Pauseless read barrier

 1 Read(src, i) :

 2  ref ← src[i]

 3  if protected(ref)

 4   ref ← GCtrap(ref, &src[i])

 5  return ref

 6

 7 GCtrap(oldRef, addr):

 8  newRef ← forward(oldRef )

/* forward/copy as necessary */

 9  mark(newRef)

/* mark as necessary */

10  loop

/* will repeat only if CAS fails spuriously */

11   if oldRef = CompareAndSwap(addr, oldRef, newRef)

12    return

/* CAS succeed, so we are done */

13   if oldRef ≠ *addr

14    return

/* another thread updated addr but newRef is ok */

Null references are quite common, so must be filtered explicitly, though the compiler can often fold this test into the existing null pointer safety checks required by languages like Java. Stripping the Not-Marked-Through bit in software can be achieved by having the compiler modify all dereferences to strip it before use, and reusing the stripped reference where the reuse does not cross a GC-safe point. Alternatively, the operating system can be modified to multi-map memory or alias address ranges so that the Not-Marked-Through bit is effectively ignored.

The Pauseless garbage collection phases. The Pauseless collector is divided into three main phases, each of which is fully parallel and concurrent:

Mark is responsible for periodically refreshing the mark bits. In the process of doing that it will set the Not-Marked-Through bit for all references to the desired value and gather liveness statistics for each page. The marker starts from the roots (static global variables and mutator stacks) and begins marking reachable objects. The Not-Marked-Through bit assists in making the mark phase fully concurrent, as described further below.

Relocate uses the most recently available mark bits to find sparse pages with little live data, to compact those pages (relocating their objects), and to free their physical backing memory.

The relocate phase starts by protecting sparsely occupied pages from mutator access and then copies live objects out of those pages. Forwarding information maintained on the side tracks the locations of relocated objects. If a mutator loads a reference to a protected page the read barrier will trigger a GC-trap, which changes the stale protected-page reference to the correctly forwarded reference. After the page contents have been relocated, the relocate phase frees the physical memory, which can be immediately recycled by the operating system. Virtual memory cannot be freed until no more stale references to that page remain in the heap.

A relocate phase runs continuously, freeing memory to keep pace with mutator allocation. It runs standalone or concurrently with the next mark phase.

Remap updates every pointer in the heap whose target has been relocated.

Collector threads traverse the object graph executing a read barrier against every reference in the heap, forwarding stale references as if a mutator had trapped on the reference. At the end of this phase no live heap reference can refer to pages protected by the previous relocate phase, so virtual memory for those pages is freed.

Since both the remap and mark phases traverse all live objects Pauseless is able to fold them together. The remap phase for the current GC cycle runs concurrently with the mark phase for the next GC cycle.

Pauseless has several qualitative advantages. Firstly, there is no ‘rush’ to finish any given phase. No phase imposes a substantial burden on the mutators that needs to be relieved by ending the phase quickly. There is no ‘race’ to finish some phase before collection can begin again — the Relocate phase runs continuously and can immediately free memory at any point. Since all phases are parallel, the collector can keep up with any number of mutator threads simply by adding more collector threads. Unlike other concurrent marking collectors, marking is guaranteed to complete in a single pass regardless of the mutation rate (there is no need to re-mark — revert to grey — previously marked objects, or stop the mutators in a final mark step to ensure termination). Collector threads will compete with mutator threads for CPU time, though any spare CPU can be employed by the collector.

Secondly, the phases incorporate a ‘self-healing’ effect, where mutators immediately correct the cause of each read barrier trap by replacing any trapping reference in the slot from which it was loaded with its updated reference that will not trigger another trap. The work involved depends on the type of the trap. Once the mutators’ working sets have been repaired they can execute at full speed without any further traps. This results in a drop in mutator utilisation for a short period (a ‘trap storm’) following a phase shift, with the minimum mutator utilisation penalty of approximately 20 milliseconds spread over a few hundred milliseconds. But Pauseless has no stop-the-world pauses where all threads must be simultaneously stopped. We now discuss the phases in more detail.

Mark. The mark phase manipulates mark bits managed on the side. It begins by clearing the current cycle’s mark bits. Each object has two mark bits, one for the current cycle and one for the previous cycle. The mark phase then marks all global references, scans each mutator thread’s root set, and flips the per-thread expected Not-Marked-Through value. Running threads cooperate by marking their own root set at a checkpoint. Blocked (or stalled) threads are marked in parallel by mark phase collector threads. Each mutator thread can immediately proceed once its root set has been marked (and expected Not-Marked-Through flipped) but the mark phase cannot proceed until all threads have passed the checkpoint.

After the root sets have been marked, marking proceeds in parallel and concurrently with the mutators in the style of Flood et al [2001]. The markers ignore the Not-Marked-Through bit, which is used only by the mutators. This continues until all live objects have been marked. New objects are allocated in live pages. Because mutators can hold (and thus store) only marked-through references, the initial state of the mark bit for new objects does not matter for marking.

The Not-Marked-Through bit is crucial to completion of the mark phase in a single pass over the live objects, regardless of stores by the mutator, because the read barrier prevents mutators from acquiring unmarked references. A mutator that loads a reference with the wrong flavour of Not-Marked-Through bit will take a Not-Marked-Through-trap which will communicate the reference to the marker threads. Because it can never acquire an unmarked reference, a mutator can never store and propagate an unmarked reference. The Not-Marked-Through-trap also stores the corrected (marked) reference back to memory, so that particular reference can never cause a trap in the future. This self-healing effect means that a phase-change will not make the mutators wait until the marker threads can flip the Not-Marked-Through bits in the objects on which the mutator is working. Instead, each mutator flips each reference it encounters as it runs. Steady state Not-Marked-Through-traps are rare.

The mutator must take care that updating the trapping reference does not clobber a store to the same location by another thread since the Not-Marked-Through-trap occurred. Thus, the trap handler uses a CompareAndSwap operation to update the memory only if it has not changed since the trapping thread read from that location. Because a checkpoint is used to initiate the mark phase, different threads may briefly have a different view of the desired Not-Marked-Through value. It is possible for two threads to compete repeatedly over a single reference’s Not-Marked-Through value via trapping and updating in memory. This can last only until the unflipped thread passes its next GC-safe point where it will trap, mark through its stack, and cross the checkpoint.

Note that it is not possible for a single thread to hold the same reference twice in its root set with different Not-Marked-Through settings, so pointer equality can always be implemented using bit-wise equality.

Termination of the marking phase needs to worry only about the race between a mutator having loaded an unmarked reference but not having yet executed the read barrier. Read barriers never span a GC-safe point, so it is sufficient that all the mutators cross a GC-safe point without trapping. Thus, the marking phase requests an empty checkpoint. Any references discovered before the checkpoint will be marked as normal. When all mutators have passed the checkpoint without reporting a new reference for marking then the mark phase is complete. Otherwise the marker threads will consume the new references for marking and the checkpoint can be repeated. Because no new references can be created with the wrong Not-Marked-Through bit this process must eventually terminate.

Relocate. The relocate phase starts by finding sparsely occupied pages. Figure 17.2a shows a logical grouping of virtual pages into distinct categories (again, the linear address-ordered layout of the heap is intentionally not illustrated). There are references from both the mutator roots and live pages into sparse pages whose live objects are to be compacted by evacuation. The relocate phase first builds side arrays to hold forwarding pointers for the objects to be relocated. These cannot be held in the fromspace originals because the physical storage for the fromspace pages will be reclaimed immediately after copying and long before all the fromspace references have been forwarded. The side array of forwarding data is not large because only sparse pages are relocated, so it can be implemented easily as a hash table. The relocate phase then protects the mostly dead condemned pages from access by the mutators as in Figure 17.2b. Objects in these pages are now considered stale, and can no longer be modified. Also, if a mutator loads a reference that points into a protected page the read barrier will now take a GC-trap.

At the time the fromspace pages are protected, running mutators may have stale references in their root set. These are already past their read barrier and will not get caught directly. Instead, the mutators are asked to forward any existing stale references from their root set with a checkpoint, relocating the fromspace targets as necessary (Figure 17.2c). Once all the mutators have passed this checkpoint, copying of the remaining live objects into tospace can proceed concurrently with the mutators. The read barrier prevents the mutators from seeing a stale object before it has finished moving.

Image

Figure 17.2: Pauseless

As in the mark phase, the read barrier in the relocate phase prevents the mutator from loading a stale reference. The self-healing GC-trap handler forwards the reference and updates the memory location using CompareAndSwap. If the fromspace object has not yet been copied then the mutator will copy the object on behalf of the collector. This is illustrated in Figure 17.2d. The mutator can read the GC-protected page because the GC-trap handler runs in the elevated GC-protection mode. Large objects that span multiple pages are not relocated, nor are objects in mostly live pages. An object that consumes about half of a page can be copied in about a millisecond.5

To amortise the cost of modifying translation lookaside buffer protections and forwarding the mutator roots, Pauseless batches up groups of sparse pages for compaction, typically protecting (and relocating and freeing) a few gigabytes at a time. The rate at which relocation must proceed is dictated only by the need to keep up with the allocation rate of the mutators.

Remap. Virtual memory is not freed immediately. The final step of forwarding the remaining stale references in the live pages and reclaiming virtual memory falls to the remap phase. At the end of the remap phase there are no more stale references to the fromspace pages so their virtual memory can now be recycled (Figure 17.2e), the side array of forwarding pointers can be reclaimed, and the GC cycle is complete. Recall that real memory for evacuated pages was reclaimed long before, during the relocate phase.

Finalisation and weak references. Java’s soft and weak references (see Section 12.1) lead to a race between the collector nulling a reference and the mutator strengthening it. Fortunately, processing the soft and weak references concurrently with the mutator is possible with Pauseless by having the collector CompareAndSwap down to null only when the reference remains not marked-through. The Not-Marked-Through-trap handler already has the proper CompareAndSwap behaviour allowing both the mutator and the collector to race to CompareAndSwap. If the mutator wins then the reference is strengthened (and the collector will know), while if the collector wins then the reference is nulled (and the mutator sees only the null).

Operating system extensions. Pauseless makes aggressive and sustained use of virtual memory mapping and physical memory manipulation. This functionality can be implemented using standard operating system primitives, but the performance and rates at which that functionality can be deployed using the standard primitives is prohibitive. Pauseless-specific extensions to the operating system’s memory manager result in significant performance improvements [Azul, 2010]. Enterprise Java applications commonly see allocation rates of from 200–500 megabyte/s per core, which must be matched by a sustained garbage collection rate to avoid pauses. In Pauseless, each page will eventually be remapped once (and later unmapped once) in order to reclaim dead object space. No physical memory copying is required, so the remap rate is not significantly sensitive to memory bandwidth. Instead, the cost of the remapping operations dominate. Typical operating systems support remapping with three limitations:

1.  Each page remap includes an implicit translation lookaside buffer invalidation operation. Since translation lookaside buffer invalidations require multiple cross-CPU interrupts (over all cores) the cost of remapping grows with the number of active threads in the program. This happens even when the active threads do not participate in the remapping, or have no interaction with the remapped memory.

2.  Only small (typically four kilobyte) page mappings can be remapped.

3.  Remap operations are single-threaded within a process (grabbing a common write lock).

To address these shortcomings, Pauseless benefits from operating system extensions that support remapping without translation lookaside buffer invalidation (these can be applied in bulk at the end of a large set of remaps as necessary), remapping of large (typically two megabyte) page mappings, and multiple concurrent remaps within the same process. These operating system improvements result in approximately three orders of magnitude speedup compared to a stock operating system, scaling almost linearly as the number of active threads doubles.

Summing up, Pauseless is designed as a fully parallel and concurrent collector for large multiprocessor systems. It requires no stop-the-world pauses and dead objects can be reclaimed at any point during a collector cycle. There are no phases where the collector must race to finish before the mutators run out of free memory. Mutators can perceive a period of reduced utilisation during trap storms at some phase shifts, but the self-healing property of these traps serves to recover utilisation quickly.

17.8  Issues to consider

This chapter has laid out the basic principles of concurrent copying collection and concurrent compaction to reduce fragmentation, while also avoiding long pauses. As in any concurrent collector algorithm, the collector must be protected against mutations that can otherwise cause lost objects. But because the collector is moving objects, the mutator must also be protected against accessing stale copies. Some algorithms protect the mutator by making sure it operates with a tospace invariant so that it can never hold references to stale fromspace objects [Baker, 1978]. Others protect the mutator by making it forward to tospace copies as they are created, but otherwise allow it to continue operating in fromspace [Brooks, 1984]. Still others permit continued operation in fromspace, so long as updates eventually propagate to tospace [Nettles et al, 1992; Nettles and O’Toole, 1993]. Once copying has finished all the mutators flip to tospace in a single step. Dispensing with this global transition can mean accumulating chains of multiple versions, which mutators must traverse to find the most up-to-date copy [Herlihy and Moss, 1992]. Alternatively, by performing updates on both copies, mutators can be transitioned one at a time [Hudson and Moss, 2001, 2003]. Compaction can be performed in similar ways but without the need to copy all objects at every collection [Kermany and Petrank, 2006; Click et al, 2005; Azul, 2008].

These approaches may result in longer pauses than non-moving concurrent collection: on any given heap access the mutator may need to wait for an object (or objects) to move or indirect to the current version. Indeed, Baker [1992a] devised his Treadmill algorithm as an antidote to the churn present in his original copying collector [Baker, 1978]. While copying or compaction are needed to avoid fragmentation, they can present particular difficulties for applications that are sensitive to prolonged or frequent pauses. Often, such applications also operate in environments where memory is unusually constrained, such as embedded systems, where defragmentation can be even more important. We consider how to manage concurrent copying or concurrent compaction while tightly bounding pauses for such applications in Chapter 19.

1Atomic copying of an object and installation of the forwarding address from the old copy to the new one is not always simple.

2Herlihy and Moss use the term process for what might now be called a thread, but we continue to use processor here to match Halstead [1985] and to emphasise that the heap regions should be thought of as per-processor.

3We emphasise that Sapphire assumes that there are no races on updating non-volatiles.

4We thank Laurence Hellyer for describing this problem.

5Recall that Pauseless’s pages are quite large.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.117.2