427
27
SimpleRemoteHeaps
Jason Hughes
Steel Penny Games, Inc.
27.1Introduction
A remote heap is an old tool in the memory management toolbox. It is appropri-
ate to consider using a remote heap whenever a memory architecture is segment-
ed, such as on the Wii, where MEM1 is much faster than MEM2 and can benefit
from having all the allocation and deletion work done with lower latency. A re-
mote heap is also useful when a processor has to handle feeding other custom
processors with their own local memory, such as pushing audio data to a sound
chip, when the only way to touch that memory is via bulk DMA calls that have
significant latency. Similarly, on the PlayStation 3, the PPU may want to treat the
memory associated with each SPU as a remote heap so that it can play traffic
cop, with data whizzing back and forth, without needing to have direct access to
their individual memories. Conceptually, remote heaps could also be used in a
unified memory architecture using multiple processors, such as the Xbox 360,
where each heap is mutually distinct and partitions memory for each CPU to
work on, with a guarantee that no other processor will cause mutex blocking at
the central memory manager.
Considerations
There are limitations to every technique. Unfortunately, remote heaps suffer from
a significant overhead in performance and local memory footprint, which can be
traced primarily back to the fact that the only data that a typical program has to
refer to an allocation is its address. In a standard heap, metadata describing each
allocation is stored a few bytes before each allocated block of memory, so opera-
tions such as
Free() tend to do some quick pointer arithmetic to do their work.
428 27.SimpleRemoteHeaps
This is not as straightforward with a remote heap because that metadata has to be
associated with an address, without the benefit of storing data at that address.
Designing a good remote heap means flirting with hash tables, red-black trees, or
similar data structures.
Unfortunately, hash tables of addresses are terrible when searching for
memory to allocate (sequential allocations would be randomly scattered through
memory and have
On allocation time, where n scales with the hash table rather
than with the entries used) and equally bad for merging adjacent free blocks.
Red-black trees of addresses (such as the standard template library
map) are
memory hungry and somewhat tedious to set up with allocators from fixed
memory pools but do work very well when freeing blocks in
logOn time,
where
n is the number of current allocations, and merging adjacent free blocks is
a constant-time operation (neglecting the reordering of the tree when a node is
deleted). Unfortunately, they have
On running time for allocation because best-
fit and first-fit strategies must scan all blocks to find free space.
One solution is to carefully pack bits so that the
On searches are as fast as
possible. Here, we accept the worst-case performance and try to make the best of
it. There are some benefits to this method, which is why we present it. Some-
times it is ideal. In general, though, it lacks performance. We call this the “bit-
wise remote heap” approach because each unit of memory is represented as a
single bit.
The other solution is to realize that a single address-based data structure is
not ideal for all functions. However, keeping two structures in sync tends to force
each data structure to have the worst-case performance of the other. Our solution
is to allow them to fall out of sync and correct mistakes as they are detected. We
call this the “blockwise remote heap” because it tracks individual allocations as
single units.
27.2BitwiseRemoteHeap
Overview
The bitwise remote heap is very similar to the Medium Page Allocation scheme
described in the previous chapter. The basic idea is to manage a block of memory
(or rather, a range of addresses) in equally sized chunks by associating each
chunk’s availability with a single bit in a packed bitmap. If the bit is true, the
block is currently free. If the bit is false, it is part of an allocation already. This
convention may seem strange, but it was chosen specifically because Intel CPUs
27.2BitwiseRemoteHeap 429
can find the next set bit in a single instruction,
1
and we’re always looking for free
blocks. On other processors, the opposite might be true.
One downside of this approach is that all allocations must be rounded to an
even granularity of the chunk size, which can be wasteful if this size is set too
large. Another disadvantage is that, when finding an allocation, the implementa-
tion may need to scan all of the bits in the bitmap, only to fail. This failure grows
slower as the number of bits in the bitmap increases.
Consequently, this kind of implementation is ideal when relatively few allo-
cations are required, they can be easily rounded to a specific size, and the amount
of memory managed is relatively small. This sounds like a lot of restrictions to
place on a remote heap, but in reality, they are satisfied quite frequently. Most
remote heaps dealing with custom hardware, such as audio processors or special-
ized processors (vector units, GPUs, etc.), have alignment requirements anyway,
and they often have a relatively small block of memory that needs to be
managed.
Algorithm
The bitwise remote heap has only two important functions:
Alloc(). The obvious method for finding a free block of memory in a bit-
map is to scan for a sequence of bits that are set. This can be accomplished
by loading 32 bits at a time, scanning for the first 1 bit, noting that location,
then scanning for the first 0 bit after it. Often this requires spanning multiple
32-bit values in sequence, so an optimized routine is required. Also, many
processor architectures have assembly language mnemonics that do this kind
of search in a single instruction for further performance enhancement. Clear-
ly, this operation has worst-case
3
2
n memory fetches, where n is the number
of bits in the heap bitmap. Once the range is found, the bits in the range are
set to zero. A single bit is set in an end-of-allocation bitmap that tells what
the last block in the allocation is. Think of it like null-terminating a string.
Free(). When deleting a chunk of memory, we simply turn a range of 0 bits
into 1 bits, scanning forward until we find the allocation terminator bit in the
end-of-allocation bitmap. Once found, we clear the terminating bit as well.
Coalescing free blocks is completely unnecessary because adjacent free bits
implicitly represent contiguous free memory. Thus, there is relatively little
management overhead in deallocation.
1
See the bit scan forward (BSF) instruction in Intel 64 and IA-32 Architectures Software
Developer’s Manual Volume 2A: Instruction Set Reference, A-M. Many details on bit
scan operations are also available at http://chessprogramming.wikispaces.com/ BitScan.
430 27.SimpleRemoteHeaps
Certain small changes can dramatically improve performance. One major
bottleneck is scanning thousands of bits to look for memory. The simple trick of
remembering a good place to start searching can cut the search time significantly
when making repeated calls to
Alloc(). When doing this, every allocation that
succeeds remembers the very next bit as the starting point, since large blocks of
memory tend to be partitioned into smaller ones, and the next piece of free
memory is likely to be next. Also, whenever a block of memory is released in
Free(), we reset the starting point if the address is lower than the current starting
point. This is very similar to a first-fit allocation strategy, while skipping a lot of
searching. However, this method can fail when an allocation occurs close to the
end of memory and most of the free blocks are at the start, so an additional loop
must check for allocations up to the starting point in the event of an allocation
failure. This tiny change is extremely important for good performance. See Table
27.2 at the end of the article for a comparison.
A further improvement, which we have not personally implemented, would
be to remove the
On search entirely. This could be accomplished by keeping a
hierarchical table of bitmaps, where each bit represents the free status of several
bits at the lower level. Along with such a system comes inaccuracies in determin-
ing whether a block is available for an allocation, meaning more
Alloc() fail-
ures, but it fails more quickly. Sometimes that’s a good trade-off.
Finally, a simple, cheap improvement would be to create a table of free block
addresses that is sorted by the size of the free block. This makes allocation ex-
tremely quick because any allocation request is simply a query against this table
for any block of at least the requested size, running up the table to larger blocks if
no blocks of the right size are available. However, deallocation becomes slower
because now, bits must be scanned in both directions to effectively coalesce
blocks. The bookkeeping required to maintain a useful table is slightly compli-
cated and might become a significant overhead if many small fragments begin
filling up the table. Thus, we would not bother keeping small free chunks in the
table, since searching is likely to provide a valid block, but anything over some
threshold that is less likely to find space would be a good candidate for keeping
in the table. This probably results in
On deallocation, but on a heavily loaded
heap, it would be negligible in practice. So, again, it might be a good trade-off.
27.3BlockwiseRemoteHeap
Overview
The blockwise remote heap is quite similar in implementation to a standard heap,
except that it does not benefit from storing extra data immediately prior to the
27.3BlockwiseRemoteHeap 431
address in user-memory. Since we cannot use the address to quickly find metada-
ta about allocations, the naive implementation leaves us with a linked list of
tracking blocks, ordered by address. A naive implementation forces all alloca-
tions to scan the heap’s tracking linked list before allocating or freeing any
memory. A naive implementation has a slow
On running time, where n is the
number of allocations in the heap.
The implementation on the website does not attempt to make
Free() as fast
as possible but, rather, does a simple linked list search. This could be improved
by creating a second data structure that keeps the tracking block pointer associat-
ed with each allocated address, perhaps in a hash table or balanced tree, but that
is an exercise left to the reader. The memory requirements for such a structure
are substantial and may be entirely unnecessary for applications where
Free() is
rarely or never called, but rather, the entire heap is dropped all at once.
However, since fast allocation is actually a complicated problem, we present
a solution that is nearly constant-time for the majority of allocations and is linear
for the rest. We do this by maintaining a bookmark table that essentially points to
the first free block of each power-of-two–sized block of free memory. It remem-
bers the last place the code found a free block of each size. Once we allocate a
block, that entry in the table may no longer be accurate. Updating the entry may
require a full traversal of the heap, a very slow operation, so we allow it to be-
come stale. Instead, during allocation and free calls, we store any blocks we
come across in the table at the appropriate (rounded down to the next) power of
two. While there is no guarantee that the table is updated, in practice it tends to
be close, at no additional performance cost. During an allocation, if a table entry
appears invalid, we can always check the next-higher power of two, or the next,
until one is found to be valid. In most cases, this works very well. In empirical
tests, 65 percent of the allocations have positive hits on the cached references in
the bookmark table, which means 65 percent of the long searches for free
memory were avoided.
The tracking data for allocations is stored in a pool with a threaded free-list,
making the location of a valid metadata block during allocation and deletion an
1O operation. Threaded free-lists act like a stack, since we simply want a blank
node to write to when allocating (pop) and want to return a node to the unused
pool when freeing (push). As in any other standard pool implementation, the used
and unused structures occupy the same physical memory at different times, and
we just cast to one or the other, depending on the current state of that block. To
aid in debugging, as well as to facilitate lazy bookmarking, unused metadata
nodes are marked with a sentinel value that cannot appear in an allocation, so we
can repair the bookmark table when it gets out of date.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.26.217