27. Simple Remote Heaps (1/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

 427

SimpleRemoteHeaps

Jason Hughes

Steel Penny Games, Inc.

27.1Introduction

A remote heap is an old tool in the memory management toolbox. It is appropri-

ate to consider using a remote heap whenever a memory architecture is segment-

ed, such as on the Wii, where MEM1 is much faster than MEM2 and can benefit

from having all the allocation and deletion work done with lower latency. A re-

mote heap is also useful when a processor has to handle feeding other custom

processors with their own local memory, such as pushing audio data to a sound

chip, when the only way to touch that memory is via bulk DMA calls that have

significant latency. Similarly, on the PlayStation 3, the PPU may want to treat the

memory associated with each SPU as a remote heap so that it can play traffic

cop, with data whizzing back and forth, without needing to have direct access to

their individual memories. Conceptually, remote heaps could also be used in a

unified memory architecture using multiple processors, such as the Xbox 360,

where each heap is mutually distinct and partitions memory for each CPU to

work on, with a guarantee that no other processor will cause mutex blocking at

the central memory manager.

Considerations

There are limitations to every technique. Unfortunately, remote heaps suffer from

a significant overhead in performance and local memory footprint, which can be

traced primarily back to the fact that the only data that a typical program has to

refer to an allocation is its address. In a standard heap, metadata describing each

allocation is stored a few bytes before each allocated block of memory, so opera-

tions such as

Free() tend to do some quick pointer arithmetic to do their work.

428 27.SimpleRemoteHeaps

This is not as straightforward with a remote heap because that metadata has to be

associated with an address, without the benefit of storing data at that address.

Designing a good remote heap means flirting with hash tables, red-black trees, or

similar data structures.

Unfortunately, hash tables of addresses are terrible when searching for

memory to allocate (sequential allocations would be randomly scattered through

memory and have





On allocation time, where n scales with the hash table rather

than with the entries used) and equally bad for merging adjacent free blocks.

Red-black trees of addresses (such as the standard template library

map) are

memory hungry and somewhat tedious to set up with allocators from fixed

memory pools but do work very well when freeing blocks in





logOn time,

where

n is the number of current allocations, and merging adjacent free blocks is

a constant-time operation (neglecting the reordering of the tree when a node is

deleted). Unfortunately, they have





On running time for allocation because best-

fit and first-fit strategies must scan all blocks to find free space.

One solution is to carefully pack bits so that the





On searches are as fast as

possible. Here, we accept the worst-case performance and try to make the best of

it. There are some benefits to this method, which is why we present it. Some-

times it is ideal. In general, though, it lacks performance. We call this the “bit-

wise remote heap” approach because each unit of memory is represented as a

single bit.

The other solution is to realize that a single address-based data structure is

not ideal for all functions. However, keeping two structures in sync tends to force

each data structure to have the worst-case performance of the other. Our solution

is to allow them to fall out of sync and correct mistakes as they are detected. We

call this the “blockwise remote heap” because it tracks individual allocations as

single units.

27.2BitwiseRemoteHeap

Overview

The bitwise remote heap is very similar to the Medium Page Allocation scheme

described in the previous chapter. The basic idea is to manage a block of memory

(or rather, a range of addresses) in equally sized chunks by associating each

chunk’s availability with a single bit in a packed bitmap. If the bit is true, the

block is currently free. If the bit is false, it is part of an allocation already. This

convention may seem strange, but it was chosen specifically because Intel CPUs

27.2BitwiseRemoteHeap 429

can find the next set bit in a single instruction,

and we’re always looking for free

blocks. On other processors, the opposite might be true.

One downside of this approach is that all allocations must be rounded to an

even granularity of the chunk size, which can be wasteful if this size is set too

large. Another disadvantage is that, when finding an allocation, the implementa-

tion may need to scan all of the bits in the bitmap, only to fail. This failure grows

slower as the number of bits in the bitmap increases.

Consequently, this kind of implementation is ideal when relatively few allo-

cations are required, they can be easily rounded to a specific size, and the amount

of memory managed is relatively small. This sounds like a lot of restrictions to

place on a remote heap, but in reality, they are satisfied quite frequently. Most

remote heaps dealing with custom hardware, such as audio processors or special-

ized processors (vector units, GPUs, etc.), have alignment requirements anyway,

and they often have a relatively small block of memory that needs to be

managed.

Algorithm

The bitwise remote heap has only two important functions:

■

Alloc(). The obvious method for finding a free block of memory in a bit-

map is to scan for a sequence of bits that are set. This can be accomplished

by loading 32 bits at a time, scanning for the first 1 bit, noting that location,

then scanning for the first 0 bit after it. Often this requires spanning multiple

32-bit values in sequence, so an optimized routine is required. Also, many

processor architectures have assembly language mnemonics that do this kind

of search in a single instruction for further performance enhancement. Clear-

ly, this operation has worst-case

n memory fetches, where n is the number

of bits in the heap bitmap. Once the range is found, the bits in the range are

set to zero. A single bit is set in an end-of-allocation bitmap that tells what

the last block in the allocation is. Think of it like null-terminating a string.

■

Free(). When deleting a chunk of memory, we simply turn a range of 0 bits

into 1 bits, scanning forward until we find the allocation terminator bit in the

end-of-allocation bitmap. Once found, we clear the terminating bit as well.

Coalescing free blocks is completely unnecessary because adjacent free bits

implicitly represent contiguous free memory. Thus, there is relatively little

management overhead in deallocation.

See the bit scan forward (BSF) instruction in Intel 64 and IA-32 Architectures Software

Developer’s Manual Volume 2A: Instruction Set Reference, A-M. Many details on bit

scan operations are also available at http://chessprogramming.wikispaces.com/ BitScan.

430 27.SimpleRemoteHeaps

Certain small changes can dramatically improve performance. One major

bottleneck is scanning thousands of bits to look for memory. The simple trick of

remembering a good place to start searching can cut the search time significantly

when making repeated calls to

Alloc(). When doing this, every allocation that

succeeds remembers the very next bit as the starting point, since large blocks of

memory tend to be partitioned into smaller ones, and the next piece of free

memory is likely to be next. Also, whenever a block of memory is released in

Free(), we reset the starting point if the address is lower than the current starting

point. This is very similar to a first-fit allocation strategy, while skipping a lot of

searching. However, this method can fail when an allocation occurs close to the

end of memory and most of the free blocks are at the start, so an additional loop

must check for allocations up to the starting point in the event of an allocation

failure. This tiny change is extremely important for good performance. See Table

27.2 at the end of the article for a comparison.

A further improvement, which we have not personally implemented, would

be to remove the





On search entirely. This could be accomplished by keeping a

hierarchical table of bitmaps, where each bit represents the free status of several

bits at the lower level. Along with such a system comes inaccuracies in determin-

ing whether a block is available for an allocation, meaning more

Alloc() fail-

ures, but it fails more quickly. Sometimes that’s a good trade-off.

Finally, a simple, cheap improvement would be to create a table of free block

addresses that is sorted by the size of the free block. This makes allocation ex-

tremely quick because any allocation request is simply a query against this table

for any block of at least the requested size, running up the table to larger blocks if

no blocks of the right size are available. However, deallocation becomes slower

because now, bits must be scanned in both directions to effectively coalesce

blocks. The bookkeeping required to maintain a useful table is slightly compli-

cated and might become a significant overhead if many small fragments begin

filling up the table. Thus, we would not bother keeping small free chunks in the

table, since searching is likely to provide a valid block, but anything over some

threshold that is less likely to find space would be a good candidate for keeping

in the table. This probably results in





On deallocation, but on a heavily loaded

heap, it would be negligible in practice. So, again, it might be a good trade-off.

27.3BlockwiseRemoteHeap

Overview

The blockwise remote heap is quite similar in implementation to a standard heap,

except that it does not benefit from storing extra data immediately prior to the

27.3BlockwiseRemoteHeap 431

address in user-memory. Since we cannot use the address to quickly find metada-

ta about allocations, the naive implementation leaves us with a linked list of

tracking blocks, ordered by address. A naive implementation forces all alloca-

tions to scan the heap’s tracking linked list before allocating or freeing any

memory. A naive implementation has a slow





On running time, where n is the

number of allocations in the heap.

The implementation on the website does not attempt to make

Free() as fast

as possible but, rather, does a simple linked list search. This could be improved

by creating a second data structure that keeps the tracking block pointer associat-

ed with each allocated address, perhaps in a hash table or balanced tree, but that

is an exercise left to the reader. The memory requirements for such a structure

are substantial and may be entirely unnecessary for applications where

Free() is

rarely or never called, but rather, the entire heap is dropped all at once.

However, since fast allocation is actually a complicated problem, we present

a solution that is nearly constant-time for the majority of allocations and is linear

for the rest. We do this by maintaining a bookmark table that essentially points to

the first free block of each power-of-two–sized block of free memory. It remem-

bers the last place the code found a free block of each size. Once we allocate a

block, that entry in the table may no longer be accurate. Updating the entry may

require a full traversal of the heap, a very slow operation, so we allow it to be-

come stale. Instead, during allocation and free calls, we store any blocks we

come across in the table at the appropriate (rounded down to the next) power of

two. While there is no guarantee that the table is updated, in practice it tends to

be close, at no additional performance cost. During an allocation, if a table entry

appears invalid, we can always check the next-higher power of two, or the next,

until one is found to be valid. In most cases, this works very well. In empirical

tests, 65 percent of the allocations have positive hits on the cached references in

the bookmark table, which means 65 percent of the long searches for free

memory were avoided.

The tracking data for allocations is stored in a pool with a threaded free-list,

making the location of a valid metadata block during allocation and deletion an





1O operation. Threaded free-lists act like a stack, since we simply want a blank

node to write to when allocating (pop) and want to return a node to the unused

pool when freeing (push). As in any other standard pool implementation, the used

and unused structures occupy the same physical memory at different times, and

we just cast to one or the other, depending on the current state of that block. To

aid in debugging, as well as to facilitate lazy bookmarking, unused metadata

nodes are marked with a sentinel value that cannot appear in an allocation, so we

can repair the bookmark table when it gets out of date.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 27. Simple Remote Heaps (1/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
27. Simple Remote Heaps (1/2)