i
i
i
i
i
i
i
i
430 CHAPTER 13. ASK THE EXPERTS
They also include novel designs to make them more economical in terms of time, space, or energy.
The main problem has been that multiprocessor companies have been constrained by legacy. The
cache coherence protocol is a complex hardware module, whose design involves careful crafting of
state machines and detailed analysis of timing issues. In addition, it interacts with multiple other
parts of the machine, including the caches, processor load-store unit, and bus or network modules.
As a result, it takes many person-years to debug a cache coherence protocol. When companies have
implemented a protocol that works, they are very reluctant to change it in any way.
Currently, each company has a few cache coherence protocol designs that have been shown to
work correctly. The full details of these designs are rarely revealed to people outside of the company.
In addition, they may not be fully documented even inside of the company. In practice, these
protocols are likely to be more complicated than needed, with states that were added to optimize
certain conditions that may not be relevant anymore. These states may in fact be slowing down the
operation of the common case.
The bottom line is that these protocols are suboptimal in performance, energy, complexity, and
scalability. However, it is very unclear what researchers can do to change this status-quo.
[YS] Can you describe your research on bulk coherence? What problems does it attack and
what unique characteristics it has?
[JT] Bulk coherence
1
is an effort to design a cache coherence protocol from new principles, in order
to enable a more programmable environment. In addition, it also improves performance, and does
not increase hardware complexity.
Bulk coherence provides to the software high-performance sequential memory consistency,
which improves programmability. In addition, bulk coherence provides support for several novel
hardware primitives. These primitives can be used to build a sophisticated program development-
and-debugging environment, including low-overhead data-race detection, deterministic replay of
parallel programs, and high-speed disambiguation of sets of addresses. The primitives have an
overhead low enough to always be on during production runs.
The key idea in bulk coherence is twofold: first, the hardware automatically executes all soft-
ware as a series of dynamically-created atomic blocks of thousands of dynamic instructions called
chunks. Chunk execution is invisible to the software and, therefore, puts no restriction on the pro-
gramming language or model. Second, bulk coherence introduces the use of “hardware address
signatures” to operate on groups of addresses. Signatures are a low-overhead mechanism to ensure
atomic and isolated execution of chunks, and help provide high-performance sequential memory
consistency.
Bulk coherence enables higher performance because the processor hardware is free to aggres-
sively reorder and overlap the memory accesses of a program within chunks without risk of break-
ing their expected behavior in a multiprocessor environment. Moreover, the compiler can create the
chunks, and then further improve performance by heavily optimizing the instructions within each
chunk.
Finally, the bulk multicore organization decreases hardware design complexity because memory-
consistency enforcement is largely decoupled from processor structures. In a conventional processor
1
J.Torrellas, L.Ceze, J.Tuck, C.Cascaval, P.Montesinos, W.Ahn, and M.Prvulovic. The Bulk Multicore Architecture for
Improved Programmability. CACM, December 2009.