i
i
i
i
i
i
i
i
Chapter 13
Ask the Experts
Josep Torrellas on Parallel Multicore Architectures
Josep Torrellas is a Professor at the Departments of Computer
Science and (by courtesy) Electrical and Computer Engineer-
ing at the University of Illinois at Urbana-Champaign (UIUC).
He is the Director of the Center for Programmable Extreme
Scale Computing, and past Director of the Illinois-Intel Paral-
lelism Center (I2PC). He is a Fellow of IEEE (2004) and ACM
(2010). He received the 2015 IEEE Computer Society Techni-
cal Achievement Award, for “Pioneering contributions to shared-
memory multiprocessor architectures and thread-level specula-
tion”, and the ICCD High-Impact Paper Award for “One of the 5
most cited papers in the first 30 years of ICCD”. He has served as the Chair of the IEEE Techni-
cal Committee on Computer Architecture (TCCA) (2005-2010) and as a Council Member of CRAs
Computing Community Consortium (CCC) (2011-2014). He was a Willett Faculty Scholar at UIUC
(2002-2009).
Torrellas’ research interests are shared-memory parallel computer architecture, energy-efficient
architectures, hardware reliability, and software dependability. He has published over 200 publi-
cations and received 12 Best Paper awards. In the early nineties, Torrellas was involved in the
Stanford DASH and Illinois Cedar experimental multiprocessor projects. Later, he lead the Illinois
Aggressive Cache Only Memory Architecture (I-ACOMA) project, which was one of the Ten Point-
Design Studies funded by the federal government to accelerate the arrival of a petascale machine.
He lead the DARPA-funded M3T Polymorphic Computer Architecture, and co-directed the NSF-
funded FlexRAM Intelligent Memory project. He was one of the PIs in the DARPA-funded IBM
PERCS multiprocessor project. As part of the Illinois-Intel Parallelism Center, he lead the design
of The Bulk Multicore Architecture for parallel programming productivity. He was also a co-PI of
the DARPA-funded Intel Runnemede multiprocessor, developed under the Ubiquitous High Perfor-
mance Computing program, to build an extreme-scale multiprocessor designed from the ground up
for energy efficiency.
Torrellas has served in the organization of numerous professional conferences and workshops.
Recent major service includes Program Chair of PACT 2014, ISCA 2012, HPCA 2005, IEEE-Micro
427
i
i
i
i
i
i
i
i
428 CHAPTER 13. ASK THE EXPERTS
Top Picks 2005, and SC 2007. As of 2015, he has graduated 35 Ph.D. students, who are now leaders
in academia and industry.
Interview
[Yan Solihin] Josep, could you describe your experience in researching and developing support
for effective parallel execution of programs on multicore architectures?
[Josep Torrellas] I have been performing research in the area of shared-memory multiprocessor
architectures for over 25 years now. I have worked on cache coherence protocols, memory consis-
tency support, prefetching, thread level-speculation, low-power design, and hardware and software
reliability.
I started back in the late 80’s by looking at the behavior of shared data in the directory-based
protocol of the Stanford DASH multiprocessor. I then worked to characterize the performance of
the Illinois Cedar multiprocessor. Later on, I contributed to the design of shared-memory multi-
processors with thread-level speculation (TLS). I showed the broad use of speculative threading
mechanisms in parallel computer architecture, including in speculative synchronization, program
debugging, and low-overhead program monitoring. I also showed that speculative threading is a
primitive for high-performance and low-complexity sequential consistency (SC) enforcement.
In the 90’s, I worked on the Illinois Aggressive Cache Only Memory Architecture (I-ACOMA)
design. I contributed with several NUMA machine organizations, coherence protocols, and prefetch-
ing schemes. A couple of significant contributions are embedded-ring snoopy cache-coherence pro-
tocols and the ReVive incremental, in-memory multiprocessor checkpointing. We also released the
widely used SESC simulator of multiprocessor architectures.
In the 00’s, I worked on the Bulk multicore architecture, as part of the Intel Center for Paral-
lelism. This was a novel architecture designed for parallel programming productivity. As part of
this effort, I worked with Intel researchers, and we built an x86-compatible hardware prototype for
record-and-replay of parallel programs.
Also, during this time, I became interested in process variation and energy-efficient multipro-
cessor design. As part of an effort on extreme scale computing, we developed the VARIUS-NTV
models of process variation, and multiple techniques to mitigate and tolerate process variation. This
work culminated in the design of an extreme-scale manycore with Intel researchers called Run-
nemede.
[YS] How have parallel architectures, including multicores, evolved in the last 20-30 years?
[JT] I finished my Ph.D. in the early 90’s. At that time, there was a lot of excitement about cache-
coherent shared-memory multiprocessors especially, scalable ones. I did my Ph.D. thesis in the
context of the Stanford DASH multiprocessor, and then joined a team that evaluated the Illinois
Cedar multiprocessor at the University of Illinois, Urbana-Champaign. These two experimental
machines, and the research groups that developed them, introduced many research ideas.
At that time, development groups in companies were very interested in these ideas. They would
invite researchers to discuss the latest results. There were several new products in the market.
Later, in the late 90’s and early 00’s, there was a progressive disenchantment with multipro-
cessors. Superscalar uniprocessors kept improving in every generation, and multiprocessor design
i
i
i
i
i
i
i
i
429
teams had trouble keeping up with the changes. Every new uniprocessor generation required a major
redesign of the multiprocessor. A bit later, simultaneous multithreaded (SMT) processors appeared,
which further pushed the performance of uniprocessors. In all of these times, I kept working on
multiprocessors.
Finally, around the mid 00’s, multicores appeared, driven by energy and power concerns. They
have been central to computer architecture since. Many of the topics that were studied one or two
decades earlier were reconsidered with different constraints. Now, parallel architectures in the form
of multicores or manycores are very popular. However, the design issues that are being discussed
are more incremental than before. The field is much more mature.
[YS] Q: In your view, how much effort programmers can be and should be expected to expend
writing parallel programs?
[JT] Several years ago, when multicores were about to become ubiquitous, I thought that all pro-
grammers would have to program in parallel. However, it has later become clear to me that many
programmers are insulated from concerns about parallelism. They program user interfaces or a va-
riety of layers in large projects. Only those programmers who program the core parallel algorithms,
including the parallel libraries, require knowledge of parallelism. They are often experienced pro-
grammers (although not always). These programmers should spend a lot of effort to design high-
performing, bug-free parallel codes.
It is for these programmers that computer architects need to work hard. Computer architects
need to provide the hardware support necessary to make life for these programmers easy. Specif-
ically, architects need to provide cache coherence support, efficient synchronization, transparent
caching and prefetching, and intuitive memory consistency models.
[YS] What are the key hardware techniques that have worked well in extracting or managing
thread-level parallelism?
[JT] The key hardware techniques that have worked well in managing thread-level parallelism in-
clude cache coherence, memory consistency support, and synchronization. These techniques have
been designed to interface well with both the memory hierarchy and the core of processors. Hard-
ware transactional memory may end-up also being a useful support for parallelism, but it is a bit
early to tell at this point.
On the other hand, it has been a disappointment to see that hardware support to parallelize
codes has not succeeded. There was a lot of effort in the 00’s to design structures to speculatively
parallelize codes with potential dependences. This effort included techniques to monitor for depen-
dence violations at run time, buffering multiple versions of the same data in caches, and squash and
restart when a dependence was violated. None of this is currently in use, although it has inspired
transactional memory. I hope that it will be used one day.
[YS] What are the limitations of current cache coherence protocols in multicore architectures?
[JT] Over the years, the academic community has come up with many improvements for cache
coherence protocols. They include novel ways to make them scalable to large numbers of cores.
i
i
i
i
i
i
i
i
430 CHAPTER 13. ASK THE EXPERTS
They also include novel designs to make them more economical in terms of time, space, or energy.
The main problem has been that multiprocessor companies have been constrained by legacy. The
cache coherence protocol is a complex hardware module, whose design involves careful crafting of
state machines and detailed analysis of timing issues. In addition, it interacts with multiple other
parts of the machine, including the caches, processor load-store unit, and bus or network modules.
As a result, it takes many person-years to debug a cache coherence protocol. When companies have
implemented a protocol that works, they are very reluctant to change it in any way.
Currently, each company has a few cache coherence protocol designs that have been shown to
work correctly. The full details of these designs are rarely revealed to people outside of the company.
In addition, they may not be fully documented even inside of the company. In practice, these
protocols are likely to be more complicated than needed, with states that were added to optimize
certain conditions that may not be relevant anymore. These states may in fact be slowing down the
operation of the common case.
The bottom line is that these protocols are suboptimal in performance, energy, complexity, and
scalability. However, it is very unclear what researchers can do to change this status-quo.
[YS] Can you describe your research on bulk coherence? What problems does it attack and
what unique characteristics it has?
[JT] Bulk coherence
1
is an effort to design a cache coherence protocol from new principles, in order
to enable a more programmable environment. In addition, it also improves performance, and does
not increase hardware complexity.
Bulk coherence provides to the software high-performance sequential memory consistency,
which improves programmability. In addition, bulk coherence provides support for several novel
hardware primitives. These primitives can be used to build a sophisticated program development-
and-debugging environment, including low-overhead data-race detection, deterministic replay of
parallel programs, and high-speed disambiguation of sets of addresses. The primitives have an
overhead low enough to always be on during production runs.
The key idea in bulk coherence is twofold: first, the hardware automatically executes all soft-
ware as a series of dynamically-created atomic blocks of thousands of dynamic instructions called
chunks. Chunk execution is invisible to the software and, therefore, puts no restriction on the pro-
gramming language or model. Second, bulk coherence introduces the use of “hardware address
signatures” to operate on groups of addresses. Signatures are a low-overhead mechanism to ensure
atomic and isolated execution of chunks, and help provide high-performance sequential memory
consistency.
Bulk coherence enables higher performance because the processor hardware is free to aggres-
sively reorder and overlap the memory accesses of a program within chunks without risk of break-
ing their expected behavior in a multiprocessor environment. Moreover, the compiler can create the
chunks, and then further improve performance by heavily optimizing the instructions within each
chunk.
Finally, the bulk multicore organization decreases hardware design complexity because memory-
consistency enforcement is largely decoupled from processor structures. In a conventional processor
1
J.Torrellas, L.Ceze, J.Tuck, C.Cascaval, P.Montesinos, W.Ahn, and M.Prvulovic. The Bulk Multicore Architecture for
Improved Programmability. CACM, December 2009.
i
i
i
i
i
i
i
i
431
that issues memory accesses out of order, supporting sequential consistency requires intrusive pro-
cessor modifications.
[YS] What are the key challenges in the future in architecture support for effective parallel
execution that still need to be addressed?
[JT] Fortunately, there is no shortage of challenges in architectural support for effective parallel
execution. The first and most important one is that, as multiprocessor designers are more and more
concerned with energy efficiency and low power, there will be pressure to ignore programmability
issues. An example of this trend is the emergence of heterogeneity. Another example is the emer-
gence of 1,000-core manycore designs that may not be cache-coherent. The challenge in this area
is to design low-overhead, energy-efficient primitives for programmability that are compatible with
an energy-first philosophy.
Another challenge is how to support effective Quality of Service (QoS). In the future, we will
have large manycores with hundreds or a thousand cores. They will be running many applications
which will compete for resources. Effective QoS requires novel hardware support.
Another important challenge is the emergence of a new breed of dynamic languages called
scripting languages. Programmers like these languages because they allow quick prototyping. How-
ever, compilers find these languages hard to manage because there is little static information. Conse-
quently, there is a lot of work that the compilers need to do at runtime, which slows down execution.
This is a great opportunity for hardware support.
Hardware support for synchronization and fences needs to be redesigned. Existing hardware is
not designed for the large manycores that we expect to see in the next few years. Existing hardware
is too costly and unscalable.
Finally, the emergence of security and privacy concerns presents a major challenge to hardware
designers. It is a totally open problem to identify the most cost-effective hardware primitives that
we need to add to uniprocessors and multiprocessors to ensure security and privacy.
[YS] Josep, I appreciate your time and insights in this interview. Thank you.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.71.188