Chapter 13 - Ask the Experts (1/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13

Ask the Experts

Josep Torrellas on Parallel Multicore Architectures

Josep Torrellas is a Professor at the Departments of Computer

Science and (by courtesy) Electrical and Computer Engineer-

ing at the University of Illinois at Urbana-Champaign (UIUC).

He is the Director of the Center for Programmable Extreme

Scale Computing, and past Director of the Illinois-Intel Paral-

lelism Center (I2PC). He is a Fellow of IEEE (2004) and ACM

(2010). He received the 2015 IEEE Computer Society Techni-

cal Achievement Award, for “Pioneering contributions to shared-

memory multiprocessor architectures and thread-level specula-

tion”, and the ICCD High-Impact Paper Award for “One of the 5

most cited papers in the ﬁrst 30 years of ICCD”. He has served as the Chair of the IEEE Techni-

cal Committee on Computer Architecture (TCCA) (2005-2010) and as a Council Member of CRA’s

Computing Community Consortium (CCC) (2011-2014). He was a Willett Faculty Scholar at UIUC

(2002-2009).

Torrellas’ research interests are shared-memory parallel computer architecture, energy-efﬁcient

architectures, hardware reliability, and software dependability. He has published over 200 publi-

cations and received 12 Best Paper awards. In the early nineties, Torrellas was involved in the

Stanford DASH and Illinois Cedar experimental multiprocessor projects. Later, he lead the Illinois

Aggressive Cache Only Memory Architecture (I-ACOMA) project, which was one of the Ten Point-

Design Studies funded by the federal government to accelerate the arrival of a petascale machine.

He lead the DARPA-funded M3T Polymorphic Computer Architecture, and co-directed the NSF-

funded FlexRAM Intelligent Memory project. He was one of the PIs in the DARPA-funded IBM

PERCS multiprocessor project. As part of the Illinois-Intel Parallelism Center, he lead the design

of The Bulk Multicore Architecture for parallel programming productivity. He was also a co-PI of

the DARPA-funded Intel Runnemede multiprocessor, developed under the Ubiquitous High Perfor-

mance Computing program, to build an extreme-scale multiprocessor designed from the ground up

for energy efﬁciency.

Torrellas has served in the organization of numerous professional conferences and workshops.

Recent major service includes Program Chair of PACT 2014, ISCA 2012, HPCA 2005, IEEE-Micro

427

428 CHAPTER 13. ASK THE EXPERTS

Top Picks 2005, and SC 2007. As of 2015, he has graduated 35 Ph.D. students, who are now leaders

in academia and industry.

Interview

[Yan Solihin] Josep, could you describe your experience in researching and developing support

for effective parallel execution of programs on multicore architectures?

[Josep Torrellas] I have been performing research in the area of shared-memory multiprocessor

architectures for over 25 years now. I have worked on cache coherence protocols, memory consis-

tency support, prefetching, thread level-speculation, low-power design, and hardware and software

reliability.

I started back in the late 80’s by looking at the behavior of shared data in the directory-based

protocol of the Stanford DASH multiprocessor. I then worked to characterize the performance of

the Illinois Cedar multiprocessor. Later on, I contributed to the design of shared-memory multi-

processors with thread-level speculation (TLS). I showed the broad use of speculative threading

mechanisms in parallel computer architecture, including in speculative synchronization, program

debugging, and low-overhead program monitoring. I also showed that speculative threading is a

primitive for high-performance and low-complexity sequential consistency (SC) enforcement.

In the 90’s, I worked on the Illinois Aggressive Cache Only Memory Architecture (I-ACOMA)

design. I contributed with several NUMA machine organizations, coherence protocols, and prefetch-

ing schemes. A couple of signiﬁcant contributions are embedded-ring snoopy cache-coherence pro-

tocols and the ReVive incremental, in-memory multiprocessor checkpointing. We also released the

widely used SESC simulator of multiprocessor architectures.

In the 00’s, I worked on the Bulk multicore architecture, as part of the Intel Center for Paral-

lelism. This was a novel architecture designed for parallel programming productivity. As part of

this effort, I worked with Intel researchers, and we built an x86-compatible hardware prototype for

record-and-replay of parallel programs.

Also, during this time, I became interested in process variation and energy-efﬁcient multipro-

cessor design. As part of an effort on extreme scale computing, we developed the VARIUS-NTV

models of process variation, and multiple techniques to mitigate and tolerate process variation. This

work culminated in the design of an extreme-scale manycore with Intel researchers called Run-

nemede.

[YS] How have parallel architectures, including multicores, evolved in the last 20-30 years?

[JT] I ﬁnished my Ph.D. in the early 90’s. At that time, there was a lot of excitement about cache-

coherent shared-memory multiprocessors — especially, scalable ones. I did my Ph.D. thesis in the

context of the Stanford DASH multiprocessor, and then joined a team that evaluated the Illinois

Cedar multiprocessor at the University of Illinois, Urbana-Champaign. These two experimental

machines, and the research groups that developed them, introduced many research ideas.

At that time, development groups in companies were very interested in these ideas. They would

invite researchers to discuss the latest results. There were several new products in the market.

Later, in the late 90’s and early 00’s, there was a progressive disenchantment with multipro-

cessors. Superscalar uniprocessors kept improving in every generation, and multiprocessor design

429

teams had trouble keeping up with the changes. Every new uniprocessor generation required a major

redesign of the multiprocessor. A bit later, simultaneous multithreaded (SMT) processors appeared,

which further pushed the performance of uniprocessors. In all of these times, I kept working on

multiprocessors.

Finally, around the mid 00’s, multicores appeared, driven by energy and power concerns. They

have been central to computer architecture since. Many of the topics that were studied one or two

decades earlier were reconsidered with different constraints. Now, parallel architectures in the form

of multicores or manycores are very popular. However, the design issues that are being discussed

are more incremental than before. The ﬁeld is much more mature.

[YS] Q: In your view, how much effort programmers can be and should be expected to expend

writing parallel programs?

[JT] Several years ago, when multicores were about to become ubiquitous, I thought that all pro-

grammers would have to program in parallel. However, it has later become clear to me that many

programmers are insulated from concerns about parallelism. They program user interfaces or a va-

riety of layers in large projects. Only those programmers who program the core parallel algorithms,

including the parallel libraries, require knowledge of parallelism. They are often experienced pro-

grammers (although not always). These programmers should spend a lot of effort to design high-

performing, bug-free parallel codes.

It is for these programmers that computer architects need to work hard. Computer architects

need to provide the hardware support necessary to make life for these programmers easy. Specif-

ically, architects need to provide cache coherence support, efﬁcient synchronization, transparent

caching and prefetching, and intuitive memory consistency models.

[YS] What are the key hardware techniques that have worked well in extracting or managing

thread-level parallelism?

[JT] The key hardware techniques that have worked well in managing thread-level parallelism in-

clude cache coherence, memory consistency support, and synchronization. These techniques have

been designed to interface well with both the memory hierarchy and the core of processors. Hard-

ware transactional memory may end-up also being a useful support for parallelism, but it is a bit

early to tell at this point.

On the other hand, it has been a disappointment to see that hardware support to parallelize

codes has not succeeded. There was a lot of effort in the 00’s to design structures to speculatively

parallelize codes with potential dependences. This effort included techniques to monitor for depen-

dence violations at run time, buffering multiple versions of the same data in caches, and squash and

restart when a dependence was violated. None of this is currently in use, although it has inspired

transactional memory. I hope that it will be used one day.

[YS] What are the limitations of current cache coherence protocols in multicore architectures?

[JT] Over the years, the academic community has come up with many improvements for cache

coherence protocols. They include novel ways to make them scalable to large numbers of cores.

430 CHAPTER 13. ASK THE EXPERTS

They also include novel designs to make them more economical in terms of time, space, or energy.

The main problem has been that multiprocessor companies have been constrained by legacy. The

cache coherence protocol is a complex hardware module, whose design involves careful crafting of

state machines and detailed analysis of timing issues. In addition, it interacts with multiple other

parts of the machine, including the caches, processor load-store unit, and bus or network modules.

As a result, it takes many person-years to debug a cache coherence protocol. When companies have

implemented a protocol that works, they are very reluctant to change it in any way.

Currently, each company has a few cache coherence protocol designs that have been shown to

work correctly. The full details of these designs are rarely revealed to people outside of the company.

In addition, they may not be fully documented even inside of the company. In practice, these

protocols are likely to be more complicated than needed, with states that were added to optimize

certain conditions that may not be relevant anymore. These states may in fact be slowing down the

operation of the common case.

The bottom line is that these protocols are suboptimal in performance, energy, complexity, and

scalability. However, it is very unclear what researchers can do to change this status-quo.

[YS] Can you describe your research on bulk coherence? What problems does it attack and

what unique characteristics it has?

[JT] Bulk coherence

is an effort to design a cache coherence protocol from new principles, in order

to enable a more programmable environment. In addition, it also improves performance, and does

not increase hardware complexity.

Bulk coherence provides to the software high-performance sequential memory consistency,

which improves programmability. In addition, bulk coherence provides support for several novel

hardware primitives. These primitives can be used to build a sophisticated program development-

and-debugging environment, including low-overhead data-race detection, deterministic replay of

parallel programs, and high-speed disambiguation of sets of addresses. The primitives have an

overhead low enough to always be on during production runs.

The key idea in bulk coherence is twofold: ﬁrst, the hardware automatically executes all soft-

ware as a series of dynamically-created atomic blocks of thousands of dynamic instructions called

chunks. Chunk execution is invisible to the software and, therefore, puts no restriction on the pro-

gramming language or model. Second, bulk coherence introduces the use of “hardware address

signatures” to operate on groups of addresses. Signatures are a low-overhead mechanism to ensure

atomic and isolated execution of chunks, and help provide high-performance sequential memory

consistency.

Bulk coherence enables higher performance because the processor hardware is free to aggres-

sively reorder and overlap the memory accesses of a program within chunks without risk of break-

ing their expected behavior in a multiprocessor environment. Moreover, the compiler can create the

chunks, and then further improve performance by heavily optimizing the instructions within each

chunk.

Finally, the bulk multicore organization decreases hardware design complexity because memory-

consistency enforcement is largely decoupled from processor structures. In a conventional processor

J.Torrellas, L.Ceze, J.Tuck, C.Cascaval, P.Montesinos, W.Ahn, and M.Prvulovic. The Bulk Multicore Architecture for

Improved Programmability. CACM, December 2009.

431

that issues memory accesses out of order, supporting sequential consistency requires intrusive pro-

cessor modiﬁcations.

[YS] What are the key challenges in the future in architecture support for effective parallel

execution that still need to be addressed?

[JT] Fortunately, there is no shortage of challenges in architectural support for effective parallel

execution. The ﬁrst and most important one is that, as multiprocessor designers are more and more

concerned with energy efﬁciency and low power, there will be pressure to ignore programmability

issues. An example of this trend is the emergence of heterogeneity. Another example is the emer-

gence of 1,000-core manycore designs that may not be cache-coherent. The challenge in this area

is to design low-overhead, energy-efﬁcient primitives for programmability that are compatible with

an energy-ﬁrst philosophy.

Another challenge is how to support effective Quality of Service (QoS). In the future, we will

have large manycores with hundreds or a thousand cores. They will be running many applications

which will compete for resources. Effective QoS requires novel hardware support.

Another important challenge is the emergence of a new breed of dynamic languages called

scripting languages. Programmers like these languages because they allow quick prototyping. How-

ever, compilers ﬁnd these languages hard to manage because there is little static information. Conse-

quently, there is a lot of work that the compilers need to do at runtime, which slows down execution.

This is a great opportunity for hardware support.

Hardware support for synchronization and fences needs to be redesigned. Existing hardware is

not designed for the large manycores that we expect to see in the next few years. Existing hardware

is too costly and unscalable.

Finally, the emergence of security and privacy concerns presents a major challenge to hardware

designers. It is a totally open problem to identify the most cost-effective hardware primitives that

we need to add to uniprocessors and multiprocessors to ensure security and privacy.

[YS] Josep, I appreciate your time and insights in this interview. Thank you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 13 - Ask the Experts (1/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 13 - Ask the Experts (1/6)