Index

Note: Page numbers followed by f indicate figures and t indicate tables.

A

Acknowledgment messages (ACKs) 20
Adaptive communication mechanism (ADCM) 323
adaptive algorithm implementation 337–340, 338f
baseline communication architecture 333, 334f
vs. ideal protocol 342
packet format 340–341, 341f
RBC table 337
receiver operations 333
RPU 337
sender operations 332
Adaptive routing 10, 11f
AllowableEVCs algorithm 202, 203f
Average hop count (AHP) 146f, 147

B

Balanced, adaptive multicast (BAM) routing 20
DOR 269
Duato's theory 268
evaluation 271
multicast routing algorithm 268
network partition 267f, 268
obligatory output ports 268
output port calculation logic 268–269, 269f
RPM types 267–268
Bubble flow control (BFC) 21
Bufferless flow control 20–21
Bus mode signal 119

C

Cache-coherent collective communications 20
BAM + Com 276
baseline configuration and variations 271, 272t
message combination mechanism 284
multicast destinations 281f, 282
multicast performance 278, 278f
multicast ratios 281, 281f
multicast-reduction transaction latency 260–261, 261f, 274–275, 275f
network sizes 281f, 282
NoC multicast routing algorithm 284–285
overall network performance 273, 273f
PARSEC benchmarks 260, 261f
power analysis 282–284, 283f
router microarchitecture 270, 271f
router pipeline 269–270, 270f
RPM + NonCom, PARSEC benchmarks 275–276, 276f
system configuration 272, 272t
unicast traffic performance 276–277, 277f
VC counts 280–281, 281f
virtual multicast tree 284–285
Channel dispensers 62–64, 63f
CMesh topology 
configuration 163–164, 164f
routing algorithm performance analysis 163–166, 165f
Collective communication 294
barrier operations 296–297, 298–299, 302–303, 302f, 310–311, 311f
broadcast operations 296–297, 298–299, 301–302, 302f, 308–310, 309f, 310f
reduce operation 296–297, 298–299, 300t, 302f, 303, 311–312, 312f
Communication-centric cross-layer optimization method 
interconnection layer 5–6
logic implementation layer 6
parallel programming paradigm layer 5
Communication customization architectures control messages 295
8 × 8 mesh topology 295, 295f
message buffer management 296
MU architecture (see MPI unit (MU)) 
VBON 296
Communication protocols 
buffered mode 304f, 305–306
packet format 305, 305f
synchronous mode 304f, 305–306
Content-addressable memory (CAM) 264
Critical bubble scheme (CBS) 21, 221–222

D

Dark silicon 6
Deadlock avoidance theory, 358 179
Deadlock-free flow control 
ALG and ALG + WPF 209–211, 210f
conventional design 218
DAMQ 209
dateline 220–221, 221f
Duato's theory 187
LBS 221, 221f
PSF 
algorithms 181, 183f
router microarchitecture 227–229, 228f
SFP ratio 219–220, 219f
tori rings  See (1D tori ring See also 2D tori rings)
variable-size packets 222, 222f
VC reallocation 180
Destination-based adaptive routing (DBAR) 19
Destination-based selection strategy (DBSS), 359 
application results 158–161, 160f
CMesh evaluation (see CMesh topology) 
congestion propagation network 149–150, 149f, 169–170
dimension preselection 151–152, 151f
hardware 163–168, 167f
in-depth analysis 162f, 166–169
irregular-region results 161–163, 162f
offered path diversity 151–154, 154f
propagation wires 168–169
router architecture 152f
scalability 169
selection metric computation 151, 151f, 152f
small regular region 147f, 161–162
synthetic traffic results 156–161, 159f, 160f
VC reallocation 153–155, 155f
Deterministic routing algorithm 10, 11f
Dimensional order routing (DOR) 10, 11f, 269
Dimension preselection (DP) 151–152, 151f
Duato's theory, 358, 359 18, 19
routing flexibility 180–181, 188
VC reallocation 180
Dynamically allocated multiqueue (DAMQ) buffer 22–23, 209
Dynamic virtual channel (DVC) routers 
congestion awareness scheme 81–82, 82f
congestion metric aggregation module 88, 89f
flow control module 89, 90f
head-of-line (HoL) blocking 79
microarchitecture design 85–86, 86f
router evaluation 96–98, 97f, 98f
structure 80–81, 80f
unified buffer 83, 84f
VC allocation module 90–91
VC control module 86–88, 87f
Dynamic voltage and frequency scaling (DVFS) 32

E

Element interconnect bus (EIB) 27, 27f
Escape virtual channels (EVCs) 
connected routing function 187
deadlock-free algorithm 187
direct and indirect dependency 187
Duato's theory 187
EVC2/AVC2 185–186
extended VC dependency graph 187
fairness issue 185–186, 186f
network path 187
routing subfunction 187

F

Fast arbiter components 64–65, 65f
Flit bubble flow control (FBFC) 22
critical design 225–226, 225f, 227f, 229–230
localized scheme 224–225, 225f, 226f, 229
wormhole flow control 223–224, 223f, 227–228, 228f
Flow control 
flit reservation 21
fully adaptive routing  See (Fully adaptive routing algorithms)
SAF 12, 12f, 13t
VCT 12, 12f, 13t
virtual channels techniques 13, 13t
wormhole 12–13, 12f, 13t
Fully adaptive routing algorithms 
Alg and Alg+WPF 209–211
allowable EVCs 202, 203f
DAMQ 209
Duato's theory 179–180
fairness effect  See (Latency distribution)
hybrid flow controls 209
off-chip network 176
packet length 208–209
PARSEC workloads 176, 177f, 199–201
performance analysis 203–204
physical/virtual networks 176, 177t
router microarchitecture  See (Router microarchitecture)
routing flexibility 180–181, 188
synthetic traffic  See (Synthetic traffic patterns)
VC reallocation 180

H

Hierarchical bit-line buffer (HiBB) 
area and power comparison 98–99, 99f
congestion avoidance scheme 84–85, 85f
generic router 100, 100f
hotspot and transpose patterns 101
input channels 91, 91f, 92
memory array 98–99
nonuniform VC configuration 83, 84f
performance and power consumption 100–102, 101f, 102f
run-time regulation scheme 95–96, 95f
VC allocation and output port allocation 92–95, 93f, 94f
VC control module 92, 93f
Hotspot traffic pattern 345–346, 347f, 348f

I

Ideal communication protocol 330–332, 331f
Intel's 80-core Teraflops chip 290

K

Knights Corner processor 34–36, 35f, 37t, 38t

L

Larrabee processor 33, 34f, 37t, 38t
Latency distribution 
FULLY+WA 206f, 207f, 208f, 209
FULLY+WPF 205f, 207f, 208–209, 208f
Localized bubble scheme (LBS) 221, 221f

M

Many-core processor 
computation/communication-centric method 6–7
design 5, 6f
NoC interconnection layer 7–8, 7f
parallel programming 5
ME.  See MPI engine (ME)
Message combination table (MCT) 
ACK packet transmission 265–266, 266f
entry 267
format 117f, 264
logical ACK tree 262–263, 263f
logical multicast tree 262–263, 263f, 266
multicast packet 263
multicast packet transmission 264, 265f
Message passing interface (MPI), 358 20
applications 307–308
buffered communication 294
buffered protocols 323–324, 324f
Cell Broadband Engine processor 292
communication customization  See (Communication customization architectures)
communication design configurations 342–343, 343t
correctness problem 324–325, 325f
definition 290–291
hardware estimation, 350–351, 351t 
ideal communication protocol 330–332, 331f
MPICH 292
MPI/Pro 292
multicast operations 293
multi-FPGA system 292
NAS Parallel Benchmarks (NPB 2.4) suite 344
network configurations 306t, 307
NoC designs 293–294
optimization 329–330
performance improvement 312–313, 314f
performance problems 328f, 329
point-to-point communication 294, 296–297, 308, 309f
power consumption 313, 315f
real traffic pattern  See (Real traffic pattern)
retry problems 325–328, 326f, 327f
rMPI 292
scalability 313–315, 315f
sensitivity analysis, 350, 351f, 352f 
SoC-MPI library 290–291, 292
STAR-MPI and HP-MPI 291–292
STORM system 292
synchronous communication 294
synchronous protocol 323–324, 324f
synthetic traffic patterns 307 See also (Synthetic traffic patterns)
SystemC-based cycle-level NoC simulator 307
TMD-MPE 291
Microarchitecture designs 
channel dispensers 62–64, 63f
fast arbiter components 64–65, 65f
SIG manager and controller 66, 67f
virtual channel arbitration components 64–65, 65f
Minimal routing 10, 10f
MIT Raw processor 23–24, 25f, 37t, 38t, 290–291
MPI engine (ME) 333–335, 335f
MPI processing unit (MPU) 291, 298, 300–301, 301f, 334–335, 336f
MPI unit (MU) 
collective communication  See (Collective communication)
communication protocol 304f, 305–306, 305f
implementation 311t, 315–316
local processor core and NI 297, 297f
parameter registers 294t, 297–298
PPU 297
primitive functions 298, 299t
send and receive operations 298–299

N

Neighbors-on-path (NoP) design 19
Network configuration 127–128, 129t
Nondeterministic routing 10
Nonminimal routing 9–10, 10f

O

Oblivious routing algorithm 10
On-chip interconnection networks 
packet-based NoC 111, 111f
transaction-based bus 110–111, 111f, 113–114
On-chip network (OCN) 29–30
1D tori ring 
buffer utilization 232–233, 232f
performance analysis 230–232, 231f
short and long packets 233, 234f
Operand network (OPN) 28–29
O1 Turn routing 10, 11f

P

Parallel programming paradigms 5, 40–41
Parameter registers (PR) 294t, 297–298, 334–335
PARSEC workloads 
methodology and configuration 199, 201t
performance 199–201, 201f
PSF algorithms 201, 201f
SFP ratio 176, 177f
Performance metrics 16–17, 17f
Point-to-point communication 294, 308, 309f
Port-selection-first (PSF) algorithms 
network size 199
PARSEC workloads 201, 201f
routing flexibility 180–181, 183f, 188
SFP ratio 196
synthetic workloads 192
VC count 199
VC depth 196–197
Power processing element (PPE) 27
Preprocessing unit (PPU) 297, 334–335
Prevention flow control 21

R

Raw processor 23–24, 25f, 37t, 38t
Real traffic pattern 
ADCM prediction accuracy, 350f 348–349
execution time 347, 349f
message latency 347, 349f
network traffic 346, 348f
Receiving buffer credit (RBC) table 337
Receiving policy unit (RPU) 337
Recursive partitioning multicast (RPM) 20, 262
Redundant bus method 114–115
Regional congestion awareness (RCA) 
congestion propagation network 149–150
in-depth analysis 166–169
intraregion interference 145–147, 146f
irregular-region results 161–163, 162f
performance analysis, CMesh 163–166, 165f
small regular region 161–162
synthetic traffic results 156–161
Ring topology 8, 9f
Round-trip traffic pattern 344–345, 345f, 346f
Router architecture 
aggressive pipeline 57–58, 58f
channel dispenser unit 57
low-latency 56–57, 57f
packet transfers 58, 59–60, 59f
speculative router 56
Router microarchitecture 
cache-coherent collective communications 270, 271f
critical path delay and area 190, 190t
crossbar 15
DVC 85–91, 86f, 87f, 89f, 90f
FBFC routers 227–228, 228f
four-stage pipeline 15, 15f
HIBB 91–96, 91f, 93f, 94f, 95f
input units 13
MUX1 and DEMUX1 189, 190f
output unit 15
permissible port 189
research 22–23
routing computation 14
structure, virtual channel 13, 14f
switch allocation and traversal 188–189
three-stage pipeline 15–16, 15f
two-stage pipeline 15f, 16
VC allocation 189, 190f
VCT routers 228–229, 228f
virtual channel allocator 14
Routing algorithms, 358 
adaptive routing algorithm 144, 144f
average hop count 146f, 147
deadlock freedom 10–11
deterministic routing 10, 11f
inter-region interference 147–148, 148f
LOCAL 145
minimal routing 10, 10f
nondeterministic routing 10
nonminimal routing 9–10, 10f
off-chip networks 144
routing function evaluation 154–158, 157f
selection strategy 142–143
synthetic traffic patterns 156
system configuration 156, 156t
Routing function design 
offered path diversity 151–154, 154f
VC reallocation 153–155, 155f

S

Sending policy unit (SPU) 335–336
Shared memory programming, 358 
Single-chip Cloud Computer (SCC) processor 31–33, 32f, 37t, 38t, 292–293
Single-cycle router, wing channels, 357–358 
adaptive routing 71–73, 72f
area and power consumption 73f, 74
deterministic routing 69–71, 70f
fast arbitration 61–62, 62f
low-latency router 55
microarchitecture designs  See (Microarchitecture designs)
packet transfer 60–61, 61f
pipeline delay analysis 67–68, 68f
principles 60
router architecture  See (Router architecture)
simulation infrastructures 66
zero-load latency 68f, 69
Single-flit packet (SFP) ratio 219–220, 219f
Sony/Toshiba/IBM cell processor 27–28, 27f
Store-and-forward (SAF) flow control 12, 12f, 13t
Synchronization signals, 359–360 
Synthetic traffic patterns 
average throughput gains 193, 193t
baseline configuration and variations 195t, 196
bit reverse 191f, 192
buffer utilization 193–194
evaluation 191
FULLY and PSF 190–191
hotspot traffic pattern 191f, 192, 345–346, 347f, 348f
MPI 307
network size 199, 200f
round-trip traffic pattern 344–345, 345f, 346f
SFP ratios 195f, 196
single-level 4 × 4 mesh 129–132, 130f
transpose-1 191f, 192
transpose-2 191f, 192
VC count 196–198, 198f
VC depth 196, 197f
8 × 8 mesh structure 131f, 132, 133f
Systems-on chip (SoC) 290–291, 292

T

Teraflops processor chip 30–31, 31f, 37t, 38t
TILE64 processor 24–26, 26f, 37t, 38t, 290
Topology design 17–18
Torus network 9, 9f
Traffic generation 128
Transaction-based bus communication 
advantages 113–114
baseline interconnect structure 110–111, 111f
Tree-based multicast routing 20
TRIPS processor 28–30, 29f, 37t, 38t
2D mesh topology 8, 9f
2D tori rings 
area result 247, 247f
buffer sizes 236–238, 237f
CBS and FBFC 240
large-scale systems 241–242, 242f
vs. meshes 247–250, 248f, 249f
message passing 241–242, 242f
methodology 242–243
PARSEC workloads 240, 240f
PDP results 245–246, 245f
power consumption 243, 244f, 246f
SFP ratios 235–236, 236f
4 × 4 torus 233–235, 235f
8 × 8 torus 238, 238f

U

Unicast routing designs 18–19

V

Valiant routing algorithm 9–10
Virtual bus on-chip network (VBON) 
arbitration algorithm 117–119, 118f, 119f
barrier operations 310–311, 311f
basic router 126–127, 128t
broadcast operations 308–310, 309f
EVC approach 109–110
interconnect structures 114–116, 115f, 116f
multicast communications 110
multicast latency 113
network configuration 127–128, 129t
NoC design 296
NOCHI EVC router 127, 128t
on-chip interconnection networks  See (On-chip interconnection networks)
operation 121–123, 121f, 122f
overhead analysis 135, 136f
packet-based mesh NoC 111, 111f
packet format 120–121, 120f
performance improvement 312–313, 314f
performance results 132, 134f
power consumption 134f, 135, 313–315
router microarchitecture 124–126, 125f
row/column VB implementation 116–117, 117f
starvation and deadlock avoidance 124
synthetic traffic evaluations 129–132, 130f, 131f, 133f
traffic generation 128
unicast message latency 112–113
VB communication 123, 123f
Virtual channel arbitration components 64–65, 65f
Virtual cut-through (VCT) flow control 11, 12f, 13t

W

Whole packet forwarding (WPF), 358 19, 182–185
Wire delay 115–116, 116f
Wormhole flow control 11, 12f, 13t
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.152.251