Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 20

3-D Circuit Architectures

Abstract

Exploiting the advantages of 3-D integration requires the development of novel circuit architectures. A 3-D version of a microprocessor memory system is a primary example of the architectures discussed in this chapter. Major improvements in throughput, power consumption, and cache miss rate are demonstrated. Communication centric architectures, such as a network-on-chip, are also discussed. On-chip networks are an important design paradigm to appease the interconnect bottleneck, where information is communicated among circuits within packets in an Internet-like fashion. The synergy between these two design paradigms, networks-on-chip and 3-D, to significantly improve performance while decreasing the power consumed in communications limited systems is described.

Keywords

3-D microprocessor; 3-D cache memory; 3-D networks-on-chip; 3-D NoC power models; 3-D NoC performance models

Technological, physical, and thermal design methodologies have been presented in the previous chapters. The architectural implications of adding the third dimension in the integrated circuit (IC) design process are discussed in this chapter. Various wire limited integrated systems are explored where the third dimension can mitigate many interconnect issues. Primary examples of this circuit category are the microprocessor memory system, on-chip networks, and field programmable gate arrays (FPGAs). Although networks-on-chip (NoC) and FPGAs are generic communication fabrics as compared to microprocessors, the effects of the third dimension on the performance of a microprocessor are discussed here due to the importance of this circuit type.

The performance enhancements that originate from the 3-D implementation of wire dominated circuits are presented in the following sections. These results are based on analytic models and academic design tools under development with exploratory capabilities. Existing performance limitations of these circuits are summarized in the following section, where a categorization of these circuits is also provided. 3-D architectural choices and corresponding tradeoffs for microprocessors, memories, and microprocessor memory systems are discussed in Section 20.2. 3-D topologies for on-chip networks are presented and evaluated in Section 20.3. Both analytic expressions and simulation tools are utilized to explore these topologies. Finally, the extension to the third dimension of an important design solution, namely FPGAs, is analyzed in Section 20.4. A brief summary of the analysis of these 3-D architectures is offered in Section 20.4.

20.1 Classification of Wire Limited 3-D Circuits

Any 2-D IC can be vertically fabricated with one or more processes developed for 3-D circuits. The benefits, however, vary for different circuits [291]. Wire dominated circuits are good candidates for vertical integration since these circuits greatly benefit from the significant decrease in wirelength. Performance projections as previously discussed in Chapter 7, Interconnect Prediction Models, highlight this situation. Consequently, only communication centric circuits are emphasized in this discussion. The different circuit categories considered in this chapter are illustrated in Fig. 20.1.

Figure 20.1 Taxonomy of 3-D architectures for wire limited circuits.

The first category includes application specific ICs (ASIC) that are typically part of a larger computing system. A criterion for a circuit to belong to this category is whether the third dimension can considerably improve the primary performance characteristics of a circuit, such as speed, power, and area. A fast Fourier transform circuit is an example of a 3-D ASIC [705]. Memory arrays and microprocessors are other circuit examples that belong to this category. 3-D integration is amenable to the interconnect structures connecting these components comprising more complex integrated systems.

Communication fabrics, such as NoC and FPGAs, are particularly appropriate for vertical integration. For instance, several low latency and high throughput multi-dimensional topologies have been presented in the past for traditional interconnection networks, such as 3-D meshes and tori [706]. These topologies, depicted in Fig. 20.2, are not usually considered for on-chip networks due to the long interconnects that hamper the overall performance of the network. These limitations are naturally circumvented when a third degree of design freedom is added.

Figure 20.2 Popular interconnection network topologies, (A) 3-D mesh, and (B) 2-D torus.

In the case of FPGAs, a design style for an increasing number of applications, vertical integration offers a twofold opportunity; increased interconnectivity among the logic blocks (LBs) as well as shorter distances among these blocks. These advantages will enhance the performance of FPGAs which is the major impediment of this design style. Consequently, 3-D architectures are indispensable for contemporary FPGAs. Each of these circuit categories is successively reviewed in the remainder of the chapter. Related design tools are also discussed and further issues in the design of these circuits are highlighted.

20.2 3-D Microprocessors and Memories

Microprocessor and memory circuits constitute a fundamental component of every computing system. Due to the use of these circuits in myriads of applications, the effects of 3-D integration on this system are of significant interest. Both of these types of circuits are amenable to a variety of 3-D architectural alternatives. The partitioning scheme used on standard 2-D circuits drastically affects the characteristics of the resulting 3-D architectures. The different partitioning levels and related building elements based on these partitions are illustrated, respectively in Figs. 20.3 and 20.4. A finer partitioning level typically requires a larger design effort and higher vertical interconnect densities. Consequently, each partitioning level is only compatible with a specific 3-D technology. Note that the intention here is not to determine an effective partitioning methodology specific to 3-D microprocessors and memories but rather to determine the architectural granularity that improves the performance of these circuits. The physical design techniques described in Chapter 9, Physical Design Techniques for Three-Dimensional ICs, can be used to achieve the optimum design objectives for a particular architecture.

Figure 20.3 Different partitioning levels and related design complexity vs the architectural granularity for 3-D microprocessors.

Figure 20.4 An example of different partitions levels for a 3-D microprocessor system at the (A) core, (B) functional unit block (FUB), (C) macrocell, and (D) transistor levels.

Different architectures for several blocks not including the on-chip cache memory of a microprocessor are discussed in Section 20.2.1. The 3-D organization of the cache memory is described in Section 20.2.2. Finally, the 3-D integration of a combined microprocessor and memory is discussed in Section 20.2.3.

20.2.1 3-D Microprocessor Logic Blocks

The microprocessor circuit analyzed herein consists of one logic core and an on-chip cache memory. The architectures can be extended to multicore microprocessor systems. For a microprocessor circuit, partitioning the functional blocks or macrocells (i.e., circuits within the functional blocks) to several tiers (see Fig. 20.4) is meaningful and can improve the performance of the individual functional blocks and the microprocessor. For a microprocessor system, partitioning at the core level is also possible. The resulting effects on a microprocessor are higher speed, additional instructions per cycle (IPC), and a decrease in the number of pipeline stages.

In general, the 3-D implementation of a wire limited functional block (i.e., partitioning at the macrocell level) decreases the dissipated power and delay of the blocks. The benefits, however, are minimal for blocks that do not include relatively long wires. In this case, some performance improvement exists due to the decrease in the area of the block and, consequently, in the length of the wires traversing the block. In addition to the cache memory, many other blocks within the processor core can be usefully placed on more than one physical tier. These blocks include, for example, the instruction scheduler, arithmetic circuits, content addressable memories, and register files.

The instruction scheduler is a critical component constraining the maximum clock frequency and dissipates considerable power [707]. Placing this block in two tiers results in a lower delay and power by, respectively 44% and 16%, [708]. Additional savings in both delay and power are achieved by using three or four tiers; however, the savings saturate rapidly. This situation is more pronounced in the case of arithmetic units, such as adders and logarithmic shifters. Delay and power improvements for Brent Kung [709] and Kogge Stone [710] adders are listed in Table 20.1. As indicated from these results, the benefits of utilizing more than two tiers are negligible. Further reductions in delay are not possible for more than two tiers since the delay of the logic dominates the delay of the interconnect.

Table 20.1

Performance and Power Improvements of 3-D over 2-D Architectures [708]

	Kogge Stone Adder			Brent Kung Adder
	16 bits		32 bits	32 bits
# of Input Bits	Delay	Power	Delay	Delay
Two tiers	20.2%	8%	9.6%	13.3%
Three tiers	23.6%	15%	20.0%	18.1%
Four tiers	32.7%	22%	20.0%	21.7%

Consequently, partitioning at the macrocell level is not necessarily helpful for every microprocessor component and should be carefully applied to ensure that only the wire dominated blocks are designed as 3-D blocks. Alternatively, another architectural approach does not split but simply stacks the functional blocks on adjacent physical tiers, decreasing the length of the wires shared by these blocks. As an example of this approach, consider the two tier 3-D design of the Intel Pentium 4 processor where 25% of the pipeline stages in the 2-D architecture are eliminated, improving performance by almost 15% [711]. The power consumption is also decreased by almost 15%. A similar architectural approach for the Alpha 21364 [712] processor resulted in a 7.3% and 10.3% increase in the IPC for, respectively two and four tiers [713].

A potential hurdle in the performance improvement is the introduction of new hotspots or an increase in the peak temperature as compared to a 2-D version of the microprocessor. Thermal analysis of the 3-D Intel Pentium 4 has shown that the maximum temperature is only 2 °C greater as compared to the 2-D counterpart, reaching a temperature of 101 °C [713]. If maintaining the same thermal profile is a primary objective, the power supply can be scaled, partially limiting the improved performance. For this specific example, voltage scaling pacified any temperature increase while providing an 8% performance improvement and 34% decrease in power (for two tiers) [711]. In these case studies, a 2-D architecture has been redesigned into a 3-D microprocessor architecture. Greater enhancements can be realized if the microprocessor is designed from scratch, initially targeting a 3-D technology as compared to simply migrating a 2-D circuit into a multi-tier stack. A considerable portion of the overall performance improvement in a microprocessor can be achieved by distributing the cache memory onto several physical tiers, as described in the following section.

20.2.2 3-D Design of Cache Memories

Data exchange between the processor logic core and the memory has traditionally been a fundamental performance bottleneck. Therefore, a small amount of memory, specifically cache memory, is integrated with the logic circuitry offering very fast data transfer, while the majority of the main memory is placed off-chip. The size and organization of the cache memory greatly depend upon the architecture of the microprocessor and has steadily increased over the past several microprocessor generations [714,715]. Due to the small size of the cache memory, a common problem is that data are often fetched from the main memory. This situation, widely known as a cache miss, is a high latency task. Increasing the cache memory size can partly lower the cache miss rate. 3-D integration supports both larger and faster cache memories. The former characteristic is achieved by adding more memory on the upper tiers of a 3-D stack, and the latter objective can be enhanced by constructing novel cache architectures with shorter interconnects.

A schematic view of a 2-D 32 KB cache is illustrated in Fig. 20.5, where only the data array is depicted. The memory is arranged into smaller arrays (subarrays) to decrease both the access time and power dissipation. Each memory subarray i is denoted as Block i. The size of the subarrays is determined by two parameters, N_dwl and N_dbl, which correspond, respectively, to the divisions of the initial number of word and bit lines. In this example, each subarray contains 128×256 bits. In addition to the SRAM arrays, other circuits are shown in Fig. 20.5. The local word line decoders and drivers are placed on the left side of each subarray. The multiplexers and sense amplifiers are located on the bottom and top side of each row. The word line predecoder is placed at the center of the entire memory.

Figure 20.5 2-D organization of a cache memory with additional circuitry [716].

Several ways exist to partition this structure into multiple tiers. An example of a 2-D and 3-D organization of a 32 Kb cache memory is schematically illustrated in Fig. 20.6. The memory can be stacked at the functional block, macrocell, and transistor levels, where a functional block is considered in this case to be equivalent to a memory subarray. Halving the memory and placing each half on a separate tier decreases the length of the global wires, such as the address input to the word line predecoder nets, the data output lines, and the wires used for synchronization. Partitioning at the functional block level does not improve the delay and power of the subarrays which adds to the total access time and power dissipation.

Figure 20.6 2-D and 3-D organization of a 32 Kb cache memory array. N_spd is the number of sets connected to a word line [713].

Dividing each subarray can therefore result in lower access time and power consumption. Partitioning occurs along the x and y directions, halving, respectively, the length of the bit and word lines at each division. An example of a subarray partition is depicted in Fig. 20.7. The number of partitions along each direction is characterized by the parameters N_x and N_y. For word line partitions (N_x > 1), the word lines are replicated on the upper tiers, as shown in Fig. 20.7. The length of the word lines, however, decreases, resulting in smaller device sizes within the drivers and local decoders. Furthermore, the area of the overall array decreases, leading to shorter global wires, such as the input address from the predecoder to the local decoders and data output lines.

Figure 20.7 Word line partitioning onto two tiers of the 2-D cache memory shown in Fig. 20.5 [713].

In the case of bit line partitioning (N_y > 1), the length of these lines and the number of pass transistors tied to each bit line are reduced, as illustrated in Fig. 20.8. The sense amplifiers can either be replicated on the upper tiers or shared among the bit lines from more than one tier. In the former case, the leakage current increases but the access time is improved, while in the latter case, the power savings is greater, however, the speed enhancement is not significant due to the requirement for bit multiplexing.

Figure 20.8 Bit line partitioning onto two tiers of the 2-D cache memory shown in Fig. 20.5 [713].

In general, word line partitioning results in a smaller delay but not necessarily a larger power savings. More specifically, for high performance caches, partitioning the word lines offers greater savings in both delay and energy. For instance, a 1 MB cache where N_x=4 and N_y=1 is faster by 16.3% as compared to a partition where N_x=1 and N_y=4 [716]. Alternatively, bit line partitioning is more efficient for low power memories. When the memory is designed for low power, bit line partitioning decreases the power by approximately 14% as compared to the reduction achieved by word line partitioning [716]. This behavior can be explained by considering the original 2-D version of the cache memory. High performance memories favor wide arrays, which implies longer word lines, while low power memories exhibit a greater height resulting in longer bit lines.

Finally, partitioning is also possible at the transistor level where the basic six transistor SRAM cell can be split among the tiers of a 3-D stack. This extra fine granularity, however, has a negative effect on the total area of the memory, since the size of a TSV is typically larger than the area of an SRAM cell [717]. Consequently, partitioning at the macrocell level offers the greatest advantages for 3-D cache memory architectures when TSVs are utilized. However, monolithic 3-D circuits can support the effective integration of memories at the SRAM cell level [115].

Analyzing the performance of these architectures is a multivariable task and design aids to support this analysis are needed. PRACTICS [718] and 3-D CACTI [716] offer exploratory capabilities for cache memories. The cache cycle time models in both of these tools are based on CACTI, an exploratory tool for 2-D memories [719]. These tools utilize delay [252], energy [720], and thermal models [721] as well as extracted delay and power profiles from SPICE simulations to characterize a cache memory. Several parameters are treated as variables. The optimum value of these parameters is determined based on the primary design objective of the architecture, such as high speed or low power. Although these tools are not highly accurate, early architectural decisions can be explored. Choosing an appropriate 3-D architecture for the cache memory further increases the performance of the microprocessor in addition to the performance improvements offered by the 3-D design of specific blocks within the processor. In this case, the primary limitation of the system is the off-chip main memory discussed in the following section.

20.2.3 Architecting a 3-D Microprocessor Memory System

Although partitioning the cache memory on multiple tiers enhances the performance of the microprocessor, data transfer to and from the main memory remains a significant hindrance. The ultimate 3-D solution is to stack the main memory on the upper tiers of a 3-D microprocessor system. This option may be feasible for low performance processors with low memory requirements [722]. For modern computing systems with considerable memory demands, increasing the size of the on-chip cache, mainly the second level (L2) cache, is an efficient approach to improve performance [723]. Two systems based on two different microprocessors are considered; one approach contains a Reduced instruction set computing (RISC) processor [724,725] while the other approach includes an Intel Core 2 Duo processor [726].

To evaluate the effectiveness of the RISC system, the average time per instruction is a useful metric. For this system, the main memory is within the 3-D stack. This practice offers a higher buss bandwidth between the main memory and the L2 cache which, in turn, decreases the time required to access the main memory. The reduction in the average number of instructions, however, is small, about 6.1% [722]. In addition, stacking many memory tiers within a 3-D stack can be technologically and thermally challenging.

A more practical approach is to increase the size of the L2 cache by utilizing a small number of either SRAM or DRAM tiers on top of the processor. An Intel Core 2 Duo system is illustrated in Fig. 20.9A where various configurations of the cache memory are illustrated [711]. Note that in some of these configurations, both level one (L1) and L2 caches are included on the same tier. Increasing the size of the L2 cache increases the time to access this memory. The advantage of the reduced cache miss rates due to more data and instructions available on-chip considerably outweighs, however, the increase in access time. Each of the architectures illustrated in Fig. 20.9 decreases the number of cycles per memory access for a large number of benchmarks [711]. The only exceptions are those benchmarks where the 4 MB memory in the baseline system is sufficient. Additionally, the power consumed by the microprocessor memory system decreases since a smaller number of transactions takes place over the off-chip high capacitive buss. Furthermore, the required bandwidth of the off-chip buss drops as much as three times as compared to a 2-D system [711].

Figure 20.9 Different organizations of a microprocessor system, (A) 2-D baseline system, (B) a second tier with 8 MB SRAM cache memory, (C) a second tier with 32 MB SRAM cache memory, and (D) a second tier with 64 MB DRAM cache memory [711].

The presence of a second memory tier naturally increases the on-chip power consumption as compared to a 2-D microprocessor. The estimated power consumed by each configuration, shown in Fig. 20.9, is listed in Table 20.2. Despite the increase in power dissipation, the highest increase in temperature as compared to the baseline system does not exceed 5 °C with a maximum (and manageable) temperature of 92.9 °C [711,727].

Table 20.2

Power Dissipation of the 3-D Microprocessor Architectures [711]

	Power Consumption [W]
Architecture	Tier 1	Tier 2	Total
Fig. 20.9A (2-D)	92	–	92
Fig. 20.9B	92	14	106
Fig. 20.9C	85	3.1	88.1
Fig. 20.9D	92	6.2	98.2

Larger on-chip memories can decrease cache miss rates for single or double core microprocessors [728]. For large scale systems with tens of cores, however, the memory access time and the related buss bandwidth can be a performance bottleneck. An on-chip network is an effective way to overcome these issues. 3-D architectures for NoC are therefore the subject of the following section.

20.3 3-D Networks-on-Chip

NoC is a design paradigm to enhance interconnections within complex integrated systems. These networks have an interconnect structure which provide Internet like communication among various elements of the network; however, on-chip networks differ from traditional interconnection networks in that communication among the network elements is achieved through the on-chip routing layers rather than the metal tracks of the package or printed circuit board (PCB).

NoC offer high flexibility and regularity, supporting simpler interconnect models and greater fault tolerance. The canonical interconnect backbone of the network combined with appropriate communication protocols enhance the flexibility of these systems [37]. NoC provide communication among a variety of functional intellectual property (IP) blocks or processing elements (PEs), such as processor and Digital signal processing (DSP) cores, memory blocks, FPGA blocks, and dedicated hardware, serving a plethora of applications that include image processing, personal devices, and mobile handsets, [729–731] (the terms IP block and PEs are interchangeably used in this chapter to describe functional structures connected by a NoC). The intra-PE delay, however, cannot be reduced by the network. Furthermore, the length of the communication channel is primarily determined by the area of the PE, which is typically unaffected by the network structure. By merging vertical integration with NoC, many of the individual limitations of 3-D ICs and NoC are circumvented, yielding a robust design paradigm with unprecedented capabilities.

Research in 3-D NoC has progressed considerably in the past few years where several 3-D topologies have been explored [732], and design methods and synthesis tools [733] have been published [503,734,735]. Several methodologies to manage vertical links with TSVs have also been reported [736].

Addo-Quaye [503] presented an algorithm for the thermal aware mapping and placement of 3-D NoC including regular mesh topologies. Li et al. [735] proposed a similar 3-D NoC topology employing a buss structure for communicating among PEs located on different physical tiers. Targeting multiprocessor systems, the proposed scheme in [735] considerably reduces cache latencies by utilizing the third dimension. Multidimensional interconnection networks have been studied under various constraints, such as constant bisection-width and pin-out constraints [706]. NoC differ from generic interconnection networks, however, in that NoC are not limited by the channel width or pin-out. Alternatively, physical constraints specific to 3-D NoC, such as the number of nodes that can be placed in the third dimension and the asymmetry in the length of the channels of the network, have to be considered.

In this chapter, various possible topologies for 3-D NoC are presented. Additionally, analytic models for the zero-load latency and power consumption with delay constraints of these networks that capture the effects of the topology on the performance of 3-D NoC are described [737]. Although these models are applied to mesh topologies, the interconnect models remain valid for other topologies as long as the parameters, such as the number of hops for a topology, are properly adapted to the features of the target topology. Optimum topologies are shown to exist that minimize the zero-load latency and power consumption of a network. These optimum topologies depend upon a number of parameters characterizing both the router and the communication channel, such as the number of ports of the network, the length of the communication channel, and the impedance characteristics of the interconnect. Several tradeoffs among these parameters that determine the minimum latency and power consumption topology of a network are described for different network sizes. A cycle accurate simulator for 3-D topologies is also discussed. This tool is used to evaluate the behavior of several 3-D topologies under broad traffic scenarios.

Several interesting topologies, which are the topic of this chapter, emerge by incorporating the third dimension in NoC. In the following section, several topological choices for 3-D NoC are reviewed. In Section 20.3.2, an analytic model of the zero-load latency of traditional interconnection networks is adapted for each of the proposed 3-D NoC topologies, while the power consumption model of these network topologies is described in Section 20.3.3. In Section 20.3.4, the 3-D NoC topologies are compared in terms of the zero-load network latency and power consumption with delay constraints, and guidelines for the optimum design of speed driven or power driven NoC structures are provided. An advanced NoC simulator, which is used to evaluate the performance of a broad variety of 3-D network topologies, is presented in Section 20.3.5.

20.3.1 3-D NoC Topologies

Several topologies for 3-D networks are presented and related terminology is introduced in this section. Mesh structures have been a popular network topology for conventional 2-D NoC [738,739]. A fundamental element of a mesh network is illustrated in Fig. 20.10A, where each PE is connected to the network through a router. A PE can be integrated either on a single physical tier (2-D IC) or on several physical tiers (3-D IC). Each router in a 2-D NoC is connected to a neighboring router in one of four directions. Consequently, each router has five ports. Alternatively, in a 3-D NoC, the router typically connects to two additional neighboring routers located on the adjacent physical tiers. The architecture of the router is considered here to be a canonical router with input and output buffering [740]. The combination of a PE and router is called a network node. For a 2-D mesh network, the total number of nodes N is N=n₁×n₂, where n_i is the number of nodes included in the i^th physical dimension.

Figure 20.10 Several NoC topologies (not to scale), (A) 2-D IC–2-D NoC, (B) 2-D IC–3-D NoC, (C) 3-D IC–2-D NoC, and (D) 3-D IC–3-D NoC.

Integration in the third dimension introduces a variety of topological choices for NoCs. For a 3-D NoC as shown in Fig. 20.10B, the total number of nodes is N=n₁×n₂×n₃, where n₃ is the number of nodes in the third dimension. In this topology, each PE is on a single yet possibly different physical tier (2-D IC–3-D NoC). Alternatively, a PE can be implemented on only one of the n₃ physical tiers of the system and, therefore, the 3-D system contains n₁×n₂ PEs on each one of the n₃ physical tiers such that the total number of nodes is N. This topology is discussed in [503] and [735]. A 3-D NoC topology is illustrated in Fig. 20.10C, where the interconnect network is contained within one physical tier (i.e., n₃=1), while each PE is integrated on multiple tiers, notated as n_p (3-D IC–2-D NoC). Finally, a hybrid 3-D NoC based on the two previous topologies is depicted in Fig. 20.10D. In this NoC topology, both the interconnect network and the PEs can span more than one physical tier within the stack (3-D IC–3-D NoC). In the following section, latency expressions for each of the NoC topologies are described, assuming a zero-load model.

20.3.2 Zero-Load Latency for 3-D NoC

In this section, analytic models of the zero-load latency of each of the 3-D NoC topologies are described. The zero-load network latency is widely used as a performance metric in traditional interconnection networks [741]. The zero-load latency of a network is the latency where only one packet traverses the network. Although such a model does not consider contention among packets, the zero-load latency model can be used to describe the effect of a topology on the performance of a network. The zero-load latency of an NoC with wormhole switching is [741]

$T_{n e t w o r k} = h o p s \cdot t_{r} + t_{c} + \frac{L_{p}}{b},$ $T_{n e t w o r k} = h o p s \cdot t_{r} + t_{c} + \frac{L_{p}}{b},$ (20.1)

(20.1)

where the first term is the routing delay, t_c is the propagation delay along the wires of the communication channel, which is also called a buss here for simplicity, and the third term is the serialization delay of the packet. hops is the average number of routers that a packet traverses to reach the destination node, t_r is the router delay, L_p is the length of the packet in bits, and b is the bandwidth of the communication channel defined as b ≡ w_cf_c, where w_c is the width of the channel in bits and f_c is the inverse of the propagation delay of a bit along the longest communication channel.

Since the number of tiers that can be stacked in a 3-D NoC is constrained by the target technology, n₃ is also constrained. Furthermore, n₁, n₂, and n₃ are not necessarily equal. The average number of hops in a 3-D NoC is

$h o p s = \frac{n_{1} n_{2} n_{3} (n_{1} + n_{2} + n_{3}) - n_{3} (n_{1} + n_{2}) - n_{1} n_{2}}{3 (n_{1} n_{2} n_{3} - 1)},$ $h o p s = \frac{n_{1} n_{2} n_{3} (n_{1} + n_{2} + n_{3}) - n_{3} (n_{1} + n_{2}) - n_{1} n_{2}}{3 (n_{1} n_{2} n_{3} - 1)},$ (20.2)

(20.2)

assuming dimension order routing to ensure that the minimum distance paths are used for the routing of packets between any source destination node pair. The number of hops in (20.2) can be divided into two components, the average number of hops within the two dimensions n₁ and n₂, and the average number of hops within the third dimension n₃,

$h o p s_{2 - D} = \frac{n_{3} (n_{1} + n_{2}) (n_{1} n_{2} - 1)}{3 (n_{1} n_{2} n_{3} - 1)},$ $h o p s_{2 - D} = \frac{n_{3} (n_{1} + n_{2}) (n_{1} n_{2} - 1)}{3 (n_{1} n_{2} n_{3} - 1)},$ (20.3)

(20.3)

$h o p s_{3 - D} = \frac{(n_{1} n_{2}) (n_{3}^{2} - 1)}{3 (n_{1} n_{2} n_{3} - 1)} .$ $h o p s_{3 - D} = \frac{(n_{1} n_{2}) (n_{3}^{2} - 1)}{3 (n_{1} n_{2} n_{3} - 1)} .$ (20.4)

(20.4)

The delay of the router t_r is the sum of the delay of the arbitration logic t_a and the delay of the switch t_s, which in this chapter is considered to be a classic crossbar switch [741],

$t_{r} = t_{a} + t_{s} .$ $t_{r} = t_{a} + t_{s} .$ (20.5)

(20.5)

The delay of the arbiter can be described from [742],

$t_{α} = (21 (\frac{1}{4}) \log_{2} p + 14 (\frac{1}{12}) + 9) τ,$ $t_{α} = (21 (\frac{1}{4}) \log_{2} p + 14 (\frac{1}{12}) + 9) τ,$ (20.6)

(20.6)

where p is the number of ports of the router and τ is the delay of a minimum sized inverter for the target technology. Note that (20.6) exhibits a logarithmic dependence on the number of router ports. The length of the crossbar switch also depends upon the number of router ports and the width of the buss,

$l_{s} = 2 (w_{t} + s_{t}) w_{c} p,$ $l_{s} = 2 (w_{t} + s_{t}) w_{c} p,$ (20.7)

(20.7)

where w_t and s_t are, respectively, the width and spacing or, alternatively, the pitch of the interconnect and w_c is the width of the communication channel in bits. Consequently, the worst case delay of the crossbar switch is determined by the longest path within the switch, which is equal to (20.7).

The delay of the communication channel t_c is

$t_{c} = t_{v} h o p s_{3 - D} + t_{h} h o p s_{2 - D},$ $t_{c} = t_{v} h o p s_{3 - D} + t_{h} h o p s_{2 - D},$ (20.8)

(20.8)

where t_v and t_h are, respectively, the delay of the vertical and horizontal channels (see Fig. 20.10B). Note that if n₃=1, (20.8) describes the propagation delay of a 2-D NoC. Substituting (20.8) and (20.5) into (20.1), the overall zero-load network latency for a 3-D NoC is

$T_{n e t w o r k} = h o p s (t_{a} + t_{s}) + h o p s_{2 - D} t_{h} + h o p s_{3 - D} t_{v} + \frac{L_{p}}{w_{c}} t_{h} .$ $T_{n e t w o r k} = h o p s (t_{a} + t_{s}) + h o p s_{2 - D} t_{h} + h o p s_{3 - D} t_{v} + \frac{L_{p}}{w_{c}} t_{h} .$ (20.9)

(20.9)

To characterize t_s, t_h, and t_v, the models described in [743] are adopted, where repeaters implemented as simple inverters are inserted along the interconnect. According to these models, the propagation delay and rise time of a single interconnect stage for a step input, respectively, are

$t_{d i} = 0.377 \frac{r_{1} c_{1} l_{i}^{2}}{k_{i}^{2}} + (R_{d 0} C_{0} + \frac{R_{d 0} c_{i} l_{i}}{h_{i} k_{i}} + \frac{r_{i} l_{i} C_{g 0} h_{i}}{k_{i}}),$ $t_{d i} = 0.377 \frac{r_{1} c_{1} l_{i}^{2}}{k_{i}^{2}} + (R_{d 0} C_{0} + \frac{R_{d 0} c_{i} l_{i}}{h_{i} k_{i}} + \frac{r_{i} l_{i} C_{g 0} h_{i}}{k_{i}}),$ (20.10)

(20.10)

$t_{d i} = 1.1 \frac{r_{1} c_{1} l_{i}^{2}}{k_{i}^{2}} + 2.75 (R_{r 0} C_{0} + \frac{R_{r 0} c_{i} l_{i}}{h_{i} k_{i}} + \frac{r_{i} l_{i} C_{g 0} h_{i}}{k_{i}}),$ $t_{d i} = 1.1 \frac{r_{1} c_{1} l_{i}^{2}}{k_{i}^{2}} + 2.75 (R_{r 0} C_{0} + \frac{R_{r 0} c_{i} l_{i}}{h_{i} k_{i}} + \frac{r_{i} l_{i} C_{g 0} h_{i}}{k_{i}}),$ (20.11)

(20.11)

where r_i (c_i) is the per unit length resistance (capacitance) of the interconnect and l_i is the total length of the interconnect. The index i is used to notate the different interconnect delays included in the network (i.e., i ∈ {s,v,h}). h_i and k_i denote the number and size of the repeaters, respectively, and C_g0 and C₀ represent, respectively, the gate and total input capacitance of a minimum sized device. C₀ is the summation of the gate and drain capacitance of the device. R_d0 and R_r0 describe, respectively, the equivalent output resistance of a minimum sized device for the propagation delay and transition time of a minimum sized inverter where the output resistance is approximated as

$R_{r (d) 0} = K_{r (d)} \frac{V_{d d}}{I_{d n 0}} .$ $R_{r (d) 0} = K_{r (d)} \frac{V_{d d}}{I_{d n 0}} .$ (20.12)

(20.12)

K denotes a fitting coefficient and I_dn0 is the drain current of an NMOS device at both V_ds and V_gs equal to V_dd. The value of these device parameters are listed in Table 20.3. A 45 nm technology node is assumed and SPICE simulations of the predictive technology library are used to determine the individual parameters [252,432].

Table 20.3

Interconnect and Design Parameters, 45 nm Technology

Parameter	Value
Parameter	NMOS	PMOS
W_min	100 nm	250 nm
I_dsat/W	1115 μΑ/μm	349 μΑ/μm
V_dsat	478 mV	−731 mV
V_t	257 mV	−192 mV
A	1.04	1.33
I_sub0	48.8 nA
I_g0	0.6 nA
V_dd	1.1 Volts
Temp.	110°C
K_d	0.98
K_r	0.63
C_g0	512 fF
C_d0	487 fF
τ	17 ps

To include the effect of the input slew rate on the total delay of an interconnect stage, (20.10) and (20.11) are further refined by including an additional coefficient γ as in [744],

$γ_{r} = \frac{1}{2} - \frac{1 - \frac{V_{t n}}{V_{d d}}}{1 - a_{n}} .$ $γ_{r} = \frac{1}{2} - \frac{1 - \frac{V_{t n}}{V_{d d}}}{1 - a_{n}} .$ (20.13)

(20.13)

By substituting the subscript n with p, the corresponding value for a falling transition is obtained. The average value γ of γ_r and γ_f is used to describe the effect of the transition time on the interconnect delay. The overall interconnect delay can therefore be described as

$t_{i} = k (t_{d i} + γ t_{r i}) = a_{1} \frac{r_{i}}{c_{i}} + a_{2} (R_{0} C_{0} k + \frac{R_{0} c_{i} l_{i}}{h} + R_{i} C_{g 0} h),$ $t_{i} = k (t_{d i} + γ t_{r i}) = a_{1} \frac{r_{i}}{c_{i}} + a_{2} (R_{0} C_{0} k + \frac{R_{0} c_{i} l_{i}}{h} + R_{i} C_{g 0} h),$ (20.14)

(20.14)

where R₀, a₁, and a₂ are described in [655] and the index i denotes the different interconnect structures such as the crossbar switch (i ≡ s), horizontal buss (i ≡ h), and vertical buss (i ≡ v).

For minimum delay, the size h and number k of repeaters are determined, respectively, by setting the partial derivative of t_i with respect to h_i and k_i equal to zero and solving for h_i and k_i,

$k_{i}^{*} = \sqrt{\frac{a_{1} r_{i} c_{i} l_{i}^{2}}{a_{2} R_{0} C_{0}}},$ $k_{i}^{*} = \sqrt{\frac{a_{1} r_{i} c_{i} l_{i}^{2}}{a_{2} R_{0} C_{0}}},$ (20.15)

(20.15)

$h_{i}^{*} = \sqrt{\frac{R_{0} c_{i} l_{i}}{r_{i} C_{g 0}}} .$ $h_{i}^{*} = \sqrt{\frac{R_{0} c_{i} l_{i}}{r_{i} C_{g 0}}} .$ (20.16)

(20.16)

The expression in (20.14) only considers RC interconnects. An RC model is sufficiently accurate to characterize the delay of a crossbar switch since the length of the longest wire within the crossbar switch and the signal frequencies ensures that inductive behavior is not prominent. For the buss lines, however, inductive behavior can appear. For this case, suitable expressions for the delay and repeater insertion characteristics can be adopted from [655]. For the target operating frequencies (1 to 2 GHz) and buss length (<2 mm) considered in this chapter, an RC interconnect model provides sufficient accuracy [588]. Additionally, for the vertical buss, k_v=1 and h_v=1, meaning that no repeaters are inserted and minimum sized drivers are utilized. Repeaters are not necessary due to the short length of the vertical buss. Driving a buss with minimum sized inverters can affect the resulting minimum latency and power dissipation topology, as discussed in the following sections. Note that the latency expression includes the effects of the input slew rate. Additionally, since a repeater insertion methodology for minimum latency is applied, any further reduction in latency is due to the network topology.

The length of the vertical communication channel for the 3-D NoC shown in Fig. 20.10 is

$l_{v} = {\begin{array}{l} L_{v}, & for 2 D IC - 3 D NoC \\ n_{p} L_{v}, & for 3 D IC - 3 D NoC \\ 0, & for 2 D IC - 2 D NoC and 3 D IC - 2 D NoC, \end{array}$ $l_{v} = {\begin{array}{l} L_{v}, & for 2 D IC - 3 D NoC \\ n_{p} L_{v}, & for 3 D IC - 3 D NoC \\ 0, & for 2 D IC - 2 D NoC and 3 D IC - 2 D NoC, \end{array}$ (20.17a–c)

(20.17a–c)

where L_v is the length of a through silicon (intertier) via connecting two routers on adjacent physical tiers. n_p is the number of physical tiers used to integrate each PE. The length of the horizontal communication channel is assumed to be

$l_{h} = {\begin{matrix} \sqrt{A_{P E}}, & for 2D IC - 2D NoC and 2D IC - 3D NoC \\ 1.12 \sqrt{A_{P E} / n_{p}}, & for 3 D IC - 2 D NoC and 3 D IC - 3 D {NoC (n}_{p} > 1), \end{matrix}$ $l_{h} = {\begin{matrix} \sqrt{A_{P E}}, & for 2D IC - 2D NoC and 2D IC - 3D NoC \\ 1.12 \sqrt{A_{P E} / n_{p}}, & for 3 D IC - 2 D NoC and 3 D IC - 3 D {NoC (n}_{p} > 1), \end{matrix}$ (20.18a,b)

(20.18a,b)

where A_PE is the area of the PE. The area of all of the PEs and, consequently, the length of each horizontal channel are assumed to be equal. For those cases where the PE is placed in multiple physical tiers, a coefficient is included to consider the effect of the intertier vias on the reduction in the ideal wirelength due to utilization of the third dimension. The value of this coefficient (= 1.12) is based on the layout of a crossbar switch manufactured in the fully depleted silicon-on-insulator (FD-SOI) 3-D technology from MIT Lincoln Laboratory (MITLL) [307]. The same coefficient is also assumed for the PEs placed on more than one physical tier. In the following section, expressions for the power consumption of a network with delay constraints are presented.

20.3.3 Power Consumption in 3-D NoC

Power dissipation is a critical issue in 3-D circuits. Although the total power consumption of 3-D systems is expected to be lower than that of mainstream 2-D circuits (since the global interconnects are shorter [308]), the increased power density is a challenging issue for this novel design paradigm. Therefore, those 3-D NoC topologies that offer low power characteristics should be of significant interest.

The different power consumption components for interconnects with repeaters are briefly discussed in this section. Due to specified performance characteristics, a low power design methodology with delay constraints for the interconnect in an NoC is adopted from [655]. An expression for the total power consumption per bit of a packet transferred between a source destination node pair is used as the basis for characterizing the power consumption of an NoC for the 3-D topologies.

The power consumption components of an interconnect line with repeaters are:

a. Dynamic power consumption is the dissipated power due to the charge and discharge of the interconnect and input gate capacitance during a signal transition, and can be described by

$P_{d i} = a_{s} f (c_{i} l_{i} + h_{i} k_{i} C_{0}) V_{d d}^{2},$ $P_{d i} = a_{s} f (c_{i} l_{i} + h_{i} k_{i} C_{0}) V_{d d}^{2},$ (20.19)

(20.19)

where f is the clock frequency and a_s is the switching factor [745]. A value of 0.15 is assumed here; however, for NoC, the switching factor can vary considerably. This variation, however, does not affect the power comparison for the various topologies as the same switching factor is incorporated in each term for the total power consumed per bit of the network (the absolute value of the power consumption, however, changes).

b. Short-circuit power is due to the DC current path that exists in a CMOS circuit during a signal transition when the input signal voltage changes between V_tn and V_dd + V_tp. The power consumption due to this current is described as short-circuit power and is modeled in [746] by

$P_{s i} = \frac{4 a_{s} f I_{d 0}^{2} t_{r i}^{2} V_{d d} k_{i} h_{i}^{2}}{V_{d s a t} G C_{e f f i} + 2 H I_{d 0} t_{r i} h_{i}},$ $P_{s i} = \frac{4 a_{s} f I_{d 0}^{2} t_{r i}^{2} V_{d d} k_{i} h_{i}^{2}}{V_{d s a t} G C_{e f f i} + 2 H I_{d 0} t_{r i} h_{i}},$ (20.20)

(20.20)

where I_d0 is the average drain current of the NMOS and PMOS devices operating in the saturation region and the value of the coefficients G and H are described in [747]. Due to resistive shielding of the interconnect capacitance, an effective capacitance is used in (20.20) rather than the total interconnect capacitance. This effective capacitance is determined from the methodology described in [748] and [749].

c. Leakage current power comprises two power components, the subthreshold and gate leakage currents. The subthreshold power consumption is due to current flowing during the cut-off region (below threshold), causing I_sub current to flow. The gate leakage component is due to current flowing through the gate oxide, denoted as I_g. The total leakage current power can be described as

$P_{l i} = h_{i} k_{i} V_{d d} (I_{s u b 0} + I_{g 0}),$ $P_{l i} = h_{i} k_{i} V_{d d} (I_{s u b 0} + I_{g 0}),$ (20.21)

(20.21)

where the average subthreshold I_sub0 and gate I_g0 leakage current of the NMOS and PMOS transistors is used in (20.21).

The total power consumption with delay constraint T₀ for a single line of a crossbar switch P_stotal, horizontal buss P_htotal, and vertical buss P_vtotal is, respectively,

$P_{s t o t a l} (T_{0} - t_{a}) = P_{d i} + P_{s i} + P_{l i},$ $P_{s t o t a l} (T_{0} - t_{a}) = P_{d i} + P_{s i} + P_{l i},$ (20.22)

(20.22)

$P_{h t o t a l} (T_{0}) = P_{d i} + P_{s i} + P_{l i},$ $P_{h t o t a l} (T_{0}) = P_{d i} + P_{s i} + P_{l i},$ (20.23)

(20.23)

$P_{h t o t a l} (T_{0}) = P_{d i} + P_{s i} + P_{l i} .$ $P_{h t o t a l} (T_{0}) = P_{d i} + P_{s i} + P_{l i} .$ (20.24)

(20.24)

The power consumption of the arbitration logic is not included in (20.22), since most of the power is consumed by the crossbar switch and the buss interconnect, as discussed in [750]. Note that for a crossbar switch, the additional delay t_a of the arbitration logic poses a stricter delay constraint on the power consumption of the switch, as shown in (20.22). The minimum power consumption with delay constraints is determined by the methodology described in [655], for which the optimum size h^*_powi and number k^*_powi of the repeaters for a single interconnect line is determined. Consequently, the minimum power consumption per bit between a source destination node pair in a NoC with a delay constraint is

$P_{b i t} = h o p s P_{s t o t a l} + h o p s_{2 - D} P_{h t o t a l} + h o p s_{3 - D} P_{v t o t a l} .$ $P_{b i t} = h o p s P_{s t o t a l} + h o p s_{2 - D} P_{h t o t a l} + h o p s_{3 - D} P_{v t o t a l} .$ (20.25)

(20.25)

The effect of resistive shielding is also considered in determining the effective interconnect capacitance. Furthermore, since the repeater insertion methodology in [655] minimizes the power consumed by the repeater system, any additional decrease in power consumption is due only to the network topology. In the following section, those 3-D NoC topologies that exhibit the maximum performance and minimum power consumption with delay constraints are presented. Tradeoffs in determining these topologies are discussed and the impact of the network parameters on the resulting optimum topologies are demonstrated for different network sizes.

20.3.4 Performance and Power Analysis for 3-D NoC

Several network parameters characterizing the topology of a network can significantly affect the speed and power of a system. The evaluation of these network parameters is discussed in subsection 20.3.4.1. The improvement in network performance achieved by the 3-D NoC topologies is explored in subsection 20.3.4.2. The distribution of nodes that produces the maximum performance is also discussed. The power consumption with delay constraints of a 3-D NoC and the topologies that yield the minimum power consumption of a 3-D NoC are presented in subsection 20.3.4.3.

20.3.4.1 Parameters of 3-D networks-on-chip

The physical layer of a 3-D NoC consists of different interconnect structures, such as a crossbar switch, the horizontal buss connecting neighboring nodes on the same physical tier and the vertical buss connecting nodes on different, not necessarily adjacent, physical tiers. The device parameters characterizing the receiver, driver, and repeaters are listed in Table 20.3. The interconnect parameters reported in Table 20.4 are different for each type of interconnect within a network.

Table 20.4

Interconnect Parameters

	Parameter
Interconnect Structure	Electrical	Physical
Crossbar switch	ρ = 3.07 μΩ-cm	w = 200 nm
	k_ILD = 2.7	s = 200 nm
	r_s = 614 Ω/mm	t = 250 nm
	c_s = 157.6 fF/mm	h = 500 nm
Horizontal bus	ρ = 2.53 μΩ-cm	w = 500 nm
	k_ILD = 2.7	s = 250 (500) nm
	r_h = 46 Ω/mm	t = 1100 nm
	ch = 332.6 (192.5) fF/mm	h = 800 nm
	a_3-D = 1.02 (1.06)	−
Vertical bus	ρ = 5.65 μΩ-cm	w = 1050 nm
	r_v = 51.2 Ω/mm	L_v = 10 μm
	c_v = 600 fF/mm	–

A typical interconnect structure is shown in Fig. 20.11, where three parallel metal lines are sandwiched between two ground planes. This interconnect structure is considered for the crossbar switch (at the network nodes) where the intermediate metal layers are assumed to be utilized. The horizontal buss is implemented on the global metal layers and, therefore, only the lower ground plane is present in this structure for a 2-D NoC. For a 3-D NoC, however, the substrate (back-to-face tier bonding) or a global metal layer of an upper tier (face-to-face tier bonding) behaves as a second ground plane. To incorporate this additional ground plane, the horizontal bus capacitance is changed by the coefficient a_3-D. A second ground plane decreases the coupling capacitance to an adjacent line, while the line-to-ground capacitance increases. The vertical buss is different from the other structures in that this buss uses through silicon vias. These intertier vias can exhibit significantly different impedance characteristics as compared to traditional horizontal interconnect structures, as discussed in [436] and also verified by extracted impedance parameters. The electrical interconnect parameters are extracted using a commercial impedance extraction tool [423], while the physical parameters are extrapolated from the predictive technology library [252], [432] and the 3-D integration technology developed by MITLL for a 45 nm technology node [307]. The physical and electrical interconnect parameters are listed in Table 20.4. For each of the interconnect structures, a buss width of 64 bits is assumed. In addition, n₃ and n_p are constrained by the maximum number of physical tiers n_max that can be vertically stacked. A maximum of eight tiers is assumed. The constraints that apply for each of the 3-D NoC topologies shown in Fig. 20.10 are

$n_{3} \leq n_{m a x}, for 2D IC - 3D NoC,$ $n_{3} \leq n_{m a x}, for 2D IC - 3D NoC,$ (20.26a)

(20.26a)

$n_{p} \leq n_{m a x}, for 3D IC - 2D NoC,$ $n_{p} \leq n_{m a x}, for 3D IC - 2D NoC,$ (20.26b)

(20.26b)

$n_{3} n_{p} \leq n_{m a x}, for 3D IC - 3D NoC .$ $n_{3} n_{p} \leq n_{m a x}, for 3D IC - 3D NoC .$ (20.26c)

(20.26c)

Figure 20.11 Typical interconnect structure for intermediate metal layers.

A small set of parameters is used as variables to explore the performance and power consumption of the 3-D NoC topologies. This set includes the network size or, equivalently, the number of nodes within the network N, the area of each PE A_PE, which is directly related to the buss length as described in (20.18), and the maximum allowed interconnect delay when evaluating the minimum power consumption with delay constraints. The range of values for these variables is listed in Table 20.5. Depending upon the network size, the NoC are roughly divided as small (N=16 to 64 nodes), medium (N=128 to 256 nodes), and large (N=512 to 2048 nodes) networks. For multiprocessor SoC networks, sizes of up to N=256 is expected to be feasible in the near future [735,751], whereas for NoC with a finer granularity, where the PEs each corresponds to hardware blocks of approximately 100,000 gates, network sizes over a few thousands nodes are predicted at the 45 nm technology node [752]. Note that this classification of the networks is not strict and is only intended to facilitate the discussion in the following sections.

Table 20.5

Network Parameters

Parameter	Values
N	16, 32, 64, 128, 256, 512, 1,024, 2,048
A_PE [mm²]	0.5, 0.64, 0.81, 1.00, 1.56, 2.25, 4.00
T₀ [ps]	1,000, 500

20.3.4.2 Performance tradeoffs for 3-D NoC

The performance enhancements that can be achieved in NoC by utilizing the third dimension are discussed in this subsection. Each of the 3-D topologies decreases the zero-latency of the network by reducing different delay components, as described in (20.9). In addition, the distribution of network nodes in each physical dimension that yields the minimum zero-load latency is shown to significantly change with the network and interconnect parameters.

2-D IC–3-D NoC

Utilizing the third dimension to implement a NoC directly results in a decrease in the average number of hops for packet switching. The average number of hops on the same tier hops_2-D (the intratier hops) and the average number of hops in the third dimension hops_3-D (the intertier hops) are also reduced. Interestingly, the distribution of nodes n₁, n₂, and n₃ that yields the minimum total number of hops is not always the same as the distribution that minimizes the number of intratier hops. This situation occurs particularly for small and medium networks, while for large networks, the distribution of n₁, n₂, and n₃ which minimizes the hops also minimizes hops_2-D.

In a 3-D NoC, the number of router ports increases from five to seven, increasing, in turn, both the switch and arbiter delay. Furthermore, a short vertical buss generally exhibits a lower delay than a relatively long horizontal buss. In Fig. 20.12, the zero-load latency of the 2-D IC–3-D NoC is compared to that of the 2-D IC–2-D NoC for different network sizes. A decrease in latency of 15.7% and 20.1% can be observed for, respectively, N=128 and N=256 nodes with A_PE=0.81 mm².

Figure 20.12 Zero-load latency for several network sizes. (A) A_PE=0.81 mm² and c_h=332.6 fF/mm, and (B) A_PE=4 mm² and c_h=332.6 fF/mm.

The node distribution that produces the lowest latency varies with network size. For example, n_3max=8 is not necessarily the optimum for small and medium networks, although by increasing n₃, more hops occur through the short, low latency vertical channel. This result can be explained by considering the reduction in the number of hops that originate from utilizing the third dimension for packet switching. For small and medium networks, the decrease in the number of hops is small and cannot compensate the increased routing delay due to the greater number of router ports in a 3-D NoC. As the horizontal buss length becomes longer, however, (e.g., approaching 2 mm), n₃ > 1, and a slight decrease in the number of hops significantly decreases the overall delay, despite the increase in the routing delay for a 3-D NoC. As an example, consider a network with log₂N=4 and A_PE=0.81 mm². The minimum latency node distribution is n₁ = n₂=4 and n₃=1 (identical to a 2-D IC–2-D NoC, as shown in Fig. 20.12), while for A_PE=4 mm², n₁=n₂=2 and n₃=4.

The optimum node distribution can also be affected by the delay of the vertical channel. The repeater insertion methodology for minimum delay as described in Section 20.3.2 can significantly reduce the delay of the horizontal buss by inserting large sized repeaters (i.e., h > 300). In this case, the delay of the vertical buss becomes comparable to that of the horizontal buss with repeaters. Consider a network with N=128 nodes. Two different node distributions yield the minimum average number of hops, specifically, n₁=4, n₂=4, and n₃=8 and n₁=8, n₂=4, and n₃=4. The first of the two distributions also results in the minimum number of intratier $h o p s_{2 - D}$ $h o p s_{2 - D}$ , thereby reducing the latency of the horizontal buss, as described by (20.9). Simulation results, however, indicate that this distribution is not the minimum latency node distribution, as the delay due to the vertical channel is nonnegligible. For this reason, the latter distribution with n₃=4 is preferable since a smaller number of hops_3-D occurs, resulting in the minimum network latency.

3-D IC–2-D NoC

For this type of 3-D network, the PEs are allowed to span multiple physical tiers while the network effectively remains 2-D (i.e., n₃=1). Consequently, the network latency is only reduced by decreasing the length of the horizontal buss, as described in (20.18). The routing delay component remains constant with this 3-D topology. Decreasing the horizontal buss length lowers both the communication channel delay and the serialization delay. In Fig. 20.13, the decrease in latency that can be achieved by a 3-D IC–3-D NoC is illustrated. A latency decrease of 30.2% and 26.4% can be observed for, respectively, N=128 and N=256 nodes with A_PE=2.25 mm². The use of multiple physical tiers reduces the latency; therefore, the optimum value for n_p = n_max, regardless of the network size and buss length.

Figure 20.13 Zero-load latency for various network sizes. (A) A_PE=0.64 mm² and c_h=192.5 fF/mm, (B) A_PE=2.25 mm² and c_h=192.5 fF/mm.

In Figs. 20.13A and 20.13B, the improvement in the network latency over a 2-D IC–2-D NoC for several network sizes and for different PE areas (i.e., different horizontal buss length) is illustrated for, respectively, the 2-D IC–3-D NoC and 3-D IC–2-D NoC topologies. Note that for the 2-D IC–3-D NoC topology, the improvement in delay is smaller for PEs with a larger area or, equivalently, with longer buss lengths independent of the network size. For longer buss lengths, the buss latency comprises a larger portion of the total network latency. Since for a 2-D IC–3-D NoC only the hop count is reduced, the improvement in latency is lower for longer buss lengths. Alternatively, the improvement in latency is greater for PEs with a larger area independent of the network size for 3-D IC–2-D NoC. This situation is due to the significant reduction in the PE area (or buss length) achieved with this topology. Consequently, there is a tradeoff in the latency of a NoC that depends both on the network size and the area of the PEs. In Fig. 20.14A, the improvement is not significant for small networks (all of the curves converge to approximately zero) in 2-D IC–3-D NoC while this situation does not occur for 3-D IC–2-D NoC. This behavior is due to the increase in the delay of the network router as the number of ports increases from five to seven for 2-D IC–3-D NoC, which is a considerable portion of the network latency for small networks. Note that for 3-D IC–2-D NoC, the network essentially remains 2-D and therefore the delay of the router for this topology does not increase. To achieve the minimum delay, a 3-D NoC topology that exploits these tradeoffs is described in the following subsection.

Figure 20.14 Improvement in zero-load latency for different network sizes and PE areas (i.e., buss lengths). (A) 2-D IC–3-D NoC, and (B) 3-D IC–2-D NoC.

3-D IC–3-D NoC

This topology offers the greatest decrease in latency over the aforementioned 3-D topologies. The 2-D IC–3-D NoC topology decreases the number of hops while the buss and serialization delays remain constant. With the 3-D IC–2-D NoC, the buss and serialization delay is smaller but the number of hops remains unchanged. With the 3-D IC–3-D NoC, all of the latency components can be decreased by assigning a portion of the available physical tiers for the network while the remaining tiers of the stack are used for the PE. The resulting decrease in network latency as compared to a standard 2-D IC–2-D NoC and the other two 3-D topologies is illustrated in Fig. 20.15. A decrease in latency of 40% and 36% can be observed, respectively, for N=128 and N=256 nodes with A_PE=4 mm². Note that the 3-D IC–3-D NoC topology achieves the greatest savings in latency by optimally balancing n₃ with n_p.

Figure 20.15 Zero-load latency for various network sizes. (A) A_PE=1 mm² and c_h=332.6 fF/mm, and (B) A_PE=4 mm² and c_h=332.6 fF/mm.

For certain network sizes, the performance of the 3-D IC–2-D NoC is identical to either the 2-D IC–3-D NoC or 3-D IC–2-D NoC. This behavior occurs because for large network sizes, the delay due to the large number of hops dominates the total delay and, therefore, the latency can be primarily reduced by decreasing the average number of hops (n₃=n_max). For small networks, the buss delay is large and the latency savings is typically achieved by reducing the buss length (n_p=n_max). For medium networks, though, the optimum topology is obtained by dividing n_max between n₃ and n_p to ensure that (20.26c) is satisfied. This distribution of n₃ and n_p as a function of the network size and buss length is illustrated in Fig. 20.16.

Figure 20.16 n₃ and n_p values for minimum zero-load latency for various network sizes. (A) A_PE=1 mm² and c_h=332.6 fF/mm, and (B) A_PE=4 mm² and c_h=332.6 fF/mm.

Note the shift in the value of n₃ and n_p as the PE area A_PE or, equivalently, the buss length increases. For long busses, the delay of the communication channel becomes dominant and therefore the smaller number of hops for medium-sized networks cannot significantly decrease the total delay. Alternatively, further decreasing the buss length by placing the PEs within a greater number of physical tiers leads to a larger savings in delay.

The suggested optimum topologies for different network sizes (namely, small, medium, and large networks) also depend upon the interconnect parameters of the network. Consequently, a change in the optimum topology for different network sizes can occur when different interconnect parameters are considered. Despite the sensitivity of the topologies on the interconnect parameters, the tradeoff between the number of hops and the buss length for different 3-D topologies (see Figs. 20.14 and 20.16) can be exploited to improve the performance of an NoC. In the following subsection, the topology that yields the minimum power consumption while satisfying the delay constraints is described. The distribution of nodes for that topology is also discussed.

20.3.4.3 Power consumption in 3-D NoC

The different power consumption components for the interconnect within an NoC are described in Section 20.3.3. The methodology presented in [655] is applied here to minimize the power consumption of these interconnects while satisfying the specified operating frequency of the network. Since a power minimization methodology is applied to the buss lines, the power consumed by the network can only be further reduced by the choice of network topology. Additionally, the power consumption also depends upon the target operating frequency, as discussed later in this section.

As with the zero-load latency, each topology affects the power consumption of the network in a different way. From (20.25), the power consumption can be reduced by either decreasing the number of hops for the packet or by decreasing the buss length. Note that by reducing the buss length, the interconnect capacitance is not only reduced but also the number and size of the repeaters required to drive the lines are decreased, resulting in a greater savings in power. The effect of each of the 3-D topologies on the power consumption of an NoC is investigated in this section.