Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3
Hardware Implementations for Block Ciphers

This chapter introduces several hardware implementations that can be often used for block ciphers, focusing on AES. AES is widespread and plays a significant role to guarantee the security of many secure systems. The AES hardware can be implemented on a wide variety of architectures with synchronous-style design. There exist several options depending on the requirement for speed performance, area cost, and so on. General discussions on hardware architecture are firstly mentioned, and secondly the AES hardware implementations are introduced. Finally, we explore the details of the merits and demerits of each architecture throughout this chapter.

3.1 Parallel Architecture

In order to achieve a high-throughput performance, a hardware implementation is designed to process several operations or modules simultaneously. It is called parallel architecture and often used in timing-critical and real-time systems. It can also be understood that designers aiming at such parallelism have to increase the value of $c3-math-0001$ discussed in Section 2.6.2.

3.1.1 Comparison between Serial and Parallel Architectures

Power consumption of the parallel architecture is relatively large as signal toggles are increased in proportion to the number of parallelized modules. For instance, Figure 3.1 shows an example, where four independent 8-bit additions are implemented in parallel. Compared to the implementation, where only one 8-bit adder is used, the data throughput with the parallel architecture becomes four times higher.

c3f001 — **Figure 3.1** Parallel architecture of four 8-bit additions

As all the modules running in parallel have to perform addition independently, it has to be assumed that any data dependencies do not exist between modules. When there is a data dependency, they have to be operated sequentially in general. For instance, let us consider the case that a 32-bit addition is implemented with four 8-bit adder modules. The carry bits are propagated among modules as $c3-math-0002$ , and $c3-math-0003$ ; the parallel architecture illustrated in Figure 3.1 cannot be employed as the carry-in signals of the left three 8-bit FAs are fixed to zeros.

Figure 3.2 illustrates a serial architecture for 32-bit adder. It takes four cycles to complete 32-bit addition since it is based on 8-bit additions. As can be seen from the figure, the architecture requires several selectors, which causes an extra area cost and path delay.¹ Nevertheless, the critical-path delay is expected much lower than the hardware with combinatorial logics of a 32-bit carry-ripple adder.

c3f002 — **Figure 3.2** Serial architecture for 32-bit adder (multi-cycle carry-ripple adder)

3.1.2 Algorithm Optimization for Parallel Architectures

An algorithmic optimization sometimes solves the data-dependency problem or facilitates the data parallelism. In a straightforward implementation, only right-most adder module in Figure 3.1 can start a correct operation as $c3-math-0004$ signals of other three modules are not determined. However, considering the fact that the value of $c3-math-0005$ has at most two possibilities, 0 or 1, the three adder modules can start operation assuming the two cases, $c3-math-0006$ or $c3-math-0007$ . The corresponding architecture is shown in Figure 3.3.

c3f003 — **Figure 3.3** Parallelized architecture for 32-bit adder (carry-select adder)

In this case, the number of total adder modules increases up to seven in total. This adder is famous as the carry-select adder. Although there exist several selector logics to select the addition results, it can be seen that this optimization makes the best use of the parallel processing of 8-bit adders.

3.2 Loop Architecture

In order to utilize a hardware resource efficiently, iterative use of combinatorial logics is often seen in hardware implementations. Combined with DFFs (delay flip flops), such hardware is called loop architecture. A simplified block diagram for the loop architecture is illustrated in Figure 3.4.

c3f004 — **Figure 3.4** Loop architecture

3.2.1 Straightforward (Loop-Unrolled) Architecture

The loop architecture is basically for reducing the hardware cost efficiently, while maintaining the speed performance. As one of the simple examples, let us see a hardware implementation that performs a modular computation of

3.1

where $c3-math-0009$ , are all 8-bit positive integers. The straightforward (loop-unrolled) implementation for this example is to prepare seven 8-bit adder modules as combinatorial logics as shown in Figure 3.5.

c3f005 — **Figure 3.5** Straightforward (loop-unrolled) implementation of 8-operand modular addition

Eight operands for the addition, $c3-math-0010$ , are all assumed to be stored in 8-bit DFFs as they should be provided to the combinatorial logic at the same timing. The operation result, S, is performed in a single clock cycle and stored in 8-bit DFFs after the addition. In this case, the critical-path delay becomes $c3-math-0012$ , where $c3-math-0013$ is the critical-path delay of the 8-bit adder module.

By changing the connection of 8-bit FAs as shown in Figure 3.6, the critical-path delay is shortened to $c3-math-0014$ , whereas the area cost is the same.

c3f006 — **Figure 3.6** Optimized implementation of 8-operand modular addition

3.2.2 Basic Loop Architecture

Alternatively, the same functionality is realized only with one 8-bit adder module as shown in Figure 3.7. This architecture is easy to understand by dividing the computation into sequential ones as

where $c3-math-0016$ is assumed before the computation.

c3f007 — **Figure 3.7** Loop architecture for 8-operand modular addition

One of the two inputs for 8-bit adder can be provided from external memory such as SRAM, and another input is connected to the output of DFF storing intermediate addition result and the final result. The critical-path delay becomes $c3-math-0017$ , however, it takes eight cycles to finish the computation. If the additional timing penalty caused by the selector, the clock skew, and the setup timing margin is small enough, the speed performance of the above mentioned two architectures shown in Figures 3.5 and 3.7 can be regarded similar under the maximum clock frequency.

The advantage of the latter loop architecture is obviously in its effectiveness of area cost. Moreover, the loop architecture offers scalability, that is, more than eight operands can be used for modular addition with a slight modification on the select signals. The loop architecture enables us to perform more iterations such as a 100-operand modular addition.

3.3 Pipeline Architecture

Pipeline architecture is commonly used in high-performance RISC (reduced instruction set computer) CPUs (central processing units) to support a flexible and efficient software implementation. The grain size of the computation of instructions in an RISC CPU is small enough as they just support basic operations. Accordingly, the operation of each pipeline stage becomes simple, and the clock frequency can be increased so that the throughput of the software could be accelerated. Note that the pipeline will stall if there is data-dependent operations, which is called pipeline stall. It is known that such stalls diminish the performance merit of the pipeline architecture.

3.3.1 Pipeline Architecture for Block Ciphers

The pipeline architecture can also be applied to hardware implementations for block ciphers in order to accelerate multiple independent operations with a low area cost. The number of the pipeline stages should be defined according to the degree of parallelism necessary for a target application.

Here, let us consider a cryptographic application that requires an encryption hardware of a 10-round block cipher. Suppose that the round function f is divided into four functions as

3.2

in order to perform at a high clock frequency. Additionally, assuming that each round operation is identical for simplicity, the 10-round encryption function is described as

3.3

where P and C are plaintexts and ciphertexts, respectively.

The combinatorial logics for the round function, f, is divided, and three more DFFs are inserted between the small functions, $c3-math-0024$ as shown in Figure 3.8. As a result, the latency becomes four clock cycles.

c3f008 — **Figure 3.8** Four-stage architecture for the round function, f

3.3.2 Advanced Pipeline Architecture for Block Ciphers

On the basis of the four-stage architecture, 10-round loop architecture is implemented as illustrated in Figure 3.9. The throughput performance can be improved up to by four times by executing multiple encryption operations in parallel. Note that, when executing the encryption for a single plaintext, the architecture does not have any merits as for the throughput and latency.

c3f009 — **Figure 3.9** Four-stage pipeline architecture for 10-round encryption

Let us see the details of the effect of pipeline architecture for the iterative operations in the timing waveform shown in Figure 3.10. The first plaintext, $c3-math-0026$ , is processed after it is stored, in DFF_1, and $c3-math-0027$ is stored in DFF_2 in the next cycle. Suppose that the size of plaintext is w. Likewise, one round operation for $c3-math-0029$ is complete in four cycles, and the result of $c3-math-0030$ is stored again in DFF_1. This means that operations, $c3-math-0031$ , and $c3-math-0032$ , are used only once in four cycles as for the round operation for $c3-math-0033$ . Therefore, the throughput becomes

when operating a single plaintext with the architecture illustrated in Figure 3.9. This is because a pipeline stall happens.

c3f010 — **Figure 3.10** Timing waveform for four-stage pipelined 10-round encryption

In order to make the best use of the architecture's capability, four different plaintexts, $c3-math-0035$ , and $c3-math-0036$ , are sent to DFF_1 in four consecutive cycles. In this case, the throughput is improved four times higher than the previous case as

3.4 AES Hardware Implementations

In this section, we will consider the several different hardware implementations for AES-128, and compare the cost and speed performance.

3.4.1 Straightforward Implementation for AES-128

One of the most straightforward implementations for AES-128 encryption hardware is shown in Figure 3.11. It operates 10-round AES-128 encryption in one cycle. It starts from the $c3-math-0038$ operation that XORs the plaintext and the subkey, $c3-math-0039$ .

c3f011 — **Figure 3.11** Straightforward implementation for AES-128 encryption

The round operation consists of four transformations, $c3-math-0040$ , and $c3-math-0041$ . The 128-bit AES state is aligned in a byte-wise manner, and each byte of them is independently operated in the $c3-math-0042$ transformation. The $c3-math-0043$ transformation changes the alignment of the bytes, which can be realized just by wiring in hardware implementations. Then, the $c3-math-0044$ transformation performs a finite field operation in $c3-math-0045$ , where the four-byte inputs transform four-byte outputs (see the details in Section 1.5). Finally, the round operation generates 128-bit data after the $c3-math-0046$ transformation, where a subkey, $c3-math-0047$ , is added to the output of the $c3-math-0048$ transformation. The round operation is performed, in total, 10 times, and the ciphertext is output. Note that, the $c3-math-0049$ transformation is not performed in the 10th round operation.

The AES-128 decryption hardware can be implemented by reversing the order of the encryption transformations as shown in Figure 3.12. Notice that the inverse of the $c3-math-0050$ transformation is the same as the $c3-math-0051$ transformation as they are both XOR operations. It is worth mentioning that the transformation order of $c3-math-0052$ and $c3-math-0053$ in the round operation can be easily switched as the $c3-math-0054$ transformations is implemented with byte-wise wiring.² Therefore, it is possible to switch the order of $c3-math-0055$ and $c3-math-0056$ transformations.

c3f012 — **Figure 3.12** Straightforward implementation for AES-128 decryption

Even if switching the order of $c3-math-0057$ and $c3-math-0058$ in Figure 3.12, the order of transformations in the round operation is still slightly different between Figures 3.11 and 3.12. That is, in order to have the same sequence of transformations for both AES encryption and decryption, ignoring the inverse properties, the order of the $c3-math-0059$ and $c3-math-0060$ transformations needs to be changed. In changing the order, the subkeys, $c3-math-0061$ to $c3-math-0062$ , are required to go through the $c3-math-0063$ transformation, that is,

3.4

where $c3-math-0065$ because

3.5

3.6

where I is an intermediate value that is output from $c3-math-0068$ and $c3-math-0069$ transformations.

Figure 3.13 shows another AES-128 decryption hardware that has the same operation sequence of the AES-128 encryption shown in Figure 3.11. The modification in Figure 3.13 motivates us to implement a round operation that can be used for both encryption and decryption in order to optimize the hardware cost. That is, one of the optimizations in hardware is to implement each transformation so that its inverse transformation could also be supported efficiently with sharing a hardware resource.

c3f013 — **Figure 3.13** Straightforward implementation for AES-128 decryption with modified key scheduling

Here, we excluded the detailed descriptions for the Key Scheduling Function or $c3-math-0070$ module that generates 128-bit subkeys provided to the $c3-math-0071$ transformation. It can be performed off-line, that is, it can be precomputed before the encryption and decryption, as far as the secret key is not frequently changed and there is enough memory space to store the subkeys. Alternatively, in order to reduce the hardware cost, the subkey generation can be performed round by round. In the following sections, we will explore several different types of implementations for AES-128.

3.4.2 Loop Architecture for AES-128

Figure 3.14 shows a loop architecture for AES. The first $c3-math-0072$ operates with the plaintext and the encryption key, and the result is stored in DFF_1. Then, the round operations are literately performed nine times with the loop architecture including DFF_2 in the datapath. The result of the ninth round (9R) is stored in DFF_3, whereas DFF_2 keeps the result of the eighth round (8R). The 10th round operation is performed on the output of 9R, and the result is stored in DFF_C as ciphertext. The corresponding timing waveform can be found in Figure 3.15.

c3f014 — **Figure 3.14** Loop architecture (I) for AES-128 encryption

c3f015 — **Figure 3.15** Timing waveform for loop architecture in Figure 3.14

This loop architecture requires three $c3-math-0073$ blocks, two $c3-math-0074$ and $c3-math-0075$ blocks, and one $c3-math-0076$ block. The total number of DFFs is $c3-math-0077$ or $c3-math-0078$ except the DFFs for encryption key. Here, we ignore the hardware cost for $c3-math-0079$ as the implementation style varies for the use case, for example, encryption key is fixed for a long term.

The critical-path delay in the architecture can be estimated as

3.7

where $c3-math-0081$ , and $c3-math-0082$ are the delay times in $c3-math-0083$ , and $c3-math-0084$ blocks. $c3-math-0085$ and $c3-math-0086$ are the delay times for multiplexer and demultiplexer, and $c3-math-0087$ takes the setup time and clock skew into consideration.

There are several ways to reduce the hardware cost. One alternative is shown in Figure 3.16, where two $c3-math-0088$ , one $c3-math-0089$ , one $c3-math-0090$ , and one $c3-math-0091$ blocks are used. The timing waveform is depicted in Figure 3.17. It is found that the $c3-math-0092$ block is skipped only at the 10th round. Only three DFFs are needed to support this loop architecture except the DFFs for encryption key.

c3f016 — **Figure 3.16** Loop architecture (II) for AES-128 encryption

c3f017 — **Figure 3.17** Timing waveform for loop architecture in Figure 3.16

However, the critical-path delay is increased as

3.8

3.4.3 Pipeline Architecture for AES-128

On the basis of the discussion in Section 3.3, we consider two-level pipeline architecture for AES-128 encryption. The first-level pipeline is for the 10-round operation, and the second-level one exploits the operations in each round operation.

3.4.3.1 High-Throughput Architecture

When AES encryption is used in practice, a mechanism called mode of operation is employed for security reasons.³ Among various mode of operations, the so-called counter mode or CTR mode uses a counter value as a plaintext for AES encryption, and the message is XORed with the encryption result. Therefore, even for a long message, all the AES encryptions can be parallelized using the counter values. Accordingly, the throughput performance is the same as that for one AES encryption.

In this regard, the pipeline architecture shown in Figure 3.18 is appropriate as the amount of data processed per cycle is fast enough (128 bits/cycle), and a high clock frequency is expected. If the architecture operates at 100 MHz, the throughput becomes 12.8 Gbps.

c3f018 — **Figure 3.18** High-throughput pipeline architecture for AES-128

What can be done for further improvements in throughput? One promising way is to reduce the critical-path delay by introducing additional pipeline stages in the critical block. In the case of AES encryption, $c3-math-0094$ tends to be the critical block as far as its nonlinear transformation is implemented with combinatorial logics.⁴ Therefore, it is natural to divide the $c3-math-0096$ transformation into several small operations, or prepare an additional DFF immediately after $c3-math-0097$ as shown in Figure 3.19. Note that throughput improvements often come with the price in hardware cost.

c3f019 — **Figure 3.19** Pipeline in the round operation

3.4.3.2 Combination with Loop Architecture

It is possible to apply the pipeline to the loop architecture as shown in Figure 3.9. In this case, the cost area will be much lower than the case for the architecture in Figure 3.18.

3.4.4 Compact Architecture for AES-128

In the previous discussions, the width of the datapath is all fixed to 128 bits. As we can easily imagine from the fact that 8-bit or 32-bit CPU can operate AES-128, the datapath can be narrow by performing byte-wise operations. For instance, the architecture shown in Figure 3.20 employs 32-bit or 4-byte datapath that performs 4-byte operations in one cycle. That is, it requires four cycles to complete a round operation.

c3f020 — **Figure 3.20** Round operation with 32-bit or 4-byte datapath

All the intermediate values are stored in a memory module such as SRAM. In the case of the architecture of Figure 3.20, 4-byte data should be aligned in accordance with the $c3-math-0098$ transformation when writing the data to SRAM or reading out from SRAM. The merit of using 4-byte datapath is that one $c3-math-0099$ operation, which operates 4 bytes, can be performed in one cycle.

Table of Contents for
Chapter 3: Hardware Implementations for Block Ciphers

Chapter 3
Hardware Implementations for Block Ciphers

3.1 Parallel Architecture

3.1.1 Comparison between Serial and Parallel Architectures

3.1.2 Algorithm Optimization for Parallel Architectures

3.2 Loop Architecture