Chapter 3
Hardware Implementations for Block Ciphers

This chapter introduces several hardware implementations that can be often used for block ciphers, focusing on AES. AES is widespread and plays a significant role to guarantee the security of many secure systems. The AES hardware can be implemented on a wide variety of architectures with synchronous-style design. There exist several options depending on the requirement for speed performance, area cost, and so on. General discussions on hardware architecture are firstly mentioned, and secondly the AES hardware implementations are introduced. Finally, we explore the details of the merits and demerits of each architecture throughout this chapter.

3.1 Parallel Architecture

In order to achieve a high-throughput performance, a hardware implementation is designed to process several operations or modules simultaneously. It is called parallel architecture and often used in timing-critical and real-time systems. It can also be understood that designers aiming at such parallelism have to increase the value of c3-math-0001 discussed in Section 2.6.2.

3.1.1 Comparison between Serial and Parallel Architectures

Power consumption of the parallel architecture is relatively large as signal toggles are increased in proportion to the number of parallelized modules. For instance, Figure 3.1 shows an example, where four independent 8-bit additions are implemented in parallel. Compared to the implementation, where only one 8-bit adder is used, the data throughput with the parallel architecture becomes four times higher.

c3f001

Figure 3.1 Parallel architecture of four 8-bit additions

As all the modules running in parallel have to perform addition independently, it has to be assumed that any data dependencies do not exist between modules. When there is a data dependency, they have to be operated sequentially in general. For instance, let us consider the case that a 32-bit addition is implemented with four 8-bit adder modules. The carry bits are propagated among modules as c3-math-0002, and c3-math-0003; the parallel architecture illustrated in Figure 3.1 cannot be employed as the carry-in signals of the left three 8-bit FAs are fixed to zeros.

Figure 3.2 illustrates a serial architecture for 32-bit adder. It takes four cycles to complete 32-bit addition since it is based on 8-bit additions. As can be seen from the figure, the architecture requires several selectors, which causes an extra area cost and path delay.1 Nevertheless, the critical-path delay is expected much lower than the hardware with combinatorial logics of a 32-bit carry-ripple adder.

c3f002

Figure 3.2 Serial architecture for 32-bit adder (multi-cycle carry-ripple adder)

3.1.2 Algorithm Optimization for Parallel Architectures

An algorithmic optimization sometimes solves the data-dependency problem or facilitates the data parallelism. In a straightforward implementation, only right-most adder module in Figure 3.1 can start a correct operation as c3-math-0004 signals of other three modules are not determined. However, considering the fact that the value of c3-math-0005 has at most two possibilities, 0 or 1, the three adder modules can start operation assuming the two cases, c3-math-0006 or c3-math-0007. The corresponding architecture is shown in Figure 3.3.

c3f003

Figure 3.3 Parallelized architecture for 32-bit adder (carry-select adder)

In this case, the number of total adder modules increases up to seven in total. This adder is famous as the carry-select adder. Although there exist several selector logics to select the addition results, it can be seen that this optimization makes the best use of the parallel processing of 8-bit adders.

3.2 Loop Architecture

In order to utilize a hardware resource efficiently, iterative use of combinatorial logics is often seen in hardware implementations. Combined with DFFs (delay flip flops), such hardware is called loop architecture. A simplified block diagram for the loop architecture is illustrated in Figure 3.4.

c3f004

Figure 3.4 Loop architecture

3.2.1 Straightforward (Loop-Unrolled) Architecture

The loop architecture is basically for reducing the hardware cost efficiently, while maintaining the speed performance. As one of the simple examples, let us see a hardware implementation that performs a modular computation of

3.1 equation

where c3-math-0009, are all 8-bit positive integers. The straightforward (loop-unrolled) implementation for this example is to prepare seven 8-bit adder modules as combinatorial logics as shown in Figure 3.5.

c3f005

Figure 3.5 Straightforward (loop-unrolled) implementation of 8-operand modular addition

Eight operands for the addition, c3-math-0010, are all assumed to be stored in 8-bit DFFs as they should be provided to the combinatorial logic at the same timing. The operation result, S, is performed in a single clock cycle and stored in 8-bit DFFs after the addition. In this case, the critical-path delay becomes c3-math-0012, where c3-math-0013 is the critical-path delay of the 8-bit adder module.

By changing the connection of 8-bit FAs as shown in Figure 3.6, the critical-path delay is shortened to c3-math-0014, whereas the area cost is the same.

c3f006

Figure 3.6 Optimized implementation of 8-operand modular addition

3.2.2 Basic Loop Architecture

Alternatively, the same functionality is realized only with one 8-bit adder module as shown in Figure 3.7. This architecture is easy to understand by dividing the computation into sequential ones as

equation

where c3-math-0016 is assumed before the computation.

c3f007

Figure 3.7 Loop architecture for 8-operand modular addition

One of the two inputs for 8-bit adder can be provided from external memory such as SRAM, and another input is connected to the output of DFF storing intermediate addition result and the final result. The critical-path delay becomes c3-math-0017, however, it takes eight cycles to finish the computation. If the additional timing penalty caused by the selector, the clock skew, and the setup timing margin is small enough, the speed performance of the above mentioned two architectures shown in Figures 3.5 and 3.7 can be regarded similar under the maximum clock frequency.

The advantage of the latter loop architecture is obviously in its effectiveness of area cost. Moreover, the loop architecture offers scalability, that is, more than eight operands can be used for modular addition with a slight modification on the select signals. The loop architecture enables us to perform more iterations such as a 100-operand modular addition.

3.3 Pipeline Architecture

Pipeline architecture is commonly used in high-performance RISC (reduced instruction set computer) CPUs (central processing units) to support a flexible and efficient software implementation. The grain size of the computation of instructions in an RISC CPU is small enough as they just support basic operations. Accordingly, the operation of each pipeline stage becomes simple, and the clock frequency can be increased so that the throughput of the software could be accelerated. Note that the pipeline will stall if there is data-dependent operations, which is called pipeline stall. It is known that such stalls diminish the performance merit of the pipeline architecture.

3.3.1 Pipeline Architecture for Block Ciphers

The pipeline architecture can also be applied to hardware implementations for block ciphers in order to accelerate multiple independent operations with a low area cost. The number of the pipeline stages should be defined according to the degree of parallelism necessary for a target application.

Here, let us consider a cryptographic application that requires an encryption hardware of a 10-round block cipher. Suppose that the round function f is divided into four functions as

3.2 equation

in order to perform at a high clock frequency. Additionally, assuming that each round operation is identical for simplicity, the 10-round encryption function is described as

3.3 equation

where P and C are plaintexts and ciphertexts, respectively.

The combinatorial logics for the round function, f, is divided, and three more DFFs are inserted between the small functions, c3-math-0024 as shown in Figure 3.8. As a result, the latency becomes four clock cycles.

c3f008

Figure 3.8 Four-stage architecture for the round function, f

3.3.2 Advanced Pipeline Architecture for Block Ciphers

On the basis of the four-stage architecture, 10-round loop architecture is implemented as illustrated in Figure 3.9. The throughput performance can be improved up to by four times by executing multiple encryption operations in parallel. Note that, when executing the encryption for a single plaintext, the architecture does not have any merits as for the throughput and latency.

c3f009

Figure 3.9 Four-stage pipeline architecture for 10-round encryption

Let us see the details of the effect of pipeline architecture for the iterative operations in the timing waveform shown in Figure 3.10. The first plaintext, c3-math-0026, is processed after it is stored, in DFF_1, and c3-math-0027 is stored in DFF_2 in the next cycle. Suppose that the size of plaintext is w. Likewise, one round operation for c3-math-0029 is complete in four cycles, and the result of c3-math-0030 is stored again in DFF_1. This means that operations, c3-math-0031, and c3-math-0032, are used only once in four cycles as for the round operation for c3-math-0033. Therefore, the throughput becomes

equation

when operating a single plaintext with the architecture illustrated in Figure 3.9. This is because a pipeline stall happens.

c3f010

Figure 3.10 Timing waveform for four-stage pipelined 10-round encryption

In order to make the best use of the architecture's capability, four different plaintexts, c3-math-0035, and c3-math-0036, are sent to DFF_1 in four consecutive cycles. In this case, the throughput is improved four times higher than the previous case as

equation

3.4 AES Hardware Implementations

In this section, we will consider the several different hardware implementations for AES-128, and compare the cost and speed performance.

3.4.1 Straightforward Implementation for AES-128

One of the most straightforward implementations for AES-128 encryption hardware is shown in Figure 3.11. It operates 10-round AES-128 encryption in one cycle. It starts from the c3-math-0038 operation that XORs the plaintext and the subkey, c3-math-0039.

c3f011

Figure 3.11 Straightforward implementation for AES-128 encryption

The round operation consists of four transformations, c3-math-0040, and c3-math-0041. The 128-bit AES state is aligned in a byte-wise manner, and each byte of them is independently operated in the c3-math-0042 transformation. The c3-math-0043 transformation changes the alignment of the bytes, which can be realized just by wiring in hardware implementations. Then, the c3-math-0044 transformation performs a finite field operation in c3-math-0045, where the four-byte inputs transform four-byte outputs (see the details in Section 1.5). Finally, the round operation generates 128-bit data after the c3-math-0046 transformation, where a subkey, c3-math-0047, is added to the output of the c3-math-0048 transformation. The round operation is performed, in total, 10 times, and the ciphertext is output. Note that, the c3-math-0049 transformation is not performed in the 10th round operation.

The AES-128 decryption hardware can be implemented by reversing the order of the encryption transformations as shown in Figure 3.12. Notice that the inverse of the c3-math-0050 transformation is the same as the c3-math-0051 transformation as they are both XOR operations. It is worth mentioning that the transformation order of c3-math-0052 and c3-math-0053 in the round operation can be easily switched as the c3-math-0054 transformations is implemented with byte-wise wiring.2 Therefore, it is possible to switch the order of c3-math-0055 and c3-math-0056 transformations.

c3f012

Figure 3.12 Straightforward implementation for AES-128 decryption

Even if switching the order of c3-math-0057 and c3-math-0058 in Figure 3.12, the order of transformations in the round operation is still slightly different between Figures 3.11 and 3.12. That is, in order to have the same sequence of transformations for both AES encryption and decryption, ignoring the inverse properties, the order of the c3-math-0059 and c3-math-0060 transformations needs to be changed. In changing the order, the subkeys, c3-math-0061 to c3-math-0062, are required to go through the c3-math-0063 transformation, that is,

3.4 equation

where c3-math-0065 because

3.5 equation
3.6 equation

where I is an intermediate value that is output from c3-math-0068 and c3-math-0069 transformations.

Figure 3.13 shows another AES-128 decryption hardware that has the same operation sequence of the AES-128 encryption shown in Figure 3.11. The modification in Figure 3.13 motivates us to implement a round operation that can be used for both encryption and decryption in order to optimize the hardware cost. That is, one of the optimizations in hardware is to implement each transformation so that its inverse transformation could also be supported efficiently with sharing a hardware resource.

c3f013

Figure 3.13 Straightforward implementation for AES-128 decryption with modified key scheduling

Here, we excluded the detailed descriptions for the Key Scheduling Function or c3-math-0070 module that generates 128-bit subkeys provided to the c3-math-0071 transformation. It can be performed off-line, that is, it can be precomputed before the encryption and decryption, as far as the secret key is not frequently changed and there is enough memory space to store the subkeys. Alternatively, in order to reduce the hardware cost, the subkey generation can be performed round by round. In the following sections, we will explore several different types of implementations for AES-128.

3.4.2 Loop Architecture for AES-128

Figure 3.14 shows a loop architecture for AES. The first c3-math-0072 operates with the plaintext and the encryption key, and the result is stored in DFF_1. Then, the round operations are literately performed nine times with the loop architecture including DFF_2 in the datapath. The result of the ninth round (9R) is stored in DFF_3, whereas DFF_2 keeps the result of the eighth round (8R). The 10th round operation is performed on the output of 9R, and the result is stored in DFF_C as ciphertext. The corresponding timing waveform can be found in Figure 3.15.

c3f014

Figure 3.14 Loop architecture (I) for AES-128 encryption

c3f015

Figure 3.15 Timing waveform for loop architecture in Figure 3.14

This loop architecture requires three c3-math-0073 blocks, two c3-math-0074 and c3-math-0075 blocks, and one c3-math-0076 block. The total number of DFFs is c3-math-0077 or c3-math-0078 except the DFFs for encryption key. Here, we ignore the hardware cost for c3-math-0079 as the implementation style varies for the use case, for example, encryption key is fixed for a long term.

The critical-path delay in the architecture can be estimated as

3.7 equation

where c3-math-0081, and c3-math-0082 are the delay times in c3-math-0083, and c3-math-0084 blocks. c3-math-0085 and c3-math-0086 are the delay times for multiplexer and demultiplexer, and c3-math-0087 takes the setup time and clock skew into consideration.

There are several ways to reduce the hardware cost. One alternative is shown in Figure 3.16, where two c3-math-0088, one c3-math-0089, one c3-math-0090, and one c3-math-0091 blocks are used. The timing waveform is depicted in Figure 3.17. It is found that the c3-math-0092 block is skipped only at the 10th round. Only three DFFs are needed to support this loop architecture except the DFFs for encryption key.

c3f016

Figure 3.16 Loop architecture (II) for AES-128 encryption

c3f017

Figure 3.17 Timing waveform for loop architecture in Figure 3.16

However, the critical-path delay is increased as

3.8 equation

3.4.3 Pipeline Architecture for AES-128

On the basis of the discussion in Section 3.3, we consider two-level pipeline architecture for AES-128 encryption. The first-level pipeline is for the 10-round operation, and the second-level one exploits the operations in each round operation.

3.4.3.1 High-Throughput Architecture

When AES encryption is used in practice, a mechanism called mode of operation is employed for security reasons.3 Among various mode of operations, the so-called counter mode or CTR mode uses a counter value as a plaintext for AES encryption, and the message is XORed with the encryption result. Therefore, even for a long message, all the AES encryptions can be parallelized using the counter values. Accordingly, the throughput performance is the same as that for one AES encryption.

In this regard, the pipeline architecture shown in Figure 3.18 is appropriate as the amount of data processed per cycle is fast enough (128 bits/cycle), and a high clock frequency is expected. If the architecture operates at 100 MHz, the throughput becomes 12.8 Gbps.

c3f018

Figure 3.18 High-throughput pipeline architecture for AES-128

What can be done for further improvements in throughput? One promising way is to reduce the critical-path delay by introducing additional pipeline stages in the critical block. In the case of AES encryption, c3-math-0094 tends to be the critical block as far as its nonlinear transformation is implemented with combinatorial logics.4 Therefore, it is natural to divide the c3-math-0096 transformation into several small operations, or prepare an additional DFF immediately after c3-math-0097 as shown in Figure 3.19. Note that throughput improvements often come with the price in hardware cost.

c3f019

Figure 3.19 Pipeline in the round operation

3.4.3.2 Combination with Loop Architecture

It is possible to apply the pipeline to the loop architecture as shown in Figure 3.9. In this case, the cost area will be much lower than the case for the architecture in Figure 3.18.

3.4.4 Compact Architecture for AES-128

In the previous discussions, the width of the datapath is all fixed to 128 bits. As we can easily imagine from the fact that 8-bit or 32-bit CPU can operate AES-128, the datapath can be narrow by performing byte-wise operations. For instance, the architecture shown in Figure 3.20 employs 32-bit or 4-byte datapath that performs 4-byte operations in one cycle. That is, it requires four cycles to complete a round operation.

c3f020

Figure 3.20 Round operation with 32-bit or 4-byte datapath

All the intermediate values are stored in a memory module such as SRAM. In the case of the architecture of Figure 3.20, 4-byte data should be aligned in accordance with the c3-math-0098 transformation when writing the data to SRAM or reading out from SRAM. The merit of using 4-byte datapath is that one c3-math-0099 operation, which operates 4 bytes, can be performed in one cycle.

Further Reading

  1. (ed. Verbauwhede IM) 2007 Secure Integrated Circuits and Systems. Springer-Verlag.

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.106.233