This chapter introduces several hardware implementations that can be often used for block ciphers, focusing on AES. AES is widespread and plays a significant role to guarantee the security of many secure systems. The AES hardware can be implemented on a wide variety of architectures with synchronous-style design. There exist several options depending on the requirement for speed performance, area cost, and so on. General discussions on hardware architecture are firstly mentioned, and secondly the AES hardware implementations are introduced. Finally, we explore the details of the merits and demerits of each architecture throughout this chapter.
In order to achieve a high-throughput performance, a hardware implementation is designed to process several operations or modules simultaneously. It is called parallel architecture and often used in timing-critical and real-time systems. It can also be understood that designers aiming at such parallelism have to increase the value of discussed in Section 2.6.2.
Power consumption of the parallel architecture is relatively large as signal toggles are increased in proportion to the number of parallelized modules. For instance, Figure 3.1 shows an example, where four independent 8-bit additions are implemented in parallel. Compared to the implementation, where only one 8-bit adder is used, the data throughput with the parallel architecture becomes four times higher.
As all the modules running in parallel have to perform addition independently, it has to be assumed that any data dependencies do not exist between modules. When there is a data dependency, they have to be operated sequentially in general. For instance, let us consider the case that a 32-bit addition is implemented with four 8-bit adder modules. The carry bits are propagated among modules as , and ; the parallel architecture illustrated in Figure 3.1 cannot be employed as the carry-in signals of the left three 8-bit FAs are fixed to zeros.
Figure 3.2 illustrates a serial architecture for 32-bit adder. It takes four cycles to complete 32-bit addition since it is based on 8-bit additions. As can be seen from the figure, the architecture requires several selectors, which causes an extra area cost and path delay.1 Nevertheless, the critical-path delay is expected much lower than the hardware with combinatorial logics of a 32-bit carry-ripple adder.
An algorithmic optimization sometimes solves the data-dependency problem or facilitates the data parallelism. In a straightforward implementation, only right-most adder module in Figure 3.1 can start a correct operation as signals of other three modules are not determined. However, considering the fact that the value of has at most two possibilities, 0 or 1, the three adder modules can start operation assuming the two cases, or . The corresponding architecture is shown in Figure 3.3.
In this case, the number of total adder modules increases up to seven in total. This adder is famous as the carry-select adder. Although there exist several selector logics to select the addition results, it can be seen that this optimization makes the best use of the parallel processing of 8-bit adders.
In order to utilize a hardware resource efficiently, iterative use of combinatorial logics is often seen in hardware implementations. Combined with DFFs (delay flip flops), such hardware is called loop architecture. A simplified block diagram for the loop architecture is illustrated in Figure 3.4.
The loop architecture is basically for reducing the hardware cost efficiently, while maintaining the speed performance. As one of the simple examples, let us see a hardware implementation that performs a modular computation of
where , are all 8-bit positive integers. The straightforward (loop-unrolled) implementation for this example is to prepare seven 8-bit adder modules as combinatorial logics as shown in Figure 3.5.
Eight operands for the addition, , are all assumed to be stored in 8-bit DFFs as they should be provided to the combinatorial logic at the same timing. The operation result, S, is performed in a single clock cycle and stored in 8-bit DFFs after the addition. In this case, the critical-path delay becomes , where is the critical-path delay of the 8-bit adder module.
By changing the connection of 8-bit FAs as shown in Figure 3.6, the critical-path delay is shortened to , whereas the area cost is the same.
Alternatively, the same functionality is realized only with one 8-bit adder module as shown in Figure 3.7. This architecture is easy to understand by dividing the computation into sequential ones as
where is assumed before the computation.
One of the two inputs for 8-bit adder can be provided from external memory such as SRAM, and another input is connected to the output of DFF storing intermediate addition result and the final result. The critical-path delay becomes , however, it takes eight cycles to finish the computation. If the additional timing penalty caused by the selector, the clock skew, and the setup timing margin is small enough, the speed performance of the above mentioned two architectures shown in Figures 3.5 and 3.7 can be regarded similar under the maximum clock frequency.
The advantage of the latter loop architecture is obviously in its effectiveness of area cost. Moreover, the loop architecture offers scalability, that is, more than eight operands can be used for modular addition with a slight modification on the select signals. The loop architecture enables us to perform more iterations such as a 100-operand modular addition.
Pipeline architecture is commonly used in high-performance RISC (reduced instruction set computer) CPUs (central processing units) to support a flexible and efficient software implementation. The grain size of the computation of instructions in an RISC CPU is small enough as they just support basic operations. Accordingly, the operation of each pipeline stage becomes simple, and the clock frequency can be increased so that the throughput of the software could be accelerated. Note that the pipeline will stall if there is data-dependent operations, which is called pipeline stall. It is known that such stalls diminish the performance merit of the pipeline architecture.
The pipeline architecture can also be applied to hardware implementations for block ciphers in order to accelerate multiple independent operations with a low area cost. The number of the pipeline stages should be defined according to the degree of parallelism necessary for a target application.
Here, let us consider a cryptographic application that requires an encryption hardware of a 10-round block cipher. Suppose that the round function f is divided into four functions as
in order to perform at a high clock frequency. Additionally, assuming that each round operation is identical for simplicity, the 10-round encryption function is described as
where P and C are plaintexts and ciphertexts, respectively.
The combinatorial logics for the round function, f, is divided, and three more DFFs are inserted between the small functions, as shown in Figure 3.8. As a result, the latency becomes four clock cycles.
On the basis of the four-stage architecture, 10-round loop architecture is implemented as illustrated in Figure 3.9. The throughput performance can be improved up to by four times by executing multiple encryption operations in parallel. Note that, when executing the encryption for a single plaintext, the architecture does not have any merits as for the throughput and latency.
Let us see the details of the effect of pipeline architecture for the iterative operations in the timing waveform shown in Figure 3.10. The first plaintext, , is processed after it is stored, in DFF_1, and is stored in DFF_2 in the next cycle. Suppose that the size of plaintext is w. Likewise, one round operation for is complete in four cycles, and the result of is stored again in DFF_1. This means that operations, , and , are used only once in four cycles as for the round operation for . Therefore, the throughput becomes
when operating a single plaintext with the architecture illustrated in Figure 3.9. This is because a pipeline stall happens.
In order to make the best use of the architecture's capability, four different plaintexts, , and , are sent to DFF_1 in four consecutive cycles. In this case, the throughput is improved four times higher than the previous case as
In this section, we will consider the several different hardware implementations for AES-128, and compare the cost and speed performance.
One of the most straightforward implementations for AES-128 encryption hardware is shown in Figure 3.11. It operates 10-round AES-128 encryption in one cycle. It starts from the operation that XORs the plaintext and the subkey, .
The round operation consists of four transformations, , and . The 128-bit AES state is aligned in a byte-wise manner, and each byte of them is independently operated in the transformation. The transformation changes the alignment of the bytes, which can be realized just by wiring in hardware implementations. Then, the transformation performs a finite field operation in , where the four-byte inputs transform four-byte outputs (see the details in Section 1.5). Finally, the round operation generates 128-bit data after the transformation, where a subkey, , is added to the output of the transformation. The round operation is performed, in total, 10 times, and the ciphertext is output. Note that, the transformation is not performed in the 10th round operation.
The AES-128 decryption hardware can be implemented by reversing the order of the encryption transformations as shown in Figure 3.12. Notice that the inverse of the transformation is the same as the transformation as they are both XOR operations. It is worth mentioning that the transformation order of and in the round operation can be easily switched as the transformations is implemented with byte-wise wiring.2 Therefore, it is possible to switch the order of and transformations.
Even if switching the order of and in Figure 3.12, the order of transformations in the round operation is still slightly different between Figures 3.11 and 3.12. That is, in order to have the same sequence of transformations for both AES encryption and decryption, ignoring the inverse properties, the order of the and transformations needs to be changed. In changing the order, the subkeys, to , are required to go through the transformation, that is,
where because
where I is an intermediate value that is output from and transformations.
Figure 3.13 shows another AES-128 decryption hardware that has the same operation sequence of the AES-128 encryption shown in Figure 3.11. The modification in Figure 3.13 motivates us to implement a round operation that can be used for both encryption and decryption in order to optimize the hardware cost. That is, one of the optimizations in hardware is to implement each transformation so that its inverse transformation could also be supported efficiently with sharing a hardware resource.
Here, we excluded the detailed descriptions for the Key Scheduling Function or module that generates 128-bit subkeys provided to the transformation. It can be performed off-line, that is, it can be precomputed before the encryption and decryption, as far as the secret key is not frequently changed and there is enough memory space to store the subkeys. Alternatively, in order to reduce the hardware cost, the subkey generation can be performed round by round. In the following sections, we will explore several different types of implementations for AES-128.
Figure 3.14 shows a loop architecture for AES. The first operates with the plaintext and the encryption key, and the result is stored in DFF_1. Then, the round operations are literately performed nine times with the loop architecture including DFF_2 in the datapath. The result of the ninth round (9R) is stored in DFF_3, whereas DFF_2 keeps the result of the eighth round (8R). The 10th round operation is performed on the output of 9R, and the result is stored in DFF_C as ciphertext. The corresponding timing waveform can be found in Figure 3.15.
This loop architecture requires three blocks, two and blocks, and one block. The total number of DFFs is or except the DFFs for encryption key. Here, we ignore the hardware cost for as the implementation style varies for the use case, for example, encryption key is fixed for a long term.
The critical-path delay in the architecture can be estimated as
where , and are the delay times in , and blocks. and are the delay times for multiplexer and demultiplexer, and takes the setup time and clock skew into consideration.
There are several ways to reduce the hardware cost. One alternative is shown in Figure 3.16, where two , one , one , and one blocks are used. The timing waveform is depicted in Figure 3.17. It is found that the block is skipped only at the 10th round. Only three DFFs are needed to support this loop architecture except the DFFs for encryption key.
However, the critical-path delay is increased as
On the basis of the discussion in Section 3.3, we consider two-level pipeline architecture for AES-128 encryption. The first-level pipeline is for the 10-round operation, and the second-level one exploits the operations in each round operation.
When AES encryption is used in practice, a mechanism called mode of operation is employed for security reasons.3 Among various mode of operations, the so-called counter mode or CTR mode uses a counter value as a plaintext for AES encryption, and the message is XORed with the encryption result. Therefore, even for a long message, all the AES encryptions can be parallelized using the counter values. Accordingly, the throughput performance is the same as that for one AES encryption.
In this regard, the pipeline architecture shown in Figure 3.18 is appropriate as the amount of data processed per cycle is fast enough (128 bits/cycle), and a high clock frequency is expected. If the architecture operates at 100 MHz, the throughput becomes 12.8 Gbps.
What can be done for further improvements in throughput? One promising way is to reduce the critical-path delay by introducing additional pipeline stages in the critical block. In the case of AES encryption, tends to be the critical block as far as its nonlinear transformation is implemented with combinatorial logics.4 Therefore, it is natural to divide the transformation into several small operations, or prepare an additional DFF immediately after as shown in Figure 3.19. Note that throughput improvements often come with the price in hardware cost.
It is possible to apply the pipeline to the loop architecture as shown in Figure 3.9. In this case, the cost area will be much lower than the case for the architecture in Figure 3.18.
In the previous discussions, the width of the datapath is all fixed to 128 bits. As we can easily imagine from the fact that 8-bit or 32-bit CPU can operate AES-128, the datapath can be narrow by performing byte-wise operations. For instance, the architecture shown in Figure 3.20 employs 32-bit or 4-byte datapath that performs 4-byte operations in one cycle. That is, it requires four cycles to complete a round operation.
All the intermediate values are stored in a memory module such as SRAM. In the case of the architecture of Figure 3.20, 4-byte data should be aligned in accordance with the transformation when writing the data to SRAM or reading out from SRAM. The merit of using 4-byte datapath is that one operation, which operates 4 bytes, can be performed in one cycle.
3.138.106.233