In 1977 the United States National Bureau of Standards (NBS) promulgated a new standard for encryption of unclassified data [NBS77] based on a proposal by IBM, the Design Encryption Standard (DES). Since then, this algorithm has become the de facto standard for data encryption worldwide. The algorithm is very suitable for implementation in hardware and several commercial products are available. In this section, we first outline the DES algorithm and secondly develop an implementation. DES provides an excellent example of an algorithm that can be speeded up greatly by bit-level hardware implementation. In this section, we develop a CAL implementation of DES: a more detailed presentation of this work appeared in [Kean89].
DES is a substitution cipher on 64-bit binary vectors based on a 56-bit key. The strength of DES lies in the complexity of the substitution. Two good introductions to DES are [Tanenb81], and for a more in-depth analysis [Konh85].
The DES algorithm is illustrated in Figure 6–4. All the f-boxes are identical, and it is obvious that a major design decision is whether to have one reusable f-box, or 16 separate ones to allow pipelining. We will consider the major components in the block diagram individually. We will try to give an idea of the size of the wiring channels (unfortunately channel density cannot be quoted because it depends on the ordering and spacing of the ports, which is implementation-dependent) and we will give the number of product terms in a minimized PLA implementation (determine using ESPRESSO) for the logic blocks.
Permutations IP and IP−1 The first step in the DES algorithm is to apply an initial permutation IP to the plaintext. The permutation is applied in reverse IP−1 as the final step. These permutations involve simple wire swapping and could be implemented using the channel router, but the resulting channel would be wide. The permutations are highly structured and more cost effective implementations than straightforward wire swapping can be found.
Design of the -Box The heart of the DES process is the f-box, which performs the function π :(L, R) → (R, L ⊕ T(R)), where L and R represent the left and right 32 bits of the 64-bit input. The transformation T is composed of an expansion (bit copying) to 48 bits modulo 2 addition with a key, substitution (in which each group of 6 bits is taken as representing a binary number, which is then substituted for another 4-bit binary number using a lookup table, bringing the size back to 32 bits) and a 32-bit permutation (wire swapping).
The major components in the DES f-box are illustrated in Figure 6–5. The important point to note is that the width of any hardware implementation is fixed by the number of bits to be operated on. We must concentrate on reducing the length of the unit. We will now consider the components of the f-box in order, left to right.
Before | After |
x0, x1, x2, x3 | x31, x0, x1, x2, x3, x4 |
x4, x5, x6, x7 | x3, x4, x5, x6, x7, x8 |
x8, x9, x10, x11 | x7, x8, x9, x10, x11, x12 |
x12, x13, x14, x15 | x11, x12, x13, x14, x15, x16 |
x16, x17, x18, x19 | x15, x16, x17, x18, x19, x20 |
x20, x21, x22, x23 | x19, x20, x21, x22, x23, x24 |
x24, x25, x26, x27 | x23, x24, x25, x26, x27, x28 |
x28, x29, x30, x31 | x27, x28, x29, x30, x31, x0 |
Final Swap The final swap has the effect of canceling out the swap in the final f-box. This cancellation is done to keep the cipher symmetrical so that decryption can be done by reencrypting with the same key; thus only one DES unit is required (not two, one for encryption and one for decryption).
Design of the Key Handler This logic controls the key that is applied to the f-box at each stage of the DES operation. This logic is determined by the key-schedule. The design of the key-handling unit is shown in Figure 6–6. We describe the major components in turn.
Floorplan When we consider the amount of logic required to implement an f-box, it becomes obvious that we cannot hope to have 16 of them in a reasonable sized array. Instead, we must use the same box for every stage of the cipher. We wish to avoid bringing the key in from the bottom because at least 48 cells will be required to input a 48-bit vector: if it is brought in from the side, then no extra cost is incurred because the unit must be at least 48 cells high anyway. We also note that IP and IP–1 can be implemented in the same wiring area. This suggests the floorplan shown in Figure 6–7.
Layout of -Box When we examine Figure 6–5 it appears that large wiring areas will be required for the L and R 32-bit words. It is also apparent that all the signals go from left to right: this would imply a large waste of cell resources since the basic cell can route right to left as well. After a little thought, it becomes clear that a much better layout of the f-box is possible (Figure 6–8). By using registers at the right-hand side we have avoided wasting large areas transporting L and R through the unit. By interleaving L and R we have made it extremely easy to XOR adjacent bits and swap over prior to the next stage. We are making use of the cell’s ability to route R right to left to bring it to the left-hand side for the initial expansion and XOR with the key. We now consider the design of the major f-box components in turn.
The S-boxes were implemented using the CAL ROM generator and required 32 product terms. Although the logic block is only nine cells wide, we choose to use an extra line of cells between each S-box unit, making its effective width 10 cells and fixing the height of the f-box at 80 cells. There are two advantages to this: first, by increasing the height of the wiring channels we simplify the routing problem, allowing a solution using fewer tracks; second, the extra width simplifies the design of the other f-box components.
The S-box outputs are generated in the order in which they appear on the opposite side of the P-box wiring channel rather than the “numerical” order of the standard. This reduces the number of tracks required to implement the P-box. Note that the right-to-left connection of R shown on the diagram is accomplished by overlaying a routing symbol on the automatically generated S-box design: this is possible because the right-to-left connections are not used by the ROM.
The layout of the whole unit is given in Figure 6–9. Note how the individual components fit together and that the signal flow scheme allows a very high utilization of cell resources.
Layout of Key Handler The layout of the key handler can follow the block diagram exactly. We will consider the implementation of the major subcomponents in turn.
The layout of the whole unit is given in Figure 6–12.
IP and IP–1 The initial permutation is given in Table 6-3. A straightforward implementation of this large wiring area would require at least 37 cells. The size of this requirement has to do with both the permutation itself and the fact that we are implementing both IP and IP–1 in the same channel. This means that there are 64 input and 64 output ports on both sides of an 80-cell channel heavily constraining the channel routing (there are very few “free” columns that the router could use to reorder nets). This channel is a problem for silicon implementations of DES as well, and we can borrow a technique from [Hoorn84], [Hoorn88] to implement it in a reasonable width. The technique relies on regularities in the IP permutation and replaces the large channel with a small one and some shift registers: it has the side effect of reducing the data rate, but this is not a problem since the speed of the implementation is constrained by the f-box circuits, which are used sixteen times for every input through IP. The design is shown in Figure 6–13 and the layout in Figure 6–14.
The design of DES encryptor just described would be unnecessarily slow because of the accumulated delays through the f-box. It was decided, therefore, to separate the f-box into four pipelined stages to increase throughput. Note that the delays within the key controller and the IP unit are relatively unimportant because the computation within the f-box is much more complex and is repeated sixteen times for every data input, and keys are changed relatively infrequently. Any reasonable design of these units will be able to keep the f-box supplied with data and keys.
It was decided to aim for a throughput of 500,000 encryptions per second with a constant key. To achieve this rate, one 64-bit ciphertext word must be calculated every 2 μs. Since the f-box unit is used sixteen times for each ciphering operation, partially ciphered data must leave the f-box every 125 ns. Table 6-4 shows the approximate delays through the major f-box components (we take routing delay through one cell as 2.7 ns and calculation and output routing delay as 10 ns—these figures are for 256-cell CAL chips designed in 1987, the CAL 1024 devices are faster). We can see that some of these components must be broken up into several pipeline stages to meet the performance objective. The last column of the table shows the number of stages required. Within the CL-ROM we choose to place pipeline registers between the AND and OR planes as well as horizontally across the array. This decouples the AND plane calculation time, which is quite high since there are six inputs. With pipelining between the planes as well as three stages across the array (two with 11 product terms and one with 10) we have a stage delay of (m – l)r + 11c = 3 × 2.7 + 11 × 10 = 118.1 ns. The ability to pipeline these large combinational logic units is a partial compensation for the loss in speed caused by using large numbers of 2-input gates rather than a single high fan-in gate. Pipelining within wiring channels is easy to provide using the function units of the cells within the channel.
With the suggested pipelining, the only stage that is close to the 125-ns limit is in the KEYXOR unit—if this proved to be a problem, a slightly more complex routing arrangement could split the 80-cell top-to-bottom wraparound connection among several of the right-to-left routing stages—extra pipeline stages would not be required. The total pipeline length is 10 stages. Extra registers must be provided at the right side of the f-box unit to store the ‘L’ vectors corresponding to the ten ‘R’ vectors at the various pipeline stages within the f-box. The pipeline registers themselves are fairly small, requiring only two cells for each master-slave register. They are clocked using the G1 and G2 global signals since propagation delay on the pipeline clock is an important consideration.
The time-line for the pipelining scheme is given as Figure 6–15. The delay through the f-box is 16 ∗ (10 ∗ 125 ns) = 20 μs. It seems natural to pipeline the input and output phases through IP, with this giving a total delay through the encryptor of 60 μs. In applications where large amounts of data have to be encrypted the throughput is of paramount importance, so this is a reasonable design. We could, however, reduce the delay to around 25 μs by removing the extra IP pipelining at a small cost in throughput.
The performance of DES implementations can be measured in two ways: the number of encryptions per second with the same key and different data, and the number of encryptions per second with a different key and the same data. The second metric corresponds to ‘key-trial’ in an attempt to break the cipher, the first to normal use of the cipher. The present design is optimized for normal use, so we will concentrate on this problem. Table 6-5 shows figures for some DES implementations: the figures for SUN workstations are using the UNIX crypt function in the C library. This implementation is not particularly efficient and it is likely that the times could be significantly improved by recoding in assembly language and taking advantage of space-time trade-offs by using large data arrays to compute S-box functions. The time for a single encryption is given, as well as the number of encryptions per second to show the effect of pipelining. The transputer array figure is a notional one assuming 1000 transputers and that each transputer has a good DES program, allowing encryption ten times faster than a SUN 3/260: it is intended to illustrate the limitations of conventional parallel computers in this application. The time for the custom chip is taken from [Hoorn88]. This claimed to be the fastest DES chip available with a throughput of 32M bits/sec (or 500,000 64 bit words/sec). Note that this chip has extra control logic to handle the more complex modes of DES defined in [NBS80] and also provides several registers for secure storage of keys on the chip. The CAL design makes use of several optimizations developed by the group that designed this chip. Although the CAL appears to be as fast, it should be pointed out that 3-μm processing technology was used in the custom chip, which would be expected to be slower than the 2-μm technology on the CAL, and the CAL figures are based on circuit simulations rather than measurements on fabricated chips.
The CAL implementation of DES (including pipeline registers) requires an array of cells 88 cells high by 90 cells wide (27 for the key controller, 60 for the f-box, and 13 for IP). This could be provided using a 6-by-6 array of 16 × 16 CAL chips or a 3-by-3 array of 32 × 32 CAL1024 chips. The entire DES encryptor would fit on a single printed circuit board even with low-density 16 × 16 chips.
13.59.34.87