Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4 Operations on Numbers

In order to perform numerical computations with a digital system, numbers need to be encoded as bit strings as explained in section 1.1.1. Then arithmetic operations can be implemented as special Boolean functions on the number codes and in turn be used as building blocks in numeric algorithms. The general scheme to arrive at the Boolean function realizing an arithmetic operation for the given encoding is as shown in Listing 3.2 for the add operation. For binary and signed binary numbers, the circuit design from gates or CMOS switches is actually carried out for single bit numbers only. For larger numbers, the arithmetic operations have fairly simple algorithms reducing them to single bit operations. There is no need to resort to ROM tables for their realization as one might consider for more irregular Boolean functions, an exception being the so-called distributed arithmetic where a complex arithmetic operation is realized with the aid of small tables. To study the materials in this chapter, the exercises section plays a special role. Several of the arithmetic circuits and algorithms are taken up there, and VHDL implementations are given that also help the reader to become more familiar with that language.

4.1 SINGLE BIT BINARY ADDERS AND MULTIPLIERS

If the Boolean 0 and 1 elements encode and are identified with the numbers 0 and 1, then the result of the arithmetic add operation applied to two numbers r, s ∊ B ranges from 0 to 2. Its binary encoding is by two bits (q, o) so that:

The Boolean function mapping (r,s) to (q,o) is called the ‘half adder’ function. According to its function table:

The half adder is thus realized by two standard gate functions. The sum of three bit r,s,t can still be represented as:

and the function mapping (r, s, t) to (q, o) is called the full adder function. It can be realized by cascading two half adder circuits and using an OR operation to sum up their mutually exclusive ‘o’ outputs (see Listing 3.6 for the VHDL definition), i.e.:

Reciprocally, the half adder function is obtained from the full adder by setting t =0.

More generally, a 2^k – 1 input adder can be defined that outputs a k-bit binary number. Its least significant bit is the result of the XOR set operation (i.e., the parity function) applied to the inputs. E.g., the 7-input add circuit with a 3-bit result can be obtained by composing 4 full adder circuits.

The product of two numbers r, s ∊ B takes its values in B and is hence encoded by a single bit. It is computed by the AND gate.

4.2 FIXED POINT ADD, SUBTRACT, AND COMPARE

In this section, we consider the n-bit unsigned and signed add and subtract operations, their restrictions to inputs yielding n-bit results, and the mod(2ⁿ) add and subtract operations. They turn out to be quite similar. These n-bit add and subtract operations and the related increment and decrement operations have algorithms based on the single bit add operations in section 4.1.

To start with the binary add operation, let two numbers a, b have the codes (a₀,…,a_n−1), (b₀,…,b_n−1) and let (s₀,…,s_n) be the (n + 1)-bit code of their sum. We have

images

and

images

The function q is the full adder operation (1) in section 4.1 and the bits c₀,..,c_n are defined recursively by c₀ = 0 and c_i+1 = o(a_i, b_i, c_i) for i < n. Thus, s_i = q(a_i, b_i, c_i) for i < n, and s_n = c_n.

The algorithm to obtain the s_i by computing the q(a_i, b_i, c_i) and the c_i+1 = o(a_i, b_i, c_i) is the common bit-by-bit computation of the sum, the c_i being the carry bits, and corresponds to the composite circuit ADD defined in section 3.4, using full adder circuits as building blocks. Note that it is the structural type of definition that corresponds to a particular Boolean algorithm. The circuit is known as the ripple-carry adder. It suffers from a long execution time T = n*T_f due to the connection of the full adder building blocks in series, T_f being the processing time of the full adder. The circuit complexity is in O(n). The first full adder stage actually performs as a half adder and could be substituted by the simpler half adder component, but maintaining the carry input to the first adder is useful for cascading several multi-bit adders.

If one desires the result of the add operation to be encoded similarly to the operands, i.e. with n bits, too, which is usual in order to avoid the growth of the codes for every add operation, (s₀,…,s_n−1) is the result and the carry output c_n = 1 is the overflow error condition of the result exceeding the range of the n-bit binary codes. The add operation with the n-bit result (s₀,…,s_n−1) irrespective of the carry is the add operation mod(2ⁿ) that computes the binary code of the remainder of the sum a + b divided by 2ⁿ. The n-bit adder with the carry input and output generalizes the full adder function by changing the inputs a, b and the output q to binary coded base 2ⁿ digits. Every n-bit add circuit with an (n + 1)-bit result performs the mod(2ⁿ) add operation and generates the overflow bit.

The carry input to the n-bit adder can also be thought of as a control input that selects between the alternative computations of a + b and a + b + 1. In fact, for n = k + m the n-bit ripple-carry adder can be regarded as the composition of a k-bit and an m-bit adder, so that the carry output of the k-bit adder selects the between the two alternative computations of the m-bit adder which is similar to executing them as the alternative branches of an ‘if’ control structure (section 1.2.2).

As the add operation may occur many times within an algorithm, a short execution time is desirable for it. There are various schemes to reduce the processing time by providing a faster carry propagation and by lowering the total depth of the adder circuit at the expense of an increased gate count. One is the carry select scheme that converts the control dependency of the m-bit add on the k-bit add into a data dependency by executing both branches in parallel and selecting using the carry bit of the k-bit adder (see section 1.4.2). Listing 4.1 lists the structural VHDL definition for a 16-bit adder architecture with k = m = 8. If n is a power of 2, the carry select scheme can be applied recursively to half size adders which raises the gate count to O(n^1.6) but decreases the processing time to O(log n) (three half-size adders are needed, hence for n = 2^r a total of 3^r = n^1.6 1-bit adders are needed). Thus, the component ADD8 would be implemented similarly with three ADD4 components etc. For other circuit schemes to achieve a faster binary add operation, see [44].

entity ADD16 is
   port (a,b: in bit_vector(15 downto 0); ci:in bit; q:out bit_vector(15 downto 0); co:out bit);
end ADD16;

architecture structural of ADD16 is
     component ADD8
         port (a,b:in bit_vector(7 downto 0); ci:in bit; q:out bit_vector(7 downto 0); co:out bit);
     end ADD8;

     component SEL port(a,b,s: in bit; q: out bit); end SEL;

     signal s0,s1: bit_vector(8 downto 0);
     signal cl: bit;

   begin

    low: ADD8 port map (a(7 downto 0), b(7 downto 0), ci, q(7 downto 0), cl);
    high0: ADD8 port map (a(15 downto 8),b(15 downto 8),‘0’,s0(7 downto 0),s0(8));
    high1: ADD8 port map (a(15 downto 8),b(15 downto 8),‘1’,s1(7 downto 0),s1(8));
    I: for n in 0 to 7 generate
              G: SEL port map (s0(n),s1(n),cl,q(n+8));
    end generate;
    GC: SEL port map (s0(8),s1(8),cl,co);

 end structural;

Listing 4.1 The carry-select adder

The operation of incrementing (i.e., the unary operation i -> i + 1) is the special case of an add operation in which the second operand is the constant number 1 which is encoded as (1,0,…,0). Like the n-nit adder, the n-bit incrementer produces an (n + 1)-bit result. The lower n output bits encode the result of the mod(2ⁿ ) increment operation, and the upper bit indicates the overflow condition of the result not being representable with n bits (this only occurs for the all ones input encoding 2ⁿ − 1 with the result of 2ⁿ with zeroes in the lower n bits; the overflow bit ‘tests’ for the all ones code).

If the ripple-carry add algorithm is used for the increment operation, the full adders all have a zero operand and can be simplified to half adders (the first of them even to an inverter). If the first half adder is maintained and the ‘1’ input to it is changed to ‘0’, the trivial operation i → i + 0 = i is performed instead. Thus the increment operation can be controlled with very little extra hardware effort. The incrementer is faster than the ripple-carry adder but still needs the ‘linear’ time n*T_h, T_h being the execution time of the half adder. Now the carry to the ith stage only occurs if all previous outputs are ones which can be determined using an (i−1)-input AND function. These can be computed in ‘logarithmic’ time ld(i-1)*T_a by connecting the elementary 2-input AND gates as trees, and the increment function using such fast carry inputs to all the half adders achieves an execution time in O(log n).

We next consider the signed binary add operation. The add operation essentially remains the same if twos-complement encoding is used instead of the straight binary encoding. Using equation (3) in section 1.1.1 the numbers a, b are related to their n-bit twos-complement codes (a₀,…,a_n−1) and (b₀,…,b_n−1) through the equations

images

where h denotes the result of the modulo 2ⁿ add operation applied to ∑a_i2ⁱ and ∑b_i2ⁱ and c_n is its overflow bit. The only difference is in the overflow condition if operands and the result use the same word size. For the binary encoding, the final carry output c_n signals the overflow. The sum of two signed n-bit numbers is a signed (n + 1)-bit number s, the sign bit s_n of which is the sign of the result. The overflow in the twos-complement case is indicated by the sign s_n of the result being unequal to s_n−1. By (3) the computation of s_n requires an extra XOR operation:

From

it follows that

The signed binary increment operation with an n-bit result is the same as the unsigned or the mod(2ⁿ) one. It overflows for the maximum positive input code of (1, 1,…,1, 0) and then outputs the most negative code of (0, 0,…,0, 1).

Concerning the subtract operations for unsigned, signed and modulo 2ⁿ binary numbers, we first remark that for a mod(2ⁿ) binary number x,/x + 1 is the mod(2ⁿ) additive inverse −x of x (/x denotes the complement of x). The negation overflows for the all zeroes code only and hence can be used to test for it. The binary subtract operation mod(2ⁿ) is obtained from the equation

Adding the ‘1’ does not require an extra increment circuit but is achieved by placing a ‘1’ at the carry input of the first adder stage of the mod(2ⁿ) adder. By selecting between b and /b the same n-bit adder can be used both for the add and subtract operations which is useful to build multi function compute circuits, in particular arithmetic units for programmable processors (see Figure 1.6).

For unsigned binary numbers a, b the difference a − b is defined for b ≤ a only and then equal to the result of the binary mod(2ⁿ) subtract operation. This condition is signaled by the mod(2ⁿ) adder used to compute (6) through a ‘1’ at the carry output. The signed binary subtract operation also coincides with the mod(2ⁿ) subtract operation for the lower n bits of the result as the signed binary code of a number coincides with the unsigned code of its remainder mod(2ⁿ), and taking remainders is compatible with the add and subtract operations. The full result can be represented as a (n + 1)-bit signed binary number that is obtained by applying equation (6) to a and b sign extended to n + 1 bits. This time the overflow condition is s_n ≠ s_n−1 again.

A special case of the subtract operations is the decrement operation, i.e. subtracting the constant 1. As for the increment operation, the subtract circuit can be simplified and sped up to logarithmic time.

If x is an n-bit signed binary number and x ≠ −2ⁿ⁻¹, then so is −x, and −x = /x + 1. The negation overflows for x = −2ⁿ⁻¹ (=100..00), the most negative representable number, only as −x = 2ⁿ⁻¹ cannot be represented as a signed n-bit number but needs n + 1 bits. Again, the overflow condition is s_n ≠ s_n−1 and can be used to check for this special code.

The add, subtract, negate and compare operations for signed or unsigned binary numbers x are identical to those for fixed point operations on the fixed-point numbers x/2^r, for which hence the same algorithms can be used.

Related to the unsigned and signed fixed point subtract operations are the compare operations ‘a ≤ b’ and ‘a = b’ viewed as a Boolean functions. ‘a ≤ b’ for unsigned n-bit operands is equal to the carry output from the subtract operation b − a and hence computed by any subtract or combined add and subtract circuit if the value of 1 is interpreted as ‘true’. The comparison of signed numbers requires computing the sign of b − a instead of the carry. ‘a = b’ can be evaluated by comparing bit per bit and computing the conjunction of all bit comparisons with an AND gate tree as the codes are equal if and only if the represented numbers are equal, or be reduced to the signed or unsigned ‘≤’ operations or to a test for zero using the equivalences

Testing for the special values 0, −1, −2ⁿ⁻¹ and 2ⁿ⁻¹ − 1 is equivalent to testing for the overflow conditions of the increment and negate operations.

4.3 ADD AND SUBTRACT FOR REDUNDANT CODES

A still more effective way to speed up the add operation is obtained by using a redundant encoding for the numbers. We first consider the code using the 2n-tuple (u₀,…,u_n−1, v₀,…,v_n−1) for the number

images

Every binary number (m₀,…,m_n) can be converted into a ‘double’ code by appending (0,…,0), and every double code can be reconverted to a binary one by performing the binary add operation according to equation (7). For two redundant codes a, b encoding numbers a′, b′, the conditions a = b and a′ = b′ are no longer equivalent.

If u and v are the binary numbers (u₀,…,u_n−1) and (v₀,…,v_n−1), then the redundant code (u₀,…,u_n−1, v₀,…,v_n−1) encoding their sum is obtained without any computational effort. An encoded add operation using a redundant code for the first argument and the result and a binary code for the second argument is any function B²ⁿ × Bⁿ → B²ⁿ⁺² mapping a double code (u₀,…,u_n−1,v₀,…,v_n−1) of a number a and the binary code (b₀,…,b_n−1) of a number b to a double code (s₀,…,s_n,t₀,…,t_n) of the number a + b. Encoded add operations are not uniquely defined as the double codes for the result are not unique. A possible choice is

where q, o are the full adder outputs. This follows from equation (1) in section 4.1; the multiplication by 2 is realized by the t word being shifted by one bit position. The computation of the result code can thus be performed by n full adder circuits in parallel in the time T_f irrespective of n. From this add operation one derives an encoded add operation

accepting double codes for both operands that also operates in the constant time 2*T_f, namely by adding the halves of the second operand one by one.

These add operations can now be composed. If k redundant codes have to be added up to a single redundant result code, one can arrange them as a tree to obtain the time of 2*T_f*ld(k) for the total computation. The only disadvantage of the redundant add operation is the redundant encoding of the result which usually needs to be converted (added up) to a binary code. If, however, a set of k binary numbers have to be added up, only the final result would have to be converted.

The redundant encoding defined by equation (7) and the fast add operation generalize to codes (u₀,…,u_n−1,v₀,…,v_n−1) where the u_i are base-k digits and only the v_i are binary digits. Listing 4.2 shows a 16-bit adder using this redundant kind of code for the first operand and the result and the non-redundant binary code for the second operand. Four 4-bit sections serve as base-16 adders.

entity XADD16 is
       port ( a,b:in bit_vector(15 downto 0); xa: in bit_vector(3 downto 0);
                 q:out bit_vector(15 downto 0); xq: out bit_vector(4 downto 1)
                   )
end XADD16;

architecture struct of XADD16 is
          component ADD4
	        port (a,b: in bit_vector(3 downto 0); ci: in bit; q: out bit_vector(3 downto 0); co: out bit)
          end ADD4;
begin
     g0: ADD4 port map(a(3 downto 0),b(3 downto 0),xa(0),q(3 downto 0),xq(1));
     g1: ADD4 port map(a(7 downto 4),b(7 downto 4),xa(1),q(7 downto 4),xq(2));
     g2: ADD4 port map(a(11 downto 8),b(11 downto 8),xa(2),q(11 downto 8),xq(3));
     g3: ADD4 port map(a(15 downto 12),b(15 downto 12),xa(3),q(15 downto 12),xq(4));
end struct;

Listing 4.2 Redundant adder built from 4-bit sections

The signed digit encoding of a number m by the tuple (m₀,…,m_n−1) so that

images

where the digits m_i take the values − 1, 0, 1 and hence also need at least two bits each for their encoding also yields a fast add operation by using the redundancy in such a way that the carry does not propagate by more than one place. The negate operation on the signed digit code is by negating the digits individually in parallel. It is hence easy to convert a signed binary number into this representation, and the reconversion to a signed binary code is by splitting m into a difference of two positive binary numbers and performing the binary subtract operation.

A fast algorithm to compute a sum code (s₀,…,s_n)of two signed digit codes (u₀,…,u_n−1) and (v₀,…,v_n−1) is as follows. For every i ≤ n − 1 one computes the conditions:

the carry digit

images

and (setting c₋₁ = 0)

Finally, s_n = c_n−1. Thus s_i only depends on u_i, v_i, u_i−1, v_i−1, u_i−2, v_i−2, and all can be computed in parallel (this also holds for the redundant add operation presented first). The fact that s_i does not depend on the lower bit positions is called the ‘on-line property’ in the literature [6]. It implies that they do not even need to be known at the time s_i is computed and that their computation may be pipelined with the computation of s_i.

4.4 BINARY MULTIPLICATION

The product of two numbers

images

is the (n + m)-bit number

images

Thus the n*m single bit products a_i b_j have to be computed and the results for a given bit position i + j have to be added up by means of single bit adders passing carries as needed. This can be done in various ways, using as its building blocks full adders extended by an AND gate (the single-bit multiplier) to form the product a_ib_j. They are depicted in Figure 4.1 for the case of n = m = 5 by circles that output the higher (carry) bit to the left or straight down and the lower (sum) bit down on the diagonals to the right, and receive up to three inputs, a product a_ib_j, and the others from other building blocks. The jth row receives the inputs a_n−1b_j,…,a₀b_j, and the output bits h₀…h_2n−1 appear at the right and lower sides of the diagrams. A circle that only receives the product input Just passes it, and if there is only one additional input the circle acts as a half adder. The version in Figure 4.1a corresponds to adding shifted versions of a with n ripple-carry adders operating in parallel and has the disadvantage of a fairly long processing time of (3n − 2)* T, T being the processing time of the full adder (the products are computed by the AND gates in parallel and only add a single AND gate delay which is neglected). The version shown in Figure 4.1b uses the same number of building blocks but a different carry routing for the nodes above the diagonal and multiplies in a time of (2n − 2)*T (the 1st row just passes the product inputs to the 2nd). The computation of the less significant half word of the result only takes (n − 1) *T. For a multiplier modulo 2ⁿ that only computes the less significant word only the adders on and above the diagonal are needed.

images

Figure 4.1 Multiplier networks with different carry propagation schemes

images

Figure 4.2 Extended k-bit adder component

The third propagation scheme in Figure 4.1c ends up with a redundant code for the upper half word of the result, i.e. two n-bit binary numbers that still need to be added up to the most significant word of the result, using e.g. a carry select adder. It takes the time of (n − 1)*T plus the carry select adder time and can be thought of as cascading n − 1 adders, each adding a redundant code and a row of products.

A still faster multiplier operating in logarithmic time can be obtained by using a tree of redundant adders instead of cascading them, the so-called Wallace tree [44], and adding up the upper word parts with a carry select adder or some other adder also operating in logarithmic time. For all of these schemes, the input data must be held unchanged throughout the processing time as they are used in all adder rows.

The multiplier networks in Figure 4.1 are regular structures constructed from a single type of component, the extended full adder including the AND multiplier. It adds the multiplier result and two more bits and output a 2-bit result. Current FPGA chips provide this function in a single cell. The extended adder component generalizes to the k-bit extended adder that is defined as a combined k-bit multiply and add circuit that forms the product of two k-bit numbers and adds it to two more k-bit inputs to arrive at a 2k-bit result (Figure 4.2). For n = m*k, the same network structures as in Figure 4.1 implement n-bit multipliers with m*m extended k-bit adder components. The extended k-bit adder (or add-subtract circuit) can also be used as a component to implement n-bit add (add/subtract) functions and could be used as a cell in a coarse-grained FPGA architecture.

The product of a twos-complement signed number a by an unsigned or signed number is formed according to the representations

images

and differs from the unsigned product in the most significant word only. If ‘∗_s’ denotes the Boolean function computing the product of signed binary numbers, and ‘∗_u’ the unsigned version explained before then

where the products by a_n−1 and b_n−1 just make the subtract operations of b or a conditional on them, and the product by 2ⁿ means that the subtract operations are from the upper half word. The same multiplier structures can be used and extended to conditionally subtract from the most significant word, or to perform a subtract in the last row containing the products a_ib_n−1instead of adding and the signed versions of the add and shift operations (the leftmost carry is replaced by the sign).

If one of the operands, say b, is a constant, the multiplier circuit can be simplified significantly. In fact, all a_ib_j can be evaluated to a_i or 0 depending on b_j. Thereby all AND gates are eliminated, and one has to add up selected shifted versions of a, namely the versions 2^j*a for those j satisfying b_j = 1. In the special case of just one of the b_j being 1 no add operation needs to be performed at all, and the output is the shifted word 2^j*a. Obviously, the structure of the circuit now depends on the constant operand.

All of the above binary multiplier circuits for two variable n-bit operands use n² full adder building blocks to add up the products and have a gate count in O(n²). A reduction of this complexity can be achieved by applying the formula

that reduces a (2r)-bit product to three r-bit products. The recursive application reduces the complexity to within O(n^1.6) if ripple carry adders are used, and to O(n^1.6 log(n)) with carry select adders. With the fast adders the processing time adds up to O(log(n)²) using the fact that the three products in the formula can be computed in parallel. If log(n) multipliers can be used in parallel, the execution time per multiply becomes O(log(n)) as for the Wallace tree but the gate count raises to O(n^1.6log(n)²) only. The Wallace tree uses its gates more efficiently but needs too many steps to perform the multiplication. The multiplication based on equation (10) pays off and is used for large bit sizes (>64) only. The complexity of multiplication algorithms can be as low as O(n log(n) log(log((n))) [54], by interpreting the multiplication as a convolution for which a fast algorithm exists (cf. section 8.2.3).

For the computation of the product an online property can be achieved if redundant codes are used for the adder implementation (at least for the upper half of the result word). It is by applying still another scheme to connect the adders given by the recursion

where A_j, B_j denote the binary numbers presented by (a_n−j,..,a_n−1) and (b_n−j,..,b_n−1) so that a = A_n, b = B_n. Adding B_j and 2A_j−1 may still cause a change in the bit position j of the 2j-bit result, but can be arranged not to do so in the higher ones, i.e. the upper j − 1 bits can be computed from the upper j bits of the operands [6]. This online property is useful in fixed point computations with a fixed number of places where instead of the full product a rounded version of the most significant part is needed.

4.5 SEQUENTIAL ADDERS, MULTIPLIERS AND MULTIPLY-ADD STRUCTURES

In this section we derive implementations of some of the add and multiply algorithms discussed before that serially reuse components for the sake of efficiency and provide cost-effective register arrangements to store the intermediate results and select them for later operations.

images

Figure 4.3 Serial adder

If h is the depth of the add or the multiply circuit, its processing time is h*T. As pointed out in section 1.5, if the multiplier is used at its maximum rate corresponding to its processing time, then the adder circuits performing the computation are used with an efficiency of 1/h only. Pipelining can also be used to raise the efficiency. The layered structure of the multiplier in Figure 4.1(c) can be used to pipeline its operation by inserting registers between the layers both for the intermediate results and for the operands. Then the multiplication still takes the same time (even a little more due to the registers) but subsequent multiplications can be started at the rate given by T that is independent of h, and the efficiency becomes close to 100% (with a proportional increase of the power consumption). The storage and power requirements become lower if the layers are grouped into sets of h′ layers and the ndnpipelining is implemented for these only. Then the pipelined multiplications can be started at a rate of h′*T and the efficiency raises to close to 1/h′.

The n-bit binary ripple-carry adder applies n identical full adder circuits at all bit positions. The full adders are connected in series via the carries. The full adder operations can be executed serially on a single full adder circuit, starting with bit 0, by using as the full adder inputs for the i^th or i-th application the bits a_i, b_i from the operands and the carry signal c_i that has been computed as the overflow o_i−1 in the previous application. o_i−1 must be stored in a flip-flop in order to be able to use it in the subsequent step, but it is no longer used thereafter and the same flip-flop can be used to store all the carries in sequence (Figure 4.3). It must be cleared to zero at the start of the serial computation. This also eliminates the need to select the carry input from different sources during the sequence of steps.

The simplicity of the serial adder circuit depends on the input bit pattern being applied serially to the same sites and on also producing the output pattern this way. Conversely, the inputs and outputs to the previous ‘parallel’ adder circuits were supposed to be applied in parallel. Auxiliary circuits are needed if the inputs to the adders are applied differently. A selection circuit for the input bits for the serial adder is needed if these are applied in parallel on n signal lines, using n-1 select gates connected as a tree as in Figure 1.3 for each of the two input words. The processing time of these select trees is proportional to their depth ld(n) and drives up the processing time of the serial adder, and also requires the generation of a select address. The extra processing time and the need for a select address are eliminated by receiving the input bits from two n-bit shift registers using the same clock signal as for the carry flip-flop (increasing the power consumption). This maximum clock is given by the processing time of the half adder plus the set-up and hold time for the flip-flop. As the shift registers needs to be loaded in parallel from the input lines, they also include n − 1 select gates each, but this time the select operation is pipelined with the full adder operation and does not contribute to the processing time. Even not counting the input register flip-flops which probably would also be used for the inputs to a parallel adder, one concludes that a need to select from the parallel input bits drives up the hardware costs to similar costs as for the parallel adder. If the output bits from the serial adder are needed in parallel, another shift register can be used, again eliminating the need to select storage locations for the individual bits. One of the input shift registers can be used for this purpose (Figure 4.4), too. This figure does not show the signal selecting parallel input and a handshaking signal indicating n clock edges later that the result is ready. Their generation still requires an extra effort not needed for the parallel adder.

images

Figure 4.4 Serial adder with parallel input and output

images

Figure 4.5 Serial multiplier using the Horner scheme

As the n by n bit binary multiplier is also built up from n² identical full adder plus AND gate components, all their operations could be executed in sequence, this time requiring n bits of intermediate storage. More generally, the n-bit multiply operation broken up into m² identical k-bit extended add operations that can be serialized.

The n by n bit multiplier in Figure 4.1(a) can also be represented as a cascade of n rows that are parallel adders and can be realized by using the same parallel adder in n sequential steps. It needs a single n-bit register to pass a partial sum to the next step (Figure 4.5) which avoids the overhead for selecting input bits. The first operand a is input in parallel to the adder while the bits of the second operand need to be input serially (e.g. using a shift register). The output bits come out serially of the least significant output bit of the adder according to the serial execution of the Horner scheme (where multiplying by ½ is equivalent to shifting down):

images

and may be moved into a shift register. After n steps, the register holds the most significant half of the result. Again, there is some extra effort for serializing b and for generating signals indicating the start and the end of the multiply operation, but this time the overall hardware costs are in O(n) and thus significantly less than for the ‘parallel’ multiplier circuits. The signed multiply is implemented by changing the cycle adding b_n−1*a to a subtract cycle and by using a signed adder circuit. The circuit in Figure 4.5 generalizes to serially adding up k*n bit products using extended k-bit adders.

Instead of multiplying via the AND gates, the multiplication step can also be realized by selecting between the adder output and directly passing the shifted input. This amounts to skipping the add operation and can be exploited to speed up the sequential multiplication in the case of the second operand containing many zeroes. Moreover, if the second operand contains a sequence of ones from bit positions k to l, then the sum of the corresponding powers 2^r∗a is equal to 2^k+1∗a − 2^r∗a and can hence be computed with two add steps and some shifts (Booth's algorithm). The processing time for the multiplication becomes data dependent which is considered as a disadvantage in some applications.

The row of two output extended full adders from Figure 4.1(c) can be also be used in n sequential steps. It requires two n-bit registers and requires their contents to be summed up in a final step. The second register is for the carry words and gives this multiplier structure the name of a carry-save multiplier. The sequential steps can be performed at nearly the maximum rate of the individual extended full adder component. The carry-save multiplier performs in linear time and with a linear hardware effort. The adder components are used at close to 100% efficiency as in the case of the fully pipelined parallel multiplier. The carry-save multiplier is much faster than the structure in Figure 4.5 even if a fast adder is used there and if the final add to convert from the redundant code is performed serially. In comparison to the parallel circuit using n² adders, it cannot exploit the adder tree structure or be pipelined.

Adding b_i*a to the right shifted output word in the serial multiplier of Figure 4.5 is easier to implement but equivalent to adding a left shifted version of it to the unshifted 2n-bit output register. If it is implemented this way, then at the start of the multiplication the register need to be reset to zero. If not, the product is actually added to its previous contents. Thus at no extra hardware costs, this serial multiplier performs a multiply-and-add operation. The shifted output register version can be modified to perform it, too, and a variant of it is shown in Figure 4.6 using the carry-save multiplier scheme.

The upper part containing the AND multipliers, the full adders and the three registers C, L, H is the carry-save multiplier circuit outlined above, the C register holding the carry word. After the last cycle of a multiply operation the redundant code of the high word of the result would stand in the registers C and H while the L would hold the low word of the result. The remaining registers serve to implement the multiply-add function. After the last cycle the next multiply operands a, b are applied, the redundant code of the high word is moved into the registers c, h instead, the low word is loaded into H, and C is cleared. The new multiplication then continues to add to H while at the same time the high word of the previous result is added up bit by bit in the serial adder consisting of the full adder and the D flip-flop and applied to the carry-save adder through the upper bit of the H register. The individual product then takes n cycles, and the add operation to convert from the redundant code is pipelined with the next multiplication. It is at the end of k multiplications only that a single redundant code remains to be added up to a binary one. A serial counter-circuit can be used to include a third result word accumulating the overflows occurring in a series of multiply-add operations.

images

Figure 4.6 Serial carry-save multiplier-accumulator (MAC)

4.6 DISTRIBUTED ARITHMETIC

Sums of products

images

with constant coefficients A_k, i.e. considered as functions of the x_k only, can be evaluated for small n quite efficiently using look-up tables, yet not using the entire multi-bit binary or fixed point codes of the x_k to address the table, but individual bit x_k,i for the same arbitrary bit position i. Then fairly small look-up tables addressed by r bits suffice. For n = 3..6 such are offered by the cells of current FPGA chips. Therefore this approach yields an efficient implementation for multiple multiply-and-add operations on such chips. For all k,

images

hence

images

with

images

F is the function of r Boolean inputs defined by F(b₀,…,b_r−1) = ∑_k b_k*A_k. It outputs n′-bit words with n′ > n due to the multiple add function and is realized using a table. Then, g is summed up serially using the Horner scheme using the sequential structure in Figure 4.7 which is similar to Figure 4.5:

images

Figure 4.7 Serial computation of a dot product using distributed arithmetic

images

The operand bit x_k,i need to be input serially using e.g. shift registers clocked synchronously with the add and shift steps.

For every bit of the result of F, a look-up table from a separate FPGA cell is used. Thus the required memory is taken from the FPGA configuration resources. This also allows an easy change of coefficients without having to redesign the circuit structure and its mapping. The number r of products is limited by the size of the available look-up tables. Longer input sequences can, however, be handled by breaking the g sum into small parts and summing up the outputs of the corresponding tables, or by combining several cells to obtain larger look-up tables. It is not possible, however, to reuse the same length r hardware circuit for the different sub-sequences as the coefficients in the LUTs cannot be changed easily as this amounts to reconfiguration which takes some time if it is supported at all (only some Xilinx FPGA families can serially load a LUT from an application circuit). If, however, the same sum-of-products operation has to be applied to a block of vectors, the swapping of the LUTs can be performed once per block only and may become affordable. The hardware effort in using the circuit in Figure 4.7 is considerably less than using r serial multipliers in parallel and adding up their results, and the speed is faster.

A trick to cut the required memory by a factor of two (cf. [3]) is to represent the signed x_k inputs in the form:

images

and to replace F by:

images

Then:

images

F′ satisfies:

The table for F′ only needs to provide the entries for x₀ = 0. x₀ is only used to complement the other bits and to change the add to a subtract operation using an add/subtract circuit instead of the adder.

4.7 DIVISION AND SQUARE ROOT

The division of an unsigned 2n-bit number a by an n-bit number b such that a < 2ⁿb is computed in a sequence of operations conditionally subtracting products 2^kb from a as long as the result is positive, for k decreasing from n-1 to 0. The sum over those powers 2^k for which the subtraction was carried out is the result of the division. The condition of an intermediate difference a′ satisfying a′ ≥ 2^kb corresponds to the carry bit being set after performing the binary subtract operation. Thus the divide step can be realized by using a binary subtract circuit and a selector between a and a − 2^kb controlled by the carry bit, and outputting the carry bit to the kth bit position of the result. Due to the serial composition of the operations, a serial execution on the same conditional subtract circuit can be nearly as fast. The carry outputs during the time steps can be shifted into a result register. After performing the algorithm the remainder of the division is obtained as the result of the subtract operations. Equivalently to subsequently subtracting 2^kb from an intermediate difference a′ one can subtract 2ⁿb from 2^n−ka′. This corresponds to shifting a′ to the left instead of shifting b to the right.

The conditional subtract operation can also be realized with an add/subtract circuit controlled by a function select input again. If the subtract operation a′ − 2^kb sets the carry, then in the next step a′ − 2^kb will be changed by subtracting 2^k−1b, otherwise by adding it as a − 2^k−1b = a − 2^kb + 2^k−1b. A common circuit for the sequential computation in shown in Figure 4.8, again shifting a to the left instead of b to the right. a is first loaded into the shift register L, the carry flip-flop C is set, and H is cleared. Then a first subtract is performed inputting from the lower n− 1 bit of H and the highest bit of L and storing the result in H, and shifting L to the left using the carry as input into the lowest bit of L (L is used for the result bits as well). After a total of n add/subtract and shift operations the result of the division stands in L. A final conditional add operation is required to obtain the remainder of the division in H. As in the circuits discussed before, the circuit starting and generating the clock sequence is not shown, only the data registers and the add/subtract circuit (the so-called data path).

The same principle of performing a binary search for a maximum solution x of an equation f(x) ≤ a (f being a monotonic function, f(x) = b*x for the division) can also be applied to calculating the square root of a number a (here, f(x) = x*x). One adds decreasing powers 2^k starting with k = n/2 − 1 as long as the square of the sum is less than a, keeping track of the difference d of a and the square of the sum s. In every add step d decreases by 2^k+1 s + 2^2k (or, 2^−1−kd decreases by s + 2^k−1) which can be realized with conditional subtract operations again.

images

Figure 4.8 Binary n-bit serial divide circuit

For the square root of fixed point numbers a with 0.25 <= a < 1, another common procedure is the approximation using the iteration b_i+1 = ½ (b_i + a/b_i) and starting from b₀ = 1 which converges quadratically but requires division.

4.8 FLOATING POINT OPERATIONS AND FUNCTIONS

Floating point operations are a lot more complex than fixed point ones, especially for the high precision types. There are, however, algorithmically derived from fixed point operations, and thus covered by the preceding sections. Special design tricks can be found in [44]. A common optimization in digital computations is to substitute floating point computations by fixed point ones after an analysis of the required precision. If the required frequency of floating point operations is low enough, they become a candidate for serial execution on a single or on a few processors and even for software implementations, i.e. for synthesizing them from fixed point operations executed on simple programmable processors.

The product of two normalized floating point numbers r₁ = m₁2^s and r₂ = m₂2^t with 1 ≤ m₁m₂ < 2 is r₁*r₂ = m₁*m₂ 2^s+t. The following operations have to be performed:

Extract exponents and form e = s + t.
Extract mantissas and form the fixed point product m = m₁*m₂.
If m ≥ 2 then normalize m, increment e.
Round m.
Extract signs and derive sign of result.
Pack the bit fields.

Of these, the multiply operation is the most demanding one. It is a fixed point operation only requiring the most significant part of the product to be computed.

If s ≤ t, the sum is derived from the formula r₁ + r₂ = (m₁*2^s−t + m₂)2^t and requires the following operations:

Extract exponents, order according to s ≤ t.
Denormalize m₁ (multiply by 2^s−t).
Fixed point add/subtract operation according to sign bits to get m = (m₁*2^s−t±m₂).
Normalize and round m and correct exponent accordingly.
pack the bit fields.

m may be small due to bit extinctions, and the normalization step requires a multi-bit shift depending on the number of sign bits. Extinction introduces an error that cannot be read off from the code.

While binary and signed binary adder and multiplier building blocks can be composed to yield add and multiply operations on words of arbitrary sizes, the floating point operations are not easily extended to higher precision operations. According to [54], the true sum of the rational numbers exactly represented by two floating point codes a, b with |a| ≥ |b| can be represented as

where ⊕, denote the (rounded) floating point add and subtract operations. The rounding error of the floating point add operation corresponding to the places of b lost during the denormalization step can thus be computed in terms of floating point operations only (although not efficiently). Such a double word representation of a number as a sum of two floating point numbers can also be obtained for the true product of two numbers, using an operation that divides a floating point code into a sum of two half mantissa size codes for which the floating point product is exact (or one clearing the lower half of the mantissa). The operations to compute the rounding errors as floating point codes are useful floating point building blocks that can be generated as by-products of the implementations of the rounded floating point operations quite easily, but are not available on standard processors. Using them, full double precision operations can be derived from single precision ones, and the accumulation of rounding errors can be tracked to some degree.

For the non-normalized floating point format proposed in section 1.1.1 a simple heuristic to keep track of the precision is to form products to the minimum number of places of the operands. For the logarithmic encoding, fairly efficient add and multiply operations have been derived [82].

The square root and reciprocal functions are evaluated using the algorithms in section 4.6 or iterative procedures starting from low precision approximations taken from a ROM table. The standard transcendental functions are all evaluated from polynomial or rational approximations on certain intervals and extended from there by means of functional equations [47]. The sine function e.g. is approximated on [0, π/2] and extended using its symmetry and periodicity. Higher precision floating point formats require higher order polynomials, the allowed approximation error being of the order of magnitude of the least significant bits of the mantissa. For the sine and exponential functions there are special CORDIC algorithms exploiting their special properties that compute their results by means of the fixed point add, subtract, compare and shift operations only that are most suitable for FPGA [46, 82] (cf. exercise 8).

For the evaluation of polynomials p(x) the Horner scheme is attractive as it eliminates multiplications and the storage of powers of x:

images

It combines the multiply and add operations in a different way from the MAC structure in section 4.5, namely by multiplying to an intermediate result and then adding instead of adding to the intermediate result, and must be supported by a different circuit structure. The coefficients are constant and may be supposed to be scaled so that for arguments x ∊ [−1,1] the result may be represented in a given fixed point format on [−1, 1]. After every multiplication rounding is supposed to be applied so that all intermediate results can be represented in the same format. A circuit realizing this composite operation would be a useful building block for realizing functions through polynomial approximations. For rational approximations, other primitives, e.g. the expressions (a + 1/x) or (ax + b)/(cx + d) can be considered.

In some applications, it is sufficient to provide ROM function tables for selected arguments of certain functions only. Starting from a table, the values at intermediate arguments may be derived by some low-order polynomial interpolation to a desired precision.

4.9 POLYNOMIAL ARITHMETIC

For a primitive binary polynomial p(x) = ∑,p_iXⁱ of degree n the multiplication of polynomials mod(p) is the composite operation of taking the remainder of the product polynomial after its division by p. The powers X^k mod(p) of the polynomial X exhaust the m = 2ⁿ − 1 non-zero polynomials of degree < n and can be used to encode the numbers k with 0 ≤ k < m by their coefficients, as proposed in section 1.1.1. For larger k, the remainders of the X^k repeat mod(m) (X^m = 1 mod(p)).

For this encoding, the increment operation is particularly simple. If

images

The tuple representing the number k is thus shifted and XOR'ed by the shifted-out bit g_n−1 at the positions of the non-zero coefficients of p. This is the operation of a shift register with XOR feedback from the output shown in Figure 4.9. The remainder of X^m−1 increments to the polynomial 1, and the increment operation repeats after m steps. If the number of non-zero coefficients of p is small, the feedback shift register is less complex than e.g. a synchronous binary counter. Apart from the simplicity of these feedback circuits, shift register sequences generated in this way also have interesting statistical properties and can be used to generate pseudo-random numbers [83].

The general add operation is more complex, however. It is the product X^r∗X^s = X^r+s that corresponds to the add operation modulo m of the exponents. Let X^r and X^s be the polynomials

images

Figure 4.9 mod(7) counter automaton based using the polynomial 1 + X + X³

images

Figure 4.10 Polynomial multiplication ( * : AND, + : XOR )

images

Figure 4.11 Computing the remainder of the polynomial q divided by p(X) = 1 + X + X³

Thus (g₀,…,g_n−1) and (h₀,…,h_n−1) are the codes of r, s. The product polynomial is given by

images

The add operation in the field B is the XOR operation. The X^k with k ≥ n must be substituted by their remainders modulo p which depend on p. If the polynomial p(X) = 1 + X + Xⁿ is used (for some n it is primitive), then X^k = X^k−n + X^k−n+1 mod(p). The total number of XOR operations to compute the code of r + s from the codes of r and s is then n² − 1 that can be arranged so that the processing time is logarithmic. An application would be to compute relative addresses in the case that subsequent address codes to read out a memory are generated by a shift register implementing the above increment operation (see section 5.1.3).

Multiplications of binary polynomials and remainder computations can be carried out serially using shift registers. The coefficients g_k are applied to an n − 1 bit shift register in reverse order (starting with g_n−1 at the input and all flip-flops reset to 0). Every shift operation moves the coefficients to the left so that g_k+i appears at the output of the i^th stage. After n − 1 clocks, g₀ is applied to the input, and g₁…g_n−1 stand in the register. Thereafter, zeroes are applied to the input. The shift register taps are multiplied by the coefficients h_n−1, h_n−2,…to generate the coefficients q_k+n−1 = ∑h_n−1−ig_k+i of the product g*h in reverse order, starting from q_2n−2 (Figure 4.10). These are fed into a second shift register with feedback from the output according to the primitive polynomial p to perform the substitution of Xⁿ appearing at the register output by its remainder (Figure 4.11). After 2n − 2 shift clocks, the result code stands at the inputs of the flip-flops of the second register.

4.10 SUMMARY

The Boolean functions realizing arithmetic operations on number codes have been found to have moderately complex algorithms. Binary arithmetic could be mostly constructed from full adder components and their generalizations, the mod(2ⁿ) adders. With some hardware effort, faster add circuits could be obtained. While parallel multiply circuits do not use the gates efficiently, serial multipliers reusing a single mod(2ⁿ) adder were found to use it quite efficiently. Other serial circuits were demonstrated for computing sums of products and for division.

EXERCISES

Formally verify that the algorithms given for the n-bit unsigned and signed binary add and subtract operations with an (n + 1)-bit result are correct.
Write out the behavioral and structural VHDL definitions of an n-bit adder with a carry input and an overflow output.
Give a structural VHDL definition for a fast 16-bit incrementer using multiple AND gates to compute the carry bits.
Verify the signed digit add algorithm in section 4.3, give a structural VHDL definition for a generic n-digit adder, and determine its gate count as a function of n.
Work out a 16-bit serial MAC as in Figure 4.6 accumulating 1024 products with an additional 10-bit overflow word into a VHDL entity. The 16-bit operands will be read from external memories. Extend to signed multiply-add operations.
Design a combined multiply-and-add circuit for the evaluation of polynomials similar to the structure in Figure 4.6. After the n carry-save cycles the result of the multiplication by x is added bit by bit to the coefficient and immediately applied as the b operand to the next multiply.
Implement a 4-bit extended adder component with handshaking signals as a VHDL entity and use it as a component in an n-bit multiplier circuit.
For a number p with 2ⁿ⁻¹ < p ≤ 2ⁿ the remainders mod(p) can be represented as binary n-bit numbers. Design a serial mod(p) multiplier that computes the product of two remainders in n cycles. Show how this multiply operation can be derived from the binary n-bit multiply operation with a 2n-bit result in the special case of p = 2ⁿ − 1.
A CORDIC algorithm to compute the trigonometric functions of some n-bit fixed-point input x ∊ [0, 1] starts by initializing a 2 × 2 matrix A to the unit matrix and subsequently multiplying matrices to it (slightly scaled rotations by α_i = atan (2⁻ⁱ) if x_i ≥ α_i, where the sequence (x_i) is defined by x₀ = x, x_i+1 = x_i − α_i if x_i ≥ α_i, x_i+1 = x_i otherwise. After n steps, A is close to a scaled rotation by x, and sin(x), cos(x) and tan(x) are easily derived from it. The matrix multiplications only involve shift and add operations. Derive a serial data path performing the described operations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4: Operations on Numbers

Create new playlist

Sign In

Sign Up

4

Operations on Numbers

4.1 SINGLE BIT BINARY ADDERS AND MULTIPLIERS

4.2 FIXED POINT ADD, SUBTRACT, AND COMPARE

4.3 ADD AND SUBTRACT FOR REDUNDANT CODES

4.4 BINARY MULTIPLICATION

4.5 SEQUENTIAL ADDERS, MULTIPLIERS AND MULTIPLY-ADD STRUCTURES

4.6 DISTRIBUTED ARITHMETIC

4.7 DIVISION AND SQUARE ROOT

4.8 FLOATING POINT OPERATIONS AND FUNCTIONS

4.9 POLYNOMIAL ARITHMETIC

4.10 SUMMARY

EXERCISES

Table of Contents for
4: Operations on Numbers