In Section 6.1.6, we demonstrated how addition can be based on logic. Choosing two's complement representation for signed integers makes subtraction a special case of addition that is relatively easy to implement at the hardware level. That choice avoids complications associated with positive and negative variants of zero in sign and magnitude representation.
In this section, we extend the exploration of arithmetic founded on logic. First we look at an early algorithm that accomplishes multiplication by using addition. Then we look at division problems, where the divisor is a constant. The divisor is converted into a reciprocal to be used in a multiplication operation that is mathematically equivalent to the division.
Repeated addition can accomplish multiplication (N × X = X + X + X for N =3, etc.), but requires an accumulator that is double the width of the operands. Repeated addition using a counted loop takes a long time for very large values of N, and can require carries.
Performing “long multiplication” on decimal numbers involves a combination of individual multiplication steps and an overall addition, with multiple carries, that is guided by positional weights (Section 1.8.1). Long multiplication for binary numbers works analogously. The multiplication steps are trivial (either 0 × X or 1 × X), but the “troublesome” carries (Section 6.1.6) are still present.
During the course of his development of an early computer architecture, Andrew Donald Booth first explained a method in which the carries cannot build up. More importantly, the algorithm itself can proceed in identical fashion regardless of the signs or order of the two's complement operands (X = multiplicand, N = multiplier).
Our presentation of Booth's algorithm uses a notational approach introduced by John Peatman, which we demonstrate with 8-bit numbers. The multiplier can be written as:
N = –128ns + 64n6 + 32n5 + 16n4 + 8n3 + 4n2 + 2n1 + n0
where ns is the sign bit. This expression can be rewritten in the following “clever” way:
N = –128(ns–n6) – 64(n6–n5) – 32(n5–n4) – 16(n4–n3) – 8(n3–n2) – 4(n2–n1) – 2(n1–n0) – (n0–0)
where all of the terms now have the same form. In an 8-bit number, the sign bit ns is obviously n7.
Since each ni bit is a coefficient taking a value of 0 or 1 in a weighted positional notation, each symbolic difference in parentheses will be +1, –1, or 0. These values constitute the recipe for adding, subtracting, or ignoring each multiple 2i X needed to compose the product P = N × X.
It is less obvious, but much more important, to observe that adjacent nonignorable operations alternate in sign (addition or subtraction) as a partial product P is being accumulated. Consider the number N = 0b00001011 with any X:
Criterion | Starting condition: | P = 0 | |
10 | –1(n0–0) = –1 | P = –1X | |
11 | –2(n1–n0) = 0 | no change | |
01 | –4(n2–n1) = +4 | P = +3X | |
10 | –8(n3–n2) = –8 | P = –5X | |
01 | –16(n4–n3) = +16 | P = +11X | |
00 | –2i(ni – ni–1) = 0 | all the rest yield 0 |
where the criterion column (examined below) juxtaposes bits ni and ni–1. Note also the special first case n0 and 0. The correct sum is produced for the multiplier N = 1110 and any X.
Recall that addition of numbers with unlike signs can never exceed the bit-width of the representation. The strict alternation of signs—that is, addition and subtraction—keeps the partial product P from growing any larger than the width of N. Stated differently, a widening P coupled with a narrowing N always fits in 2w bits, where w is the width of the numeric representation.
The schematic arrangement for the Booth algorithm in Figure 6-4 has the following features: one register of width w bits to hold the multiplicand X; a double register of width 2w bits, split into L and R halves, where L is initialized to zero and the R contains the multiplier N; and additional storage for a single bit, ni–1, which is the bit most recently shifted out of the LR double register. That single bit is initialized to zero.
Shifting L (and R) to the right, relative to a stationary X, is equivalent to writing down the partial products of X that are shifted to the left in standard long multiplication.
The criterion for addition or subtraction can be recognized as the appropriate entry in Table 6-4, corresponding to the current multiplier bit ni and the previous multiplier bit ni–1.
Bit ni | Bit ni–1 | Action |
---|---|---|
0 | 0 | do nothing; then shift right |
0 | 1 | add X; then shift right |
1 | 0 | subtract X; then shift right |
1 | 1 | do nothing; then shift right |
With the stated initializations and an understanding of the presentation in Figure 6-4 and Table 6-4, Booth's algorithm simply becomes:
repeat w times: isolate the current ni for use in the next iteration L := L+X, L := L-X, or leave L alone // Table 6-4 shift the pattern in LR to the right one bit position maintain sign extension for the pattern in LR at all times
Carries from the leftmost bit of L are ignored. For N = 11 as given above, this plays out as follows for X = 13 = 0b00001101:
iteration | L | R | n i–1 | |
---|---|---|---|---|
00000000 | 00001011 | 0 | initialization | |
1 | 11110011 11111001 | 00001011 10000101 | 0 1 | 10: subtract X from L shift LR (same sign); update ni–1 |
2 | 11111001 11111100 | 10000101 11000010 | 1 1 | 11: do nothing shift LR (same sign); update ni–1 |
3 | 00001001 00000100 | 11000010 11100001 | 1 0 | 01: add X to L shift LR (same sign); update ni–1 |
4 | 11110111 11111011 | 11100001 11110000 | 0 1 | 10: subtract X from L shift LR (same sign); update ni–1 |
5 | 00001000 00000100 | 11110000 01111000 | 1 0 | 01: add X to L shift LR (same sign); update ni–1 |
6 | 00000100 00000010 | 01111000 00111100 | 0 0 | 00: do nothing shift LR (same sign); update ni–1 |
7 | 00000010 00000001 | 00111100 00011110 | 0 0 | 00: do nothing shift LR (same sign); update ni–1 |
8 | 00000001 00000000 | 00011110 10001111 | 0 1 | 00: do nothing shift LR (same sign); update ni–1 |
with a final result of 0b10001111 = 14310 = 13 × 11. The L portion is zero only because this particular product is a small positive number.
Can we now convert this algorithm into Itanium instructions using 64-bit operands? Our outline suggests roles for a double-width register, shifting of a double-width intermediate quantity that eventually becomes the product, and a criterion for predicating addition, subtraction, or neither at each stage in a counted loop. Representing the single-bit memory location with register rn, initialized to zero, we express Itanium instructions for the body of the loop:
and rt=0x1,rr;; // Isolate lowest bit of R xor ri=rn,rt;; // ri = whether to act cmp.ne pi,p0=0,ri // pi = whether to act mov rn=rt;; // Save bit for next time (pi) cmp.eq.unc pa,ps=0,rt;; // Add, subtract, nop? (pa) add rl=rl,rx // Add X to L (ps) sub rl=rl,rx;; // Subtract X from L shrp rr=rl,rr,1 // New R of shifted LR shr rl=rl,1 // New L of shifted LR
There are countless ways to implement Booth's algorithm in any language or architecture. Some may offer shorter code sequences based on tight iterative loops, but EPIC principles can significantly improve performance by eliminating the branches inside those loops.
We can easily adapt the implementation of Booth's algorithm for circumstances where we know that the two operands are unsigned, rather than signed, integers. In order to see how to make this correction, look back at the leading term in the algebraic sum for the value of the multiplier N (see the beginning of Section 6.5.1).
When the operands are signed, as with the normal implementation of Booth's algorithm, the leading term has the general weighted value –2w–1nw–1, where w is the bit width of the integer representation. When the operands are unsigned, that leading term should have the weighted value +2w–1nw–1. The result from the normal implementation of Booth's algorithm needs to be corrected by +2×2w–1nw–1 = +2wnw–1.
In pseudocode, X is added to L after the last shift operation of the loop if and only if the leading bit nw–1 of N is 1:
// Unsigned multiplication only tbit.nz pn,p0=rn,63 // If bit 63 of N is 1, (pn) add rl=rl,rx // add X after last shift // Unsigned multiplication only
Since our development treated X symbolically throughout, the leading (and no longer sign) bit of X can be ignored, as can any carry-out to the left in L.
When we introduced integer arithmetic instructions in Chapter 4, we pointed out that Alpha, Itanium, and some other high-performance architectures lack an integer divide instruction. As Bhandarkar has noted, several methods can be used to implement integer division:
Use of floating-point hardware to convert from integer into floating-point representation, to divide, and then to convert back. Accuracy might be limited by the precision of the floating-point representation (not a problem for the Itanium ISA);
Iterative testing and subtraction to implement algorithms analogous to decimal long division; and
Multiplication by the reciprocal of the divisor, and then a shift instruction to normalize the result.
This last method, using a reciprocal, is especially applicable for division by a constant that is known at the time the program is assembled or compiled.
Consider the very common need to divide a binary integer by 10 as part of an algorithm for formatting decimal numbers for printing. The reciprocal of 10 is the repeating fraction 0.0001100110011... (binary). In order to preserve maximum precision, we can instead use the normalized 64-bit fraction 0.CCCCCCCCCCCCCCCD (hexadecimal representation of 0.8) for the multiplication, with a subsequent division by 8. The final hexadecimal digit of the fraction is D rather than C in order to ensure normal rounding, since shifting as a means of division truncates, rather than rounds, the result. According to standard algebra, this division by 8 can be deferred until an extended-precision product has been formed through multiplication of the source number by the normalized fraction. That is, we can use the simple mathematical identities x/10 = 0.1x = 0.8x/8.
We now explain this method in more detail. Since one-tenth is less than one-eighth but greater than one-sixteenth, the representation of 0.1 (decimal) as a binary fraction begins with three zero bits before the first one bit (that is, no one-half, no one-quarter, no one-eighth, one one-sixteenth, etc.). The full binary representation for one-tenth is then found, by a continuation of this process, to contain a repeating bit pattern, …1100…, which corresponds to the hexadecimal character C. With binary numbers and fractions, each shift to the left by one bit position corresponds to a multiplication of the value by a factor of two, and each shift to the right by one bit position corresponds to a division of the value by a factor of two. If we shift the bit pattern representing one-tenth three bit positions to the left, so that those first three zeros are gone, the value becomes 23 = 8 times greater; thus the resulting 0.CCC… hexadecimal pattern represents the value eight tenths.
When we wish to divide a value x by ten, we first multiply that value by 0.8 to form an intermediate product 0.8x of highest precision. Then we divide the intermediate product 0.8x by 8 to obtain a result equivalent to x/10. Division by 8 is easily accomplished on any computer that supports an arithmetic shift to the right by three bit positions.
This algorithm is quite easily expressed using Itanium instructions, with the source number in register rx. We imagine some function, umpy8hi, which implements either Booth's algorithm or a sequence of Itanium floating-point instructions to perform 8-byte unsigned multiplication while retaining the high-order portion of a full 128-bit product. For reasons that will unfold below, we need an instruction that yields the high portion. Bhandarkar's suggestion can be expressed as:
DOT8 = 0xCCCCCCCCCCCCCCCD // 0.8 (finite approx.) (code fragment) movl rf = DOT8;; // Get reciprocal factor umpy8hi rl = rf,rx;; // rl = high 64 bits shr rq = rl,3;; // Divide by 8
We now have the desired quotient in register rq. The original source number is still in register rx. If we are doing a modulo calculation and need to know the remainder, we can go on to compute it from rx – 10*rq.
We can illustrate this method by asking for the result of integer division of 256 by 10 using this program fragment. Since 256 = 28, the multiplication of DOT8 by 256 will result in a 128-bit product based on a shift to the left by 8 bits, or 2 hexadecimal digits, where
low-order 64 bits are: CCCCCCCCCCCCCD00 (hexadecimal) high-order 64 bits are: 00000000000000CC (hexadecimal)
Remember that our version of Booth's algorithm retains the low-order 64 bits in R and the high-order 64 bits in L (register rl in the pseudocode just above). Continuing, we now divide the high-order portion by 8,
CC16 = 1100 11002 0001 10012 = 1916 = 2510
thus obtaining the proper result of 25; integer division discards the remainder.
In developing this methodology, we chose to treat the reciprocal of the known divisor as a binary fraction and the multiplicand (dividend) as a whole number. Accordingly, the split between the whole number and the fraction was located exactly between the high- and low-order 64-bit portions of the full product. Picking off the integer part of the quotient then simply meant choosing the contents of register L in our illustration of Booth's algorithm. Among computer architectures with machine instructions for multiplication, some provide the low- and high-order segments in separate registers from the execution of a single instruction, while others have two different opcodes that specify which segment is placed into a destination register.
3.145.20.193