Chapter 5

Multiplication and Squaring

5.1 The Multipliers

For most number theoretic problems, including certain public key cryptographic algorithms, the “multipliers” form the most important subset of algorithms of any multiple precision integer package. The set of multiplier algorithms include integer multiplication, squaring, and modular reduction, where in each of the algorithms single precision multiplication is the dominant operation performed. This chapter discusses integer multiplication and squaring, leaving modular reductions for the subsequent chapter.

The importance of the multiplier algorithms is for the most part driven by the fact that certain popular public key algorithms are based on modular exponentiation; that is, computing dab(mod c) for some arbitrary choice of a, b, c, and d. During a modular exponentiation the majority1of the processor time is spent performing single precision multiplications.

For centuries, general-purpose multiplication has required a lengthy O(n2) process, whereby each digit of one multiplicand has to be multiplied against every digit of the other multiplicand. Traditional long-hand multiplication is based on this process; while the techniques can differ, the overall algorithm used is essentially the same. Only “recently” have faster algorithms been studied. First Karatsuba multiplication was discovered in 1962. This algorithm can multiply twonumbers with considerably fewer single precision multiplications when compared to the long-hand approach. This technique led to the discovery of polynomial basis algorithms [5] and subsequently Fourier Transform based solutions.

5.2 Multiplication

5.2.1 The Baseline Multiplication

Computing the product of two integers in software can be achieved using a trivial adaptation of the standard O(n2) long-hand multiplication algorithm that schoolchildren are taught. The algorithm is considered an O(n2) algorithm, since for two n-digit inputs n2single precision multiplications are required. More specifically, for an m and n digit input m. n single precision multiplications are required. To simplify most discussions, it will be assumed that the inputs have a comparable number of digits.

The “baseline multiplication” algorithm is designed to act as the “catch-all” algorithm, only to be used when the faster algorithms cannot be used. This algorithm does not use any particularly interesting optimizations and should ideally be avoided if possible. One important facet of this algorithm is that it has been modified to only produce a certain amount of output digits as resolution. The importance of this modification will become evident during the discussion of Barrett modular reduction. Recall that for an n and m digit input the product will be at most n+ m digits. Therefore, this algorithm can be reduced to a full multiplier by having it produce n+ m digits of the product.

Recall from section 4.2.2 the definition of γ as the number of bits in the type mp_digit. We shall now extend the variable set to include α, which shall represent the number of bits in the type mp_word. This implies that 2α>2 · β2. The constant δ= 2α–2lg(β)will represent the maximal weight of any column in a product (see 6.2 for more information).

Algorithm s_mp_mul_digs. This algorithm computes the unsigned product of two inputs a and b, limited to an output precision of digs digits. While it may seem a bit awkward to modify the function from its simple O(n2) description, the usefulness of partial multipliers will arise in a subsequent algorithm. The algorithm is loosely based on algorithm 14.12 from [2, pp. 595] and is similar to Algorithm M of Knuth [1, pp. 268]. Algorithm s_mp_mul_digs differs from these cited references since it can produce a variable output precision regardless of the precision of the inputs (Figure 5.1).

image

Figure 5.1 Algorithm s_mp_mul_digs

The first thing this algorithm checks for is whether a Comba multiplier can be used instead. If the minimum digit count of either input is less than δ, then the Comba method may be used instead. After the Comba method is ruled out, the baseline algorithm begins. A temporary mp_int variable t is used to hold the intermediate result of the product. This allows the algorithm to be used to compute products when either a= c or b = c without overwriting the inputs.

All of step 5 is the infamous O(n2) multiplication loop slightly modified to only produce up to digs digits of output. The pb variable is given the count of digits to read from b inside the nested loop. If pb ≤ 1, then no more output digits can be produced and the algorithm will exit the loop. The best way to think of the loops are as a series of pb × 1 multiplications. That is, in each pass of the innermost loop, aixis multiplied against b and the result is added (with an appropriate shift) to t.

For example, consider multiplying 576 by 241. That is equivalent to computing 100(1)(576) + 101(4)(576) + 102(2)(576), which is best visualized in Figure 5.2.

image

Figure 5.2 Long-Hand Multiplication Diagram

Each row of the product is added to the result after being shifted to the left (multiplied by a power of the radix) by the appropriate count. That is, in pass ix of the inner loop the product is added starting at the ix′th digit of the result.

Step 5.4.1 introduces the hat symbol (e.g.,, image), which represents a double precision variable. The multiplication on that step is assumed to be a double wide output single precision multiplication. That is, two single precision variables are multiplied to produce a double precision result. The step is somewhat optimized from a long-hand multiplication algorithm because the carry from the addition in step 5.4.1 is propagated through the nested loop. If the carry were not propagated immediately, it would overflow the single precision digit tix+iyand the result would be lost.

At step 5.5 the nested loop is finished and any carry that was left over should be forwarded. The carry does not have to be added to the ix+ pb′th digit sincethat digit is assumed to be zero at this point. However, if ix + pb ≥ digs, the carry is not set, as it would make the result exceed the precision requested.

image

image

First, we determine (Line 31) if the Comba method can be used since it is faster. The conditions for using the Comba routine are that min(a.used, b.used) < δ and the number of digits of output is less than MP_WARRAY. This new constant is used to control the stack usage in the Comba routines. By default it is set to δ, but can be reduced when memory is at a premium.

If we cannot use the Comba method we proceed to set up the baseline routine. We allocate the the destination mp_int t(Line 37) to the exact size of the output to avoid further reallocations. At this point, we now begin the O(n2) loop.

This implementation of multiplication has the caveat that it can be trimmed to only produce a variable number of digits as output. In each iteration of the outer loop the pb variable is set (Line 49) to the maximum number of inner loop iterations.

Inside the inner loop we calculate image as the mp_word product of the two mp_digits and the addition of the carry from the previous iteration. A particularly important observation is that most modern optimizing C compilers (GCC for instance) can recognize that an N × N → 2N multiplication is all that is required for the product. In x86 terms, for example, this means using the MUL instruction.

Each digit of the product is stored in turn (Line 69) and the carry propagated (Line 72) to the next iteration.

5.2.2 Faster Multiplication by the “Comba” Method

One of the huge drawbacks of the “baseline” algorithms is that at the O(n2) level the carry must be computed and propagated upwards. This makes the nested loop very sequential and hard to unroll and implement in parallel. The “Comba” [5] method is named after little known (in cryptographic venues) Paul G. Comba, who described a method of implementing fast multipliers that do not require nested carry fix-up operations. As an interesting aside it seems that Paul Barrett describes a similar technique in his 1986 paper [5] written five years before.

At the heart of the Comba technique is again the long-hand algorithm, except in this case a slight twist is placed on how the columns of the result are produced. In the standard long-hand algorithm, rows of products are produced and then added together to form the result. In the baseline algorithm, the columns are added together after each iteration to get the result instantaneously.

In the Comba algorithm, the columns of the result are produced entirely independently of each other; that is, at the O(n2)level a simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount of work required. Succinctly, the first step of the algorithm is to compute the product vector image as follows:

image (5.1)

where image is the n′th column of the output vector. Consider Figure 5.3, which computes the vector image for the multiplication of 576 and 241.

image

Figure 5.3 Comba Multiplication Diagram

At this point the vector x= 〈10, 34, 45, 31, 6〉is the result of the first step of the Comba multiplier. Now the columns must be fixed by propagating the carry upwards. The resultant vector will have one extra dimension over the input vector, which is congruent to adding a leading zero digit (Figure 5.4).

image

Figure 5.4 Algorithm Comba Fixup

With that algorithm and k= 5 and β = 10 the image = 〈1, 3, 8, 8, 1, 6〉vector is produced. In this case, 241 · 576 is in fact 138816 and the procedure succeeded. If the algorithm is correct and, as will be demonstrated shortly, more efficient than the baseline algorithm, why not simply always use this algorithm?

Column Weight.

At the nested O(n2) level the Comba method adds the product of two single precision variables to each column of the output independently. A serious obstacle is if the carry is lost, due to lack of precision before the algorithm has a chance to fix the carries. For example, in the multiplication of two three-digit numbers, the third column of output will be the sum of three single precision multiplications. If the precision of the accumulator for the output digits is less than 3 · (β– 1)2,then an overflow can occur and the carry information will be lost. For any m and n digit inputs the maximum weight of any column is min(m, n), which is fairly obvious.

The maximum number of terms in any column of a product is known as the “column weight” and strictly governs when the algorithm can be used. Recall that a double precision type has a bits of resolution and a single precision digit has lg(β)bits of precision. Given these two quantities we must not violate:

image (5.2)

which reduces to

image (5.3)

Let ρ = lg(β) represent the number of bits in a single precision digit. By further re-arrangement of the equation the final solution is found.

image (5.4)

The defaults for LibTomMath are β = 228and α= 264, which means that k is bounded by k< 257. In this configuration, the smaller input may not have more than 256 digits if the Comba method is to be used. This is quite satisfactory for most applications, since 256 digits would allow for numbers in the range of 0 ≤ x < 27168, which is much larger than most public key cryptographic algorithms require.

Algorithm fast_s_mp_mul_digs. This algorithm performs the unsigned multiplication of a and b using the Comba method limited to digs digits of precision (Figure 5.5).

image

Figure 5.5 Algorithm fast_s_mp_mul_digs

The outer loop of this algorithm is more complicated than that of the baseline multiplier. This is because on the inside of the loop we want to produce one column per pass. This allows the accumulator image to be placed in CPU registers and reduce the memory bandwidth to two mp_digit reads per iteration.

The ty variable is set to the minimum count of ix, or the number of digits in b. That way, if a has more digits than b, this will be limited to b.used – 1. The tx variable is set to the distance past b.used the variable ix is. This is used for the immediately subsequent statement where we find iy.

The variable iy is the minimum digits we can read from either a or b before running out. Computing one column at a time means we have to scan one integer upwards and the other downwards. a starts at tx and b starts at ty. In each pass we are producing the ix′th output column and we note that tx+ ty= ix. As we move tx upwards, we have to move ty downwards so the equality remains valid. The iy variable is the number of iterations until txa.used or ty< 0 occurs.

After every inner pass we store the lower half of the accumulator into Wix and then propagate the carry of the accumulator into the next round by dividing _Ŵ by β.

To measure the benefits of the Comba method over the baseline method, consider the number of operations that are required. If the cost in terms of time of a multiply and addition is p and the cost of a carry propagation is q, then a baseline multiplication would require O((p+ q)n2) time to multiply two n-digit numbers. The Comba method requires only O(pn2+ qn) time; however, in practice the speed increase is actually much more. With O(n) space the algorithm can be reduced to O(pn+ qn) time by implementing the n multiply and addition operations in the nested loop in parallel.

image

image

image

As per the pseudo-code we first calculate pa (Line 48) as the number of digits to output. Next, we begin the outer loop to produce the individual columns of the product. We use the two aliases tmpx and tmpy(lines 62, 63) to point inside the two multiplicands quickly.

The inner loop (lines 71 to 73) of this implementation is where the trade–off come into play. Originally, this Comba implementation was “row-major,” which means it adds to each of the columns in each pass. After the outer loop it would then fix the carries. This was very fast, except it had an annoying drawback. You had to read an mp_word and two mp_digits and write one mp_word per iteration. On processors such as the Athlon XP and P4 this did not matter much since the cache bandwidth is very high and it can keep the ALU fed with data. It did, however, matter on older and embedded CPUs where cache is often slower and often does not exist. This new algorithm only performs two reads per iteration under the assumption that the compiler has aliased _Ŵ to a CPU register.

After the inner loop we store the current accumulator in W and shift _Ŵ (lines 76, 79) to forward it as a carry for the next pass. After the outer loop we use the final carry (Line 76) as the last digit of the product.

5.2.3 Even Faster Multiplication

In the realm of O(n2) multipliers, we can actually do better than Comba multipliers. In the case of the portable code, only lg(β) bits of each digit are being used. This is only because accessing carry bits from the CPU flags is not efficient in portable C.

In the TomsFastMath2project, a triple-precision register is used to accumulate products. The multiplication algorithm produces digits of the result at a time. The benefit of this algorithm is that we are packing more bits per digit resulting in fewer single precision multiplications. For example, a 1024-bit multiplication on a 32-bit platform involves 1024 single precision multiplications with TomsFastMath and 372== 1369 with LibTomMath (33% more).

Algorithm fast_mult. This algorithm performs a multiplication using the full precision of the digits (Figure 5.6). It is not strictly part of LibTomMath, instead this is part of TomsFastMath. Quite literally the TomsFastMath library was a port of LibTomMath.

image

Figure 5.6 Algorithm fast_mult

The first noteworthy change from our LibTomMath conventions is that we are indeed using the full precision of the digits. For example, on a 32-bit platform, a 1024-bit number would require 32 digits to be fully represented (instead of the 37 that LibTomMath would require).

The shuffle in step 3.4 is effectively a triple-precision shift right by the size of one digit. Similarly, in step 3.5.1, a double-precision product is being accumulated in the triple–precision array {c 2 : c 1 : c 0}.

The TomsFastMath library gets its significant speed increase over LibTomMath not only due to the use of full precision digits, but also the fact that the multipliers are unrolled and use inline assembler. It unrolls the multipliers in steps of 1 through 16, 20, 24, 28, 32, 48 and 64 digits. The unrolling takes considerable space, but the savings in time from not having all of the loop control overhead is significant. The use of inline assembler also lets us perform the inner loop with code such as the following x 86 assembler.

image

This performs the 32 × 32 multiplication and accumulates it in the 96-bit array {c 2 : c 1 : c 0}, as required in step 3.5.1 · A particular feature of the TomsFastMath approach is to use these functional macro blocks instead of hand-tuning the implementation for a given platform. As a result, we can change the macro to the following and produce a math library for ARM processors.

image

In total, TomsFastMath supports four distinct hardware architectures covering x86, PPC32 and ARM platforms from a relatively consistent code base. Adding new ports for most platforms is usually a matter of implementing the macros, and then choosing a suitable level of loop unrolling to match the processor cache.

When fully unrolled, the x86 assembly code achieves very high performance on the AMD K8 series of processors. An “instructions per cycle” count close to 2 can be observed through 1024-bit multiplications. This means that, on average, more than one processor pipeline is actively processing opcodes. This is particularly significant due to the long delay of the single precision multiplication instruction.

Unfortunately, while this routine could be adapted to LibTomMath (using a more complicated right shift in step 3.4), it would not help as we still have to perform the same number of single precision multiplications. Readers are encouraged to investigate the TomsFastMath library on its own to see how far these optimizations can push performance.

5.2.4 Polynomial Basis Multiplication

To break the O(n2) barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms the use of polynomial basis representation for two integers a and b as image and image respectively, is required. In this system, both f(x) and g(x) have n+ 1 terms and are of the n′th degree.

The product a. bf(x)g(x) is the polynomial image. The coefficients wiwill directly yield the desired product when β is substituted for x. The direct solution to solve for the 2n+ 1 coefficients requires O(n2) time and would in practice be slower than the Comba technique.

However, numerical analysis theory indicates that only 2n+ 1 distinct points in W(x) are required to determine the values of the 2n+1 unknown coefficients. This means by finding ζy= W(y) for 2n+ 1 small values of y, the coefficients of W(x) can be found with Gaussian elimination. This technique is also occasionally referred to as the interpolation technique [5], since in effect an interpolation based on 2n+ 1 points will yield a polynomial equivalent to W(x).

The coefficients of the polynomial W(x) are unknown, which makes finding W(y) for any value of y impossible. However, since W(x) = f(x)g(x), the equivalent ζy= f(y)g(y) can be used in its place. The benefit of this technique stems from the fact that f(y) and g(y) are much smaller than either a or b, respectively. As a result, finding the 2n+ 1 relations required by multiplying f(y)g(y) involves multiplying integers that are much smaller than either of the inputs.

When you are picking points to gather relations, there are always three obvious points to choose, y = 0, 1, and ∞. The ζ0term is simply the product W(0) = w0= a0. b0. The ζ1term is the product image. The third point ζis less obvious but rather simple to explain. The 2n+ 1′th coefficient of W(x) is numerically equivalent to the most significant column in an integer multiplication. The point at ∞ is used symbolically to represent the most significant column– W(∞) = w2n= anbn. Note that the points at y = 0 and ∞ yield the coefficients w0and w2ndirectly.

If more points are required they should be of small values and powers of twosuch as 2qand the related mirror points(2q)2n. ζ2– qfor small values of q. The term “mirror point” stems from the fact that (2q)2n. ζ2– qcan be calculated in the exact opposite fashion as ζ2q. For example, when n = 2 and q = 1, the following two equations are equivalent to the point ζ2and its mirror.

image (5.5)

Using such points will allow the values of f(y) and g(y) to be independently calculated using only left shifts. For example, when n = 2 the polynomial f(2q) is equal to 2q((2qa2) + a1) + a0. This technique of polynomial representation is known as Horner’s method.

As a general rule of the algorithm when the inputs are split into n parts each, there are 2n– 1 multiplications. Each multiplication is of multiplicands that have n times fewer digits than the inputs. The asymptotic running time of this algorithm is O(klgn(2n–1)) for k digit inputs (assuming they have the same number of digits). Figure 5.7 summarizes the exponents for various values of n.

image

Figure 5.7 Asymptotic Running Time of Polynomial Basis Multiplication

At first, it may seem like a good idea to choose n = 1000 since the exponent is approximately 1 · 1. However, the overhead of solving for the 2001 terms of W(x) will certainly consume any savings the algorithm could offer for all but exceedingly large numbers.

Cutoff Point

The polynomial basis multiplication algorithms all require fewer single precision multiplications than a straight Comba approach. However, the algorithms incur an overhead (at the O(n)work level) since they require a system of equations to be solved. This makes the polynomial basis approach more costly to use with small inputs.

Let m represent the number of digits in the multiplicands (assume both multiplicands have the same number of digits). There exists a point y such that when m < y, the polynomial basis algorithms are more costly than Comba; when m = y, they are roughly the same cost; and when m >y, the Comba methods are slower than the polynomial basis algorithms.

The exact location of y depends on several key architectural elements of the computer platform in question.

1. The ratio of clock cycles for single precision multiplication versus other simpler operations such as addition, shifting, etc. For example on the AMD Athlon the ratio is roughly 17 : 1, while on the Intel P4 it is 29 : 1. The higher the ratio in favor of multiplication, the lower the cutoff point y will be.

2. The complexity of the linear system of equations (for the coefficients of W(x)) is, generally speaking, as the number of splits grows the complexity grows substantially. Ideally, solving the system will only involve addition, subtraction, and shifting of integers. This directly reflects on the ratio previously mentioned.

3. To a lesser extent, memory bandwidth and function call overhead affect the location of y. Provided the values and code are in the processor cache, this is less of an influence over the cutoff point.

A clean cutoff point separation occurs when a point y is found such that all the cutoff point conditions are met. For example, if the point is too low, there will be values of m such that m> y and the Comba method is still faster. Finding the cutoff points is fairly simple when a high-resolution timer is available.

5.2.5 Karatsuba Multiplication

Karatsuba [5] multiplication when originally proposed in 1962 was among the first set of algorithms to break the O(n2) barrier for general-purpose multiplication.

Given two polynomial basis representations f(x) = ax+ b and g(x) = cx+ d, Karatsuba proved with light algebra [5] that the following polynomial is equivalent to multiplication of the two integers the polynomials represent.

image (5.6)

Using the observation that ac and bd could be re-used, only three half-sized multiplications would be required to produce the product. Applying this algorithm recursively the work factor becomes O(nlg(3)), which is substantially better than the work factor O(n2) of the Comba technique. It turns out what Karatsuba did not know or at least did not publish was that this is simply polynomial basis multiplication with the points ζ0, ζ, and ζ1. Consider the resultant system of equations.

image

By adding the first and last equation to the equation in the middle, the term w1can be isolated and all three coefficients solved for. The simplicity of this system of equations has made Karatsuba fairly popular. In fact, the cutoff point is often fairly low3, making it an ideal algorithm to speed up certain public key cryptosystems such as RSA and Diffie-Hellman.

Algorithm mp_karatsuba_mul. This algorithm computes the unsigned product of two inputs using the Karatsuba multiplication algorithm. It is loosely based on the description from Knuth [1, pp. 294-295] (Figure 5.8).

image

Figure 5.8 Algorithm mp_karatsuba_mul

To split the two inputs into their respective halves, a suitable radix point must be chosen. The radix point chosen must be used for both of the inputs, meaning that it must be smaller than the smallest input. Step 3 chooses the radix point B as half of the smallest input used count. After the radix point is chosen, the inputs are split into lower and upper halves. Steps 4 and 5 compute the lower halves. Steps 6 and 7 compute the upper halves.

After the halves have been computed the three intermediate half-size products must be computed. Step 8 and 9 compute the trivial products x 0 · y 0 and x 1 · y 1. The mp_int x 0 is used as a temporary variable after x 1 + x 0 has been computed. By using x 0 instead of an additional temporary variable, the algorithm can avoid an addition memory allocation operation.

The remaining steps 13 through 18 compute the Karatsuba polynomial through a variety of digit shifting and addition operations.

image

image

image

image

The new coding element in this routine, not seen in previous routines, is the usage of goto statements. The conventional wisdom is that goto statements should be avoided. This is generally true; however, when every single function call can fail, it makes sense to handle error recovery with a single piece of code. Lines 62 to 76 handle initializing all of the temporary variables required. Note how each of the if statements goes to a different label in case of failure. This allows the routine to correctly free only the temporaries that have been successfully allocated so far.

The temporary variables are all initialized using the mp_init_size routine since they are expected to be large. This saves the additional reallocation that would have been necessary. Moreover, x 0, x 1, y 0, and y 1 have to be able to hold at least their respective number of digits for the next section of code.

The first algebraic portion of the algorithm is to split the two inputs into their halves. However, instead of using mp_mod_2d and mp_rshd to extract the halves, the respective code has been placed inline within the body of the function. To initialize the halves, the used and sign members are copied first. The first for loop on Line 96 copies the lower halves. Since they are both the same magnitude, it is simpler to calculate both lower halves in a single loop. The for loop on lines 102 and 107 calculate the upper halves x 1 and y 1, respectively.

By inlining the calculation of the halves, the Karatsuba multiplier has a slightly lower overhead and can be used for smaller magnitude inputs.

When Line 151 is reached, the algorithm has completed successfully. The “error status” variable err is set to MP_OKAY so the same code that handles errors can be used to clear the temporary variables and return.

5.2.6 Toom-Cook 3-Way Multiplication

The 3-Way multiplication scheme, usually known as Toom–Cook, is actually a variation of the Toom–Cook multiplication [1, pp. 296–299] algorithm. In their combined approach, multiplication is essentially linearized by increasing the number of ways as the size of the inputs increase. The 3-Way approach is the polynomial basis algorithm for n = 2, except that the points are chosen such that ζ is easy to compute and the resulting system of equations easy to reduce. Here, the points ζ0, 16 ·ζ1/2, ζ1, ζ2, and ζmake up the five required points to solve for the coefficients of W(x).

With the five relations Toom-Cook specifies, the following system of equations is formed.

image

A trivial solution to this matrix requires 12 subtractions, two multiplications by a small power of two, two divisions by a small power of two, two divisions by three, and one multiplication by three. All of these 19 sub-operations require less than quadratic time, meaning that the algorithm can be faster than a baseline multiplication. However, the greater complexity of this algorithm places the cutoff point (TOOM_MUL_CUTOFF) where Toom-Cook becomes more efficient much higher than the Karatsuba cutoff point.

Algorithm mp_toom_mul. This algorithm computes the product of two mp_int variables a and b using the Toom-Cook approach. Compared to the Karatsuba multiplication, this algorithm has a lower asymptotic running time of approximately O(n1.464) but at an obvious cost in overhead. In this description, several statements have been compounded to save space. The intention is that the statements are executed from left to right across any given step (Figure 5.9).

image

image

Figure 5.9 Algorithm mp_toom_mul

The two inputs a and b are first split into three k-digit integers a0, a1, a2and b0, b1, b2, respectively. From these smaller integers the coefficients of the polynomial basis representations f(x) and g(x) are known and can be used to find the relations required.

The first two relations w0and w4are the points ζ0and ζ, respectively. The relation w1, W2, and w3correspond to the points 16 · ζ1/2, ζ2and ζ1, respectively. These are found using logical shifts to independently find f(y) and g(y) which significantly speeds up the algorithm.

After the five relations w0, w1, …, w4have been computed, the system they represent must be solved in order for the unknown coefficients w1, w2, and w3to be isolated. Steps 18 through 25 perform the system reduction required as previously described. Each step of the reduction represents the comparable matrix operation that would be performed had this been performed by pencil. For example, step 18 indicates that row 1 must be subtracted from row 4, and simultaneously row 0 subtracted from row 3.

Once the coefficients have been isolated, the polynomial image is known. By substituting βkfor x, the integer result a. b is produced.

image

image

image

image

image

image

image

The first obvious thing to note is that this algorithm is complicated. The complexity is worth it if you are multiplying very large numbers. For example, a 10,000 digit multiplication takes approximately 99,282,205 fewer single precision multiplications with Toom-Cook than a Comba or baseline approach (a savings of more than 99%). For most “crypto” sized numbers this algorithm is not practical, as Karatsuba has a much lower cutoff point.

First, we split a and b into three roughly equal portions. This has been accomplished (lines 41 to 70) with combinations of mp_rshd() and mp_mod_2d() function calls. At this point, a= a 2 · β2+ a 1 · β+ a 0, and similarly for b.

Next, we compute the five points w 0, w 1, w 2, w 3, and w 4. Recall that w 0 and w 4 can be computed directly from the portions so we get those out of the way first (lines 73 and 78). Next, we compute w 1, w 2, and w 3 using Horner’s method.

After this point we solve for the actual values of w 1, w 2, and w 3 by reducing the 5 × 5 system, which is relatively straightforward.

5.2.7 Signed Multiplication

Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far, all of the multiplication algorithms have been unsigned multiplications, which leaves only a signed multiplication algorithm to be established.

Algorithm mp_mul. This algorithm performs the signed multiplication of two inputs (Figure 5.10). It will make use of any of the three unsigned multiplication algorithms available when the input is of appropriate size. The sign of the result is not set until the end of the algorithm, since algorithm s_mp_mul_digs (Figure 5.1) will clear it.

image

Figure 5.10 Algorithm mp_mul

image

image

The implementation is rather simplistic and is not particularly noteworthy. Line 22 computes the sign of the result using the “?” operator from the C programming language. Line 48 computes δ using the fact that 1 << k is equal to 2k.

5.3 Squaring

Squaring is a special case of multiplication where both multiplicands are equal. At first, it may seem like there is no significant optimization available, but in fact there is. Consider the multiplication of 576 against 241. In total there will be nine single precision multiplications performed–1 · 6, 1 · 7, 1 · 5, 4 · 6, 4 · 7, 4 · 5, 2 · 6, 2 · 7, and 2 · 5. Now consider the multiplication of 123 against 123. The nine products are 3 · 3, 3 · 2, 3 · 1, 2 · 3, 2 · 2, 2 · 1, 1 · 3, 1 · 2, and 1 · 1. On closer inspection some of the products are equivalent; for example, 3 · 2 = 2 · 3 and 3 · 1 = 1 · 3.

For any n-digit input, there are (n2+n)/2 possible unique single precision multiplications required compared to the n2required for multiplication.Figure 5.11 gives an example of the operations required.

image

Figure 5.11 Squaring Optimization Diagram

Starting from zero and numbering the columns from right to left, you will see a very simple pattern become obvious. For the purposes of this discussion, let x represent the number being squared. The first observation is that in row k, the 2k′th column of the product has a (xk)2term in it.

The second observation is that every column j in row k where j≠ 2k is part of a double product. Every non-square term of a column will appear twice, hence the name “double product.” Every odd column is made up entirely of double products. In fact, every column is made up of double products and at most one square (see the Exercise section).

The third and final observation is that for row k the first unique non-square term-that is, one that hasn′t already appeared in an earlier row-occurs at column 2k+ 1. For example, on row 1 of the previous squaring, column one is part of the double product with column one from row zero. Column two of row one is a square, and column three is the first unique column.

5.3.1 The Baseline Squaring Algorithm

The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines will not handle.

Algorithm s_mp_sqr. This algorithm computes the square of an input using the three observations on squaring. It is based fairly faithfully on algorithm 14.16 of HAC [2, pp.596-597]. Similar to algorithm s_mp_mul_digs, a temporary mp_int is allocated to hold the result of the squaring. This allows the destination mp_int to be the same as the source mp_int (Figure 5.12).

image

Figure 5.12 Algorithm s_mp_sqr

The outer loop of this algorithm begins on step 4. It is best to think of the outer loop as walking down the rows of the partial results, while the inner loop computes the columns of the partial result. Steps 4.1 and 4.2 compute the square term for each row, and steps 4.3 and 4.4 propagate the carry and compute the double products.

The requirement that an mp_word be able to represent the range 0 ≤ x <2arises from this very algorithm. The product aixaiywill lie in the range 0 ≤ x≤ β2– 2β+1, which is obviously less than β2, meaning that when it is multiplied by two, it can be properly represented by an mp_word.

Similar to algorithm s_mp_mul_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial results calculated so far. This involves expensive carry propagation, which will be eliminated in the next algorithm.

image

image

Inside the outer loop (line 34) the square term is calculated on Line 37. The carry (Line 44) has been extracted from the mp_word accumulator using a right shift. Aliases for aixand tix+ iyare initialized (lines 47 and 50) to simplify the inner loop. The doubling is performed using two additions (Line 59), since it is usually faster than shifting, if not at least as fast.

The important observation is that the inner loop does not begin at iy= 0 like for multiplication. As such, the inner loops get progressively shorter as the algorithm proceeds. This is what leads to the savings compared to using a multiplication to square a number.

5.3.2 Faster Squaring by the “Comba” Method

A major drawback to the baseline method is the requirement for single precision shifting inside the O(n2) nested loop. Squaring has an additional drawback in that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these performance hazards.

The first obvious solution is to make an array of mp_words that will hold all the columns. This will indeed eliminate all of the carry propagation operations from the inner loop. However, the inner product must still be doubled O(n2) times. The solution stems from the simple fact that 2a+ 2b+ 2c= 2(a+ b+ c). That is, the sum of all of the double products is equal to double the sum of all the products. For example, ab+ ba+ ac+ ca= 2ab+ 2ac= 2(ab+ ac).

However, we cannot simply double all the columns, since the squares appear only once per row. The most practical solution is to have two mp_word arrays. One array will hold the squares, and the other will hold the double products. With both arrays, the doubling and carry propagation can be moved to a O(n) work level outside the O(n2) level. In this case, we have an even simpler solution in mind.

Algorithm fast_s_mp_sqr. This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s_mp_sqr when the number of input digits is less than MP_WARRAY and less than δ/2. This algorithm is very similar to the Comba multiplier, except with a few key differences we shall make note of (Figure 5.13).

image

Figure 5.13 Algorithm fast_s_mp_sqr

First, we have an accumulator and carry variables _Ŵ and Ŵ 1, respectively. This is because the inner loop products are to be doubled. If we had added the previous carry in we would be doubling too much. Next, we perform an addition MIN condition on iy(step 5.5) to prevent overlapping digits. For example, a3. a5is equal a5. a3, whereas in the multiplication case we would have 5 < a.used, and 3 ≥ 0 is maintained since we double the sum of the products just outside the inner loop, which we have to avoid doing. This is also a good thing since we perform fewer multiplications and the routine ends up being faster.

The last difference is the addition of the “square” term outside the inner loop (step 5.8). We add in the square only to even outputs, and it is the square of the term at the ix/2 position.

image

image

image

This implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for the special case of squaring. The innermost loop (lines 72 to 74) computes the products the same way the multiplication routine does. The sum of the products is doubled separately (Line 77) outside the innermost loop. The square term is added if ix is even (lines 80 to 82), indicating column with a square.

5.3.3 Even Faster Squaring

Just like the case of algorithm fast_mult (Section 5.2.3), squaring can be performed using the full precision of single precision variables. This algorithm borrows much from the algorithm in Figure 5.13. Except that, in this case, we will be accumulating into a triple-precision accumulator. Similarly, loop unrolling can boost the performance of this operation significantly.

The TomsFastMath library incorporates fast squaring that is a direct port of algorithm fast_s_mp_sqr. Readers are encouraged to research this project to learn more.

5.3.4 Polynomial Basis Squaring

The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception is that ζy= f(y)g(y) is actually equivalent to ζy= f(y)2, since f(y)= g(y). Instead of performing 2n+ 1 multiplications to find the ζ relations, squaring operations are performed instead.

5.3.5 Karatsuba Squaring

Let f(x) = ax+ b represent the polynomial basis representation of a number to square. Let h(x) = (f(x))2represent the square of the polynomial. The Karatsuba equation can be modified to square a number with the following equation.

image (5.7)

Upon closer inspection, this equation only requires the calculation of three half-sized squares:a2, b2, and (a+ b)2. As in Karatsuba multiplication, this algorithm can be applied recursively on the input and will achieve an asymptotic running time of O(nlg(3)).

If the asymptotic times of Karatsuba squaring and multiplication are the same, why not simply use the multiplication algorithm instead? The answer to this arises from the cutoff point for squaring. As in multiplication, there exists a cutoff point, at which the time required for a Comba–based squaring and a Karatsuba–based squaring meet. Due to the overhead inherent in the Karatsuba method, the cutoff point is fairly high. For example, on an AMD Athlon XP processor with β= 228, the cutoff point is around 127 digits.

Consider squaring a 200-digit number with this technique. It will be split into two 100-digit halves that are subsequently squared. The 100-digit halves will not be squared using Karatsuba, but instead using the faster Comba–based squaring algorithm. If Karatsuba multiplication were used instead, the 100-digit numbers would be squared with a slower Comba–based multiplication.

Algorithm mp_karatsuba_sqr. This algorithm computes the square of an input a using the Karatsuba technique. It is very similar to the Karatsuba–based multiplication algorithm with the exception that the three half-size multiplications have been replaced with three half-size squarings (Figure 5.14).

image

Figure 5.14 Algorithm mp_karatsuba_sqr

The radix point for squaring is simply placed exactly in the middle of the digits when the input has an odd number of digits; otherwise, it is placed just below the middle. Steps 3, 4, and 5 compute the two halves required using B as the radix point. The first two squares in steps 6 and 7 are straightforward, while the last square is of a more compact form.

By expanding (x 1 + x 0)2, the x 12and x 02terms in the middle disappear; that is, (x 0 – x 1)2– (x 12+ x 02) = 2 · x 0 · x 1. Now if 5n single precision additions and a squaring of n-digits is faster than multiplying two n-digit numbers and doubling, then this method is faster. Assuming no further recursions occur, the difference can be estimated with the following inequality.

Let p represent the cost of a single precision addition and q the cost of a single precision multiplication both in terms of time4.

image (5.8)

For example, on an AMD Athlon XP processor, p=1/3 q=6. This implies that the following inequality should hold.

image

This results in a cutoff point around n= 2. As a consequence, it is actually faster to compute the middle term the “long way” on processors where multiplication is substantially slower5than simpler operations such as addition.

image

image

image

image

This implementation is largely based on the implementation of algorithm mp_karatsuba_mul. It uses the same inline style to copy and shift the input into the two halves. The loop from Line 54 to Line 70 has been modified since only one input exists. The used count of both x 0 and x 1 is fixed up, and x 0 is clamped before the calculations begin. At this point, x 1 and x 0 are valid equivalents to the respective halves as if mp_rshd and mp_mod_2d had been used.

By inlining the copy and shift operations, the cutoff point for Karatsuba multiplication can be lowered. On the Athlon, the cutoff point is exactly at the point where Comba squaring can no longer be used (128 digits). On slower processors such as the Intel P4, it is actually below the Comba limit (at 110 digits).

This routine uses the same error trap coding style as mp_karatsuba_sqr. As the temporary variables are initialized, errors are redirected to the error trap higher up. If the algorithm completes without error, the error code is set to MP_OKAY and mp_clears are executed normally.

5.3.6 Toom-Cook Squaring

The Toom-Cook squaring algorithm mp_toom_sqr is heavily based on the algorithm mp_toom_mul, with the exception that squarings are used instead of multiplication to find the five relations. Readers are encouraged to read the description of the latter algorithm and try to derive their own Toom-Cook squaring algorithm.

5.3.7 High Level Squaring

Algorithm mp_sqr. This algorithm computes the square of the input using one of four different algorithms. If the input is very large and has at least TOOM_SQR_CUTOFF or KARATSUBA_SQR_CUTOFF digits, then either the Toom-Cook or the Karatsuba Squaring algorithm is used. If neither of the polynomial basis algorithms should be used, then either the Comba or baseline algorithm is used (Figure 5.15).

image

Figure 5.15 Algorithm mp_sqr

image

image

Exercises

3. Devise an efficient algorithm for selection of the radix point to handle inputs that have different numbers of digits in Karatsuba multiplication.

2. In section 5.3, we stated, that every column of a squaring is made up of double products and at most one square is stated. Prove this statement.

3. Prove the equation for Karatsuba squaring.

1. Prove that Karatsuba squaring requires O(nlg(3)) time.

3. Implement a threaded version of Comba multiplication (and squaring) where you compute subsets of the columns in each thread. Determine a cutoff point where it is effective, and add the logic to mp_mul() and mp_sqr().

4. Same as the previous, but also modify the Karatsuba and Toom-Cook. You must increase the throughput of mp_exptmod() for random odd moduli in the range 512 … 4096 bits significantly (>2x) to complete this challenge.


1Roughly speaking, a modular exponentiation will spend about 40% of the time performing modular reductions, 35% of the time performing squaring, and 25% of the time performing multiplications.

2See http://tfm.libtomcrypt.com

3With LibTomMath 0.18 it is 70 and 109 digits for the Intel P4 and AMD Athlon, respectively.

4Or machine clock cycles.

5On the Athlon there is a 1:17 ratio between clock cycles for addition and multiplication. On the Intel P4 processor this ratio is 1:29, making this method even more beneficial. The only common exception is the ARMv4 processor, which has a ratio of 1:7.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.217.55