Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

In Practice

The ability to deduce tangible properties of operands for arithmetic instructions based solely on timing information is the most obvious, effective, and interesting vector for performing computational complexity attacks. Other techniques, such as cache hit and miss timing, usually require considerably more detailed analysis and reveal less information in every cycle.

It is clear that this problem would, to a degree, affect many software algorithms, such as large-number arithmetic libraries commonly used in cryptographic applications. But software algorithms and theory aside, a couple of important questions remain: how real is the execution time dependency on the hardware level, and how can it be measured?

An example is well within reach. At least a portion of the Intel IA32 architecture exhibits this behavior. The 80386 Programmer’s Reference Manual^[55] describes an integer-signed multiplication opcode, denoted by the mnemonic IMUL. The opcode, in its basic form, multiplies the value stored in the accumulator (a multipurpose working register going by the name [E]AX on this platform), by a value stored in another register. The result is then stored back in the accumulator.

The documentation further explains:

The 80386 uses an early-out multiply algorithm. The actual number of clocks depends on the position of the most significant bit in the optimizing multiplier [...]. The optimization occurs for positive and negative values. Because of the early-out algorithm, clock counts given are minimum to maximum. To calculate the actual clocks, use the following formula:
Actual clock = if m <> 0 then max(ceiling(log2(m)), 3) + 6 clocks
Actual clock = if m = 0 then 9 clocks

Although this may look cryptic, its meaning is simple: The processor optimizes multiplication based on the value of the multiplier. Instead of multiplying the multiplicand until all bits of the multiplier are exhausted, it skips zeros at the beginning of the operand.

Early-Out Optimization

To understand the relevance of this tactic to integer multiplication, imagine a traditional iterative multiplication method taught in schools, except this time in binary. A hypothetical “dumb” implementation of this algorithm performs the following set of operations.

00000000 00000000 11001010 11111110     Multiplicand (P)
* 00000000 00000000 00000000 00000110     Multiplier (R)
 -------------------------------------
  00000000 00000000 00000000 00000000     P * R[0] = P * 0
  00000000 00000001 10010101 1111110      P * R[1] = P * 1
  00000000 00000011 00101011 111110       P * R[2] = P * 1
  00000000 00000000 00000000 00000        P * R[3] = P * 0
  00000000 00000000 00000000 0000         P * R[4] = P * 0
  00000000 00000000 00000000 000          P * R[5] = P * 0
  ...
+ 0                                       P * R[31] = P * 0
 -------------------------------------
  00000000 00000100 11000001 11110100

It should be obvious that a large number of these operations are completely unnecessary and unwarranted and that continuing the operation once nothing but zeros remain at subsequent bits of the multiplier is simply pointless. A more reasonable approach is to skip them:

00000000 00000000 11001010 11111110      Multiplicand (P)
* 00000000 00000000 00000000 00000110      Multiplier (R) - optimizing
 -------------------------------------
  00000000 00000000 00000000 00000000      P * R[0] = P * 0
  00000000 00000001 10010101 1111110       P * R[1] = P * 1
+ 00000000 00000011 00101011 111110        P * R[2] = P * 1
  ...Bail out − ignore leading zeros of R!
 -------------------------------------
  00000000 00000100 11000001 11110100

And this is, in essence, the nature of the early-out optimization that Intel deployed.

Note

This optimization makes multiplication nonsymmetrical in time. 2*100 will compute more slowly than 100*2 (!), even though the result is obviously the same.

With early-out optimization, Intel processors require a variable number of cycles to perform multiplication, and the length is directly proportional to the location of the oldest (most significant) bit set in the second operand. By applying the clock count algorithm provided in the documentation, it is possible to determine the correlation between the multiplier and IMUL time, as shown here:

Multiplier Value Range	Cycles to Complete
0 – 7	9
8 – 15	10
16 – 31	11
32 – 63	12
64 – 127	13
128 – 255	14
256 – 1,023	15
1,024 – 2,047	16
2,048 – 4,095	17
4,096 – 8,191	18
8,192 – 16,383	19
16,384 – 32,767	20
32,768 – 65,535	21
65,536 – 131,071	22
131,072 – 262,143	23
262,144 – 524,287	24
524,288 – 1,048,575	25
1,048,576 – 2,097,151	26
2,097,152 – 4,194,303	27
4,194,304 – 8,388,607	28
8,388,608 – 16,777,215	29
16,777,216 – 33,554,431	30
33,554,432 – 67,108,863	31
67,108,864 – 134,217,727	32
134,217,728 – 268,435,455	33
268,435,456 – 536,870,911	34
536,870,912 – 1,073,741,823	35
1,073,741,824 – 2,147,483,647	36

A similar dependency exists for negative multiplier values.

Working Code—Do It Yourself

The following code listing shows a practical implementation in C for Unix-type systems that can be used to confirm and measure differences in timing patterns. The program is invoked with two parameters: multiplicand (which should not affect performance in any way) and multiplier (presumably used in early-out optimizations and hence impacting the speed of the entire operation). The program performs 256 tests of 500 subsequent multiplications with the chosen parameters and returns the shortest measured time.

We run 256 tests and select the best result in order to compensate for cases in which execution is interrupted by the system for some period of time, a condition fairly common in multitasking environments. Although a single test can be affected by such an event, at least some of the test in a rapid sequence of short tests can be expected to complete without interruption.

The code uses the system clock to measure execution time in micro-seconds.

Note

Several of today’s Intel chips feature a precise timing mechanism available through RDTSC opcode. This method for accessing the internal clock cycle counter is not available on older platforms, and so we will not rely on it.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <limits.h>

int main(int argc,char** argv) {

  int shortest = INT_MAX;
  int i,p,r;

  if (argc != 3) {
    printf("Usage: %s multiplicand multiplier
",argv[0]);
    exit(1);
  }

  p=atoi(argv[1]);
  r=atoi(argv[2]);

  for (i=0;i<256;i++) {
    int ct;
    struct timeval s;
    struct timeval e;

    gettimeofday(&s,NULL);

    asm(

      "  movl $500,%%ecx    
"  /* Loop repetition counter (R) */
      "imul_loop:           
"
      "  movl %%esi,%%eax   
"
      "  movl %%edi,%%edx   
"
      "  imul %%edx,%%eax   
"        /* Comment out for first run */
      "  loop imul_loop     
"
        :
        : "S" (p), "D" (r)
        : "ax", "cx", "dx", "cc");

    gettimeofday(&e,NULL);

    ct = ( e.tv_usec - s.tv_usec ) +
         ( e.tv_sec - s.tv_sec ) * 1000000;

    if (ct < shortest) shortest = ct;

  }

  printf("T[%d,%d] = %d usec
",p,r,shortest);
  return 0;
}

By compiling the code with the IMUL instruction initially commented out and invoking the program with arbitrary parameters, we can estimate the timing code overhead (T_idle). If the value falls outside the range of 10 to 100 microseconds—which is high enough to provide a fine-grained readout, but low enough to maximize the chance of not being interrupted by the operating system—readjust the loop repetition counter R, which is set to 500 by default.

After restoring the IMUL instruction and recompiling and running the program with a chosen multiplicand D and repetition counter R, it is possible to use the returned time approximation T_D,R to estimate the number of CPU cycles spent on IMUL operation (C_D,R), as long as the operating frequency of the processor (F_MHz) is known:

C_{D, R} = (T_{D, R} − T_idle) · F_MHz/R

As expected, pipelining and branch predictors on newer and more advanced chips will kick in and skew the result slightly, but a good estimate can be made.

Note

On newer Intel processors, the time needed to complete multiplication is already constant.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for In Practice

Create new playlist

Sign In

Sign Up

In Practice

Early-Out Optimization

Note

Working Code—Do It Yourself

Note

Note

Table of Contents for
In Practice