5.4. DOTLOOP: Using a Counted Loop

In the previous chapter, we presented the DOTPROD program that computed the dot product of two 3-component vectors without using a loop. A similar program, but with greater generality, would compute the dot product of two N-component vectors. Such a program, with the dimensionality N as a symbolic parameter at the top of the listing, is presented in Figure 5-1.

This more general program uses additional registers: r17 for the dimensionality and loop control, r14 as an address pointer for vector V , r15 as an address pointer for vector W , and r16 as an address pointer for the product P. While there is a little more overhead between first and top to get everything set up, the heart of the algorithm is simplified because the multiply, sign-extend, and add sequence occurs only once (inside the loop). Notice that both address pointers must be incremented by two units in order to advance to the next word values each time through the loop, while the copy of the dimensionality (i.e., the number of components) must be decremented by one (by adding –1 to r17).

Figure 5-1. DOTLOOP: An illustration of a simple down-counted loop
// DOTLOOP       Scalar Product of N-vectors

// This program will compute the scalar product
// of two multielement vectors V and W.
         N       = 3              // N = dimensionality
         .data                    // Declare storage
         .align  8                // Desired alignment
P:       .skip   8                // Space for product
V:       data2   -1,+3,+5         // V1, V2, V3, etc.
W:       data2   -2,-4,+6         // W1, W2, W3, etc.
         .text                    // Section for code
         .align  32               // Desired alignment
         .global main             // These three lines
         .proc   main             //  mark the mandatory
main:                             //   'main' program entry
        .body                     // Now we really begin...
first:  movl     r14 = V;;        // Pointer for V
        movl     r15 = W;;        // Pointer for W
        movl     r16 = P;;        // Pointer for P
        mov      r17 = N;;        // Number of V components
        mov      r20 = 0;;        // R20 = running sum
top:    ld2      r21 = [r14],2;;  // Get Vi; bump pointer
        ld2      r22 = [r15],2;;  // Get Wi; bump pointer
        pmpy2.r  r21 = r21,r22;;  // Compute Vi times Wi
        sxt4     r21 = r21;;      // Extend 32 bits to 64
        add      r20 = r20,r21;;  // Update the sum
        add      r17 = -1,r17;;   // Decrement loop count
        cmp.gt   p6,p0 = r17,r0   // More to do?
        (p6) br.cond.sptk.few top;;  // Yes
        st8      [r16] = r20;;    // No, store the product
done:   mov      r8 = 0;;         // Signal all is normal
        br.ret.sptk.many b0       // Back to command line
        .endp    main             // Mark end of procedure

We can run this program using the debugger, with a breakpoint set at done. Examining the memory location P should reveal the correctly computed value of 20 (0x14).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.170.27