4.7. DOTPROD: Using Data Access Instructions

We shall now illustrate the very common operation of referring to successive entries in a list using vector components. Three-component vectors occur frequently in physics and engineering problems. In vector algebra, the scalar product of two vectors (also called the inner product, or the dot product) is the sum of products of corresponding components:

P = VW = (vx × wx) + (vy × wy) + (vz × wz)

It makes sense to store the x-, y-, and z-components of each vector in adjacent information units. We will select word-length storage for components of two vectors, V and W , in our sample program (Figure 4-5), and the resulting scalar product, P, will be stored in a quad word.

Figure 4-5. DOTPROD: An illustration of data access instructions
// DOTPROD       Scalar Product of 3-vectors

// This program will compute the scalar product
// of two three-element vectors V and W.
         .data                    // Declare storage
         .align  8                // Desired alignment
P:       .skip   8                // Space for product
V:       data2   -1,+3,+5         // Vx, Vy, Vz
W:       data2   -2,-4,+6         // Wx, Wy, Wz
         .text                    // Section for code
         .align  32               // Desired alignment
         .global main             // These three lines
         .proc   main             //  mark the mandatory
main:                             //   'main' program entry
         .body                    // Now we really begin...
first:   movl    r14 = V;;        // Pointer for V
         movl    r15 = W;;        // Pointer for W
         movl    r16 = P;;        // Pointer for P
         mov     r20 = 0;;        // R20 = running sum
         ld2     r21 = [r14],2;;  // Get Vx; bump pointer
         ld2     r22 = [r15],2;;  // Get Wx; bump pointer
         pmpy2.r r21 = r21,r22;;  // Compute Vx times Wx
         sxt4    r21 = r21;;      // Extend 32 bits to 64
         add     r20 = r20,r21;;  // Update the sum
         ld2     r21 = [r14],2;;  // Get Vy; bump pointer
         ld2     r23 = [r15],2;;  // Get Wy; bump pointer
         pmpy2.r r21 = r21,r22;;  // Compute Vy times Wy
         sxt4    r21 = r21;;      // Extend 32 bits to 64
         add     r20 = r20,r21;;  // Update the sum
         ld2     r21 = [r14],2;;  // Get Vz; bump pointer
         ld2     r22 = [r15],2;;  // Get Wz; bump pointer
         pmpy2.r r21 = r21,r22;;  // Compute Vz times Wz
         sxt4    r21 = r21;;      // Extend 32 bits to 64
         add     r20 = r20,r21;;  // Update the sum
         st8     [r16] = r20;;    // Store computed product
                                  // No more components...
done:    mov     r8 = 0;;         // Signal all is normal
         br.ret.sptk.many b0      // Back to command line
         .endp   main             // Mark end of procedure

Some computer architectures can map data structures using fixed offsets for the component values relative to a fixed base address for each vector—i.e., (V, V+8, V+16). The Itanium ISA, on the other hand, offers only register indirect addressing. We chose registers r14, r15, and r16 to point to V , W , and the result P, respectively.

Each component is expressed as a 2-byte word. We used ld2 instructions that also perform zero-extension in the destination register. We took advantage of postincrementing with the Itanium load and store instructions, since the x-, y-, and z-components of each vector are stored as successive words. (We did not remove the increment of 2 from the last set of load and store instructions. If we were to write a more general scheme utilizing a loop to compute the dot product of two N-component vectors, it would not be convenient to isolate the last component as a special case.)

Each multiplication of two word-length components using the pmpy2.r instruction yields a product expressed as a double word in the destination register. We extended that intermediate product to 64 bits using the sxt4 instruction to ensure correct results.

With this background, you should have little difficulty in following the flow of the entire calculation. Using the debugger on a Linux system, we could proceed as follows:

L> gcc -Wall -O0 -o bin/dotprod dotprod.s
L> gdb bin/dotprod
[messages deleted here]
(gdb) break done
Breakpoint 1 at 0x40000000000005e0
(gdb) run
Starting program: /home/user/bin/dotprod

Breakpoint 1, 0x40000000000005e0 in done ()
(gdb) x/g &P
0x6000000000000770 <P>: 0x0000000000000014
(gdb) q
The program is running.  Exit anyway? (y or n) y
L>

The correct answer is (–1 × –2) + (+3 × –4) + (+5 × +6) = (+2) + (–12) + (+30) = 2010 = 1416. Alternatively, you could monitor the contents of registers r20 and r21 as you step through the sequence of instructions to the label done. Be attentive to the two's complement arithmetic operations.

Using a label such as done, where output instructions would be inserted, works just as well in the HP-UX command-line environment:

H> cc +DD64 -o bin/dotprod dotprod.s
H> adb bin/dotprod
adb> done:b
adb> :r
Process 9619 Thread 9728 Execed
Breakpoint 1 set at address 0x4000980
main + 0xc0:
>       adds             r8=0,r0
        nop.f            0
        nop.b            0;;
Hit Breakpoint 1 at address 0x4000980
adb> P/jx
P:
                0x14
adb> q
H>

where P is the symbolic address for the quad word result in memory. In later chapters, we shall usually demonstrate the sample programs using either the GNU tools (Linux) or the HP-UX tools, but not both, in the interest of keeping the book concise and readable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.57.251