20.3 FORWARD SUBSTITUTION (DIRECT TECHNIQUE)

The general form for a system of linear equations was given in Eq. 20.1. Forward substitution technique converts the square matrix A into a lower triangular form:

(20.10) c20e010

Consider the 5 × 5 lower triangular linear system:

(20.11) c20e011

If all lii ≠ 0, then we can determine the unknowns according to the equations

(20.12) c20e012

where x1 must be calculated before x2 could be evaluated and so on. Thus, it appears that the calculations are sequential, with small opportunity for parallelization. However, the techniques we discussed earlier will help us derive parallel multithreaded and systolic architectures.

20.3.1 Forward Substitution Dependence Graph

The iterations in Eq. 20.12 use two indices i and j and we can use the results of Chapters 10 or 11 to study the parallelization of the forward substitution algorithm. Figure 20.1 is the dependence graph of the iterations in Eq. 20.12. Note that the variable x is an input/output variable and this explains why we have two sets of x, one set for the output instances of x and the other set is for the input instances of x.

Figure 20.1 The dependence graph of the forward substitution algorithm.

c20f001

Input instance xin(k) is a copy of the output instance xout(k). This is shown by the bends in the lines that appear on the diagonal nodes.

20.3.2 Forward Substitution Scheduling Function and Directed Acyclic Graph (DAG)

The choice of the scheduling function is dictated by the need to maintain the proper sequence of evaluating xi. We know that xout(i) can be found only after xout(i − 1) has been evaluated. We note that output instance xout(i) is obtained at the diagonal node (i, i) and is used by all nodes whose coordinates are (i + k, i) where k > 0. Therefore, if the scheduling vector is s = [s1 s2], then we must have

(20.13) c20e013

(20.14) c20e014

and we can write the inequality

(20.15) c20e015

Since k > 0, we must have s1 > 0 too. We can choose s1 = 1 and we have our scheduling vector as

(20.16) c20e016

We can choose three possible scheduling vectors while satisfying inequality (Eq. 20.13):

(20.17) c20e017

(20.18) c20e018

(20.19) c20e019

The resulting DAGs for the three choices are shown in Fig. 20.2. The scheduling vector s1 implies diagonal-based calculations since each iteration requires simultaneous access to the elements of a diagonal. The choice of s1 will produce a DAG where output sample xout(i) is obtained on the left edge of the diagram. However, this output sample must be fed back to node (i, i) for use by later calculations. Depending on the projection vector chosen, we might have to provide communication between the left nodes and the diagonal nodes. The work W at each time step would start at N, then decrease by one at each time step thereafter.

Figure 20.2 The DAG graphs of the forward substitution algorithm for the three possible scheduling functions.

c20f002

The scheduling vector s2 implies row-based calculations since each iteration requires simultaneous access to the elements of a row. The work W at each time step would start at 1, then increase by one at each time step thereafter.

The scheduling vector s3 implies column-based calculations since each iteration requires simultaneous access to the elements of a column. The work W at each time step would start at 1 for the first two operations, then increase by two at the next two time steps. The maximum work is encountered halfway during the operation, and then work starts to decrease by 2 after each two time steps thereafter.

We can use nonlinear scheduling to control the total workload at each iteration. However, the work done at each time step will not be uniform.

20.3.3 Forward Substitution Projection Function

Three projection directions are possible:

(20.20) c20e020

(20.21) c20e021

(20.22) c20e022

The simplest projection directions to use would be d2 and d3.

Let us consider the case when d2 is used. The reduced DAG (c20ue002) is shown in Fig. 20.3. We can use nonlinear projection to control the workload of each thread or each processing element (PE) in the systolic array.

Figure 20.3 The c20ue025 graphs of the forward substitution algorithm for the three possible scheduling functions.

c20f003
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.53.202