11.8 DATA SCHEDULING

We discuss in this section how to divide the tasks in the algorithm into stages such that at each stage a group of tasks gets executed in parallel while preserving the correctness of the results. The section will also determine the stages when data are to be fed or extracted from a processor or thread. We need to find a function that will take the coordinates of a point p in the computation domain x1D49F_EuclidMathOne_10n_000100 and to assign a time value to it. We use an affine scheduling function to specify the scheduling of the algorithm tasks. The affine scheduling functions are of the form [86]

(11.52) c11e052

where s is a row vector of length n, which is called scheduling vector, and s is an integer that biases the ordering to ensure non-negative stage index values.

The main purpose of the scheduling function is to assign an execution time to several nodes in x1D49F_EuclidMathOne_10n_000100. Consequently, this function determines the computational load to be performed by the computing system at each time step or execution sequence. This is a subtle but very important by-product. As we shall see, the linear or affine scheduling function affords us little control on the amount of that work. However, we still need it to correctly perform the algorithm tasks. Nonlinear scheduling techniques will allows us to control the total work assigned to the system during each time step. Another effect of the scheduling function is the fact that dependencies between tasks will be created, and interthread and interprocessor communication can thus be determined.

Assigning time values to the nodes of x1D49F_EuclidMathOne_10n_000100 transforms it to a serial–parallel algorithm (SPA) where the parallel tasks could be implemented using a thread pool or parallel processors for software or hardware implementations, respectively. The different stages of the SPA are accomplished using barriers or clocks for software or hardware implementations, respectively.

The following theorem will prove that the scheduling function will convert the dependence graph into a DAG.

Theorem 11.2

The affine scheduling function changes a dependence graph into a DAG even if the dependence graph contained cycles.

Proof:

Given a dependence graph, we define a k-hop cycle as one involving k-nodes as shown in Fig. 11.5. We exclude one-hop loops since they represent local computations and are absorbed within a node. Without loss of generality, we assume a 1-D dependence graph where the nodes lie on a straight line with index i. Now assume that there are one or more k-hops involving nodes whose indices are 0, 1, … , k − 1 where k > 1. The presence of a loop in the dependence graph implies that there are undirected links between the following pairs of nodes:

c11ue001

For our 1-D situation, the affine timing function is given by t(p) = i; the times associated with these points become

c11ue002

Figure 11.5 Illustration of the presence of cycles in a dependence graph and a DAG. (a) Undirected links and cycles in the dependence graph. (b) Directed links only to produce a DAG.

c11f005

The execution time associated with each point imparts an ordering on the times and a direction to the unidirectional links. We can write

c11ue003

The above inequalities give direction to the undirected edges of the dependence graph. We thus have the following directional links:

c11ue004

The dependence graph has now become a DAG and the loopback edge between the first and last nodes 0 and k − 1 has now become directed from node 0 to node k − 1. Thus, our undirected dependence graph, which included cycles, is transformed into a DAG.

The following theorem will prove that the affine scheduling function will convert the dependence graph to a SPA.

Theorem 11.3

The affine scheduling function changes a dependence graph into a SPA.

Proof:

Assume without loss of generality a two-dimensional (2-D) dependence graph. A node in the dependence graph is described by the coordinates i, j:

(11.53) c11e053

The scheduling function assigns an order of execution to each node given by the expression

(11.54) c11e054

For given values of s1, s2, and s, we get

(11.55) c11e055

where c is some constant.

The algorithm computation domain x1D49F_EuclidMathOne_10n_000100 is defined in the integer space x1D4B5_EuclidMathOne_10n_0001002 by the inequalities

(11.56) c11e056

(11.57) c11e057

where we assumed that the indices in our 2-D case fall in the limits indicated above. The scheduling function imposes another restriction as given by Eq. 11.55. All nodes satisfying the above two inequality and satisfying Eq. 11.55 will describe a subset of x1D49F_EuclidMathOne_10n_000100. All these nodes are assigned stage c.

Changing the value of c to k identifies a new set of nodes that will be executed at stage k. Thus, we have divided the nodes in our computation domain x1D49F_EuclidMathOne_10n_000100 to a set of sequential stages, and each stage executes a set of nodes in parallel. This is the definition of a SPA

The affine scheduling function should satisfy five conditions in order to be a valid scheduling function:

(11.58) c11e058

(11.59) c11e059

(11.60) c11e060

(11.61) c11e061

(11.62) c11e062

where

e broadcast nullvector,
f pipelining nullvector,
d projection direction (discussed later), and
R any extremal ray in our domain.

It is important to mention here that the restrictions implied by Eq. 11.59 preclude any broadcast directions that coincide with the projection directions, defined in Section 11.9. The above conditions provide the minimum constraints that must be satisfied for a possible valid scheduling function. Further restrictions are imposed to narrow our choices as will be explained in Section 11.8.1.

Since the point [0 0 0]tx1D49F_EuclidMathOne_10n_000100, we have s = 0. Thus, our scheduling function is simply given by

(11.63) c11e063

or more explicitly, we can write above Equation as

(11.64) c11e064

We need to come up with values for the variables s1, s2, and s3, which are based on the restrictions given by Eqs. 11.58–11.62.

To start, let us assume that we want to pipeline output variable M1 since this leads to faster and simpler hardware. For this, we need to know the nullvector associated with M1. The nullspace for output variable M1 is given in Eq. 11.35 as

(11.65) c11e065

From restriction (Eq. 11.60), we can write

(11.66) c11e066

which implies s3 ≠ 0. Let us choose s3 = 1. Thus, our scheduling function so far is given by

(11.67) c11e067

Next, let us assume that we want to broadcast input variable M2. For this, we need to know the nullvector associated with M2. The nullspace for input variable M2 is given in Eq. 11.37 as

(11.68) c11e068

From restriction (Eq. 11.59), we can write

(11.69) c11e069

which implies s2 = 0. Thus, our scheduling function so far is given by

(11.70) c11e070

To find the component s1, we consider the third variable, M3. Let us pipeline that variable. For this, we need to know the nullvector associated with M3. The nullspace for output variable M3 is given in Eq. 11.35 as

(11.71) c11e071

From restriction (Eq. 11.60), we can write

(11.72) c11e072

which implies s1 ≠ 0. Let us choose s1 = 1. Thus, our scheduling function is finally given by

(11.73) c11e073

Equation 11.73 defines the valid scheduling function for our matrix multiplication algorithm given our choices for data broadcast and pipelining.

11.8.1 Impact of Scheduling Function on Data Timing

The restrictions on our choice for a valid scheduling function were developed in the previous section.

The timing function so far that we developed in Eq. 11.73 imposes certain restrictions on the timing of the output variable. The output data samples are indexed using two indices, for example,

(11.74) c11e074

(11.75) c11e075

(11.76) c11e076

The feeding and extraction for this variable were found in Eq. 11.51. These two points can be used to determine the timing of the variable. Consider the output sample M1(i, j). According to Eq. 11.51, the extraction point for this instance is given by

(11.77) c11e077

We can write the following time value for this element:

(11.78) c11e078

This Equation states that output elements in the same column of M1 are obtained from the processors or the threads at the same time. The first row with i = 0 is obtained in time instance K − 1; the second row is obtained at time K, and so on.

The reader can verify that input variable sample M2(i, k) is supplied to the array at time

(11.79) c11e079

Thus, the inputs for this variable are supplied such that all elements whose row and column index sum is equal to the same value are supplied simultaneously. Element M2(0, 0) is supplied at time 0. Elements M2(1, 0) and M2(0, 1) are supplied at time 1, and so on.

Similarly, input variable sample M3(k, j) is supplied to the array at time

(11.80) c11e080

All elements on the same row are supplied to the system at the same time. Element M3(0, j) is supplied at time 0. Element M3(1, j) is supplied at time 1, and so on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.249.141