11.8 DATA SCHEDULING
We discuss in this section how to divide the tasks in the algorithm into stages such that at each stage a group of tasks gets executed in parallel while preserving the correctness of the results. The section will also determine the stages when data are to be fed or extracted from a processor or thread. We need to find a function that will take the coordinates of a point p in the computation domain and to assign a time value to it. We use an affine scheduling function to specify the scheduling of the algorithm tasks. The affine scheduling functions are of the form [86]
(11.52)
where s is a row vector of length n, which is called scheduling vector, and s is an integer that biases the ordering to ensure non-negative stage index values.
The main purpose of the scheduling function is to assign an execution time to several nodes in . Consequently, this function determines the computational load to be performed by the computing system at each time step or execution sequence. This is a subtle but very important by-product. As we shall see, the linear or affine scheduling function affords us little control on the amount of that work. However, we still need it to correctly perform the algorithm tasks. Nonlinear scheduling techniques will allows us to control the total work assigned to the system during each time step. Another effect of the scheduling function is the fact that dependencies between tasks will be created, and interthread and interprocessor communication can thus be determined.
Assigning time values to the nodes of transforms it to a serial–parallel algorithm (SPA) where the parallel tasks could be implemented using a thread pool or parallel processors for software or hardware implementations, respectively. The different stages of the SPA are accomplished using barriers or clocks for software or hardware implementations, respectively.
The following theorem will prove that the scheduling function will convert the dependence graph into a DAG.
Theorem 11.2
The affine scheduling function changes a dependence graph into a DAG even if the dependence graph contained cycles.
Proof:
Given a dependence graph, we define a k-hop cycle as one involving k-nodes as shown in Fig. 11.5. We exclude one-hop loops since they represent local computations and are absorbed within a node. Without loss of generality, we assume a 1-D dependence graph where the nodes lie on a straight line with index i. Now assume that there are one or more k-hops involving nodes whose indices are 0, 1, … , k − 1 where k > 1. The presence of a loop in the dependence graph implies that there are undirected links between the following pairs of nodes:
For our 1-D situation, the affine timing function is given by t(p) = i; the times associated with these points become
The execution time associated with each point imparts an ordering on the times and a direction to the unidirectional links. We can write
The above inequalities give direction to the undirected edges of the dependence graph. We thus have the following directional links:
The dependence graph has now become a DAG and the loopback edge between the first and last nodes 0 and k − 1 has now become directed from node 0 to node k − 1. Thus, our undirected dependence graph, which included cycles, is transformed into a DAG.
The following theorem will prove that the affine scheduling function will convert the dependence graph to a SPA.
Theorem 11.3
The affine scheduling function changes a dependence graph into a SPA.
Proof:
Assume without loss of generality a two-dimensional (2-D) dependence graph. A node in the dependence graph is described by the coordinates i, j:
(11.53)
The scheduling function assigns an order of execution to each node given by the expression
(11.54)
For given values of s1, s2, and s, we get
where c is some constant.
The algorithm computation domain is defined in the integer space 2 by the inequalities
(11.56)
(11.57)
where we assumed that the indices in our 2-D case fall in the limits indicated above. The scheduling function imposes another restriction as given by Eq. 11.55. All nodes satisfying the above two inequality and satisfying Eq. 11.55 will describe a subset of . All these nodes are assigned stage c.
Changing the value of c to k identifies a new set of nodes that will be executed at stage k. Thus, we have divided the nodes in our computation domain to a set of sequential stages, and each stage executes a set of nodes in parallel. This is the definition of a SPA
The affine scheduling function should satisfy five conditions in order to be a valid scheduling function:
where
e | broadcast nullvector, |
f | pipelining nullvector, |
d | projection direction (discussed later), and |
R | any extremal ray in our domain. |
It is important to mention here that the restrictions implied by Eq. 11.59 preclude any broadcast directions that coincide with the projection directions, defined in Section 11.9. The above conditions provide the minimum constraints that must be satisfied for a possible valid scheduling function. Further restrictions are imposed to narrow our choices as will be explained in Section 11.8.1.
Since the point [0 0 0]t ∈ , we have s = 0. Thus, our scheduling function is simply given by
(11.63)
or more explicitly, we can write above Equation as
(11.64)
We need to come up with values for the variables s1, s2, and s3, which are based on the restrictions given by Eqs. 11.58–11.62.
To start, let us assume that we want to pipeline output variable M1 since this leads to faster and simpler hardware. For this, we need to know the nullvector associated with M1. The nullspace for output variable M1 is given in Eq. 11.35 as
(11.65)
From restriction (Eq. 11.60), we can write
(11.66)
which implies s3 ≠ 0. Let us choose s3 = 1. Thus, our scheduling function so far is given by
(11.67)
Next, let us assume that we want to broadcast input variable M2. For this, we need to know the nullvector associated with M2. The nullspace for input variable M2 is given in Eq. 11.37 as
(11.68)
From restriction (Eq. 11.59), we can write
(11.69)
which implies s2 = 0. Thus, our scheduling function so far is given by
(11.70)
To find the component s1, we consider the third variable, M3. Let us pipeline that variable. For this, we need to know the nullvector associated with M3. The nullspace for output variable M3 is given in Eq. 11.35 as
(11.71)
From restriction (Eq. 11.60), we can write
(11.72)
which implies s1 ≠ 0. Let us choose s1 = 1. Thus, our scheduling function is finally given by
Equation 11.73 defines the valid scheduling function for our matrix multiplication algorithm given our choices for data broadcast and pipelining.
11.8.1 Impact of Scheduling Function on Data Timing
The restrictions on our choice for a valid scheduling function were developed in the previous section.
The timing function so far that we developed in Eq. 11.73 imposes certain restrictions on the timing of the output variable. The output data samples are indexed using two indices, for example,
(11.74)
(11.75)
(11.76)
The feeding and extraction for this variable were found in Eq. 11.51. These two points can be used to determine the timing of the variable. Consider the output sample M1(i, j). According to Eq. 11.51, the extraction point for this instance is given by
(11.77)
We can write the following time value for this element:
(11.78)
This Equation states that output elements in the same column of M1 are obtained from the processors or the threads at the same time. The first row with i = 0 is obtained in time instance K − 1; the second row is obtained at time K, and so on.
The reader can verify that input variable sample M2(i, k) is supplied to the array at time
(11.79)
Thus, the inputs for this variable are supplied such that all elements whose row and column index sum is equal to the same value are supplied simultaneously. Element M2(0, 0) is supplied at time 0. Elements M2(1, 0) and M2(0, 1) are supplied at time 1, and so on.
Similarly, input variable sample M3(k, j) is supplied to the array at time
(11.80)
All elements on the same row are supplied to the system at the same time. Element M3(0, j) is supplied at time 0. Element M3(1, j) is supplied at time 1, and so on.
3.129.249.141