11.8 DATA SCHEDULING

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

We discuss in this section how to divide the tasks in the algorithm into stages such that at each stage a group of tasks gets executed in parallel while preserving the correctness of the results. The section will also determine the stages when data are to be fed or extracted from a processor or thread. We need to find a function that will take the coordinates of a point p in the computation domain and to assign a time value to it. We use an affine scheduling function to specify the scheduling of the algorithm tasks. The affine scheduling functions are of the form [86]

(11.52)

where s is a row vector of length n, which is called scheduling vector, and s is an integer that biases the ordering to ensure non-negative stage index values.

The main purpose of the scheduling function is to assign an execution time to several nodes in . Consequently, this function determines the computational load to be performed by the computing system at each time step or execution sequence. This is a subtle but very important by-product. As we shall see, the linear or affine scheduling function affords us little control on the amount of that work. However, we still need it to correctly perform the algorithm tasks. Nonlinear scheduling techniques will allows us to control the total work assigned to the system during each time step. Another effect of the scheduling function is the fact that dependencies between tasks will be created, and interthread and interprocessor communication can thus be determined.

Assigning time values to the nodes of transforms it to a serial–parallel algorithm (SPA) where the parallel tasks could be implemented using a thread pool or parallel processors for software or hardware implementations, respectively. The different stages of the SPA are accomplished using barriers or clocks for software or hardware implementations, respectively.

The following theorem will prove that the scheduling function will convert the dependence graph into a DAG.

Theorem 11.2

The affine scheduling function changes a dependence graph into a DAG even if the dependence graph contained cycles.

Proof:

Given a dependence graph, we define a k-hop cycle as one involving k-nodes as shown in Fig. 11.5. We exclude one-hop loops since they represent local computations and are absorbed within a node. Without loss of generality, we assume a 1-D dependence graph where the nodes lie on a straight line with index i. Now assume that there are one or more k-hops involving nodes whose indices are 0, 1, … , k − 1 where k > 1. The presence of a loop in the dependence graph implies that there are undirected links between the following pairs of nodes:

For our 1-D situation, the affine timing function is given by t(p) = i; the times associated with these points become

Figure 11.5 Illustration of the presence of cycles in a dependence graph and a DAG. (a) Undirected links and cycles in the dependence graph. (b) Directed links only to produce a DAG.

The execution time associated with each point imparts an ordering on the times and a direction to the unidirectional links. We can write

The above inequalities give direction to the undirected edges of the dependence graph. We thus have the following directional links:

The dependence graph has now become a DAG and the loopback edge between the first and last nodes 0 and k − 1 has now become directed from node 0 to node k − 1. Thus, our undirected dependence graph, which included cycles, is transformed into a DAG.

The following theorem will prove that the affine scheduling function will convert the dependence graph to a SPA.

Theorem 11.3

The affine scheduling function changes a dependence graph into a SPA.

Proof:

Assume without loss of generality a two-dimensional (2-D) dependence graph. A node in the dependence graph is described by the coordinates i, j:

(11.53)

The scheduling function assigns an order of execution to each node given by the expression

(11.54)

For given values of s₁, s₂, and s, we get

(11.55)

where c is some constant.

The algorithm computation domain is defined in the integer space ² by the inequalities

(11.56)

(11.57)

where we assumed that the indices in our 2-D case fall in the limits indicated above. The scheduling function imposes another restriction as given by Eq. 11.55. All nodes satisfying the above two inequality and satisfying Eq. 11.55 will describe a subset of . All these nodes are assigned stage c.

Changing the value of c to k identifies a new set of nodes that will be executed at stage k. Thus, we have divided the nodes in our computation domain to a set of sequential stages, and each stage executes a set of nodes in parallel. This is the definition of a SPA

The affine scheduling function should satisfy five conditions in order to be a valid scheduling function:

where

e	broadcast nullvector,
f	pipelining nullvector,
d	projection direction (discussed later), and
R	any extremal ray in our domain.

It is important to mention here that the restrictions implied by Eq. 11.59 preclude any broadcast directions that coincide with the projection directions, defined in Section 11.9. The above conditions provide the minimum constraints that must be satisfied for a possible valid scheduling function. Further restrictions are imposed to narrow our choices as will be explained in Section 11.8.1.

Since the point [0 0 0]^t ∈ , we have s = 0. Thus, our scheduling function is simply given by

(11.63)

or more explicitly, we can write above Equation as

(11.64)

We need to come up with values for the variables s₁, s₂, and s₃, which are based on the restrictions given by Eqs. 11.58–11.62.

To start, let us assume that we want to pipeline output variable M₁ since this leads to faster and simpler hardware. For this, we need to know the nullvector associated with M₁. The nullspace for output variable M₁ is given in Eq. 11.35 as

(11.65)

From restriction (Eq. 11.60), we can write

(11.66)

which implies s₃ ≠ 0. Let us choose s₃ = 1. Thus, our scheduling function so far is given by

(11.67)

Next, let us assume that we want to broadcast input variable M₂. For this, we need to know the nullvector associated with M₂. The nullspace for input variable M₂ is given in Eq. 11.37 as

(11.68)

From restriction (Eq. 11.59), we can write

(11.69)

which implies s₂ = 0. Thus, our scheduling function so far is given by

(11.70)

To find the component s₁, we consider the third variable, M₃. Let us pipeline that variable. For this, we need to know the nullvector associated with M₃. The nullspace for output variable M₃ is given in Eq. 11.35 as

(11.71)

From restriction (Eq. 11.60), we can write

(11.72)

which implies s₁ ≠ 0. Let us choose s₁ = 1. Thus, our scheduling function is finally given by

(11.73)

Equation 11.73 defines the valid scheduling function for our matrix multiplication algorithm given our choices for data broadcast and pipelining.

11.8.1 Impact of Scheduling Function on Data Timing

The restrictions on our choice for a valid scheduling function were developed in the previous section.

The timing function so far that we developed in Eq. 11.73 imposes certain restrictions on the timing of the output variable. The output data samples are indexed using two indices, for example,

(11.74)

(11.75)

(11.76)

The feeding and extraction for this variable were found in Eq. 11.51. These two points can be used to determine the timing of the variable. Consider the output sample M₁(i, j). According to Eq. 11.51, the extraction point for this instance is given by

(11.77)

We can write the following time value for this element:

(11.78) c11e078

This Equation states that output elements in the same column of M₁ are obtained from the processors or the threads at the same time. The first row with i = 0 is obtained in time instance K − 1; the second row is obtained at time K, and so on.

The reader can verify that input variable sample M₂(i, k) is supplied to the array at time

(11.79)

Thus, the inputs for this variable are supplied such that all elements whose row and column index sum is equal to the same value are supplied simultaneously. Element M₂(0, 0) is supplied at time 0. Elements M₂(1, 0) and M₂(0, 1) are supplied at time 1, and so on.

Similarly, input variable sample M₃(k, j) is supplied to the array at time

(11.80)

All elements on the same row are supplied to the system at the same time. Element M₃(0, j) is supplied at time 0. Element M₃(1, j) is supplied at time 1, and so on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11.8 DATA SCHEDULING

Create new playlist

Sign In

Sign Up

Table of Contents for
11.8 DATA SCHEDULING