2

,

PARALLEL IN-CORE AND OUT-OF-CORE LU FACTORIZATION FOR SOLVING A MATRIX EQUATION AND THE CORRESPONDING PARALLEL MATRIX FILLING IN HOBBIES

2.0 SUMMARY

The parallel implementation of a Method of Moments (MoM) code involves filling up a complex, dense matrix in parallel followed by a parallel matrix solution. The approach used to fill the matrix in parallel is determined by the type of matrix equation solver being used, namely, in-core or out-of-core. To fill the matrix efficiently for a MoM code in parallel, it is important to understand how the data are distributed over different processors (cores) among the different computer nodes.

With this in mind, this chapter explores the details of solving the dense, complex matrix equation for MoM using the LU factorization method rather than describing the MoM procedure itself. After introducing the matrix equation for MoM, the serial versions of the LU decomposition method are investigated. Then, the discussion is focused on the parallel implementation of the LU algorithm. The data distribution pattern used by the ScaLAPACK (Scalable Linear Algebra PACKage) library packages is explained.

When the matrix from a MoM code is too large to fit into the random access memory (RAM) of the computer at hand, the space on the hard disk is used to extend the storage capability of the computer. An out-of-core solver, in addition to the RAM, uses the space on the hard disk, whereas an in-core solver uses only the RAM of the computer. A one-slab, left-looking, out-of-core algorithm is discussed. With the LU factors and forward/backward substitution algorithm, the matrix equation obtained from a MoM code can be efficiently solved in parallel.

Once how to solve the matrix equation is outlined, the parallel matrix filling scheme can be designed for that procedure. At the end of this chapter, the parallel in-core and parallel out-of-core matrix filling algorithms are presented for MoM using higher order basis functions.

2.1 MATRIX EQUATION RESULTING FROM A MOM CODE

By following the MoM procedure to solve an integral equation, a matrix equation can be obtained as

images

where [Z] is a M × M dense, complex matrix in which M unknowns are used to represent the continuous electric and/or the magnetic current over the surface of interest. [I] is a M × 1 coefficient vector of the unknown current coefficients to be solved for, and [V] is a M × 1 excitation vector. Here, [Z] is often referred to as the impedance matrix. [V] is termed the voltage or the excitation matrix, and [I] is referred to as the matrix containing the unknown current coefficients. Detailed steps for computing the elements of the matrices [Z] and [V], for different basis functions, will be given in the subsequent chapters. The following discussion is focused on solving Eq. (2.1).

The numerical solution of Eq. (2.1) generally proceeds by factoring the impedance matrix [Z] into a lower and an upper triangular matrix, called the LU factorization or decomposition. This LU-factored matrix is then used to solve for the induced current given an excitation. When a shared-memory machine is used, the impedance matrix is stored in the memory and operated on. When a distributed memory machine is used instead, the matrix must be broken up into pieces and distributed across the memory of all the processes involved [1-3]. The algorithms will be explained in detail in the following section.

2.2 IN-CORE MATRIX EQUATION SOLVER

For descriptive convenience, let us rewrite the dense linear system of Eq. (2.1), as

images

where [A] ∈ CM×M; that is, elements of the matrix are complex quantities. If the use of pivoting during the LU decomposition for stability of the solution is ignored for the moment, it is customary to first perform the LU factorization as follows, so that

images

where [L] ∈ CM×M is a lower triangular matrix with unity elements on the diagonal, and [U] ∈ CM×M is an upper triangular matrix. The system given by Eq. (2.3) is then solved using a two-step procedure. First solve for [Y] by using

images

and then solve for [X] using the knowledge of [Y] from

images

In the past, the solution of the matrix equation could have been easily accomplished by calling the appropriate LINPACK (LINear algebra PACKage) [4] library routines. However, on a typical processor with hierarchical memory, the brute-force use of LINPACK is inefficient because its memory access patterns disregard the multi-layered memory hierarchies of the computing machines, and thereby spend too much time transferring data instead of carrying out useful floating-point operations. In essence, O(M) [images: represents of the order of] operations are performed on O(M) data, and the processor cache, which allows high data access rates, cannot be optimally utilized [5].

LINPACK is inefficient because its memory access pattern disregards the multilayered memory hierarchies of the machines. Actually, LINPACK has been largely superseded by LAPACK (Linear Algebra PACKage) [6], which has been designed to run efficiently on shared-memory, vector supercomputers. LAPACK addresses the problem of optimizing the time required in transferring data instead of performing useful floating-point operations by reorganizing the algorithms to use block matrix operations, such as matrix multiplication, in the innermost loops. These block matrix operations can be optimized to account for the memory hierarchy, and make it possible to achieve high efficiency on diverse modern machines.

In LAPACK, matrix [A], [L] and [U] can be partitioned as

images

where [A11],[L11],[U11] ∈ Ck×k , k is the block size (see also Section 2.3.2), and [0] is the null matrix. This leads to the following set of equations:

images

images

images

The LU factorization algorithm can now be formulated by overwriting a panel of matrix elements of width k with its LU factorization [Eq. (2.7)], followed by solution of the triangular system with multiple right-hand sides [Eq. (2.8)]. Then the bulk of the computation is in updating [A22], using a matrix–matrix multiplication, followed by a recursive factorization of the result [Eq. (2.9)]. During the updating of the elements of [A22], 2k(Mk)2 operations are performed on the (M2k2) data, which requires (Mk)k(Mk) multiplications plus (Mk)(k−1)(Mk) additions for calculating [L21][U12] plus (Mk)2 subtractions from the elements of [A22]. This allows the data to be brought into cache, after which a large amount of computation occurs before it is returned to the memory. Thus, in this procedure, the bottleneck to memory access is overcome [1].

LAPACK obtains its high performance by using algorithms that do most of the work in calls to the BLAS, with the emphasis placed on matrix–matrix multiplications. (BLAS-Basic Linear Algebra Subprograms-is an application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication. They are used to build larger packages such as LAPACK.) Since the mid-1980s, researchers have been pursuing a high-performance implementation of matrix multiplications. One way to accomplish it is to organize the computation around an “inner kernel,” [C] = [A]T [B] + [C], which keeps one of the operands in the L1 cache, while streaming parts of the other operands through that cache. Variants include approaches that extend this principle to multiple levels of cache or that apply the same principle to the L2 cache while essentially ignoring the L1 cache. The purpose of this tuning is to reduce optimally and gradually the time of moving data between memory layers [7].

2.3 PARALLEL IMPLEMENTATION OF AN IN-CORE SOLVER

2.3.1 ScaLAPACK — Scalable Linear Algebra PACKage

ScaLAPACK is the largest and most flexible public-domain library with basic numerical operations for distributed-memory parallel systems to date. It can solve problems associated with systems of linear equations, linear least-squares problems, eigenvalue problems, and singular value decomposition. ScaLAPACK can also handle many associated computations such as matrix factorizations and estimation of the condition number of a matrix.

ScaLAPACK is a parallel version of LAPACK in both function and software design. Like LAPACK, the ScaLAPACK routines are based on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy. The fundamental building blocks of the ScaLAPACK library are distributed-memory versions of the level 1, level 2, and level 3 BLAS [8]), called the Parallel BLAS (PBLAS) [9], and a set of Basic Linear Algebra Communication Subprograms (BLACS) [10] for communication tasks that arise frequently in parallel linear algebra computations. In the ScaLAPACK routines, the majority of interprocessor communication occurs within the PBLAS, so the source code of the top software layer of ScaLAPACK resembles that of LAPACK.

Figure 2.1 describes the ScaLAPACK software hierarchy [11]. The components below the dashed line, labeled “Local,” are called on a single process, with arguments stored on a single process only. The components above the dashed line, labeled “Global,” are synchronous parallel routines, whose arguments include matrices and vectors distributed across multiple processes. The components below the solid line are machine-specific, while those above the solid line are machine-independent. Each component in Figure 2.1 is described in the following with the various acronyms defined.

images

Figure 2.1. ScaLAPACK software hierarchy.

MPI. The Message Passing Interface (MPI) protocols consist of a library specification for message-passing, proposed as a standard by a broadly based committee of vendors, implementers, and users. It is a language-independent communications protocol used to program for parallel computers. The MPI interface is meant to provide essential virtual topology, synchronization, and communication functionality between a set of processes (that have been mapped to nodes/servers/computer instances) in a language-independent way, with language-specific syntax (bindings), plus a few features that are language-specific. MPI has such functions included, but are not limited to, point-to-point rendezvous-type send/receive operations, choosing between a Cartesian or graph-like logical process topology, exchanging data between process pairs (send/receive operations), combining partial results of computations (gathering and reduction operations), synchronizing nodes (barrier operations), as well as obtaining network-related information (such as the number of processes), neighboring processes accessible in a logical topology, and so on. MPI programs always work with processes, although usually people talk about processors. When one tries to achieve maximum performance, one process per processor/core is selected as part of the mapping activity. This mapping activity occurs at runtime, through the agent that starts the MPI program, normally called mpirun or mpiexec [12, 13].

BLAS. BLAS (Basic Linear Algebra Subprograms [8]) include subroutines for common linear algebra computations such as dot product, matrix–vector multiplication, and matrix–matrix multiplication. An important aim of the BLAS is to provide a portability layer for computation. As is well known, using matrix–matrix operations (in particular, matrix multiplication) tuned for a particular architecture can mask the effects of the memory hierarchy (cache misses, Translation Look-aside Buffer (TLB) misses, etc.) and permit floating-point operations to be performed near peak speed of the machine. Optimized versions of the BLAS can be found at http://www.tacc.utexas.edu/resources/software/#blas.

BLACS – BLACS (Basic Linear Algebra Communication Subprograms [10]) is a message-passing library designed for linear algebra. An important aim of the BLACS is to provide a portable, linear, algebra-specific layer for communication. The computational model consists of a one- or two-dimensional process grid, where each process stores pieces of the matrices and vectors. The BLACS include synchronous send/receive routines to communicate a matrix or submatrix from one process to another, to broadcast submatrices to many processes, or to compute global reductions (sums, maxima, and minima). There are also routines to construct, change, or query the process grid. Since several ScaLAPACK algorithms require broadcasts or reductions among different subsets of processes, the BLACS permit a process to be a member of several overlapping or disjointed process grids, each one labeled by a context. Some message-passing systems, such as MPI, also include this context concept; MPI calls this a “communicator.” The BLACS provide facilities for safe interoperation of system contexts and BLACS contexts.

LAPACK. LAPACK, or Linear Algebra PACKage [6], is a collection of routines for solving linear systems, least-squares problems, eigenproblems, and singular problems. High performance is attained by using algorithms that do most of their work in calls to the BLAS, with an emphasis on matrix–matrix multiplication. Each routine has one or more performance tuning parameters, such as the sizes of the blocks operated on by the BLAS. These parameters are machine-dependent and are obtained from a table defined when the package is installed and referenced at runtime.

The LAPACK routines are written as a single thread of execution. LAPACK can accommodate shared-memory machines, provided parallel BLAS are available (in other words, the only parallelism is implicit in calls to BLAS). More detailed information about LAPACK can be found at http://www.netlib.org/lapack/.

PBLAS. To simplify the design of ScaLAPACK, and because BLAS have proved to be quite useful tools outside LAPACK, the authors of ScaLAPACK chose to build a Parallel set of BLAS, called PBLAS, which perform message passing and whose interface is as similar to the BLAS as possible. This decision has permitted the ScaLAPACK code to be quite similar, and sometimes nearly identical, to the analogous LAPACK code. Further details of PBLAS can be found in [9].

ScaLAPACK also contains additional libraries to treat distributed matrices and vectors. One is the tools library, which offers useful routines, for example, to find out which part of the global matrix a local process has in its memory or to identify the global index of a matrix element corresponding to its local index and vice versa.

2.3.2 ScaLAPACK: Two-Dimensional, Block-Cyclic Matrix Distribution

For reasons related to the performance and load balancing, ScaLAPACK uses a two-dimensional, block-cyclic distribution of full matrices (readers may refer to the ScaLAPACK User's Guide). First, the matrix is partitioned into blocks of size mb × nb. These blocks are then uniformly distributed across the Pr × Pc process grid in a cyclic manner. As a result, every process owns a collection of blocks, which are contiguously stored in a two-dimensional “column-major” array. This local storage convention allows the ScaLAPACK software to use local memory efficiently by calling BLAS3 routines for submatrices that may be larger than a single mb × nb block.

Here, the parallel matrix distribution of ScaLAPACK, known as block-cyclic distribution, is briefly reviewed. The Pr × Pc processes available to the application are viewed as filling a logical two-dimensional grid of Pr rows and Pc columns.

Given the matrix [A], the block [Aij] is mapped to the process at the {(i−1) mod Pr, (j−1) mod Pc} position of the process grid, i.e., the process in row {(i−1) mod Pr} and column {(j−1) mod Pc}, where “mod” represents the modulo operator.

For example, consider a 9 × 9 matrix that is distributed using a 2 × 2 block size over six processes, which are mapped into a 2 × 3 two-dimensional MPI virtual topology grid with a cyclic boundary condition along its two directions, as shown in Figure 2.2 (a). After this arrangement in Figure 2.2 (a), the 9 × 9 matrix is divided into 2 × 2 blocks (the boxes are marked by solid gray lines; the dashed gray lines associated with the rightmost and the bottommost boxes indicate that these blocks may not be fully filled), which results in six blocks corresponding to six processes as shown in Figure 2.2 (b).

Figure 2.2 (c) shows another case, in which the 9 × 9 matrix is divided into blocks of size 3 × 2, across a 2 × 2 process grid. Figure 2.2 (d) shows the matrix distribution on each process resulting from the distribution in Figure 2.2 (c). In Figure 2.2 (c), the outermost numbers denote the indices for the process rows and columns. Each solid gray line encloses a block, similar to that shown in Figure 2.2 (a), while the block size is changed to be 3 × 2; the dashed line in the rightmost boxes indicates that those blocks are not fully filled.

images

Figure 2.2. A matrix and its distribution using ScaLAPACK: (a) a 9×9 matrix is divided into 2×2-sized blocks, (b) the rearrangement of the blocks in (a) for each process in a 2×3 process grid, (c) a 9×9 matrix is divided into 3×2-sized blocks, (d) the rearrangement of the blocks in (c) for each process in a 2×2 process grid.

2.4 DATA DECOMPOSITION FOR AN OUT-OF-CORE SOLVER

The reason for developing an out-of-core algorithm is that the matrix is too large to be stored in the main memory of the system. When the matrices involved are too large to fit in the main memory of the computer, they must be stored on the hard disks. When filling the matrix, one portion of the matrix is computed at a time, and that portion is written to the hard disk, then another portion of the matrix is computed and written, and this procedure is repeated until the complete matrix is evaluated. For the in-core matrix filling algorithm, the entire matrix is filled in one step, but need not be written out. The main idea of designing an out-of-core filling algorithm is to modify the in-core filling algorithm structure and fill a portion of the matrix at a time instead of the whole matrix. When performing an out-of-core LU factorization, each portion of the matrix is read into the RAM and the LU decomposition is started. On completion, the result of the LU factorization of this portion of the matrix is written back to the hard disk. The code then proceeds with the next portion of the matrix until the entire matrix is LU factored. This is the basic mechanism of an out-of-core LU factorization, which will be further discussed in the following sections.

An important point for an out-of-core algorithm is that during the evaluation of the elements of the matrix and its solution through the LU decomposition, the original matrix is decomposed into a set of smaller matrices that can be fitted in-core. To do this, the following assumptions are necessary. The (double-precision complex) matrix has M rows and columns and requires Nstorage = M × M × 16 bytes of hard-disk storage. The computer system has MRAM bytes of in-core buffer available to each process. To complete the decomposition of the matrix, each process will handle a specific number of portions Islab, which must satisfy the equation

images

where p is the total number of processes, and the ceiling function returns the smallest integer greater than or equal to the number.

When the decomposition is oriented along the column, each in-core filling is a slab of the matrix, as shown in Figure 2.3, consisting of all the M rows of the matrix by the number of columns that will fit in the total amount of the in-core buffer available pMRAM. Note that the value of MRAM for the in-core buffer memory is not required to be the same as the physical memory available to each process on the computer. To obtain the best performance of an out-of-core code, the value of MRAM should always be less than the size of physical memory for each process. This is because the operating system needs some memory resource, and if the out-of-core code uses up all of the physical memory, the system will start to use virtual memory. Note that using virtual memory of the computer degrades the performance.

images

Figure 2.3. Data decomposition for storing a matrix in the out-of-core mode.

The width of the ith out-of-core slab Ki, as illustrated in Figure 2.3, is bounded by

images

At the last out-of-core fill (i = Islab), the number of unfilled columns is

images

When only one process is used, the out-of-core matrix filling can easily be done with a slight modification to the serial code. However, when more processes are used, the assignment to fill each slab is distributed over p processes by using the block distribution scheme of ScaLAPACK. Thus, the matrix filling schemes need to be designed in such a way so as to avoid redundant calculation for better parallel efficiency.

The parallel out-of-core matrix filling algorithm for MoM, which using higher order basis functions, will be further explained later based on the data distribution given here.

2.5 ONE-SLAB, LEFT-LOOKING, OUT-OF-CORE LU ALGORITHM

Assuming that the out-of-core filling of the matrix is completed, the next step involves performing the out-of-core LU factorization. It is well known that high performance can be achieved for an algorithm portable to different computer platforms by designing the algorithms in terms of matrix–matrix multiplications. Such algorithms are called blocked algorithms.

With the relationships presented in Eq. (2.6), variants can be developed by postponing the formation of certain components and manipulating the order in which they are formed. Two natural variants occur: the left-looking and the right-looking algorithms. The terms “left” and “right” refer to the regions of the data access, as shown in Figure 2.4.

The shaded parts in Figure 2.4 indicate the matrix elements accessed when forming a block row or column. The darker shaded parts indicate the block row or column being modified. Assuming the matrix is partitioned into 3 × 3 blocks, the left-looking variant, shown in Figure 2.4 (a), computes one block column at a time by using previously computed columns, while the right-looking variant, shown in Figure 2.4 (b), computes one block row and one column at each step and uses them to update the trailing submatrix [the trailing submatrix is shown as images in Figure 2.4 (b)]. The submatrices in the darker shaded parts represent the LU factors that will be obtained at the current step. By repartitioning the matrix after each step, the calculation can continue until the whole matrix is factorized.

images

Figure 2.4. Two different memory access patterns for LU factorization: (a) data access in the left-looking LU algorithm, (b) data access in the right-looking LU algorithm.

If the read and write to the hard disk use similar amount of time, then the left-looking algorithm performs less input/output (I/O) than the right-looking algorithm. Therefore, a left-looking algorithm would be better than a right-looking algorithm for an out-of-core LU factorization technique. In the following section, we will briefly review the one-slab, left-looking, out-of-core LU algorithm [14].

Given a general M × N matrix [A], its LU factorization, when employing partial pivoting, is given by

images

where [P] is a permutation matrix of order M. The matrix [L] is a M × N lower triangular matrix, and [U] is a N × N upper triangular matrix. Denote the computation of [P], [L], and [U] by

images

where {LU} is the matrix whose strictly lower and upper triangular parts are [L] and [U] , respectively. Here it is recognized that [L] has ones on the diagonal, which need not be stored, so the diagonal elements of {LU} belong to [U] . The factors [L] and [U] can be stored by overwriting the original contents of [A]. The permutation matrix is generally stored in a vector [p] of M integers.

The term slab solver is derived from the fact that the matrix is viewed as a collection of slabs consisting of K adjacent columns. The first slab is brought from the hard disk (out-of-core) into the memory (in-core). It is first factored and written back to the hard disk. The second slab is brought into memory and its forward substitution with the first slab starts, requiring the first slab to be also loaded into the memory, although not all at once, as will be seen next. Once updated, the second slab is factored and written back to the disk. The procedure continues with all the remaining slabs. When slab j is reached, it is loaded into the memory. Forward substitution with slabs 1, …,j−1, continues in order, after which slab j is factored and written back to the disk.

The primary reason that the left-looking algorithm is chosen at the top level is that the required I/O is less than that of a right-looking algorithm. The bulk of the I/O lies with the reading of the prior slabs (the part of the matrix to the left of the current slab, and these data need only to be read) during the forward substitution stage.

The complete left-looking, out-of-core algorithm is given in Figure 2.5. The following notation is used to characterize the following matrices: [ATL] and [ABL] are the top left and the bottom left partitions of [A], respectively; [ATR] and [ABR] are the top right and the bottom right partitions, respectively; [pT] and [pB] are the top and bottom partitions of the pivot vector [p], respectively; [BT] is the top partition and [BB] is the bottom partition of [B]; and n([ATL]) and n([A]) are the column dimensions of the matrices [ATL] and [A], respectively. K is called the block size in the algorithm, and it corresponds to the width of the slab.

images

Figure 2.5. One-slab blocked, left-looking algorithm for the LU factorization with partial pivoting: (a) one-slab, left-looking LU factorization algorithm, (b) forward substitution algorithm in the left-looking LU algorithm.

The thick and thin lines inside the matrices and vectors are used to explain how the matrix has been partitioned. At the end of each iteration of the repartitioning loop for the matrix, the thick line will move to the position of the thin line, as will be illustrated next.

In the rest of this chapter, the row and column indices of matrices start from 0 rather than 1 (which was used in the discussions above) for an easier C-language implementation of the algorithm. The left-looking, out-of-core algorithm is described as follows (ignoring pivoting for now).

Partition the matrices as

images

images

images

We then obtain

images

Now, assume that before and after each iteration of the loop, the submatrices to the left of the thick line have been overwritten with the corresponding submatrices [L] and [U] [see Figure 2.5 (a), below the line marked “Repartition”]. In other words, [A00], [A10], and [A20] have been overwritten by {LU}00 , [L10] , and [L20], respectively:

images

To continue with the solution process, [U01] , {LU}11 , and [L21] must overwrite the corresponding parts of [A]. Equation (2.15) outlines the necessary computations required. The current panel of rows must first be updated with the computation that precedes it through:

images

The notation tril denotes the lower triangular part of a matrix with the diagonal elements replaced by ones, such that it becomes a unit lower triangular matrix, as required for matrix [L].

For the case where pivoting occurs in the computation that precedes the current panel, the computation is similar as in the forward substitution, shown in the operation FSubLL of Figure 2.5 (b), where the computation is interleaved with the application of permutations. Subsequently, the updated matrix: {[A11][A21]}T, is factored and the thick lines are moved since another panel of columns has now been completed [see Figure 2.5 (a), below “Continue with”]. Here { }T implies that the submatrices are in the transpose position.

During the forward substitution with FSubLL, the computation is interleaved with the periodic application of row exchanges [the multiplication by the permutation matrix P([p1])] . Therefore, the forward substitution routine in Figure 2.5 (b) corresponds to that in the right-looking algorithm in Figure 2.6 (b) assuming that pivoting has been applied before the computation. The lines inside the matrices and vectors in Figure 2.6 are used to show the partitioning of the matrix, which is similar to the lines given in Figure 2.5.

For the factorization of the slab, once the slab has been updated, it is better to use a right-looking algorithm. The blocked, in-core, right-looking algorithm for LU factorization, with partial pivoting, is shown in Figure 2.6 (a). The algorithm assumes that at the start of the loop, [ATL] and [ATR] have been overwritten by {LU}TL and [UTR], respectively, and [ABL] by [LBL] . The matrix [PT] has been obtained, and [ABR] has been updated to where its LU factorization still needs to be computed. The algorithm proceeds by computing an LU factorization with partial pivoting of “the current panel,” overwriting [A11] and [A21] with {LU}11 and [L21] , respectively. It then applies the row-swaps to the remainder of the rows and updates [A12] with

[U12] = ([L11])−1 [A12]

images

Figure 2.6. Blocked right-looking algorithm for the LU factorization with partial pivoting: (a) right-looking LU factorization algorithm, (b) forward substitution algorithm in the right-looking LU algorithm.

Finally, [A22] is updated with [A22]−[L21][U12], after which [A22] must be factored during future iterations.

The reason for using the right-looking algorithm in the one-slab, left-looking algorithm is somewhat subtle. On a sequential architecture, the right-looking algorithm involves most computation in terms of the update

[A22] ⇐ [A22] − [A21][A12]

where [A21] consists of b columns. This particular shape of matrix–matrix multiplication inherently can achieve better performance than the shape that is encountered in the left-looking algorithm. On a parallel architecture, either shared memory or distributed memory, the shape of the matrix–matrix multiplication encountered by the right-looking algorithm parallelizes more naturally, since it requires communication between parts of [A21] and [A12] among processes after which non-overlapping parts of [A22] can be updated in parallel. What makes the solution procedure in Figure 2.5 a one-slab solver is the observation that when slab j is being updated by slab i (i < j), it is not necessary to bring slab i completely into the memory. Instead, that slab is brought in by one panel (b columns) at a time.

A distributed memory parallel implementation of the algorithm proceeds exactly as described above, except that the matrix to be factored is distributed among the processes and parallel implementations of the various in-core operations are employed.

The details of ScaLAPACK, a widely used parallel in-core linear algebra library packages, has already been discussed. It uses block distribution to map an in-core matrix (in RAM) onto nodes. The same block distribution of matrices used in ScaLAPACK can also map an out-of-core matrix (on hard disk) onto nodes. When bringing a slab or a panel into memory, by design, it is guaranteed that the in-core copy of the submatrix is on the same node as the out-of-core copy so that reading or writing need only to access the local disk attached to that process.

2.6 SOLVING A MATRIX EQUATION USING THE OUT-OF-CORE LU MATRICES

After the matrix [A] has been factored into the [L] and [U] matrices, the matrix equation given by Eq. (2.2) can be solved by first performing a forward substitution given in Eq. (2.4), and then a backward substitution as specified in Eq. (2.5).

The out-of-core algorithm for solving the matrix equation using the LU-factored results is shown in Figure 2.7 (a), where a general matrix equation is assumed to have multiple vectors on the right-hand side. The thick and the thin lines inside the matrices and vectors are used as an aid to explain the partitioning of the matrix. At the end of each iteration of the repartition loop for the matrix, the thick line will move into position of the thin line, as will be discussed next. Figure 2.7 (b) presents one possible backward substitution algorithm used in Figure 2.7 (a). The notation triu denotes the upper triangular part of a matrix, and [XT] and [XB] are the top and bottom partitions of [X]. Other notations for [ATL], [ABL], [ATR], [ABR], [BT], [BB], and K are the same as has been discussed in Section 2.5.

Since the same FSubLL, as described in Section 2.5, can be used as one possible forward substitution procedure, as shown in Figure 2.5, the forward substitution will not be treated further here. The focus will be on the backward substitution procedure [in Figure 2.7 (b)], which starts from partitioning the matrix as follows:

images

It results in the following equation:

images

Now, assume that before and after each iteration of the loop, the [B] submatrices below the thick line have been overwritten with the corresponding [X] submatrices. In other words, [B2] has been overwritten with [X2]. To continue with the solution procedure, [X1] must overwrite the corresponding parts of [B1].

images

Figure 2.7. An out-of-core matrix equation solver: (a) the algorithm for solving the matrix equation, (b) the backward substitution algorithm.

Equation (2.17) indicates the required computation; blocks [B1] and [B0] must first be updated through:

[B1] ⇐ [X1] ⇐ (triu[A11]−1 [B1], and [B0] ⇐ [B0]− [A01][B1].

Subsequently, the thick lines are moved upward for [B] and [X], and upward and left for [A], since another [B] submatrix can now be processed.

Note that [X] and its partition, given in Eqs. (2.16) and (2.17), are just employed for a more accessible description of the procedure. However, the vector [X] does not necessarily exist in the code when programming the algorithm, because the memory used for [B1] is reusable so that it is considered as the corresponding [X1] at each step.

2.7 PARALLEL IN-CORE AND OUT-OF-CORE MATRIX FILLING SCHEMES

In applying the MoM, an integral equation is first discretized into a matrix equation by using a set of expansion and testing functions. To obtain the current distribution on the structure for a given excitation, this matrix equation needs to be solved. Hence, the parallelization of the solution procedure involves two steps. The first step is the matrix filling, and the second step is the solution of the matrix equation. Both of these must be handled efficiently. Furthermore, efficient parallel matrix filling for MoM with higher order basis functions introduces new challenges and is quite different from the procedure used in a MoM formulation using the more typical subdomain basis functions.

To parallelize the solution of the large, dense matrix in a MoM problem, typically one needs to divide the matrix between processes in such a way that two important conditions are fulfilled: (1) Each node should store approximately the same amount of data, and (2) the computational load should be equally distributed among the processes that run on different nodes. Fulfilling these conditions, which affect both the matrix filling and matrix solution, is typically determined by which kind of solver is chosen. In this chapter, the ScaLAPACK library package is selected to solve the matrix equation, and this choice determines the details of the matrix filling. In the next two sections, the parallel matrix filling scheme is presented, first for in-core execution, and then for the out-of-core solver.

2.7.1 Parallel In-Core Matrix Filling Scheme

The parallelization of the generation of the impedance matrix should be designed to produce a linear speedup with an increase in the number of processes. If the parallelization is carried out appropriately, this linear speedup will be evident in the execution of the code. Distributing the computation of the elements of the matrix enables a user to solve very large problems in a reasonable time. The impedance matrix has the dimension of N × N, where N is the number of unknowns. The matrix is dense, and its elements are complex. Each node performing the computation of the elements of the matrix also stores a portion of the complete matrix. The minimum number of nodes required for the solution of the problem is determined by dividing the number of unknowns for the problem by the number of unknowns that can be held in the memory of a single node. A node is the entity that has an Internet Protocol (IP) address (internal or external), and each node may have multiple processors, and in the case of multicore processors, each processor can have multiple instances of the code, one on each core. This type of configuration can efficiently utilize large amounts of RAM per node. An IP address is a numerical identification (logical address) that is assigned to devices participating in a computer network utilizing the Internet Protocol for communication between its nodes.

The impedance matrix is constructed by looping over the number of geometric elements (number of wires and plates) and performing the calculation of the elements of the impedance matrix. There is an outer and inner loop that cycles over the number of elements. The iterations of the outer loop are independent of each other, except for the accumulations of the final result that need to take place at the end of the inner loop.

A process is a single serial instance of the code: Several processes run simultaneously in a parallel code. Each process entering the loop participates in the iterative process and performs the task that it is assigned to do, and all the processes go through the iterations in parallel. The results prior to performing the accumulation are saved, and the actual accumulations are performed afterward. All of the calculations to generate the impedance matrix are distributed over the number of processes performing the generation of the complete impedance matrix. This parallel execution provides an increase in the computational speed for matrix filling.

The parallel filling of the matrix could be the same regardless of the choice of the basis functions. However, the matrix filling scheme will be most efficient if the characteristics of the basis functions are taken into account.

In the following paragraphs, a description of an efficient matrix filling scheme, using the higher order basis functions over wires and quadrilateral plates, is provided.

When using higher order basis functions over quadrilateral plates, the surface can be defined using two coordinate directions, p and s. The goal is to find the current components along the p- and s-directions. However, there is an advantage in using the polynomial basis functions because the intermediate results obtained in evaluating the elements of the impedance matrix for lower-order can be used in the computation of the elements of the impedance matrix when using higher order polynomials. This advantage improves the efficiency of the matrix filling for the higher order basis functions when employed over wires and quadrilaterals and can be implemented quite easily in both the serial and parallel codes.

For parallel matrix filling, an additional improvement can be made to increase further the efficiency of the code. The objective is to eliminate redundant calculations for each process. For the most efficient code, this concept can be applied regardless of the choice of the basis functions. However, the specific details for implementing this are quite different for different basis functions. The pseudocode describing an efficient scheme to fill the impedance matrix for higher order basis functions is shown in Figure 2.8. In this scheme, within each process, redundant calculations related to the evaluation of the potential functions are avoided by using a flag, set true or false, for each order of the polynomial on a geometric element. This flag is initially set to be false. After a process performs an integration, the flag status is changed to be true. The flag status is checked before each integration is performed for the elements of the impedance matrix, and if the flag is true, the redundant calculation is avoided and the calculated value is retrieved from the RAM.

images

Figure 2.8. A parallel in-core matrix filling scheme.

When dealing with the right-hand sides of the matrix equation in MoM, the parallel matrix filling algorithm is much easier to design, compared with the parallel filling algorithm for the impedance matrix.

Load balancing is critical to obtain an efficient operation of a parallel code. When performed correctly, the matrix filling is executed across all the available resources in an accurate and efficient manner. Little communication between nodes is necessary during the initial matrix filling, and parallel speedup can be carefully tracked to ensure proper implementation.

During the process of finding the solution of the matrix, balancing the computational load between all the processors is also important. However, it is less easy to track, since an increase in the number of unknowns or the number of nodes executing the solution can increase the amount of communication required between the nodes. This increase in communication will decrease the gains of parallel speedup. As a rule of thumb, more nodes typically means less wall time for solving large problems, even though the overall parallel efficiency may sometimes be increased by using fewer nodes to solve the same problem!

2.7.2 Parallel Out-of-Core Matrix Filling Scheme

The reason for developing an out-of-core matrix filling algorithm is to enable one to solve large matrix equations, where the impedance matrix may be too large to be stored in the main memory (RAM) of the system. Compared with the in-core matrix filling algorithm, where the matrix is filled once and kept in the RAM, the main idea of designing an out-of-core algorithm is to fill a portion of the matrix at a time and then write this portion to the hard disk rather than keeping it in the RAM. The pseudocode for the efficient parallel matrix filling scheme is displayed in Figure 2.9.

As shown in Figure 2.9, at the beginning of the matrix filling algorithm, the matrix is partitioned into different slabs. As discussed in Section 2.4, the number of slabs Islab and the width of each slab Ki are determined by the nature of the application and the particular system architecture of the computing platform on which the job is executed.

Each process goes through a loop of slabs or blocks, from 1 to Islab. Each process calculates the elements for the Ki, which represents the ith out-of-core slab, and it sets the global upper (nend in Figure 2.9) and lower bound (nstart in Figure 2.9). For example, for the first slab (i = 1), the global lower bound is 1 and the upper bound is K1. For the second slab, the global lower bound is K1 + 1 and the upper bound is K1 + K2. Each process fills a portion of the matrix in the same way as the in-core fill algorithm. However, each process pays no attention to the columns that fall outside the fill bound. After every process has completed the desired action of filling the appropriate portion of the matrix, a synchronous write is called to write the portion of the matrix into a file. Then, each process enters the loop corresponding to the next slab. This procedure avoids the calculation of most of the redundant integrals related to the potential functions, which are used to calculate the elements of the impedance matrix. This is accomplished by using a flag for each order of the geometric element, described previously for the in-core matrix filling scheme.

images

Figure 2.9. A parallel out-of-core matrix filling scheme.

By comparing with the in-core matrix filling algorithm, it can be found that for each slab, the algorithm is exactly the same. Most of the overhead for filling the out-of-core matrix, excluding that from the in-core matrix calculation for an individual slab, comes from two parts: (1) calculation of the redundant integrals performed on each process, for different elements of the impedance matrix, which belongs to different slabs; and (2) writing the matrix elements to the hard disk.

2.8 CONCLUSION

In this chapter, the data distribution of ScaLAPACK is introduced for factoring a matrix using parallel processing. The data decomposition is then presented for an out-of-core matrix solver, where the storage needed for the matrix exceeds the capacity of the RAM. The one-slab, left-looking, out-of-core algorithm is described where the right-looking, in-core algorithm is used when factoring each slab. Thus, the advantages of both the left-looking and the right-looking algorithms are retained. The in-core and the out-of-core solvers are used in Higher Order Basis Based Integral Equation Solver (HOBBIES) to solve the matrix equations that result from the application of the MoM procedure to an integral equation in the frequency domain. The solvers in an automatic way determine the patterns of the parallel in-core and out-of-core matrix filling, which have been presented in this chapter.

REFERENCES

[1] T. Cwik, R. A. van de Geijn, and J. E. Patterson, “The Application of Massively Parallel Computation to Integral Equation Models of Electromagnetic Scattering,” Journal of the Optical Society of America A, Vol. 11, pp. 1538–1545, 1994.

[2] T. Cwik and J. E. Patterson, “Computational Electromagnetic and Supercomputer Architecture,” Progress in Electromagnetics Research, Vol. 7, pp. 23–55, 1993.

[3] T. Cwik, “Parallel Decomposition Methods for the Solution of Electromagnetic Scattering Problems,” Electromagnetics, Vol. 12, pp. 343–357, 1992.

[4] Netlib Repository at UTK and ORNL, LINPACK, Collection of Fortran Subroutines. Available at: http://www.netlib.org/linpack/. Accessed Aug. 2008.

[5] J. J. Dongarra, I. S. Duff, D. C. Sorenson, and H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia, 1991.

[6] Netlib Repository at UTK and ORNL, LAPACK — Linear Algebra PACKage. Available at: http://www.netlib.org/lapack/. Accessed Aug. 2008.

[7] Texas Advanced Computing Center, “Goto BLAS,” Software and Tools, The University of Texas at Austin, 2008. Available at: http://www.tacc.utexas.edu/resources/software/#blas. Accessed Aug. 2008.

[8] Netlib Repository at UTK and ORNL, Basic Linear Algebra Subprograms (BLAS). Available at: http://www.netlib.org/blas/. Accessed Aug. 2008.

[9] Netlib Repository at UTK and ORNL, The ScalAPACK Project, Parallel Basic Linear Algebra Subprograms (PBLAS) Home Page. Available at: http://www.netlib.org/scalapack/pblas_qref.html. Accessed Aug. 2008.

[10] Netlib Repository at UTK and ORNL, Basic Linear Algebra Subprograms (BLACS). Available at: http://www.netlib.org/blacs/. Accessed Aug. 2008.

[11] L. S. Blackford, ScaLAPACK Tutorial. Available at: http://www.netlib.org/scalapack/tutorial/sld053.htm. Accessed Aug. 2008.

[12] Wikipedia contributors, “Message Passing Interface,” Wikipedia, The Free Encyclopedia. Available at: http://en.wikipedia.org/wiki/Message_Passing_Interface. Accessed Aug. 2008.

[13] Message Passing Interface Forum, MPI Documents. Available at: http://www.mpi-forum.org/docs/docs.html. Accessed Aug. 2008.

[14] Y. Zhang and T. K. Sarkar, Parallel Solution of Integral Equation Based EM Problems in the Frequency Domain. IEEE-Wiley Press, Hoboken, NJ, 2009.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.48.244