Chapter 9

Augmented Block Cimmino Distributed Algorithm for solving tridiagonal systems on GPU

Y.-C. Chen; C.-R. Lee    National Tsing Hua University, Hsinchu City, Taiwan

Abstract

Tridiagonal systems appear in many scientific and engineering problems, such as Alternating Direction Implicit methods, fluid simulation, and Poisson equation. This chapter presents the parallelization of the Augmented Block Cimmino Distributed method for solving tridiagonal systems on graphics processing units (GPUs). Because of the special structure of tridiagonal matrices, we investigate the boundary padding technique to eliminate the execution branches on GPUs. Various performance optimization techniques, such as memory coalescing, are also incorporated to further enhance the performance. We evaluate the performance of our GPU implementation and analyze the effectiveness of each optimization technique. Over 24 times speedups can be obtained on the GPU as compared to speedups on the CPU version.

Keywords

Tridiagonal solver; GPU; Parallel algorithm; Numerical linear algebra; Performance optimization

1 Introduction

A tridiagonal matrix has nonzero elements only on the main diagonal, the diagonal upon the main diagonal, and the diagonal below the main diagonal. This special structure appears often in scientific computing and computer graphics [1, 2]. Because many of them require real-time execution, the solver must compute the result quickly as well as correctly. Many parallel algorithms for solving a tridiagonal system have been proposed [26]. These algorithms can give answers with superior accuracy and short execution time. Many applications or open-source tridiagonal matrix solvers are based on these algorithms, such as Alternating Direction Implicit [1] and cuSPARSE package [7].

The Augmented Block Cimmino Distributed (ABCD) method [8] is a new type of algorithm to solve linear systems. It partitions the matrix into block rows and constructs the augmented matrices to make each block row orthogonal. Although the augmentation increases the problem size, the orthogonalization makes the parallelization easy, since each augmented block row can be solved independently. Moreover, because of the orthogonal property, each block row can be solved stably without pivoting.

This chapter describes the implementation and performance optimization techniques of the ABCD algorithm for solving tridiagonal systems on a graphics processing unit (GPU). Although ABCD algorithm can be paralleled easily at the coarse-grained level, for multicore devices such as a GPU, fine-grained parallelization is more important. The major problem is that the operations in the block row solver are few, so the optimization of each step in the implementation is critical to the overall performance.

The rest of the chapter is organized as follows: Section 2 introduces the ABCD method for solving tridiagonal systems. Section 3 describes the details of GPU implementation and performance optimization techniques. Section 4 presents the experimental results. The conclusion is given in the last section.

2 ABCD Solver for tridiagonal systems

The ABCD method is a redesigned block Cimmino algorithm, which is an iterative method for solving large linear least square problems. Let A be an m × n matrix and b an m vector. The linear least square problem is to solve x

minxbAx2=minxi=1meiTbeiTAx2

si1_e

where ei is the ith column of an m × m identity matrix I. The preceding question can be written in the block form,

minxbAx2=minxi=1pEiTbEiTAx2

si2_e

where Ei is the ith block column of the identity matrix. This is equivalent to partitioning the matrix A into p equal-sized row blocks, as well as the vector b. Fig. 1 shows a picture of the partition. If we let bi = EiTb and Ai=EiTAsi3_e, the preceding equation becomes

f09-01-9780128037386
Fig. 1 Partition of the system.

minxbAx2=minxi=1pbiAix2

si4_e

The original block Cimmino method approximates the solution of each row block in parallel, and iteratively makes the approximations converge to the solution.

If each block row Ai is orthogonal to each other, which means AiTAj=Osi5_e for ij, the exact solution can be obtained in one iteration. Of course, most matrices do not possess such a property. However, we can augment the matrix so that the augmented matrix has this property. For instance, suppose the original matrix A has four row blocks, A = (A1A2A3A4)T, in which case, we can construct an augmented matrix

Â=Â1Â2Â3Â4=A1A1A1A1A2A2A2A2A3A3A3A3A4A4A4A4

si6_e

in which ÂiTÂj=Osi7_e if ij. In the preceding example, the matrix size is increased seven times by augmentation.

Augmentation could be small if a matrix has some special structures, such as tridiagonal matrices. Here is an example. Let A be a 9 × 9 tridiagonal matrix, and let p = 3, which means matrix A is partitioned into three row blocks.

eq00009-01-9780128037386

To make each block row orthogonal to each other, we can simply augment the matrix as follows:

eq00009-02-9780128037386

Because only few elements are overlapped between consecutive row blocks in a tridiagonal matrix, only 2(p − 1) column augmentation should be added for p partitions.

Suppose Ax = b is the original tridiagonal system to solve, where A is an m × m tridiagonal matrix, x is the unknown, and b is an m vector. The augmented system can be written as

ACOIxy=bo

si8_e

The solution of x will be the same solution as the original problem. The only problem is that the augmented row block (O I) is not orthogonal to other row blocks in (A C), which makes parallelization difficult. This can be solved by projecting (O I) to the orthogonal subspace of (A C). Let Ā=(AC)si9_e, and Ā1,Ā2,,Āpsi10_e be the block rows of Āsi11_e. The oblique projector of Āsi11_e is

P=i=1pĀiTĀiĀiT1Āi

si13_e

Therefore we can construct the appending matrix

W=(BS)=(OI)(IP)

si14_e  (1)

and the augmented right-hand side

f=(BS)xo

si15_e  (2)

The augmented system becomes

ACBSxy=bf

si16_e  (3)

The solution of Eq. (3) is

xy=i=1pĀiTĀiĀiT1b+WTWWT1f

si17_e  (4)

whose upper part x is the solution of the original system.

The ABCD method is summarized in Algorithm 1.

Algorithm 1

Augmented Block Cimmino Distributed Method

1. Determine partition size p and construct augmented matrix C.

2. Compute ui=ĀiTĀiĀiT1bisi18_e, Si=CiTĀiĀiT1Cisi19_e, fi = (O I)ui,
and Wi=ĀiT(ĀiĀiT)1Cisi20_e for i = 1, …, p in parallel.

3. Compute u=i=1puisi21_e, S=Ii=1pSisi22_e, f=i=1pfisi23_e,
and W=OIi=1pWisi24_e

4. Solve Sz = f and compute v = Wz.

5. Return the upper part of u + v.

3 GPU implementation and optimization

Here we present the details of the GPU implementation of the ABCD method for solving tridiagonal systems and the performance optimization techniques.

3.1 QR Method and Givens Rotation

The major computation in Algorithm 1 is the calculation of

ĀiT(ĀiĀiT)1and(ĀiĀiT)1

si25_e  (5)

where Āisi26_e is a block row of the augmented matrix. Each block row has three nonzero diagonal elements and few augmented elements. However, the calculation using Eq. (5) has some drawbacks. First, although the matrix is quite sparse, the direct calculation still iterates many times, especially the calculation of ĀiĀiT1si27_e. If Āisi26_e is an k × q matrix, ĀiĀiTsi29_e is an k × k matrix. Because Āisi26_e is almost a tridiagonal matrix, ĀiĀiTsi29_e will be a band matrix with five diagonals. Second, the direct calculation of oblique projector using Eq. (5) is numerically unstable [9, Chapter 14], even with some pivoting strategies. A better choice is the QR method.

The QR method first decomposes the matrix ĀiTsi32_e into the product of an orthogonal matrix Qi and an upper triangular matrix Ri, ĀiT=QiRisi33_e. Based on that, Eq. (5) can be simplified as

ĀiTĀiĀiT1=QiRiRiTQiTQiRi1=QiRiRiTRi1=QiRiT

si34_e  (6)

and

ĀiĀiT1=RiTQiTQiRi1=Ri1RiT

si35_e  (7)

assuming Āisi26_e or Ri is of full rank. As can be seen, the formulas in Eqs. (6), (7) are much simpler and cleaner than that in Eq. (5).

There are several algorithms to carry out the QR decomposition [10, Chapter 5]. The most suitable one for matrix AiTsi37_e is the Givens rotation, because AiTsi37_e, a tridiagonal matrix, is very close to the upper triangular matrix Ri structurally, except for one subdiagonal and few augmented elements, and the Givens rotation method annihilates those nonzero elements one by one using rotation matrices. A rotation matrix for a two-vector v = (a b)T is

G=cssc

si39_e

where c = a/d, s = b/d, and d=a2+b2si40_e. By premultiplying G to v, Gv=(a2+b20)Tsi41_e. Note that G is a orthonormal matrix, and it preserves the length of the multiplicand, which means v and Gv have the same length.

The QR decomposition by Givens rotation uses the diagonal and subdiagonal elements to create rotation matrices to brings zeros to the subdiagonal. The final Q matrix can be obtained by cumulating the rotation matrices.

Table 1 compares the operation counts of the Givens rotation and the direct method using Eq. (5). The size of ĀiTsi32_e is n × k, and the size of Ci is × k. The operation count of the direct method is based on the spare matrix multiplication and Gaussian elimination for band matrix. As can be seen, the operation count of Givens rotation is almost half in the computation of Si, compared to that of the direct method.

Table 1

Operation Counts of Givens Rotation and Direct Multiplication

EquationGivens RotationDirect Method
ui=ĀiT(ĀiĀiT)1bisi43_e41k + 2446k − 3
Si=CiT(ĀiĀiT)1Cisi44_e5kℓ + 20k − 6 + 69kℓ + 20k − 12 + 6
Wi=ĀiT(ĀiĀiT)1Cisi20_e2kℓ + 17k − 62kℓ + 6k + 14

3.2 Sparse Storage Format

Because A is tridiagonal, which is extremely sparse, a special storage format should be considered to reduce the used memory. A simple format of a tridiagonal matrix is to use three vectors to store the nonzero elements. In addition to that, we also need to consider the storage for intermediate results, such as the storage for matrix Q and matrix R.

Based on those requirements, we designed a 5n array as the major storage format, as shown in Fig. 2. Matrix A can be stored in the first three n vectors. Matrix R is also a tridiagonal matrix, whose nonzeros are all above the main diagonal. After the QR decomposition, matrix R can use the same space as matrix A’s since A is no longer needed in the computation. Matrix Q is a block diagonal matrix. To store it may take too much space. Alternatively, we keep the Q matrix in the decomposed form and store the rotation matrices Gi. Since each Gi only has two parameters, we only need two elements to store each Gi. So the last two vectors are used to store the rotation matrix Gi. When the matrix Q is required in computation, such as Eq. (6), we multiply those Gi by the multiplicand directly.

f09-02-9780128037386
Fig. 2 Sparse storage format of matrix A.

3.3 Coalesced Memory Access

We reformat the matrix storage to improve the performance on the GPU. As suggested in Ref. [11], the data accessed by threads within a block should be successively stored because threads in a thread-block read or store simultaneously. Successively stored data allows coalesced memory access on the GPU, which can read/write 128 or more elements simultaneously, depending on the GPU hardware structure.

The original data storage format that stores the adjacent tridiagonal elements successively in memory cannot make coalesced memory access. Therefore we modify to the storage method so that the element k in the ith partition is before the element k in i − 1th partition and followed by the element k in the i + 1th partition.

Fig. 3 shows an example. There are eight elements, and each block has two threads. The number of partitions is two, which means that each thread controls 8/2 = 4 elements. Elements are reformatted based on the number of threads in a block. If thread 1 needs to access elements from 1 to 4 and thread 2 needs to access elements from 5 to 8, we place elements 1 and 5 together, 3 and 6 together, and so on. The new storage lets the threads in a block access data simultaneously. Although the data reformatting also costs time, this technique fulfills the coalescing requirement on the GPU and contributes to a higher memory accessing efficiency in the following steps. Experimental results are shown in the next section.

f09-03-9780128037386
Fig. 3 Coalesced format transformation.

3.4 Boundary Padding

One problem of single instruction multiple data architecture, such as GPU, is the branch divergence, which means a group of threads sharing the same program counter have different execution paths. On a Compute Unified Device Architecture (CUDA) core, threads from a block are bundled into fixed-size warps for execution. Threads within a warp follow the same instruction synchronously. If a kernel function encounters an if-then-else statement that some threads evaluate to true while others to false, a branch divergence occurs. Because of the restriction that threads in a warp cannot diverge to different conditions, warp deactivates the false conditioned threads and proceeds to the true condition, and then reverses condition. In other words, the branch divergence serializes all the possible execution paths, which can really hurt a GPU’s performance.

In the ABCD algorithm, the branch divergence occurs on the boundary process, which needs to handle different data. To solve the problem, we proposed the boundary padding technique, which adds unnecessary paddings and uses additional memory spaces for those threads with different execution on the GPU. The boundary padding technique ensures that all the threads in the same warp perform the same operations at any moment, which eliminates branch divergence.

3.4.1 Padding of the augmented matrix

Fig. 4 shows that the augmentations for the top and bottom partitions are different from those of the middle partitions, because they lack either Ci, j or negative identity − I, which will lead to divergences. Our implementation adds zero elements to the top and bottom parts of the augmented matrix. Those zero elements have no actual influence on the final solution. The padding does not cost extra storage. Fig. 5 shows how those padding are stored in a compact format.

f09-04-9780128037386
Fig. 4 Padding of augmented matrix C.
f09-05-9780128037386
Fig. 5 The storage format of augmented matrix C.

3.4.2 Padding for Givens rotation

During the rotation steps, the first partition of ABCD is different from other partitions. It rotates to middle and lower diagonal, while other partitions rotate to lower and below the lower diagonal, which leads to branches. To reduce the branches, we add redundant elements to the first partition. In other words, the first partition will further rotate to the lower diagonal and the diagonal below the lower diagonal. This technique doubles the work of the first partition but unifies the work among all threads. In our experiment results, the total time of calculating the Givens rotations can be reduced to half of the original time. Fig. 6 shows the padding before and after the padding on the first partition.

f09-06-9780128037386
Fig. 6 The padding of the first partition for Givens rotation. (A) Before padding. (B) After padding.

4 Performance evaluation

We first compare the performance of CPU and GPU implementation. Then we evaluate the speedup before and after applying the performance optimization techniques. The testing matrices are collected from Matlab standard libraries, Toeplitz matrices, and random generations.

The experimental platform is equipped with Intel Xeon CPU E5-2670 v2, which is of 2.50 GHz. The used GPU is NVIDIA Tesla K20m. The operating system is CentOS 6.4, and the NVIDIA Driver is 340.65.

4.1 Comparison With CPU Implementation

Fig. 7 shows the comparison between our implementations of the ABCD algorithm on the GPU and on CPU. The dataflow are the same for all the functions. The CPU version is implemented in C; the GPU version is implemented in CUDA. As shown in Table 2, the calculation of two implementations are similar if the matrix size is smaller than 217 because we have to load data onto GPU memory before starting calculation. The overload caused by moving the data is large compared to calculation time. But the CPU calculation time nearly doubles when the dimension of the matrix doubles, while GPU calculation time increases only a little. The difference is because of GPU’s architecture. If the dimension rises to 4 million, a 24.15 times speedup can be obtained by using GPU. The speedup of difference between a matrix size of 2 million and 4 million is small because we believe that GPU reaches the hardware limits.

f09-07-9780128037386
Fig. 7 Comparison of the execution time on the GPU and on the CPU.

Table 2

Performance Comparison of GPU and CPU Implementation

Matrix SizeGPU (ms)CPU (ms)Speedup (CPU Time/GPU Time)
32,7684.0 102.50
65,5365.9 203.39
131,0725.9 6010.17
262,14410.6 12011.32
524,28813.8 23016.66
1,048,57626.1 43016.47
2,097,15240.8 91022.30
4,194,30482.0198024.15

t0015

4.2 Speedup by Memory Coalescing

Table 3 lists the speedups of different kernels made by the coalescing storage format, as described in Section 3.3. The matrix size is 4 million. The speedup is calculated by the ratio of the time without coalesced format and the time with coalesced format. Coalesced format can attribute to over five times speedup for solving Ri1si46_e with multiple right-hand sides.

Table 3

Speedup by Coalesced Memory Access

Kernel (Functionality)Speedup (Original/Coalesced)
Assign augmented matrix1.180
Create Givens rotation1.519
Apply Givens rotation1.750
Solve Ri1si46_e1.959
Solve Ri1si46_e (multiple right-hand sides)5.082
Calculate u and f2.319
Calculate matrix S1.873
Solve S−13.464

There are many factors influencing the possible speedup. The major one is the size of memory accessed by the kernel. More memory accesses give a larger speedup. When the data size is small, its effectiveness is limited. As shown in Table 3, only a 1.15× speedup can be obtained for the task of assigning an augmented matrix. Another reason is the data access pattern. With the original data format, increasing the number of threads cannot accelerate the performance. But with a coalesced data format, doubling the number of threads reduces almost half of the execution time. This is because the bottleneck of the program using the original data format is memory access. More threads do not help to improve the performance. On the other hand, with a coalesced data format, the memory access is no longer the performance bottleneck. So other performance-tuning techniques can take effect. The other factors influencing the speedups include memory size, total number of threads, and the operations on the memory.

4.3 Boundary Padding

Although padding adds useless work to some threads, all threads can perform the same instructions simultaneously, which increases the overall performance.

Table 4 shows the speedup of the boundary padding technique. The speedup comes from two major factors. First, the boundary padding technique reduces the branch divergence because all the threads have the same execution path. Second, the boundary padding technique increases the coalesced memory access, since the memory access pattern is unified.

Table 4

Speedup by Boundary Padding

Kernel (Functionality)Speedup (Times)
Add padding to the augmented matrix C1.254
Add padding for Givens rotation (create rotation)2.548
Add padding for Givens rotation (apply rotation)1.898

The padding for a Givens rotation can not only improve the performance of creating the rotation matrix but also accelerate the kernel that applies the Givens rotations, as shown in the results in Table 4.

5 Conclusion and future work

We present the GPU implementation of the ABCD algorithm, which provides a totally new aspect of parallel algorithm using augmentation. We focus on the problem of solving tridiagonal systems. Because of the special structure of tridiagonal matrices, two performance optimization techniques are proposed to accelerate the GPU implementation. The first is the memory coalesced data format, which significantly reduces the memory access time. The second one is the boundary padding, which adds useless data to unify the execution paths, and can effectively reduce the branches’ divergence on the GPU. Experiments show that a speedup of more than 24 times can be obtained by using the GPU implementation.

Several future directions are worthy of investigation. First, the ABCD algorithm can be applied to general sparse matrices, but the matrix structure will be varied. How to handle the general sparse matrices effectively is a question. Second, here we only consider the parallelization on single-GPU platform. For better scalability, a multi-GPU platform or even heterogeneous platforms that hybrid various devices should be considered. How to design and optimize the algorithms is an interesting problem.

References

[1] Sakharnykh N. Efficient tridiagonal solvers for ADI methods and fluid simulation. GTC. 2010.

[2] Hockney R.W. A fast direct solution of Poisson’s equation using Fourier analysis. J. ACM. 0004-54111965;12(1):95–113. doi:10.1145/321250.321259.

[3] Stone H.S. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. J. ACM. 0004-54111973;20(1):27–38. doi:10.1145/321738.321741.

[4] Gander W., Golub G.H. Cyclic reduction history and applications. In: Proceedings of the Workshop on Scientific Computing. 1997:10–12.

[5] Polizzi E., Sameh A.H. A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Comput. 2006;32(2):177–194.

[6] Chang L.-W., Stratton J.A., Kim H.-S., Hwu W.W. A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 2012:1–11. doi:10.1109/SC.2012.12. 2167-4329.

[7] CUDA toolkit, http://docs.nvidia.com/cuda/cusparse/.

[8] Duff I.S., Guivarch R., Ruiz D., Zenadi M. The augmented block Cimmino distributed method. SIAM J. Sci. Comput. 2015;37(3):A1248–A1269.

[9] Higham N.J. Accuracy and Stability of Numerical Algorithms. second ed. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2002.xxx+680 ISBN 0-89871-521-0.

[10] Golub G.H., Van Loan C.F. Matrix Computations. third ed. Baltimore, MD: Johns Hopkins University Press; 1996 ISBN 0-8018-5414-8.

[11] CUDA C Programming Guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.137.91