Chapter 10

Parallel patterns: sparse matrix computation

An introduction to data compression and regularization

Abstract

This chapter introduces the parallel sparse matrix-vector computation pattern. It starts with the basic concepts of sparse matrices as a data compaction technique. It then introduces a kernel based on Compressed Sparse Row (CSR) data storage for sparse matrices. The ELL format with data padding is then introduced as a regularization technique for reduced control divergence and improved memory coalescing. The COO format is further introduced, a complementary regularization technique to reduce padding overhead. Finally, the Jagged Diagonal Storage (JDS) format based on sorting is introduced to smooth out variation from one row to the next row, allowing one to further reduce padding overhead in the regularization process. The chapter shows that the same sparse matrix kernel code can exhibit very different performance behaviors on different data sets.

Keywords

sparse matrix computation; sparse matrix-vector multiplication; compressed sparse row format; ELL format; coordinate format; jagged diagonal storage format; divergence; transposition; iterative linear solvers; conjugate gradient solvers

Our next parallel pattern is sparse matrix computation. In a sparse matrix, the majority of the elements are zeros. Many important real-world problems involve sparse matrix computation. Storing and processing these zero elements are wasteful in terms of memory capacity, memory bandwidth, time, and energy. To address these problems, several sparse matrix storage formats and their corresponding processing methods have been proposed and widely used in the field. These approaches employ a compaction technique to avoid storing or processing zero elements at the cost of introducing a certain degree of irregularity into the data representation. Unfortunately, such irregularity can lead to underutilization of memory bandwidth, control flow divergence, and load imbalance in parallel computing. Striking a good balance between compaction and regularization is important. Some storage formats achieve a high level of compaction at high levels of irregularity, whereas others attain a modest level of compaction while keeping the representation more regular. The parallel computation performance of their corresponding methods depends heavily on the distribution of nonzero elements in the sparse matrices. Understanding the wealth of work in sparse matrix storage formats and their corresponding parallel algorithms provides a parallel programmer an overview for addressing compaction and regularization challenges in solving related problems.

10.1 Background

A sparse matrix is a matrix where the majority of the elements are zeros. Sparse matrices arise in many science, engineering, and financial modeling problems. As seen in Chapter 6, Numerical Considerations, matrices can be used to represent the coefficients in a linear system of equations. Each row of the matrix represents one equation of the linear system. In various science and engineering problems, the large number of variables and equations involved are sparsely coupled; i.e., each equation involves only a small number of variables. This point is illustrated in Fig. 10.1, where each column of the matrix corresponds to the coefficients for a variable: column 0 for x0, column 1 for x1, etc. For instance, the fact that row 0 has nonzero elements in columns 0 and 2 indicates that variables x0 and x2 are involved in equation 0. It should be clear that none of the variables are present in equation 1, variables x1, x2 and x3 are present in equation 2, and finally variables x0 and x3 are present in equation 3.

image
Figure 10.1 A simple sparse matrix example.

Sparse matrices are typically stored in a format, or a representation, that avoids storing zero elements. We will start with the Compressed Sparse Row (CSR) storage format, which is illustrated in Fig. 10.2. CSR stores only nonzero values in a one-dimensional data storage, shown as data[] in Fig. 10.2. Array data[] stores all nonzero values in the sparse matrix in Fig. 10.1. The nonzero elements of Row 0 (3 and 1) are stored first, followed by the nonzero elements of Row 1 (none), the nonzero elements of Row 2 (2, 4, 1), and the nonzero elements of Row 3 (1, 1). The format compresses away all the zero elements.

image
Figure 10.2 Example of Compressed Sparse Row (CSR) format.

With the compressed format, we need to input two sets of markers to preserve the structure of the original sparse matrix in the compressed representation. The first set of markers forms a column index array, col_index[], in Fig. 10.2. This array gives the column index of every nonzero value in the original sparse matrix. Since we have squeezed away the nonzero elements of each row, we need to use these markers to remember the location of the remaining elements in the original rows of the sparse matrix; e.g., values 3 and 1 came from columns 0 and 2 of row 0 in the original sparse matrix. The col_index[0] and col_index[1] elements are assigned to store the column indices for these two elements. For another example, values 2, 4, and 1 came from columns 1, 2, and 3 of row 2 in the original sparse matrix. Therefore, col_index[2], col_index[3], and col_index[4] store indices 1, 2, and 3.

The second set of markers indicates the starting location of every row in the compressed format as the size of each row varies after the zero elements are removed. The starting location of each row in the compressed storage can no longer be identified using indexing based on the fixed row size. Fig. 10.2 shows a row_ptr[] array whose elements are the indices for the beginning locations of each row; row_ptr[0] indicates that Row 0 starts at location 0 of the data[] array, row_ptr[1] indicates that Row 1 starts at location 2, etc. Both row_ptr[1] and row_ptr[2] have a value of 2, implying that none of the elements of Row 1 is stored in the compressed format. This statement makes sense since Row 1 in Fig. 10.1 consists entirely of zero values. In addition, row_ptr[4] stores the starting location of a nonexistent “Row 4”. This choice is for convenience, as some algorithms need to use the starting location of the next row to delineate the end of the current row. This extra marker provides a convenient way to locate the ending location of Row 3.

As discussed in Chapter 6, Numerical Considerations, matrices are often used in solving a linear system of N equations of N variables in the form A*X+Y=0, where A is an N ×N matrix, X is a vector of N variables, and Y is a vector of N constant values. The objective is to solve for the X variable that will satisfy all the equations. An intuitive approach is to inverse the matrix such that X=A-1*(-Y). This technique can be used for matrices of moderate size through methods such as Gaussian elimination, as illustrated in Chapter 6, Numerical Considerations. While these methods can be used theoretically to solve equations represented in sparse matrices, the sheer size of numerous sparse matrices can overwhelm this intuitive approach. Furthermore, an inversed sparse matrix is often much larger than the original because the inversion process tends to generate a large number of additional nonzero elements called fill-ins. As a result, it is often impractical to compute and store the inversed matrix in solving real-world problems.

Instead, linear systems of equations represented in sparse matrices can be better solved with an iterative approach. When the sparse matrix A is positive–-definite (i.e., xTAx > 0 for all nonzero vectors x in Rn), the conjugate gradient method can be used to iteratively solve the corresponding linear system with guaranteed convergence to a solution [HS 1952]. The conjugate gradient methods predicts a solution for X and performs A*X+Y. If the result is not close to a 0 vector, a gradient vector formula can be used to refine the predicted X and another iteration of A*X+Y performed.

The most time-consuming aspect of such iterative approaches is the evaluation of A*X+Y, which is a sparse matrix–vector multiplication and accumulation. Fig. 10.3 illustrates matrix–vector multiplication and accumulation, where A is a sparse matrix. The dark squares in A represent nonzero elements. By contrast, both X and Y are typically dense vectors; i.e., most of the elements of X and Y hold nonzero values. Owing to its importance, standardized library function interfaces have been created to perform this operation referred to as Sparse Matrix–Vector (SpMV) multiplication and accumulation. We will use SpMV to illustrate the important tradeoffs between different storage formats in parallel sparse matrix computation.

image
Figure 10.3 An example of matrix–vector multiplication and accumulation.

A sequential implementation of SpMV based on the CSR format is quite straightforward, as shown in Fig. 10.4. We assume that the code has access to (1) num_rows, a function argument that specifies the number of rows in the sparse matrix, (2) a floating point data array of A elements (via the data[] input parameter), two floating point x[] and y[] arrays of X and Y elements, and two integer row_ptr and col_index arrays, as described in Fig. 10.2. Only seven lines of code exist. Line 1 is a loop that iterates through all rows of the matrix, with each iteration calculating a dot product of the current row and the vector x.

image
Figure 10.4 A sequential loop that implements SpMV based on the CSR format.

In each row, Line 2 first initializes the dot product to zero. It then sets up the range of data[] array elements that belong to the current row. The starting and ending locations can be loaded from the row_ptr[] array, as illustrated in Fig. 10.5 for the small sparse matrix in Fig. 10.1. For row=0, row_ptr[row] is 0 and row_ptr[row+1] is 2. The two elements from Row 0 reside in data[0] and data[1]. That is, row_ptr[row] gives the starting position of the current row and row_ptr[row+1] gives the starting position of the next row, which is one after the ending position of the current row. This process is reflected in the loop in Line 5, where the loop index iterates from the position given by row_ptr[row] to the position given by row_ptr[row+1]-1.

image
Figure 10.5 Illustration of the sequential SpMV loop when operating on the sparse matrix example in Fig. 10.1.

The loop body in Line 6 calculates the dot product for the current row. For each element, the loop body uses the loop index elem to access the matrix A element in data[elem]. The code also uses elem to retrieve the column index for the element from col_index[elem]. This column index is then used to access the appropriate x element for multiplication. To illustrate, the elements in data[0] and data[1] come from column 0 (col_index[0]=0) and column 2 (col_index[1]=2), respectively. Thus, the inner loop will perform the dot product for row 0 as data[0]*x[0]+data[1]*x[2]. The reader is encouraged to perform the dot product for other rows as an exercise.

CSR completely removes all zero elements from the storage. It incurs storage overhead by introducing the col_index and row_ptr arrays. In our example where the number of zero elements is not much larger than the number of nonzero elements, the storage overhead is greater than the space saved by not storing the zero elements. However, for sparse matrices where the vast majority of elements are zeros, the overhead introduced is far smaller than the space saved by not storing zeros. For instance, in a sparse matrix where only 1% of the elements are nonzero values, the total storage for the CSR representation, including the overhead, would be around 2% of the space required to store both zero and nonzero elements.

Removing all zero elements from the storage also eliminates the need to fetch these zero elements from memory or to perform useless multiplication operations on these zero elements. This method can significantly reduce the consumption of memory bandwidth and computational resources.

Any SpMV computation code will reflect the storage format assumed. Therefore, we will add the storage format to the name of a code to clarify the combination used. The SpMV code in Fig. 10.4 will be referred to as sequential SpMV/CSR. With a good understanding of sequential SpMV/CSR, we are now ready to discuss parallel sparse computation.

10.2 Parallel SpMV Using CSR

The dot product calculation for each row of the sparse matrix is independent of the dot product for other rows; i.e., all iterations of the outer loop (Line 1) in Fig. 10.4 are logically independent of each other. We can easily convert this sequential SpMV/CSR into a parallel CUDA kernel by assigning each iteration of the outer loop to a thread. Each thread calculates the inner product for a row of the matrix, which is illustrated in Fig. 10.6, where Thread 0 calculates the dot product for row 0, Thread 1 for row 1, and so on.

image
Figure 10.6 Example of mapping threads to rows in parallel SpMV/CSR.

A real-world sparse matrix application usually consists of thousands to millions of rows, each of which contains tens to hundreds of nonzero elements. The mapping in Fig. 10.6 seems appropriate, showing multiple threads, with each of them having a substantial amount of work. We present a parallel SpMV/CSR in Fig. 10.7.

image
Figure 10.7 A parallel SpMV/CSR kernel.

It should be clear that the kernel appears almost identical to the sequential SpMV/CSR loop. The outermost loop construct has been removed and is replaced by the thread grid. In Line 2, the row index assigned to a thread is calculated as the familiar expression blockIdx.x*blockDim.x + threadIdx.x. With the need to manage any arbitrary number of rows, Line 3 checks whether the row index of a thread exceeds the number of rows. This method handles the situation where the number of rows is not a multiple of the thread block size.

Despite its simplicity, the parallel SpMV/CSR kernel has two major shortcomings. First the kernel does not make coalesced memory accesses. As shown in Fig. 10.5, adjacent threads will be making simultaneous nonadjacent memory accesses. In our example, threads 0, 1, 2, and 3 will access data[0], none, data[2], and data[5] in the first iteration of their dot product loop. The same threads will then access data[1], none, data[3], and data[6] in the second iteration, and so on. Thus, the parallel SpMV/CSR kernel in Fig. 10.7 fails to efficiently use the memory bandwidth.

The second shortcoming of the SpMV/CSR kernel is its potential to incur significant control flow divergence in all warps. The number of iterations performed by a thread in the dot product loop depends on the number of nonzero elements in the row assigned to the thread. Since the distribution of nonzero elements among rows can be random, adjacent rows can have varying numbers of nonzero elements. Consequently, a widespread control flow divergence can occur in most or even all warps.

Both the execution efficiency and memory bandwidth efficiency of the parallel SpMV kernel depends on the distribution of the input data matrix. This behavior somehow differs from those of most of the kernels we have thus far studied. However, such data-dependent performance behavior is commonly observed in real-world applications. Such occurrence justifies the importance of parallel SpMV as a parallel pattern. Although simple, parallel SpMV depicts an important behavior in many complex parallel applications. We will discuss three important techniques in the next sections to address the noncoalesced memory accesses and control flow divergence in the parallel SpMV/CSR kernel.

10.3 Padding and Transposition

The problems of noncoalesced memory accesses and control divergence can be addressed by applying data padding and transposition on the sparse matrix data. These ideas were used in the ELL storage format, whose name came from the sparse matrix package in ELLPACK, a package for solving elliptic boundary value problems [RB 1984]. A simple way to understand the ELL format is to start with the CSR format, as is illustrated in Fig. 10.8.

image
Figure 10.8 ELL storage format.

From a CSR representation, we first determine the rows with the maximal number of nonzero elements. We then add dummy (zero) elements to all other rows after the nonzero elements for them to be of the same length as the maximal rows, thereby generating a rectangular matrix. For our small sparse matrix example, we determine that row 2 has the maximal number of elements. We then add one zero element to row 0, three zero elements to row 1, and one zero element to row 3 for the rows to be of equal lengths. These additional zero elements appear as squares with an * in Fig. 10.8, thereby generating a rectangular matrix. Note that the col_index array also needs to be padded the same way to preserve their correspondence to the data values.

We can now lay the padded matrix out in the column major order; i.e., we will place all elements of column 0 in consecutive memory locations, followed by all elements of column 1, and so on. This method is equivalent to transposing the rectangular matrix in the row major order used by the C language. In terms of our small example, after transposition, data[0] through data[3] contain 3, *, 2, 1, the 0th elements of all rows, as illustrated in Fig. 10.9, bottom portion. Similarly, col_index[0] through col_index[3] contain the column positions of the 0th elements of all rows. We no longer need row_ptr since the beginning of row i has been simplified to data[i]. With the padded elements, it is also very easy to move from the current element of row i to the next element by simply adding the number of rows in the original matrix to the index. To illustrate, the 0th element of row 2 is in data[2], and the next element is in data[2+4]=data[6], where 4 is the number of rows in the original matrix in our small example.

image
Figure 10.9 More details of our small example in ELL.

Using the ELL format, we show a parallel SpMV/ELL kernel in Fig. 10.10. The kernel receives slightly different arguments. The kernel no longer needs the row_ptr; instead, it needs an argument, num_elem, to determine the number of elements in each row after padding. Recall that num_elem is the maximal number of nonzero elements among all rows in the original sparse matrix.

image
Figure 10.10 A parallel SpMV/ELL kernel.

A first observation is that the kernel code of SpMV/ELL kernel is simpler than that of SpMV/CSR. With padding, all rows are now of the same length. In the dot product loop in Line 5, all threads can simply loop through the number of elements given by num_elem. Consequently, control flow divergence no longer occurs in warps: all threads now iterate exactly the same number of times in the dot product loop. In the case where a dummy element is used in a multiplication and accumulation step, it will not affect the final result because its value is 0.

A second observation is that in the dot product loop body, each thread accesses its 0th element in data[row], and in general, its ith element in data[row+i*num_rows]. As we have seen in Fig. 10.9, by arranging the elements in the column major order, all adjacent threads are now accessing adjacent memory locations, enabling memory coalescing, thereby using memory bandwidth more efficiently.

By eliminating control flow divergence and enabling memory coalescing, SpMV/ELL should run faster than SPMV/CSR. Furthermore, SpMV/ELL is simpler, making SpMV/ELL an all-around winning approach. Unfortunately, SpMV/ELL has a potential downside. In situations where one or a small number of rows have an exceedingly large number of nonzero elements, the ELL format will result in excessive number of padded elements. These padded elements will require storage space, need to be fetched, and perform calculations despite their lack of influence on the final result. They consume memory storage, memory bandwidth, and execution cycles. Consider our sample matrix: in the ELL format, we have replaced a 4×4 matrix with a 4×3 matrix, and with the overhead from the column indices we are storing more data than contained in the original 4×4 matrix.

To illustrate, a 1000×1000 sparse matrix has 1% of its elements of nonzero value. On average, each row has 10 nonzero elements. With the overhead, the size of a CSR representation would be about 2% of the uncompressed total size. Assume that one of the rows has 200 nonzero values while all other rows have less than 10. Using the ELL format, we would pad all other rows to 200 elements. This method makes the ELL representation about 40% of the uncompressed total size and 20 times larger than the CSR representation. The excessively long row will extend the runtime of only one of the warps of the SpMV/CSR kernel, whereas the padding will extend the runtime of all warps of the SpMV/ELL kernel. With numerous padded dummy elements, an SpMV/ELL kernel can run more slowly compared with an SpMV/CSR kernel. This inadequacy requires a method to control the number of padded elements when we convert from the CSR format to the ELL format.

10.4 Using a Hybrid Approach to Regulate Padding

The root of the problem with excessive padding in the ELL representation is that one or a small number of rows have an exceedingly large number of nonzero elements. If we have a mechanism to “take away” some elements from these rows, we can reduce the number of padded elements in ELL. The Coordinate (COO) format provides such a mechanism.

The COO format is illustrated in Fig. 10.11, where each nonzero element is stored with both its column index and row index. We have both the col_index and row_index arrays to accompany the data array. To illustrate, A[0,0] of our small example is now stored with both its column index (0 in col_index[0]) and its row index (0 in row_index[0]). With the COO format, one can look at any element in the storage and know where the element came from in the original sparse matrix. As in the case of the ELL format, row_ptr is unnecessary because each element self-identifies its column and row positions.

image
Figure 10.11 Example of Coordinate (COO) format.

Although the COO format involves additional storage cost for the row_index array, it has the additional benefit of flexibility. The elements in a COO format can be arbitrarily reordered without losing any information provided that the data, col_index, and row_index arrays are reordered in the same manner, as illustrated in Fig. 10.12.

image
Figure 10.12 Reordering the Coordinate (COO) format.

In Fig. 10.12, we have reordered the elements of data, col_index, and row_index. Currently, data[0] contains an element from row 3 and column 0 of the original sparse matrix. We have also shifted the row index and column index values along with the data value; thus, we can correctly identify the location of this element location in the original sparse matrix. The reader may ask why we would want to reorder these elements. Such reordering would disturb the locality and sequential patterns necessary for the efficient use of memory bandwidth.

The answer lies in an important use case for the COO format. It can be used to curb the length of rows in the CSR format or the ELL format. First, we make an important observation. In the COO format, we can process the elements in any desired order. For each element in data[i], we can simply perform a y[row_index[i]]+=data[i]*x[col_index[i]] operation. The correct y element identified by row_index[i] will receive the correct contribution from the product of data[i] and x[col_index]. If this operation is performed for all elements of data, the correct final answer will be obtained regardless of the order in which these elements are processed.

Before converting a sparse matrix from the CSR format to the ELL format, we can remove some elements from rows with exceedingly large numbers of nonzero elements and place them into a separate COO storage. We can use SpMV/ELL on the remaining elements. With excess elements removed from the extra-long rows, the number of padded elements for other rows can be significantly reduced. We can then use a SpMV/COO to finish the job. This approach of employing two formats to collaboratively complete a computation is often referred to as a hybrid method.

A hybrid ELL and COO method for SpMV using our small sparse matrix is shown in Fig. 10.13. Row 2 has the largest number of nonzero elements. We remove the last nonzero element of row 2 from the ELL representation and move it into a separate COO representation. By removing the last element of row 2, we reduce the maximal number of nonzero elements among all rows in the small sparse matrix from 3 to 2. As shown in Fig. 10.13, the number of padded elements is reduced from 5 to 2. More importantly, all of the threads only need to perform 2 rather than 3 iterations, which can accelerate the parallel execution of the SpMV/ELL kernel by 50%.

image
Figure 10.13 Our small example in ELL and COO hybrid.

A typical way of using an ELL–COO hybrid method is for the host to convert the format from one similar to the CSR format into the ELL format. During conversion, the host removes some nonzero elements from the rows with exceedingly large number of nonzero elements. The host places these elements into a COO representation and then transfers the ELL representation of the data to a device. When the device completes the SpMV/ELL kernel, it transfers the resulting y values back to the host. These values are missing the contributions from the elements in the COO representation. The host performs a sequential SpMV/COO kernel on the COO elements and finishes their contributions to the y element values.

The user may question whether the additional work performed by the host to separate COO elements from an ELL format could incur excessive overhead. It depends. In situations where a sparse matrix is only used in one SpMV calculation, this extra work can indeed incur a significant large overhead. However, in a number of real-work applications, the SpMV is performed on the same sparse kernel repeatedly in an iterative solver. In each iteration of the solver, the x and y vectors vary; however, the sparse matrix remains the same because its elements correspond to the coefficients of the linear system of equations being solved, and these coefficients remain the same from iteration to iteration. Thus, the work done to produce both the hybrid ELL and COO representations can be amortized across many iterations. We will return to this point in the next section.

In our small example, the device finishes the SpMV/ELL kernel on the ELL portion of the data. The y values are then transferred back to the host, which then adds the contribution of the COO element with the operation y[2]+=data[0]*x[col_index[0]]=1*x[3]. Note that in general, the COO format includes multiple nonzero elements. Thus, we expect the host code to be a loop, as shown in Fig. 10.14.

image
Figure 10.14 A sequential loop that implements SpMV/COO.

The loop is extremely simple. It iterates through all the data elements and performs the multiply-and-accumulate operation on the appropriate x and y elements by using the accompanying col_index and row_index elements. We will not present a parallel SpMV/COO kernel. It can be easily constructed using each thread to process a portion of the data elements and to use an atomic operation in order to accumulate the results into y elements. The reason is that the threads are no longer mapped to a particular row. Many rows will likely be missing from the COO representation; only the rows that have exceedingly large numbers of nonzero elements will have elements in the COO representation. Therefore, it is better for each thread to take a portion of the data element and use an atomic operation in order to ensure that none of the threads will trample the contribution of other threads.

The hybrid SpMV/ELL–COO method illustrates a productive use of both CPUs and GPUs in a heterogeneous computing system. The CPU can readily perform SpMV/COO by using its large cache memory. The GPU can quickly perform SpMV/ELL by using its coalesced memory accesses and large number of hardware execution units. The removal of some elements from the ELL format is a regularization technique: it reduces the disparity between long and short rows and improves the workload uniformity of all threads. Such enhanced uniformity provides certain benefits, including less control divergence in an SpMV/CSR kernel or less padding in an SpMV/ELL kernel.

10.5 Sorting and Partitioning for Regularization

While COO helps regulate the amount of padding in an ELL representation, we can further reduce the padding overhead by sorting and partitioning the rows of a sparse matrix. The idea is to sort the rows according to their length, e.g., from the longest to the shortest, as illustrated in our small sparse matrix in Fig. 10.15. Since the sorted matrix looks largely like a triangular matrix, the format is often referred to as the Jagged Diagonal Storage (JDS) format. As we sort the rows, we typically maintain an additional jds_row_index array that preserves the original index of the row. For CSR, this array is similar to the row_ptr array in that both arrays have one element for each row of the matrix. Whenever we exchange two rows during sorting, we also exchange the corresponding elements of the jds_row_index array, thereby keeping track of the original position of all rows.

image
Figure 10.15 Sorting rows according to their length.

Once a sparse matrix is in the JDS format, we can partition the matrix into sections of rows. Since the rows have been sorted, all rows in a section will likely have similar numbers of nonzero elements. As shown in Fig. 10.15, the small matrix can be divided into three sections: the first section consists of one row with three elements, and the second section consists of two rows with two elements each. The third section consists of one row without any element. We can then generate an ELL representation for each section. Within each section, we only need to pad the rows to match the row with the maximal number of elements in that section. This method would reduce the number of padded elements. In our example, we do not even need to pad within any of the three sections. We can then transpose each section independently and launch a separate kernel on each section. In fact, we do not even need to launch a kernel for the section of rows with no nonzero elements.

Fig. 10.16 shows a JDS–ELL representation of our small sparse matrix, which assumes similar sorting and partitioning results found in Fig. 10.15. The first section has only one row so that the transposed layout is the same as the original one. The second section is a 2 ×2 matrix and has been transposed. The third section consists of Row 1, which has no nonzero element. This lack of nonzero elements is reflected in the fact that its starting location and the starting position of the next section are identical.

image
Figure 10.16 JDS format and sectioned ELL.

An SpMV/JDS kernel will not be presented in this chapter. Either an SpMV/CSR kernel on each section of the CSR or an SpMV/ELL kernel on each section of the ELL after padding may be used to represent the kernel. The host code required to create a JDS representation and to launch SpMV kernels on each section of the JDS representation is left as an exercise.

Note that we want each section to have a large number of rows so that its kernel launch will be worthwhile. In extreme cases where a very small number of rows have extremely large numbers of nonzero elements, the COO hybrid with JDS can still be used to allow more rows in each section.

Once again, the reader should ask whether sorting rows will lead to incorrect solutions to the linear system of equations. Recall that we can freely reorder equations of a linear system without changing the solution. Provided that the y elements are reordered along with the rows, we are effectively reordering the equations. Therefore, we will obtain the correct solution. The only extra step is to reorder the final solution back to the original order by using the jds_row_index array.

Whether sorting will incur a significant overhead, the answer is similar to what we saw in the hybrid method. Provided that the SpMV/JDS kernel is used in an iterative solver, such sorting and reordering of the final solution x elements can be performed, and the cost can be amortized among many iterations of the solver.

In relatively recent devices, the memory coalescing hardware has relaxed the address alignment requirement, allowing the simple transposition of a JDS-CSR representation. The jds_section_ptr array does not need to be adjusted after transposition. This further eliminates the need to pad rows in each section. As memory bandwidth becomes increasingly the limiting factor of performance, eliminating the need to store and fetch padded elements can be a significant advantage. Indeed, we have observed that while sectioned JDS–ELL tends to exhibit the best performance on older CUDA devices, transposed JDS-CSR tends to exhibit the best performance on Fermi and Kepler.

We would like to make an additional remark on the performance of sparse matrix computation compared with dense matrix computation. In general, the FLOPS achieved by either CPUs or GPUs is much lower for sparse matrix computation than for dense matrix computation. This finding is particularly true for SpMV, where there is no data reuse in the sparse matrix. The CGMA value (see Chapter 4: Memory and Data Locality) is essentially 1, limiting the attainable FLOPS to a small fraction of the peak performance. The various formats are important for CPUs and GPUs since both are limited by memory bandwidth when performing SpMV. Many have been surprised by the low FLOPS of this type of computation on both CPUs and GPUs. After reading this chapter, one should no longer be surprised.

10.6 Summary

In this chapter, we presented sparse matrix computation as an important parallel pattern. Sparse matrices are important in a number of real-world applications that involve modeling complex phenomena. Furthermore, sparse matrix computation is a simple example demonstrating data-dependent performance behavior of many large real-world applications. Owing to the large number of zero elements, compaction techniques are used to reduce the amount of storage, memory accesses, and computation performed on these zero elements. Unlike most other kernels presented thus far in this book, the SpMV kernels are sensitive to the distribution of data, specifically the nonzero elements in sparse matrices. Not only can the performance of each kernel vary significantly across matrices; their relative merit can change significantly as well. Using this pattern, we introduce the concept of regularization applying hybrid methods and sorting/partitioning. These regularization techniques are used in many real-world applications. Interestingly, some of the regularization techniques re-introduce zero elements into the compacted representations. We use hybrid methods to mitigate the pathological cases where we could introduce too many zero elements. Readers are referred to [Bell 2009] and encouraged to experiment with different sparse data sets to gain additional insights into the data-dependent performance behavior of the various SpMV kernels presented in this chapter.

10.7

Exercises

1. Complete the host code to produce the hybrid ELL–COO format, launch the ELL kernel on the device, and complete the contributions of the COO elements.

2. Complete the host code for creating JDS–ELL and launch one kernel for each section of the representation.

3. Consider the following sparse matrix:
1 0 7 0
0 0 8 0
0 4 3 0
2 0 0 1
Represent the matrix in each of the following formats: (a) COO, (b) CSR, and (c) ELL.

4. Given a sparse matrix of integers with m rows, n columns, and z nonzeros, how many integers are needed to represent the matrix in (a) COO, (b) CSR, and (c) ELL. If the information provided is insufficient, indicate the missing information.

References

1. Bell, N. & Garland, M. (2009). Implementing sparse matrix–vector multiplication on throughput-oriented processors. In Proceedings of the ACM Conference on High-Performance Computing Networking Storage and Analysis (SC’09).

2. Hestenes MR, Stiefel E. Methods of conjugate gradients for solving linear systems. (PDF) Journal of Research of the National Bureau of Standards. 1952;49.

3. Rice JR, Boisvert RF. Solving Elliptic Problems Using, ELLPACK Springer Verlag 1984; 497 pages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.200.71