45.3. Algorithms, Implementations, and Evaluations (Detailed Description)
Most of the computation involved in the algorithm described in the previous section manifests itself as simple elementwise operations on the image data. Given the availability of the highly optimized CUFFT library, the parallelization and optimization of our code is fairly straightforward. The exception is the wavelet transform, which required substantially more optimization than the other operations.
Our MRI pulse sequence only undersamples the data in two of the three dimensions. Undersampling in the first dimension, the
readout direction
, is not sensible. This is a consequence of the use of gradient fields by MRI scanners to manipulate k-space: acquiring an entire readout line is no more expensive than acquiring a single point along that line. Consequently, we can decouple the 3-D problem into many independent 2-D problems by performing an inverse Fourier transform along the readout direction. Each independent 2-D problem can be solved in parallel. Each 2-D problem itself is a substantial computation, and our implementation parallelizes the 2-D solver across a single GPU. In systems with multiple GPUs, we distribute 2-D problems round-robin to ensure load balance.
45.3.1. Calibration Consistency Operator
As described in
Section 45.2
, the application of the Calibration Consistency operator
G
is a convolution in the Fourier domain. It is well known that convolutions in the Fourier domain are equivalent to elementwise multipication in the image domain. If the convolution kernel has
m
coefficients, and the image has
n
voxels, then the the implementation of
G
in the Fourier domain is
O
(
mn
), while the implementation in the Image domain is only
O
(
n
) after incurring an
O
(
n
log
n
) cost of performing Fourier transforms. Note that we must perform the Fourier transforms anyway — otherwise, we could not compute the wavelet transform and perform the soft-thresholding operation. Thus, the Fourier transforms effectively cost nothing, and an image domain implementation is clearly more efficient. Furthermore, convolutions are more difficult to optimize than elementwise operations, and we can more easily produce a near-peak-performance image-domain implementation.
Figure 45.2
shows an example of the image-domain elementwise masks that are used to compute the Fourier-domain convolutions.
However, the image-domain implementation requires a substantially higher memory footprint than does the Fourier-domain implementation. In the Fourier domain, the convolution kernels are very small. Each kernel is typically 7
3
, and we have typically 8 × 8 of them. However, to represent these kernels in the image domain, we must zero-pad them to the size of the image (at least 128
3
) and perform an inverse Fourier transform. This increasess memory usage by approximately three orders of magnitude: we must carefully manage this data, or we will overrun the memory capacity of the GPUs.
If there are eight channels in the MRI acquisition (typically there are eight, but there can be as many as 32), then we have 64 convolution kernels. For a typical volume size of 384 × 256 × 100 with eight channels, precomputing and all of the image-domain masks would require 384 × 256 × 100 × 8 × 8 complex-valued, single-precision: 4.6 GB of data. For a 32-channel acquisition, this would balloon to 75 GB. However, because the 3-D problem decouples into independent 2-D problems, we need to store the image-domain covolution masks only for as many 2-D problems as we have in flight simultaneously. Currently, we have one 2-D problem in flight per GPU in the system. This may change in future versions of the code, however, as we will discuss in
Section 45.5
. For each 256 × 100 2-D problem of a 32-channel acquisition, the 2-D convolution masks require 200 MB of memory and fit in the DRAM of any reasonably sized CUDA-capable GPU.
Our implementation of the calibration consistency operator uses neither the texture cache, nor the shared memory because there is little data reuse that can be exploited. All data (the convolution masks and the image data itself) are stored in the GPU's DRAM and streamed through using standard coalesced memory operations. Because the code performs just under 1 complex floating-point operation per complex float loaded from DRAM, the computation rapidly becomes DRAM bandwidth bound and no further optimization is necessary.
45.3.4. Wavelet Transform
Figure 45.3
shows an example of the application of a wavelet transform to an MR image. The forward and inverse wavelet transforms are the most intricately optimized steps of our implementation. The algorithm we are using to compute the wavelet transforms in ℓ 1-SPIRiT is
O
(
n
lg
n
) (
n
is the number of pixels in the image) and repeatedly convolves the image with high- and low-pass filters, downsampling at every step. The result is a spatial separation of the image content corresponding to high- and low-frequency Fourier coefficients. The algorithm performs
O
(lg
n
) steps, during which the image undergoes several
O
(
n
) convolutions.
If during one step of a wavelet transform, the image is
H
pixels tall and
W
pixels wide
2
, our goal is to produce four
H
/2 by
W
/2 pixel images, one of which is high-pass filtered in both the X and Y dimensions, one of which is low-pass filtered in both dimensions, and two which are a combination of low- and high-pass filtered in the two dimensions. This could be achieved by performing four 2-D convolutions, and downsampling by a factor of two. However, a more efficient implementation leverages the separability of the wavelet filters. In particular, the 2-D convolutions are performed as pairs of 1-D convolutions: first convolving each length
W
row with high- and low-pass filters, then convolving each length
H
column. The downsampling operation can be combined with the convolution to reduce computation by a factor of 2×.
While we are performing a convolution on a row (or column) of the image, it is crucial to cache the input data to leverage the data reuse inherent in convolution. We are using length-
P
(usually four) filters and performing two down-sampling convolutions, so effective caching will reduce memory bandwidth requirements by a factor of
P
×. Again, we expect that an efficient implementation of the wavelet transform should be memory bound owing to the extremely high floating-point throughput of the GPU. Note that when we are convolving in the nonunit stride dimension (in our case, convolving the rows of the image), it is also necessary to block in the unit-stride dimension. In order for memory accesses to be coalesced, we must load up to 16 rows at a time into the GPU's shared memory. In fact, we found that blocking increased the performance of both our row and column convolutions, likely because of the increased thread block size the blocking transformation enabled. Once the blocked data is in shared memory, we can then perform our convolutions rapidly on the cached data.
The implementations of the row- and column-convolutions for both the forward and inverse transforms are similar. Here, we provide a partial listing of the source for the forward transform's row convolution, with some details omitted for clarity. Note that the
shared_rows
variable represents the data we are caching in the GPU's shared memory, and the constant
K
determines how many rows we load into a single thread block's memory. Because the data is column major,
K
must be large enough to ensure that global memory accesses are coalesced. The high- and low-pass filters are stored in the GPU's constant memory in arrays named
hpf
and
lpf
, respectively. Also, note that while the MRI data is complex valued, our wavelet filters are real valued. Thus, we perform the transform independently on the real and imaginary components of the data.
_ _global_ _ void forward_wavelet_rows ()
{
// some initializaion omitted for bevity
extern _ _shared_ _ float shared_rows[];
for (int j = threadIdx.y; j < n; j += blockDim.y)
shared_rows[threadIdx.x + j*blockDim.x] = X[i + M*j];
__syncthreads();
// Low-pass Downsample
for (int j = threadIdx.y; j < n/2; j += blockDim.y){
float y = 0;
for (int k = 0; k < P; k++)
y += shared_rows[threadIdx.x + K*(2*j+k)&(n-1)]*lpf[k];
X[i + j*M] = y;
}
// High-pass Downasmple
for (int j = threadIdx.y; j < n/2; j += blockDim.y){
float y = 0;
for (int k = 0; k < P; k++)
y += shared_rows[threadIdx.x + K*mod(2*j+1-k,n)]*hpf[k];
X[i + (n/2)*M + j*M] = y;
}
This code parallelizes the computation of each row's convolution among
of the threads in a thread block. Since
K
rows are loaded, the entire thread block computes
K
convolutions in parallel. However, a wavelet transform of a single image is insufficient to saturate the GPU for any appreciable amount of time. Additional parallelism is found in performing the transforms for all 8–32 channels simultaneously. Additionally, our wavelet filters are real valued and transforms of the complex-valued MRI images further decompose into real and imaginary transforms which can be performed in parallel.
void forward_wavelet (float *X_real, float *X_imag, int M, int N, C, int L)
{
T = 256; // threads per Thread Block
int min_size = (1 ≪ L);
// other variable initializaion omitted for clarity
for (int m = M, n = N, j = J, j > L; j−−)
{
if (m > min_size)
{
for (K = MAX_K
; K*m*sizeof(float) > SHARED_MEM_SIZE
; K ≫= 1);
forward_wavelet_cols ⋘ dim3(2, C*(n/K)), dim3(T/K, K) K*m*sizeof(float) ⋙ (X_real, X_imag, M, m, N, n, C);
}
if (n > min_size)
{
for (K = MAX_K
; K*n*sizeof(float) > SHARED_MEM_SIZE
; K ≫= 1);
forward_wavelet_rows ⋘ dim3(2*(m/K), C), dim3(K, T/K), K*n*sizeof(float) ⋙ (X_real, X_imag, M, m, N, n, C);
}
if (m > min_size) m = m/2;
if (n > min_size) n = n/2;
}
In the preceding code, we have called the column and row convolutions as separate kernels: all convolutions are performed in place in global DRAM, and the output of the column convolutions is the input to the row convolutions. The grid dimensions of
dim3(2,C*(n/K))
and
dim3(2*(m/k),C)
specify that we want to perform convolutions for all
C
image channels in parallel, that we are grouping the
n
column convolutions (or
m
row convolutions) into groups of
K
, and we multiply by two in order to process the real and imaginary components separately.