36.3. Algorithms, Implementations, and Evaluations
Obviously, a GPU implementation of de-mosaicing benefits from the massive parallelism of GPUs. Equally important to achieve high throughput is proper memory management and memory access patterns. Although GPUs have tremendous memory bandwidth, in excess of 100 Gb/s on the fastest cards, with the sheer amount of calculation necessary per pixel, the limit to achieve real-time performance is quickly surpassed when reading directly from uncached memory. Graphics shader programmers, on the other hand, will be familiar with using textures to read image data. Texture reads are cached and provide significant improvement over reading directly, but the cache size is limited, not under programmer control, and there is no provision for storing or sharing intermediate calculations without writing back to the GPU's external DRAM, or global memory (GMEM). In the CUDA architecture, NVIDIA added shared memory (SMEM), an extremely fast user-managed on-chip memory that can be shared by multiple threads. SMEM is perfect for storing source pixel information and will form the backbone of our de-mosaicing implementation.
The provided source contains implementations of GPU de-mosaicing using a few different methods: bilinear interpolation, Lanczos
[2]
interpolation, a gradient-modified interpolation
[3]
and the Patterned-Pixel Group method
[4]
. The particular details of the methods will be discussed later after a look at the overall application. Start by examining the file
main.cpp,
which contains a single
main()
function with the entire application control. The first step is to create an OpenGL window to display the output image. The details of OpenGL display or CUDA/OpenGL interop are beyond the scope of this chapter and are not discussed here
1
; a simple
CudaGLVideoWindow
class contains all the code necessary to draw the image on the screen.
Next, CPU memory is allocated to store the original raw image prior to transfer to the GPU. Pinned host memory, allocated with
cudaHostAlloc(),
is selected to facilitate fast, asynchronous DMA (Direct Memory Access) transfers of data to the GPU. In a real-time system, pinned memory would facilitate simultaneous data transfer and processing so that one frame could be processed, while the next frame is read and uploaded to the GPU. To avoid convoluting this sample with lots of file-parsing code, no attempt is made to read any standard raw file format; rather, several images are provided in a flat binary file, and all image properties are hard-coded above the declaration of
main().
Thus, users can easily replace this code with the necessary routines to acquire images from a particular source. Also, and consistent with our goal of simplicity, the sample implements only the common red-green1-green2-blue (or R-G1-G2-B) pixel arrangement shown in
Figure 36.1
. Best performance results when targeting the kernel to a specific pixel arrangement rather than including all the conditional logic to support possible variations.
The output image for this sample is an RGBA image with 32-bit/channel floating-point precision. GPUs efficiently handle 32-bit floating-point data, and this preserves the color precision and dynamic range of high-quality image sources; however, other output types are trivially substituted.
The quality of raw conversion algorithms is highly subjective. It is routinely debated, and depends on particular properties of specific cameras or the type of images acquired. Some algorithms, such as the PPG method, actually switch between calculations or vary coefficients based upon edge gradients or image content. Bilinear interpolation is chosen for simplicity and because it is regularly used when processing power is limited. Lanczos interpolation is also simple conceptually, but it provides superior output quality and can run on the GPU with only a modest decrease in throughput, despite a significant increase in overall computation load. The Malvar method
[3]
applies a gradient correction factor to the results of bilinear interpolation for dramatic improvement in quality with little additional complexity (and in this example we also apply the Malvar gradients to the Lanczos interpolation). Finally, the PPG method demonstrates a more complex conditional approach common in many of the most advanced de-mosaicing methods. This chapter does not advocate for the use of any particular method, or make claims about the comparative quality of different methods. Rather, this sample demonstrates the techniques to achieve high-performance de-mosaicing on an NVIDIA GPU, and developers can easily substitute their own preferred algorithms if desired (also see
[5]
).
The file
DeBayer.cu
contains the CUDA C kernels along with the kernel launch functions. Examine
DeBayer_Bilinear_16bit_RGGB_kernel(),
the bilinear de-mosaicing kernel. In this method, the missing color components for each pixel are computed from the average of the neighboring adjacent pixels of the other colors. More precisely refer to
Figure 36.3
and the following formulas:
Each thread will process one 2 × 2 pixel quad and outputs four RGBA F32 pixels. This strategy prevents thread divergence that would occur if each thread processed one output pixel and applied different formulas. Two pixels are read once as a
ushort2
value by each thread. The first two steps of the pipeline of
Figure 36.2
are implemented in one simple code statement:
fval = CLAMP((float)((int)inPix.x − black_level) * scale, 0,1.0f);
In our example case the black level is assumed uniform across the image. As previously mentioned, the black level is commonly spatially dependent upon row or column location owing to fixed pattern noise from manufacturing variations in the row and column read-out circuitry and A/D converter variability in the image sensor. In this case the variable
black_level
may be replaced with a row or column lookup-table. The resulting integer value is then cast to float and multiplied by a predetermined
scale
variable to convert the value to floating point in the 0.0f to 1.0f range. Again, for simplicity, the source data is assumed linear; if it is not then a linearization function may be applied. In this case a 1-D CUDA texture provides an ideal way to implement a lookup table with hardware linear interpolation.
The linear scaled floating-point raw values are now stored in the GPU's shared memory (SMEM). SMEM is extremely fast on-chip memory accessible to all threads within a thread block. It is analogous to CPU cache memory, but allocated and accessed under programmer control and thus can be relied upon for deterministic performance. SMEM serves the needs of de-mosaicing perfectly because the multiple threads in a block need to access the same source raw pixel values.
The thread block size dictates the input tile size stored. Additional apron source pixels are needed to interpolate pixels at edges of the tile, so the source reads are shifted up one row and left two columns with the left two threads in the thread block reading extra columns and the top two thread rows reading extra rows (two extra columns are read on each side rather than the one needed to maintain 32-bit read alignment).
2
For the first generation of NVIDIA GPUs with compute capability 1.x, a thread block size of 32 × 13 requires 34 × 14 × 4 pixels, or 7616 bytes total, just less than 50% of the total 16 kilobytes of SMEM. Thus, two thread blocks run concurrently per SM, ensuring good overlap between memory access and computation. On the newest compute capability 2.x GPUs, 48 kB of SMEM is available, so a thread block size of 32 × 24 is chosen. SMEM is no longer the limiting factor; rather, 768 threads is 50% of the maximum 1572 threads per SM.
Reading the source data and storing the values into SMEM compose the majority of the code in the sample, and unfortunately complicates the implementation quite a bit. The performance benefit far outweighs the complexity that arises from shifting the reads for the extra row and column apron pixels, and properly clamping on the image boarders. Also, note that the raw source pixels are stored in SMEM as four different color planes rather than a contiguous tile. This prevents a two-way SMEM bank conflict that would arise because every thread processes two output pixels per row. The arrangement of the source reading and storage in SMEM is shown in
Figure 36.4
. Also note that the blue and green-2 SMEM tiles contain one extra apron row on top and the red and green-1 tiles contain their extra apron row on bottom.
The read process may also be understood by referring to the following pseudo code, which directly mirrors the code for the bilinear interpolation in the accompanying source code:
Pseudo Code for Bilinear Interpolation reads to SMEM
Notes:
X
is the two-pixel column index in the image (Reading two 16-bit pixels at a time).
References to ‘X’ or ‘Column’ below are two physical columns.
Y
is the row index.
Tx
and
Ty
are the thread indices.
Boundary checking for off-of-image reads are omitted in this pseudo code for clarity.
// Step 1: Read Green2 and Blue pixels for the top left part of the tile
// Reads are shifted up one row and left one double-column
Column = X − 1
Row = Y − 1
Pixels = Image[Row][Column]
Linearize and Scale Pixels
Green2_SMEM_Tile[Ty][Tx] = Pixels[0]
Blue_SMEM_Tile[Ty][Tx] = Pixel[1]
// Step 2: Read the extra Green2 and Blue pixels for the right apron
If Tx< 2
Columnextra Green2 and Blue pixels = X + Thread Block Width − 1
Pixels = Image[Row][Column]
Linearize and Scale Pixels
Green2_SMEM_Tile[Ty][Tx + Thread Block Width] = Pixels[0]
Blue_SMEM_Tile[Ty][Tx + Thread Block Width] = Pixel[1]
//Step 3: Read the Red and Green 1 pixels from the next row
Row = Y
Column = X − 1
Pixels = Image[Row][Column]
Linearize and Scale Pixels
Red_SMEM_Tile[Ty][Tx + Thread Block Width] = Pixels[0]
Green1_SMEM_Tile[Ty][Tx + Thread Block Width] = Pixel[1]
// Step 4: Read the extra Red and Green 1 pixels for the right apron
If Tx< 2
Column = X + Thread Block Width − 1
Pixels = Image[Row][Column]
Linearize and Scale Pixels
Green2_SMEM_Tile[Ty][Tx + Thread Block Width] = Pixels[0]
Blue_SMEM_Tile[Ty][Tx + Thread Block Width] = Pixel[1]
// Step 5: Need to read two rows on the bottom to complete the apron
If Ty < 2
Column = X − 1 Row = (BlockIdx.y + 1) + BlockHeight*2 + Ty − 1
Pixels = Image[Row][Column]
Linearize and Scale Pixels
If Ty = 0
Green2_SMEM_Tile[Ty + Thread Block Height][Tx] = Pixels[0]
Blue_SMEM_Tile[Ty + Thread Block Height][Tx] = Pixel[1]
If Ty = 1
Red_SMEM_Tile[Ty + Thread Block Height][Tx] = Pixels[0]
Green1_SMEM_Tile[Ty + Thread Block Height][Tx] = Pixel[1]
// Step 6: Read the bottom right apron pixels
If Ty < 2 and Tx< 2
Column = X + Thread Block Width − 1
Row = (BlockIdx.y + 1) + BlockHeight*2 + Ty − 1
Pixels = Image[Row][Column]
Linearize and Scale Pixels
If Ty = 0
Green2_SMEM_Tile[Ty + Thread Block Height][Tx + Thread Block Width] = Pixels[0]
Blue_SMEM_Tile[Ty + Thread Block Height][Tx + Thread Block Width] = Pixel[1]
If Ty = 1
Red_SMEM_Tile[Ty + Thread Block Height][Tx + Thread Block Width] = Pixels[0]
Green1_SMEM_Tile[Ty + Thread Block Height][Tx + Thread Block Width] = Pixel[1]
Wait until all threads in thread block complete read.
Following the reading from global memory into SMEM, each thread block must synchronize with a
__syncthreads()
call to ensure all reads are complete before continuing. For this reason, the block sizes discussed previously are chosen to allow two blocks to run per SM, thereby giving the GPU compute work to do on one block, while another reads and waits for synchronization.
After synchronization each thread computes the missing color components for each output pixel in the 2 × 2 quad according to the preceding formulas. The output RGB value for each pixel is multiplied by the 3 × color conversion matrix for the target output color space. More sophisticated color conversion methods, such as a 3-D lookup table using 3-D textures, may be substituted here to account for different rendering intent. Finally, the output gamma is applied and the result is clamped in the 0 to 1.0 range. Note that clamping may be omitted if out-of-gamut colors are permitted. Here is the CUDA-C code for the interpolation of additional color channels of the red filtered (northwest) pixel in the quad. The variables
tile_R, tile_G1, tile_G2
, and
tile_B
refer to the SMEM memory tiles containing the source pixel values for each color channel.
// Bilinear Interpolation
float4 NW;
NW.x = tile_R[sy][sx];
NW.y = 0.25f * (tile_G2[sy][sx] + tile_G2[sy+1][sx] +
tile_G1[sy][sx-1] + tile_G1[sy][sx]);
NW.z = 0.25f * (tile_B[sy][sx-1] + tile_B[sy][sx] +
tile_B[sy+1][sx-1] + tile_B[sy+1][sx]);
NW.w = 1.0f;
// color correction
float r = ColorMatrix[0] * NW.x + ColorMatrix[1] * NW.y + ColorMatrix[2] * NW.z;
float g = ColorMatrix[3] * NW.x + ColorMatrix[4] * NW.y + ColorMatrix[5] * NW.z;
float b = ColorMatrix[6] * NW.x + ColorMatrix[7] * NW.y + ColorMatrix[8] * NW.z;
// gamma
NW.x = CLAMP(__powf(r,gamma),0,1.0f);
NW.y = CLAMP(__powf(g,gamma),0,1.0f);
NW.z = CLAMP(__powf(b,gamma),0,1.0f);
// write output
pOut[Y*2*OutPitch_in_pixels + X*2] = NW;
Each thread writes each pixel of the 2 × 2 quad back out to GPU GMEM after computation is completed. Because each thread processes 2 pixels per row, only every other 128-bit word is written at a time, and strict coalescing does not occur. Alternatively, these values could be stored until the entire quad is processed, written back to SMEM (which can now be reused because the processing of the source raw data is complete), and then written to GMEM in a coalesced fashion. Experimentally, it has been determined that just writing the values immediately performs better because the writes of the first 3 pixels per thread can overlap the processing of the subsequent pixels. It is quite common to perform sharpening, noise reduction, or other color adjustments as part of the raw conversion process. After interpolation pixels may be stored back to the now free SMEM and filters applied. Performing this processing may be more efficient than running a separate kernel later because extra passes through GMEM are eliminated.
3
When all additional filtering completes, then apply the final color conversion and gamma.
36.3.1. Improved Filtering
Bilinear interpolation leaves much to be desired in image quality that is especially noticeable at the edges in images. It is quickly evident that only a small portion of the kernel code in the preceding example relates to the actual interpolation computation, so presumably more complex methods may be used without a substantial performance penalty. Here, we discuss additional methods that improve upon the quality of the bilinear interpolation kernel.
In the sample, a second kernel,
DeBayer_Lanczos_16bit_RGGB_kernel()
, implements Lanczos interpolation. Lanczos interpolation is regarded as one of the highest-quality methods of image resampling. Coefficients for the Lanczos operator are given by:
Where
x
is the distance between each pixel in the neighborhood and the interpolated location and
a
is the order of the filter (a = 2 in our example). A plot of the Lanczos function is shown in
Figure 36.6
.
To interpolate the missing color values for each pixel compute a weighted sum of the neighboring pixels filtered in the desired color. The weights are the product of the x and y Lanczos function values for the distance between the output pixel and the neighbors in each axis. Consider interpolating the blue and green color values for a red filtered pixel.
Figure 36.7
shows the weights by which the neighboring green and blue pixels are multiplied and then summed to compute the corresponding blue and green color values. The sum is then normalized by multiplying by the reciprocal of the sum of the weights.
The Lanzcos interpolation CUDA kernel function is similar to the bilinear interpolation kernel, with the following substantial differences:
• A wider apron read is necessary to accommodate the larger neighborhood used to compute the interpolated values; thus, three extra rows above and below and four extra columns right and left are read. Because of the larger apron more SMEM is required, and thus, a smaller block size is used.
• The actual interpolation computation involves using several more input pixels and coefficients for each output pixel, but otherwise, follows the form for the bilinar kernel.
The CUDA-C code for computing the interpolated values of a red filtered pixel is as follows:
// Lanczos Interpolation
NW.x = tile_R[sy][sx];
NW.y = (−0.063684f * tile_G2[sy-1][sx] +
0.573159f * tile_G2[sy][sx] +
-0.063684f * tile_G1[sy][sx-2] + 0.573159f * tile_G1[sy][sx-1] +
0.573159f * tile_G1[sy][sx] + −0.063684f * tile_G1[sy][sx+1] +
0.573159f * tile_G2[sy+1][sx] + −0.063684f * tile_G2[sy+2][sx])
* 0.490701f;
NW.z = (0.004056f * tile_B[sy-1][sx-2] + −0.036501f * tile_B[sy-1][sx-1] +
-0.036501f * tile_B[sy-1][sx] + 0.004056f * tile_B[sy-1][sx+1] +
-0.036501f * tile_B[sy][sx-2] + 0.328511f * tile_B[sy][sx-1] +
0.328511f * tile_B[sy][sx] + −0.036501f * tile_B[sy][sx+1] +
-0.036501f * tile_B[sy+1][sx-2] + 0.328511f * tile_B[sy+1][sx-1] +
0.328511f * tile_B[sy+1][sx] + −0.036501f * tile_B[sy+1][sx+1] +
0.004056f * tile_B[sy-2][sx-2] + −0.047964f * tile_B[sy+2][sx-1] +
-0.047964f * tile_B[sy+2][sx] + 0.004056f * tile_B[sy+2][sx+1]) * 0.963151f;
NW.w = 1.0f;
Malvar, He, and Cutler improved upon basic interopolation by including a gradient-correction gain factor that exploits the correlation between the red, green, and blue channels in an image. A weighted luminance gradient of the actual filter color for each pixel is added to the output of the interpolated missing color. The specific gradient correction gain calculation forumulas for each particular interpolation case are given in
[3]
and in the accompanying source code. The gain correction factor uses an additional 5 to 9 additional pixel values in the calcualtion of the final interpolated color. However, because these values are already located in SMEM, the overall performance of the calcualtion is remarkably close to simple bilinear interpolation. We now list the calculation of the output of a red filtered pixel for comparison with the previously listed methods.
// Bilinear Interpolation + Gradient Gain Correction Factor
NW.x = tile_R[sy][sx];
NW.y = 0.25f * (tile_G2[sy][sx] + tile_G2[sy+1][sx]
+ tile_G1[sy][sx-1] + tile_G1[sy][sx]) + 0.5f * tile_R[sy][sx] +
-0.125f * (tile_R[sy-1][sx] + tile_R[sy+1][sx] + tile_R[sy][sx-1]
+ tile_R[sy][sx+1]);
NW.z = 0.25f * (tile_B[sy][sx-1] + tile_B[sy][sx] + tile_B[sy+1][sx-1]
+ tile_B[sy+1][sx]) + 0.75f * tile_R[sy][sx] +
-0.1875f * (tile_R[sy-1][sx] + tile_R[sy+1][sx] + tile_R[sy][sx-1]
+ tile_R[sy][sx+1]);
NW.w = 1.0f;
The Patterned Pixel gGrouping (PPgG) algorithm is used to interpolate red (R), green (G), and blue (B) values at each location, just as in the previous discussions, although the algorithm differs from the previous examples. PPG performs a two-phase computation that begins by interpolating all missing green values. For interpolation of the green values at blue or red pixels consider estimating
G
2,2
at
R
2
2
. First, the method calculates gradients in four directions, centered at pixel
R
2
2
:
Next, the algorithm determines which direction has the minimum gradient and then uses that direction to estimate green values at the red locations, such as at
G
2,2
, as:
This fully fills in the green channel at all locations, and those results can now be used in the next steps. For interpolation of the blue and red values at green pixels, the algorithm estimates
B
2,1
and
R
2,1
at
G
2,1
by a “hue transit” function, given a set of inputs:
where hue_;transit() is defined as:
hue_transit(l
1
,
l
2
,
l
3
,
v
1
,
v
3
){
if (l
1
<
l
2
<
l
3
||
l
1
>
l
2
>
l
3
)
return(v
1
+ (v
3
−
v
1
) × (l
2
×
l
1
)/(l
3
×
l
1
)
else
return(v
1
+
v
3
)/2 + (l
2
× 2 −
l
1
−
l
3
)/4
}
Finally, for interpolation of the blue or red values at red or blue pixels, consider estimating
B
2,2
at
R
2,2
:
Then, the final values for
B
2,2
and
R
2,2
are determined as:
if Δne ≤ Δnw
B
2,2
= hue_transit(G
3,2
,
G
2,2
,
G
1,3
,
G
3,2
,
G
1,3
)
else
B
2,2
= hue_transit(G
1,1
,
G
2,2
,
G
3,3
,
G
1,1
,
G
3,3
)
Each pixel processed in the PPG method is dependent only on a neighborhood of pixels around it. As in the previous examples, multiple reads and writes are issued to this neighborhood, and it can be stored in shared memory for faster repeated read access. In the case of PPG, the size of the region must be somewhat extended to accommodate for generating apron green values to avoid any need for a global synchronization (which would require a new kernel launch).
A tile of Bayer data and surrounding apron region can be copied from global memory into shared memory, and a group of threads can process a tile. If the CUDA-C algorithm was written to have each thread process a single element of the Bayer pattern, this would result in divergence because the operations to interpolate values for different colors vary at each location. So, as in the previous methods, the algorithm can be written so that each thread is responsible for processing a 2 × 2 region of the data corresponding to a “quad” of four data element locations: {
R
1
G
1
B
1
G
} = {
R
0,0
,
G
1,0
,
B
1,1
,
G
0,1
}. This way, all threads execute interpolations for
R
0,0
, then
G
1,0
, and so on, and so avoid divergent paths.
The PPG algorithms require substantially more SMEM than the prior methods for storing the interpolated green values from phase 1, and bank conflicts must again be avoided. Thus, for PPG analysis, we examined an alternative case where the input data is stored in SMEM as 16-bit short integer data. Linearization and black-level subtraction may still be performed upon the read of the raw data, but then values are converted back to short integers to conserve SMEM. Conveniently, when 16-bit source Bayer pattern values are written into shared memory and with each thread processing a “quad” of data, it can be seen that successive threads will each access different and successive banks of shared memory. This is a result of the banks accommodating two 16-bit elements per bank, and a row of Bayer data consisting of two 16-bit elements per “quad” element on a row, meaning the next quad's
R
1
value is spaced 32 bits, or one bank, away from the previous quad's
R
1
value, and so on. Thus, operating on 2 × 2 regions of 16-bit elements avoids divergence and bank conflicts. Finally, as seen in the preceding formulas the PPG method does not require any computation with precise floating-point weights; all calculations use integer math. Thus, the entire calculation and final output are written using unsigned short values.