The next step is implementation of the CUDA kernel that performs the processing of a region of pixels at the original scale using the whole classifiers cascade. The kernel receives the classifier cascade, integral and squared integral images, normalization map, and a pointer to a segment of global memory where the output of classification is stored.
The very simplistic implementation of such a kernel has a simple mapping of the processing flow to CUDA grid and block configurations. The kernel is launched with as many threads as there are regions of the
M
×
N
size to be classified in the image, that is (height
−
M
+ 1) × (width
−
N
+ 1). Each CUDA thread works within the search region and sequentially calculates stages, classifiers, and features until the classification decision is done. After that it outputs the decision flag to the output detections mask of the same size as the input image.
The output mask should have a low amount of detections, so for the convenience of iteration over them, they are collected into the list of detections using the stream compaction algorithm.
Stream compaction is a data manipulation primitive that can be found in both CUDPP and Thrust libraries (in the Stream Compaction and Reordering section). The algorithm takes the input vector of values and the compaction predicate, which indicates if an element from the input vector should be in the output or not. The algorithm then outputs the “compacted” vector of only those values for which the predicate evaluated to true.
It has to be noted that under certain conditions if the input vector contains a very low amount of elements of interest, then the use of atomic operations on the global memory instead of scan may give better performance.
Processing Optimizations
The approach to the classifier cascade application with CUDA, which maps each region to one CUDA thread, is hereby called pixel-parallel processing (
Figure 33.14
). The CUDA block configuration is chosen to be linear for convenience. As a convention, the anchor of an
M
×
N
region will be represented by the top-left pixel (the middle pixel on the illustrations 33.15). Because all regions have to be processed by the kernel, all threads within the CUDA block will address the region of the
M
× (
blockDim.x
+
N
− 1) of the integral image. This leads to the idea that any kind of cache would be extremely helpful here for prefetching of the integral image values. The possibilities are shared memory, texture cache, and L1/L2 cache (starting with the Fermi architecture). As for shared memory, it will not be efficient on early architectures owing to the ratio of shared memory per multiprocessor to the amount of threads to launch.
The described processing is very simple to implement, but its performance is below any expectations. The reason for this phenomenon is so-called work starvation. Indeed, on each stage the classifier cascade is supposed to reject some percentage of the input hypotheses, which means that just after a few stages more than 50% of CUDA threads will become inactive (
Figure 33.15
). After all threads from the block terminate, the SM releases the CUDA block and its resources and proceeds with the new one. The worst-case scenario can happen if every CUDA block has one or more threads that will be discarded on the last stage of the cascade.
The efficiency of the processing may be improved if the classifier cascade is broken down into groups of stages (or even into separate classifiers), which are then executed in sequential kernel launches. The kernel needs to be modified to take the amount of stages to process before forced termination. This idea draws several problems that have to be overcome.
Configuring CUDA blocks for subsequent launches as before doesn't make work starvation go away because the same threads are addressing the same regions. This problem can be dealt with by means of the stream compaction algorithm that is applied to the output mask of detections after the first group of stages processing; it is capable of creating a list with indices of those anchors which passed through.
As a result, there is a need in some other kernel for the subsequent launches that takes the list of indices as an argument and assigns one thread per anchor passed to the moment. This new kernel differs from the previous in several ways. Instead of the direct computing of the anchor location from
threadIdx
and
blockIdx
, the new kernel has a one-dimensional block configuration and takes the anchor from the list (passed as input) at the absolute thread
ID
in the grid. In addition, the kernel can output detections to the same locations of the output vector, relying on the compaction kernel, or it can do compaction by itself (either each thread increments a global counter using
atomicAdd
and writes
to the slot, or the local compaction stage is held after all threads finish classification and the atomic operation is called only once from the CUDA block to reserve the needed amount of detections in the global array).
Although the latter processing scheme allows one to keep work starvation under the control, the situation can still be improved by revisiting the parallelism of the CUDA implementation. The two disadvantages of the latter CUDA kernel are
• The adjacent threads within the CUDA block share little (or don't share at all) areas of corresponding regions; hence, the access pattern to the integral image is quite sparse; and
• The work starvation at a smaller scale remains in place: still a possibility exists that all threads but one will reject the anchor and exit immediately, while the one left will do processing up to the end.
The proposed new scheme is called stage-parallel processing (
Figure 33.16
), which means that the parallelism has moved inside of the single-stage processing. The whole linear CUDA block is working
with one region, and the processing of one classifier stage is performed in the following way. If we assume that the stage contains more weak classifiers than the value of CUDA block dimension, each CUDA thread picks one corresponding weak classifier from the beginning of the stage and calculates its value for the given region. Then it accumulates the value in the register and advances to the next weak classifier, located
blockDim.x
weak classifiers apart. When all weak classifiers are evaluated, all threads perform parallel reduction of the local accumulators, which results in the stage value, which in turn is compared with the stage threshold, and the classification decision is made.
The main advantage of this new kernel is the finest-grain parallelism, which completely solves the problem of work starvation and sparse memory accesses. It is also a disadvantage at the same time because the kernel is efficient only on the second part of the classifier cascade, where stages contain a large number of weak classifiers. Obviously, the performance of the kernel is low on the first stage of the face detection classifier cascade with a few weak classifiers.
The pros and cons of the two presented approaches certainly dictate the majority of usage restrictions. First, the pixel-parallel kernel executes several stages in one or a few launches. After that the stage-parallel kernel executes all hypotheses left up to the final classification. A good discussion topic would be “how to break down the cascade into groups of stages.” As before, here is the bifurcation point:
• The analytical way is to hard-code the decision based on the information about each classifier cascade. For instance, switch to the stage-parallel kernel after the ratio of passed anchors goes below some percentage threshold. This kind of information can be obtained on the cascade-training phase or calculated from an exhaustive amount of tests.
• The formal way is to switch to the stage-parallel kernel as soon as the amount of weak classifiers becomes greater or equal to the number of threads with which the stage-parallel kernel is configured to run.
• The feedback way is to call the pixel-parallel kernel restricted to process
N
(parameter to tweak) stages at a time several times until the percentage of passed hypotheses goes below the threshold, or the amount of weak classifiers on the next stage is greater than the stage-parallel kernel block dimension. In this case the first pixel-parallel kernel works in the very traditional way and outputs the hypotheses count and the list of detections (or the mask, and then the host calls the compaction operation to generate the list). All subsequent pixel-parallel calls take input from the list, generated
by the previous kernel call (so-called kernel chaining). The feedback analysis is performed on the host by copying the hypotheses count value and comparing it against the total number of possible detections. Whenever it becomes possible to do the feedback analysis on the device and call the next kernel depending on some condition directly from the device, then it should work a bit faster.
In every particular case (the classifier cascade) the fastest way needs to be determined, but in general the feedback way is superior owing to its robust essence.
Cascade Layout Optimizations
The object classifier cascade is the cornerstone of the object-detection pipeline. It is the most intensively addressed structure, and thus it has to be stored in a compact but well-structured representation. The OpenCV XML classifier representation (
Table 33.11
) doesn't meet these requirements; its main purpose is to store the cascade in a pictorial, convenient for browsing way. In the OpenCV Haar classifier loader it is parsed to bits and stored in the RAM. The cascade structures being used in OpenCV after XML parsing can be found in the
cv.hpp
file. Although the internal representation is a structure of arrays that is an encouraged technique in GPU computing, there is still a lot of room for optimizations, especially GPU-specific.
Table 33.11
OpenCV Haar classifier excerpt.
<opencv_storage><haarcascade_frontalface_alt type_id="opencv-haar-classifier">
<size>20 20</size>
<stages>
<_>
<!-- stage 0 -->
<trees>
<_>
<!-- tree 0 -->
<_>
<!-- root node -->
<feature>
<rects>
<_>3 7 14 4 −1.</_>
<_>3 9 14 2 2.</_></rects>
<tilted>0</tilted></feature>
<threshold>4.0141958743333817e-003</threshold>
<left_val>0.0337941907346249</left_val>
<right_val>0.8378106951713562</right_val></_></_>
</trees>
<stage_threshold>0.8226894140243530</stage_threshold>
</stages>
</haarcascade_frontalface_alt></opencv_storage>
|
The classifier adds up from stages, which in turn add up from trees (weak classifiers). Each tree consists of nodes, which encapsulate a number of rectangles with weights (the Haar feature), the feature threshold, and leaves. In
Table 33.11
the leaves are the weak classifier response values. To store the tree-based classifier, the
left_val
and
right_val
tags might be changed to
left_node
and
right_node
, with the
ID
of the referenced node instead of the classifier response.
To facilitate caching of the whole classifier cascade data, it needs to be stored in the global memory with texture reference bound to it. It is worth noting that texture reference supports reading up to 16-byte (128-bit) data types (e.g.,
uint4
), so the best approach is to pack the data to fit several arrays of structures (with elements of either 32, 64, or 128 bits in size), or to store it as several arrays of independent values.
Fortunately, all structures can be packed into the mentioned data types, not without hacks, though. The first array contains all feature rectangles with weights (
Figure 33.17
). As far as an average classifier window size does not exceed some value in any dimension, the rectangle's fields (top, left, width, height) can be stored in 8-bit unsigned integers. The weight is stored as a single-precision floating point, so the overall size of the rectangle with weight is 64 bits. The corresponding structure is named
HaarFeaturePart64
.
The second array contains descriptions of nodes. Each node is represented by the complete Haar feature, threshold, and two leaves (or pointers to other nodes). Because there are two leaves and one threshold, each stored as a 32-bit float (the leaf field can be interpreted as a 32-bit pointer to an other node, which agrees with the leaf size), this immediately leaves the only chance to fit “the complete Haar feature” into 32 bits to not exceed the allowed 128 bits for the total structure size. For this purpose the
HaarFeatureDescriptor32
structure is introduced. It contains the information about the number of rectangles in the feature (8 bits) and the offset from the beginning of the array of
HaarFeaturePart64
(24 bits). The whole 128-bit structure gets named
HaarClassifierNode128
.
It is important to store
HaarFeaturePart64
elements exactly in the same order as they appeared during the XML cascade parsing, because the
HaarFeatureDescriptor32
relies on their sequential order.
The final (third) array contains the description of the stages. Generally, a stage is characterized by the trees it contains and the stage threshold. The proposed representation of the stage is 16 bits for the amount of trees, 16 bits for the offset of the first tree root node in the array of
Haar-ClassifierNode128
, and 32 bits for the stage threshold. The corresponding data structure is named
HaarStage64
.
The
HaarStage64
representation relies on the solid indexing of the trees in the array of
Haar-ClassifierNode128
. Unlike the requirement of sequential filling of the array of
Haar-FeaturePart64
, the array of
HaarClassifierNode128
has to be handled slightly differently. In its first part all root nodes of the whole cascade are stored. After the last root node all intermediate nodes and leaves follow.
This layout allows keeping the size of
HaarStage64
without maintaining additional information about trees indexing. The opposite side of this storage scheme is a slightly more complex loading process, which has to maintain two separate arrays of nodes: one for root nodes and one for all others. After all nodes are loaded into the separate arrays, all far pointers to other nodes need to be corrected by adding the length of the root nodes vector.
One more bit-twiddling hack is storing the information about how to interpret the children of the node, that is, whether the 32-bit value is a floating-point leaf or integer pointer to some other node. Storing the single bit in the lowest bit of the integer (which is the least significant bit of the floating-point value at the same time) changes detection rates resulting from the precision loss. The safe location for this additional bit would be the highest bit of exponent (bit number 31) because it is never used in the values of weak classifier response. The highest exponent bit is used only for values with modulus greater or equal than 2, which rarely happens in practice, unless the classifier was trained with normalized weights.
Another potential placement for the classifier cascade is CUDA's constant memory. It is known to be one of the fastest memory types available on the GPU. The classifier cascade can be easily moved from global memory to constant memory. However, this enhancement has some application limitations.
First of all, constant memory is efficient only when a single location is addressed by all threads within a warp; otherwise, warp serialization occurs. Obviously, the worst warp serialization will happen in the stage-parallel kernel because every thread addresses a different weak classifier; thus, the optimization is not applicable here. However, the pixel-parallel kernel meets the addressing requirements because all threads address the same parts of the cascade simultaneously. The warp serialization still may happen while processing with a tree-based classifier, when different threads follow different paths depending on the calculations. No problems with serialization should happen when working with stump-based classifiers because of a lack of branching below the root node.
However, it appeared to be impossible to store the traditional face detection classifier cascade into the constant memory because the representation described earlier in this chapter required about 80 KB, whereas the capacity of constant memory is 64 KB (the size of the XML classifier representation is over 1 MB).
It appears that there is still a room for compression of the classifier cascade to increase chances for them to fit into the constant memory. This specific optimization covers only classifiers of limited window size (
M
×
N
). The core idea is that the array of HaarFeaturePart64 takes about half of the total
size of the classifier cascade and by crunching it into 32 bits it is possible to achieve ∼75% compression ratio. Two things need to be done: the first (trivial) is storing the rectangle weight in the IEEE 754 half float (16 bits).
The second step is to learn how to store a feature rectangle into 2 bytes. Because of the symmetry in the definition of rectangle [(
x
,
width
), (
y
,
height
)], the problem can be reduced to storing a tuple (
x
,
width
) in one byte (
B
), under the constraints from the
Table 33.12
.
Table 33.12
Constraints on the rectangle representation for storing in 2 bytes.
x
∈ [0,
N
− 1]
|
width ∈ [1,
N
]
|
x
+ width ∈ [1,
N
]
|
The final step is to calculate the maximum
N
for which all possible tuples can be stored in 1 byte. The enumeration scheme for this is shown on
Figure 33.18
(left). The increasing sum of elements from 1 to
N
evaluates to (
N
+ 1) *
N
/2, which yields the maximum for
N
equal to 22. This means that the maximum allowed classifier window size to use with this technique is 22 × 22 pixels.
The enumeration scheme can be used for building the conversion between (
x
,
width
) and
B
, but it would require a square root operation involved.
Figure 33.18
(right) demonstrates the alternative enumeration scheme for which a very simple conversion exists (see
Table 33.13
). The idea is to restructure the enumeration into groups, each containing
N
+ 1 elements, and perform conversion with the use of intermediate values — the number of group in which the element is located (
G
) and the number of element within the group (
K
). On the tuple decoding step the division can be replaced with multiplication followed by arithmetic shifts, or multiplication by the precalculated reciprocal.
Table 33.13
Left — coding tuple pseudo code; right — decoding tuple pseudo code.
if (width <= N / 2)
{
G = width − 1;
K = x;
}
else
{
G = N − width;
K = x + (N − G);
}
B = G * (N + 1) + K;
| G = B / (N + 1);
K = B − G * (N + 1);
if (K < N − G)
{
x = K;
width = G + 1;
}
else
{
x = K − (N − G);
width = N − G;
}
|