8.4. Face Detection and Eye Localization

Face (and eye) detection can be viewed as a two-category classification problem. Given an image patch, a face detector decides whether the patch contains a human face (class ω1) (class ω0). Misclassification happens when the face detector misses a facial image patch (false rejection) or mistakenly raises a flag on a normal, nonfacial image (false acceptance). According to Bayes's decision theory, the decision rule for the two-category classification problem can be designed by the likelihood ratio (see Eq. 7.4.2):

Equation 8.4.1


where T is the threshold and p(x|ωi) is the likelihood density of class ωi. For typical binary classification problems, the PDBNN allocates two subnets to estimate the likelihood densities of both categories. For detection problems, however, the PDBNN adopts a simpler structure:

Equation 8.4.2


Figure 8.1 shows this type of PDBNN. The simplification comes from the assumption that the likelihood density of the "nonface" class is uniform across the whole sample space. The single-subnet PDBNN estimates only the distribution of the face class samples. The decision threshold T controls the operating point on the detector's ROC (or DET) curve; a larger T value reduces the false acceptance rate (FAR) but may increase the false rejection rate (FRR). For applications such as automatic security systems, the decision threshold is usually set in favor of a lower FAR because the face detection function (and, similarly, the eye localization function) plays a "gatekeeper" role for processes that follow). This allows the face recognition engine to remain dormant until a "very possible" face pattern is detected from the continuous video stream.

Figure 8.1. PDBNN-based face (or eye) detector. (a) Schematic diagram; (b) an example of pattern detection by PDBNN. o = patterns belonging to the target object; * = "nonobject" patterns.


8.4.1. Face Detection

Face detection is a difficult computer vision problem. Chapter 2 discussed many variation factors affecting detection accuracy. To simplify the task, many research groups focus only on developing detection algorithms for frontally viewed faces. For applications such as ATM access control, one can assume that rightful users will show their faces to the camera in an upright and mostly unoccluded fashion. Another useful application for frontal face detection is screening of police department photos of suspected criminals.

Although facial orientation may be constrained, frontal face detection remains a problem. Consider the picture on the left of Figure 8.2. It may look like a low-resolution image of a smiling face, but in fact it is just an enlarged image segment from the stadium crowd scene on the right. The face detector needs to carefully comb through arbitrary scenes to screen out false facial patterns like this one while also remaining responsive to the "true" face patches.

Figure 8.2. A pattern that looks like a face.


Sung and Poggio [343] developed a distribution-based system for face detection. Their system consisted of two components: distribution-based models for face and nonface patterns and a multi-layer perceptron classifier. In the training phase, each face and nonface example was first normalized and processed to a 19 x 19 pixel image and treated as a 361-dimensional feature vector. Next, the patterns were grouped into six face and six nonface clusters using the K-means algorithm. During the testing phase, two distance measures were applied. The first was the normalized Mahalanobis distance between the test image pattern and the centers of the prototype clusters, measured within a low-dimension subspace spanned by the cluster's 75 largest eigenvectors. The second distance component was the Euclidean distance between the test pattern and its projection onto the 75-dimensional subspace. The last step was to use a multi-layer perceptron (MLP) network to classify face patterns from nonface patterns using the 12 pairs of distances to each face and nonface cluster. A database of 47,316 windowed patterns, 4,150 face patterns, and the rest nonface was used to train the classifier using the standard back-propagation algorithm.

Rowley et al. [323] also used a neural network model for face detection. A 20 x 20 window was used to scan the entire image. Each windowed pattern was preprocessed (normalization, histogram equalization) then fed into a three-layer convolutional neural network. Nearly 1,050 face samples of various sizes, orientations, positions, and intensities were used to train the network. The locations of the eyes, tip of the nose, and center of the mouth were labeled manually to normalize the faces. Multiple neural networks were trained, each focused on different sets of training images. In the testing phase, the decisions from these networks were merged and arbitrated by a simple scheme such as voting or logic operators (AND/OR). Rowley et al. reported that such an arbitration scheme was less computationally expensive than Sung and Poggio's system [343] and enjoyed higher detection rates on a test set of 144 faces in 24 images.

Lin et al. [211] proposed a one-class PDBNN model for face detection. There were four clusters (neurons) in the network; the number of clusters was determined based on empirical results. The input dimension of the network was 12 × 12 × 2 =288. (12 × 12 image intensity plus x- and y-directional gradient feature vectors— compare to Figures 8.9(c) and 8.9(d)). The network weighting parameters and thresholds were trained by the procedures described in Chapter 7.

Figure 8.9. (a) Facial region used for face recognition, (b) intensity feature extracted from (a), (c) x-directional gradient feature, and (d) y-directional gradient feature.


The training database contained 92 annotated images (each image generated approximately 25 virtual training patterns). The images were taken in normal indoor lighting conditions with a cluttered background (refer to Figure 8.8). The image size was 320 × 240 pixels, and the face size was approximately 140 × 100 pixels. The variation of head orientation was about 15 degrees toward the four directions (up, down, right, left). The testing database contained 473 images taken under similar conditions. The testing performance was measured by the error (in terms of pixels) between detected face location and true location. To make it size invariant, errors were normalized with the assumption that the distance between both eyes is 40 pixels, which is the average distance in the annotated images. Among all testing images, 98.5% of the errors were within 5 pixels and 100% were within 10 pixels in the original high-resolution image (which is less than a 1-pixel error in the low-resolution images).

Figure 8.8. The thin white box represents the location found by the face detector. Based on face location, the searching windows for locating eyes were assigned, as illustrated by the two thick white boxes.


Table 8.1 lists the detection accuracy of the PDBNN face detector. For most face recognition applications, a 10-pixel error is acceptable since the main purpose of face detection is to restrict the searching area for eye localization. As for processing speed, the PDBNN face detector down-scaled the images (320 × 240 pixels) by a factor of 7. Working with low-resolution images (search range approximately 46 × 35 pixels, search step 1 pixel, and block size 12 × 12 pixels), the PDBNN pattern detector was reported to detect a face within 200ms on the SPARC II workstation.

Table 8.1. Performance of the PDBNN face detector. Note: A database of 473 frontal facial images and most faces were detected within 5-pixel displacement.
Displacement from True Face Location0 to 5 pixels5 to 10 pixels> 10 pixels
Percentage of test patterns98.5%1.5%0 %

The trained PDBNN face detector was applied to other image databases. Figure 8.3 shows its detection result on some sample images. The algorithm performed fairly well on images with different size faces (from the anchorwoman to the pedestrians) and in different lighting conditions.

Figure 8.3. Some faces detected by the PDBNN detector.


Sung and Poggio's database was also used for comparison. The pictures in this database, which contained 155 faces, were from a wide variety of preexisting sources. Three face detection algorithms (PDBNN, Rowley et al., and Sung and Poggio) were applied in this comparison. Under the similar false acceptance performance, three nonface patterns were falsely detected as faces by Rowley et al., five by Sung and Poggio, and six by PDBNN. The false acceptance rates of all three were below 105—Rowley et al. missed 34 faces, Sung and Poggio missed 36 faces, and the PDBNN face detector missed 43 faces (see Table 8.2).

Table 8.2. Comparing PDBNN-based face detector with other face detection algorithms
SystemMissed FacesFalse Detection
Ideal0 of 1550 in 2,709,734
Rowley et al. [323]343
Sung and Poggio [343]365
PDBNN436

Two reasons explain why PDBNN had an inferior performance. First, compared to the huge number of training samples used by both groups (4,000 in Sung and Poggio [343] and 16,000 in Rowley et al. [323]), PDBNN's training set only consisted of 92 images. A more interesting comparison would be made if a comparable-sized training database for PDBNN were available. Second, PDBNN did not detect the "artificial faces" (e.g., faces on poker cards, handdrawn face; cf. Figure 8.4). Since the PDBNN face detector was mainly used in surveillance and security applications, this "discrimination" may actually be beneficial.

Figure 8.4. (a) Faces detected by Sung and Poggio's algorithm. (b) Face detected by PDBNN. Note that the artificial drawing (on the board) was marked as a face by (a) but not by (b).


8.4.2. Eye Localization

The eye localization module is activated whenever the face detector discovers facial patterns from the incoming image. Since the purpose of eye localization is to normalize facial patterns into a format the recognizer can accept, eye locations need to be pinpointed with much higher precision than face location.

The eye localizer must overcome several challenges. First of all, eye shape changes whenever people blink. Also, eyes are often occluded by eyeglass glare. Wu et al. [386] proposed a statistical inference method to "remove" eyeglasses by artificially synthesizing the face image without them. Although this method is not intended for face recognition applications, it would be interesting to determine whether the synthesized images could improve recognition accuracy.

Another challenge for the eye localizer is that the eye region usually occupies only a small portion of the captured image. In other words, the eye localizer deals with a smaller amount of image information but needs to generate more precise detection results. In Lin et al. [211], the PDBNN eye localizer used more examples to form the training set (250 annotated images) and a higher resolution image patch for the input feature (14 x 14 x 2 = 392) than the face detector. The PDBNN eye localizer used one class subnet with four clusters to learn the distribution of the "eye" class. To simplify the training effort, only left eye images were used to form the training data set. Right eyes were detected by the mirrored image. Table 8.3 shows the experimental result on a test database of 323 images. The errors were normalized with the assumption that the eye-to-eye distance is 40 pixels. For a face recognition system, a misalignment of 5 pixels or less is tolerable. For this test database, 96.4% of the errors were within 3 pixels, and 98.9% were within 5 pixels.

Table 8.3. Performance of eye localization. Note: Most eyes were detected with error of less than 3 pixels.
Displacement from True Eye Location0 to 3 pixels3 to 5 pixels> 5 pixels
Percentage of test patterns96.4%2.5%1.1 %

Figure 8.5 shows several eye images detected by the PDBNN eye localizer, which is robust against variations in size, shape (including closed eyes), and orientation but somewhat vulnerable to eyeglass glare. Two failure cases are shown in Figure 8.6 in which the localizer mistook the glare to be a real eye and thus reported the wrong eye location.

Figure 8.5. Eye images detected by the PDBNN eye localizer.


Figure 8.6. Detection failures of the PDBNN eye localizer. The failures were caused by the specular reflection on the eyeglasses.


8.4.3. Assisting Realtime Face Recognition

The application of face recognition technology can be categorized into two types: The first deals with controlled format photographs (e.g., photos in a police database). The number of images is usually small and additional images are not easy to obtain if more training images are needed. The second type of application receives realtime video streams (e.g., gateway security control), where the number of images can be very large. A video camera with a rate of 30 frames per second produces 1,800 images in one minute. The system developer therefore has the luxury of choosing many clear and distinguishable facial images to train the recognizer. Such an abundance of resources can greatly increase the chances of successful system training, but it also consumes a great deal of development time if human effort is required in the process of selecting "good" facial images. Thus, a complete realtime face recognition system demands not only a powerful face recognizer but also an effective, automatic scheme to acquire acceptable training samples.

The face detector and eye localizer play a crucial role in this acquisition scheme. The confidence score of the PDBNN provides a convenient and accurate criterion for selecting useful facial images from a large amount of image data. Because the face detector's confidence scores and eye localizer faithfully reflect the correctness of the pattern's detected position, an image with high confidence scores in both modules can almost always generate qualified facial patterns for face recognition.

The following description of an experiment shows more explicitly how to use the confidence scores of a face/eye detector to extract useful facial images. An in-house facial image database called the SCR 40x150 database was used for this study. This database was constructed by asking the tester to slowly move or rotate his or her head for 10 seconds in front of a video camera. Since the video frame rate was 15 per second, a total of 150 images were taken for each person. Forty people from different races, age groups, and genders participated in the construction of this database. While the 10-second video were taken, testers were asked to rotate their heads not only at a wide angle (up to 45 degrees) but also along various axes (i.e., left-right, up-down, and tilted rotations). All of the images were taken in front of a uniformly illuminated white background. The face detector and eye localizer worked correctly for 75% of the 6,000 images in this database, which was considered a valid data set. Although the face detector and eye localizer were trained only on frontal views, they nevertheless handled reasonably well the faces within 30 degrees. Most of the failures occurred in the case of faces with a large rotation/tilt angle (45 degrees or more). Moreover, the confidence scores of face detector and eye localizer were very low for all the failure cases. Therefore, images with a large head rotation angle or in poor condition due to thresholding were automatically screened out.

Twenty percent of the valid data set was selected as the training set for the PDBNN face recognizer. Table 8.4 shows its performance. The test image set was formed in two phases; first, 60 images were randomly selected from each individual's image data (excluding those used in the training set). Second, these 2,400 images were fed into the PDBNN face detector and eye localizer. By thresholding their confidence scores, 2,176 images were selected as the "valid test set."

Table 8.4. Performance of a PDBNN face recognition system using a database with a large head orientation. Note: With the assistance of confidence scores from the PDBNN face detector and eye localizer, useful training patterns can be automatically selected (the valid set). The identification accuracy was highly improved by using a valid training set.
PDBNNTrained by Original SetTrained by Valid Set
Recognition84.64%98.34%
False rejection10.03%0.97%
Misclassification5.33%0.69%

Table 8.4 shows the performance of the PDBNN face recognizer on the valid test set. Here a false rejection pattern means that the confidence score generated by the face recognizer is below the threshold, and a misclassified pattern means that its confidence score is higher than the threshold but is classified to the wrong person. For the sake of comparison, another PDBNN was trained by a new set of training images formed by purely random selection from the original data set (the entire 6,000 images). Poor facial images might have been selected into this training set, which would account for a decrease in the recognition rate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.3.175