10.2. Sensor Fusion for Biometrics

Sensor fusion is an information processing technique (see [66, 125]) through which information produced by several sources can be optimally combined. The human brain is a good example of a complex, multisensor fusion system; it receives five different signals—sight, hearing, taste, smell, and touch—from five different sensors: eyes, ears, tongue, nose, and skin. Typically, it fuses signals from these sensors for decision making and motor control. The human brain also fuses signals at different levels for different purposes. For example, humans recognize objects by both seeing and touching them; humans also communicate by watching the speaker's face and listening to his or her voice at the same time. All of these phenomena suggest that the human brain is a flexible and complicated fusion system.

Research in sensor fusion can be traced back to the early 1980s [17, 348]. Sensor fusion can be applied in many ways, such as detection of the presence of an object, recognition of an object, tracking an object, and so on. This chapter focuses on sensor fusion for verification purposes. Information can be fused at two different levels: feature and decision. Decision-level fusion can be further divided into abstract fusion and score fusion. These fusion techniques are discussed in the following two subsections.

10.2.1. Feature-Level Fusion

In feature-level fusion, data from different modalities are combined at the feature level before being presented to a pattern classifier [60]. One possible approach is to concatenate the feature vectors derived from different modalities [60], as illustrated in Figure 10.1. The dimensionality of the concatenated vectors, however, is sometimes too large for a reliable estimation of a classifier's parameters, a problem known as the curse of dimensionality. Although dimensionality reduction techniques, such as PCA or LDA, can help alleviate the problem [60, 260], these techniques rely on the condition that data from each class contain only a single cluster. Classification performance might be degraded when the data from individual classes contain multiple clusters. Moreover, systems based on feature-level fusion are not very flexible because the system needs to be retrained whenever a new sensor is added. It is also important to synchronize different sources of information in feature-level fusion, which may introduce implementation difficulty in AV fusion systems.

Figure 10.1. Architecture of feature-level fusion—features are concatenated before fusion takes place.


10.2.2. Decision-Level Fusion

Unlike feature-level fusion, decision-level fusion attempts to combine the decisions made by multiple modality-dependent classifiers (see Figure 10.2). This fusion approach solves the curse of dimensionality problem by training modality-dependent classifiers separately. Combining the outputs of the classifiers, however, is an important issue. The architecture of the classifiers can be identical but the input features are different (e.g., one uses audio data as input and the other uses video data). Alternatively, different classifiers can work on the same features and their decisions are combined. There are also systems that use a combination of these two types.

Figure 10.2. Architecture of decision-level fusion—(a) abstract fusion in which the Yes/No decisions made by the classifiers and decision units are combined; (b) score fusion in which final decisions are based on the fused scores.


The two types of decision fusions are: abstract and score. In the former, the binary decisions made by individual classifiers are combined, as shown in Figure 10.2(a); in the latter, the scores (confidence) of the classifiers are combined, as in Figure 10.2(b).

In abstract fusion, the binary decisions can be combined by majority voting or using AND and OR operators. In majority voting, the final decision is based on the number of votes made by the individual classifiers [113, 182]. However, this voting method may have difficulty making a decision when there are an even number of sensors and the decisions made by half of the classifiers do not agree with the other half.

Varshney [359] proposed using logical AND and OR operators for fusion. In the AND fusion, the final decision is not reached until all the decisions made by the classifiers agree. This type of fusion is very strict and therefore suitable only for systems that require low false acceptance. However, it has difficulty when the decisions made by different sensors are not consistent, which is a serious problem in multiclass applications. Unlike the AND fusion, the final decision in the OR fusion is made as soon as one of the classifiers makes a decision. This type of fusion is suitable only for systems that can tolerate a loose security policy (i.e., allowing high false acceptance error). The OR fusion suffers the same problem as the voting method when the decisions of individual classifiers do not agree with one other.

In score fusion, the scores of modality-specific classifiers are combined and the final score is used to make a decision (see Figure 10.2(b)). Typically, the output of modality-specific classifiers are linearly combined through a set of fusion weights [182]. The final score is obtained from

Equation 10.2.1


where K is the number of modalities or experts, {ωi} are a set of fusion weights, and {si} are the scores obtained from the K modalities. This kind of fusion is also referred to as the sum rule [6, 182].

Scores can be interpreted as posteriori probabilities in the Bayesian framework. Assuming that scores from different modalities are statistically independent, the final score can be combined by using the product rule [6, 182]:

Equation 10.2.2


To account for the discriminative power and reliability of each modality, a set of weights can be introduced as follows:

Equation 10.2.3


It has been stated that the independence assumption is unrealistic in many situations. However, this challenge does hold for some applications. For example, in AV verification systems, facial and speech features are mainly independent. Therefore, fusion of audio and visual data at the score level is a possible solution to reducing verification error.

The fusion weights wi can be nonadaptive and adaptive. Nonadaptive weights are learned from training data and kept fixed during recognition. For example, in Potamianos and Neti [284] and Sanderson and Paliwal [330], the fusion weights were estimated by minimizing the misclassification error on a held-out set; in Pigeon et al. [277], the parameters of a logistic regression model are estimated from the dispersion between the means of speakers' and impostors' scores. The nonadaptive weights, however, may not be optimal in mismatch conditions. Adaptive weights, on the other hand, are estimated from observed data during recognition—for example, according to the signal-to-noise ratio [239], degree of voicing [260], degree of mismatch between training and testing conditions [331], and amount of estimation error present in each modality [371].

Another important approach to adapting fusion weights is based on the trainable properties of neural networks. For example, in Brunelli and Falavigna [37], a person identification system based on acoustic and visual features was proposed. In particular, two classifiers based on acoustic features and three based on visual ones provide data for an integration module whose performance was evaluated. A novel technique for the integration of multiple classifiers at a hybrid rank/measurement level was introduced using HyperBF networks. This research showed that the performance of the integrated system was superior to that of the acoustic and visual subsystems.

The linear combiners described before assume that the combined scores obtained from different classes are linearly separable. In case this assumption cannot be met, the scores obtained from d experts can be considered as some d-dimensional vectors and a binary classifier (e.g., support vector machine, multi-layer perceptron, decision-tree architecture, Fisher's linear discriminant, and Bayesian classifier) can be trained from a held-out set to classify the vectors [22, 51, 96]. The experimental results showed that SVMs and Bayesian classifiers achieve about the same performance and outperform the rest of the candidate classifiers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.12.232