33.5 Support Vector Machines and Kernel-Based Approaches

In remote sensing image classification, one of most widely used techniques is maximum likelihood classifier (MLC), which is based on Mahalanobis distance. Since MLC is essentially a weighted distance measure with the weighting matrix specified by sample covariance matrix, it is not particularly designed for classification. Consequently, it does not necessarily yield the best classification. It has been shown in pattern classification that Fisher's linear discriminant analysis (FLDA) is one of best classifiers due to the fact that the Fisher's ratio used by FLDA as a criterion is specifically to be designed for classification where the Fisher ratio is defined by a ratio of the between-class scatter matrix to the within-class scatter matrix. In both cases, these two classifiers require second-order statistics to calculate the weighting matrix for MLC and scatter matrices for FLDA. Accordingly, MLC or FLDA may not work effectively if either data statistics cannot be well characterized by second-order statistics or the second-order statistics used to characterize the data are not reliable. In the former case, classifiers must go beyond second-order statistics such as high-order statistics-based classifiers, projection pursuit, neural networks, etc. As for the latter case, two circumstances may occur. One is that the used training samples are not appropriately selected. The other is that the pool of selected training samples is too small to constitute credible statistics. The support vector machines (SVMs) arises from such needs. First of all, since an SVM is designed to find an optimal hyperplane to maximize the margin between two classes of training samples, referred to as support vectors for separation, it does not require a large number of training samples to achieve the best maximal separation margin. As a result, the SVM performance in classification is very sensitive to selection of support vectors. For example, consider the following example shown in Figure 33.7 where two datasets are denoted by S1, which consists of all data sample vectors belonging to two classes marked by open circles “img” and crosses “img” and S2 which is set S1 with exclusion of two data sample vectors, img and img where the subscripts “img”, and “img” indicates classes to which theses support vectors belong. In addition, two sets of training samples img and img defined as support vectors are used for training SVMs.

Figure 33.7 Different sets of support vectors used by SVMs.

img

Assume that the dotted line in Figure 33.7 is the optimal hyperplane. There are four cases worth being discussed.

1. Implement an SVM using img as support vectors to classify the dataset S2 in which case it reaches a perfect classification.
2. Implement an SVM using img as support vectors to classify the dataset S2 in which case the data sample vectors are labeled by img and img are within the margin but still in correct sides of the optimal hyperplane.
3. Implement an SVM using img as support vectors to classify the dataset S1 in which case the SVM misclassified img and img.
4. Implement an SVM using img as support vectors to classify the dataset S1 in which case the SVM not only misclassified img and img, but also img and img.

Cases 1 and 2 demonstrate the significance of selecting appropriate support vectors. In addition, these two cases also show that if the worst sample vectors are selected as support vectors, SVM performs its best where in our cases, the set of support vectors, img is the worst sample vectors compared to the set of img which are not. Because of that the SVM can be considered as the worst-case classifier compared to the statistics-based MLC and FLDA, which can be considered average-case classifiers. To penalize those training samples which are inappropriately selected, a set of slack variables ξi specified by (2.63) in Chapter 2 are introduced in Case 2 to account for the costs incurred by each of these training samples with img.

Cases 3 and 4 demonstrate linear nonseparability encountered in pattern classification where linear classifiers have difficulty with classifying certain data sample vectors using linear decisions regardless of how training samples are selected. In Figure 33.7, img and img are such data sample vectors. Like Case 2, a set of slack variables img must be introduced in Cases 3 and 4 to penalize those training samples which are linear nonseparable to account for their costs with img and img, respectively. So, the four cases can be summarized as follows.

1. Case 1 achieves perfect classification without a need of slack variables.
2. Case 2 requires a set of slack variable img to penalize inappropriately selected training samples, img, with img. However, if img and img are used as support vectors, the SVM can achieve perfect classification without using slack variables as Case 1.
3. Case 3 requires a set of slack variables img to penalize linear nonseparable training samples, img and img with img.
4. Case 4 requires a set of slack variables, img to penalize inappropriately selected training samples img and img and linear nonseparable training samples, img and img, respectively, where a set of slack variable img with img similar to Case 2 is used to penalize inappropriately selected training samples, img and img and another set of slack variables with img similar to Case 3 to penalize linear nonseparable training samples, img and img. In this particular case, a kernel must be used to resolve linear nonseparability issue to account for the case of using a set of slack variables with img.

As noted in Figure 33.7, it is impossible to find linear lines to separate the two training samples, img and img no matter how linear decision boundaries are drawn in the data space. This implies that using slack variables to penalize such training samples still cannot change the fact that, img, SVM is a linear machine. To further resolve linear nonseparable problems, kernel-based approaches have been developed for this purpose. The key idea is to map the original data space into a high dimensional feature space via a nonlinear kernel mapping so that linear nonseparable problems in the original space can be resolved in this new feature space. Generally speaking, two types of kernel-based approaches have been investigated. One is kernel-based transformations which map component analysis-based transformations into a feature space such as kernel-based PCA (KPCA) (Scholkopf et al., 1999b) described in Section 30.4.2, which is the most popular kernel-based component analysis transformation. Its idea is to kernelize the projection vectors produced by a component analysis in the original space. Using the PCA as an illustrative example, the eigenvectors considered as projection vectors of the PCA are kernelized via Equations (30.9)–(30.13) by the sample covariance matrix formed by all the kernelized data sample vectors in the new feature space. However, one of major difficulties with such kernel-based transforms is computational complexity due to its use of all data sample vectors to perform the kernelized component analysis transformation. If the number of data sample vectors is huge, kernel-based transformations will incur computational problems which require tremendous computer power to find kernelized projection vectors. This is a severe obstacle to overcome for hyperspectral imagery whose data size is generally very large. This may be one of reasons that kernel-based transformations are not preferred in applications of hyperspectral data exploitation.

As an alternative, a second kernel-based approach is to kernelize the training samples to be used for classification instead of kernelizing component analysis transformation for entire data space. One of significant advantages resulting from such training sample-kernelization is that only training samples rather than entire data set are required to perform kernelization. This approach turns out to be a major trend in developing new hyperspectral imaging algorithms, for example, kernel-based SVM (KSVM) and kernel-based FLDA (KFLDA) discussed in Chapter 2. Similarly, kernel-based FLSMA (KFLSMA) and kernel-based WACLSMA (KWACLSMA) can also be derived for FLSMA and WACLSMA respectively by the same treatment (Liu, 2011). Nevertheless, it should be noted that the benefit of using the second training samples-based approach to kernelization is also traded for a need of judicious selection of training samples which involves two issues, how many training samples are sufficiently enough and how these training samples are selected. In addition, there are also two issues in using kernels as well. One is selection of parameters and their appripriate values for kernels to be used. The other is computational complexity which is generally very high since the kernel-mapped feature space usually has very high dimensions. In supervised classification, these two issues in selection of training samples are resolved by prior knowledge obtained from ground truth such as spectral libraries, data bases, or visual inspection. But as noted in Section 33.2.2, supervised classification may not be reliable if the training samples are neither representative nor accurate. This is often the case in real applications, specifically for hyperspectral images where many subtle material substances cannot be identified by ground truth and visual assessment. Accordingly, unsupervised classification may be a better approach. However, for unsupervised classification to be effective, the above-mentioned two issues must be appropriately addressed. In doing so, one is to reduce the prior knowledge as much as possible so that its impact on classification can be minimal. The constrained minimum energy (CEM) described in Chapter 2 is derived for this purpose where only desire target knowledge is required. The other is anomaly detection such as RX detector (RXD) developed by Reed and Yu (1990) discussed in Section 33.4 where no prior knowledge is required at all. Unfortunately, when CEM and RXD are extended to their kernel counterparts, KCEM and KRXD also run into the same issues that the kernel-based component analysis transformations have where the sample covariance/correlation matrix must be calculated from the entire dataset to account for a posteriori information. To mitigate this dilemma, current research being conducted in the Remote Sensing Signal and Image Processing Laboratory (RSSIPL) at University of Maryland, Baltimore County (UMBC), has been directed to addressing the issue of finding unsupervised training samples which can eventually solve issues arising in the kernel-based component analysis transformations. Most recently, a third approach is to kernelize classifiers instead of training samples where LSMA can be further extended to kernel-based LSMA where the three spectral unmixing methods, LSOSP, NCLS, and FCLS are extended to their kernel counterparts in Chapter 15. Using similar ideas, the three least-squares unsupervised algorithms, ATGP, UNCLS, and UFCLS, developed for finding unsupervised target signal sources in Chapters 8 and 1716 can also be extended to kernel-based ATGP (KATGP), kernel-based UNCLS (KUNCLS), and kernel-based UFCLS (KUFCLS) (Wong, 2011).

Finally, an interesting approach to extending kernel-based classifiers on the used kernels is worth pursuing (Liu, 2011). In Section 2.4.1, three types of commonly used kernels, radial basis function (RBF) kernel, polynomial kernel, and sigmoid function kernel, are described. Among them the RBF kernel given by

(33.27) equation

is the most popular one used by many kernel-based classifiers. Analogous to the treatment for the WACLSMA presented in Chapter 13, Equation (33.27) can be further extended to weighted RBF kernel by including a weighting matrix A as follows:

(33.28) equation

where A is a positive definite matrix similar to the one defined in (14.2). Three special cases can be derived from (33.28):

1. When img, Equation (33.28) is reduced to (33.27) where img is different from σ2 in (33.28).
2. When A = R−1 or K−1, Equation (33.28) can be considered as Mahalanobis distance kernel.
3. When A is set to img, which is the inverse of the within-class scatter matrix, SW defined in (13.1), Equation (33.28) becomes Fisher's kernel similar to the one developed for Fisher's LSMA (FLSMA) in Chapter 14.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.111.33