Kernel-based approaches have recently received considerable interest in solving linear nonseparable problems in classification. Their main idea is to introduce a nonlinear kernel that maps the original data into a high-dimensional feature space from which the extracted features can be used to resolve the issue of linear nonseparability encountered in the original data space. The property of the Mercer's theorem in Scholkopf and Smola (2002) then plays a trick called kernel trick to implicitly calculate the dot products in the feature space F without actually mapping all data sample vectors into the feature space F in which case the kernel function was not even identified.
Prior to introducing kernel-based classifiers a kernelization procedure and kernel trick need to be discussed. The idea of kernel-based techniques is to obtain a nonlinear version of a linear algorithm by implicitly mapping the original data to a higher-dimensional feature space. Suppose that forms a data space of and ϕ is a nonlinear function that maps the data space X into a feature space F with dimensionality yet to be determined by a kernel, that is,
It is shown in Scholkopf and Smola (2002) that an effective kernel-based learning algorithm known as “kernel trick” can be implemented. This technique implicitly computes dot products in F without mapping the original data sets into a feature space, which provides feasibility to implement classifiers in a higher-dimensional space. The kernel representation of dot products in F is expressed as
where K is a kernel function of the original data. There are three common types of kernel that are generally used:
The idea of kernel-based FLDA (KFLDA) is to first map the data matrix X via (2.84) using a nonlinear kernel ϕ into a higher-dimensional feature space F in which FLDA is performed where the between-scatter and within-class scatter matrices in (2.40) and (2.41) can be reexpressed by
where and . Using (2.86) and (2.87) the Fisher's ratio in the feature space F can be derived as follows.
where w is a nonzero vector in F. According to the theory of reproducing kernel, any data vector in the feature space F can be also represented as a linear combination of by
where and . Substituting (2.89) in (2.86) and (2.87) yields
where and are obtained by the kernel trick. Using (2.90) and (2.91), (2.88) can be reexpressed as
The optimal solution to maximizing (2.92) is given by a set of p − 1 eigenvectors that solve the following generalized eigenvalue problem
(2.93)
For a new data sample vector in the feature space F its projection onto the optimal jth discriminant vector is given by
(2.94)
where
Let be the optimal feature matrix formed by . The data space in the feature space F can be obtained via the optimal discriminant feature matrix W∗ specified by (2.95) as follows
(2.96)
where X is the original data space. KFLDA is then used to perform the FLDA on and is expected to perform better than FLDA directly operating on the original space X.
The SVM discussed in Section 2.3.1.2 is a linear machine that performs a binary decision for linear separable or nonseparable problems. As we have seen, in order for a linear SVM to solve linear nonseparable problems, a set of slack variables are introduced to resolve the dilemma caused by confusing data sample vectors. In this section, we consider a rather different approach that generalizes the linear discriminant function to allow a much larger class of possible decision boundaries by transforming an original data sample x via a set of M predetermined nonlinear functions ϕj(r), referred to as basis functions or kernels. The output of r resulting from such a nonlinear transform can be represented by a linear combination of these functions as follows
In order to absorb the bias bk in the linear sum in (2.97), we can introduce an auxiliary function ϕ0(r) = 1 so that the Equation (2.71) can be reexpressed as a linear form by
where and the weight vector wk is represented by with wk0 = bk. By adapting (2.98) we can also obtain a kernel-based SVM specified by the weight vector wSVM given by
(2.99)
(2.100)
If we further define the kernel K(r,ri) as the inner product of r and ri by
(2.101)
we can form a kernel-based SVM (KSVM) as follows.
Given a set of training pool, , find a set of Lagrange multiplier vector that maximizes Q(α), which is modified from (2.60) and given by
(2.102)
subject to the following constraints:
An optimal solution to the above kernel-based SVM optimization problem is given by
where ns is the number of support vectors.
The key issue to solve the KSVM is to determine the kernel used in (2.103). Interestingly, according to Mercer's theorem, if a kernel-based inner product K(x,y) is continuous and symmetric in a closed interval , K(x,y) can be expanded in a series
(2.104)
where are eigenvalues and are their associated eigenfunctions.
The KFLDA and KSVM discussed above are derived from two popular hard decision-made classifiers, FLDA and SVM. Correspondingly, there should also be kernel versions of soft decision-made classifiers as discussed in Section 2.3.2.1 such as OSP. However, their derivations are much more involved. Therefore, their kernel versions, referred to as kernel-based linear spectral mixture analysis (KLSMA), will be developed and discussed in a separate and individual chapter, Chapter 15, which includes kernel versions of various LSMA-based classifiers.
3.14.6.194