Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5

Maximum Correntropy Criterion–Based Kernel Adaptive Filters

Badong Chen^⁎; Xin Wang^⁎; Jose C. Principe^⁎^,^† ^⁎Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China
^†University of Florida, Gainesville, FL, United States

Abstract

Kernel adaptive filters (KAFs) are a family of powerful online kernel learning methods that can learn nonlinear and nonstationary systems efficiently. Most of the existing KAFs are obtained by minimizing the well-known mean square error (MSE). The MSE criterion is computationally simple and very easy to implement but may suffer from a lack of robustness to non-Gaussian noises. To deal with the non-Gaussian noises robustly, the optimization criteria must go beyond the second-order framework. Recently, the correntropy, a novel similarity measure that involves all even-order moments, has been shown to be very robust to the presence of heavy-tailed non-Gaussian noises. A generalized correntropy has been proposed in a recent study. Combining the KAFs with the maximum correntropy criterion (MCC) provides a unified and efficient approach to handle nonlinearity, nonstationarity and non-Gaussianity. The major goal of this chapter is to briefly characterize such an approach. The KAFs under the generalized MCC will be investigated, and some simulation results will be presented.

Keywords

Kernel adaptive filters; Correntropy; Generalized correntropy; Maximum correntropy criterion; Robustness

Chapter Points

• Kernel adaptive filters (KAFs) are powerful in online nonlinear system modeling.
• Correntropy and generalized correntropy are local similarity measures, and the maximum correntropy criterion (MCC) is robust to impulsive non-Gaussian noises.
• Combining KAFs and MCC provides an efficient way to deal with nonlinear modeling in the presence of non-Gaussian noises.

5.1 Introduction

The features of nonlinearity, nonstationarity and non-Gaussianity in many practical signals (such as EEG and EMG signals) pose great challenges for signal processing methods [1–5]. To deal with nonlinearities, a lot of nonlinear models were proposed in the literature. Well-known examples include neural networks (e.g., MLP, RBF), linear-in-the-parameter nonlinear models (e.g., Volterra series, kernel methods, extreme learning machines) and block-oriented nonlinear models (e.g., Wiener, Hammerstein) [6–10]. To handle nonstationarities, a common approach is to build the models in an online manner, where the parameters or structures of the models are updated adaptively to track the changes in a time varying process. For example, in adaptive filtering domains, various adaptive algorithms have been developed so far to learn the models adaptively, such as the least mean square (LMS), recursive least squares (RLS), affine projection algorithms (APA) and their variants [10–13]. To cope with non-Gaussianity, optimization criteria must go beyond the second-order framework, and statistical measures incorporating higher-order information should be used. In particular, information theoretic measures such as entropy, mutual information and divergences have been shown to be very efficient for extracting the non-Gaussian features of the processed data [14–17]. Recently, the concept of correntropy in information theoretic learning (ITL) has been successfully used for robust learning and signal processing in the presence of heavy-tailed non-Gaussian noises [18–27].

KAFs have recently attracted great attention due to their powerful learning capability for nonlinear and nonstationary system modeling [10,28–39]. The KAF algorithms are a family of online kernel learning methods, which are implemented as linear adaptive filters with input data being transformed into a high-dimensional feature space induced by a Mercer kernel [10]. Combining kernel methods with standard (linear) adaptive filtering algorithms, the KAFs have some desirable features such as universal approximation capability, the absence of local minima during adaptation and reasonable computational resources. To date, a number of kernel adaptive filtering algorithms have been developed, including kernel LMS (KLMS) [28], kernel RLS (KRLS) [29], kernel APA (KAPA) [36] and kernel Kalman filter [39]. Generally a KAF algorithm builds a growing RBF-like network with data-centered hidden nodes. In the literature, there are several approaches to constrain the network size of KAFs [10,31,40].

Most KAFs were developed under the popular mean square error (MSE) criterion, which is computationally simple and mathematically tractable but may suffer severe performance degradation in the presence of non-Gaussian noises. To improve the robustness of KAFs with respect to non-Gaussian noises, many efforts have been made to develop new KAF algorithms based on nonMSE (nonsecond-order) optimization criteria, such as the MCC [21,38], minimum error entropy (MEE) criterion [41], least mean p-power (LMP) criterion [42], least mean mixed norm (LMMN) criterion [43] and least mean kurtosis (LMK) criterion [44]. Increasing attention has recently been paid to the MCC criterion that aims at maximizing the correntropy between model output and desired response. Correntropy is a generalized correlation measure in kernel space and closely related to the Parzen estimator of Rényi's entropy [14], and its name comes from the words “correlation” and “entropy”. When used as an optimality criterion, correntropy has some desirable properties: smoothness, robustness and invexity [45,46]. Combining KAFs with MCC provides an efficient way to handle nonlinearity, nonstationarity and non-Gaussianity in a unified manner.

In this chapter, we give a brief introduction to the MCC-based KAFs. In Section 5.2, we briefly review two basic KAF algorithms. In Section 5.3, we introduce the concepts of correntropy and MCC. In Section 5.4, we present two KAF algorithms under the generalized MCC. Experimental results are shown in Section 5.5 and, finally, a conclusion is given in Section 5.6.

5.2 Kernel Adaptive Filters

Kernel methods are a powerful nonparametric modeling tool, which transform the input data into a high-dimensional feature space via a reproducing kernel such that the inner product operation in the feature space can be computed effectively through the kernel evaluation [47]. Using the linear structure of this feature space to implement a well-established linear adaptive filtering algorithm, one can obtain a KAF algorithm [10]. Among various KAF algorithms, the KLMS and KRLS are perhaps the most popular and widely used ones [28,29]. The KLMS is an LMS-type stochastic gradient–based learning algorithm which is computationally very simple yet efficient [28]. The KRLS is an RLS-type recursive learning algorithm which can outperform the KLMS in many practical situations but requires higher computational resources [10].

5.2.1 KLMS Algorithm

Consider the learning of a nonlinear function $y = f (x)$ based on the sample sequence ${x_{i} \in R^{d}, y_{i} \in R}$ , $(i = 1, 2, . . .)$ . First, we transform the input data into a feature space $F$ through a nonlinear mapping $φ (.)$ induced by a Mercer kernel $κ (., .)$ . Then applying the well-known LMS algorithm [12] on the transformed sample sequence ${φ (x_{i}), y_{i}}$ , we have

$\begin{matrix} Ω_{i} = Ω_{i - 1} - μ \frac{\partial}{\partial Ω_{i - 1}} e_{i}^{2} \\ = Ω_{i - 1} + η e_{i} φ (x_{i}), \end{matrix}$

(5.1)

where $Ω_{i}$ denotes the weight vector in feature space at iteration i, $η = 2 μ$ is the step size parameter and $e_{i}$ stands for the prediction error, that is,

$\begin{matrix} e_{i} = y_{i} - Ω_{i - 1}^{T} φ (x_{i}) . \end{matrix}$

(5.2)

Identifying $φ (x) = κ (., x)$ and $Ω_{i} = f_{i}$ , where $f_{i}$ denotes the estimated function at iteration i, and setting the initial function $f_{0} = 0$ , we can obtain the following update equations for KLMS [28]:

$\begin{matrix} f_{0} = 0, \\ e_{i} = y_{i} - f_{i - 1} (x_{i}), \\ f_{i} = f_{i - 1} + η e_{i} κ (x_{i}, .) . \end{matrix}$

(5.3)

Note that, for the reproducing kernel $κ (., .)$ , we have ${〈 κ (., x), f_{i} 〉}_{F} = f_{i} (x)$ . With a radial basis kernel (e.g., Gaussian kernel), the KLMS creates a growing RBF-like network [10], with the input vectors as the centers and prediction errors (scaled by η) as the output weights, as shown in Fig. 5.1.

5.2.2 KRLS Algorithm

Next, we present the KRLS algorithm [29]. Consider the following optimization cost function:

$ℓ (Ω) = \sum_{j = 1}^{i} ζ^{i - j} e_{i}^{2} + ζ^{i} γ {‖ Ω ‖}^{2},$

(5.4)

where ζ is the forgetting factor and γ is the regularization factor. Setting $\frac{\partial}{\partial Ω} ℓ (Ω) = 0$ , one can derive

$Ω = {(Φ_{i} B_{i} Φ_{i}^{T} + γ ζ^{i} I)}^{- 1} Φ_{i} B_{i} y_{i},$

(5.5)

where $Φ_{i} = [φ_{1}, \dots, φ_{i}]$ with $φ_{i}$ being $φ_{i} = φ (x_{i})$ , $y_{i} = [y_{1}, \dots, y_{i}]$ , I is an identity matrix and $B_{i} = d i a g {ζ^{i - 1}, ζ^{i - 2}, \dots, 1}$ is a diagonal matrix. Using the matrix inversion lemma [10], we have

$Ω = Φ_{i} {(Φ_{i}^{T} Φ_{i} + γ ζ^{i} B_{i}^{- 1})}^{- 1} y_{i} .$

(5.6)

Let $a_{i} = {(Φ_{i}^{T} Φ_{i} + γ ζ^{i} B_{i}^{- 1})}^{- 1} y_{i}$ be a coefficient vector. Then the weight vector Ω can be expressed explicitly as a linear combination of the transformed data, that is, $Ω = \sum_{j = 1}^{i} {(a_{i})}_{j} φ_{j}$ , where ${(a_{i})}_{j}$ denotes the jth component of $a_{i}$ . Denote $Q_{i} = {(Φ_{i}^{T} Φ_{i} + γ ζ^{i} B_{i}^{- 1})}^{- 1}$ . We have

${Q_{i}}^{- 1} = [\begin{matrix} {Q_{i - 1}}^{- 1} & h_{i} \\ {h_{i}}^{T} & {φ_{i}}^{T} φ_{i} + γ ζ^{i} \end{matrix}],$

(5.7)

where $h_{i} = {Φ_{i - 1}}^{T} φ_{i}$ . Using the block matrix inversion identity [10], we have the following update rule for $Q_{i}$ :

$Q_{i} = {r_{i}}^{- 1} [\begin{matrix} Q_{i - 1} r_{i} + z_{i} {z_{i}}^{T} & - z_{i} \\ - {z_{i}}^{T} & 1 \end{matrix}],$

(5.8)

where $z_{i} = Q_{i - 1} h_{i}$ and $r_{i} = γ ζ^{i} + {φ_{i}}^{T} φ_{i} - {z_{i}}^{T} h_{i}$ . Therefore, the coefficient vector $a_{i}$ can be updated as

$a_{i} = [\begin{matrix} a_{i - 1} - z_{i} {r_{i}}^{- 1} e_{i} \\ {r_{i}}^{- 1} e_{i} \end{matrix}] .$

(5.9)

From the above derivation, the KRLS algorithm can be briefly described in Algorithm 1.

Similar to the KLMS, the KRLS also produces a growing RBF-like network. In order to reduce the computational cost and memory requirements of KAFs, several approaches have been proposed to curb the network growth [10]. Particularly, a simple online quantization approach was proposed in [31]. The basic idea of the quantization approach is to represent the input data with a smaller data set and quantize the original centers to the nearest code vector in the quantization code book (or dictionary).

5.3 Maximum Correntropy Criterion

5.3.1 Correntropy

First, we give some background on information theoretic learning (ITL) [14,15]. There is a close relationship between kernel methods and ITL. Most ITL cost functions, when estimated using the Parzen kernel estimator, can be expressed in terms of inner products in kernel space $F$ . For example, the Parzen estimator of the quadratic information potential (QIP) from samples $x_{1}, x_{2}, \dots, x_{N} \in R$ can be obtained as [14]

$Q I P (X) = \frac{1}{N^{2} \sqrt{2 π} σ} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ} (x_{i} - x_{j}),$

(5.10)

where $κ_{σ} (.)$ denotes the translation-invariant Gaussian kernel with bandwidth σ, given by

$κ_{σ} (x - y) = \exp (- \frac{{(x - y)}^{2}}{2 σ^{2}}) .$

(5.11)

Since $κ_{σ} (x, y) = {〈 φ (x), φ (y) 〉}_{F}$ , we have

$Q I P (X) = \frac{1}{\sqrt{2 π} σ} {‖ \frac{1}{N} \sum_{i = 1}^{N} φ (x_{i}) ‖}_{F}^{2} = \frac{1}{\sqrt{2 π} σ} {〈 \frac{1}{N} \sum_{i = 1}^{N} φ (x_{i}), \frac{1}{N} \sum_{j = 1}^{N} φ (x_{j}) 〉}_{F},$

(5.12)

where ${‖ . ‖}_{F}$ stands for the norm in $F$ . Thus, the estimated QIP represents the squared mean of the transformed data in kernel space.

The intrinsic link between ITL and kernel methods enlightens researchers to define new similarity measures in kernel space. As a local similarity measure in ITL, the correntropy between two random variables X and Y is defined [18] by

$V (X, Y) = E [κ_{σ} (X - Y)] = \int κ_{σ} (x - y) d F_{X Y} (x, y),$

(5.13)

where the expectation value is over the joint distribution $F_{X Y} (x, y)$ . Correntropy can easily be estimated from data as

$V (X, Y) = \frac{1}{N} \sum_{i = 1}^{N} κ_{σ} (x_{i} - y_{i}) .$

(5.14)

Correntropy can be expressed in terms of an inner product as

$V (X, Y) = E [{〈 φ (X), φ (Y) 〉}_{F}],$

(5.15)

which is a generalized correlation measure in kernel space and itself is also positive definite, i.e., it defines a new kernel space for inference. In addition, by a simple Taylor series expansion on the kernel, one can see that correntropy provides a number that is the sum of all the statistical moments expressed by the kernel. In many applications this sum may be sufficient to quantify better than correlation the relationships of interest and it is much simpler to estimate than the higher-order statistical moments. Therefore, it can be considered as a new type of statistical descriptor. To date correntropy has been successfully applied to develop new adaptive filters for non-Gaussian noises [21–24], a correntropy MACE filter for pattern matching [48], robust PCA methods [19], new ICA algorithms [49], pitch detection [50] and nonlinear Granger causality analysis [51] methods with very good results. It is worth noting that many other similarity measures can be defined in kernel space such as the centered correntropy [14], correntropy coefficient [14], kernel risk sensitive loss [52] and kernel mean p-power error [53].

Based on the correntropy, one can define the correntropic loss (C-Loss) [52] as

$C L (X, Y) = 1 - E [κ_{σ} (X - Y)],$

(5.16)

which essentially is the MSE in kernel space since we have

$C L (X, Y) = \frac{1}{2} E [{‖ φ (X) - φ (Y) ‖}_{F}^{2}] .$

(5.17)

The empirical C-Loss can be calculated by

$\hat{C L} (X, Y) = 1 - \frac{1}{N} \sum_{i = 1}^{N} κ_{σ} (x_{i} - y_{i}),$

(5.18)

which is a function of the sample vectors $X = {[x_{1}, x_{2}, \dots, x_{N}]}^{T}$ and $Y = {[y_{1}, y_{2}, \dots, y_{N}]}^{T}$ . If no confusion can arise, we denote the empirical C-Loss by $\hat{C L} (X, Y)$ . The square root of $\hat{C L} (X, Y)$ is also called the Correntropy Induced Metric (CIM) between $X$ and $Y$ since it defines a metric in the sample space [18].

Fig. 5.2 shows the contours of distance from $e = X - Y$ to the origin in a two-dimensional space. The interesting observation from the figure is as follows. When two points are close, CIM behaves like an $L_{2}$ norm (which is clear from the Taylor expansion of the kernel function) and we call this area the Euclidean zone; outside of the Euclidean zone CIM then behaves like an $L_{1}$ , which is named the Transition zone; eventually in the Rectification zone as two points are further apart, the metric is much less sensitive to the change of the distance and saturates. Specifically, note that the CIM can be used in at least two respects for robust signal processing: (1) the relative insensitivity to very large values and the $L_{1}$ equivalence in the transition region enables its use as an outlier-robust error measure and (2) its concave contours for large-norm vectors in the rectification regime can be used to further impose sparsity in adaptive system coefficients, which is typically done using an $L_{p} (p \leq 1)$ norm penalization of the weight-vector norm. In both cases, the CIM has the useful property that for small-signal/weight situations the traditional quadratic measures are approximated, while for large-signal/weight conditions, it makes a smooth but sharp transition to desirable and robust $L_{p} (p \leq 1)$ penalization. The kernel size controls the speed of transition and can be optimally determined to best fit the data through the kernel density estimation connection. It is worth noting that there is a single parameter in correntropy, the kernel bandwidth, which controls the behavior of the CIM and MCC. Intuitively, if the data size is large and the problem dimension is low, a small kernel size can be used so that MCC searches with high precision (small bias) for the maximum position of the error PDF. However, if the data size is small and/or the problem dimension large, the kernel size has to be chosen so that the bulk of the data are inside the bandwidth to ensure efficient data usage while the outliers stay outside to enable an effective attenuation (small variance).

Figure 5.2 Contours of CIM(X,0) in 2D vector space (kernel size is set to 1.0).

The kernel function in correntropy can easily be extended to the generalized Gaussian function and the new measure is called the generalized correntropy [23]. Specifically, the generalized correntropy between random variables X and Y is defined by

$V_{α, β} (X, Y) = E [G_{α, β} (X - Y)],$

(5.19)

with $G_{α, β} (.)$ being the generalized Gaussian density (GGD) function [23]:

$G_{α, β} (x) = \frac{α}{2 β Γ (1 / α)} \exp (- {| \frac{x}{β} |}^{α}) = γ_{α, β} \exp (- λ {| x |}^{α}),$

(5.20)

where $Γ (.)$ denotes the gamma function, $α > 0$ is the shape parameter, $β > 0$ is the scale parameter, $λ = 1 / β^{α}$ is the kernel parameter and $γ_{α, β} = α / (2 β Γ (1 / α))$ is the normalization constant. Obviously, when $α = 2$ , the generalized correntropy is equivalent to the original correntropy. Note that the kernel function in the generalized correntropy is a Mercer kernel only when $0 < α \leq 2$ .

Similarly, one can define the generalized C-Loss (GC-Loss) and generalized CIM (GCIM). The GCIM also behaves like different norms (from $L_{α}$ to $L_{0}$ ) of the data in different regions (see [23] for more details).

5.3.2 Maximum Correntropy Criterion

The correntropy and generalized correntropy can be used as an optimization cost in estimation-related problems. Given two random variables $X \in R$ , an unknown real-valued parameter to be estimated, and $Y \in R^{m}$ , a vector of observations (measurements), an estimator of X can be defined as a function of Y: $\hat{X} = g (Y)$ , where g is solved by optimizing a certain cost function. Under the MCC, the estimator g can be solved by maximizing the correntropy (or minimizing the C-Loss) between X and $\hat{X}$ , that is [45],

$g_{M C C} = \underset{g \in G}{\arg \max} V (X, \hat{X}) = \underset{g \in G}{\arg \max} E [κ_{σ} (X - g (Y))],$

(5.21)

where G stands for the collection of all measurable functions of Y. If $\forall y \in R^{m}$ , X has a conditional PDF $p_{X | Y} (x | y)$ , then the estimation error $e = X - \hat{X}$ has PDF

$p_{e} (ε) = \int_{R^{m}} p_{X | Y} (ε + g (y) | y) d F_{Y} (y),$

(5.22)

where $F_{Y} (y)$ is the distribution function of Y. It follows that

$g_{M C C} = \underset{g \in G}{\arg \max} \int_{R} κ_{σ} (ε) \int_{R^{m}} p_{X | Y} (ε + g (y) | y) d F_{Y} (y) d ε .$

(5.23)

In [45], it has been proven that the MCC estimate is a smoothed maximum a posteriori probability (MAP) estimate, which equals the mode (at which the PDF attains its maximum value). Similar results hold for the generalized MCC (GMCC) estimation.

Theorem 1

The GMCC estimator can be expressed as

$g_{G M C C} (y) = \underset{x \in R}{\arg \max} ρ_{α, β} (x | y), \forall y \in R^{m},$

(5.24)

where $ρ_{α, β} (x | y) = G_{α, β} (x) ⁎ p_{X | Y} (x | y)$ (“⁎” denotes the convolution operator with respect to x).

Proof

It is easy to derive

$\begin{matrix} V_{α, β} (X, \hat{X}) = \int_{R} G_{α, β} (ε) \int_{R^{m}} p_{X | Y} (ε + g (y) | y) d F_{Y} (y) d ε \\ = \int_{R^{m}} {\int_{- \infty}^{\infty} G_{α, β} (ε) p_{X | Y} (ε + g (y) | y) d ε} d F_{Y} (y) \\ = \int_{R^{m}} {\int_{- \infty}^{\infty} G_{α, β} (ε^{'} - g (y)) p_{X | Y} (ε^{'} | y) d ε^{'}} d F_{Y} (y) \\ \overset{(a)}{=} \int_{R^{m}} {\int_{- \infty}^{\infty} G_{α, β} (g (y) - ε^{'}) p_{X | Y} (ε^{'} | y) d ε^{'}} d F_{Y} (y) \\ = \int_{R^{m}} {(G_{α, β} (.) ⁎ p_{X | Y} (. | y)) (g (y))} d F_{Y} (y) \\ = \int_{R^{m}} ρ_{α, β} (g (y) | y) d F_{Y} (y), \end{matrix}$

(5.25)

where $ε^{'} = ε + g (y)$ and (a) comes from the symmetry of $G_{α, β} (.)$ . Thus

$\begin{matrix} g_{G M C C} = \underset{g \in G}{\arg \max} \int_{R^{m}} ρ_{α, β} (g (y) | y) d F_{Y} (y) \\ \Rightarrow g_{G M C C} (y) = \underset{x \in R}{\arg \max} ρ_{α, β} (x | y), \forall y \in R^{m} \end{matrix}$

(5.26)

This completes the proof.

Remark

Considering the conditional PDF $p_{X | Y} (x | y)$ as an a posteriori PDF given the observation y, the function $ρ_{α, β} (x | y)$ is a smoothed (by convolution) a posteriori PDF. Therefore, the GMCC estimate is a smoothed MAP estimate.

Corollary 1

When $β \to 0 +$ (or $λ \to \infty$ ), the GMCC estimation becomes the MAP estimation.

Proof

As $β \to 0 +$ , the GGD function $G_{α, β} (.)$ will approach the Dirac delta function and the smoothed conditional PDF $ρ_{α, β} (x | y)$ will reduce to the original conditional PDF $p_{X | Y} (x | y)$ . In this case, the GMCC estimation will be equivalent to the MAP estimation.

Theorem 2

The GMCC estimator can also be expressed as

$g_{G M C C} = \underset{g \in G}{\arg \max} p_{e}^{α, β} (0),$

(5.27)

where $p_{e}^{α, β} (x) = p_{e} (x) ⁎ G_{α, β} (x)$ is the smoothed error PDF.

Proof

It is easy to derive

$\begin{matrix} E [G_{α, β} (e)] = \int_{R} G_{α, β} (ε) p_{e} (ε) d ε \\ = {(\int_{R} G_{α, β} (x - ε) p_{e} (ε) d ε)}_{x = 0} \\ = {(p_{e} (x) ⁎ G_{α, β} (x))}_{x = 0} \\ = p_{e}^{α, β} (0) . \end{matrix}$

(5.28)

Hence

$g_{G M C C} = \underset{g \in G}{\arg \max} E [G_{α, β} (e)] = \underset{g \in G}{\arg \max} p_{e}^{α, β} (0) .$

(5.29)

Theorem 3

When $β \to \infty$ (or $λ \to 0 +$ ), the GMCC estimation will be equivalent to the LMP estimation with $p = α$ .

Proof

Under the LMP criterion, the optimal estimator is solved by minimizing the mean p-power of the error:

$g_{L M P} = \underset{g \in G}{\arg \min} E [{| e |}^{p}] .$

(5.30)

On the other hand, we have $V_{α, β} (X, \hat{X}) \approx γ_{α, β} (1 - λ E [{| e |}^{α}])$ as $λ \to 0 +$ . It follows that

${maxV}_{α, β} (X, \hat{X}) \sim \min E [{| e |}^{α}],$

(5.31)

that is, as $λ \to 0 +$ , the GMCC estimation will be, approximately, equivalent to the LMP estimation with $p = α$ .

5.4 Kernel Adaptive Filters Under Generalized MCC

Most of the KAFs, such as KLMS and KRLS, have been derived under the well-known MSE criterion. The MSE is computationally simple and mathematically tractable but may perform very poorly under non-Gaussian noise (especially impulsive noise) disturbance. An efficient approach to improve the robustness to non-Gaussian noises is to use the MCC-based costs [21–24]. In the following, we present two KAF algorithms under the generalized MCC [54,55]. Of course, one can derive many other KAFs under GMCC.

5.4.1 Generalized Kernel Maximum Correntropy

In a way very similar to the derivation of KLMS, applying the GMCC on the transformed sample sequence ${φ (x_{i}), y_{i}}$ , one can derive

$\begin{matrix} Ω_{i} = Ω_{i - 1} + μ \frac{\partial}{\partial Ω_{i - 1}} \exp (- λ {| e_{i} |}^{α}) \\ = Ω_{i - 1} + μ λ α \exp (- λ {| e_{i} |}^{α}) {| e_{i} |}^{α - 1} s i g n (e_{i}) φ (x_{i}) \\ = Ω_{i - 1} + η \exp (- λ {| e_{i} |}^{α}) {| e_{i} |}^{α - 1} s i g n (e_{i}) φ (x_{i}), \end{matrix}$

(5.32)

where the step size $η = μ λ α$ . The above algorithm is called the generalized kernel maximum correntropy (GKMC) [54]. From (5.32), we have the following observations:

1) When $α = 2$ , the GKMC becomes

$Ω_{i} = Ω_{i - 1} + η \exp (- λ e_{i}^{2}) e_{i} φ (x_{i}),$

(5.33)

which is the kernel maximum correntropy (KMC) algorithm developed in [21].

2) When $λ \to 0 +$ , the GKMC becomes

$Ω_{i} = Ω_{i - 1} + η {| e_{i} |}^{α - 1} s i g n (e_{i}) φ (x_{i}),$

(5.34)

which is the kernel LMP (KLMP) algorithm developed in [42].

3) When $α = 2$ and $λ \to 0 +$ , the GKMC will reduce to the KLMS algorithm

$Ω_{i} = Ω_{i - 1} + η e_{i} φ (x_{i}) .$

(5.35)

In addition, one can rewrite (5.32) as

$Ω_{i} = Ω_{i - 1} + η_{i} e_{i} φ (x_{i}),$

(5.36)

where $η_{i} = η \exp (- λ {| e_{i} |}^{α}) {| e_{i} |}^{α - 2}$ is a variable step size (VSS). Thus, the GKMC can be viewed as a KLMS with VSS $η_{i}$ . Since $η_{i} \to 0$ as $| e_{i} | \to \infty$ , the learning rate of GKMC will approach zero when a large error occurs. This property ensures the robustness of the GKMC with respect to large outliers.

In summary, the GKMC algorithm is summarized in Algorithm 2, where $a_{i}$ denotes the coefficient vector at iteration i, ${(a_{i})}_{j}$ is the jth component of $a_{i}$ and $C_{i}$ is the center set (or dictionary).

The network size of the GKMC grows linearly with the number of input data. One can use a simple quantization approach to curb the network growth [31]. In this way, the weight update Eq. (5.32) becomes

$Ω_{i} = Ω_{i - 1} + η \exp (- λ {| e_{i} |}^{α}) {| e_{i} |}^{α - 1} s i g n (e_{i}) φ (Q [x_{i}]),$

(5.37)

where $Q [.]$ denotes the quantization operator [31]:

$Q [x_{i}] = {\begin{matrix} C_{j^{⁎}} (i - 1) & d i s (x_{i}, C_{i - 1}) \leq ε \\ x_{i} & d i s (x_{i}, C_{i - 1}) > ε \end{matrix}},$

(5.38)

with $C_{i - 1}$ being the codebook at iteration $i - 1$ , ε the quantization size and $d i s (x_{i}, C_{i - 1})$ the distance between $x_{i}$ and $C_{i - 1}$ . We have

$d i s (x_{i}, C_{i - 1}) = \min_{1 \leq j \leq s i z e (C_{i - 1})} ‖ x_{i} - {(C_{i - 1})}_{j} ‖,$

(5.39)

where ${(C_{i - 1})}_{j}$ denotes the jth element (code vector) of the codebook $C_{i - 1}$ , $j^{⁎}$ is the index of the closest center

$j^{⁎} = \underset{1 \leq j \leq s i z e (C (i - 1))}{\arg \min} ‖ x_{i} - {(C_{i - 1})}_{j} ‖ .$

(5.40)

The quantized GKMC (QGKMC) algorithm is summarized in Algorithm 3.

5.4.2 Generalized Kernel Recursive Maximum Correntropy

Next, we derive the generalized kernel recursive maximum correntropy (GKRMC) algorithm [55]. A weighted and regularized cost function under GMCC is proposed as

$ℓ (Ω) = \sum_{j = 1}^{i} ζ^{i - j} G_{α, β} (y_{j} - Ω^{T} φ_{j}) - \frac{1}{2} ζ^{i} γ {‖ Ω ‖}^{2},$

(5.41)

where ζ is the forgetting factor, satisfying $0 \leq ζ \leq 1$ , γ is the regularization factor and $G_{α, β} (y_{j} - Ω^{T} φ_{j}) = \frac{α}{2 β Γ (1 / α)} \exp (- {| \frac{y_{j} - Ω^{T} φ_{j}}{β} |}^{α})$ . Setting $\frac{\partial}{\partial Ω} ℓ (Ω) = 0$ , one can obtain the solution

$Ω_{i} = {(Φ_{i} B_{i} {Φ_{i}}^{T} + γ ζ^{i} β^{α} I)}^{- 1} Φ_{i} B_{i} y_{i},$

(5.42)

where

$\begin{matrix} B_{i} = d i a g {b_{1}, b_{2}, \dots, b_{i}} \end{matrix}$

with

$\begin{matrix} b_{j} = ζ^{i - j} \times {(y_{j} - Ω^{T} φ_{j})}^{α - 2} \times (\frac{α^{2}}{2 β Γ (1 / α)}) \times \exp (- {| \frac{y_{j} - Ω^{T} φ_{j}}{β} |}^{α}), j = 1, 2, \dots i . \end{matrix}$

With the matrix inversion lemma, we have

${(Φ_{i} B_{i} {Φ_{i}}^{T} + γ ζ^{i} β^{α} I)}^{- 1} Φ_{i} B_{i} = Φ_{i} {({Φ_{i}}^{T} Φ_{i} + γ ζ^{i} β^{α} {B_{i}}^{- 1})}^{- 1} .$

(5.43)

Thus, (5.42) can be rewritten as

$Ω_{i} = Φ_{i} {({Φ_{i}}^{T} Φ_{i} + γ ζ^{i} β^{α} {B_{i}}^{- 1})}^{- 1} y_{i} .$

(5.44)

The above weight vector can be expressed as a linear combination of the transformed feature vectors, where the combination coefficients can be calculated recursively. We denote

$Ω_{i} = Φ_{i} a_{i}$ with $a_{i} = {({Φ_{i}}^{T} Φ_{i} + γ ζ^{i} β^{α} {B_{i}}^{- 1})}^{- 1} y_{i}$ being the coefficient vector. In addition, let $Q_{i} = {({Φ_{i}}^{T} Φ_{i} + γ ζ^{i} β^{α} \times {B_{i}}^{- 1})}^{- 1}$ . Then we have

$Q_{i} = {[\begin{matrix} {Φ_{i - 1}}^{T} Φ_{i - 1} + γ ζ^{i} β^{α} {B_{i - 1}}^{- 1} & {Φ_{i - 1}}^{T} φ_{i} \\ {φ_{i}}^{T} Φ_{i - 1} & {φ_{i}}^{T} φ_{i} + γ ζ^{i} β^{α} θ_{i} \end{matrix}]}^{- 1},$

(5.45)

where $θ_{i} = {(y_{i} - Ω^{T} φ_{i})}^{α - 2} \times (\frac{α^{2}}{2 β Γ (1 / α)}) \times \exp (- {| \frac{y_{i} - Ω^{T} φ_{i}}{β} |}^{α})$ . The above result can be converted to

${Q_{i}}^{- 1} = [\begin{matrix} {Q_{i - 1}}^{- 1} & h_{i} \\ {h_{i}}^{T} & {φ_{i}}^{T} φ_{i} + γ ζ^{i} β^{α} θ_{i} \end{matrix}],$

(5.46)

where $h_{i} = {Φ_{i - 1}}^{T} φ_{i}$ . Based on the following block matrix inversion identity,

${[\begin{matrix} A & B \\ C & D \end{matrix}]}^{- 1} = [\begin{matrix} (A - B D^{- 1} {C)}^{- 1} & - A^{- 1} B {(D - C A^{- 1} B)}^{- 1} \\ - D^{- 1} C {(A - B D^{- 1} C)}^{- 1} & {(D - C A^{- 1} B)}^{- 1} \end{matrix}],$

(5.47)

one can derive

$Q_{i} = {r_{i}}^{- 1} [\begin{matrix} Q_{i - 1} r_{i} + z_{i} {z_{i}}^{T} & - z_{i} \\ - {z_{i}}^{T} & 1 \end{matrix}],$

(5.48)

where $r_{i} = γ ζ^{i} β^{α} θ_{i} + {φ_{i}}^{T} φ_{i} - {z_{i}}^{T} h_{i}$ and $z_{i} = Q_{i - 1} h_{i}$ . Therefore, the expansion coefficients are

$\begin{matrix} a_{i} = Q_{i} y_{i} \\ = {r_{i}}^{- 1} [\begin{matrix} Q_{i - 1} r_{i} + z_{i} {z_{i}}^{T} & - z_{i} \\ - {z_{i}}^{T} & 1 \end{matrix}] [\begin{matrix} y_{i - 1} \\ y_{i} \end{matrix}] \\ = [\begin{matrix} a_{i - 1} - z_{i} {r_{i}}^{- 1} e_{i} \\ {r_{i}}^{- 1} e_{i} \end{matrix}], \end{matrix}$

(5.49)

where $e_{i} = y_{i} - Ω^{T} φ_{i}$ . Then we have obtained the GKRMC algorithm, whose pseudocode is given in Algorithm 4.

Remark

When $α = 2$ , the GKRMC algorithm will reduce to the kernel recursive maximum correntropy (KRMC) developed in [38].

5.5 Simulation Results

In the following, we present some simulation results to demonstrate the performance of the GKMC and GKRMC algorithms.

5.5.1 Frequency Doubling Problem

First, consider the frequency-doubling problem, where the input signal is a sine wave with 20 samples per period. The desired signal is another sine wave with double frequency, disturbed by an additive noise $v_{i}$ . With a time embedding dimension of 5, the input vector is $x_{i} = [x_{i - 4}, x_{i - 3}, . . ., x_{i}]$ . In this example, we consider a mixture noise model which combines two independent noise processes, given by

$v_{i} = (1 - δ_{i}) A_{i} + δ_{i} B_{i},$

(5.50)

where $δ_{i}$ is binary distributed over ${0, 1}$ , with probability $p (δ_{i} = 1) = c$ , $p (δ_{i} = 0) = 1 - c$ , where $0 \leq c \leq 1$ controls the occurrence probability of the noise processes $A_{i}$ and $B_{i}$ . In general, $A_{i}$ represents the “background noise” with small variance, while $B_{i}$ stands for large outliers that occur occasionally with larger variance. In the simulations below, without explicit mention we set $c = 0.05$ . $B_{i}$ is assumed to be a white Gaussian process with zero mean and variance 15. The Gaussian kernel is adopted as the Mercer kernel in kernel adaptive filters and the bandwidth is set to $σ = 2.0$ . The step sizes are chosen such that all the algorithms achieve almost the same initial convergence speed.

We choose 4000 noisy samples for training and 200 clean samples for testing (the filters are fixed during the testing). The average learning curves of the KLMS and GKMC over 100 Monte Carlo runs in terms of the testing MSEs with different values of α ( $λ = 2.0$ ) and distributions of $A_{i}$ (Gaussian and uniform) are illustrated in Fig. 5.3, and the corresponding testing MSEs at the final iteration are summarized in Table 5.1. From the simulation results we observe: (i) the GKMC can outperform the KLMS significantly, showing much stronger robustness with respect to the impulsive noises; (ii) with different values of α, the GKMC can achieve difference performances, and the best performance may be obtained when $α \neq 2.0$ . Note that the GKMC will reduce to the KMC algorithm when $α = 2.0$ .

Figure 5.3 Average learning curves with different values of α (α = 2,4,6) and distributions of A_i: (A) Gaussian distribution with zero mean and variance 0.25; (B) uniform distribution over [−0.5,0.5].

Table 5.1

Testing MSEs at final iteration in frequency doubling

	KLMS	GKMC α = 2	GKMC α = 4	GKMC α = 6
Gaussian	0.2250 ± 0.1896	0.0353 ± 0.0168	0.0196 ± 0.0101	0.0260 ± 0.0121
Uniform	0.1781 ± 0.1842	0.0137 ± 0.0057	0.0037 ± 0.0016	0.0030 ± 0.0022

Further, we compare the performance of the GKMC and QGKMC. We set $ε = 0.03$ , $α = 4$ , $λ = 2$ and $η = 0.3$ and assume that $A_{i}$ is of Gaussian distribution with variance 0.04. The average learning curves over 100 runs are shown in Fig. 5.4, and the network size growth curve of the QGKMC is presented in Fig. 5.5. One can see that both algorithms achieve almost the same convergence performance but the network size of QGKMC has been significantly curbed (only slightly larger than 30 at steady state).

Figure 5.4 Average learning curves of QGKMC and GKMC.

Figure 5.5 Network size growth curve of QGKMC.

5.5.2 Online Nonlinear System Identification

Next, we show the performance of the GKMC and GKRMC algorithms in online nonlinear system identification. Suppose the nonlinear system is of the form [41]

$\begin{matrix} y_{i} = (0.8 - 0.5 e x p (- y_{i - 1}^{2})) y_{i - 1} \\ - (0.3 + 0.9 \exp (- y_{i - 1}^{2})) y_{i - 2} \\ + 0.1 \sin (3.1415926 y_{i - 1}) + v_{i}, \end{matrix}$

(5.51)

where $v_{i}$ is a disturbance noise described by (5.50). The average learning curves in terms of testing MSEs over 100 Monte Carlo runs with different values of α and distributions of $A_{i}$ (uniform and alpha-stable) are shown in Fig. 5.6, and the corresponding testing MSEs at final iteration are listed in Table 5.2. In the simulation, 1000 noisy samples are used for training and 100 noise-free samples are used for testing (the filters are fixed during the testing). The parameters of the GKMC and GKRMC are experimentally chosen such that both achieve a desirable performance. It can be clearly seen that the GKRMC can outperform the GKMC, with faster convergence speed and lower testing MSE. However, we should note that the GKRMC is computationally more expensive than the GKMC.

Figure 5.6 Average learning curves with different values of α and distributions of A_i: (A) uniform distribution over [−0.5,0.5]; (B) alpha-stable distribution with shape parameter equal to 0.7 and scale parameter equal to 0.05.

Table 5.2

Testing MSEs at the final iteration in online nonlinear system identification

	GKMC α = 2	GKMC α = 4	GKMC α = 6	GKRMC α = 2	GKRMC α = 4	GKRMC α = 6
Uniform	0.0383 ± 0.0250	0.0330 ± 0.0145	0.0298 ± 0.0132	0.0024 ± 0.0009	0.0014 ± 0.0007	0.0012 ± 0.0006
Uniform	GKMC α = 1	GKMC α = 2	GKMC α = 4	GKRMC α = 1	GKRMC α = 2	GKRMC α = 4
Alpha-stable	0.0152 ± 0.0110	0.0286 ± 0.0194	0.0907 ± 0.0556	0.0007 ± 0.0004	0.0011 ± 0.0005	0.0017 ± 0.0009

5.5.3 Noise Cancellation

In this part, we consider a noise cancellation problem in signal processing. The basic structure of a noise cancellation system is shown in Fig. 5.7. The primary signal is $s_{i}$ , whose noisy measurement $d_{i}$ serves as the desired signal of the system. The term $ρ_{i}$ represents an unknown white noise process and $x_{i}$ is its reference measurement. The signal $x_{i}$ can be used as an input of an adaptive filter such that the filter output is an estimate of the noise $ρ_{i}$ . Clearly, the signal-to-noise ratio (SNR) can be improved by subtracting the estimated noise from $d_{i}$ . In the simulation, the noise source is assumed to be uniformly distributed over $[- 0.5, 0.5]$ . The interference distortion function is

$x_{i} = ρ_{i} - 0.2 x_{i - 1} - x_{i - 1} ρ_{i - 1} + 0.1 ρ_{i - 1} + 0.4 x_{i - 2},$

(5.52)

which is nonlinear with an infinite impulsive response. It is impossible to recover $ρ_{i}$ from a finite time delay embedding of $x_{i}$ . Actually, one can rewrite the distortion function as

$ρ_{i} = x_{i} + 0.2 x_{i - 1} - 0.4 x_{i - 2} + (x_{i - 1} - 0.1) ρ_{i - 1} .$

(5.53)

Obviously, the present value of the noise source $ρ_{i}$ not only depends on the reference noise measurements $[x_{i}, x_{i - 1}, x_{i - 2}]$ , but also on the past value $ρ_{i - 1}$ . However, the filter output $y_{i - 1}$ can be used as a surrogate signal. In this case, the input vector of the adaptive filter is $[x_{i}, x_{i - 1}, x_{i - 2}, y_{i - 1}]$ . We assume the primary signal $s_{i} = 0$ during the training phase. The system tries to reconstruct the noise source from the reference measurements. We use three nonlinear filters trained with KLMS, KRLS and GKRMC, respectively. In the simulation, the step size and the Gaussian kernel bandwidth for KLMS are 2.0 and 0.15. The regularization parameter for KRLS and GKRMC is 0.1. The Gaussian kernel bandwidth for KRLS and GKRMC are 0.17 and 0.1, respectively. The parameter β for GKRMC is set to 0.5. The ensemble learning curves over 100 Monte Carlo runs for different algorithms are illustrated in Fig. 5.8. In addition, the noise reduction factor (NR) defined by $10 \log_{10} {E [ρ_{i}^{2}] / E {[ρ_{i} - y_{i}]}^{2}}$ is given in Table 5.3. In this example, the GKRMC achieves the best performance when $α = 6$ .

Figure 5.7 Basic structure of a noise cancellation system.

Figure 5.8 Learning curves of noise cancellation.

Table 5.3

Performance comparison of KLMS, KRLS and GKRMC in noise cancellation

Algorithm	KLMS	KRLS	GKMC α = 2	GKMC α = 4	GKMC α = 6
NR (dB)	5.7356 ± 0.6441	11.6057 ± 1.3710	27.4908 ± 1.2422	30.5206 ± 1.4516	33.5343 ± 1.7440

5.6 Conclusion

In this chapter, we characterized a unified and efficient approach to deal with online nonlinear modeling with non-Gaussian noises, which combines the powerful modeling capability of the KAFs and the robustness of the MCC with respect to heavy-tailed non-Gaussian noises. We focused mainly on the KLMS- and KRLS-type algorithms but the results can easily be extended to other KAFs. Particularly the GKMC and GKRMC algorithms were presented. Simulation results on frequency doubling, nonlinear system identification and noise cancellation have been presented to illustrate the benefits of the new approach.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: Maximum Correntropy Criterion–Based Kernel Adaptive Filters

Create new playlist

Sign In

Sign Up

5.1 Introduction

5.2 Kernel Adaptive Filters

5.2.1 KLMS Algorithm

5.2.2 KRLS Algorithm

5.3 Maximum Correntropy Criterion

5.3.1 Correntropy

5.3.2 Maximum Correntropy Criterion

5.4 Kernel Adaptive Filters Under Generalized MCC

5.4.1 Generalized Kernel Maximum Correntropy

5.4.2 Generalized Kernel Recursive Maximum Correntropy

5.5 Simulation Results

5.5.1 Frequency Doubling Problem

5.5.2 Online Nonlinear System Identification

5.5.3 Noise Cancellation

5.6 Conclusion

Table of Contents for
Chapter 5: Maximum Correntropy Criterion–Based Kernel Adaptive Filters