Chapter 5

Maximum Correntropy Criterion–Based Kernel Adaptive Filters

Badong ChenXin WangJose C. Principe,    Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China
University of Florida, Gainesville, FL, United States

Abstract

Kernel adaptive filters (KAFs) are a family of powerful online kernel learning methods that can learn nonlinear and nonstationary systems efficiently. Most of the existing KAFs are obtained by minimizing the well-known mean square error (MSE). The MSE criterion is computationally simple and very easy to implement but may suffer from a lack of robustness to non-Gaussian noises. To deal with the non-Gaussian noises robustly, the optimization criteria must go beyond the second-order framework. Recently, the correntropy, a novel similarity measure that involves all even-order moments, has been shown to be very robust to the presence of heavy-tailed non-Gaussian noises. A generalized correntropy has been proposed in a recent study. Combining the KAFs with the maximum correntropy criterion (MCC) provides a unified and efficient approach to handle nonlinearity, nonstationarity and non-Gaussianity. The major goal of this chapter is to briefly characterize such an approach. The KAFs under the generalized MCC will be investigated, and some simulation results will be presented.

Keywords

Kernel adaptive filters; Correntropy; Generalized correntropy; Maximum correntropy criterion; Robustness

Chapter Points

  • •  Kernel adaptive filters (KAFs) are powerful in online nonlinear system modeling.
  • •  Correntropy and generalized correntropy are local similarity measures, and the maximum correntropy criterion (MCC) is robust to impulsive non-Gaussian noises.
  • •  Combining KAFs and MCC provides an efficient way to deal with nonlinear modeling in the presence of non-Gaussian noises.

5.1 Introduction

The features of nonlinearity, nonstationarity and non-Gaussianity in many practical signals (such as EEG and EMG signals) pose great challenges for signal processing methods [15]. To deal with nonlinearities, a lot of nonlinear models were proposed in the literature. Well-known examples include neural networks (e.g., MLP, RBF), linear-in-the-parameter nonlinear models (e.g., Volterra series, kernel methods, extreme learning machines) and block-oriented nonlinear models (e.g., Wiener, Hammerstein) [610]. To handle nonstationarities, a common approach is to build the models in an online manner, where the parameters or structures of the models are updated adaptively to track the changes in a time varying process. For example, in adaptive filtering domains, various adaptive algorithms have been developed so far to learn the models adaptively, such as the least mean square (LMS), recursive least squares (RLS), affine projection algorithms (APA) and their variants [1013]. To cope with non-Gaussianity, optimization criteria must go beyond the second-order framework, and statistical measures incorporating higher-order information should be used. In particular, information theoretic measures such as entropy, mutual information and divergences have been shown to be very efficient for extracting the non-Gaussian features of the processed data [1417]. Recently, the concept of correntropy in information theoretic learning (ITL) has been successfully used for robust learning and signal processing in the presence of heavy-tailed non-Gaussian noises [1827].

KAFs have recently attracted great attention due to their powerful learning capability for nonlinear and nonstationary system modeling [10,2839]. The KAF algorithms are a family of online kernel learning methods, which are implemented as linear adaptive filters with input data being transformed into a high-dimensional feature space induced by a Mercer kernel [10]. Combining kernel methods with standard (linear) adaptive filtering algorithms, the KAFs have some desirable features such as universal approximation capability, the absence of local minima during adaptation and reasonable computational resources. To date, a number of kernel adaptive filtering algorithms have been developed, including kernel LMS (KLMS) [28], kernel RLS (KRLS) [29], kernel APA (KAPA) [36] and kernel Kalman filter [39]. Generally a KAF algorithm builds a growing RBF-like network with data-centered hidden nodes. In the literature, there are several approaches to constrain the network size of KAFs [10,31,40].

Most KAFs were developed under the popular mean square error (MSE) criterion, which is computationally simple and mathematically tractable but may suffer severe performance degradation in the presence of non-Gaussian noises. To improve the robustness of KAFs with respect to non-Gaussian noises, many efforts have been made to develop new KAF algorithms based on nonMSE (nonsecond-order) optimization criteria, such as the MCC [21,38], minimum error entropy (MEE) criterion [41], least mean p-power (LMP) criterion [42], least mean mixed norm (LMMN) criterion [43] and least mean kurtosis (LMK) criterion [44]. Increasing attention has recently been paid to the MCC criterion that aims at maximizing the correntropy between model output and desired response. Correntropy is a generalized correlation measure in kernel space and closely related to the Parzen estimator of Rényi's entropy [14], and its name comes from the words “correlation” and “entropy”. When used as an optimality criterion, correntropy has some desirable properties: smoothness, robustness and invexity [45,46]. Combining KAFs with MCC provides an efficient way to handle nonlinearity, nonstationarity and non-Gaussianity in a unified manner.

In this chapter, we give a brief introduction to the MCC-based KAFs. In Section 5.2, we briefly review two basic KAF algorithms. In Section 5.3, we introduce the concepts of correntropy and MCC. In Section 5.4, we present two KAF algorithms under the generalized MCC. Experimental results are shown in Section 5.5 and, finally, a conclusion is given in Section 5.6.

5.2 Kernel Adaptive Filters

Kernel methods are a powerful nonparametric modeling tool, which transform the input data into a high-dimensional feature space via a reproducing kernel such that the inner product operation in the feature space can be computed effectively through the kernel evaluation [47]. Using the linear structure of this feature space to implement a well-established linear adaptive filtering algorithm, one can obtain a KAF algorithm [10]. Among various KAF algorithms, the KLMS and KRLS are perhaps the most popular and widely used ones [28,29]. The KLMS is an LMS-type stochastic gradient–based learning algorithm which is computationally very simple yet efficient [28]. The KRLS is an RLS-type recursive learning algorithm which can outperform the KLMS in many practical situations but requires higher computational resources [10].

5.2.1 KLMS Algorithm

Consider the learning of a nonlinear function y=f(x)Image based on the sample sequence {xiRd,yiR}Image, (i=1,2,...)Image. First, we transform the input data into a feature space FImage through a nonlinear mapping φ(.)Image induced by a Mercer kernel κ(.,.)Image. Then applying the well-known LMS algorithm [12] on the transformed sample sequence {φ(xi),yi}Image, we have

Ωi=Ωi1μΩi1e2i=Ωi1+ηeiφ(xi),

Image (5.1)

where ΩiImage denotes the weight vector in feature space at iteration i, η=2μImage is the step size parameter and eiImage stands for the prediction error, that is,

ei=yiΩTi1φ(xi).

Image (5.2)

Identifying φ(x)=κ(.,x)Image and Ωi=fiImage, where fiImage denotes the estimated function at iteration i, and setting the initial function f0=0Image, we can obtain the following update equations for KLMS [28]:

f0=0,ei=yifi1(xi),fi=fi1+ηeiκ(xi,.).

Image (5.3)

Note that, for the reproducing kernel κ(.,.)Image, we have κ(.,x),fiF=fi(x)Image. With a radial basis kernel (e.g., Gaussian kernel), the KLMS creates a growing RBF-like network [10], with the input vectors as the centers and prediction errors (scaled by η) as the output weights, as shown in Fig. 5.1.

Image
Figure 5.1 Learning process of KLMS.

5.2.2 KRLS Algorithm

Next, we present the KRLS algorithm [29]. Consider the following optimization cost function:

(Ω)=ij=1ζije2i+ζiγΩ2,

Image (5.4)

where ζ is the forgetting factor and γ is the regularization factor. Setting Ω(Ω)=0Image, one can derive

Ω=(ΦiBiΦTi+γζiI)1ΦiBiyi,

Image (5.5)

where Φi=[φ1,,φi]Image with φiImage being φi=φ(xi)Image, yi=[y1,,yi]Image, I is an identity matrix and Bi=diag{ζi1,ζi2,,1}Image is a diagonal matrix. Using the matrix inversion lemma [10], we have

Ω=Φi(ΦTiΦi+γζiB1i)1yi.

Image (5.6)

Let ai=(ΦTiΦi+γζiB1i)1yiImage be a coefficient vector. Then the weight vector Ω can be expressed explicitly as a linear combination of the transformed data, that is, Ω=ij=1(ai)jφjImage, where (ai)jImage denotes the jth component of aiImage. Denote Qi=(ΦTiΦi+γζiB1i)1Image. We have

Qi1=[Qi11hihiTφiTφi+γζi],

Image (5.7)

where hi=Φi1TφiImage. Using the block matrix inversion identity [10], we have the following update rule for QiImage:

Qi=ri1[Qi1ri+ziziTziziT1],

Image (5.8)

where zi=Qi1hiImage and ri=γζi+φiTφiziThiImage. Therefore, the coefficient vector aiImage can be updated as

ai=[ai1ziri1eiri1ei].

Image (5.9)

From the above derivation, the KRLS algorithm can be briefly described in Algorithm 1.

Image
Algorithm 1 KRLS Algorithm.

Similar to the KLMS, the KRLS also produces a growing RBF-like network. In order to reduce the computational cost and memory requirements of KAFs, several approaches have been proposed to curb the network growth [10]. Particularly, a simple online quantization approach was proposed in [31]. The basic idea of the quantization approach is to represent the input data with a smaller data set and quantize the original centers to the nearest code vector in the quantization code book (or dictionary).

5.3 Maximum Correntropy Criterion

5.3.1 Correntropy

First, we give some background on information theoretic learning (ITL) [14,15]. There is a close relationship between kernel methods and ITL. Most ITL cost functions, when estimated using the Parzen kernel estimator, can be expressed in terms of inner products in kernel space FImage. For example, the Parzen estimator of the quadratic information potential (QIP) from samples x1,x2,,xNRImage can be obtained as [14]

QIP(X)=1N22πσNi=1Nj=1κσ(xixj),

Image (5.10)

where κσ(.)Image denotes the translation-invariant Gaussian kernel with bandwidth σ, given by

κσ(xy)=exp((xy)22σ2).

Image (5.11)

Since κσ(x,y)=φ(x),φ(y)FImage, we have

QIP(X)=12πσ1NNi=1φ(xi)2F=12πσ1NNi=1φ(xi),1NNj=1φ(xj)F,

Image (5.12)

where .FImage stands for the norm in FImage. Thus, the estimated QIP represents the squared mean of the transformed data in kernel space.

The intrinsic link between ITL and kernel methods enlightens researchers to define new similarity measures in kernel space. As a local similarity measure in ITL, the correntropy between two random variables X and Y is defined [18] by

V(X,Y)=E[κσ(XY)]=κσ(xy)dFXY(x,y),

Image (5.13)

where the expectation value is over the joint distribution FXY(x,y)Image. Correntropy can easily be estimated from data as

V(X,Y)=1NNi=1κσ(xiyi).

Image (5.14)

Correntropy can be expressed in terms of an inner product as

V(X,Y)=E[φ(X),φ(Y)F],

Image (5.15)

which is a generalized correlation measure in kernel space and itself is also positive definite, i.e., it defines a new kernel space for inference. In addition, by a simple Taylor series expansion on the kernel, one can see that correntropy provides a number that is the sum of all the statistical moments expressed by the kernel. In many applications this sum may be sufficient to quantify better than correlation the relationships of interest and it is much simpler to estimate than the higher-order statistical moments. Therefore, it can be considered as a new type of statistical descriptor. To date correntropy has been successfully applied to develop new adaptive filters for non-Gaussian noises [2124], a correntropy MACE filter for pattern matching [48], robust PCA methods [19], new ICA algorithms [49], pitch detection [50] and nonlinear Granger causality analysis [51] methods with very good results. It is worth noting that many other similarity measures can be defined in kernel space such as the centered correntropy [14], correntropy coefficient [14], kernel risk sensitive loss [52] and kernel mean p-power error [53].

Based on the correntropy, one can define the correntropic loss (C-Loss) [52] as

CL(X,Y)=1E[κσ(XY)],

Image (5.16)

which essentially is the MSE in kernel space since we have

CL(X,Y)=12E[φ(X)φ(Y)2F].

Image (5.17)

The empirical C-Loss can be calculated by

ˆCL(X,Y)=11NNi=1κσ(xiyi),

Image (5.18)

which is a function of the sample vectors X=[x1,x2,,xN]TImage and Y=[y1,y2,,yN]TImage. If no confusion can arise, we denote the empirical C-Loss by ˆCL(X,Y)Image. The square root of ˆCL(X,Y)Image is also called the Correntropy Induced Metric (CIM) between XImage and YImage since it defines a metric in the sample space [18].

Fig. 5.2 shows the contours of distance from e=XYImage to the origin in a two-dimensional space. The interesting observation from the figure is as follows. When two points are close, CIM behaves like an L2Image norm (which is clear from the Taylor expansion of the kernel function) and we call this area the Euclidean zone; outside of the Euclidean zone CIM then behaves like an L1Image, which is named the Transition zone; eventually in the Rectification zone as two points are further apart, the metric is much less sensitive to the change of the distance and saturates. Specifically, note that the CIM can be used in at least two respects for robust signal processing: (1) the relative insensitivity to very large values and the L1Image equivalence in the transition region enables its use as an outlier-robust error measure and (2) its concave contours for large-norm vectors in the rectification regime can be used to further impose sparsity in adaptive system coefficients, which is typically done using an Lp(p1)Image norm penalization of the weight-vector norm. In both cases, the CIM has the useful property that for small-signal/weight situations the traditional quadratic measures are approximated, while for large-signal/weight conditions, it makes a smooth but sharp transition to desirable and robust Lp(p1)Image penalization. The kernel size controls the speed of transition and can be optimally determined to best fit the data through the kernel density estimation connection. It is worth noting that there is a single parameter in correntropy, the kernel bandwidth, which controls the behavior of the CIM and MCC. Intuitively, if the data size is large and the problem dimension is low, a small kernel size can be used so that MCC searches with high precision (small bias) for the maximum position of the error PDF. However, if the data size is small and/or the problem dimension large, the kernel size has to be chosen so that the bulk of the data are inside the bandwidth to ensure efficient data usage while the outliers stay outside to enable an effective attenuation (small variance).

Image
Figure 5.2 Contours of CIM(X,0) in 2D vector space (kernel size is set to 1.0).

The kernel function in correntropy can easily be extended to the generalized Gaussian function and the new measure is called the generalized correntropy [23]. Specifically, the generalized correntropy between random variables X and Y is defined by

Vα,β(X,Y)=E[Gα,β(XY)],

Image (5.19)

with Gα,β(.)Image being the generalized Gaussian density (GGD) function [23]:

Gα,β(x)=α2βΓ(1/α)exp(|xβ|α)=γα,βexp(λ|x|α),

Image (5.20)

where Γ(.)Image denotes the gamma function, α>0Image is the shape parameter, β>0Image is the scale parameter, λ=1/βαImage is the kernel parameter and γα,β=α/(2βΓ(1/α))Image is the normalization constant. Obviously, when α=2Image, the generalized correntropy is equivalent to the original correntropy. Note that the kernel function in the generalized correntropy is a Mercer kernel only when 0<α2Image.

Similarly, one can define the generalized C-Loss (GC-Loss) and generalized CIM (GCIM). The GCIM also behaves like different norms (from LαImage to L0Image) of the data in different regions (see [23] for more details).

5.3.2 Maximum Correntropy Criterion

The correntropy and generalized correntropy can be used as an optimization cost in estimation-related problems. Given two random variables XRImage, an unknown real-valued parameter to be estimated, and YRmImage, a vector of observations (measurements), an estimator of X can be defined as a function of Y:ˆX=g(Y)Image, where g is solved by optimizing a certain cost function. Under the MCC, the estimator g can be solved by maximizing the correntropy (or minimizing the C-Loss) between X and ˆXImage, that is [45],

gMCC=argmaxgGV(X,ˆX)=argmaxgGE[κσ(Xg(Y))],

Image (5.21)

where G stands for the collection of all measurable functions of Y. If yRmImage, X has a conditional PDF pX|Y(x|y)Image, then the estimation error e=XˆXImage has PDF

pe(ε)=RmpX|Y(ε+g(y)|y)dFY(y),

Image (5.22)

where FY(y)Image is the distribution function of Y. It follows that

gMCC=argmaxgGRκσ(ε)RmpX|Y(ε+g(y)|y)dFY(y)dε.

Image (5.23)

In [45], it has been proven that the MCC estimate is a smoothed maximum a posteriori probability (MAP) estimate, which equals the mode (at which the PDF attains its maximum value). Similar results hold for the generalized MCC (GMCC) estimation.

Theorem 1

The GMCC estimator can be expressed as

gGMCC(y)=argmaxxRρα,β(x|y),yRm,

Image (5.24)

where ρα,β(x|y)=Gα,β(x)pX|Y(x|y)Image (“” denotes the convolution operator with respect to x).

Proof

It is easy to derive

Vα,β(X,ˆX)=RGα,β(ε)RmpX|Y(ε+g(y)|y)dFY(y)dε=Rm{Gα,β(ε)pX|Y(ε+g(y)|y)dε}dFY(y)=Rm{Gα,β(εg(y))pX|Y(ε|y)dε}dFY(y)(a)=Rm{Gα,β(g(y)ε)pX|Y(ε|y)dε}dFY(y)=Rm{(Gα,β(.)pX|Y(.|y))(g(y))}dFY(y)=Rmρα,β(g(y)|y)dFY(y),

Image (5.25)

where ε=ε+g(y)Image and (a) comes from the symmetry of Gα,β(.)Image. Thus

gGMCC=argmaxgGRmρα,β(g(y)|y)dFY(y)gGMCC(y)=argmaxxRρα,β(x|y),yRm

Image (5.26)

This completes the proof. 

Remark

Considering the conditional PDF pX|Y(x|y)Image as an a posteriori PDF given the observation y, the function ρα,β(x|y)Image is a smoothed (by convolution) a posteriori PDF. Therefore, the GMCC estimate is a smoothed MAP estimate.

Corollary 1

When β0+Image (or λImage), the GMCC estimation becomes the MAP estimation.

Proof

As β0+Image, the GGD function Gα,β(.)Image will approach the Dirac delta function and the smoothed conditional PDF ρα,β(x|y)Image will reduce to the original conditional PDF pX|Y(x|y)Image. In this case, the GMCC estimation will be equivalent to the MAP estimation. 

Theorem 2

The GMCC estimator can also be expressed as

gGMCC=argmaxgGpα,βe(0),

Image (5.27)

where pα,βe(x)=pe(x)Gα,β(x)Image is the smoothed error PDF.

Proof

It is easy to derive

E[Gα,β(e)]=RGα,β(ε)pe(ε)dε=(RGα,β(xε)pe(ε)dε)x=0=(pe(x)Gα,β(x))x=0=pα,βe(0).

Image (5.28)

Hence

gGMCC=argmaxgGE[Gα,β(e)]=argmaxgGpα,βe(0).

Image (5.29)

Theorem 3

When βImage (or λ0+Image), the GMCC estimation will be equivalent to the LMP estimation with p=αImage.

Proof

Under the LMP criterion, the optimal estimator is solved by minimizing the mean p-power of the error:

gLMP=argmingGE[|e|p].

Image (5.30)

On the other hand, we have Vα,β(X,ˆX)γα,β(1λE[|e|α])Image as λ0+Image. It follows that

maxVα,β(X,ˆX)minE[|e|α],

Image (5.31)

that is, as λ0+Image, the GMCC estimation will be, approximately, equivalent to the LMP estimation with p=αImage. 

5.4 Kernel Adaptive Filters Under Generalized MCC

Most of the KAFs, such as KLMS and KRLS, have been derived under the well-known MSE criterion. The MSE is computationally simple and mathematically tractable but may perform very poorly under non-Gaussian noise (especially impulsive noise) disturbance. An efficient approach to improve the robustness to non-Gaussian noises is to use the MCC-based costs [2124]. In the following, we present two KAF algorithms under the generalized MCC [54,55]. Of course, one can derive many other KAFs under GMCC.

5.4.1 Generalized Kernel Maximum Correntropy

In a way very similar to the derivation of KLMS, applying the GMCC on the transformed sample sequence {φ(xi),yi}Image, one can derive

Ωi=Ωi1+μΩi1exp(λ|ei|α)=Ωi1+μλαexp(λ|ei|α)|ei|α1sign(ei)φ(xi)=Ωi1+ηexp(λ|ei|α)|ei|α1sign(ei)φ(xi),

Image (5.32)

where the step size η=μλαImage. The above algorithm is called the generalized kernel maximum correntropy (GKMC) [54]. From (5.32), we have the following observations:

1) When α=2Image, the GKMC becomes

Ωi=Ωi1+ηexp(λe2i)eiφ(xi),

Image (5.33)

which is the kernel maximum correntropy (KMC) algorithm developed in [21].

2) When λ0+Image, the GKMC becomes

Ωi=Ωi1+η|ei|α1sign(ei)φ(xi),

Image (5.34)

which is the kernel LMP (KLMP) algorithm developed in [42].

3) When α=2Image and λ0+Image, the GKMC will reduce to the KLMS algorithm

Ωi=Ωi1+ηeiφ(xi).

Image (5.35)

In addition, one can rewrite (5.32) as

Ωi=Ωi1+ηieiφ(xi),

Image (5.36)

where ηi=ηexp(λ|ei|α)|ei|α2Image is a variable step size (VSS). Thus, the GKMC can be viewed as a KLMS with VSS ηiImage. Since ηi0Image as |ei|Image, the learning rate of GKMC will approach zero when a large error occurs. This property ensures the robustness of the GKMC with respect to large outliers.

In summary, the GKMC algorithm is summarized in Algorithm 2, where aiImage denotes the coefficient vector at iteration i, (ai)jImage is the jth component of aiImage and CiImage is the center set (or dictionary).

Image
Algorithm 2 GKMC Algorithm.

The network size of the GKMC grows linearly with the number of input data. One can use a simple quantization approach to curb the network growth [31]. In this way, the weight update Eq. (5.32) becomes

Ωi=Ωi1+ηexp(λ|ei|α)|ei|α1sign(ei)φ(Q[xi]),

Image (5.37)

where Q[.]Image denotes the quantization operator [31]:

Q[xi]={Cj(i1)dis(xi,Ci1)εxidis(xi,Ci1)>ε},

Image (5.38)

with Ci1Image being the codebook at iteration i1Image, ε the quantization size and dis(xi,Ci1)Image the distance between xiImage and Ci1Image. We have

dis(xi,Ci1)=min1jsize(Ci1)xi(Ci1)j,

Image (5.39)

where (Ci1)jImage denotes the jth element (code vector) of the codebook Ci1Image, jImage is the index of the closest center

j=argmin1jsize(C(i1))xi(Ci1)j.

Image (5.40)

The quantized GKMC (QGKMC) algorithm is summarized in Algorithm 3.

Image
Algorithm 3 QGKMC Algorithm.

5.4.2 Generalized Kernel Recursive Maximum Correntropy

Next, we derive the generalized kernel recursive maximum correntropy (GKRMC) algorithm [55]. A weighted and regularized cost function under GMCC is proposed as

(Ω)=ij=1ζijGα,β(yjΩTφj)12ζiγΩ2,

Image (5.41)

where ζ is the forgetting factor, satisfying 0ζ1Image, γ is the regularization factor and Gα,β(yjΩTφj)=α2βΓ(1/α)exp(|yjΩTφjβ|α)Image. Setting Ω(Ω)=0Image, one can obtain the solution

Ωi=(ΦiBiΦiT+γζiβαI)1ΦiBiyi,

Image (5.42)

where

Bi=diag{b1,b2,,bi}

Image

with

bj=ζij×(yjΩTφj)α2×(α22βΓ(1/α))×exp(|yjΩTφjβ|α),j=1,2,i.

Image

With the matrix inversion lemma, we have

(ΦiBiΦiT+γζiβαI)1ΦiBi=Φi(ΦiTΦi+γζiβαBi1)1.

Image (5.43)

Thus, (5.42) can be rewritten as

Ωi=Φi(ΦiTΦi+γζiβαBi1)1yi.

Image (5.44)

The above weight vector can be expressed as a linear combination of the transformed feature vectors, where the combination coefficients can be calculated recursively. We denote

Ωi=ΦiaiImage with ai=(ΦiTΦi+γζiβαBi1)1yiImage being the coefficient vector. In addition, let Qi=(ΦiTΦi+γζiβα×Bi1)1Image. Then we have

Qi=[Φi1TΦi1+γζiβαBi11Φi1TφiφiTΦi1φiTφi+γζiβαθi]1,

Image (5.45)

where θi=(yiΩTφi)α2×(α22βΓ(1/α))×exp(|yiΩTφiβ|α)Image. The above result can be converted to

Qi1=[Qi11hihiTφiTφi+γζiβαθi],

Image (5.46)

where hi=Φi1TφiImage. Based on the following block matrix inversion identity,

[ABCD]1=[(ABD1C)1A1B(DCA1B)1D1C(ABD1C)1(DCA1B)1],

Image (5.47)

one can derive

Qi=ri1[Qi1ri+ziziTziziT1],

Image (5.48)

where ri=γζiβαθi+φiTφiziThiImage and zi=Qi1hiImage. Therefore, the expansion coefficients are

ai=Qiyi=ri1[Qi1ri+ziziTziziT1][yi1yi]=[ai1ziri1eiri1ei],

Image (5.49)

where ei=yiΩTφiImage. Then we have obtained the GKRMC algorithm, whose pseudocode is given in Algorithm 4.

Remark

When α=2Image, the GKRMC algorithm will reduce to the kernel recursive maximum correntropy (KRMC) developed in [38].

Image
Algorithm 4 GKRMC Algorithm.

5.5 Simulation Results

In the following, we present some simulation results to demonstrate the performance of the GKMC and GKRMC algorithms.

5.5.1 Frequency Doubling Problem

First, consider the frequency-doubling problem, where the input signal is a sine wave with 20 samples per period. The desired signal is another sine wave with double frequency, disturbed by an additive noise viImage. With a time embedding dimension of 5, the input vector is xi=[xi4,xi3,...,xi]Image. In this example, we consider a mixture noise model which combines two independent noise processes, given by

vi=(1δi)Ai+δiBi,

Image (5.50)

where δiImage is binary distributed over {0,1}Image, with probability p(δi=1)=cImage, p(δi=0)=1cImage, where 0c1Image controls the occurrence probability of the noise processes AiImage and BiImage. In general, AiImage represents the “background noise” with small variance, while BiImage stands for large outliers that occur occasionally with larger variance. In the simulations below, without explicit mention we set c=0.05Image. BiImage is assumed to be a white Gaussian process with zero mean and variance 15. The Gaussian kernel is adopted as the Mercer kernel in kernel adaptive filters and the bandwidth is set to σ=2.0Image. The step sizes are chosen such that all the algorithms achieve almost the same initial convergence speed.

We choose 4000 noisy samples for training and 200 clean samples for testing (the filters are fixed during the testing). The average learning curves of the KLMS and GKMC over 100 Monte Carlo runs in terms of the testing MSEs with different values of α (λ=2.0Image) and distributions of AiImage (Gaussian and uniform) are illustrated in Fig. 5.3, and the corresponding testing MSEs at the final iteration are summarized in Table 5.1. From the simulation results we observe: (i) the GKMC can outperform the KLMS significantly, showing much stronger robustness with respect to the impulsive noises; (ii) with different values of α, the GKMC can achieve difference performances, and the best performance may be obtained when α2.0Image. Note that the GKMC will reduce to the KMC algorithm when α=2.0Image.

Image
Figure 5.3 Average learning curves with different values of α (α = 2,4,6) and distributions of Ai: (A) Gaussian distribution with zero mean and variance 0.25; (B) uniform distribution over [−0.5,0.5].

Table 5.1

Testing MSEs at final iteration in frequency doubling

KLMS GKMC α = 2 GKMC α = 4 GKMC α = 6
Gaussian 0.2250 ± 0.1896 0.0353 ± 0.0168 0.0196 ± 0.0101 0.0260 ± 0.0121
Uniform 0.1781 ± 0.1842 0.0137 ± 0.0057 0.0037 ± 0.0016 0.0030 ± 0.0022

Image

Further, we compare the performance of the GKMC and QGKMC. We set ε=0.03Image, α=4Image, λ=2Image and η=0.3Image and assume that AiImage is of Gaussian distribution with variance 0.04. The average learning curves over 100 runs are shown in Fig. 5.4, and the network size growth curve of the QGKMC is presented in Fig. 5.5. One can see that both algorithms achieve almost the same convergence performance but the network size of QGKMC has been significantly curbed (only slightly larger than 30 at steady state).

Image
Figure 5.4 Average learning curves of QGKMC and GKMC.
Image
Figure 5.5 Network size growth curve of QGKMC.

5.5.2 Online Nonlinear System Identification

Next, we show the performance of the GKMC and GKRMC algorithms in online nonlinear system identification. Suppose the nonlinear system is of the form [41]

yi=(0.80.5exp(y2i1))yi1(0.3+0.9exp(y2i1))yi2+0.1sin(3.1415926yi1)+vi,

Image (5.51)

where viImage is a disturbance noise described by (5.50). The average learning curves in terms of testing MSEs over 100 Monte Carlo runs with different values of α and distributions of AiImage (uniform and alpha-stable) are shown in Fig. 5.6, and the corresponding testing MSEs at final iteration are listed in Table 5.2. In the simulation, 1000 noisy samples are used for training and 100 noise-free samples are used for testing (the filters are fixed during the testing). The parameters of the GKMC and GKRMC are experimentally chosen such that both achieve a desirable performance. It can be clearly seen that the GKRMC can outperform the GKMC, with faster convergence speed and lower testing MSE. However, we should note that the GKRMC is computationally more expensive than the GKMC.

Image
Figure 5.6 Average learning curves with different values of α and distributions of Ai: (A) uniform distribution over [−0.5,0.5]; (B) alpha-stable distribution with shape parameter equal to 0.7 and scale parameter equal to 0.05.

Table 5.2

Testing MSEs at the final iteration in online nonlinear system identification

GKMC α = 2 GKMC α = 4 GKMC α = 6 GKRMC α = 2 GKRMC α = 4 GKRMC α = 6
Uniform 0.0383 ± 0.0250 0.0330 ± 0.0145 0.0298 ± 0.0132 0.0024 ± 0.0009 0.0014 ± 0.0007 0.0012 ± 0.0006
GKMC α = 1 GKMC α = 2 GKMC α = 4 GKRMC α = 1 GKRMC α = 2 GKRMC α = 4
Alpha-stable 0.0152 ± 0.0110 0.0286 ± 0.0194 0.0907 ± 0.0556 0.0007 ± 0.0004 0.0011 ± 0.0005 0.0017 ± 0.0009

Image

5.5.3 Noise Cancellation

In this part, we consider a noise cancellation problem in signal processing. The basic structure of a noise cancellation system is shown in Fig. 5.7. The primary signal is siImage, whose noisy measurement diImage serves as the desired signal of the system. The term ρiImage represents an unknown white noise process and xiImage is its reference measurement. The signal xiImage can be used as an input of an adaptive filter such that the filter output is an estimate of the noise ρiImage. Clearly, the signal-to-noise ratio (SNR) can be improved by subtracting the estimated noise from diImage. In the simulation, the noise source is assumed to be uniformly distributed over [0.5,0.5]Image. The interference distortion function is

xi=ρi0.2xi1xi1ρi1+0.1ρi1+0.4xi2,

Image (5.52)

which is nonlinear with an infinite impulsive response. It is impossible to recover ρiImage from a finite time delay embedding of xiImage. Actually, one can rewrite the distortion function as

ρi=xi+0.2xi10.4xi2+(xi10.1)ρi1.

Image (5.53)

Obviously, the present value of the noise source ρiImage not only depends on the reference noise measurements [xi,xi1,xi2]Image, but also on the past value ρi1Image. However, the filter output yi1Image can be used as a surrogate signal. In this case, the input vector of the adaptive filter is [xi,xi1,xi2,yi1]Image. We assume the primary signal si=0Image during the training phase. The system tries to reconstruct the noise source from the reference measurements. We use three nonlinear filters trained with KLMS, KRLS and GKRMC, respectively. In the simulation, the step size and the Gaussian kernel bandwidth for KLMS are 2.0 and 0.15. The regularization parameter for KRLS and GKRMC is 0.1. The Gaussian kernel bandwidth for KRLS and GKRMC are 0.17 and 0.1, respectively. The parameter β for GKRMC is set to 0.5. The ensemble learning curves over 100 Monte Carlo runs for different algorithms are illustrated in Fig. 5.8. In addition, the noise reduction factor (NR) defined by 10log10{E[ρ2i]/E[ρiyi]2}Image is given in Table 5.3. In this example, the GKRMC achieves the best performance when α=6Image.

Image
Figure 5.7 Basic structure of a noise cancellation system.
Image
Figure 5.8 Learning curves of noise cancellation.

Table 5.3

Performance comparison of KLMS, KRLS and GKRMC in noise cancellation

Algorithm KLMS KRLS GKMC α = 2 GKMC α = 4 GKMC α = 6
NR (dB) 5.7356 ± 0.6441 11.6057 ± 1.3710 27.4908 ± 1.2422 30.5206 ± 1.4516 33.5343 ± 1.7440

Image

5.6 Conclusion

In this chapter, we characterized a unified and efficient approach to deal with online nonlinear modeling with non-Gaussian noises, which combines the powerful modeling capability of the KAFs and the robustness of the MCC with respect to heavy-tailed non-Gaussian noises. We focused mainly on the KLMS- and KRLS-type algorithms but the results can easily be extended to other KAFs. Particularly the GKMC and GKRMC algorithms were presented. Simulation results on frequency doubling, nonlinear system identification and noise cancellation have been presented to illustrate the benefits of the new approach.

References

[1] R.A. Gonzalo, Nonlinear Signal Processing: A Statistical Approach. J. Wiley and Sons Editors; 2005.

[2] W.J. Fitzgerald, R.L. Smith, A.T. Walden, P.C. Young, Nonlinear and Nonstationary Signal Processing. Cambridge University Press; 2000.

[3] S. Haykin, L. Li, Nonlinear adaptive prediction of nonstationary signals, IEEE Transactions on Signal Processing 1995;43(2):526–535.

[4] M.S. Arulampalam, S. Maskell, N. Gordon, T. Clapp, A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking, IEEE Transactions on Signal Processing 2002;50(2):174–188.

[5] S.C. Schwartz, J.B. Thomas, E.J. Wegman, Topics in Non-Gaussian Signal Processing. Springer-Verlag; 1989.

[6] D. Robert Nowak, Nonlinear system identification, Circuits, Systems, and Signal Processing 2002;21(1):109–122.

[7] Oliver Nelles, Nonlinear System Identification: From Classical Approaches to Neural Networks and Fuzzy Models. Springer Science and Business Media; 2013.

[8] Er-WeiBai FouadGiri, Block-Oriented Nonlinear System Identification. London: Springer; 2010.

[9] G. Huang, Y. Qin, Siew Chee-Kheong, Extreme learning machine: theory and applications, Neurocomputing 2006;70(1):489–501.

[10] W. Liu, J.C. Principe, S. Haykin, Kernel adaptive filtering: a comprehensive introduction, Neural Networks (IJCNN), the 2013 International Joint Conference on. IEEE; 2011:1–6.

[11] Ali H. Sayed, Fundamentals of adaptive filtering, Control Systems IEEE 2003;25(4):77–79.

[12] S. Haykin, B. Widrow, Least-Mean-Square Adaptive Filters. J. Wiley and Sons; 2003;vol. 31.

[13] Paulo S.R. Diniz, Adaptive Filtering: Algorithms and Practical Implementation. Springer Science and Business Media; 2012.

[14] J.C. Principe, Information Theoretic Learning: Renyi's Entropy and Kernel Perspectives. Springer Publishing Company, Incorporated; 2010.

[15] J.C. Principe, D. Xu, J. Fisher, Information theoretic learning, Unsupervised Adaptive Filtering, vol. 1. 2000:265–319.

[16] E. Kenneth Hild, D. Erdogmus, K. Torkkola, J.C. Principe, Feature extraction using information-theoretic learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 2006;28(9):1385–1392.

[17] B. Chen, Y. Zhu, J. Hu, J.C. Principe, System Parameter Identification: Information Criteria and Algorithms. Elsevier; 2013.

[18] W. Liu, P.P. Pokharel, J.C. Principe, Correntropy: properties and applications in non-Gaussian signal processing, IEEE Transactions on Signal Processing 2007;55(11):5286–5298.

[19] R. He, B. Hu, W. Zheng, X. Kong, Robust principal component analysis based on maximum correntropy criterion, IEEE Transactions on Image Processing 2011;20(6):1485–1494.

[20] R. He, W. Zheng, B. Hu, Maximum correntropy criterion for robust face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 2011;33(8):1561–1576.

[21] S. Zhao, B. Chen, J.C. Principe, Kernel adaptive filtering with maximum correntropy criterion, Neural Networks (IJCNN), the 2011 International Joint Conference on. IEEE; 2011.

[22] B. Chen, L. Xing, J. Liang, N. Zheng, J.C. Principe, Steady-state mean-square error analysis for adaptive filtering under the maximum correntropy criterion, IEEE Signal Processing Letters 2014;21(7):880–884.

[23] B. Chen, L. Xing, H. Zhao, N. Zheng, J.C. Principe, Generalized correntropy for robust adaptive filtering, IEEE Transactions on Signal Processing 2016;64(13):3376–3387.

[24] W. Ma, H. Qu, G. Gui, L. Xu, J. Zhao, B. Chen, Maximum correntropy criterion based sparse adaptive filtering algorithms for robust channel estimation under non-Gaussian environments, Journal of the Franklin Institute 2015;352(7):2708–2727.

[25] B. Chen, X. Liu, H. Zhao, J.C. Principe, Maximum correntropy Kalman filter, Automatica 2017;76:70–77.

[26] X. Liu, B. Chen, B. Xu, Z. Wu, P. Honeine, Maximum correntropy unscented filter, International Journal of Systems Science 2017;48(8):1607–1615.

[27] Z. Wu, S. Peng, B. Chen, H. Zhao, Robust Hammerstein adaptive filtering under maximum correntropy criterion, Entropy 2015;17(10):7149–7166.

[28] W. Liu, P.P. Pokharel, J.C. Principe, The kernel least-mean-square algorithm, IEEE Transactions on Signal Processing 2008;56(2):543–554.

[29] Y. Engel, S. Mannor, R. Meir, The kernel recursive least-squares algorithm, IEEE Transactions on Signal Processing 2004;52(8):2275–2285.

[30] C. Richard, J.C.M. Bermudez, P. Honeine, Online prediction of time series data with kernels, IEEE Transactions on Signal Processing Mar. 2009;57(3):1058–1067.

[31] B. Chen, S. Zhao, P. Zhu, J.C. Principe, Quantized kernel least mean square algorithm, IEEE Transactions on Neural Networks and Learning Systems Jan. 2012;23(1):22–32.

[32] B. Chen, S. Zhao, P. Zhu, J.C. Principe, Quantized kernel recursive least squares algorithm, IEEE Transactions on Neural Networks and Learning Systems 2013;24(9):1484–1491.

[33] S. Zhao, B. Chen, P. Zhu, J.C. Principe, Fixed budget quantized kernel least-mean-square algorithm, Signal Processing 2013;93(9):2759–2770.

[34] M. Yukawa, Multikernel adaptive filtering, IEEE Transactions on Signal Processing 2012;60(9):4672–4682.

[35] W. Liu, I. Park, Y. Wang, J.C. Principe, Extended kernel recursive least squares algorithm, IEEE Transactions on Signal Processing 2009;57(10):3801–3814.

[36] W. Liu, J.C. Príncipe, Kernel affine projection algorithms, EURASIP Journal on Advances in Signal Processing 2008;2008(1):1–12.

[37] V.V. Steven, M. Lazaro-Gredilla, I. Santamaría, Kernel recursive least-squares tracker for time-varying regression, IEEE Transactions on Neural Networks and Learning Systems 2002;23(8):1313–1326.

[38] Z. Wu, J. Shi, X. Zhang, W. Ma, B. Chen, Kernel recursive maximum correntropy, Signal Processing 2015;117:11–16.

[39] P. Zhu, B. Chen, J.C. Principe, Learning nonlinear generative models of time series with a Kalman filter in RKHS, IEEE Transactions on Signal Processing 2014;62:141–155.

[40] W. Liu, I. Park, J.C. Principe, An information theoretic approach of designing sparse kernel adaptive filters, IEEE Transactions on Neural Networks 2009;20(12):1950–1961.

[41] B. Chen, Zejian Yuan, Nanning Zheng, Jose C. Principe, Kernel minimum error entropy algorithm, Neurocomputing 2013;121:160–169.

[42] W. Ma, J. Duan, W. Man, H. Zhao, B. Chen, Robust kernel adaptive filters based on mean p-power error for noisy chaotic time series prediction, Engineering Applications of Artificial Intelligence 2017;58:101–110.

[43] J. Liu, H. Qu, B. Chen, W. Ma, Kernel robust mixed-norm adaptive filtering, Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE; 2014:3021–3024.

[44] H. Qu, W. Ma, J. Zhao, B. Chen, Kernel least mean kurtosis based online chaotic time series prediction, Chinese Physics Letters 2013;30(11), 110505.

[45] B. Chen, J.C. Principe, Maximum correntropy estimation is a smoothed MAP estimation, IEEE Signal Processing Letters 2012;19(8):491–494.

[46] M.N. Syed, P.M. Pardalos, J.C. Principe, On the optimization properties of the correntropic loss function in data analysis, Optimization Letters 2014;3(8):823–839.

[47] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press; 2000.

[48] K.H. Jeong, W. Liu, S. Han, E. Hasanbelliu, J.C. Principe, The correntropy MACE filter, Pattern Recognition 2009;42(5):871–885.

[49] R. Li, W. Liu, J.C. Principe, A unifying criterion for instantaneous blind source separation based on correntropy, Signal Processing 2007;87(8):1872–1881.

[50] J. Xu, J.C. Principe, A pitch detector based on a generalized correlation function, IEEE Transactions on Audio, Speech, and Language Processing 2008;16(8):1420–1432.

[51] I. Park, J.C. Principe, Correntropy based Granger causality, Acoustics, Speech and Signal Processing, 2008, ICASSP 2008, IEEE International Conference on. IEEE; 2008:3605–3608.

[52] B. Chen, L. Xing, B. Xu, H. Zhao, N. Zheng, J.C. Principe, Kernel risk-sensitive loss: definition, properties and application to robust adaptive filtering, IEEE Transactions on Signal Processing 2017;65(11):2888–2901.

[53] B. Chen, L. Xing, X. Wang, J. Qin, N. Zheng, Robust learning with kernel mean p-power error loss, arXiv preprint arXiv:1612.07019; 2016.

[54] Y. He, F. Wang, J. Yang, B. Chen, Kernel adaptive filtering under generalized maximum correntropy criterion, Neural Networks (IJCNN), 2016 International Joint Conference on. IEEE; 2016:1738–1745.

[55] C. Zheng, L. Xing, T. Li, H. Yang, J. cao, B. chen, Z. Zhou, L. zhang, Developing a robust colorectal cancer (CRC) risk predictive model with the big genetic and environment related CRC data, Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on. IEEE; 2016:1885–1893.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.27.45