Chapter 12

Adaptive Filters

Vítor H. Nascimento and Magno T.M. Silva,    Department of Electronic Systems Engineering, University of São Paulo, São Paulo, SP, Brazil, [email protected]

Abstract

This chapter provides an introduction to adaptive signal processing, covering basic principles through the most important recent developments. After a brief example for, we present an overview of how adaptive filters work, in which we use only deterministic arguments and concepts from basic linear systems theory, followed by a description of a few common applications of adaptive filters. Later, we turn to a more general model for adaptive filters based on stochastic processes and optimum estimation. Then three of the most important adaptive algorithms—LMS, NLMS, and RLS—are derived and analyzed in detail. The chapter closes with a brief description of some important extensions to the basic adaptive filtering algorithms and some promising current research topics. Since adaptive filter theory brings together results from several fields, short reviews are provided for most of the necessary material in separate sections, that the reader may consult if needed.

Keywords

Adaptive signal processing; Digital filters; Parameter estimation; Adaptive equalizers; Blind equalizers; Deconvolution; System identification; Echo cancellers

Acknowledgment

The authors thank Prof. Wallace A. Martins (Federal University of Rio de Janeiro) for his careful reading and comments on the text.

1.12.1 Introduction

Adaptive filters are employed in situations in which the environment is constantly changing, so that a fixed system would not have adequate performance. As they are usually applied in real-time applications, they must be based on algorithms that require a small number of computations per input sample.

These algorithms can be understood in two complementary ways. The most straightforward way follows directly from their name: an adaptive filter uses information from the environment and from the very signal it is processing to change itself, so as to optimally perform its task. The information from the environment may be sensed in real time (in the form of a so-called desired signal), or may be provided a priori, in the form of previous knowledge of the statistical properties of the input signal (as in blind equalization).

On the other hand, we can think of an adaptive filter also as an algorithm to separate a mixture of two signals. The filter must have some information about the signals to be able to separate them; usually this is given in the form of a reference signal, related to only one of the two terms in the mixture. The filter has then two outputs, corresponding to each signal in the mixture (see Figure 12.1). As a byproduct of separating the signals, the reference signal is processed in useful ways. This way of viewing an adaptive filter is very useful, particularly when one is learning the subject.

image

Figure 12.1 Inputs and outputs of an adaptive filter. Left: detailed diagram showing inputs, outputs and internal variable. Right: simplified diagram.

In the following sections we give an introduction to adaptive filters, covering from basic principles to the most important recent developments. Since the adaptive literature is vast, and the topic is still an active area of research, we were not able to treat all interesting topics. Along the text, we reference important textbooks, classic papers, and the most promising recent literature. Among the many very good textbooks on the subject, four of the most popular are [14].

We start with an application example, in the next section. In it, we show how an adaptive filter is applied to cancel acoustic echo in hands-free telephony. Along the text we return frequently to this example, to illustrate new concepts and methods. The example is completed by an overview of how adaptive filters work, in which we use only deterministic arguments and concepts from basic linear systems theory, in Section 1.12.1.2. This overview shows many of the compromises that arise in adaptive filter design, without the more complicated math. Section 1.12.1.3 gives a small glimpse of the wide variety of applications of adaptive filters, describing a few other examples. Of course, in order to fully understand the behavior of adaptive filters, one must use estimation theory and stochastic processes, which is the subject of Sections 1.12.2 and 1.12.3. The remaining sections are devoted to extensions to the basic algorithms and recent results.

Adaptive filter theory brings together results from several fields: signal processing, matrix analysis, control and systems theory, stochastic processes, and optimization. We tried to provide short reviews for most of the necessary material in separate links (“boxes”), that the reader may follow if needed.

1.12.1.1 Motivation—acoustic echo cancellation

Suppose that you are talking with a person who uses a hands-free telephone inside a car. Let us assume that the telephone does not have any acoustic echo canceler. In this case, the sound of your voice will reach the loudspeaker, propagate inside the car, and suffer reflections every time it encounters the walls of the car. Part of the sound can be absorbed by the walls, but part of it is reflected, resulting in an echo signal. This signal is fed back to the microphone and you will hear your own voice with delay and attenuation, as shown in Figure 12.2.

image

Figure 12.2 Echo in hands-free telephony. Click speech.wav to listen to a speech signal and echo.wav to listen to the corresponding echo signal.

The echo signal depends on the acoustic characteristics of each car and also on the location of the loudspeaker and microphone. Furthermore, the acoustic characteristics can change during a phone call, since the driver may open or close a window, for example. In order to follow the variations of the environment, the acoustic echo must be canceled by an adaptive filter, which performs on-line updating of its parameters through an algorithm. Figure 12.3 shows the insertion of an adaptive filter in the system of Figure 12.2 to cancel the echo signal. In this configuration, the adaptive filter must identify the echo path and provide at its output an estimate of the echo signal. Thus, a synthesized version of the echo is subtracted from the signal picked up by the car microphone. In an ideal scenario, the resulting signal will be free of echo and will contain only the signal of interest. The general set-up of an echo canceler is shown in Figure 12.4. The near-end signal (near with respect to the echo canceler) may be simply noise when the person inside the car is in silence, as considered in Figures 12.2 and 12.3, or it can be a signal of interest, that is, the voice of the user in the car.

image

Figure 12.3 Acoustic echo cancellation using an adaptive filter.

image

Figure 12.4 General echo canceler configuration.

You most likely have already heard the output of an adaptive echo canceler. However, you probably did not pay attention to the initial period, when you can hear the adaptive filter learning the echo impulse response, if you listen carefully. We illustrate this through a simulation, using a measured impulse response of 256 samples (sampled at 8 kHz), and an echo signal produced from a recorded voice signal. The measured impulse response is shown in Figure 12.5. In the first example, the near-end signal is low-power noise (more specifically, white Gaussian noise with zero mean and variance image). Listen to e_nlms.wav, and pay attention to the beginning. You will notice that the voice (the echo) is fading away, while the adaptive filter is learning the echo impulse response. This file is the output of an echo canceler adapted using the normalized least-mean squares algorithm (NLMS), which is described in Section 1.12.3.2.

image

Figure 12.5 Measured impulse response (256 samples).

Listen now to e_lsl.wav in this case the filter was adapted using the modified error-feedback least-squares lattice algorithm (EF-LSL) [5], which is a low-cost and stable version of the recursive least-squares algorithm (RLS) described in Section 1.12.3.3. Now you will notice that the echo fades away much faster. This fast convergence is characteristic of the EF-LSL algorithm, and is of course a very desirable property. On the other hand, the NLMS algorithm is simpler to implement and more robust against imperfections in the implementation. An adaptive filtering algorithm must satisfy several requirements, such as convergence rate, tracking capability, noise rejection, computational complexity, and robustness. These requirements are conflicting, so that there is no best algorithm that will outperform all the others in every situation. The many algorithms available in the literature are different compromises between these requirements [13].

Figure 12.6 shows 4 s of the echo signal, prior to cancellation (bottom) and a measure of the quality of echo cancellation, the so-called echo return loss enhancement (ERLE) (top). The ERLE is defined as the ratio

image

The higher the ERLE, the better the performance of the canceler. The ERLE drops when the original echo is small (there is nothing to cancel!), and increases again when the echo is strong. The figure shows that EF-LSL performs better than NLMS (but at a higher computational cost).

image

Figure 12.6 Echo return loss enhancement for NLMS image and EF-LSL image algorithms. Bottom trace: echo signal prior to cancellation. The meaning of the several parameters is explained in Sections 1.12.3.1, 1.12.3.2, and 1.12.3.3.

Assuming now that the near-end signal is another recorded voice signal (Click DT_echo.wav to listen to this signal), we applied again the EF-LSL algorithm to cancel the echo. This case simulates a person talking inside the car. Since the speech of this person is not correlated to the far-end signal, the output of the adaptive filter converges again to an estimate of the echo signal. Thus, a synthesized version of the echo is subtracted from the signal picked up by the car microphone and the resulting signal converges to the speech of the person inside the car. This result can be verified by listening to the file e_lslDT.wav.

This is a typical example of use of an adaptive filter. The environment (the echo impulse response) changes constantly, requiring a constant re-tuning of the canceling filter. Although the echo itself is not known, we have access to a reference signal (the clean speech of the far-end user). This is the signal that is processed to obtain an estimate of the echo. Finally, the signal captured by the near-end microphone (the so-called desired signal) is a mixture of two signals: the echo, plus the near-end speech (or ambient noise, when the near-end user is silent). The task of the adaptive filter can be seen as to separate these two signals.

We should mention that the name desired signal is somewhat misleading, although widespread in the literature. Usually we are interested in separating the “desired” signal into two parts, one of which is of no interest at all!

We proceed now to show how an algorithm can perform this task, modeling all signals as periodic to simplify the arguments.

1.12.1.2 A quick tour of adaptive filtering

Before embarking in a detailed explanation of adaptive filtering and its theory, it is a good idea to present the main ideas in a simple setting, to help the reader create intuition. In this section, we introduce the least-mean squares (LMS) algorithm and some of its properties using only deterministic arguments and basic tools from linear systems theory.

1.12.1.2.1 Posing the problem

Consider again the echo-cancellation problem, seen in the previous section (see Figure 12.7). Let image represent the echo, image represent the voice of the near-end user, and image the voice of the far-end user (the reference). image represents the signal that would be received by the far-end user without an echo canceler (the mixture, or desired signal).

image

Figure 12.7 A model for the echo.

The echo path represents the changes that the far-end signal suffers when it goes through the digital-to-analog (D/A) converter, the loudspeaker, the air path between the loudspeaker and the microphone (including all reflections), the microphone itself and its amplifier, and the analog-to-digital (A/D) converter. The microphone signal image will always be a mixture of the echo image and an unrelated signal image, which is composed of the near-end speech plus noise. Our goal is to remove image from image. The challenge is that we have no means to measure image directly, and the echo path is constantly changing.

1.12.1.2.2 Measuring how far we are from the solution

How is the problem solved, then? Since we cannot measure image, the solution is to estimate it from the measurable signals image and image. This is done as follows: imagine for now that all signals are periodic, so they can be decomposed as Fourier series

image (12.1)

for certain amplitudes image and image, phases image and image, and frequencies image and image. The highest frequencies appearing in image and image must satisfy image, image, since we are dealing with sampled signals (Nyquist criterion). We assume for now that the echo path is fixed (not changing), and can be modeled by an unknown linear transfer function image. In this case, image would also be a periodic sequence with fundamental frequency image,

image

with image. The signal picked by the microphone is then

image

Since we know image, if we had a good approximation image to image, we could easily obtain an approximation image to image and subtract it from image, as in Figure 12.8.

image

Figure 12.8 Echo canceler.

How could we find image in real time? Recall that only image, and image can be observed. For example, how could we know that we have the exact filter, that is, that image, by looking only at these signals? To answer this question, take a closer look at image. Let the output of image be

image (12.2)

where image.

Assuming that the cosines in image and image have no frequencies in common, the simplest approach is to measure the average power of image. Indeed, if all frequencies image and image appearing in image are different from one another, then the average power of image is

image (12.3)

If image, then image is at its minimum,

image

which is the average power of image. In this case, image is at its optimum: image, and the echo is completely canceled.

It is important to see that this works because the two signals we want to separate, image and image, are uncorrelated, that is, they are such that

image (12.4)

This property is known as the orthogonality condition. A form of orthogonality condition will appear in general whenever one tries to minimize a quadratic function, as is the case here. This is discussed in more detail in Section 1.12.2.

The average power image could then be used as a measure of how good is our approximation image. The important point is that it is very easy to find a good approximation to image. It suffices to low-pass filter image, for example,

image (12.5)

is a good approximation if the window length image is large enough.

The adaptive filtering problem then becomes an optimization problem, with the particularity that the cost function that is being minimized is not known exactly, but only through an approximation, as in (12.5). In the following sections, we discuss in detail the consequences of this fact.

It is also important to remark that the average error power is not the only possible choice for cost function. Although it is the most popular choice for a number of reasons, in some applications other choices are more adequate. Even posing the adaptive filtering problem as an optimization problem is not the only alternative, as shown by recent methods based on projections on convex sets [6].

1.12.1.2.3 Choosing a structure for the filter

Our approximation image will be built by minimizing the estimated average error power (12.5) as a function of the echo path model image, that is, our estimated echo path will be the solution of

image (12.6)

Keep in mind that we are looking for a real-time way of solving (12.6), since image can only be obtained by measuring image for a certain amount of time. We must then be able to implement the current approximation image, so that image and image may be computed.

This will impose practical restrictions on the kind of function that image may be: since memory and processing power are limited, we must choose beforehand a structure for image. Memory and processing power will impose a constraint on the maximum order the filter may have. In addition, in order to write the code for the filter, we must decide if it will depend on past outputs (IIR filters) or only on past inputs (FIR filters). These choices must be made based on some knowledge of the kind of echo path that our system is likely to encounter.

To explain how these choices can be made, let us simplify the problem a little more and restrict the far-end signal image to a simple sinusoid (i.e., assume that image). In this case,

image

with image. Therefore, image must satisfy

image (12.7)

only for image—the value of image for other frequencies is irrelevant, since the input image has only one frequency. Expanding (12.7), we obtain

image (12.8)

The optimum image is defined through two conditions. Therefore an FIR filter with just two coefficients would be able to satisfy both conditions for any value of image, so we could define

image

and the values of image and image would be chosen to solve (12.6). Note that the argument generalizes for image: in general, if image has image harmonics, we could choose image as an FIR filter with length image, two coefficients per input harmonic.

This choice is usually not so simple: in practice, we would not know image, and in the presence of noise (as we shall see later), a filter with fewer coefficients might give better performance, even though it would not completely cancel the echo. The structure of the filter also affects the dynamic behavior of the adaptive filter, so simply looking at equations such as (12.7) does not tell the whole story. Choosing the structure for image is one of the most difficult steps in adaptive filter design. In general, the designer must test different options, initially perhaps using recorded signals, but ultimately building a prototype and performing some tests.

1.12.1.2.4 Searching for the solution

Returning to our problem, assume then that we have decided that an FIR filter with length image is adequate to model the echo path, that is,

image

Our minimization problem (12.6) now reduces to finding the coefficients image that solve

image (12.9)

Since image must be computed from measurements, we must use an iterative method to solve (12.9). Many algorithms could be used, as long as they depend only on measurable quantities (image, and image). We will use the steepest descent algorithm (also known as gradient algorithm) as an example now. Later we show other methods that also could be used. If you are not familiar with the gradient algorithm, see Box 1 for an introduction.

The cost function in (12.9) is

image

We need to rewrite this equation so that the steepest descent algorithm can be applied. Recall also that now the filter coefficients will change, so define the vectors

image

and

image

where image denotes transposition. At each instant, we have image, and our cost function becomes

image (12.10)

which depends on image. In order to apply the steepest descent algorithm, let us keep the coefficient vector image constant during the evaluation of image, as follows. Starting from an initial condition image, compute for image (The notation is explained in Boxes 2 and 6.)

image (12.11)

where image is a positive constant, and image. Our cost function now depends on only one image (compare with (12.10)):

image (12.12)

We now need to evaluate the gradient of image. Expanding the expression above, we obtain

image

so

image

and

image (12.13)

As we needed, this gradient depends only on measurable quantities, image and image, for image. Our algorithm for updating the filter coefficients then becomes

image (12.14)

where we introduced the overall step-size image.

We still must choose image and image. The choice of image is more complicated and will be treated in Section 1.12.1.2.5. The value usually employed for image may be a surprise: in almost all cases, one uses image, resulting in the so-called least-mean squares (LMS) algorithm

image (12.15)

proposed initially by Widrow and Hoff in 1960 [7] (Widrow [8] describes the history of the creation of the LMS algorithm). The question is, how can this work, if no average is being used for estimating the error power? An intuitive answer is not complicated: assume that image is a very small number so that image for a large image. In this case, we can approximate image from (12.15) as follows.

image

Since image, we have image. Therefore, we could approximate

image

so image steps of the LMS recursion (12.15) would result

image

just what would be obtained from (12.14). The conclusion is that, although there is no explicit average being taken in the LMS algorithm (12.15), the algorithm in fact computes an implicit, approximate average if the step-size is small. This is exactly what happens, as can be seen in the animations in Figure 12.9. These simulations were prepared with

image

where image and image were chosen randomly in the interval image. Several step-sizes were used. The estimates computed with the LMS algorithm are marked by the crosses (red in the web version). The initial condition is at the left, and the theoretical optimum is at the right end of each figure. In the simulations, the true echo path was modeled by the fixed filter

image

For comparison, we also plotted the estimates that would be obtained by the gradient algorithm using the exact value of the error power at each instant. These estimates are marked with small dots (blue in the web version).

image

Figure 12.9 LMS algorithm for echo cancelation with sinusoidal input. Click LMS video files to see animations on your browser.

Note that when a small step-size is used, such as image in Figure 12.9a, the LMS filter stays close to the dots obtained assuming exact knowledge of the error power. However, as the step-size increases, the LMS estimates move farther away from the dots. Although convergence is faster, the LMS estimates do not reach the optimum and stay there: instead, they hover around the optimum. For larger step-sizes, the LMS estimates can go quite far from the optimum (Figure 12.9b and c). Finally, if the step-size is too large, the algorithm will diverge, that is, the filter coefficients grow without bounds (Figure 12.9d).

1.12.1.2.5 Tradeoff between speed and precision

The animations in Figure 12.9 illustrate an important problem in adaptive filters: even if you could magically chose as initial conditions the exact optimum solutions, the adaptive filter coefficients would not stay there! This happens because the filter does not know the exact value of the cost function it is trying to minimize (in this case, the error power): since image is an approximation, noise (and the near-end signal in our echo cancellation example) would make the estimated gradient non-zero, and the filter would wander away from the optimum solution. This effect keeps the minimum error power obtainable using the adaptive filter always a little higher than the optimum value. The difference between the (theoretical, unattainable without perfect information) optimum and the actual error power is known as excess mean-square error (EMSE), and the ratio between the EMSE and the optimum error is known as misadjustment (see Eqs. (12.171) and (12.174)).

For small step-sizes the misadjustment is small, since the LMS estimates stay close to the estimates that would be obtained with exact knowledge of the error power. However, a small step-size also means slow convergence. This trade-off exists in all adaptive filter algorithms, and has been an intense topic of research: many algorithms have been proposed to allow faster convergence without increasing the misadjustment. We will see some of them in the next sections.

The misadjustment is central to evaluate the performance of an adaptive filter. In fact, if we want to eliminate the echo from the near-end signal, we would like that after convergence, image. This is indeed what happens when the step-size is small (see Figure 12.10a). However, when the step-size is increased, although the algorithm converges more quickly, the performance after convergence is not very good, because of the wandering of the filter weights around the optimum (Figure 12.10b–d—in the last case, for image, the algorithm is diverging.).

image

Figure 12.10 Compromise between convergence rate and misadjustment in LMS. Plots of image for several values of image.

We will stop this example here for the moment, and move forward to a description of adaptive filters using probability theory and stochastic processes. We will return to it during discussions of stability and model order selection.

The important message from this section is that an adaptive filter is able to separate two signals in a mixture by minimizing a well-chosen cost function. The minimization must be made iteratively, since the true value of the cost function is not known: only an approximation to it may be computed at each step. When the cost function is quadratic, as in the example shown in this section, the components of the mixture are separated such that they are in some sense orthogonal to each other.

Box 1:

Steepest descent algorithm

Given a cost function

image

with image (image denotes transposition), we want to find the minimum of image, that is, we want to find image such that

image

The solution can be found using the gradient of image, which is defined by

image

See Box 2 for a brief explanation about gradients and Hessians, that is, derivatives of functions of several variables.

The notation image is most convenient when we deal with complex variables, as in Box 6. We use it for real variables for consistency.

Since the gradient always points to the direction in which image increases most quickly, the steepest descent algorithm searches iteratively for the minimum of image taking at each iteration a small step in the opposite direction, i.e., towards image:

image (12.16)

where image is a step-size, i.e., a constant controlling the speed of the algorithm. As we shall see, this constant should not be too small (or the algorithm will converge too slowly) neither too large (or the recursion will diverge).

As an example, consider the quadratic cost function with image

image (12.17)

with

image (12.18)

The gradient of image is

image (12.19)

Of course, for this example we can find the optimum image by equating the gradient to zero:

image (12.20)

Note that in this case the Hessian of image (the matrix of second derivatives, see Box 2) is simply image. As image is symmetric with positive eigenvalues image, it is positive-definite, and we can conclude that image is indeed a minimum of image (See Boxes 2 and 3. Box 3 lists several useful properties of matrices.)

Even though in this case we could compute the optimum solution through (12.20), we will apply the gradient algorithm with the intention of understanding how it works and which are its main limitations. From (12.16) and (12.17), we obtain the recursion

image (12.21)

The user must choose an initial condition image.

In Figure 12.11 we plot the evolution of the approximations computed through (12.21) against the level sets (i.e., curves of constant value) of the cost function (12.17). Different choices of step-size image are shown. In Figure 12.11a, a small step-size is used. The algorithm converges to the correct solution, but slowly. In Figure 12.11b, we used a larger step-size—convergence is now faster, but the algorithm still needs several iterations to reach the solution. If we try to increase the step-size even further, convergence at first becomes slower again, with the algorithm oscillating towards the solution (Figure 12.11c). For even larger step-sizes, such as that in Figure 12.11d, the algorithm diverges: the filter coefficients get farther and farther away from the solution.

image

Figure 12.11 Performance of the steepest descent algorithm for different step-sizes. The crosses (x) represent the successive approximations image, plotted against the level curves of (12.17). The initial condition is the point in the lower left.

It is important to find the maximum step-size for which the algorithm (12.21) remains stable. For this, we need concepts from linear systems and linear algebra, in particular eigenvalues, their relation to stability of linear systems, and the fact that symmetric matrices always have real eigenvalues (see Box 3).

The range of allowed step-sizes is found as follows. Rewrite (12.21) as

image

which is a linear recursion in state-space form (image is the identity matrix).

Linear systems theory [9] tells us that this recursion converges as long as the largest eigenvalue of image has absolute value less than one. The eigenvalues of image are the roots of

image

Let image. Then image is an eigenvalue of image if and only if image is an eigenvalue of image. Therefore, if we denote by image the eigenvalues of image, the stability condition is

image

The stability condition for our example is thus image, in agreement with what we saw in Figure 12.11.

The gradient algorithm leads to relatively simple adaptive filtering algorithms; however, it has an important drawback. As you can see in Figure 12.11a, the gradient does not point directly to the direction of the optimum. This effect is heightened when the level sets of the cost function are very elongated, as in Figure 12.12. This case corresponds to a quadratic function in which image has one eigenvalue much smaller than the other. In this example, we replaced the matrix image in (12.18) by

image

This matrix has eigenvalues image and image.

image

Figure 12.12 Performance of the steepest descent algorithm for a problem with large ratio of eigenvalues. The crosses (x) represent the successive approximations image, plotted against the level curves of (12.17). The initial condition is the point in the lower left. Step-size image.

Let us see why this is so. Recall from Box 3 that for every symmetric matrix there exists an orthogonal matrix image (that is, image) such that

image (12.22)

where we defined the diagonal eigenvalue matrix image.

Let us apply a change of coordinates to (12.19). The optimum solution to (12.17) is

image

Our first change of coordinates is to replace image by image in (12.19). Subtract (12.19) from image and replace image in (12.19) by image to obtain

image (12.23)

Next, multiply this recursion from both sides by image

image

Defining image, we have rewritten the gradient equation in a new set of coordinates, such that now the equations are uncoupled. Given that image is diagonal, the recursion is simply

image (12.24)

Note that, as long as image for image, both entries of image will converge to zero, and consequently image will converge to image.

The stability condition for this recursion is

image

When one of the eigenvalues is much larger than the other, say, when image, the rate of convergence for the direction relative to the smaller eigenvalue becomes

image

and even if one of the coordinates in image converges quickly to zero, the other will converge very slowly. This is what we saw in Figure 12.12. The ratio image is known as the eigenvalue spread of a matrix.

In general, the gradient algorithm converges slowly when the eigenvalue spread of the Hessian is large. One way of solving this problem is to use a different optimization algorithm, such as the Newton or quasi-Newton algorithms, which use the inverse of the Hessian matrix, or an approximation to it, to improve the search direction used by the gradient algorithm.

Although these algorithms converge very quickly, they require more computational power, compared to the gradient algorithm. We will see more about them when we describe the RLS (recursive least-squares) algorithm in Section 1.12.3.3.

For example, for the quadratic cost-function (12.17), the Hessian is equal to image. If we had a good approximation image to image, we could use the recursion

image (12.25)

This class of algorithms is known as quasi-Newton (if image is an approximation) or Newton (if image is exact). In the Newton case, we would have

image

that is, the algorithm moves at each step precisely in the direction of the optimum solution. Of course, this is not a very interesting algorithm for a quadratic cost function as in this example (the optimum solution is used in the recursion!) However, this method is very useful when the cost function is not quadratic and in adaptive filtering, when the cost function is not known exactly.

Box 2:

Gradients

Consider a function of several variables image, where

image

Its gradient is defined by

image

The real value of the notation image will only become apparent when working with functions of complex variables, as shown in Box 6. We also define, for consistency,

image

As an example, consider the quadratic cost function

image (12.26)

The gradient of image is (verify!)

image (12.27)

If we use the gradient to find the minimum of image, it would be necessary also to check the second-order derivative, to make sure the solution is not a maximum or a saddle point. The second order derivative of a function of several variables is a matrix, the Hessian. It is defined by

image (12.28)

Note that for the quadratic cost-function (12.26), the Hessian is equal to image. It is equal to image if image is symmetric, that is, if image. This will usually be the case here. Note that image is always symmetric: in fact, since for well-behaved functions image it holds that

image

the Hessian is usually symmetric.

Assume that we want to find the minimum of image. The first-order conditions are then

image

The solution image is a minimum of image if the Hessian at image is a positive semi-definite matrix, that is, if

image

for all directions image.

Box 3:

Useful results from Matrix Analysis

Here we list without proof a few useful results from matrix analysis. Detailed explanations can be found, for example, in [1013].

Fact 1

Traces

The trace of a matrix image is the sum of its diagonal elements:

image (12.29)

in which the image are the entries of image. For any two matrices image and image, it holds that

image (12.30)

In addition, if image, image are the eigenvalues of a square matrix image, then

image (12.31)

Fact 2

Singular matrices

A matrix image is singular if and only if there exists a nonzero vector image such that image.

The null space, or kernelimage of a matrix image is the set of vectors image such that image. A square matrix image is nonsingular if, and only if, image, that is, if its null space contains only the null vector.

Note that, if image, with image, then image cannot be invertible if image. This is because when image, there always exists a nonzero vector image such that image for image, and thus image with image.

Fact 3

Inverses of block matrices

Assume the square matrix image is partitioned such that

image

with image and image square. If image and image are invertible (nonsingular), then

image

where image is the Schur complement of image in image. If image and image are invertible, then

image

where image.

Fact 4

Matrix inversion Lemma

Let image and image be two nonsingular image matrices, image be also nonsingular, and image be such that

image (12.32)

The inverse of image is then given by

image (12.33)

The lemma is most useful when image. In particular when image is a scalar, and we have

image

Fact 5

Symmetric matrices

Let image be Hermitian Symmetric (or simply Hermitian), that is, image. Then, it holds that

1. All eigenvalues image, image of image are real.

2. image has a complete set of orthonormal eigenvectors image, image. Since the eigenvectors are orthonormal, image, where image if image and 0 otherwise.

3. Arranging the eigenvectors in a matrix image, we find that image is unitary, that is, image. In addition,

image

4. If image, it is called simply symmetric. In this case, not only the eigenvalues, but also the eigenvectors are real.

5. If image is symmetric and invertible, then image is also symmetric.

Fact 6

Positive-definite matrices

If image is Hermitian symmetric and moreover for all image it holds that

image (12.34)

then image is positive semi-definite. Positive semi-definite matrices have non-negative eigenvalues: image, image. They may be singular, if one or more eigenvalue is zero.

If the inequality (12.34) is strict, that is, if

image (12.35)

for all image, then image is positive-definite. All eigenvalues image of positive-definite matrices satisfy image. Positive-definite matrices are always nonsingular.

Finally, all positive-definite matrices admit a Cholesky factorization, i.e., if image is positive-definite, there is a lower-triangular matrix image such that image[14].

Fact 7

Norms and spectral radius

The spectral radiusimage of a matrix image is

image (12.36)

in which image are the eigenvalues of image. The spectral radius is important, because a linear system

image

is stable if and only if image, as is well known. A useful inequality is that, for any matrix norm image such that for any image

image

it holds that

image (12.37)

This property is most useful when used with norms that are easy to evaluate, such as the 1-norm,

image

where image are the entries of image. We could also use the infinity norm,

image

1.12.1.3 Applications

Due to the ability to adjust themselves to different environments, adaptive filters can be used in different signal processing and control applications. Thus, they have been used as a powerful device in several fields, such as communications, radar, sonar, biomedical engineering, active noise control, modeling, etc. It is common to divide these applications into four groups:

1. interference cancellation;

2. system identification;

3. prediction; and

4. inverse system identification.

In the first three cases, the goal of the adaptive filter is to find an approximation image for the signal image, which is contained in the signal image. Thus, as image approaches image, the signal image approaches image. The difference between these applications is in what we are interested. In interference cancellation, we are interested in the signal image, as is the case of acoustic echo cancellation, where image is the speech of the person using the hands-free telephone (go back to Figures 12.4 and 12.7). In system identification, we are interested in the filter parameters, and in prediction, we may be interested in the signal image and/or in the filter parameters. Note that in all these applications (interference cancellation, system identification and prediction), the signal image is an approximation for image and should not converge to zero, except when image. In inverse system identification, differently from the other applications, the signal at the output of the adaptive filter must be as close as possible to the signal image and thus, ideally, the signal image should be zeroed. In the sequel, we give an overview of these four groups in order to arrive at a common formulation that will simplify our analysis of adaptive filtering algorithms in further sections.

1.12.1.3.1 Interference cancellation

In a general interference cancellation problem, we have access to a signal image, which is a mixture of two other signals, image and image. We are interested in one of these signals, and want to separate it from the other (the interference) (see Figure 12.13). Even though we do not know image or image, we have some information about image, usually in the form of a reference signal image that is related to image through a filtering operation image. We may know the general form of this operation, for example, we may know that the relation is linear and well approximated by an FIR filter with 200 coefficients. However, we do not know the parameters (the filter coefficients) necessary to reproduce it. The goal of the adaptive filter is to find an approximation image to image, given only image and image. In the process of finding this image, the adaptive filter will construct an approximation for the relation image. A typical example of an interference cancellation problem is the echo cancellation example we gave in Section 1.12.1.1. In that example, image is the far-end voice signal, image is the near-end voice signal, and image is the echo. The relation image is usually well approximated by a linear FIR filter with a few hundred taps.

image

Figure 12.13 Scheme for interference cancellation and system identification. image and image are correlated to each other.

The approximation for image is a by-product, that does not need to be very accurate as long as it leads to a image close to image. This configuration is shown in Figure 12.13 and is called interference cancellation, since the interference image should be canceled. Note that we do not want to make image, otherwise we would not only be killing the interference image, but also the signal of interest image.

There are many applications of interference cancellation. In addition to acoustic and line echo cancellation, we can mention, for example, adaptive notch filters for cancellation of sinusoidal interference (a common application is removal of 50 or 60 Hz interference from the mains line), cancellation of the maternal electrocardiography in fetal electrocardiography, cancellation of echoes in long distance telephone circuits, active noise control (as in noise-canceling headphones), and active vibration control.

1.12.1.3.2 System identification

In interference cancellation, the coefficients of the adaptive filter converge to an approximation for the true relation image, but as mentioned before, this approximation is a by-product that does not need to be very accurate. However, there are some applications in which the goal is to construct an approximation as accurate as possible for the unknown relation image between image and image, thus obtaining a model for the unknown system. This is called system identification or modeling (the diagram in Figure 12.13 applies also for this case). The signal image is now composed of the output image of an unknown system image, plus noise image. The reference signal image is the input to the system which, when possible, is chosen to be white noise. In general, this problem is harder than interference cancellation, as we shall see in Section 1.12.2.1.5, due to some conditions that the reference signal image must satisfy. However, one point does not change: in the ideal case, the signal image will be equal to the noise image. Again, we do not want to make image, otherwise we would be trying to model the noise (however, the smaller the noise the easier the task, as one would expect).

In many control systems, an unknown dynamic system (also called as plant in control system terminology) is identified online and the result is used in a self-tuning controller, as depicted in Figure 12.14. Both the plant and the adaptive filter have the same input image. In practical situations, the plant to be modeled is noisy, which is represented by the signal image added to the plant output. The noise image is generally uncorrelated with the plant input. The task of the adaptive filter is to minimize the error model image and track the time variations in the dynamics of the plant. The model parameters are continually fed back to the controller to obtain the controller parameters used in the self-tuning regulator loop [15].

image

Figure 12.14 Scheme for plant identification in a control system.

There are many applications that require the identification of an unknown system. In addition to control systems, we can mention, for example, the identification of the room impulse response, used to study the sound quality in concert rooms, and the estimation of the communication channel impulse response, required by maximum likelihood detectors and blind equalization techniques based on second-order statistics.

1.12.1.3.3 Prediction

In prediction, the goal is to find a relation between the current sample image and previous samples image, as shown in Figure 12.15. Therefore, we want to model image as a part image that depends only on image, and a part image that represents new information. For example, for a linear model, we would have

image

Thus, we in fact have again the same problem as in Figure 12.13. The difference is that now the reference signal is a delayed version of image, and the adaptive filter will try to find an approximation image for image, thereby separating the “predictable” part image of image from the “new information” image, to which the signal image should converge. The system image in Figure 12.13 now represents the relation between image and the predictable part of image.

image

Figure 12.15 Scheme for prediction. image is a delayed version of image.

Prediction finds application in many fields. We can mention, for example, linear predictive coding (LPC) and adaptive differential pulse code modulation (ADPCM) used in speech coding, adaptive line enhancement, autoregressive spectral analysis, etc.

For example, adaptive line enhancement (ALE) seeks the solution for a classical detection problem, whose objective is to separate a narrowband signal image from a wideband signal image. This is the case, for example, of finding a low-level sine wave (predictable narrowband signal) in noise (non-predictable wideband signal). The signal image is constituted by the sum of these two signals. Using image as the reference signal and a delayed replica of it, i.e., image, as input of the adaptive filter as in Figure 12.15, the output image provides an estimate of the (predictable) narrowband signal image, while the error signal image provides an estimate of the wideband signal image. The delay image is also known as decorrelation parameter of the ALE, since its main function is to remove the correlation between the wideband signal image present in the reference image and in the delayed predictor input image.

1.12.1.3.4 Inverse system identification

Inverse system identification, also known as deconvolution, has been widely used in different fields as communications, acoustics, optics, image processing, control, among others. In communications, it is also known as channel equalization and the adaptive filter is commonly called as equalizer. Adaptive equalizers play an important role in digital communications systems, since they are used to mitigate the inter-symbol interference (ISI) introduced by dispersive channels. Due to its importance, we focus on the equalization application, which is explained in the sequel.

A simplified baseband communications system is depicted in Figure 12.16. The signal image is transmitted through an unknown channel, whose model is constituted by an FIR filter with transfer function image and additive noise image. Due to the channel memory, the signal at the receiver contains contributions not only from image, but also from the previous symbols image, i.e.,

image (12.38)

Assuming that the overall channel-equalizer system imposes a delay of image samples, the adaptive filter will try to find an approximation image for image and for this purpose, the two summations in (12.38), which constitute the inter-symbol interference, must be mitigated. Of course, when you are transmitting information the receiver will not have access to image. The filter will adapt during a training phase, in which the transmitter sends a pre-agreed signal. You can see that in this case, the role of image and image is the reverse of that in system identification. In this case, you would indeed like to have image. The problem is that this is not possible, given the presence of noise in image. The role of the adaptive filter is to approximately invert the effect of the channel, at the same time trying to suppress the noise (or at least, not to amplify it too much). In one sense, we are back to the problem of separating two signals, but now the mixture is in the reference signal image, so what can and what cannot be done is considerably different from the other cases.

image

Figure 12.16 Simplified communications system with an adaptive equalizer in the training mode.

The scheme of Figure 12.16 is also called training mode, since the delayed version of the transmitted sequence image (training sequence) is known at the receiver. After the convergence of the filter, the signal image is changed to the estimate image obtained at the output of a decision device, as shown in Figure 12.17. In this case, the equalizer works in the so-called decision-directed mode. The decision device depends on the signal constellation—for example, if image, the decision device returns image for image, and image for image.

image

Figure 12.17 Adaptive equalizer in the decision-directed mode.

1.12.1.3.5 A common formulation

In all the four groups of applications, the inputs of the adaptive filter are given by the signals image and image and the output by image. Note that the input image appears effectively in the signal image, which is computed and fed back at each time instant image, as shown in Figure 12.18. In the literature, the signal image is referred to as desired signal, image as reference signal and image as error signal. These names unfortunately are somewhat misleading, since they give the impression that our goal is to recover exactly image by filtering (possibly in a nonlinear way) the reference image. In almost all applications, this is far from the truth, as previously discussed. In fact, except in the case of channel equalization, exactly zeroing the error would result in very poor performance.

image

Figure 12.18 Inputs and output of an adaptive filter.

The only application in which image is indeed a “desired signal” is channel equalization, in which image. Despite this particularity in channel equalization, the common feature in all adaptive filtering applications is that the filter must learn a relation between the reference image and the desired signal image (we will use this name to be consistent with the literature.) In the process of building this relation, the adaptive filter is able to perform useful tasks, such as separating two mixed signals; or recovering a distorted signal. Section 1.12.2 explains this idea in more detail.

1.12.2 Optimum filtering

We now need to extend the ideas of Section 1.12.1.2, that applied only to periodic (deterministic) signals, to more general classes of signals. For this, we will use tools from the theory of stochastic processes. The main ideas are very similar: as before, we need to choose a structure for our filter and a means of measuring how far or how near we are to the solution. This is done by choosing a convenient cost function, whose minimum we must search iteratively, based only on measurable signals.

In the case of periodic signals, we saw that minimizing the error power (12.3), repeated below,

image

would be equivalent to zeroing the echo. However, we were not able to compute the error power exactly—we used an approximation through a time-average, as in (12.5), repeated here:

image

where image is a convenient window length (in Section 1.12.1.2, we saw that choosing a large window length is approximately equivalent to choosing image and a small step-size). Since we used an approximated estimate of the cost, our solution was also an approximation: the estimated coefficients did not converge exactly to their optimum values, but instead hovered around the optimum (Figures 12.9 and 12.10).

The minimization of the time-average image also works in the more general case in which the echo is modeled as a random signal—but what is the corresponding exact cost function? The answer is given by the property of ergodicity that some random signals possess. If a random signal image is such that its mean image does not depend on image and its autocorrelation image depends only on image, it is called wide-sense stationary (WSS) [16]. If image is also (mean-square) ergodic, then

image (12.39)

Note that image is an ensemble average, that is, the expected value is computed over all possible realizations of the random signal. Relation (12.39) means that for an ergodic signal, the time average of a single realization of the process is equal to the ensemble-average of all possible realizations of the process. This ensemble average is the exact cost function that we would like to minimize.

In the case of adaptive filters, (12.39) holds approximately for image for a finite value of image if the environment does not change too fast, and if the filter adapts slowly. Therefore, for random variables we will still use the average error power as a measure of how well the adaptive filter is doing. The difference is that in the case of periodic signals, we could understand the effect of minimizing the average error power in terms of the amplitudes of each harmonic in the signal, but now the interpretation will be in terms of ensemble averages (variances).

Although the error power is not the only possible choice for cost function, it is useful to study this choice in detail. Quadratic cost functions such as the error power have a number of properties that make them popular. For example, they are differentiable, and so it is relatively easy to find a closed-form solution for the optimum, and in the important case where all signals are Gaussian, the optimum filters are linear. Of course, quadratic cost functions are not the best option in all situations: the use of other cost functions, or even of adaptive filters not based on the minimization of a cost function, is becoming more common. We will talk about some of the most important alternatives in Sections 1.12.5.2 and 1.12.5.4.

Given the importance of quadratic cost functions, in this section we study them in detail. Our focus is on what could and what could not be done if the filter were able to measure perfectly image at each instant. This discussion has two main goals: the first is to learn what is feasible, so we do not expect more from a filter than it can deliver. The second is to enable us to make more knowledgeable choices when designing a filter—for example, which filter order should be used, and for system identification, which input signal would result in better performance. In Section 1.12.4.4.3 we will study the performance of adaptive filters taking into consideration their imperfect knowledge of the environment.

1.12.2.1 Linear least-mean squares estimation

In the examples we gave in previous sections, a generic adaptive filtering problem resumed to this: we are given two sequences, image and image, as depicted in Figure 12.18. It is known, from physical arguments, that image has two parts, one that is in some sense related to image, and another that is not, and that both parts are combined additively, that is,

image (12.40)

where image is an unknown relation, and image is the part “unrelated” to image. In our examples so far, image was linear, but we do not need to restrict ourselves to this case. Our objective is to extract image and image from image, based only on observations of image and image. We saw in Section 1.12.1.2 that the average power of the difference image between image and our current approximation image could be used as a measure of how close we are to the solution. In general, image is time-variant, i.e., depends directly on image. We will not write this dependence explicitly to simplify the notation.

In this section we ask in some sense the inverse problem, that is: given two sequences image and image, what sort of relation will be found between them if we use image as a standard? The difference is that we now do not assume a model such as (12.40); instead, we want to know what kind of model results from the exact minimization of image, where image is defined as the difference between image and a function of image

We can answer this question in an entirely general way, without specifying beforehand the form of image. This is done in Box 4. However, if we have information about the physics of a problem and know that the relation image is of a certain kind, we can restrict the search for the solution image to this class. This usually reduces the complexity of the filter and increases its convergence rate (because less data is necessary to estimate a model with less unknowns, as we will see in Section 1.12.3). In this section we focus on this second option.

The first task is to describe the class image of allowed functions image. This may be done by choosing a relation that depends on a few parameters, such as (recall that the adaptive filter output is image)

image (12.41)

image (12.42)

image (12.43)

image (12.44)

In each of these cases, the relation between the input sequence image and image is constrained to a certain class, for example, linear length-image FIR filters in (12.41), first-order IIR filters in (12.42), and second-order Volterra filters in (12.43). Each class is described by a certain number of parameters: the filter coefficients image in the case of FIR filters, image and image in the case of IIR filters, and so on. The task of the adaptive filter will then be to choose the values of the parameters that best fit the data.

For several practical reasons, it is convenient if we make the filter output depend linearly on the parameters, as happens in (12.41) and in (12.43). It is important to distinguish linear in the parameters from input-output linear: (12.43) is linear in the parameters, but the relation between the input sequence image and the output image is nonlinear. What may come as a surprise is that the IIR filter of Eq. (12.42) is not linear in the parameters: in fact, image— you can see that image depends nonlinearly on image and image.

Linearly parametrized classes image, such as image(12.41) and image(12.43) are popular because in general it is easier to find the optimum parameters, both theoretically and in real time. In fact, when the filter output depends linearly on the parameters and the cost function is a convex function of the error, it can be shown that the optimal solution is unique (see Box 5). This is a very desirable property, since it simplifies the search for the optimal solution.

In the remainder of this section we will concentrate on classes of relations that are linear in the parameters. As image shows, this does not imply that we are restricting ourselves to linear models. There are, however, adaptive filtering algorithms that use classes of relations that are not linear in the parameters, such as IIR adaptive filters. On the other hand, blind equalization algorithms are based on non-convex cost functions.

Assume then that we have chosen a convenient class of relations image that depends linearly on its parameters. That is, we want to solve the problem

image (12.45)

We want to know which properties the solution to this problem will have when image depends linearly on a finite number of parameters. In this case, image is a linear combinations of certain functions image of image

image (12.46)

where in general image, image. The vector image is known as regressor. In the case of length-image FIR filters, we would have

image

whereas in the case of second-order Volterra filters with memory 1, we would have

image

Our problem can then be written in general terms as: find image such that

image (12.47)

We omitted the dependence of the variables on image to lighten the notation. Note that, in general, image will also depend on image.

To solve this problem, we use the facts that image to expand the expected value

image

Recall that the weight vector image is not random, so we can take it out of the expectations.

Define the autocorrelation of image, and also the cross-correlation vector and autocorrelation matrix

image (12.48)

The cost function then becomes

image (12.49)

Differentiating image with respect to image, we obtain

image

and equating the result to zero, we see that the optimal solution must satisfy

image (12.50)

These are known as the normal, or Wiener-Hopf, equations. The solution, image, is known as the Wiener solution. A note of caution: the Wiener solution is not the same thing as the Wiener filter. The Wiener filter is the linear filter that minimizes the mean-square error, without restriction of filter order [17]. The difference is that the Wiener solution has the filter order pre-specified (and is not restricted to linear filters, as we saw).

When the autocorrelation matrix is non-singular (which is usually the case), the Wiener solution is

image (12.51)

Given image, the optimum error will be

image (12.52)

Note that the expected value of image is not necessarily zero:

image (12.53)

If, for example, image and image, then image. In practice, it is good to keep image, because we usually know that image should approximate a zero-mean signal, such as speech or noise. We will show shortly, in Section 1.12.2.1.4, how to guarantee that image has zero mean.

1.12.2.1.1 Orthogonality condition

A key property of the Wiener solution is that the optimum error is orthogonal to the regressor image, that is,

image (12.54)

We saw a similar condition in Section 1.12.1.2. It is very useful to remember this result, known as the orthogonality condition: from it, we will find when to apply the cost function (12.47) to design an adaptive filter. Note that when image has zero mean, (12.54) also implies that image and image are uncorrelated.

Remark 1

You should not confuse orthogonal with uncorrelated. Two random variables image and image are orthogonal if image, whereas they are uncorrelated if image. The two concepts coincide only if either image or image have zero mean.

Remark 2

The orthogonality condition is an intuitive result, if we remember that we can think of the space of random variables with finite variance as a vector space. In fact, define the cross-correlation image between two random variables image and image as the inner product. Then the autocorrelation of a random variable image, image, would be interpreted as the square of the “length” of image. In this case, our problem (12.47) is equivalent to finding the vector in the subspace spanned by image that is closest to image. As we know, the solution is such that the error is orthogonal to the subspace. So, the orthogonality condition results from the vector space structure of our problem and the quadratic nature of our cost function. See Figure 12.19.

image

Figure 12.19 Orthogonality in vector spaces with inner products.

Remark 3

The difference between (12.54) and the corresponding result obtained in Box 4, Eq. (12.105), is that here the optimum error is orthogonal only to the functions image of image that were included in the regressor (and their linear combinations), not to any function, as in (12.105). This is not surprising: in this section the optimization was made only over the functions in class image, and in Box 4 we allow any function. Both solutions, (12.106) and (12.51), will be equal if the general solution according to Box 4 is indeed in image.

Remark 4

The optimal mean-square error is (recall that image)

image (12.55)

Now that we have the general solution to (12.47), we turn to the question: for which class of problems is the quadratic cost function adequate?

1.12.2.1.2 Implicit vs. physical models

From (12.54), we see that the fact that we are minimizing the mean-square error between image and a linear combination of the regressor image induces a model

image (12.56)

in which image is orthogonal to image. This is not an assumption, as we sometimes see in the literature, but a consequence of the quadratic cost function we chose. In other words, there is always a relation such as (12.56) between any pair image, as long as both have finite second-order moments. This relation may make sense from a physical analysis of the problem (as we saw in the echo cancellation example), or be simply a consequence of solving (12.47) (see Section 1.12.2.1.3 for examples).

On the other hand, the orthogonality condition allows us to find out when the solution of (12.47) will be able to successfully solve a problem: assume now that we know beforehand, by physical arguments about the problem, that image and image must be related through

image (12.57)

where image is a certain coefficient vector and image is orthogonal to image. Will the solution to (12.47) be such that image and image?

To check if this is the case, we only need to evaluate the cross-correlation vector image assuming the model (12.57):

image (12.58)

Therefore, image also obeys the normal equations, and we can conclude that image, as long as image is nonsingular.

Even if image is singular, the optimal error image will always be equal to image (in a certain sense). In fact, if image is singular, then there exists a vector image such that image (see Fact 2 in Box 3), and thus any solution image of the normal equations (we know from (12.58) that there is at least one, image) will have the form

image (12.59)

where image is the null-space of image. In addition,

image

Therefore, image with probability one. Then,

image

with probability one. Therefore, although we cannot say that image always, we can say that they are equal with probability one. In other words, when image is singular we may not be able to identify image, but (with probability one) we are still able to separate image into its two components. More about this in Section 1.12.2.1.5.

1.12.2.1.3 Undermodeling

We just saw that the solution to the quadratic cost function (12.47) is indeed able to separate the two components of image when the regressor that appears in physical model (12.57) is the same image that we used to describe our class image, that is, when our choice of image includes the general solution from Box 4. Let us check now what happens when this is not the case.

Assume there is a relation

image (12.60)

in which image is orthogonal to image, but our class image is defined through a regressor image that is a subset of the “correct” one, image. This situation is called undermodeling: our class is not rich enough to describe the true relation between image and image.

Assume then that

image (12.61)

where image, with image. The autocorrelation of image is

image

where image. From our hypothesis that image is orthogonal to image, that is,

image

we conclude that image is also orthogonal to image, so

image

Assuming that image is nonsingular, solving the normal Eq. (12.50) we obtain

image (12.62)

Now we have two interesting special cases: first, if image and image are orthogonal, i.e., if image, then image contains the first image elements of image. Let us consider a specific example: assume that

image (12.63)

with image. Then image. This situation is very common in practice.

In this case, image when image is zero-mean white noise. This is one reason why white noise is the preferred input to be used in system identification: even in the (very likely in practice) case in which image, at least the first elements of image are identified without bias.

On the other hand, if image, then image is a mixture of the elements of image. This happens in (12.63) when the input sequence image is not white. In this case, the optimum filter image takes advantage of the correlation between the entries of image and image to estimate image given image, and uses these estimated values to obtain a better approximation for image. Sure enough, if you write down the problem of approximating image linearly from image, namely,

image

you will see that the solution is exactly image.

What is important to notice is that in both cases the optimum error image is not equal to image:

image

Partitioning image, with image, we have

image

Although image, you may check that indeed image.

As another example, assume that we have two variables such that image, in which image has zero mean and is independent of image, and image and image are constants. Assume that we choose as regressor simply image, so we try to solve

image

In this case we have

image

since image. The optimum solution is thus

image

and the optimum error is

image

However, image is again orthogonal to image (but not to image, which we did not include in the regressor).

What happens in the opposite situation, when we include more entries in image than necessary? As one would expect, in this case the parameters related to these unnecessary entries will be zeroed in the optimum solution image. This does not mean, however, that we should hasten to increase image to make image as large as possible, and leave to the filter the task of zeroing whatever was not necessary. This is because increasing the number of parameters to be estimated has a cost. The first and more obvious cost is in terms of memory and number of computations. However, we saw in Section 1.12.1.2 that an actual adaptive filter usually does not converge to the exact optimum solution, but rather hovers around it. Because of this, increasing the number of parameters may in fact decrease the quality of the solution. We explain this in more detail in Section 1.12.4.

1.12.2.1.4 Zero and non-zero mean variables

Up to now we did not assume anything about the means of image and image. If both image and image have zero mean, we can interpret all autocorrelations as variances or covariance matrices, according to the case. To see this, assume that

image (12.64)

Then image, and thus

image

so the optimum error has zero mean. In this case, image is the variance of image, and image is the variance of image, so (12.55) becomes

image (12.65)

Since image is positive-definite, we conclude that the uncertainty in image (its variance) is never larger than the uncertainty in image (Box 3).

However, image may have a nonzero mean if either image or image has nonzero mean. This is verified as follows. Define

image (12.66)

then

image

in general (for example, if image, but image).

In many applications, image should approximate some signal that is constrained to have zero mean, such as speech or noise, or because image will be passed through a system with a high DC gain (such as an integrator), so it may be useful to guarantee that image has zero mean. How can this be done if the means of image and image are not zero? This problem is quite common, particularly when image includes nonlinear functions of image. Fortunately, there are two easy solutions: First, if one knew the means of image and image, one could define zero-mean variables

image (12.67)

and use them instead of the original variables in (12.47). This is the simplest solution, but we often do not know the means image and image.

The second solution is used when the means are not known: simply add an extra term to image, defining an extended regressor (not to be confused with the extended regressor from Section 1.12.2.1.3)

image (12.68)

and an extended weight vector image. The first coefficient of image will take care of the means, as we show next. The extended cross-correlation vector and autocorrelation matrix are

image

Assuming that image and image are nonsingular, we can compute image if we evaluate the inverse of image using Schur complements (See Fact 3 in Box 3), as follows

image

Using this result, we can evaluate now image as follows (we used the fact that the inverse of a symmetric matrix is also symmetric from Box 3):

image

1.12.2.1.5 Sufficiently rich signals

We saw in Section 1.12.2.1.2 that even when image is singular, we can recover the optimum image (with probability one), but not the optimum image. Let us study this problem in more detail, since it is important for system identification. So, what does it mean in practice to have a singular autocorrelation function image? The answer to this question can be quite complicated when the input signals are nonstationary (i.e., when image is not constant), so in this case we refer the interested reader to texts in adaptive control (such as [18]), which treat this problem in detail (although usually from a deterministic point of view). We consider here only the case in which the signals image are jointly wide-sense stationary.

In this case, image may be singular because some entries of image are linear combinations of the others. For example, if we choose image and image, then

image

which is singular.

However, there are more subtle ways in which image may become singular. Assume for example that our input sequence is given by

image (12.69)

where image is the frequency, and image is a random phase, uniformly distributed in the interval image. This kind of model (sinusoidal signals with random phases), by the way, is the bridge between the periodic signals from Section 1.12.1.2 and the stochastic models discussed in this section. In this case, consider a vector image formed from a delay line with image as input,

image

If image, we have

image

The determinant of image is image if image.

However, for any image, the resulting image will be singular: we can write

image

and since image, we have

image

Since image is the sum of two rank-one matrices, its rank is at most two. It will be two when image, since in this case the vectors are linearly independent. We conclude that when the input signal is a sinusoid as in (12.69), the optimum solution to (12.47) will never be able to identify more than two coefficients. This argument can be generalized: if image is composed of image sinusoids of different frequencies, then image will be nonsingular if and only if image.

The general conclusion is that image will be singular or not, depending on which functions are chosen as inputs, and on the input signal. We say that the input is sufficiently rich for a given problem if it is such that image is nonsingular. We show a few examples next.

1.12.2.1.6 Examples

The concepts we saw in Sections 1.12.2.1.21.12.2.1.5 can be understood intuitively if we return to our example in Section 1.12.1.2. In that example, the true echo was obtained by filtering the input image by an unknown image, so we had

image

where image is the true order of the echo path, image and image. We considered only one harmonic for simplicity.

We are trying to cancel the echo, by constructing an approximated transfer function image, which is modeled as an FIR filter with image coefficients. As we saw in Section 1.12.1.2, if image is a simple sinusoid, the echo will be completely canceled if we satisfy (12.8), reproduced below

image

If image, so image and image, this condition becomes

image

with a single solution image.

On the other hand, if the input were as before (a single sinusoid), but image, so image, the equations would be

image

and the solution would be unique (since image is nonsingular), but would not approximate image and image in general.

This is an example of an undermodeled system, as we saw in Section 1.12.2.1.3. Figure 12.20 shows the LMS algorithm derived in Section 1.12.1.2 applied to this problem. Note that, even though image do not converge to image, in this example the error image does approximate well image. This happens because in this case image is periodic, and the term image can be obtained exactly as a linear combination of image and image (Check that using the expressions for image from Section 1.12.2.1.3, we obtain image in this case.)

image

Figure 12.20 LMS in undermodeled case.

If we tried to identify image by adding an extra coefficient to image, the equations would be

image (12.70)

Now image, and image is a solution, but not the only one. Depending on the initial condition, the filter would converge to a different solution, as can be seen in Figure 12.21 (the level sets are not shown, because they are not closed curves in this case). The input is not rich enough (in harmonics) to excite the unknown system image so that the adaptive filter might identify it, as we saw in Section 1.12.2.1.5. However, for all solutions the error converges to image.

image

Figure 12.21 LMS under non-sufficiently rich input.

If the input had a second harmonic image, we would add two more equations to (12.70), corresponding to image, and the filter would be able to identify systems with up to four coefficients.

The important message is that the adaptive filter only “sees” the error image. If a mismatch between the true echo path image and the estimated one image is not observable through looking only at the error signal, the filter will not converge to image. Whether a mismatch is observable or not through the error depends on both image and the input signal being used: the input must excite a large enough number of frequencies so that image can be estimated correctly. For this reason, in system identification a good input would be white noise.

1.12.2.2 Complex variables and multi-channel filtering

Adaptive filters used in important applications such as communications and radar have complex variables as input signals. In this section, we extend the results of Section 1.12.2.1 to consider complex signals. We start with the general solution, then restrict the result to the more usual case of circular signals.

We first show that the general solution for complex signals is equivalent to an adaptive filter with two inputs and two outputs: Indeed, if all input variables are complex, then the error image will also be complex, that is, image, where image are the real and imaginary parts of image. We must then minimize image, the total power of the complex signal image. The quadratic cost function (12.45) must be changed to

image

where image is the complex conjugate of image. Now, we need to be careful about what exactly is the coefficient vector image in this case: If the regressor is image and image, in principle image should be such that both its real and imaginary parts depend on both the real and imaginary parts of image, that is,

image (12.71)

Note that there is no a priori reason for us to imagine that there should be a special relation between the pairs image and image. That is, we are dealing with two different coefficient vectors and one extended regressor:

image (12.72)

Let us proceed for the time being separating the real and imaginary channels of our filter. Later, in Section 1.12.2.2.1, we return to complex algebra. With these definitions, we can write

image (12.73)

The cost function then becomes

image (12.74)

where the autocorrelation matrix and cross-correlation vectors are given by

image (12.75)

where we simplified the notation, writing image instead of image instead of image instead of image, and so on.

Differentiating (12.74) to find the minimum, we obtain

image (12.76)

Therefore, the optimum filters satisfy

image (12.77)

The optimum error is then

image (12.78)

There is also an orthogonality condition for this case: using (12.77), we obtain

image (12.79a)

image (12.79b)

The value of the cost function at the minimum can be obtained using this result. As in the case of real variables,

image

where we used (12.79a) in the second equality. Doing similarly for image, we obtain

image (12.80a)

image (12.80b)

image (12.80c)

What we have just described is a two-input/two-output optimal filter. In the next section, we re-derive it using complex algebra. Note that multi-channel filters have applications that are unrelated to complex variables, particularly in control [18], stereo echo cancellation [19,20], and active noise control [2124].

1.12.2.2.1 Widely-linear complex least-mean squares

This general solution can be obtained in a more straightforward way if we use the complex gradients described in Box 6. Keeping the regressor image as in (12.72) and defining the extended complex weight vector as image, we can write the cost function as

image

where we defined image and image. Note that image represents the Hermitian transpose, that is, transpose followed by conjugation. Differentiating image with respect to image, as described in Box 6, we obtain

image (12.81)

The normal equations then become

image (12.82)

Expanding the real and imaginary parts of (12.82), we recover (12.77). Similarly, the optimum error and the minimum value of the cost are

image (12.83)

Comparing with (12.80), we see that image, and the value of image is the same as before. However, as you noticed, using a complex weight vector image and the complex derivatives of Box 6, all expressions are much simpler to derive.

The filter using a complex weight vector image and the real regressor image constitutes a version of what is known as widely-linear complex filter. This version was recently proposed in [25] as a reduced-complexity alternative to the widely-linear complex filter originally proposed in [26,27], which uses as regressor the extended vector

image (12.84)

composed of the original regressor image and its complex conjugate image (computational complexity is reduced because the number of operations required to compute image is about half of what is necessary to evaluate image). Widely-linear estimation has been receiving much attention lately, due to new applications that have appeared [2831].

But why the name widely linear? The reason for this name is that the complex regressor image does not appear linearly in the expressions, but only by separating its real and imaginary parts (as in image) or by including its complex conjugate (as in image). Both these expressions, from the point of view of a complex variable, are nonlinear.

What would be a linear complex filter, then? This is the subject of the next section.

1.12.2.2.2 Linear complex least-mean squares

There are many applications in which the data and regressor are such that their real and imaginary parts are similarly distributed, so that

image (12.85)

Under these conditions, we say that the pair image is complex circular, or simply circular.

In this case, we can rewrite (12.77) as

image

Expand these equations as follows (notice the change in order in (12.86b))

image (12.86a)

image (12.86b)

Now we can see that, if image is a solution to (12.86a), then

image (12.87)

is a solution to (12.86b). Therefore, when the input signals are circular, the number of degrees of freedom in the problem is actually smaller. In this case, there is a very nice way of recovering the solutions, working only with complex algebra. Indeed, defining image, we have

image (12.88)

which is equivalent to (12.71) with the identities (12.87). Using this definition, it is possible to obtain the optimum solution (12.86a), (12.86b), and (12.87) using only complex algebra, following exactly the same steps as our derivation for the real case in the previous section.

This path is very useful, we can derive almost all results for real or complex circular adaptive filters using the same arguments. Let us see how it goes. First, define image, so our problem becomes

image (12.89)

Expanding the cost, recalling that now image, we obtain

image (12.90)

where we defined image.

We need to find the minimum of this cost.One easy solution is to use the rules for differentiation of real functions of complex variables, as described in Box 6. These rules are very useful, and surprisingly easy to use: basically, we can treat image and its conjugate image as if they were independent variables, that is

image (12.91)

Equating the gradient to zero, the solution is (assuming that image is nonsingular)

image (12.92)

which is exactly equal to (12.51). This is the advantage of this approach: the expressions for complex and real problems are almost equal. The important differences are the conjugates that appear in the definitions of image and image, and the absence of a factor of image in the gradient (12.91). This follows from the definition of complex derivative we used (see Box 6). Here this difference is canceled out in the solution (12.92), but further on we will find situations in which there will be a different factor for the case of real and for the case of complex variables.

Let us check that (12.92) is really equivalent to (12.86a). Expanding image and image and using the circularity conditions (12.85), we obtain

image (12.93a)

image (12.93b)

and

image (12.94)

so, equating the gradient (12.91) to zero, image must satisfy

image (12.95)

Comparing the real and imaginary parts of (12.95) with (12.86a), we see that they indeed correspond to the same set of equations.

A few important remarks:

Remark 5

The orthogonality condition still holds in the complex circular case, but with a small difference: defining the optimum error as before,

image

we obtain, noting that image,

image (12.96)

Remark 6

The complex gradient (12.91) seems strange because it does not include the factors of 2 that would appear in the real case. This difference will appear in many expressions in the following sections, whenever we need to differentiate between the real and complex cases. However, the difference is illusory: if you go back to (12.93b) and (12.94), you will see that the factor of two is actually there, just hidden by the definition of autocorrelation and cross-correlation under the circularity conditions.

Remark 7

The optimum error power can be obtained as in the real case, and is equal to

image (12.97)

Box 4:

Least mean-squares estimation

Here we consider the general problem of finding the optimum relation image between image and image, such that

image (12.98)

without restricting image to a particular class.

In order to understand well this problem, let us consider first the simpler case where the relation between image and image is memoryless, that is, our function depends only on the current value of image. Dropping the time index to simplify the notation, our problem then becomes how to find a function image that, applied to a certain random variable image, approximates a certain random variable image so that the mean-square error is minimized, that is, we want to solve

image (12.99)

Assume that the joint probability density function image of image and image is known (we are assuming that image and image are continuous random variables, but the final result in (12.100) also holds if they are discrete). Then,

image (12.100)

where we use the subscripts image and image to highlight that the expectations are taken with respect to image or image only.

Note that the inner integrand image in (12.100) is nonnegative. Therefore, if we minimize it for each value of image, we will obtain the minimum of the full expression. That is, the optimum image is the function defined by

image (12.101)

It is important to remember that, fixed image is simply a number. This can be seen more easily expanding the expectations in (12.101)

image

Differentiating with respect to image and equating to zero, we obtain the solution

image (12.102)

which is indeed a function of image, as desired.

Let us study the properties of this solution. First, define

image

Note that image and image are not necessarily equal to image and image from (12.40). Evaluating the average of image, we obtain

image

by an argument similar to that used to obtain (12.100). We conclude that image has zero mean. In addition, for any function image of image for which the expected values exist, we have

image (12.103)

since image. This result means that the error image is uncorrelated to any function of the reference data image, which is a very strong condition. It basically means that we have extracted all second-order information available in image about image.

We remark that (12.103) does not imply that image and image are independent. A simple counter-example is the following: consider two discrete-valued random variables image and image with joint probabilities given by Table 12.1. Then image, but image and image are not independent.

Table 12.1

Counter-Example: (12.103) does not Imply Independence

Image

In the more general case in which image is allowed to depend on previous samples of image, the arguments and the final conclusion are similar:

image (12.104)

and defining

image

it holds that image has zero mean and is uncorrelated with any function of image:

image (12.105)

Note that, in general, the optimal solution of (12.104) depends on image, so we should write

image (12.106)

to make the time dependence explicit.

We can now return to the question at the start of this section: From our results thus far, we see that minimizing image leads to a model which relates image and the sequence image such that

image

in which image is uncorrelated with any function image of image In other words, the solution of (12.104) imposes such a model on our data, even if this model makes no physical sense at all. This is an important point: such a model will always result from the minimization of the mean-square error, at least as long as the statistics of image and image are such that the expected values in (12.104) are finite.

Now, what if we knew beforehand, by physical arguments about the data, that a model such as (12.40) holds between image and image, such that image has zero mean and is uncorrelated to any function of image? Will it result that the solution of (12.104) will be such that

image

To answer this question, let us go back to (12.99), and use (12.40) to find the solution. We thus have (we omit the arguments of image and image for simplicity)

image

Since image is orthogonal to any function of image by assumption, it will be orthogonal to image and image in particular. Therefore, we can simplify the cost as

image (12.107)

Both terms in (12.107) are positive, and only the first depends on image. Therefore, the solution is indeed image and image, as one would wish (to be precise, image and image could differ on a set of probability zero).

We conclude that we can use (12.102) as a criterion for separating the two parts image and image of our model for image if and only if image satisfies (12.105). This holds, for example, if image has zero mean and is independent of image for all image.

Note that since image is itself a function of image, (12.105) implies that

image

and we have

image

Recalling that image, if we add and subtract image to the last result, we conclude that

image (12.108)

where image and image are the variances of image and of image.

The solution we just found is very general (it assumes little about image), and there are ways of computing it adaptively from the data, using for example algorithms known as particle filters [32]. The solution requires, however, that one builds approximations for the joint probability distribution of image and image, which in general requires a large amount of data, resulting in relatively high complexity and relatively slow convergence. If you know more about the relation between image and image, this information might be used to constrain the search to a smaller class of problems, potentially leading to simpler algorithms with faster convergence. That is the goal of adaptive filters, and the reason why we restricted ourselves to this case in the main text.

Box 5:

Convex functions

Convex sets are such that lines between any two points in the set always stay entirely in the set (see Figure 12.22). More formally, a set image is convex if

image

image

Figure 12.22 Convexity.

Convex functions, on the other hand, are functions image such that for all image and image,

image

The function is called strictly convex if the inequality is strict (< instead of image). For example, image is strictly convex. Figure 12.23 shows examples of convex and non-convex functions. An important property of strictly convex functions is that they always have a unique minimum, that is, they never have sub-optimal local minima.

image

Figure 12.23 Convex functions.

Box 6:

Wirtinger derivatives and minima of functions of a complex variable

Suppose you want to minimize a function

image (12.109)

that is, a function that takes a complex argument to a real value. The minimization problem is well defined, since the image of image is real. However, we cannot use standard arguments about derivatives and stationary points to find the minimum, because a function like this is never analytic, that is, the standard derivative used in complex analysis does not exist.

To see why, we need to think about image as a function of two variables, image and image, the real and imaginary parts of image. Take image for example. We can write it as a function of image and image as

image (12.110)

Since it is a function of two variables, the derivatives in principle depend on the direction we are moving along. Consider the derivative at a point image along the image axis, that is, keeping image constant:

image

On the other hand, along the image axis, we have

image

that is, the derivative at point image is different according to the direction we are moving. This is essentially what is meant by saying that the function is not analytic at image.

For contrast, consider now the analytic function image. Its derivative is the same, no matter which direction we take:

image

Similarly, differentiating along the image axis we obtain again the same result

image

When the derivative at a certain point image is the same along every direction, we can define derivatives of complex functions writing simply

image (12.111)

since we know the limit does not depend on the direction along which image goes to zero.

It can be shown that a function is analytic at a given point only if the Cauchy-Riemann conditions hold at that point. Separating a generic complex function image into its real and imaginary parts image, the Cauchy-Riemann conditions are [33]

image

In the case of a real function of a complex variable, such as image defined as in (12.109), image, and of course the conditions could hold only for points in which the real part has zero partial derivatives.

This means that we cannot search for the stationary points of a function such as (12.109) using the derivative defined as in (12.111). We could of course always go back and interpret the function as a function of two variables, as we did in (12.110). In order to find the minimum of image, we would need to find the stationary points of image

image

This works just fine, of course, but there is a nice trick, which allows us to keep working with complex variables in a simple way. The idea is to define derivatives differently, such that optimization problems become easy to solve, using what is known as Wirtinger calculus.

Real functions of complex variables are usually defined in terms of variables and its conjugates, since image and image are both real. The idea is to define derivatives such that image and image can be treated as if they were independent variables, as follows. Despite the motivation, the new derivatives are defined for general functions image.

Define partial derivatives of a function image with respect to image and image as

image (12.112)

You can check that if image is analytic at a given point (so image satisfies the Cauchy-Riemann conditions at that point), then image at that point.

Consider, for example, image. Then,

image

As advertised, the final result is what we would obtain if we had treated image and image as two independent variables. Similarly, you can check that

image

An important special case is the quadratic function image, which assumes only real values if image and image are real. Using the previous result, we see that

image

Note that there is no factor of image in this case. This difference between Wirtinger derivatives and standard real derivatives will propagate to many expressions in this text.

Now, the definitions (12.112) are useful for minimizing real functions of complex variables as in (12.109), since from (12.112)

image

Applying this idea to our quadratic function, we see that

image

if image. In addition,

image

and we can conclude that image is a minimum as long as image.

The same conclusions could easily be obtained working directly with image. The Wirtinger derivatives are merely a shortcut, that makes working with complex variables very similar to working with real variables. A good detailed description of Wirtinger derivatives can be found in [34].

In the vector case, the definitions and results are very similar. Given a function

image

with image, image, we define

image

Using our definition of gradients, the second derivative of image (its Hessian) is

image

The most important example is the quadratic function

image

where image is a real variable, and image is Hermitian symmetric. Then, we have

image

We see that a stationary point of image is such that image. In addition,

image

and we conclude that image is the single point of minimum of image if and only if image is positive-definite.

Note that in the complex case, the gradient and the Hessian do not have the factors of image that we saw in the real case (see Box 1).

1.12.3 Stochastic algorithms

In this section, we focus on three of the most important adaptive algorithms: the least-mean squares (LMS) algorithm, the normalized least-mean squares (NLMS) algorithm, and the recursive least-squares (RLS) algorithm, which are the basis of many other algorithms. These three stochastic algorithms are derived in detail and analyzed using different methods. We present different approaches for analyzing an adaptive filter, because the variety and complexity of their behavior makes it difficult for a single technique to be able to handle all situations. The analyses provided here allow us to predict the behavior of the algorithms in transient and steady-state, and choose values of parameters (such as the step-size) for which the filters operate adequately. In particular, we study conditions under which the filters will remain stable.

We assume linearly parametrized classes, such as FIR image and Volterra image filters (see Section 1.12.2.1). In these classes, each element of the input regressor vector is given by a certain function imageimage of image. In the case of length-image FIR filters, we have

image (12.113)

whereas in the case of second-order Volterra filters with memory image and real-valued signals, we have

image (12.114)

As much as possible, our presentation in this section is independent of the choices of image and image. However, some algorithms are designed for a specific class image (the most common case are fast algorithms designed for FIR filters (12.113)). In these cases, we will alert the reader of the restriction.

Using the results of Box 6, the derivation of algorithms for real or circular complex inputs is very similar, the only differences being the need of conjugating a few variables in some expressions, and the appearance of a factor image in the expressions of gradients of functions of real variables, and image in the complex case.

In general, the output of a length-image adaptive filter is given by

image (12.115)

In the case of Volterra filters, it is common to use the notation image for the weight related to the input image (see Eq. (12.43)). However, to obtain a common formulation independent of the class image, we denote the weight vector as

image (12.116)

The task of the adaptive filter will be to choose in a recursive form the values of the parameters image that best fit the data at each time instant image.

Figure 12.24 shows the scheme of a length-image adaptive filter for linearly parametrized classes, where image denotes the set of functions image, image of image that generate the elements image of the input regressor vector image. The number of samples of the signal image depends on the class image and is given by image. An FIR filter with length image requires image samples. On the other hand, a Volterra filter with memory image, requires image samples, but the length of the corresponding adaptive filter depends on the order of the Volterra series expansion. At each time instant, the output of the adaptive filter image is compared with the desired signal image to compute the error image. In order to minimize a cost function of this error, a stochastic algorithm uses image to adjust the filter weights.

image

Figure 12.24 Scheme for a length-image adaptive filter assuming linearly parametrized classes.

1.12.3.1 LMS algorithm

The update equation of the steepest-descent algorithm is given by

image (12.117)

where image is a step-size and image (resp., image) for complex (resp., real) data (see Box 6). This algorithm requires exact measurements of the gradient vector. For the case of minimization of the mean-square error, that is, if image,

image (12.118)

at each iteration image (see Box 1 for an introduction to gradient algorithms). Note that the exact gradient requires a prior knowledge of the correlation matrix image and the cross-correlation vector image, which is not feasible in real-time applications. For example, in the hands-free telephone application described in Section 1.12.1.1, the input signal of the adaptive filter is the far-end speech (see Figures 12.3 and 12.4). A speech signal is nonstationary and therefore, it is not possible to obtain exact estimates for image and image at each time instant. Furthermore, good estimates would demand much processing time, which could insert an inadmissible delay in the system, making this approach not feasible, except in off-line applications. Similar restrictions appear in virtually all adaptive filtering applications.

To circumvent this problem, a stochastic version of the steepest descent algorithm was proposed, the least-mean squares (LMS) algorithm. Instead of minimizing the exact mean-square error, the LMS algorithm minimizes a straightforward approximation to it: image. The idea, as we saw in Section 1.12.1.2.4, is that the algorithm will approximately average the cost-function if adaptation is slow (small step-size). The estimation error is

image (12.119)

so the gradient of the instantaneous cost function is

image (12.120)

Since image depends on the old estimate of the weight vector, it is called an a priori estimation error, by opposition to the a posteriori error that will be introduced in (12.129). Instead of minimizing the instantaneous cost function, the stochastic gradient defined in (12.120) can be obtained by replacing the instantaneous estimates

image (12.121)

and

image (12.122)

in (12.118).

Replacing (12.120) in (12.117), we arrive at the update equation for LMS, i.e.,

image (12.123)

As explained in Section 1.12.1.2.4 and illustrated in the animations of Figure 12.9, although there is no explicit average being taken in the LMS algorithm, it in fact computes an implicit approximate average if the step-size is small.

The LMS algorithm is summarized in Table 12.2. Its computational cost increases linearly with the length of the filter, i.e., the computational cost is image. Table 12.3 shows the number of real multiplications and real additions per iteration for real and complex-valued signals. The quantities inside brackets were computed in previous steps. We assume that a complex multiplication requires 4 real multiplications and 2 real additions (it is possible to use less multiplications and more additions), and a complex addition requires 2 real additions. Different ways of carrying out the calculations may result in a different computational cost. For example, if we first computed image followed by image, LMS would require image more real multiplications per iteration than the ordering of Table 12.3, considering real-valued signals. For complex-valued signals, the cost would be even higher, since LMS would require image and image more real multiplications and real additions, respectively.

Table 12.2

Summary of the LMS Algorithm

Image

Table 12.3

Computational Cost of the LMS Algorithm for Complex and Real-Valued Signals in Terms of Real Multiplications and Real Additions Per Iteration

Image

1.12.3.1.1 A deterministic approach for the stability of LMS

Finding the range of step-sizes such that the recursion (12.123) converges is a complicated task. There are two main approaches to finding the maximum step-size: an average approach and a deterministic (worst-case) approach. The average approach will be treated in Section 1.12.4.3. Since the deterministic approach is easy to explain intuitively, we address it in the sequel (although a precise proof is not so simple, see [35,36] for a full analysis).

For simplicity, assume that our filter is such that image. Although this is not a very likely situation to encounter in practice, it allows us to avoid technical details that would complicate considerably the discussion, and still arrive at the important conclusions.

Let us rewrite the LMS recursion, expanding the error with image as image to obtain

image (12.124)

This is a linear system, but not a time-invariant one. It will be stable if the eigenvalues image of the matrix image satisfy

image (12.125)

Be careful, if image were not identically zero, condition (12.125) would have to be slightly modified to guarantee stability (see, e.g., [36,37]).

The eigenvalues of image are easy to find, if we note that image is an eigenvector of image (the case image is trivial):

image

where image is the square of the Euclidean norm. The corresponding eigenvalue is image. All other eigenvalues are equal to one (the eigenvectors are all vectors orthogonal to image.) Therefore, the LMS algorithm (12.124) will be stable whenever

image (12.126)

The supremum (image) of a sequence is an extension to the maximum, defined to avoid a technicality. Consider the sequence image. Strictly speaking, it has no maximum, since there is no image for which image for all image. The supremum is defined so we can avoid this problem: image. Therefore, the supremum is equal to the maximum whenever the later exists, but is also defined for some sequences that do not have a maximum. The infimumimage has a similar relation to the minimum.

This condition is sufficient for stability, but is too conservative, since image is not at its highest value at all instants. A popular solution is to use normalization. The idea is simple: adopt a time-variant step-size image such that

image (12.127)

with image and where image is used to avoid division by zero. The resulting algorithm is known as normalized LMS (NLMS). We will talk more about it in the next section.

1.12.3.2 Normalized LMS algorithm

One problem with the LMS algorithm is how to choose the step-size. How large should it be to enable a high convergence rate, provide an acceptable misadjustment, and even ensure the stability of LMS? More importantly, how can this choice be made so that the filter will work correctly even for very different input signals? One answer to this problem is the normalized LMS algorithm which we describe now. It uses a time varying step-size, which is particularly useful when the statistics of the input signals change quickly.

Thus, replacing the fixed step-size image by a time varying step-size image in (12.123), we obtain

image (12.128)

In order to increase the convergence speed, we could try to find image such that the a posteriori estimation error,

image (12.129)

is zeroed. Note that, in contrast to the a priori estimation error defined in (12.119), image depends on the updated estimate of the weight vector, hence the name a posteriori estimation error. Replacing (12.128) in (12.129), we obtain

image (12.130)

In order to enforce image at each time instant image, we must select

image (12.131)

Replacing (12.131) in (12.128), we obtain the update equation of the normalized least-squares algorithm, i.e.,

image (12.132)

Note that the time varying step-size image depends inversely on the instantaneous power of the input vector image. Indeed, we will show in Section 1.12.4.3 that the maximum value of the fixed step-size image to ensure the convergence and mean-square stability of LMS depends inversely on the power of the input signal.

In order to make (12.132) more reliable for practical implementations, it is common to introduce two positive constants: a fixed step-size image to control the rate of convergence (and the misadjustment) and the regularization factor image to prevent division by a small value when image is small. With these constants, the NLMS update equation reduces to

image (12.133)

Some authors refer to (12.132) as NLMS and to (12.133) as image-NLMS. However, for simplicity we will refer to both as NLMS. The name “normalized” is due to a most useful property of NLMS: the range of values of image for which the algorithm remains stable is independent of the statistics of image, i.e., image, as can be observed in Section 1.12.3.1.1.

As pointed out in [3], the NLMS algorithm can also be interpreted as a quasi-Newton algorithm. As briefly explained in Box 1, a quasi-Newton algorithm updates the weights using approximations for the Hessian matrix and gradient vector, i.e.,

image (12.134)

Assuming the following instantaneous approximations

image (12.135)

and

image (12.136)

and replacing them in (12.134), we arrive at the stochastic recursion

image (12.137)

The term image guarantees that (12.136) is invertible (a rank-one matrix such as image is never invertible).

To compute image, we can use the matrix inversion lemma (Box 3), which gives us an expression for the inverse of an image matrix of the form

image (12.138)

where image and image are two square matrices with dimensions image and image, respectively, and image is an image matrix. According to the matrix inversion lemma, the inverse of image is given by

image (12.139)

Thus, identifying

image

we obtain

image (12.140)

Multiplying (12.140) from the right by image, after some algebraic manipulations, we get

image (12.141)

Replacing (12.141) in (12.137), we arrive at (12.133). Since Newton-based algorithms converge faster than gradient-based algorithms, it is expected that NLMS exhibits a higher convergence rate than that of LMS, which is indeed the case in general.

The NLMS algorithm is summarized in Table 12.4. To obtain its computational cost, we can use the same procedure considered in the computation of the number of operations of LMS (see Table 12.3). For complex-valued signals, NLMS requires image real multiplications, image real additions and one real division per iteration. For real-valued signals, it requires image real multiplications and image real additions per iteration. Note that for input vectors with a shift structure such as the FIR filters of Eq. (12.113), its computational cost can be reduced if image is computed by the following recursion

image (12.142)

with initialization image.

Table 12.4

Summary of the NLMS Algorithm

Image

1.12.3.3 RLS algorithm

The convergence rate of LMS-type algorithms varies considerably with the input signal image, since they are based on steepest-descent optimization algorithms. If image is a white noise, the convergence rate is high. On the other hand, if the correlation between successive samples of image is high, LMS-type algorithms converge slowly. The RLS algorithm does not have this drawback. However, this advantage has a price: a considerable increase in the computational cost. Furthermore, numerical errors play a more important role in RLS and may even lead to divergence of the weight estimates. Different versions of RLS were proposed to circumvent this problem. In the sequel, we obtain the equations of the conventional RLS algorithm.

The RLS can be derived in different ways (much as NLMS). Each form of arriving at the algorithm highlights a different set of its properties: one that leads to many new algorithms and insights is the connection between RLS and Kalman filters described in [38].

We present here shorter routes, starting with a derivation based on deterministic arguments. In fact, RLS solves a deterministic least-squares problem, which is based on the weighted least-squares cost function

image (12.143)

where image is a constant known as forgetting factor,

image (12.144)

are a posteriori errors and

image (12.145)

The factor image emphasizes the most recent errors, forgetting the errors from the remote past. This is important, otherwise the algorithm would give too much weight to the remote past and would not be able to track variations in the input signals. Note that image minimizes image at each time instant. Thus, RLS provides an exact solution for the least-squares problem at each time instant. As the NLMS algorithm, RLS seeks to minimize the a posteriori error image, searching for an exact solution to an optimization problem. The difference is that RLS searches for a weight vector image that takes into consideration the whole past history of inputs, minimizing image at each time instant image, not only the current value image as in NLMS.

The error sequence image weighted by image in the interval image can be written in a vectorial form, i.e.,

image (12.146)

where

image (12.147)

image (12.148)

and

image (12.149)

Note that the weighted least-squares cost function can be rewritten in terms of the vector image, i.e.,

image (12.150)

Now, we have to find the weight vector image that minimizes (12.150). The gradient of image with relation to image is given by

image (12.151)

where as usual, image (resp., image) for complex (resp., real) data. Defining

image (12.152)

and

image (12.153)

in which image is defined as before, (12.151) can be rewritten as

image (12.154)

Equating image to the null vector, we get the normal equations

image (12.155)

Note that these are a deterministic version of the normal equations we saw in Section 1.12.2.1. Therefore, to minimize image, the weight vector image must satisfy

image (12.156)

which requires the computation of the inverse correlation matrix image at each time instant. Note that this is only valid if image is nonsingular. In particular, the inverse does not exist if image (Box 3). Assuming image and that the signals image and image are jointly stationary and ergodic, we obtain the following steady-state relations

image

Therefore, the deterministic normal equations converge to the stochastic normal equations when image, and the weight vector obtained from (12.156) converges to the Wiener solution, i.e.,

image (12.157)

We should mention that with image the convergence rate of RLS goes to zero, i.e., it loses the ability to follow the statistical variations of the observed data. An equivalent result is obtained for LMS or NLMS when the fixed step-size image or image is replaced by image[39].

The computational cost of (12.156) is high due to the computation of the inverse matrix. In order to solve the normal equations in real-time and with less cost, we should rewrite (12.152) and (12.153) as recursions, i.e.,

image (12.158)

and

image (12.159)

with initializations image and image, where image is a small positive constant. Note that (12.159) is identical to (12.153), but (12.158) leads to the estimate

image (12.160)

and therefore the initialization guarantees the existence of the inverse at all instants. For image, the initialization image is forgotten and (12.158) becomes close to (12.152) as image increases. Furthermore, if you want the algorithm to forget quickly the initialization, besides using image, you should choose image small (image large). Note however that a certain amount of regularization may be useful in practice, to avoid problems when the input signal has long stretches with low power.

In order to obtain a recursion for the inverse matrix image, we can use the matrix inversion lemma (Box 3) by comparing (12.138) with (12.158) and identifying

image

Thus, using (12.139), we obtain

image (12.161)

Defining image, with some algebraic manipulations in (12.161), we arrive at

image (12.162)

with initialization image. This equation allow us to compute the inverse correlation matrix in a recurrent form, avoiding the explicit computation of a matrix inverse. However, the implementation of (12.162) in finite precision arithmetic requires some care to avoid numerical problems, as explained in Section 1.12.3.3.1.

To simplify the following equations, it is convenient to define the column vector

image (12.163)

which is the so-called Kalman gain. Thus, (12.162) can be rewritten as

image (12.164)

Using (12.164) and some algebraic manipulations in (12.163), we arrive at

image (12.165)

Thus, using (12.159) and (12.165) in (12.156), the weight vector image can also be computed recursively, i.e.,

image (12.166)

Replacing (12.164) in (12.166), we arrive at

image (12.167)

The RLS algorithm was derived here as an exact solution for the least-squares problem. However, it also can be interpreted as a quasi-Newton algorithm assuming the instantaneous approximation for the gradient vector as in (12.135) and considering image as an approximation for the inverse Hessian matrix. Replacing these approximations in (12.134), we arrive at the same stochastic recursion (12.167).

The RLS algorithm is summarized in Table 12.5. Its computational cost increases with image, much faster than those of LMS and NLMS. Different ways of carrying out the calculations may result in a slightly different computational cost, but all with the same order of magnitude, i.e., image. Performing the calculations in the following order [3]:

image

the conventional RLS requires image real multiplications, image real additions, and one real division per iteration for complex-valued signals. For real-valued data, it requires image real multiplications, image real additions, and one real division per iteration. Note that the quantities inside brackets were computed in previous steps.

Table 12.5

Summary of the Conventional RLS Algorithm

Image

1.12.3.3.1 Practical implementation of RLS

The RLS algorithm must be implemented carefully, because it is sensitive to numerical errors, which may accumulate and make the algorithm unstable. The problem can be understood intuitively as follows. In the conventional RLS algorithm, the inverse of the autocorrelation matrix is computed recursively by (expanding (12.161))

image

i.e., as the difference between two positive (semi-) definite matrices. Note that, since image, the factor multiplying the current image is image. In infinite precision arithmetic, any growth due to this factor will be compensated by the second term. However, in finite precision arithmetic this compensation may not take place, and the factor image may make the recursion numerically unstable—usually, what happens is that image looses its positive character, so the matrix will have a negative eigenvalue, which in its turn leads to the divergence of the coefficients. Thus, to avoid problems one would need to guarantee that image stays symmetric and positive-definite for all time instants image (these properties are known as backward consistency of RLS [40,41]). Symmetry can be guaranteed by computing only the upper or lower triangular part of image, and copying the result to obtain the rest of elements. However, it is also necessary to guarantee positive-definiteness. Reference [42] describes how this can be achieved.

Due to the fast convergence rate of RLS, a large literature is devoted to finding numerically stable and low-cost (i.e., image) versions of RLS. This is not an easy problem, that required the work of numerous researchers to be solved. Nowadays the user has a very large number of different versions of RLS to choose from. There are versions with formal proofs of stability, versions that work well in practice, but without formal proofs, versions with image complexity, and versions with image complexity. The interested reader can find more references in Section 1.12.5.1.2.

One approach to ensure the consistency of image uses a property of symmetric and positive-definite matrices (see Box 3): Cholesky factorizations. A symmetric and positive-definite matrix image may always be factored as

image (12.168)

where image is a lower triangular image matrix, called Cholesky factor of image. Thus, an algorithm that computes image instead of image can avoid the numerical instability of the conventional RLS algorithm. There are many algorithms based on this main idea; for some of them a precise stability proof is available [40]. A recent and very complete survey of QR-RLS algorithms is available in [43].

Another approach is available when the regressor sequence image has shift structure. In this case, it is possible to derive lattice-based RLS filters that are in practice stable, although no formal proof is available at this time, as described, for example, in [5].

A more practical solution for the implementation of the RLS algorithm was proposed recently in [44]. This approach avoids propagation of the inverse covariance matrix, using an iterative method to solve (12.155) at each step. The computational complexity is kept low by using the previous solution as initialization, and by restricting the majority of the multiplications to multiplications by powers of two (which can be implemented as simple shifts). See Section 1.12.5.1.3.

1.12.3.4 Comparing convergence rates

To compare the convergence of LMS, NLMS, and RLS, we show next two examples. In Example 1, we consider a system identification application and in Example 2, we use these algorithms to update the weights of an equalizer.

As we will see in more detail in Section 1.12.4, adaptive filters are compared based on how well they handle different classes of signals. For example, we could be interested in seeing how two filters handle white noise, or how they handle a certain level of correlation at the input. Given the stochastic nature of the inputs, the performance measures are defined in terms of averaged cost functions, most commonly the mean-square error (MSE), that is, the average of image over all possible inputs in a given class. These averages can be computed theoretically, as we show in Section 1.12.4, or via simulations.

For example, the MSE as a function of time, that is, the curve image, is called a filter’s learning curve. A comparison of learning curves obtained theoretically and from simulations can be seen in Figure 12.30 further ahead. When simulations are used, one runs a filter image times for an interval of image samples, starting always from the same conditions. For each run image, the error image is computed for image, and one obtains the so-called ensemble-average learning curve

image

Usually the ensemble-average learning curve will be a reasonable approximation of the true learning curve for a reasonably small value of image (say, from image to image), but in some situations this may not be true—see [45,46] for examples and a detailed explanation.

Example 1

The aim of this example is to identify the system image, assuming that the regressor image is obtained from a process image generated with a first-order autoregressive model, whose transfer function is image. This model is fed with an i.i.d. Gaussian random process with unitary variance. Moreover, additive i.i.d. noise image with variance image is added to form the desired signal.

Considering an adaptive filter with image coefficients, we can obtain a contour plot of the mean-square error cost as a function of image and image as shown in Figure 12.25. The animations in Figure 12.25 illustrate the behavior of LMS image, NLMS image, and RLS image initialized at image. The correspondent curves of MSE along the iterations, estimated from the ensemble-average of 1000 independent runs, are shown in Figure 12.26. As expected for a colored input signal, RLS converges much faster than NLMS and LMS. NLMS converges faster than LMS and achieves the solution through a different path. To obtain a good behavior of NLMS in this example, we had to choose a relatively large regularization factor image. This is always necessary when NLMS is used with few coefficients since in this case, image has a high probability of becoming too close to zero and de-stabilizing the algorithm. This is shown in the analysis of Section 1.12.4.

image

Figure 12.25 LMS image, NLMS image and RLS image initialized at image. Click the animation video file.

image

Figure 12.26 MSE along the iterations for LMS image, NLMS image and RLS image initialized at image, and ensemble-average of 1000 independent runs.

Example 2

Now, we assume the transmission of a QPSK (quadrature phase shift-keying) signal, whose symbols belong to the alphabet image. This signal suffers the effect of the channel (taken from [47])

image (12.169)

and of additive white Gaussian noise (AWGN) under a signal-to-noise ratio (SNR) of image. The equalizer is an adaptive filter with image coefficients initialized with zero and the desired signal is the transmitted sequence delayed by image samples (go back to Figure 12.16 to see the equalization scheme). The adaptation parameters image were chosen in order to obtain the same steady-state performance of LMS, NLMS, and RLS (note that with a longer filter, the regularization parameter image can be much smaller than in the previous example). Figure 12.27 shows the MSE along the iterations, estimated from the ensemble-average of 1000 independent runs. Again, RLS presents the fastest convergence rate, followed by NLMS and LMS. These algorithms achieve the same steady-state MSE and therefore, present the same performance after convergence, which can be observed in Figure 12.28 where the equalizer output signal constellations are shown. The Matlab file used in this simulation is available at http://www.lps.usp.br.

image

Figure 12.27 MSE along the iterations for LMS image, NLMS image, and RLS image, image initialized at image; QPSK, image, and ensemble-average of 1000 independent runs; equalization of the channel (12.169).

image

Figure 12.28 Equalizer output signal constellations after convergence for (a) LMS image, (b) NLMS imageimage, and (c) RLS image; initialized at image; QPSK, image; equalization of the channel (12.169).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.218.184