Chapter 9

Diffusion Adaptation Over Networks*

Ali H. Sayed,    Electrical Engineering Department, University of California at Los Angeles, USA, [email protected]

Abstract

Adaptive networks are well-suited to perform decentralized information processing and optimization tasks and to model various types of self-organized and complex behavior encountered in nature. Adaptive networks consist of a collection of agents with processing and learning abilities. The agents are linked together through a connection topology, and they cooperate with each other through local interactions to solve distributed optimization, estimation, and inference problems in real-time. The continuous diffusion of information across the network enables agents to adapt their performance in relation to streaming data and network conditions; it also results in improved adaptation and learning performance relative to non-cooperative agents. This article provides an overview of diffusion strategies for adaptation and learning over networks, with emphasis on mean-square-error designs. Stability and performance analyses are provided and the benefits of cooperation are highlighted. Several supporting appendices are included in an effort to make the presentation self-contained for most readers.

Keywords

Distributed estimation; Distributed optimization; Diffusion adaptation; Distributed filtering; Diffusion strategy; Consensus strategy; Cooperative strategy; Adaptation; Learning; Adaptive networks; Cooperative networks; Diffusion networks; Steepest-descent; Stochastic gradient approximation; Mean-square-error; Energy conservation; LMS filter; RLS filter; Kalman filter; Combination rules; Graphs; Noisy links; Block maximum norm

Acknowledgments

The development of the theory and applications of diffusion adaptation over networks has benefited greatly from the insights and contributions of several UCLA Ph.D. students, and several visiting graduate students to the UCLA Adaptive Systems Laboratory (http://www.ee.ucla.edu/asl). The assistance and contributions of all students are hereby gratefully acknowledged, including Cassio G. Lopes, Federico S. Cattivelli, Sheng-Yuan Tu, Jianshu Chen, Xiaochuan Zhao, Zaid Towfic, Chung-Kai Yu, Noriyuki Takahashi, Jae-Woo Lee, Alexander Bertrand, and Paolo Di Lorenzo. The author is also particularly thankful to S.-Y. Tu, J. Chen, X. Zhao, Z. Towfic, and C.-K. Yu for their assistance in reviewing an earlier draft of this chapter.

Adaptive networks are well-suited to perform decentralized information processing and optimization tasks and to model various types of self-organized and complex behavior encountered in nature. Adaptive networks consist of a collection of agents with processing and learning abilities. The agents are linked together through a connection topology, and they cooperate with each other through local interactions to solve distributed optimization, estimation, and inference problems in real-time. The continuous diffusion of information across the network enables agents to adapt their performance in relation to streaming data and network conditions; it also results in improved adaptation and learning performance relative to non-cooperating agents. This chapter provides an overview of diffusion strategies for adaptation and learning over networks. The chapter is divided into several sections and includes appendices with supporting material intended to make the presentation rather self-contained for the benefit of the reader.

3.09.1 Motivation

Consider a collection of image agents interested in estimating the same parameter vector, image, of size image. The vector is the minimizer of some global cost function, denoted by image, which the agents seek to optimize, say,

image (9.1)

We are interested in situations where the individual agents have access to partial information about the global cost function. In this case, cooperation among the agents becomes beneficial. For example, by cooperating with their neighbors, and by having these neighbors cooperate with their neighbors, procedures can be devised that would enable all agents in the network to converge towards the global optimum image through local interactions. The objective of decentralized processing is to allow spatially distributed agents to achieve a global objective by relying solely on local information and on in-network processing. Through a continuous process of cooperation and information sharing with neighbors, agents in a network can be made to approach the global performance level despite the localized nature of their interactions.

3.09.1.1 Networks and neighborhoods

In this chapter we focus mainly on connected networks, although many of the results hold even if the network graph is separated into disjoint subgraphs. In a connected network, if we pick any two arbitrary nodes, then there will exist at least one path connecting them: the nodes may be connected directly by an edge if they are neighbors, or they may be connected by a path that passes through other intermediate nodes. Figure 9.1 provides a graphical representation of a connected network with image nodes. Nodes that are able to share information with each other are connected by edges. The sharing of information over these edges can be unidirectional or bi-directional. The neighborhood of any particular node is defined as the set of nodes that are connected to it by edges; we include in this set the node itself. The figure illustrates the neighborhood of node 3, which consists of the following subset of nodes: image. For each node, the size of its neighborhood defines its degree. For example, node image in the figure has degree image, while node 8 has degree image. Nodes that are well connected have higher degrees. Note that we are denoting the neighborhood of an arbitrary node image by image and its size by image. We shall also use the notation image to refer to image.

image

Figure 9.1 A network consists of a collection of cooperating nodes. Nodes that are linked by edges can share information. The neighborhood of any particular node consists of all nodes that are connected to it by edges (including the node itself). The figure illustrates the neighborhood of node image, which consists of nodes image. Accordingly, node image has degree image, which is the size of its neighborhood.

The neighborhood of any node image therefore consists of all nodes with which node image can exchange information. We assume a symmetric situation in relation to neighbors so that if node image is a neighbor of node image, then node image is also a neighbor of node image. This does not necessarily mean that the flow of information between these two nodes is symmetrical. For instance, in future sections, we shall assign pairs of nonnegative weights to each edge connecting two neighboring nodes—see Figure 9.2. In particular, we will assign the coefficient image to denote the weight used by node image to scale the data it receives from node image; this scaling can be interpreted as a measure of trustworthiness or reliability that node image assigns to its interaction with node image. Note that we are using two subscripts, image, with the first subscript denoting the source node (where information originates from) and the second subscript denoting the sink node (where information moves to) so that:

image (9.2)

In this way, the alternative coefficient image will denote the weight used to scale the data sent from node image to image:

image (9.3)

The weights image can be different, and one or both of them can be zero, so that the exchange of information over the edge connecting the neighboring nodes image need not be symmetric. When one of the weights is zero, say, image, then this situation means that even though nodes image are neighbors, node image is either not receiving data from node image or the data emanating from node image is being annihilated before reaching node image. Likewise, when image, then node image scales its own data, whereas image corresponds to the situation when node image does not use its own data. Usually, in graphical representations like those in Figure 9.1, edges are drawn between neighboring nodes that can share information. And, it is understood that the actual sharing of information is controlled by the values of the scaling weights that are assigned to the edge; these values can turn off communication in one or both directions and they can also scale one direction more heavily than the reverse direction, and so forth.

image

Figure 9.2 In the top part, and for emphasis purposes, we are representing the edge between nodes image and image by two separate directed links: one moving from image to image and the other moving from image to image. Two nonnegative weights are used to scale the sharing of information over these directed links. The scalar image denotes the weight used to scale data sent from node image to image, while image denotes the weight used to scale data sent from node image to image. The weights image can be different, and one or both of them can be zero, so that the exchange of information over the edge connecting any two neighboring nodes need not be symmetric. The bottom part of the figure illustrates directed links arriving to node image from its neighbors image (left) and leaving from node image towards these same neighbors (right).

3.09.1.2 Cooperation among agents

Now, depending on the application under consideration, the solution vector image from (9.1) may admit different interpretations. For example, the entries of image may represent the location coordinates of a nutrition source that the agents are trying to find, or the location of an accident involving a dangerous chemical leak. The nodes may also be interested in locating a predator and tracking its movements over time. In these localization applications, the vector image is usually two or three-dimensional. In other applications, the entries of image may represent the parameters of some model that the network wishes to learn, such as identifying the model parameters of a biological process or the occupied frequency bands in a shared communications medium. There are also situations where different agents in the network may be interested in estimating different entries of image, or even different parameter vectors image altogether, say, image for node image, albeit with some relation among the different vectors so that cooperation among the nodes can still be rewarding. In this chapter, however, we focus exclusively on the special (yet frequent and important) case where all agents are interested in estimating the same parameter vector image.

Since the agents have a common objective, it is natural to expect cooperation among them to be beneficial in general. One important question is therefore how to develop cooperation strategies that can lead to better performance than when each agent attempts to solve the optimization problem individually. Another important question is how to develop strategies that endow networks with the ability to adapt and learn in real-time in response to changes in the statistical properties of the data. This chapter provides an overview of results in the area of diffusion adaptation with illustrative examples. Diffusion strategies are powerful methods that enable adaptive learning and cooperation over networks. There have been other useful works in the literature on the use of alternative consensus strategies to develop distributed optimization solutions over networks. Nevertheless, we explain in Appendix E why diffusion strategies outperform consensus strategies in terms of their mean-square-error stability and performance. For this reason, we focus in the body of the chapter on presenting the theoretical foundations for diffusion strategies and their performance.

3.09.1.3 Notation

In our treatment, we need to distinguish between random variables and deterministic quantities. For this reason, we use boldface letters to represent random variables and normal font to represent deterministic (non-random) quantities. For example, the boldface letter image denotes a random quantity, while the normal font letter image denotes an observation or realization for it. We also need to distinguish between matrices and vectors. For this purpose, we use CAPITAL letters to refer to matrices and small letters to refer to both vectors and scalars; Greek letters always refer to scalars. For example, we write image to denote a covariance matrix and image to denote a vector of parameters. We also write image to refer to the variance of a random variable. To distinguish between a vector image (small letter) and a scalar image (also a small letter), we use parentheses to index scalar quantities and subscripts to index vector quantities. Thus, we write image to refer to the value of a scalar quantity image at time image, and image to refer to the value of a vector quantity image at time image. Thus, image denotes a scalar while image denotes a vector. All vectors in our presentation are column vectors, with the exception of the regression vector (denoted by the letter image), which will be taken to be a row vector for convenience of presentation. The symbol image denotes transposition, and the symbol image denotes complex conjugation for scalars and complex-conjugate transposition for matrices. The notation image denotes a column vector with entries image and image stacked on top of each other, and the notation image denotes a diagonal matrix with entries image and image. Likewise, the notation image vectorizes its matrix argument and stacks the columns of image on top of each other. The notation image denotes the Euclidean norm of its vector argument, while image denotes the block maximum norm of a block vector (defined in Appendix D). Similarly, the notation image denotes the weighted square value, image. Moreover, image denotes the block maximum norm of a matrix (also defined in Appendix D), and image denotes the spectral radius of the matrix (i.e., the largest absolute magnitude among its eigenvalues). Finally, image denotes the identity matrix of size image; sometimes, for simplicity of notation, we drop the subscript image from image when the size of the identity matrix is obvious from the context. Table 9.1 provides a summary of the symbols used in the chapter for ease of reference.

Table 9.1

Summary of Notation Conventions Used in the Chapter

Image

3.09.2 Mean-square-error estimation

Readers interested in the development of the distributed optimization strategies and their adaptive versions can move directly to Section 3.09.3. The purpose of the current section is to motivate the virtues of distributed in-network processing, and to provide illustrative examples in the context of mean-square-error estimation. Advanced readers may skip this section.

We start our development by associating with each agent image an individual cost (or utility) function, image. Although the algorithms presented in this chapter apply to more general situations, we nevertheless assume in our presentation that the cost functions image are strictly convex so that each one of them has a unique minimizer. We further assume that, for all costs image, the minimum occurs at the same value image. Obviously, the choice of image is limitless and is largely dependent on the application. It is sufficient for our purposes to illustrate the main concepts underlying diffusion adaptation by focusing on the case of mean-square-error (MSE) or quadratic cost functions. In the sequel, we provide several examples to illustrate how such quadratic cost functions arise in applications and how cooperative processing over networks can be beneficial. At the same time, we note that most of the arguments in this chapter can be extended beyond MSE optimization to more general cost functions and to situations where the minimizers of the individual costs image need not agree with each other—as already shown in [13]; see also Section 3.09.10.4 for a brief summary.

In non-cooperative solutions, each agent would operate individually on its own cost function image and optimize it to determines image, without any interaction with the other nodes. However, the analysis and derivations in future sections will reveal that nodes can benefit from cooperation between them in terms of better performance (such as converging faster to image or tracking a changing image more effectively)—see, e.g., Theorems 9.6.39.6.5 and Section 3.09.7.3.

3.09.2.1 Application: autoregressive modeling

Our first example relates to identifying the parameters of an auto-regressive (AR) model from noisy data. Thus, consider a situation where agents are spread over some geographical region and each agent image is observing realizations image of an AR zero-mean random process image, which satisfies a model of the form:

image (9.4)

The scalars image represent the model parameters that the agents wish to identify, and image represents an additive zero-mean white noise process with power:

image (9.5)

It is customary to assume that the noise process is temporally white and spatially independent so that noise terms across different nodes are independent of each other and, at the same node, successive noise samples are also independent of each other with a time-independent variance image:

image (9.6)

The noise process image is further assumed to be independent of past signals image across all nodes image. Observe that we are allowing the noise power profile, image, to vary with image. In this way, the quality of the measurements is allowed to vary across the network with some nodes collecting noisier data than other nodes. Figure 9.3 illustrates an example of a network consisting of image nodes spread over a region in space. The figure shows the neighborhood of node image, which consists of nodes image.

image

Figure 9.3 A collection of nodes, spread over a geographic region, observes realizations of an AR random process and cooperates to estimate the underlying model parameters image in the presence of measurement noise. The noise power profile can vary over space.

3.09.2.1.1 Linear model

To illustrate the difference between cooperative and non-cooperative estimation strategies, let us first explain that the data can be represented in terms of a linear model. To do so, we collect the model parameters image into an image column vector image:

image (9.7)

and the past data into a image row regression vector image:

image (9.8)

Then, we can rewrite the measurement equation (9.4) at each node image in the equivalent linear model form:

image (9.9)

Linear relations of the form (9.9) are common in applications and arise in many other contexts (as further illustrated by the next three examples in this section).

We assume the stochastic processes image in (9.9) have zero means and are jointly wide-sense stationary. We denote their second-order moments by:

image (9.10)

image (9.11)

image (9.12)

As was the case with the noise power profile, we are allowing the moments image to depend on the node index image so that these moments can vary with the spatial dimension as well. The covariance matrix image is, by definition, always nonnegative definite. However, for convenience of presentation, we shall assume that image is actually positive-definite (and, hence, invertible):

image (9.13)

3.09.2.1.2 Non-cooperative mean-square-error solution

One immediate result that follows from the linear model (9.9) is that the unknown parameter vector image can be recovered exactly by each individual node from knowledge of the local moments image alone. To see this, note that if we multiply both sides of (9.9) by image and take expectations we obtain

image (9.14)

so that

image (9.15)

It is seen from (9.15) that image is the solution to a linear system of equations and that this solution can be computed by every node directly from its moments image. It is useful to re-interpret construction (9.15) as the solution to a minimum mean-square-error (MMSE) estimation problem [4,5]. Indeed, it can be verified that the quantity image that appears in (9.15) is the unique solution to the following MMSE problem:

image (9.16)

To verify this claim, we denote the cost function that appears in (9.16) by

image (9.17)

and expand it to find that

image (9.18)

The cost function image is quadratic in image and it has a unique minimizer since image. Differentiating image with respect to image we find its gradient vector:

image (9.19)

It is seen that this gradient vector is annihilated at the same value image given by (9.15). Therefore, we can equivalently state that if each node image solves the MMSE problem (9.16), then the solution vector coincides with the desired parameter vector image. This observation explains why it is often justified to consider mean-square-error cost functions when dealing with estimation problems that involve data that satisfy linear models similar to (9.9).

Besides image, the solution of the MMSE problem (9.16) also conveys information about the noise level in the data. Note that by substituting image from (9.15) into expression (9.16) for image, the resulting minimum mean-square-error value that is attained by node image is found to be:

image (9.20)

image

We shall use the notation image and image interchangeably to denote the minimum cost value of image. The above result states that, when each agent image recovers image from knowledge of its moments image using expression (9.15), then the agent attains an MSE performance level that is equal to the noise power at its location, image. An alternative useful expression for the minimum cost can be obtained by substituting expression (9.15) for image into (9.18) and simplifying the expression to find that

image (9.21)

This second expression is in terms of the data moments image.

3.09.2.1.3 Non-cooperative adaptive solution

The optimal MMSE implementation (9.15) for determining image requires knowledge of the statistical information image. This information is usually not available beforehand. Instead, the agents are more likely to have access to successive time-indexed observations image of the random processes image for image. In this case, it becomes necessary to devise a scheme that would allow each node to use its measurements to approximate image. It turns out that an adaptive solution is possible. In this alternative implementation, each node image feeds its observations image into an adaptive filter and evaluates successive estimates for image. As time passes by, the estimates would get closer to image.

The adaptive solution operates as follows. Let image denote an estimate for image that is computed by node image at time image based on all the observations image it has collected up to that time instant. There are many adaptive algorithms that can be used to compute image; some filters are more accurate than others (usually, at the cost of additional complexity) [47]. It is sufficient for our purposes to consider one simple (yet effective) filter structure, while noting that most of the discussion in this chapter can be extended to other structures. One of the simplest choices for an adaptive structure is the least-mean-squares (LMS) filter, where the data are processed by each node image as follows:

image (9.22)

image (9.23)

Starting from some initial condition, say, image, the filter iterates over image. At each time instant, image, the filter uses the local data image at node image to compute a local estimation error, image, via (9.22). The error is then used to update the existing estimate from image to image via (9.23). The factor image that appears in (9.23) is a constant positive step-size parameter; usually chosen to be sufficiently small to ensure mean-square stability and convergence, as discussed further ahead in the chapter. The step-size parameter can be selected to vary with time as well; one popular choice is to replace image in (9.23) with the following construction:

image (9.24)

where image is a small positive parameter and image. The resulting filter implementation is known as normalized LMS [5] since the step-size is normalized by the squared norm of the regression vector. Other choices include step-size sequences image that satisfy both conditions:

image (9.25)

Such sequences converge slowly towards zero. One example is the choice image. However, by virtue of the fact that such step-sizes die out as image, then these choices end up turning off adaptation. As such, step-size sequences satisfying (9.25) are not generally suitable for applications that require continuous learning, especially under non-stationary environments. For this reason, in this chapter, we shall focus exclusively on the constant step-size case (9.23) in order to ensure continuous adaptation and learning.

Equations (9.22) and (9.23) are written in terms of the observed quantities image; these are deterministic values since they correspond to observations of the random processes image. Often, when we are interested in highlighting the random nature of the quantities involved in the adaptation step, especially when we study the mean-square performance of adaptive filters, it becomes more useful to rewrite the recursions using the boldface notation to highlight the fact that the quantities that appear in (9.22) and (9.23) are actually realizations of random variables. Thus, we also write:

image (9.26)

image (9.27)

where image will be random variables as well.

The performance of adaptive implementations of this kind are well-understood for both cases of stationary image and changing image [47]. For example, in the stationary case, if the adaptive implementation (9.26) and (9.27) were to succeed in having its estimator image tend to image with probability one as image, then we would expect the error signal image in (9.26) to tend towards the noise signal image (by virtue of the linear model (9.9)). This means that, under this ideal scenario, the variance of the error signal image would be expected to tend towards the noise variance, image, as image. Recall from (9.20) that the noise variance is the least cost that the MSE solution can attain. Therefore, such limiting behavior by the adaptive filter would be desirable. However, it is well-known that there is always some loss in mean-square-error performance when adaptation is employed due to the effect of gradient noise, which is caused by the algorithm’s reliance on observations (or realizations) image rather than on the actual moments image. In particular, it is known that for LMS filters, the variance of image in steady-state will be larger than image by a small amount, and the size of the offset is proportional to the step-size parameter image (so that smaller step-sizes lead to better mean-square-error (MSE) performance albeit at the expense of slower convergence). It is easy to see why the variance of image will be larger than image from the definition of the error signal in (9.26). Introduce the weight-error vector

image (9.28)

and the so-called a priori error signal

image (9.29)

This second error measures the difference between the uncorrupted term image and its estimator prior to adaptation, image. It then follows from the data model (9.9) and from the defining expression (9.26) for image that

image (9.30)

Since the noise component, image, is assumed to be zero-mean and independent of all other random variables, we conclude that

image (9.31)

This relation holds for any time instant image; it shows that the variance of the output error, image is larger than image by an amount that is equal to the variance of the a priori error, image. We define the filter mean-square-error (MSE) and excess-mean-square-error (EMSE) as the following steady-state measures:

image (9.32)

image (9.33)

so that, for the adaptive implementation (compare with (9.20)):

image (9.34)

Therefore, the EMSE term quantifies the size of the offset in the MSE performance of the adaptive filter. We also define the filter mean-square-deviation (MSD) as the steady-state measure:

image (9.35)

which measures how far image is from image in the mean-square-error sense in steady-state. It is known that the MSD and EMSE of LMS filters of the form (9.26) and (9.27) can be approximated for sufficiently small-step sizes by the following expressions [47]:

image (9.36)

image (9.37)

It is seen that the smaller the step-size parameter is, the better the performance of the adaptive solution.

3.09.2.1.4 Cooperative adaptation through diffusion

Observe from (9.36) and (9.37) that even if all nodes employ the same step-size, image, and even if the regression data are spatially uniform so that image for all image, the mean-square-error performance across the nodes still varies in accordance with the variation of the noise power profile, image, across the network. Nodes with larger noise power will perform worse than nodes with smaller noise power. However, since all nodes are observing data arising from the same underlying model image, it is natural to expect cooperation among the nodes to be beneficial. By cooperation we mean that neighboring nodes can share information (such as measurements or estimates) with each other as permitted by the network topology. Starting in the next section, we will motivate and describe algorithms that enable nodes to carry out adaptation and learning in a cooperative manner to enhance performance.

Specifically, we are going to see that one way to achieve cooperation is by developing adaptive algorithms that enable the nodes to optimize the following global cost function in a distributed manner:

image (9.38)

where the global cost is the aggregate objective:

image (9.39)

Comparing (9.38) with (9.16), we see that we are now adding the individual costs, image, from across all nodes. Note that since the desired image satisfies (9.15) at every node image, then it also satisfies

image (9.40)

But it can be verified that the optimal solution to (9.38) is given by the same image that satisfies (9.40). Therefore, solving the global optimization problem (9.38) also leads to the desired image. In future sections, we will show how cooperative and distributed adaptive schemes for solving (9.38), such as (9.153) or (9.154) further ahead, lead to improved performance in estimating image (in terms of smaller mean-square-deviation and faster convergence rate) than the non-cooperative mode (9.26) and (9.27), where each agent runs its own individual adaptive filter—see, e.g., Theorems 9.6.39.6.5 and Section 3.09.7.3.

3.09.2.2 Application: tapped-delay-line models

Our second example to motivate MSE cost functions, image, and linear models relates to identifying the parameters of a moving-average (MA) model from noisy data. MA models are also known as finite-impulse-response (FIR) or tapped-delay-line models. Thus, consider a situation where agents are interested in estimating the parameters of an FIR model, such as the taps of a communications channel or the parameters of some (approximate) model of interest in finance or biology. Assume the agents are able to independently probe the unknown model and observe its response to excitations in the presence of additive noise; this situation is illustrated in Figure 9.4, with the probing operation highlighted for one of the nodes (node 4).

image

Figure 9.4 The network is interested in estimating the parameter vector image that describes an underlying tapped-delay-line model. The agents are assumed to be able to independently probe the unknown system, and observe its response to excitations, under noise, as indicated in the highlighted diagram for node image.

The schematics inside the enlarged diagram in Figure 9.4 is meant to convey that each node image probes the model with an input sequence image and measures the resulting response sequence, image, in the presence of additive noise. The system dynamics for each agent image is assumed to be described by a MA model of the form:

image (9.41)

In this model, the term image again represents an additive zero-mean noise process that is assumed to be temporally white and spatially independent; it is also assumed to be independent of the input process, image, for all image, and image. The scalars image represent the model parameters that the agents seek to identify. If we again collect the model parameters into an image column vector image:

image (9.42)

and the input data into a image row regression vector:

image (9.43)

then, we can again express the measurement equation (9.41) at each node image in the same linear model as (9.9), namely,

image (9.44)

As was the case with model (9.9), we can likewise verify that, in view of (9.44), the desired parameter vector image satisfies the same normal equations (9.15), i.e.,

image (9.45)

where the moments image continue to be defined by expressions (9.11) and (9.12) with image now defined by (9.43). Therefore, each node image can determine image on its own by solving the same MMSE estimation problem (9.16). This solution method requires knowledge of the moments image and, according to (9.20), each agent image would then attain an MSE level that is equal to the noise power level at its location.

Alternatively, when the statistical information image is not available, each agent image can estimate image iteratively by feeding data image into the adaptive implementation (9.26) and (9.27). In this way, each agent image will achieve the same performance level shown earlier in (9.36) and (9.37), with the limiting performance being again dependent on the local noise power level, image. Therefore, nodes with larger noise power will perform worse than nodes with smaller noise power. However, since all nodes are observing data arising from the same underlying model image, it is natural to expect cooperation among the nodes to be beneficial. As we are going to see, starting from the next section, one way to achieve cooperation and improve performance is by developing algorithms that optimize the same global cost function (9.38) in an adaptive and distributed manner, such as algorithms (9.153) and (9.154) further ahead.

3.09.2.3 Application: target localization

Our third example relates to the problem of locating a destination of interest (such as the location of a nutrition source or a chemical leak) or locating and tracking an object of interest (such as a predator or a projectile). In several such localization applications, the agents in the network are allowed to move towards the target or away from it, in which case we would end up with a mobile adaptive network [8]. Biological networks behave in this manner such as networks representing fish schools, bird formations, bee swarms, bacteria motility, and diffusing particles [812]. The agents may move towards the target (e.g., when it is a nutrition source) or away from the target (e.g., when it is a predator). In other applications, the agents may remain static and are simply interested in locating a target or tracking it (such as tracking a projectile).

To motivate mean-square-error estimation in the context of localization problems, we consider the situation corresponding to a static target and static nodes. Thus, assume that the unknown location of the target in the cartesian plane is represented by the image vector image. The agents are spread over the same region of space and are interested in locating the target. The location of every agent image is denoted by the image vector image in terms of its image and image coordinates—see Figure 9.5. We assume the agents are aware of their location vectors. The distance between agent image and the target is denoted by image and is equal to:

image (9.46)

The image unit-norm direction vector pointing from agent image towards the target is denoted by image and is given by:

image (9.47)

Observe from (9.46) and (9.47) that image can be expressed in the following useful inner-product form:

image (9.48)

image

Figure 9.5 The distance from node image to the target is denoted by image and the unit-norm direction vector from the same node to the target is denoted by image. Node image is assumed to have access to noisy measurements of image.

In practice, agents have noisy observations of both their distance and direction vector towards the target. We denote the noisy distance measurement collected by node image at time image by:

image (9.49)

where image denotes noise and is assumed to be zero-mean, and temporally white and spatially in-dependent with variance

image (9.50)

We also denote the noisy direction vector that is measured by node image at time image by image. This vector is a perturbed version of image. We assume that image continues to start from the location of the node at image, but that its tip is perturbed slightly either to the left or to the right relative to the tip of image—see Figure 9.6. The perturbation to the tip of image is modeled as being the result of two effects: a small deviation that occurs along the direction that is perpendicular to image, and a smaller deviation that occurs along the direction of image. Since we are assuming that the tip of image is only slightly perturbed relative to the tip of image, then it is reasonable to expect the amount of perturbation along the parallel direction to be small compared to the amount of perturbation along the perpendicular direction.

image

Figure 9.6 The tip of the noisy direction vector is modeled as being approximately perturbed away from the actual direction by two effects: a larger effect caused by a deviation along the direction that is perpendicular to image, and a smaller deviation along the direction that is parallel to image.

Thus, we write

image (9.51)

where image denotes a unit-norm row vector that lies in the same plane and whose direction is perpendicular to image. The variables image and image denote zero-mean independent random noises that are temporally white and spatially independent with variances:

image (9.52)

We assume the contribution of image is small compared to the contributions of the other noise sources, image and image, so that

image (9.53)

The random noises image are further assumed to be independent of each other.

Using (9.48) we find that the noisy measurements image are related to the unknown image via:

image (9.54)

where the modified noise term image is defined in terms of the noises in image and image as follows:

image (9.55)

since, by construction,

image (9.56)

and the contribution by image is assumed to be sufficiently small. If we now introduce the adjusted signal:

image (9.57)

then we arrive again from (9.54) and (9.55) at the following linear model for the available measurement variables image in terms of the target location image:

image (9.58)

There is one important difference in relation to the earlier linear models (9.9) and (9.44), namely, the variables image in (9.58) do not have zero means any longer. It is nevertheless straightforward to determine the first and second-order moments of the variables image. First, note from (9.49), (9.51), and (9.57) that

image (9.59)

Even in this case of non-zero means, and in view of (9.58), the desired parameter vector image can still be shown to satisfy the same normal equations (9.15), i.e.,

image (9.60)

where the moments image continue to be defined as

image (9.61)

To verify that (9.60) holds, we simply multiply both sides of (9.58) by image from the left, compute the expectations of both sides, and use the fact that image has zero mean and is assumed to be independent of image for all times image and nodes image. However, the difference in relation to the earlier normal equations (9.15) is that the matrix image is not the actual covariance matrix of image any longer. When image is not zero mean, its covariance matrix is instead defined as:

image (9.62)

so that

image (9.63)

We conclude from this relation that image is positive-definite (and, hence, invertible) so that expression (9.60) is justified. This is because the covariance matrix, image, is itself positive-definite. Indeed, some algebra applied to the difference image from (9.51) shows that

image (9.64)

where the matrix

image (9.65)

is full rank since the rows image are linearly independent vectors.

Therefore, each node image can determine image on its own by solving the same minimum mean-square-error estimation problem (9.16). This solution method requires knowledge of the moments image and, according to (9.20), each agent image would then attain an MSE level that is equal to the noise power level, image, at its location.

Alternatively, when the statistical information image is not available beforehand, each agent image can estimate image iteratively by feeding data image into the adaptive implementation (9.26) and (9.27). In this case, each agent image will achieve the performance level shown earlier in (9.36) and (9.37), with the limiting performance being again dependent on the local noise power level, image. Therefore, nodes with larger noise power will perform worse than nodes with smaller noise power. However, since all nodes are observing distances and direction vectors towards the same target location image, it is natural to expect cooperation among the nodes to be beneficial. As we are going to see, starting from the next section, one way to achieve cooperation and improve performance is by developing algorithms that solve the same global cost function (9.38) in an adaptive and distributed manner, by using algorithms such as (9.153) and (9.154) further ahead.

3.09.2.3.1 Role of adaptation

The localization application helps highlight one of the main advantages of adaptation, namely, the ability of adaptive implementations to learn and track changing statistical conditions. For example, in the context of mobile networks, where nodes can move closer or further away from a target, the location vector for each agent image becomes time-dependent, say, image. In this case, the actual distance and direction vector between agent image and the target also vary with time and become:

image (9.66)

The noisy distance measurement to the target is then:

image (9.67)

where the variance of image now depends on time as well:

image (9.68)

In the context of mobile networks, it is reasonable to assume that the variance of image varies both with time and with the distance to the target: the closer the node is to the target, the less noisy the measurement of the distance is expected to be. Similar remarks hold for the variances of the noises image and image that perturb the measurement of the direction vector, say,

image (9.69)

where now

image (9.70)

The same arguments that led to (9.58) can be repeated to lead to the same model, except that now the means of the variables image become time-dependent as well:

image (9.71)

Nevertheless, adaptive solutions (whether cooperative or non-cooperative), are able to track such time-variations because these solutions work directly with the observations image and the successive observations will reflect the changing statistical profile of the data. In general, adaptive solutions are able to track changes in the underlying signal statistics rather well [4,5], as long as the rate of non-stationarity is slow enough for the filter to be able to follow the changes.

3.09.2.4 Application: collaborative spectral sensing

Our fourth and last example to illustrate the role of mean-square-error estimation and cooperation relates to spectrum sensing for cognitive radio applications. Cognitive radio systems involve two types of users: primary users and secondary users. To avoid causing harmful interference to incumbent primary users, unlicensed cognitive radio devices need to detect unused frequency bands even at low signal-to-noise (SNR) conditions [1316]. One way to carry out spectral sensing is for each secondary user to estimate the aggregated power spectrum that is transmitted by all active primary users, and to locate unused frequency bands within the estimated spectrum. This step can be performed by the secondary users with or without cooperation.

Thus, consider a communications environment consisting of image primary users and image secondary users. Let image denote the power spectrum of the signal transmitted by primary user image. To facilitate estimation of the spectral profile by the secondary users, we assume that each image can be represented as a linear combination of some basis functions, image, say, image of them [17]:

image (9.72)

In this representation, the scalars image denote the coefficients of the basis expansion for user image. The variable image denotes the normalized angular frequency measured in radians/sample. The power spectrum is often symmetric about the vertical axis, image, and therefore it is sufficient to focus on the interval image. There are many ways by which the basis functions, image, can be selected. The following is one possible construction for illustration purposes. We divide the interval image into image identical intervals and denote their center frequencies by image. We then place a Gaussian pulse at each location image and control its width through the selection of its standard deviation, image, i.e.,

image (9.73)

Figure 9.7 illustrates this construction. The parameters image are selected by the designer and are assumed to be known. For a sufficiently large number, image, of basis functions, the representation (9.72) can approximate well a large class of power spectra.

image

Figure 9.7 The interval image is divided into image sub-intervals of equal width; the center frequencies of the sub-intervals are denoted by image. A power spectrum image is approximated as a linear combination of Gaussian basis functions centered on the image.

We collect the combination coefficients image for primary user image into a column vector image:

image (9.74)

and collect the basis functions into a row vector:

image (9.75)

Then, the power spectrum (9.72) can be expressed in the alternative inner-product form:

image (9.76)

Let image denote the path loss coefficient from primary user image to secondary user image. When the transmitted spectrum image travels from primary user image to secondary user image, the spectrum that is sensed by node image is image. We assume in this example that the path loss factors image are known and that they have been determined during a prior training stage involving each of the primary users with each of the secondary users. The training is usually repeated at regular intervals of time to accommodate the fact that the path loss coefficients can vary (albeit slowly) over time. Figure 9.8 depicts a cognitive radio system with image primary users and image secondary users. One of the secondary users (user image) is highlighted and the path loss coefficients from the primary users to its location are indicated; similar path loss coefficients can be assigned to all other combinations involving primary and secondary users.

image

Figure 9.8 A network of secondary users in the presence of two primary users. One of the secondary users is highlighted and the path loss coefficients from the primary users to its location are indicated as image and image.

Each user image senses the aggregate effect of the power spectra that are transmitted by all active primary users. Therefore, adding the effect of all primary users, we find that the power spectrum that arrives at secondary user image is given by:

image (9.77)

where image denotes the receiver noise power at node image, and where we introduced the following vector quantities:

image (9.78)

image (9.79)

The vector image is the collection of all combination coefficients for all image primary users. The vector image contains the path loss coefficients from all primary users to user image. Now, at every time instant image, user image observers its received power spectrum, image, over a discrete grid of frequencies, image, in the interval image in the presence of additive measurement noise. We denote these measurements by:

image (9.80)

The term image denotes sampling noise and is assumed to have zero mean and variance image; it is also assumed to be temporally white and spatially independent; and is also independent of all other random variables. Since the row vectors image in (9.79) are defined in terms of the path loss coefficients image, and since these coefficients are estimated and subject to noisy distortions, we model the image as zero-mean random variables in (9.80) and use the boldface notation for them.

Observe that in this application, each node image collects image measurements at every time instant image and not only a single measurement, as was the case with the three examples discussed in the previous sections (AR modeling, MA modeling, and localization). The implication of this fact is that we now deal with an estimation problem that involves vector measurements instead of scalar measurements at each node. The solution structure continues to be the same. We collect the image measurements at node image at time image into vectors and introduce the image quantities:

image (9.81)

and the regression matrix:

image (9.82)

The time subscript in image is used to model the fact that the path loss coefficients can change over time due to the possibility of node mobility. With the above notation, expression (9.80) is equivalent to the linear model:

image (9.83)

Compared to the earlier examples (9.9), (9.44), and (9.58), the main difference now is that each agent image collects a vector of measurements, image, as opposed to the scalar image, and its regression data are represented by the matrix quantity, image, as opposed to the row vector image. Nevertheless, the estimation approach will continue to be the same. In cognitive network applications, the secondary users are interested in estimating the aggregate power spectrum of the primary users in order for the secondary users to identify vacant frequency bands that can be used by them. In the context of model (9.83), this amounts to determining the parameter vector image since knowledge of its entries allows each secondary user to reconstruct the aggregate power spectrum defined by:

image (9.84)

where the notation image denotes the Kronecker product operation, and image denotes a image vector whose entries are all equal to one.

As before, we can again verify that, in view of (9.83), the desired parameter vector image satisfies the same normal equations:

image (9.85)

where the moments image are now defined by

image (9.86)

image (9.87)

Therefore, each secondary user image can determine image on its own by solving the following minimum mean-square-error estimation problem:

image (9.88)

This solution method requires knowledge of the moments image and, in an argument similar to the one that led to (9.20), it can be verified that each agent image would attain an MSE performance level that is equal to the noise power level, image, at its location.

Alternatively, when the statistical information image is not available, each secondary user image can estimate image iteratively by feeding data image into an adaptive implementation similar to (9.26) and (9.27), such as the following vector LMS recursion:

image (9.89)

image (9.90)

In this case, each secondary user image will achieve performance levels similar to (9.36) and (9.37) with image replaced by image and image replaced by image. The performance will again be dependent on the local noise level, image. As a result, secondary users with larger noise power will perform worse than secondary users with smaller noise power. However, since all secondary users are observing data arising from the same underlying model image, it is natural to expect cooperation among the users to be beneficial. As we are going to see, starting from the next section, one way to achieve cooperation and improve performance is by developing algorithms that solve the following global cost function in an adaptive and distributed manner:

image (9.91)

3.09.3 Distributed optimization via diffusion strategies

The examples in the previous section were meant to illustrate how MSE cost functions and linear models are useful design tools and how they arise frequently in applications. We now return to problem (9.1) and study the distributed optimization of global cost functions such as (9.39), where image is assumed to consist of the sum of individual components. Specifically, we are now interested in solving optimization problems of the type:

image (9.92)

where each image is assumed to be differentiable and convex over image. Although the algorithms presented in this chapter apply to more general situations, we shall nevertheless focus on mean-square-error cost functions of the form:

image (9.93)

where image is an image column vector, and the random processes image are assumed to be jointly wide-sense stationary with zero-mean and second-order moments:

image (9.94)

image (9.95)

image (9.96)

It is clear that each image is quadratic in image since, after expansion, we get

image (9.97)

A completion-of-squares argument shows that image can be expressed as the sum of two squared terms, i.e.,

image (9.98)

or, more compactly,

image (9.99)

where image denotes the minimizer of image and is given by

image (9.100)

and image denotes the minimum value of image when evaluated at image:

image (9.101)

Observe that this value is necessarily nonnegative since it can be viewed as the Schur complement of the following covariance matrix:

image (9.102)

and covariance matrices are nonnegative-definite.

The choice of the quadratic form (9.93) or (9.97) for image is useful for many applications, as was already illustrated in the previous section for examples involving AR modeling, MA modeling, localization, and spectral sensing. Other choices for image are of course possible and these choices can even be different for different nodes. It is sufficient in this chapter to illustrate the main concepts underlying diffusion adaptation by focusing on the useful case of MSE cost functions of the form (9.97); still, most of the derivations and arguments in the coming sections can be extended beyond MSE optimization to more general cost functions and to situations where the minimizers of the individual costs do not necessarily occur at the same location—as already shown in [13]; see also Section 3.09.10.4.

The positive-definiteness of the covariance matrices image ensures that each image in (9.97) is strictly convex, as well as image from (9.39). Moreover, all these cost functions have a unique minimum at the same image, which satisfies the normal equations:

image (9.103)

Therefore, given knowledge of image, each node can determine image on its own by solving (9.103). One then wonders about the need to seek distributed cooperative and adaptive solutions in this case. There are a couple of reasons:

a. First, even for MSE cost functions, it is often the case that the required moments image are not known beforehand. In this case, the optimal image cannot be determined from the solution of the normal equations (9.103). The alternative methods that we shall describe will lead to adaptive techniques that enable each node image to estimate image directly from data realizations.

b. Second, since adaptive strategies rely on instantaneous data, these strategies possess powerful tracking abilities. Even when the moments vary with time due to non-stationary behavior (such as image changing with time), these changes will be reflected in the observed data and will in turn influence the behavior of the adaptive construction. This is one of the key advantages of adaptive strategies: they enable learning and tracking in real-time.

c. Third, cooperation among nodes is generally beneficial. When nodes act individually, their performance is limited by the noise power level at their location. In this way, some nodes can perform significantly better than other nodes. On the other hand, when nodes cooperate with their neighbors and share information during the adaptation process, we will see that performance can be improved across the network.

3.09.3.1 Relating the global cost to neighborhood costs

Let us therefore consider the optimization of the following global cost function:

image (9.104)

where image is given by (9.93) or (9.97)(9.93) or (9.97). Our strategy to optimize image in a distributed manner is based on two steps [2,18]. First, using a completion-of-squares argument (or, equivalently, a second-order Taylor series expansion), we approximate the global cost function (9.104) by an alternative local cost that is amenable to distributed optimization. Then, each node will optimize the alternative cost via a steepest-descent method.

To motivate the distributed diffusion-based approach, we start by introducing a set of nonnegative coefficients image that satisfy two conditions:

image (9.105)

where image denotes the neighborhood of node image. Condition (9.105) means that for every node image, the sum of the coefficients image that relate it to its neighbors is one. The coefficients image are free parameters that are chosen by the designer; obviously, as shown later in Theorem 9.6.8, their selection will have a bearing on the performance of the resulting algorithms. If we collect the entries image into an image matrix image, so that the imageth row of image is formed of image, then condition (9.105) translates into saying that each of row of image adds up to one, i.e.,

image (9.106)

where the notation image denotes an image column vector with all its entries equal to one:

image (9.107)

We say that image is a right stochastic matrix. Using the coefficients image so defined, we associate with each node image, a local cost function of the following form:

image (9.108)

This cost consists of a weighted combination of the individual costs of the neighbors of node image (including image itself)—see Figure 9.9. Since the image are all nonnegative and each image is strictly convex, then image is also strictly convex and its minimizer occurs at the same image. Using the alternative representation (9.99) for the individual image, we can re-express the local cost image as

image (9.109)

or, equivalently,

image (9.110)

where image corresponds to the minimum value of image at the minimizer image:

image (9.111)

and image is a positive-definite weighting matrix defined by:

image (9.112)

That is, image is a weighted combination of the covariance matrices in the neighborhood of node image. Equality (9.110) amounts to a (second-order) Taylor series expansion of image around image. Note that the right-hand side consists of two terms: the minimum cost and a weighted quadratic term in the difference image.

image

Figure 9.9 A network with image nodes. The nodes in the neighborhood of node 3 are highlighted with their individual cost functions, and with the combination weights image along the connecting edges; there is also a combination weight associated with node 3 and is denoted by image. The expression for the local cost function, image, is also shown in the figure.

Now note that we can express image from (9.104) as follows:

image (9.113)

Substituting (9.110) into the second term on the right-hand side of the above expression gives:

image (9.114)

The last term in the above expression does not depend on image. Therefore, minimizing image over image is equivalent to minimizing the following alternative global cost:

image (9.115)

Expression (9.115) relates the optimization of the original global cost function, image or its equivalent image, to the newly-introduced local cost function image. The relation is through the second term on the right-hand side of (9.115), which corresponds to a sum of quadratic factors involving the minimizer image; this term tells us how the local cost image can be corrected to the global cost image. Obviously, the minimizer image that appears in the correction term is not known since the nodes wish to determine its value. Likewise, not all the weighting matrices image are available to node image; only those matrices that originate from its neighbors can be assumed to be available. Still, expression (9.115) suggests a useful way to replace image by another local cost that is closer to image. This alternative cost will be shown to lead to a powerful distributed solution to optimize image through localized interactions.

Our first step is to limit the summation on the right-hand side of (9.115) to the neighbors of node image (since every node image can only have access to information from its neighbors). We thus introduce the modified cost function at node image:

image (9.116)

The cost functions image and image are both associated with node image; the difference between them is that the expression for the latter is closer to the global cost function (9.115) that we want to optimize.

The weighting matrices image that appear in (9.116) may or may not be available because the second-order moments image may or may not be known beforehand. If these moments are known, then we can proceed with the analysis by assuming knowledge of the image. However, the more interesting case is when these moments are not known. This is generally the case in practice, especially in the context of adaptive solutions and problems involving non-stationary data. Often, nodes can only observe realizations image of the regression data image arising from distributions whose covariance matrices are the unknown image. One way to address the difficulty is to replace each of the weighted norms image in (9.116) by a scaled multiple of the un-weighted norm, say,

image (9.117)

where image is some nonnegative coefficient; we are even allowing its value to change with the node index image. The above substitution amounts to having each node image approximate the image from its neighbors by multiples of the identity matrix

image (9.118)

Approximation (9.117) is reasonable in view of the fact that all vector norms are equivalent [1921]; this norm property ensures that we can bound the weighted norm image by some constants multiplying the un-weighted norm image say, as:

image (9.119)

for some positive constants image. Using the fact that the image are Hermitian positive-definite matrices, and calling upon the Rayleigh-Ritz characterization of eigenvalues [19,20], we can be more specific and replace the above inequalities by

image (9.120)

We note that approximations similar to (9.118) are common in stochastic approximation theory and they mark the difference between using a Newton’s iterative method or a stochastic gradient method [5,22]; the former uses Hessian matrices as approximations for image and the latter uses multiples of the identity matrix. Furthermore, as the derivation will reveal, we do not need to worry at this stage about how to select the scalars image; they will end up being embedded into another set of coefficients image that will be set by the designer or adjusted by the algorithm—see (9.132) further ahead.

Thus, we replace (9.116) by

image (9.121)

The argument so far has suggested how to modify image from (9.108) and replace it by the cost (9.121) that is closer in form to the global cost function (9.115). If we replace image by its definition (9.108), we can rewrite (9.121) as

image (9.122)

With the exception of the variable image, this approximate cost at node image relies solely on information that is available to node image from its neighborhood. We will soon explain how to handle the fact that image is not known beforehand to node image.

3.09.3.2 Steepest-descent iterations

Node image can apply a steepest-descent iteration to minimize image. Let image denote the estimate for the minimizer image that is evaluated by node image at time image. Starting from an initial condition image, node image can compute successive estimates iteratively as follows:

image (9.123)

where image is a small positive step-size parameter, and the notation image denotes the gradient vector of the function image relative to image and evaluated at image. The step-size parameter image can be selected to vary with time as well. One choice that is common in the optimization literature [5,22,23] is to replace image in (9.123) by step-size sequences image that satisfy the two conditions (9.25). However, such step-size sequences are not suitable for applications that require continuous learning because they turn off adaptation as image; the steepest-descent iteration (9.123) would stop updating since image would be tending towards zero. For this reason, we shall focus mainly on the constant step-size case described by (9.123) since we are interested in developing distributed algorithms that will endow networks with continuous adaptation abilities.

Returning to (9.123) and computing the gradient vector of (9.122) we get:

image (9.124)

Using the expression for image from (9.97) we arrive at

image (9.125)

This iteration indicates that the update from image to image involves adding two correction terms to image. Among many other forms, we can implement the update in two successive steps by adding one correction term at a time, say, as follows:

image (9.126)

image (9.127)

Step (9.126) updates image to an intermediate value image by using local gradient vectors from the neighborhood of node image. Step (9.127) further updates image to image. Two issues stand out from examining (9.127):

a. First, iteration (9.127) requires knowledge of the minimizer image. Neither node image nor its neighbors know the value of the minimizer; each of these nodes is actually performing steps similar to (9.126) and (9.127) to estimate the minimizer. However, each node image has a readily available approximation for image, which is its local intermediate estimate image. Therefore, we replace image in (9.127) by image. This step helps diffuse information throughout the network. This is because each neighbor of node image determines its estimate image by processing information from its own neighbors, which process information from their neighbors, and so forth.

b. Second, the intermediate value image at node image is generally a better estimate for image than image since it is obtained by incorporating information from the neighbors through the first step (9.126). Therefore, we further replace image in (9.127) by image. This step is reminiscent of incremental-type approaches to optimization, which have been widely studied in the literature [2427].

With the substitutions described in items (a) and (b) above, we replace the second step (9.127) by

image (9.128)

Introduce the weighting coefficients:

image (9.129)

image (9.130)

image (9.131)

and observe that, for sufficiently small step-sizes image, these coefficients are nonnegative and, moreover, they satisfy the conditions:

image (9.132)

Condition (9.132) means that for every node image, the sum of the coefficients image that relate it to its neighbors is one. Just like the image, from now on, we will treat the coefficients image as free weighting parameters that are chosen by the designer according to (9.132); their selection will also have a bearing on the performance of the resulting algorithms—see Theorem 9.6.8. If we collect the entries image into an image matrix image, such that the imageth column of image consists of image, then condition (9.132) translates into saying that each column of image adds up to one:

image (9.133)

We say that image is a left stochastic matrix.

3.09.3.3 Adapt-then-Combine (ATC) diffusion strategy

Using the coefficients image so defined, we replace (9.126) and (9.128) by the following recursions for image:

image (9.134)

for some nonnegative coefficients image that satisfy conditions (9.106) and (9.133), namely,

image (9.135)

or, equivalently,

image (9.136)

To run algorithm (9.134), we only need to select the coefficients image that satisfy (9.135) or (9.136); there is no need to worry about the intermediate coefficients image any longer since they have been blended into the image. The scalars image that appear in (9.134) correspond to weighting coefficients over the edge linking node image to its neighbors image. Note that two sets of coefficients are used to scale the data that are being received by node image: one set of coefficients, image is used in the first step of (9.134) to scale the moment data image, and a second set of coefficients, image is used in the second step of (9.134) to scale the estimates image. Figure 9.10 explains what the entries on the columns and rows of the combination matrices image stand for using an example with image and the matrix image for illustration. When the combination matrix is right-stochastic (as is the case with image), each of its rows would add up to one. On the other hand, when the matrix is left-stochastic (as is the case with image), each of its columns would add up to one.

image

Figure 9.10 Interpretation of the columns and rows of combination matrices. The pair of entries image correspond to weighting coefficients used over the edge connecting nodes image and image. When nodes image are not neighbors, then these weights are zero.

At every time instant image, the ATC strategy (9.134) performs two steps. The first step is an information exchange step where node image receives from its neighbors their moments image. Node image combines this information and uses it to update its existing estimate image to an intermediate value image. All other nodes in the network are performing a similar step and updating their existing estimates image into intermediate estimates image by using information from their neighbors. The second step in (9.134) is an aggregation or consultation step where node image combines the intermediate estimates of its neighbors to obtain its updated estimate image. Again, all other nodes in the network are simultaneously performing a similar step. The reason for the name Adapt-then-Combine (ATC) strategy is that the first step in (9.134) will be shown to lead to an adaptive step, while the second step in (9.134) corresponds to a combination step. Hence, strategy (9.134) involves adaptation followed by combination or ATC for short. The reason for the qualification “diffusion” is that the combination step in (9.134) allows information to diffuse through the network in real time. This is because each of the estimates image is influenced by data beyond the immediate neighborhood of node image.

In the special case when image, so that no information exchange is performed but only the aggregation step, the ATC strategy (9.134) reduces to:

image (9.137)

where the first step relies solely on the information image that is available locally at node image.

Observe in passing that the term that appears in the information exchange step of (9.134) is related to the gradient vectors of the local costs image evaluated at image, i.e., it holds that

image (9.138)

so that the ATC strategy (9.134) can also be written in the following equivalent form:

image (9.139)

The significance of this general form is that it is applicable to optimization problems involving more general local costs image that are not necessarily quadratic in image, as detailed in [13]—see also Section 3.09.10.4. The top part of Figure 9.11 illustrates the two steps involved in the ATC procedure for a situation where node image has three other neighbors labeled image. In the first step, node image evaluates the gradient vectors of its neighbors at image, and subsequently aggregates the estimates image from its neighbors. The dotted arrows represent flow of information towards node image from its neighbors. The solid arrows represent flow of information from node image to its neighbors. The CTA diffusion strategy is discussed next.

image

Figure 9.11 Illustration of the ATC and CTA strategies for a node image with three other neighbors image. The updates involve two steps: information exchange followed by aggregation in ATC and aggregation followed by information exchange in CTA. The dotted arrows represent the data received from the neighbors of node image, and the solid arrows represent the data sent from node image to its neighbors.

3.09.3.4 Combine-then-Adapt (CTA) diffusion strategy

Similarly, if we return to (9.125) and add the second correction term first, then (9.126) and (9.127) are replaced by:

image (9.140)

image (9.141)

Following similar reasoning to what we did before in the ATC case, we replace image in step (9.140) by image and replace image in (9.141) by image. We then introduce the same coefficients image and arrive at the following Combine-then-Adapt (CTA) strategy:

image (9.142)

where the nonnegative coefficients image satisfy the same conditions (9.106) and (9.133), namely,

image (9.143)

or, equivalently,

image (9.144)

At every time instant image, the CTA strategy (9.142) also consists of two steps. The first step is an aggregation step where node image combines the existing estimates of its neighbors to obtain the intermediate estimate image. All other nodes in the network are simultaneously performing a similar step and aggregating the estimates of their neighbors. The second step in (9.142) is an information exchange step where node image receives from its neighbors their moments image and uses this information to update its intermediate estimate to image. Again, all other nodes in the network are simultaneously performing a similar information exchange step. The reason for the name Combine-then-Adapt (CTA) strategy is that the first step in (9.142) involves a combination step, while the second step will be shown to lead to an adaptive step. Hence, strategy (9.142) involves combination followed by adaptation or CTA for short. The reason for the qualification “diffusion” is that the combination step of (9.142) allows information to diffuse through the network in real time.

In the special case when image, so that no information exchange is performed but only the aggregation step, the CTA strategy (9.142) reduces to:

image (9.145)

where the second step relies solely on the information image that is available locally at node image. Again, the CTA strategy (9.142) can be rewritten in terms of the gradient vectors of the local costs image as follows:

image (9.146)

The bottom part of Figure 9.11 illustrates the two steps involved in the CTA procedure for a situation where node image has three other neighbors labeled image. In the first step, node image aggregates the estimates image from its neighbors, and subsequently performs information exchange by evaluating the gradient vectors of its neighbors at image.

3.09.3.5 Useful properties of diffusion strategies

Note that the structure of the ATC and CTA diffusion strategies (9.134) and (9.142) are fundamentally the same: the difference between the implementations lies in which variable we choose to correspond to the updated weight estimate image. In the ATC case, we choose the result of the combination step to be image, whereas in the CTA case we choose the result of the adaptation step to be image.

For ease of reference, Table 9.2 lists the steepest-descent diffusion algorithms derived in the previous sections. The derivation of the ATC and CTA strategies (9.134) and (9.142) followed the approach proposed in [18,28]. CTA estimation schemes were first proposed in the works [2933], and later extended in [18,28,34,35]. The earlier versions of CTA in [2931] used the choice image. This form of the algorithm with image, and with the additional constraint that the step-sizes image should be time-dependent and decay towards zero as time progresses, was later applied by [36,37] to solve distributed optimization problems that require all nodes to reach consensus or agreement. Likewise, special cases of the ATC estimation scheme (9.134), involving an information exchange step followed by an aggregation step, first appeared in the work [38] on diffusion least-squares schemes and subsequently in the works [18,34,35,3941] on distributed mean-square-error and state-space estimation methods. A special case of the ATC strategy (9.134) corresponding to the choice image with decaying step-sizes was adopted in [42] to ensure convergence towards a consensus state. Diffusion strategies of the form (9.134) and (9.142) (or, equivalently, (9.139) and (9.146)) are general in several respects:

1. These strategies do not only diffuse the local weight estimates, but they can also diffuse the local gradient vectors. In other words, two sets of combination coefficients image can be used.

2. In the derivation that led to the diffusion strategies, the combination matrices image and image are only required to be right-stochastic (for image) and left-stochastic (for image). In comparison, it is common in consensus-type strategies to require the corresponding combination matrix image to be doubly stochastic (i.e., its rows and columns should add up to one)—see, e.g., Appendix E and [36,4345].

3. As the analysis in Section 3.09.6 will reveal, ATC and CTA strategies do not force nodes to converge to an agreement about the desired parameter vector image, as is common in consensus-type strategies (see Appendix E and [36,4652]). Forcing nodes to reach agreement on image ends up limiting the adaptation and learning abilities of these nodes, as well as their ability to react to information in real-time. Nodes in diffusion networks enjoy more flexibility in the learning process, which allows their individual estimates, image, to tend to values that lie within a reasonable mean-square-deviation (MSD) level from the optimal solution, image. Multi-agent systems in nature behave in this manner; they do not require exact agreement among their agents (see, e.g., [810]).

4. The step-size parameters image are not required to depend on the time index image and are not required to vanish as image (as is common in many works on distributed optimization, e.g., [22,23,36,53]). Instead, the step-sizes can assume constant values, which is a critical property to endow networks with continuous adaptation and learning abilities. An important contribution in the study of diffusion strategies is to show that distributed optimization is still possible even for constant step-sizes, in addition to the ability to perform adaptation, learning, and tracking. Sections 3.09.5 and 3.09.6 highlight the convergence properties of the diffusion strategies—see also [13] for results pertaining to more general cost functions.

5. Even the combination weights image can be adapted, as we shall discuss later in Section 3.09.8.3. In this way, diffusion strategies allow multiple layers of adaptation: the nodes perform adaptive processing, the combination weights can be adapted, and even the topology can be adapted especially for mobile networks [8].

Table 9.2

Summary of Steepest-Descent Diffusion Strategies for the Distributed Optimization of Problems of the form (9.92), and their Specialization to the Case of Mean-Square-Error (MSE) Individual Cost Functions Given by (9.93)

Image

3.09.4 Adaptive diffusion strategies

The distributed ATC and CTA steepest-descent strategies (9.134) and (9.142) for determining the image that solves (9.92) and (9.93) require knowledge of the statistical information image. These moments are needed in order to be able to evaluate the gradient vectors that appear in (9.134) and (9.142), namely, the terms:

image (9.147)

image (9.148)

for all image. However, the moments image are often not available beforehand, which means that the true gradient vectors are generally not available. Instead, the agents have access to observations image of the random processes image. There are many ways by which the true gradient vectors can be approximated by using these observations. Recall that, by definition,

image (9.149)

One common stochastic approximation method is to drop the expectation operator from the definitions of image and to use the following instantaneous approximations instead [47]:

image (9.150)

In this case, the approximate gradient vectors become:

image (9.151)

image (9.152)

Substituting into the ATC and CTA steepest-descent strategies (9.134) and (9.142), we arrive at the following adaptive implementations of the diffusion strategies for image:

image (9.153)

and

image (9.154)

where the coefficients image are chosen to satisfy:

image (9.155)

The adaptive implementations usually start from the initial conditions image for all image, or from some other convenient initial values. Clearly, in view of the approximations (9.151) and (9.152), the successive iterates image that are generated by the above adaptive implementations are different from the iterates that result from the steepest-descent implementations (9.134) and (9.142). Nevertheless, we shall continue to use the same notation for these variables for ease of reference. One key advantage of the adaptive implementations (9.153) and (9.154) are that they enable the agents to react to changes in the underlying statistical information image and to changes in image. This is because these changes end up being reflected in the data realizations image. Therefore, adaptive implementations have an innate tracking and learning ability that is of paramount significance in practice.

We say that the stochastic gradient approximations (9.151) and (9.152) introduce gradient noise into each step of the recursive updates (9.153) and (9.154). This is because the updates (9.153) and (9.154) can be interpreted as corresponding to the following forms:

image (9.156)

and

image (9.157)

where the true gradient vectors, image, have been replaced by approximations, image—compare with (9.139) and (9.146). The significance of the alternative forms (9.156) and (9.157) is that they are applicable to optimization problems involving more general local costs image that are not necessarily quadratic, as detailed in [13]; see also Section 3.09.10.4. In the next section, we examine how gradient noise affects the performance of the diffusion strategies and how close the successive estimates image get to the desired optimal solution image. Table 9.3 lists several of the adaptive diffusion algorithms derived in this section.

Table 9.3

Summary of Adaptive Diffusion Strategies for the Distributed Optimization of Problems of the form (9.92), and their Specialization to the Case of Mean- Square-Error (MSE) Individual Cost Functions Given by (9.93). These Adaptive Solutions Rely on Stochastic Approximations

Image

The operation of the adaptive diffusion strategies is similar to the operation of the steepest-descent diffusion strategies of the previous section. Thus, note that at every time instant image, the ATC strategy (9.153) performs two steps; as illustrated in Figure 9.12. The first step is an information exchange step where node image receives from its neighbors their information image. Node image combines this information and uses it to update its existing estimate image to an intermediate value image. All other nodes in the network are performing a similar step and updating their existing estimates image into intermediate estimates image by using information from their neighbors. The second step in (9.153) is an aggregation or consultation step where node image combines the intermediate estimates image of its neighbors to obtain its updated estimate image. Again, all other nodes in the network are simultaneously performing a similar step. In the special case when image, so that no information exchange is performed but only the aggregation step, the ATC strategy (9.153) reduces to:

image (9.158)

image

Figure 9.12 Illustration of the adaptive ATC strategy, which involves two steps: information exchange followed by aggregation.

Likewise, at every time instant image, the CTA strategy (9.154) also consists of two steps—see Figure 9.13. The first step is an aggregation step where node image combines the existing estimates of its neighbors to obtain the intermediate estimate image. All other nodes in the network are simultaneously performing a similar step and aggregating the estimates of their neighbors. The second step in (9.154) is an information exchange step where node image receives from its neighbors their information image and uses this information to update its intermediate estimate to image. Again, all other nodes in the network are simultaneously performing a similar information exchange step. In the special case when image, so that no information exchange is performed but only the aggregation step, the CTA strategy (9.154) reduces to:

image (9.159)

image

Figure 9.13 Illustration of the adaptive CTA strategy, which involves two steps: aggregation followed by information exchange.

We further note that the adaptive ATC and CTA strategies (9.153) and (9.154) reduce to the non-cooperative adaptive solution (9.22) and (9.23), where each node k runs its own individual LMS filter, when the coefficients image are selected as

image (9.160)

where image denotes the Kronecker delta function:

image (9.161)

In terms of the combination matrices image and image, this situation corresponds to setting

image (9.162)

3.09.5 Performance of steepest-descent diffusion strategies

Before studying in some detail the mean-square performance of the adaptive diffusion implementations (9.153) and (9.154), and the influence of gradient noise, we examine first the convergence behavior of the steepest-descent diffusion strategies (9.134) and (9.142), which employ the true gradient vectors. Doing so will help introduce the necessary notation and highlight some features of the analysis in preparation for the more challenging treatment of the adaptive strategies in Section 3.09.6.

3.09.5.1 General diffusion model

Rather than study the performance of the ATC and CTA steepest-descent strategies (9.134) and (9.142) separately, it is useful to introduce a more general description that includes the ATC and CTA recursions as special cases. Thus, consider a distributed steepest-descent diffusion implementation of the following general form for image:

image (9.163)

image (9.164)

image (9.165)

where the scalars image denote three sets of nonnegative real coefficients corresponding to the image entries of image combination matrices image, respectively. These matrices are assumed to satisfy the conditions:

image (9.166)

so that image are left stochastic and image is right-stochastic, i.e.,

image (9.167)

As indicated in Table 9.4, different choices for image correspond to different cooperation modes. For example, the choice image and image corresponds to the ATC implementation (9.134), while the choice image and image corresponds to the CTA implementation (9.142). Likewise, the choice image corresponds to the case in which the nodes only share weight estimates and the distributed diffusion recursions (9.163) to (9.165) become:

image (9.168)

image (9.169)

image (9.170)

Furthermore, the choice image corresponds to the non-cooperative mode of operation, in which case the recursions reduce to the classical (stand-alone) steepest-descent recursion [47], where each node minimizes individually its own quadratic cost image, defined earlier in (9.97):

image (9.171)

Table 9.4

Different Choices for the Combination Matrices image in (9.163)(9.165) Correspond to Different Cooperation Strategies

Image

3.09.5.2 Error recursions

Our objective is to examine whether, and how fast, the weight estimates image from the distributed implementation (9.163)(9.165) converge towards the solution image of (9.92) and (9.93). To do so, we introduce the image error vectors:

image (9.172)

image (9.173)

image (9.174)

Each of these error vectors measures the residual relative to the desired minimizer image. Now recall from (9.100) that

image (9.175)

Then, subtracting image from both sides of relations in (9.163)(9.165) we get

image (9.176)

image (9.177)

image (9.178)

We can describe these relations more compactly by collecting the information from across the network into block vectors and matrices. We collect the error vectors from across all nodes into the following image block vectors, whose individual entries are of size image each:

image (9.179)

The block quantities image represent the state of the errors across the network at time image. Likewise, we introduce the following image block diagonal matrices, whose individual entries are of size image each:

image (9.180)

image (9.181)

Each block diagonal entry of image, say, the imageth entry, contains the combination of the covariance matrices in the neighborhood of node image. We can simplify the notation by denoting these neighborhood combinations as follows:

image (9.182)

so that image becomes

image (9.183)

In the special case when image, the matrix image reduces to

image (9.184)

with the individual covariance matrices appearing on its diagonal; we denote image by image in this special case. We further introduce the Kronecker products

image (9.185)

The matrix image is an image block matrix whose image block is equal to image. Likewise, for image. In other words, the Kronecker transformation defined above simply replaces the matrices image by block matrices image where each entry image in the original matrices is replaced by the diagonal matrices image. For ease of reference, Table 9.5 lists the various symbols that have been defined so far, and others that will be defined in the sequel.

Table 9.5

Definitions of Network Variables Used Throughout the Analysis

Image

Returning to (9.176)(9.178), we conclude that the following relations hold for the block quantities:

image (9.186)

image (9.187)

image (9.188)

so that the network weight error vector, image, ends up evolving according to the following dynamics:

image (9.189)

For comparison purposes, if each node in the network minimizes its own cost function, image, separately from the other nodes and uses the non-cooperative steepest-descent strategy (9.171), then the weight error vector across all image nodes would evolve according to the following alternative dynamics:

image (9.190)

where the matrices image and image do not appear, and image is replaced by image from (9.184). This recursion is a special case of (9.189) when image.

3.09.5.3 Convergence behavior

Note from (9.189) that the evolution of the weight error vector involves block vectors and block matrices; this will be characteristic of the distributed implementations that we consider in this chapter. To examine the stability and convergence properties of recursions that involve such block quantities, it becomes useful to rely on a certain block vector norm. In Appendix D, we describe a so-called block maximum norm and establish some of its useful properties. The results of the appendix will be used extensively in our exposition. It is therefore advisable for the reader to review the properties stated in the appendix at this stage.

Using the result of Lemma D.6, we can establish the following useful statement about the convergence of the steepest-descent diffusion strategy (9.163)(9.165). The result establishes that all nodes end up converging to the optimal solution image if the nodes employ positive step-sizes image that are small enough; the lemma provides a sufficient bound on the image.

Theorem 9.5.1

Convergence to Optimal Solution

Consider the problem of optimizing the global cost (9.92) with the individual cost functions given by (9.93). Pick a right stochastic matrix image and left stochastic matrices image and image satisfying (9.166) or (9.167); these matrices define the network topology and how information is shared over neighborhoods. Assume each node in the network runs the (distributed) steepest-descent diffusion algorithm (9.163)(9.165). Then, all estimates image across the network converge to the optimal solution image if the positive step-size parameters image satisfy

image (9.191)

where the neighborhood covariance matrix image is defined by (9.182).

Proof.

The weight error vector image converges to zero if, and only if, the coefficient matrix image in (9.189) is a stable matrix (meaning that all its eigenvalues lie strictly inside the unit disc). From property (9.605) established in Appendix D, we know that image is stable if the block diagonal matrix image is stable. It is now straightforward to verify that condition (9.191) ensures the stability of image. It follows that

image (9.192)

 ent

Observe that the stability condition (9.191) does not depend on the specific combination matrices image and image. Thus, as long as these matrices are chosen to be left-stochastic, the weight-error vectors will converge to zero under condition (9.191) no matter what image are. Only the combination matrix image influences the condition on the step-size through the neighborhood covariance matrices image. Observe further that the statement of the lemma does not require the network to be connected. Moreover, when image, in which case the nodes only share weight estimates and do not share the neighborhood moments image, as in (9.168)(9.170), condition (9.191) becomes

image (9.193)

in terms of the actual covariance matrices image. Conditions (9.191) and (9.193) are reminiscent of a classical result for stand-alone steepest-descent algorithms, as in the non-cooperative case (9.171), where it is known that the estimate by each individual node in this case will converge to image if, and only if, its positive step-size satisfies

image (9.194)

This is the same condition as (9.193) for the case image.

The following statement provides a bi-directional statement that ensures convergence of the (distributed) steepest-descent diffusion strategy (9.163)(9.165) for any choice of left-stochastic combination matrices image and image.

Theorem 9.5.2

Convergence for Arbitrary Combination Matrices

Consider the problem of optimizing the global cost (9.92) with the individual cost functions given by (9.93). Pick a right stochastic matrix image satisfying (9.166). Then, the estimates image generated by (9.163)(9.165) converge to image, for all choices of left-stochastic matrices image and image satisfying (9.166) if, and only if,

image (9.195)

Proof.

The result follows from property (b) of Corollary D.1, which is established in Appendix Dent

More importantly, we can verify that under fairly general conditions, employing the steepest-descent diffusion strategy (9.163)(9.165) enhances the convergence rate of the error vector towards zero relative to the non-cooperative strategy (9.171). The next three results establish this fact when image is a doubly stochastic matrix, i.e., it has nonnegative entries and satisfies

image (9.196)

with both its rows and columns adding up to one. Compared to the earlier right-stochastic condition on image in (9.105), we are now requiring

image (9.197)

For example, these conditions are satisfied when image is right stochastic and symmetric. They are also automatically satisfied for image, when only weight estimates are shared as in (9.168)(9.170); this latter case covers the ATC and CTA diffusion strategies (9.137) and (9.145), which do not involve information exchange.

Theorem 9.5.3

Convergence Rate is Enhanced: Uniform Step-Sizes

Consider the problem of optimizing the global cost (9.92) with the individual cost functions given by (9.93). Pick a doubly stochastic matrix image satisfying (9.196) and left stochastic matrices image and image satisfying (9.166). Consider two modes of operation. In one mode, each node in the network runs the (distributed) steepest-descent diffusion algorithm (9.163)(9.165). In the second mode, each node operates individually and runs the non-cooperative steepest-descent algorithm (9.171). In both cases, the positive step-sizes used by all nodes are assumed to be the same, say, image for all image, and the value of image is chosen to satisfy the required stability conditions (9.191) and (9.194), which are met by selecting

image (9.198)

It then holds that the magnitude of the error vector, image, in the diffusion case decays to zero more rapidly than in the non-cooperative case. In other words, diffusion cooperation enhances convergence rate.

Proof.

Let us first establish that any positive step-size image satisfying (9.198) will satisfy both stability conditions (9.191) and (9.194). It is obvious that (9.194) is satisfied. We verify that (9.191) is also satisfied when image is doubly stochastic. In this case, each neighborhood covariance matrix, image, becomes a convex combination of individual covariance matrices image, i.e.,

image

where now

image

To proceed, we recall that the spectral norm (maximum singular value) of any matrix image is a convex function of image [54]. Moreover, for Hermitian matrices image, their spectral norms coincide with their spectral radii (largest eigenvalue magnitude). Then, Jensen’s inequality [54] states that for any convex function image it holds that

image

for Hermitian matrices image and nonnegative scalars image that satisfy

image

Choosing image as the spectral radius function, and applying it to the definition of image above, we get

image

In other words,

image

It then follows from (9.198) that

image

so that (9.191) is satisfied as well.

Let us now examine the convergence rate. To begin with, we note that the matrix image that appears in the weight-error recursion (9.189) is block diagonal:

image

and each individual block entry, image, is a stable matrix since image satisfies (9.191). Moreover, each of these entries can be written as

image

which expresses image as a convex combination of stable terms image. Applying Jensen’s inequality again we get

image

Now, we know from (9.189) that the rate of decay of image to zero in the diffusion case is determined by the spectral radius of the coefficient matrix image. Likewise, we know from (9.190) that the rate of decay of image to zero in the non-cooperative case is determined by the spectral radius of the coefficient matrix image. Then, note that

image

Therefore, the spectral radius of image is at most as large as the largest individual spectral radius in the non-cooperative case. ent

The argument can be modified to handle different step-sizes across the nodes if we assume uniform covariance data across the network, as stated below.

Theorem 9.5.4

Convergence Rate is Enhanced: Uniform Covariance Data

Consider the same setting of Theorem 9.5.3. Assume the covariance data are uniform across all nodes, say, image is independent of image. Assume further that the nodes in both modes of operation employ steps-sizes image that are chosen to satisfy the required stability conditions (9.191) and (9.194), which in this case are met by:

image (9.199)

It then holds that the magnitude of the error vector, image, in the diffusion case decays to zero more rapidly than in the non-cooperative case. In other words, diffusion enhances convergence rate.

Proof.

Since image for all image and image is doubly stochastic, we get image and image. Then,

image

 ent

The next statement considers the case of ATC and CTA strategies (9.137) and (9.145) without information exchange, which corresponds to the case image. The result establishes that these strategies always enhance the convergence rate over the non-cooperative case, without the need to assume uniform step-sizes or uniform covariance data.

Theorem 9.5.5

Convergence Rate is Enhanced when image

Consider the problem of optimizing the global cost (9.92) with the individual cost functions given by (9.93). Pick left stochastic matrices image and image satisfying (9.166) and set image. This situation covers the ATC and CTA strategies (9.137) and (9.145), which do not involve information exchange. Consider two modes of operation. In one mode, each node in the network runs the (distributed) steepest-descent diffusion algorithm (9.168)(9.170). In the second mode, each node operates individually and runs the non-cooperative steepest-descent algorithm (9.171). In both cases, the positive step-sizes are chosen to satisfy the required stability conditions (9.193) and (9.194), which in this case are met by

image (9.200)

It then holds that the magnitude of the error vector, image, in the diffusion case decays to zero more rapidly than in the non-cooperative case. In other words, diffusion cooperation enhances convergence rate.

Proof.

When image, we get image and, therefore, image and image. Then,

image

 ent

The results of the previous theorems highlight the following important facts about the role of the combination matrices image in the convergence behavior of the diffusion strategy (9.163)(9.165):

a. The matrix image influences the stability of the network through its influence on the bound in (9.191). This is because the matrices image depend on the entries of image. The matrices image do not influence network stability in that they can be chosen arbitrarily and the network will remain stable under (9.191).

b. The matrices image influence the rate of convergence of the network since they influence the spectral radius of the matrix image, which controls the dynamics of the weight error vector in (9.189).

3.09.6 Performance of adaptive diffusion strategies

We now move on to examine the behavior of the adaptive diffusion implementations (9.153) and (9.154), and the influence of both gradient noise and measurement noise on convergence and steady-state performance. Due to the random nature of the perturbations, it becomes necessary to evaluate the behavior of the algorithms on average, using mean-square convergence analysis. For this reason, we shall study the convergence of the weight estimates both in the mean and mean-square sense. To do so, we will again consider a general diffusion structure that includes the ATC and CTA strategies (9.153) and (9.154) as special cases. We shall further resort to the boldface notation to refer to the measurements and weight estimates in order to highlight the fact that they are now being treated as random variables. In this way, the update equations become stochastic updates. Thus, consider the following general adaptive diffusion strategy for image:

image (9.201)

image (9.202)

image (9.203)

As before, the scalars image are nonnegative real coefficients corresponding to the image entries of image combination matrices image, respectively. These matrices are assumed to satisfy the same conditions (9.166) or (9.167). Again, different choices for image correspond to different cooperation modes. For example, the choice image and image corresponds to the adaptive ATC implementation (9.153), while the choice image and image corresponds to the adaptive CTA implementation (9.154). Likewise, the choice image corresponds to the case in which the nodes only share weight estimates and the distributed diffusion recursions (9.201)(9.203) become

image (9.204)

image (9.205)

image (9.206)

Furthermore, the choice image corresponds to the non-cooperative mode of operation, where each node runs the classical (stand-alone) least-mean-squares (LMS) filter independently of the other nodes [47]:

image (9.207)

3.09.6.1 Data model

When we studied the performance of the steepest-descent diffusion strategy (9.163)(9.165) we exploited result (9.175), which indicated how the moments image that appeared in the recursions related to the optimal solution image. Likewise, in order to be able to analyze the performance of the adaptive diffusion strategy (9.201)(9.203), we need to know how the data image across the network relate to image. Motivated by the several examples presented earlier in Section 3.09.2, we shall assume that the data satisfy a linear model of the form:

image (9.208)

where image is measurement noise with variance image:

image (9.209)

and where the stochastic processes image are assumed to be jointly wide-sense stationary with moments:

image (9.210)

image (9.211)

image (9.212)

All variables are assumed to be zero-mean. Furthermore, the noise process image is assumed to be temporally white and spatially independent, as described earlier by (9.6), namely,

image (9.213)

The noise process image is further assumed to be independent of the regression data image for all image and image so that:

image (9.214)

We shall also assume that the regression data are temporally white and spatially independent so that:

image (9.215)

Although we are going to derive performance measures for the network under this independence assumption on the regression data, it turns out that the resulting expressions continue to match well with simulation results for sufficiently small step-sizes, even when the independence assumption does not hold (in a manner similar to the behavior of stand-alone adaptive filters) [4,5].

3.09.6.2 Performance measures

Our objective is to analyze whether, and how fast, the weight estimates image from the adaptive diffusion implementation (9.201)(9.203) converge towards image. To do so, we again introduce the image weight error vectors:

image (9.216)

image (9.217)

image (9.218)

Each of these error vectors measures the residual relative to the desired image in (9.208). We further introduce two scalar error measures:

image (9.219)

image (9.220)

The first error measures how well the term image approximates the measured data, image; in view of (9.208), this error can be interpreted as an estimator for the noise term image. If node image is able to estimate image well, then image would get close to image. Therefore, under ideal conditions, we would expect the variance of image to tend towards the variance of image. However, as remarked earlier in (9.31), there is generally an offset term for adaptive implementations that is captured by the variance of the a priori error, image. This second error measures how well image approximates the uncorrupted term image. Using the data model (9.208), we can relate image as

image (9.221)

Since the noise component, image, is assumed to be zero-mean and independent of all other random variables, we recover (9.31):

image (9.222)

This relation confirms that the variance of the output error, image is at least as large as image and away from it by an amount that is equal to the variance of the a priori error, image. Accordingly, in order to quantify the performance of any particular node in the network, we define the mean-square-error (MSE) and excess-mean-square-error (EMSE) for node image as the following steady-state measures:

image (9.223)

image (9.224)

Then, it holds that

image (9.225)

Therefore, the EMSE term quantifies the size of the offset in the MSE performance of each node. We also define the mean-square-deviation (MSD) of each node as the steady-state measure:

image (9.226)

which measures how far image is from image in the mean-square-error sense.

We indicated earlier in (9.36) and (9.37) how the MSD and EMSE of stand-alone LMS filters in the non-cooperative case depend on image. In this section, we examine how cooperation among the nodes influences their performance. Since cooperation couples the operation of the nodes, with data originating from one node influencing the behavior of its neighbors and their neighbors, the study of the network performance requires more effort than in the non-cooperative case. Nevertheless, when all is said and done, we will arrive at expressions that approximate well the network performance and reveal some interesting conclusions.

3.09.6.3 Error recursions

Using the data model (9.208) and subtracting image from both sides of the relations in (9.201)(9.203) we get

image (9.227)

image (9.228)

image (9.229)

Comparing the second recursion with the corresponding recursion in the steepest-descent case (9.176)(9.178), we see that two new effects arise: the effect of gradient noise, which replaces the covariance matrices image by the instantaneous approximation image, and the effect of measurement noise, image.

We again describe the above relations more compactly by collecting the information from across the network in block vectors and matrices. We collect the error vectors from across all nodes into the following image block vectors, whose individual entries are of size image each:

image (9.230)

The block quantities image represent the state of the errors across the network at time image. Likewise, we introduce the following image block diagonal matrices, whose individual entries are of size image each:

image (9.231)

image (9.232)

Each block diagonal entry of image, say, the imageth entry, contains a combination of rank-one regression terms collected from the neighborhood of node image. In this way, the matrix image is now stochastic and dependent on time, in contrast to the matrix image in the steepest-descent case in (9.181), which was a constant matrix. Nevertheless, it holds that

image (9.233)

so that, on average, image agrees with image. We can simplify the notation by denoting the neighborhood combinations as follows:

image (9.234)

so that image becomes

image (9.235)

Again, compared with the matrix image defined in (9.182), we find that image is now both stochastic and time-dependent. Nevertheless, it again holds that

image (9.236)

In the special case when image, the matrix image reduces to

image (9.237)

with

image (9.238)

where image was defined earlier in (9.184).

We further introduce the following image block column vector, whose entries are of size image each:

image (9.239)

Obviously, given that the regression data and measurement noise are zero-mean and independent of each other, we have

image (9.240)

and the covariance matrix of image is image block diagonal with blocks of size image:

image (9.241)

Returning to (9.227)(9.229), we conclude that the following relations hold for the block quantities:

image (9.242)

image (9.243)

image (9.244)

where

image (9.245)

so that the network weight error vector, image, ends up evolving according to the following stochastic recursion:

image (9.246)

For comparison purposes, if each node operates individually and uses the non-cooperative LMS recursion (9.207), then the weight error vector across all image nodes would evolve according to the following stochastic recursion:

image (9.247)

where the matrices image and image do not appear, and image is replaced by image from (9.237).

3.09.6.4 Convergence in the mean

Taking expectations of both sides of (9.246) we find that:

image (9.248)

where we used the fact that image and image are independent of each other in view of our earlier assumptions on the regression data and noise in Section 3.09.6.1. Comparing with the error recursion (9.189) in the steepest-descent case, we find that both recursions are identical with image replaced by image. Therefore, the convergence statements from the steepest-descent case can be extended to the adaptive case to provide conditions on the step-size to ensure stability in the mean, i.e., to ensure

image (9.249)

When (9.249) is guaranteed, we say that the adaptive diffusion solution is asymptotically unbiased. The following statements restate the results of Theorems 9.5.19.5.5 in the context of mean error analysis.

Theorem 9.6.1

Convergence in the Mean

Consider the problem of optimizing the global cost (9.92) with the individual cost functions given by (9.93). Pick a right stochastic matrix image and left stochastic matrices image and image satisfying (9.166) or (9.167). Assume each node in the network measures data that satisfy the conditions described in Section 3.09.6.1, and runs the adaptive diffusion algorithm (9.201)(9.203). Then, all estimators image across the network converge in the mean to the optimal solution image if the positive step-size parameters image satisfy

image (9.250)

where the neighborhood covariance matrix image is defined by (9.182). In other words, image for all nodes image as image.

Observe again that the mean stability condition (9.250) does not depend on the specific combination matrices image and image that are being used. Only the combination matrix image influences the condition on the step-size through the neighborhood covariance matrices image. Observe further that the statement of the lemma does not require the network to be connected. Moreover, when image, in which case the nodes only share weight estimators and do not share neighborhood data image as in (9.204)(9.206), condition (9.250) becomes

image (9.251)

Conditions (9.250) and (9.251) are reminiscent of a classical result for the stand-alone LMS algorithm, as in the non-cooperative case (9.207), where it is known that the estimator by each individual node in this case would converge in the mean to image if, and only if, its step-size satisfies

image (9.252)

The following statement provides a bi-directional result that ensures the mean convergence of the adaptive diffusion strategy for any choice of left-stochastic combination matrices image and image.

Theorem 9.6.2

Mean Convergence for Arbitrary Combination Matrices

Consider the problem of optimizing the global cost (9.92) with the individual cost functions given by (9.93). Pick a right stochastic matrix image satisfying (9.166). Assume each node in the network measures data that satisfy the conditions described in Section 3.09.6.1. Then, the estimators image generated by the adaptive diffusion strategy (9.201)(9.203), converge in the mean to image, for all choices of left stochastic matrices image and image satisfying (9.166) if, and only if,

image (9.253)

As was the case with steepest-descent diffusion strategies, the adaptive diffusion strategy (9.201)(9.203) also enhances the convergence rate of the mean of the error vector towards zero relative to the non-cooperative strategy (9.207). The next results restate Theorems 9.5.39.5.5; they assume image is a doubly stochastic matrix.

Theorem 9.6.3

Mean Convergence Rate is Enhanced: Uniform Step-Sizes

Consider the problem of optimizing the global cost (9.92) with the individual cost functions given by (9.93). Pick a doubly stochastic matrix image satisfying (9.196) and left stochastic matrices image and image satisfying (9.166). Assume each node in the network measures data that satisfy the conditions described in Section 3.09.6.1. Consider two modes of operation. In one mode, each node in the network runs the adaptive diffusion algorithm (9.201)(9.203). In the second mode, each node operates individually and runs the non-cooperative LMS algorithm (9.207). In both cases, the positive step-sizes used by all nodes are assumed to be the same, say, image for all image, and the value of image is chosen to satisfy the required mean stability conditions (9.250) and (9.252), which are met by selecting

image (9.254)

It then holds that the magnitude of the mean error vector, image in the diffusion case decays to zero more rapidly than in the non-cooperative case. In other words, diffusion enhances convergence rate.

Theorem 9.6.4

Mean Convergence Rate is Enhanced: Uniform Covariance Data

Consider the same setting of Theorem 9.6.3. Assume the covariance data are uniform across all nodes, say, image is independent of image. Assume further that the nodes in both modes of operation employ steps-sizes image that are chosen to satisfy the required stability conditions (9.250) and (9.252), which in this case are met by:

image (9.255)

It then holds that the magnitude of the mean error vector, image in the diffusion case also decays to zero more rapidly than in the non-cooperative case. In other words, diffusion enhances convergence rate.

The next statement considers the case of ATC and CTA strategies (9.204)(9.206) without information exchange, which correspond to the choice image. The result establishes that these strategies always enhance the convergence rate over the non-cooperative case, without the need to assume uniform step-sizes or uniform covariance data.

Theorem 9.6.5

Mean Convergence Rate is Enhanced when image

Consider the problem of optimizing the global cost (9.92) with the individual cost functions given by (9.93). Pick left stochastic matrices image and image satisfying (9.166) and set image. This situation covers the ATC and CTA strategies (9.204)(9.206) that do not involve information exchange. Assume each node in the network measures data that satisfy the conditions described in Section 3.09.6.1. Consider two modes of operation. In one mode, each node in the network runs the adaptive diffusion algorithm (9.163)(9.165). In the second mode, each node operates individually and runs the non-cooperative LMS algorithm (9.207). In both cases, the positive step-sizes are chosen to satisfy the required stability conditions (9.251) and (9.252), which in this case are met by

image (9.256)

It then holds that the magnitude of the mean error vector, image in the diffusion case decays to zero more rapidly than in the non-cooperative case. In other words, diffusion cooperation enhances convergence rate.

The results of the previous theorems again highlight the following important facts about the role of the combination matrices image in the convergence behavior of the adaptive diffusion strategy (9.201)(9.203):

a. The matrix image influences the mean stability of the network through its influence on the bound in (9.250). This is because the matrices image depend on the entries of image. The matrices image do not influence network mean stability in that they can be chosen arbitrarily and the network will remain stable under (9.250).

b. The matrices image influence the rate of convergence of the mean weight-error vector over the network since they influence the spectral radius of the matrix image, which controls the dynamics of the weight error vector in (9.248).

3.09.6.5 Mean-square stability

It is not sufficient to ensure the stability of the weight-error vector in the mean sense. The error vectors, image, may be converging on average to zero but they may have large fluctuations around the zero value. We therefore need to examine how small the error vectors get. To do so, we perform a mean-square-error analysis. The purpose of the analysis is to evaluate how the variances image evolve with time and what their steady-state values are, for each node image.

In this section, we are particularly interested in evaluating the evolution of two mean-square-errors, namely,

image (9.257)

The steady-state values of these quantities determine the MSD and EMSE performance levels at node image and, therefore, convey critical information about the performance of the network. Under the independence assumption on the regression data from Section 3.09.6.1, it can be verified that the EMSE variance can be written as:

image (9.258)

in terms of a weighted square measure with weighting matrix image. Here we are using the notation image to denote the weighted square quantity image, for any column vector image and matrix image. Thus, we can evaluate mean-square-errors of the form (9.257) by evaluating the means of weighted square quantities of the following form:

image (9.259)

for an arbitrary Hermitian nonnegative-definite weighting matrix image that we are free to choose. By setting image to different values (say, image or image), we can extract various types of information about the nodes and the network, as the discussion will reveal. The approach we follow is based on the energy conservation framework of [4,5,55].

So, let image denote an arbitrary image block Hermitian nonnegative-definite matrix that we are free to choose, with image block entries image. Let image denote the image vector that is obtained by stacking the columns of image on top of each other, written as

image (9.260)

In the sequel, it will become more convenient to work with the vector representation image than with the matrix image itself.

We start from the weight-error vector recursion (9.246) and re-write it more compactly as:

image (9.261)

where the coefficient matrices image and image are short-hand representations for

image (9.262)

and

image (9.263)

Note that image is stochastic and time-variant, while image is constant. We denote the mean of image by

image (9.264)

where image is defined by (9.181). Now equating weighted square measures on both sides of (9.261) we get

image (9.265)

Expanding the right-hand side we find that

image (9.266)

Under expectation, the last two terms on the right-hand side evaluate to zero so that

image (9.267)

Let us evaluate each of the expectations on the right-hand side. The last expectation is given by

image (9.268)

where image is defined by (9.241) and where we used the fact that image for any two matrices image and image of compatible dimensions. With regards to the first expectation on the right-hand side of (9.267), we have

image (9.269)

where we introduced the nonnegative-definite weighting matrix

image (9.270)

where image is defined by (9.181) and the term image denotes the following factor, which depends on the square of the step-sizes, image:

image (9.271)

The evaluation of the above expectation depends on higher-order moments of the regression data. While we can continue with the analysis by taking this factor into account, as was done in [4,5,18,55], it is sufficient for the exposition in this chapter to focus on the case of sufficiently small step-sizes where terms involving higher powers of the step-sizes can be ignored. Therefore, we continue our discussion by letting

image (9.272)

The weighting matrix image is fully defined in terms of the step-size matrix, image, the network topology through the matrices image, and the regression statistical profile through image. Expression (9.272) tells us how to construct image from image. The expression can be transformed into a more compact and revealing form if we instead relate the vector forms image and image. Using the following equalities for arbitrary matrices image of compatible dimensions [5]:

image (9.273)

image (9.274)

and applying the vec operation to both sides of (9.272) we get

image

That is,

image (9.275)

where we are introducing the coefficient matrix of size image:

image (9.276)

A reasonable approximate expression for image for sufficiently small step-sizes is

image (9.277)

Indeed, if we replace image from (9.264) into (9.277) and expand terms, we obtain the same factors that appear in (9.276) plus an additional term that depends on the square of the step-sizes, image, whose effect can be ignored for sufficiently small step-sizes.

In this way, and using in addition property (9.274), we find that relation (9.267) becomes:

image (9.278)

The last term is dependent on the network topology through the matrix image, which is defined in terms of image, and the noise and regression data statistical profile through image. It is convenient to introduce the alternative notation image to refer to the weighted square quantity image, where image. We shall use these two notations interchangeably. The convenience of the vector notation is that it allows us to exploit the simpler linear relation (9.275) between image and image to rewrite (9.278) as shown in (9.279) below, with the same weight vector image appearing on both sides.

Theorem 9.6.6

Variance Relation

Consider the data model of Section 3.09.6.1 and the independence statistical conditions imposed on the noise and regression data, including (9.208)(9.215). Assume further sufficiently small step-sizes are used so that terms that depend on higher-powers of the step-sizes can be ignored. Pick left stochastic matrices image and image and a right stochastic matrix image satisfying (9.166). Under these conditions, the weight-error vector image associated with a network running the adaptive diffusion strategy (9.201)(9.203) satisfies the following variance relation:

image (9.279)

for any Hermitian nonnegative-definite matrix image with image, and where image are defined by (9.241), (9.263), and (9.277), and

image (9.280)

 ent

Note that relation (9.279) is not an actual recursion; this is because the weighting matrices image on both sides of the equality are different. The relation can be transformed into a true recursion by expanding it into a convenient state-space model; this argument was pursued in [4,5,18,55] and is not necessary for the exposition here, except to say that stability of the matrix image ensures the mean-square stability of the filter—this fact is also established further ahead through relation (9.327). By mean-square stability we mean that each term image remains bounded over time and converges to a steady-state image value. Moreover, the spectral radius of image controls the rate of convergence of image towards its steady-state value.

Theorem 9.6.7

Mean-Square Stability

Consider the same setting of Theorem 9.6.6. The adaptive diffusion strategy (9.201)(9.203) is mean-square stable if, and only if, the matrix image defined by (9.276), or its approximation (9.277), is stable (i.e., all its eigenvalues lie strictly inside the unit disc). This condition is satisfied by sufficiently small positive step-sizes image that are also smaller than the following bound:

image (9.281)

where the neighborhood covariance matrix image is defined by (9.182). Moreover, the convergence rate of the algorithm is determined by the value image (the square of the spectral radius of image).

Proof.

Recall that, for two arbitrary matrices image and image of compatible dimensions, the eigenvalues of the Kronecker product image is formed of all product combinations image of the eigenvalues of image and image [19]. Therefore, for sufficiently small step-sizes, we can use expression (9.277) to note that image. It follows that image is stable if, and only if, image is stable. We already noted earlier in Theorem 9.6.1 that condition (9.281) ensures the stability of image. Therefore, step-sizes that ensure stability in the mean and are sufficiently small will also ensure mean-square stability. ent

Remark

More generally, had we not ignored the second-order term (9.271), the expression for image would have been the following. Starting from the definition image, we would get

image

so that

image (9.282)

image

Mean-square stability of the filter would then require the step-sizes image to be chosen such that they ensure the stability of this matrix image (in addition to condition (9.281) to ensure mean stability). ent

3.09.6.6 Network mean-square performance

We can now use the variance relation (9.279) to evaluate the network performance, as well as the performance of the individual nodes, in steady-state. Since the dynamics is mean-square stable for sufficiently small step-sizes, we take the limit of (9.279) as image and write:

image (9.283)

Grouping terms leads to the following result.

Corollary 9.6.1

Steady-State Variance Relation

Consider the same setting of Theorem 9.6.6. The weight-error vector, image of the adaptive diffusion strategy (9.201)(9.203) satisfies the following relation in steady-state:

image (9.284)

for any Hermitian nonnegative-definite matrix image with image, and where image are defined by (9.277) and (9.280).

 ent

Expression (9.284) is a very useful relation; it allows us to evaluate the network MSD and EMSE through proper selection of the weighting vector image (or, equivalently, the weighting matrix image). For example, the network MSD is defined as the average value:

image (9.285)

which amounts to averaging the MSDs of the individual nodes. Therefore,

image (9.286)

This means that in order to recover the network MSD from relation (9.284), we should select the weighting vector image such that

image

Solving for image and substituting back into (9.284) we arrive at the following expression for the network MSD:

image (9.287)

Likewise, the network EMSE is defined as the average value

image (9.288)

which amounts to averaging the EMSEs of the individual nodes. Therefore,

image (9.289)

where image is the matrix defined earlier by (9.184), and which we repeat below for ease of reference:

image (9.290)

This means that in order to recover the network EMSE from relation (9.284), we should select the weighting vector image such that

image (9.291)

Solving for image and substituting into (9.284) we arrive at the following expression for the network EMSE:

image (9.292)

3.09.6.7 Mean-square performance of individual nodes

We can also assess the mean-square performance of the individual nodes in the network from (9.284). For instance, the MSD of any particular node image is defined by

image (9.293)

Introduce the image block diagonal matrix with blocks of size image, where all blocks on the diagonal are zero except for an identity matrix on the diagonal block of index image, i.e.,

image (9.294)

Then, we can express the node MSD as follows:

image (9.295)

The same argument that was used to obtain the network MSD then leads to

image (9.296)

Likewise, the EMSE of node image is defined by

image (9.297)

Introduce the image block diagonal matrix with blocks of size image, where all blocks on the diagonal are zero except for the diagonal block of index image whose value is image, i.e.,

image (9.298)

Then, we can express the node EMSE as follows:

image (9.299)

The same argument that was used to obtain the network EMSE then leads to

image (9.300)

We summarize the results in the following statement.

Theorem 9.6.8

Network Mean-Square Performance

Consider the same setting of Theorem 9.6.6. Introduce the image row vector image defined by

image (9.301)

where image are defined by (9.277) and (9.280). Then the network MSD and EMSE and the individual node performance measures are given by

image (9.302)

image (9.303)

image (9.304)

image (9.305)

where image are defined by (9.294) and (9.298).

 ent

We can recover from the above expressions the performance of the nodes in the non-cooperative implementation (9.207), where each node performs its adaptation individually, by setting image.

We can express the network MSD, and its EMSE if desired, in an alternative useful form involving a series representation.

Corollary 9.6.2

Series Representation for Network MSD

Consider the same setting of Theorem 9.6.6. The network MSD can be expressed in the following alternative series form:

image (9.306)

where

image (9.307)

image (9.308)

image (9.309)

Proof.

Since image is stable when the filter is mean-square stable, we can expand image as

image

Substituting into (9.287) and using property (9.274), we obtain the desired result. ent

3.09.6.8 Uniform data profile

We can simplify expressions (9.307)(9.309) for image in the case when the regression covariance matrices are uniform across the network and all nodes employ the same step-size, i.e., when

image (9.310)

image (9.311)

and when the combination matrix image is doubly stochastic, so that

image (9.312)

Thus, under conditions (9.310)(9.312), expressions (9.180), (9.181), and (9.263) for image simplify to

image (9.313)

image (9.314)

image (9.315)

Substituting these values into expression (9.309) for image we get

image (9.316)

where we used the useful Kronecker product identities:

image (9.317)

image (9.318)

for any matrices image of compatible dimensions. Likewise, introduce the image diagonal matrix with noise variances:

image (9.319)

Then, expression (9.241) for image becomes

image (9.320)

It then follows that we can simplify expression (9.307) for image as:

image (9.321)

Corollary 9.6.3

Network MSD for Uniform Data Profile

Consider the same setting of Theorem 9.6.6 with the additional requirement that conditions (9.310)(9.312) for a uniform data profile hold. The network MSD is still given by the same series representation (9.306) where now

image (9.322)

image (9.323)

Using these expressions, we can decouple the network MSD expression (9.306) into two separate factors: one is dependent on the step-size and data covariance image, and the other is dependent on the combination matrices and noise profile image:

image (9.324)

 ent

Proof.

Using (9.306) and the given expressions (9.322) and (9.323) for image, we get

image

Result (9.324) follows from property (9.317)ent

3.09.6.9 Transient mean-square performance

Before comparing the mean-square performance of various cooperation strategies, we pause to comment that the variance relation (9.279) can also be used to characterize the transient behavior of the network, and not just its steady-state performance. To see this, iterating (9.279) starting from image, we find that

image (9.325)

where

image (9.326)

in terms of the initial condition, image. If this initial condition happens to be image, then image. Comparing expression (9.325) at time instants image and image we can relate image and image as follows:

image (9.327)

This recursion relates the same weighted square measures of the error vectors image. It therefore describes how these weighted square measures evolve over time. It is clear from this relation that, for mean-square stability, the matrix image needs to be stable so that the terms involving image do not grow unbounded.

The learning curve of the network is the curve that describes the evolution of the network EMSE over time. At any time image, the network EMSE is denoted by image and measured as:

image (9.328)

image

The above expression indicates that image is obtained by averaging the EMSE of the individual nodes at time image. Therefore,

image (9.329)

where image is the matrix defined by (9.290). This means that in order to evaluate the evolution of the network EMSE from relation (9.327), we simply select the weighting vector image such that

image (9.330)

Substituting into (9.327) we arrive at the learning curve for the network.

Corollary 9.6.4

Network Learning Curve

Consider the same setting of Theorem 9.6.6. Let image denote the network EMSE at time image, as defined by (9.328). Then, the learning curve of the network corresponds to the evolution of image with time and is described by the following recursion over image:

image (9.331)

where image are defined by (9.277), (9.280), and (9.290)ent

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.120.109