Chapter 8

Kernel-Based Inference of Functions Over Graphs

Vassilis N. IoannidisMeng MaAthanasios N. NikolakopoulosGeorgios B. GiannakisDaniel Romero    University of Minnesota, Minneapolis, MN, United States
University of Agder, Grimstad, Norway

Abstract

The study of networks has witnessed an explosive growth over the past decades with several ground-breaking methods introduced. A particularly interesting—and prevalent in several fields of study—problem is that of inferring a function defined over the nodes of a network. This work presents a versatile kernel-based framework for tackling this inference problem that naturally subsumes and generalizes the reconstruction approaches put forth recently for the signal processing by the community studying graphs. Both the static and the dynamic settings are considered along with effective modeling approaches for addressing real-world problems. The analytical discussion herein is complemented with a set of numerical examples, which showcase the effectiveness of the presented techniques, as well as their merits related to state-of-the-art methods.

Keywords

Signal processing on graphs; Kernel-based learning; Graph function reconstruction; Dynamic graphs; Kernel Kalman filter

Acknowledgements

The research was supported by NSF grants 1442686, 1500713, 1508993, 1509040, and 1739397.

8.1 Introduction

Numerous applications arising in diverse disciplines involve inference over networks [1]. Modeling nodal attributes as signals that take values over the vertices of the underlying graph allows the associated inference tasks to leverage node dependencies captured by the graph structure. In many real settings one often affords to work with only a limited number of node observations due to inherent restrictions particularly to the inference task at hand. In social networks, for example, individuals may be reluctant to share personal information; in sensor networks the nodes may report observations sporadically in order to save energy; in brain networks acquiring node samples may involve invasive procedures (e.g., electrocorticography). In this context, a frequently encountered challenge that often emerges is that of inferring the attributes for every node in the network given the attributes for a subset of nodes. This is typically formulated as the task of reconstructing a function defined on the nodes [16], given information about some of its values.

Reconstruction of functions over graphs has been studied by the machine learning community, in the context of semisupervised learning under the term of transductive regression and classification [68]. Existing approaches assume “smoothness” with respect to the graph—in the sense that neighboring vertices have similar values—and devise nonparametric methods [2,3,6,9] targeting primarily the task of reconstructing binary-valued signals. Function estimation has also been investigated recently by the community of signal processing on graphs (SPoG) under the term signal reconstruction [1017]. Most of such approaches commonly adopt parametric estimation tools and rely on bandlimitedness, by which the signal of interest is assumed to lie in the span of the B principal eigenvectors of the graph's Laplacian (or adjacency) matrix.

This chapter cross-pollinates ideas arising from both communities and presents a unifying framework for tackling signal reconstruction problems both in the traditional time-invariant, as well as in the more challenging time varying setting. We begin by a comprehensive presentation of kernel-based learning for solving problems of signal reconstruction over graphs (Section 8.2). Data-driven techniques are then presented based on multikernel learning (MKL) that enables combining optimally the kernels in a given dictionary and simultaneously estimating the graph function by solving a single optimization problem (Section 8.2.3). For the case where prior information is available, semiparametric estimators are discussed that can incorporate seamlessly structured prior information into the signal estimators (Section 8.2.4). We then move to the problem of reconstructing time evolving functions on dynamic graphs (Section 8.3). The kernel-based framework is now extended to accommodate the time evolving setting building on the notion of graph extension, specific choices of which can lend themselves to a reduced complexity online solver (Section 8.3.1). Next, a more flexible model is introduced that captures multiple forms of time dynamics, and kernel-based learning is employed to derive an online solver that effects online MKL by selecting the optimal combination of kernels on-the-fly (Section 8.3.2). Our analytical exposition, in both parts, is supplemented with a set of numerical tests based on both real and synthetic data that highlight the effectiveness of the methods, while providing examples of interesting realistic problems that they can address.

Notation: Scalars are denoted by lowercase characters, vectors by bold lowercase, matrices by bold uppercase; (A)m,nImage is the (m,n)Imageth entry of matrix A; superscripts T and respectively denote transpose and pseudoinverse. If A:=[a1,,aN]Image, then vec{A}:=[aT1,,aTN]T:=aImage. With N×NImage matrices {At}Tt=1Image and {Bt}Tt=2Image satisfying At=ATttImage, btridiag{A1,,AT;B2,,BT}Image represents the symmetric block tridiagonal matrix. Symbols ⊙, ⊗ and ⊕, respectively, denote the element-wise (Hadamard) matrix product, Kronecker product and Kronecker sum, the latter being defined for ARM×MImage and BRN×NImage as AB:=AIN+IMBImage. The nth column of the identity matrix INImage is represented by iN,nImage. If ARN×NImage is positive definite and xRNImage, then x2A:=xTA1xImage and x2:=xINImage. The cone of N×NImage positive definite matrices is denoted by SN+Image. Finally, δ[]Image stands for the Kronecker delta and EImage for expectation.

8.2 Reconstruction of Functions Over Graphs

Before giving the formal problem statement, it is instructive to start with the basic definitions that will be used throughout this chapter.

Definitions: A graph can be specified by a tuple G:=(V,A)Image, where V:={v1,,vN}Image is the vertex set and A is the N×NImage adjacency matrix, whose (n,n)Imageth entry, An,n0Image, denotes the nonnegative edge weight between vertices vnImage and vnImage. For simplicity, it is assumed that the graph has no self-loops, i.e., An,n=0,vnVImage. This chapter focuses on undirected graphs, for which An,n=An,nvn,vnVImage. A graph is said to be unweighted if An,nImage is either 0 or 1. The edge set is defined as E:={(vn,vn)V×V:An,n0}Image. Two vertices vnImage and vnImage are adjacent, connected or neighbors if (vn,vn)EImage. The Laplacian matrix is defined as L:=diag{A1}AImage and is symmetric and positive semidefinite [1, Chapter 2]. A real-valued function (or signal) on a graph is a map f:VRImage. The value f(v)Image represents an attribute or feature of vVImage, such as age, political alignment or annual income of a person in a social network. Signal f is thus represented by f:=[f(v1),,f(vN)]TImage.

Problem statement. Suppose that a collection of noisy samples (or observations) {ys|ys=f(vns)+es}Ss=1Image is available, where esImage models noise and S:={n1,,nS}Image contains the indices 1n1<<nSNImage of the sampled vertices, with SNImage. Given {(ns,ys)}Ss=1Image and assuming knowledge of GImage, the goal is to estimate f. This will provide estimates of f(v)Image both at observed and unobserved vertices. By defining y:=[y1,,yS]TImage, the observation model is summarized as

y=Sf+e,

Image (8.1)

where e:=[e1,,eS]TImage and S is a known S×NImage binary sampling matrix with entries (s,ns)Image, s=1,,SImage, set to one and the rest set to zero.

8.2.1 Kernel Regression

Kernel methods constitute the “workhorse” of machine learning for nonlinear function estimation [18]. Their popularity can be attributed to their simplicity, flexibility and good performance. Here, we present kernel regression as a unifying framework for graph signal reconstruction along with the so-called representer theorem.

Kernel regression seeks an estimate of f in a reproducing kernel Hilbert space (RKHS) HImage, which is the space of functions f:VRImage defined as

H:={f:f(v)=Nn=1αnκ(v,vn),αnR},

Image (8.2)

where the kernel map κ:V×VRImage is any function defining a symmetric and positive semidefinite N×NImage matrix with entries [K]n,n:=κ(vn,vn)Image [19]. Intuitively, κ(v,v)Image is a basis function in (8.2) measuring similarity between the values of f at v and vImage. (For a more detailed treatment of RKHS, see, e.g., [3].)

Note that, for signals over graphs, the expansion in (8.2) is finite since VImage is finite-dimensional. Thus, any fHImage can be expressed in compact form

f=Kα

Image (8.3)

for some N×1Image vector α:=[α1,,αN]TImage.

Given two functions f(v):=Nn=1αnκ(v,vn)Image and f(v):=Nn=1αnκ(v,vn)Image, their RKHS inner product is defined as1

f,fH:=Nn=1Nn=1αnαnκ(vn,vn)=αTKα,

Image (8.4)

where α:=[α1,,αN]TImage and the reproducing property has been employed that suggests κ(,vn0),κ(,vn0)H=iTn0Kin0=κ(vn0,vn0)Image. The RKHS norm is defined by

f2H:=f,fH=αTKα

Image (8.5)

and will be used as a regularizer to control overfitting and to cope with the under-determined reconstruction problem. As a special case, setting K=INImage recovers the standard inner product f,fH=fTfImage and the Euclidean norm f2H=f22Image. Note that when K0Image, the set of functions of the form (8.3) coincides with RNImage.

Given {ys}Ss=1Image, RKHS-based function estimators are obtained by solving functional minimization problems formulated as (see also, e.g., [1820])

ˆf:=argminfHL(y,ˉf)+μΩ(fH),

Image (8.6)

where the loss LImage measures how the estimated function f at the observed vertices {vns}Ss=1Image, collected in ˉf:=[f(vn1),,f(vnS)]T=SfImage, deviates from the data y. The so-called square loss L(y,ˉf):=(1/S)Ss=1[ysf(vns)]2Image constitutes a popular choice for LImage. The increasing function Ω is used to promote smoothness with typical choices including Ω(ζ)=|ζ|Image and Ω(ζ)=ζ2Image. The regularization parameter μ>0Image controls overfitting. Substituting (8.3) and (8.5) into (8.6) shows that ˆfImage can be found as

ˆα:=argminαRNL(y,SKα)+μΩ((αTKα)1/2),

Image (8.7a)

ˆf=Kˆα.

Image (8.7b)

An alternative form of (8.5) that will be frequently used in the sequel results upon noting that

αTKα=αTKKKα=fTKf.

Image (8.8)

Thus, one can rewrite (8.6) as

ˆf:=argminfR{K}L(y,Sf)+μΩ((fTKf)1/2).

Image (8.9)

Although graph signals can be reconstructed from (8.7), such an approach involves optimizing over N variables. Thankfully, the solution can be obtained by solving an optimization problem in S variables (where typically SNImage), by invoking the so-called representer theorem [19,21].

The representer theorem plays an instrumental role in the traditional infinite-dimensional setting where (8.6) cannot be solved directly; however, even when HImage comprises graph signals, it can still be beneficial to reduce the dimension of the optimization in (8.7). The theorem essentially asserts that the solution to the functional minimization in (8.6) can be expressed as

ˆf(v)=Ss=1ˉαsκ(v,vns)

Image (8.10)

for some ˉαsRImage, s=1,,SImage.

The representer theorem shows the form of ˆfImage, but does not provide the optimal {ˉαs}Ss=1Image, which are found after substituting (8.10) into (8.6) and solving the resulting optimization problem with respect to these coefficients. To this end, let ˉα:=[ˉα1,,ˉαS]TImage and write α=STˉαImage to deduce that

ˆf=Kα=KSTˉα.

Image (8.11)

With ˉK:=SKSTImage and using (8.7) and (8.11), the optimal ˉαImage can be found as

ˆˉα:=argminˉαRSL(y,ˉKˉα)+μΩ((ˉαTˉKˉα)1/2).

Image (8.12)

Kernel ridge regression. For LImage chosen as the square loss and Ω(ζ)=ζ2Image, the ˆfImage in (8.6) is referred to as the kernel ridge regression (RR) estimate [18]. If ˉKImage is full-rank, this estimate is given by ˆfRR=KSTˆˉαImage, where

ˆˉα:=argminˉαRS1SyˉKˉα2+μˉαTˉKˉα

Image (8.13a)

=(ˉK+μSIS)1y.

Image (8.13b)

Therefore, ˆfRRImage can be expressed as

ˆfRR=KST(ˉK+μSIS)1y.

Image (8.14)

Section 8.2.2 shows that (8.14) generalizes a number of existing signal reconstructors upon properly selecting K.

8.2.2 Kernels on Graphs

When estimating functions on graphs, conventional kernels such as the Gaussian kernel cannot be adopted because the underlying set where graph signals are defined is not a metric space. Indeed, no vertex addition vn+vnImage, scaling βvnImage or norm vnImage can be naturally defined on VImage. An alternative is to embed VImage into Euclidean space via a feature map ϕ:VRDImage and invoke a conventional kernel afterwards. However, for a given graph it is generally unclear how to explicitly design ϕ or select D. This motivates the adoption of kernels on graphs [3].

A common approach to designing kernels on graphs is to apply a transformation function on the graph Laplacian [3]. The term Laplacian kernel comprises a wide family of kernels obtained by applying a certain function r()Image to the Laplacian matrix L. Laplacian kernels are well motivated since they constitute the graph counterpart of the so-called translation-invariant kernels in Euclidean spaces [3]. This section reviews Laplacian kernels, provides beneficial insights in terms of interpolating signals and highlights their versatility in capturing information about the graph Fourier transform of the estimated signal.

The reason why the graph Laplacian constitutes one of the prominent candidates for regularization on graphs becomes clear upon recognizing that

fTLf=12(n,n)EAn,n(fnfn)2,

Image (8.15)

where An,nImage denotes weight associated with edge (n,n)Image. The quadratic form of (8.15) becomes larger when function values vary a lot among connected vertices and therefore quantifies the smoothness of f on GImage.

Let 0=λ1λ2λNImage denote the eigenvalues of the graph Laplacian matrix L and consider the eigendecomposition L=UΛUTImage, where Λ:=diag{λ1,,λN}Image. A Laplacian kernel matrix is defined by

K:=r(L):=Ur(Λ)UT,

Image (8.16)

where r(Λ)Image is the result of applying a user-selected, scalar, nonnegative map r:RR+Image to the diagonal entries of Λ. The selection of map r generally depends on desirable properties that the target function is expected to have. Table 8.1 summarizes some well-known examples arising for specific choices of r.

Table 8.1

Common spectral weight functions
Kernel name Function Parameters
Diffusion [2] r(λ)=exp{σ2λ/2} Image σ 2
p-step random walk [3] r(λ)=(aλ)p a ⩾ 2, p ≥ 0
Regularized Laplacian [3,22] r(λ)=1 + σ2λ σ 2
Bandlimited [23] r(λ)={1/β,λλmax,β,otherwise Image β > 0, λmax

At this point, it is prudent to offer interpretations and insights on the operation of Laplacian kernels. Towards this objective, note first that the regularizer from (8.9) is an increasing function of

αTKα=αTKKKα=fTKf=fTUr(Λ)UTf=ˇfTr(Λ)ˇf=Nn=1r(λn)|ˇfn|2,

Image (8.17)

where ˇf:=UTf:=[ˇf1,,ˇfN]TImage comprises the projections of f onto the eigenspace of L and is referred to as the graph Fourier transform of f in the SPoG parlance [4]. Consequently, {ˇfn}Nn=1Image are called frequency components. The so-called bandlimited (BL) functions in SPoG refer to those whose frequency components only exist inside some band B, that is, ˇfn=0,n>BImage.

By adopting the aforementioned SPoG notions, one can intuitively interpret the role of BL kernels. Indeed, it follows from (8.17) that the regularizer strongly penalizes those ˇfnImage for which the corresponding r(λn)Image is large, thus promoting a specific structure in this “frequency” domain. Specifically, one prefers r(λn)Image to be large whenever |ˇfn|2Image is small and vice versa. The fact that |ˇfn|2Image is expected to decrease with n for smooth f motivates the adoption of an increasing function r [3]. From (8.17) it is clear that r(λn)Image determines how heavily ˇfnImage is penalized. Therefore, by setting r(λn)Image to be small when nBImage and extremely large when n>BImage, one can expect the result to be a BL signal.

Observe that Laplacian kernels can capture forms of prior information richer than bandlimitedness [11,13,16,17] by selecting function r accordingly. For instance, using r(λ)=exp{σ2λ/2}Image (diffusion kernel) accounts not only for smoothness of f as in (8.15), but also for the prior knowledge that f is generated by a process diffusing over the graph. Similarly, the use of r(λ)=(αλ)1Image (1-step random walk) can accommodate cases where the signal captures a notion of network centrality.2

So far, f has been assumed deterministic, which precludes accommodating certain forms of prior information that probabilistic models can capture, such as domain knowledge and historical data. Suppose without loss of generality that {f(vn)}Nn=1Image are zero mean random variables. The LMMSE estimator of f given y in (8.1) is the linear estimator ˆfLMMSEImage minimizing EfˆfLMMSE22Image, where the expectation is over all f and noise realizations. With C:=E[ffT]Image, the LMMSE estimate is

ˆfLMMSE=CST[SCST+σ2eIS]1y,

Image (8.18)

where σ2e:=(1/S)E[e22]Image denotes the noise variance. Comparing (8.18) with (8.14) and recalling that ˉK:=SKSTImage, it follows that ˆfLMMSE=ˆfRRImage if μS=σ2eImage and K=CImage. In other words, the similarity measure κ(vn,vn)Image embodied in such a kernel map is just the covariance cov[f(vn),f(vn)]Image. A related observation was pointed out in [27] for general kernel methods.

In short, one can interpret kernel ridge regression as the LMMSE estimator of a signal f with covariance matrix equal to K; see also [28]. The LMMSE interpretation also suggests the usage of C as a kernel matrix, which enables signal reconstruction even when the graph topology is unknown. Although this discussion hinges on kernel ridge regression after setting K=CImage, any other kernel estimator of the form (8.7) can benefit from vertex covariance kernels too.

8.2.3 Selecting Kernels From a Dictionary

The selection of the pertinent kernel matrix is of paramount importance to the performance of kernel-based methods [23,29]. This section presents an MKL approach that effects kernel selection in graph signal reconstruction. Two algorithms with complementary strengths will be presented. Both rely on a user-specified kernel dictionary, and the best kernel is built from the dictionary in a data-driven way.

The first algorithm, which we call RKHS superposition, is motivated by the fact that one specific HImage in (8.6) is determined by some κ; therefore, kernel selection is tantamount to RKHS selection. Consequently, a kernel dictionary {κm}Mm=1Image gives rise to an RKHS dictionary {Hm}Mm=1Image, which motivates estimates of the form3

ˆf=Mm=1ˆfm,ˆfmHm.

Image (8.19)

Upon adopting a criterion that controls sparsity in this expansion, the “best” RKHSs will be selected. A reasonable approach is therefore to generalize (8.6) to accommodate multiple RKHSs. With LImage selected to be the square loss and Ω(ζ)=|ζ|Image, one can pursue an estimate ˆfImage by solving

min{fmHm}Mm=11SSs=1[ysMm=1fm(vns)]2+μMm=1fmHm.

Image (8.20)

Invoking the representer theorem per fmImage establishes that the minimizers of (8.20) can be written as

ˆfm(v)=Ss=1ˉαmsκm(v,vns),m=1,,M

Image (8.21)

for some coefficients ˉαmsImage. Substituting (8.21) into (8.20) suggests obtaining these coefficients as

argmin{ˉαm}Mm=11SyMm=1ˉKmˉαm2+μMm=1(ˉαTmˉKmˉαm)1/2,

Image (8.22)

where ˉαm:=[ˉαm1,,ˉαmS]TImage and ˉKm:=SKmSTImage with (Km)n,n:=κm(vn,vn)Image. Letting ˇˉαm:=ˉK1/2mˉαmImage, Eq. (8.22) becomes

argmin{ˇˉαm}Mm=11SyMm=1ˉK1/2mˇˉαm2+μMm=1ˇˉαm2.

Image (8.23)

Interestingly, (8.23) can be efficiently solved using the alternating-direction method of multipliers (ADMM) [30,31] after some necessary reformulation [23].

After obtaining {ˇˉαm}Mm=1Image, the sought-after function estimate can be recovered as

ˆf=Mm=1KmSTˉαm=Mm=1KmSTˉK1/2mˇˉαm.

Image (8.24)

This MKL algorithm can identify the best subset of RKHSs—and therefore kernels—but entails MS unknowns (cf. (8.22)). Next, an alternative approach is discussed which can reduce the number of variables to M+SImage at the price of not being able to assure a sparse kernel expansion.

The alternative approach is to postulate a kernel of the form K(θ)=Mm=1θmKmImage, where {Km}Mm=1Image is given and θm0mImage. The coefficients θ:=[θ1,,θM]TImage can be found by jointly minimizing (8.12) with respect to θ and ˉαImage [32]. We have

(θ,ˆˉα):=argminθ,ˉα1SL(v,y,ˉK(θ)ˉα)+μΩ((ˉαTˉK(θ)ˉα)1/2),

Image (8.25)

where ˉK(θ):=SK(θ)STImage. Except for degenerate cases, problem (8.25) is not jointly convex in θ and ˆˉαImage, but it is separately convex in each vector for a convex LImage [32]. Iterative algorithms for solving (8.23) and (8.25) are available in [23].

8.2.4 Semiparametric Reconstruction

The approaches discussed so far are applicable to various problems but they are certainly limited by the modeling assumptions they make. In particular, the performance of algorithms belonging to the parametric family [11,15,33] is restricted by how well the signals actually adhere to the selected model. Nonparametric models, on the other hand [2,3,6,34], offer flexibility and robustness but they cannot readily incorporate information available a priori.

In practice however, it is not uncommon that neither of these approaches alone suffices for reliable inference. Consider, for instance, an employment-oriented social network such as LinkedIn, and suppose the goal is to estimate the salaries of all users given information about the salaries of a few. Clearly, besides network connections, exploiting available information regarding the users' education level and work experience could benefit the reconstruction task. The same is true in problems arising in link analysis, where the exploitation of hierarchical web structure can aid the task of estimating the importance of web pages [35]. In recommender systems, inferring preference scores for every item, given the users' feedback about particular items, could be cast as a signal reconstruction problem over the item correlation graph. Data sparsity imposes severe limitations on the quality of pure collaborative filtering methods [64]. Exploiting side information about the items is known to alleviate such limitations [36], leading to considerably improved recommendation performance [37,38].

A promising direction to endow nonparametric methods with prior information relies on a semiparametric approach whereby the signal of interest is modeled as the superposition of a parametric and a nonparametric component [39]. While the former leverages side information, the latter accounts for deviations from the parametric part and can also promote smoothness using kernels on graphs. In this section we outline two simple and reliable semiparametric estimators with complementary strengths, as detailed in [39].

8.2.4.1 Semiparametric Inference

Function f is modeled as the superposition4

/f=fP+fNP,

Image (8.26)

where fP:=[fP(v1),,fP(vN)]TImage and fNP:=[fNP(v1)Image, ,fNP(vN)]TImage.

The parametric term fP(v):=Mm=1βmbm(v)Image captures the known signal structure via the basis B:={bm}Mm=1Image, while the nonparametric term fNPImage belongs to an RKHS HImage, which accounts for deviations from the span of BImage. The goal of this section is to provide an efficient and reliable estimation of f given y, S, BImage, HImage and GImage.

Since fNPHImage, vector fNPImage can be represented as in (8.3). By defining β:=[β1,,βM]TImage and the N×MImage matrix B with entries (B)n,m:=bm(vn)Image, the parametric term can be written in vector form as fP:=BβImage. The semiparametric estimates can be found as the solution of the following optimization problem:

{ˆα,ˆβ}=argminα,β1SSs=1L(ys,f(vns))+μfNP2Hs.t.f=Bβ+Kα,

Image (8.27)

where the fitting loss LImage quantifies the deviation of f from the data and μ>0Image is the regularization scalar that controls overfitting of the nonparametric term. Using (8.27), the semiparametric estimates are expressed as ˆf=Bˆβ+KˆαImage.

Solving (8.27) entails minimization over N+MImage variables. Clearly, when dealing with large-scale graphs this could lead to prohibitively large computational costs. To reduce complexity, the semiparametric version of the representer theorem [18,19] is employed, which establishes that

ˆf=Bˆβ+KSTˆˉα,

Image (8.28)

where ˆˉα:=[ˆˉα1,,ˆˉαS]TImage. Estimates ˆˉα,ˆβImage are found as

{ˆˉα,ˆβ}=argminˉα,β1SSs=1L(ys,f(vns))+μfNP2Hs.t.f=Bβ+KSTˉα,

Image (8.29)

where ˉα:=[ˉα1,,ˉαS]TImage. The RKHS norm in (8.29) is expressed as fNP2H=ˉαTˉKˉαImage, with ˉK:=SKSTImage. Relative to (8.27), the number of optimization variables in (8.29) is reduced to the more affordable S+MImage, with SNImage.

Next, two loss functions with complementary benefits will be considered: the square loss and the ϵ-insensitive loss. The square loss function is

L(ys,f(vns)):=ysf(vns))22

Image (8.30)

and (8.29) admits the following closed-form solution:

ˆˉα=(PˉK+μIS)1Py,

Image (8.31a)

ˆβ=(ˉBTˉB)1ˉBT(yˉKˆˉα),

Image (8.31b)

where ˉB:=SBImage, P:=ISˉB(ˉBTˉB)1ˉBTImage. The complexity of (8.31) is O(S3+M3)Image.

The ϵ-insensitive loss function is given by

L(ys,f(vns)):=max(0,|ysf(vns)|ϵ),

Image (8.32)

where ϵ is tuned, e.g. via cross-validation, to minimize the generalization error and it has well-documented merits in signal estimation from quantized data [40]. Substituting (8.32) into (8.29) yields a convex nonsmooth quadratic problem that can be solved efficiently for ˉαImage and β using, e.g., interior-point methods [18].

8.2.5 Numerical Tests

This section reports on the signal reconstruction performance of different methods using real as well as synthetic data. The performance of the estimators is assessed via Monte Carlo simulation by comparing the normalized mean square error (NMSE)

NMSE=E[ˆff2f2].

Image (8.33)

Multikernel reconstruction. The first data set contains departure and arrival information for flights among U.S. airports [41], from which 3×106Image flights in the months of July, August and September of 2014 and 2015 were selected. We construct a graph with N=50Image vertices corresponding to the airports with highest traffic, and whenever the number of flights between the two airports exceeds 100 within the observation window, we connect the corresponding nodes with an edge.

A signal was constructed per day averaging the arrival delay of all inbound flights per selected airport. A total of 184 signals were considered, of which the first 154 were used for training (July, August, September 2014 and July, August 2015), and the remaining 30 for testing (September 2015). The weights of the edges between airports were learned using the training data based on the technique described in [23].

Table 8.2 lists the NMSE and the RMSE in minutes for the task of predicting the arrival delay at 40 airports when the delay at a randomly selected collection of 10 airports is observed. The second row corresponds to the ridge regression estimator that uses the nearly optimal estimated covariance kernel. The next two rows correspond to the multikernel approaches in §8.2.3 with a dictionary of 30 diffusion kernels with values of σ2Image uniformly spaced between 0.1 and 7. The rest of the rows pertain to graph BL estimators. Table 8.2 demonstrates the reliable performance of covariance kernels as well as the herein discussed multikernel approaches relative to competing alternatives.

Table 8.2

Multikernel reconstruction
NMSE RMSE[min]
KRR with cov. kernel 0.34 3.95
Multikernel, RS 0.44 4.51
Multikernel, KS 0.43 4.45
BL for B=2 1.55 8.45
BL for B=3 32.64 38.72
BL, cut-off 3.97 13.5

Semiparametric reconstruction. An Erdős–Rényi graph with probability of edge presence 0.6 and N=200Image nodes was generated, and f was formed by superimposing a BL signal [13,15] plus a piece-wise constant signal [42]; that is,

f=10i=1γiui+6i=1δi1Vc,

Image (8.34)

where {γi}10i=1Image and {δi}6i=1Image are standardized Gaussian distributed, {ui}10i=1Image are the eigenvectors associated with the 10 smallest eigenvalues of the Laplacian matrix, {Vi}6i=1Image are the vertex sets of six clusters obtained via spectral clustering [43] and 1ViImage is the indicator vector with entries (1Vi)n:=1Image, if vnViImage, and 0 otherwise. The parametric basis B={1Vi}6i=1Image was used by the estimators capturing the prior knowledge, and S vertices were sampled uniformly at random. The subsequent experiments evaluate the performance of the semiparametric graph kernel estimators, SP-GK and SP-GK(ϵ), resulting from using (8.30) and (8.32) in (8.29), respectively; the parametric (P) that considers only the parametric term in (8.26), the nonparametric (NP) [2,3] that considers only the nonparametric term in (8.26) and the graph BL estimators from [13,15], which assume a BL model with bandwidth B. For all the experiments, the diffusion kernel (cf. Table 8.1) with parameter σ is employed. First, white Gaussian noise esImage of variance σ2eImage is added to each sample fsImage to yield the signal-to-noise ratio SNRe:=f22/(Nσ2e)Image. Fig. 8.1B presents the NMSE of different methods. As expected, the limited flexibility of the parametric approaches, BL and P, affects their ability to capture the true signal structure. The NP estimator achieves smaller NMSE, but only when the amount of available samples is adequate. Both semiparametric estimators were found to outperform other approaches, exhibiting reliable reconstruction even with few samples.

Image
Figure 8.1 NMSE of the synthetic signal estimates. (A) μ = 5 × 10−4, σ = 5 × 10−4, SNRe = 5 dB. (B) μ = 5 × 10−4, σ = 5 × 10−4, ϵ = 10−4 and SNRo = −5 dB.

To illustrate the benefits of employing different loss functions (8.30) and (8.32), we compare the performance of SP-GK and SP-GK(ϵ) in the presence of outlying noise. Each sample fsImage is contaminated with Gaussian noise osImage of large variance σ2oImage with probability p=0.1Image. Fig. 8.1A demonstrates the robustness of SP-GK(ϵ) which is attributed to the ϵ-insensitive loss function (8.32). Further experiments using real signals can be found in [39].

8.3 Inference of Dynamic Functions Over Dynamic Graphs

Networks that exhibit time varying connectivity patterns with time varying node attributes arise in a plethora of network science–related applications. Sometimes these dynamic network topologies switch between a finite number of discrete states, governed by sudden changes of the underlying dynamics [44,45]. A challenging problem that arises in this setting is that of reconstructing time evolving functions on graphs, given their values on a subset of vertices and time instants. Efficiently exploiting spatiotemporal dynamics can markedly impact sampling costs by reducing the number of vertices that need to be observed to attain a target performance. Such a reduction can be of paramount importance in certain applications, e.g., in monitoring time-dependent activity of different regions of the brain through invasive electrocorticography (ECoG), where observing a vertex requires the implantation of an intracranial electrode [44].

Although one could reconstruct a time varying function per time slot using the non- or semiparametric methods of §8.2, leveraging time correlations typically yields estimators with improved performance. Schemes tailored for time evolving functions on graphs include [46] and [47], which predict the function values at time t given observations up to time t1Image. However, these schemes assume that the function of interest adheres to a specific vector autoregressive model. Other works target time-invariant functions, but can only afford tracking sufficiently slow variations. This is the case with the dictionary learning approach in [48] and the distributed algorithms in [49] and [50]. Unfortunately, the flexibility of these algorithms to capture spatial information is also limited since [48] focuses on Laplacian regularization, whereas [49] and [50] require the signal to be BL.

Motivated by the aforementioned limitations, in what comes next we extend the framework presented in §8.2 accommodating time varying function reconstruction over dynamic graphs. But before we delve into the time varying setting, a few definitions are in order.

Definitions: A time varying graph is a tuple G(t):=(V,At)Image, where V:={v1,,vN}Image is the vertex set and AtRN×NImage is the adjacency matrix at time t, whose (n,n)Imageth entry An,n(t)Image assigns a weight to the pair of vertices (vn,vn)Image at time t. A time-invariant graph is a special case with At=Att,tImage. Adopting common assumptions made in related literature (e.g., [1, Chapter 2], [4,9]), we also define G(t)Image (i) to have nonnegative weights (An,n(t)0t, and nnImage), (ii) to have no self-edges (An,n(t)=0n,tImage) and (iii) to be undirected (An,n(t)=An,n(t)n,n,tImage).

A time varying function or signal on a graph is a map f:V×TRImage, where T:={1,2,}Image is the set of time indices. The value f(vn,t)Image of f at vertex vnImage and time t can be thought of as the value of an attribute of vnVImage at time t. The values of f at time t will be collected in ft:=[f(v1,t),,f(vN,t)]TImage.

At time t, vertices with indices in the time-dependent set St:={n1(t),,nS(t)(t)}Image, 1n1(t)<<nS(t)(t)NImage, are observed. The resulting samples can be expressed as ys(t)=f(vns(t),t)+es(t),s=1,,S(t)Image, where es(t)Image models the observation error. By letting yt:=[y1(t),,yS(t)(t)]TImage, the observations can be conveniently expressed as

yt=Stft+et,t=1,2,,

Image (8.35)

where et:=[e1(t),,eS(t)(t)]TImage and the S(t)×NImage sampling matrix StImage contains ones at positions (s,ns(t))Image, s=1,,S(t)Image and zeros elsewhere.

The broad goal of this section is to “reconstruct” f from the observations {yt}tImage in (8.35). Two formulations will be considered.

Batch formulation. In the batch reconstruction problem, one aims at finding {ft}Tt=1Image given {G(t)}Tt=1Image, the sample locations {St}Tt=1Image and all observations {yt}Tt=1Image.

Online formulation. At every time t, one is given GImage together with StImage and ytImage, and the goal is to find ftImage. The latter can be obtained possibly based on a previous estimate of ft1Image, but the complexity per time slot t must be independent of t.

To solve these problems, we will rely on the assumption that f evolves smoothly over space and time, yet more structured dynamics can be incorporated if known.

8.3.1 Kernels on Extended Graphs

This section extends the kernel-based learning framework of §8.2 to subsume time evolving functions over possibly dynamic graphs through the notion of graph extension, by which the time dimension receives the same treatment as the spatial dimension. The versatility of kernel-based methods to leverage spatial information [23] is thereby inherited and broadened to account for temporal dynamics as well. This vantage point also accommodates time varying sampling sets and topologies.

8.3.1.1 Extended Graphs

An immediate approach to reconstructing time evolving functions is to apply (8.9) separately for each t=1,,TImage. This yields the instantaneous estimator (IE)

ˆf(IE)t:=argminf1S(t)ytStf22+μfTKtf.

Image (8.36)

Unfortunately, this estimator does not account for the possible relation between, e.g., f(vn,t)Image and f(vn,t1)Image. If, for instance, f varies slowly over time, an estimate of f(vn,t)Image may as well benefit from leveraging observations ys(τ)Image at time instants τtImage. Exploiting temporal dynamics potentially reduces the number of vertices that have to be sampled to attain a target reconstruction performance, which in turn can markedly reduce sampling costs.

Incorporating temporal dynamics into kernel-based reconstruction, which can only handle a single snapshot (cf. §8.2), necessitates an appropriate reformulation of time evolving function reconstruction as a problem of reconstructing a time-invariant function. An appealing possibility is to replace GImage with its extended version ˜G:=(˜V,˜A)Image, where each vertex in VImage is replicated T times to yield the extended vertex set ˜V:={vn(t),n=1,,N,t=1,,T}Image, and the (n+N(t1),n+N(t1))Imageth entry of the TN×TNImage extended adjacency matrix ˜AImage equals the weight of the edge (vn(t),vn(t))Image. The time varying function f can thus be replaced with its extended time-invariant counterpart ˜f:˜VRImage with ˜f(vn(t))=f(vn,t)Image.

Definition 1

Let V:={v1,,vN}Image denote a vertex set and let G:=(V,{At}Tt=1)Image be a time varying graph. A graph ˜GImage with vertex set ˜V:={vn(t),n=1,,N,t=1,,T}Image and NT×NTImage adjacency matrix ˜AImage is an extended graph of GImage if the tth N×NImage diagonal block of ˜AImage equals AtImage.

In general, the diagonal blocks {At}Tt=1Image do not provide full description of the underlying extended graph. Indeed, one also needs to specify the off-diagonal block entries of ˜AImage to capture the spatiotemporal dynamics of f.

As an example, consider an extended graph with

˜A=btridiag{A1,,AT;B(T)2,,B(T)T},

Image (8.37)

where B(T)tRN×N+Image connects {vn(t1)}Nn=1Image to {vn(t)}Nn=1Image, t=2,,TImage, and btridiag{A1Image, ,AT;B2,,BT}Image represents the symmetric block tridiagonal matrix

˜A=[A1BT2000B2A2BT3000B3A300000AT1BTT000BTAT].

Image

For instance, each vertex can be connected to its neighbors at the previous time instant by setting B(T)t=At1Image, or it can be connected to its replicas at adjacent time instants by setting B(T)tImage to be diagonal. Such a graph is depicted in Fig. 8.2.

Image
Figure 8.2 NMSE of the synthetic signal estimates. (A) Original graph and (B) extended graph ˜GImage for diagonal B(T)tImage. Edges connecting vertices at the same time instant are represented by solid lines whereas edges connecting vertices at different time instants are represented by dashed lines.

8.3.1.2 Batch and Online Reconstruction via Space–Time Kernels

The extended graph enables a generalization of the estimators in §8.2 for time evolving functions. The rest of this subsection discusses two such KRR estimators.

Consider first the batch formulation, where all the ˜S:=Tt=1StImage samples in ˜y:=[yT1,,yTT]TImage are available, and the goal is to estimate ˜f:=[fT1,,fTT]TImage. Directly applying the KRR criterion in (8.9) to reconstruct ˜fImage on the extended graph ˜GImage yields

ˆ˜f:=argmin˜f˜y˜S˜f2D+μ˜fT˜K˜f,

Image (8.38a)

where ˜KImage is now a TN×TNImage “space–time” kernel matrix, ˜S:=bdiag{S1,,ST}Image and D:=bdiag{S(1)IS(1),,S(T)IS(T)}Image. If ˜KImage is invertible, (8.38a) can be solved in closed form as

ˆ˜f=˜K˜ST(˜S˜K˜ST+μD)1˜y.

Image (8.38b)

The “space–time” kernel ˜KImage, captures complex spatiotemporal dynamics. If the topology is time-invariant, ˜KImage can be specified in a bidimensional plane of spatiotemporal frequency similar to §8.2.2.5

In the online formulation, one aims to estimate ftImage after the ˜S(t):=tτ=1S(τ)Image samples in ˜yt:=[yT1,,yTt]TImage become available. Based on these samples, the KRR estimate of ˜fImage, denoted as ˆ˜f1:T|tImage, is clearly

ˆ˜f1:T|t:=argmin˜f˜yt˜St˜f2Dt+μ˜fT˜K1˜f

Image (8.39a)

=˜K˜STt(˜St˜K˜STt+μDt)1˜yt,

Image (8.39b)

where ˜KImage is assumed invertible for simplicity, Dt:=bdiag{S(1)IS(1),,S(t)IS(t)}Image and ˜St:=[diag{S1,,St},0˜S(t)×(Tt)N]{0,1}˜S(t)×TNImage.

The estimate in (8.39) comprises the per slot estimates {ˆfτ|t}Tτ=1Image; that is, ˆ˜f1:T|t:=[ˆfT1|t,,ˆfTT|t]TImage with ˆfτ|t:=[ˆf1(τ|t),,ˆfN(τ|t)]TImage, where ˆfτ|tImage (respectively ˆfn(τ|t)Image) is the KRR estimate of fτImage (f(vn,τ)Image) given the observations up to time t. With this notation, it follows that for all t,τImage

ˆfτ|t=(iTT,τIN)ˆ˜f1:T|t.

Image (8.40)

Regarding t as the present, (8.39) therefore provides estimates of past, present and future values of f. The solution to the online problem comprises the sequence of present KRR estimates for all t, that is, {ˆft|t}Tt=1Image. This can be obtained by solving (8.39a) in closed form per t as in (8.39b) and then applying (8.40). However, such an approach does not yield a desirable online algorithm since its complexity per time slot is O(˜S3(t))Image and therefore increasing with t. For this reason, this approach is not satisfactory since the online problem formulation requires the complexity per time slot of the desired algorithm to be independent of t. An algorithm that does satisfy this requirement yet provides the exact KRR estimate is presented next when the kernel matrix is any positive definite matrix ˜KImage satisfying

˜K1=btridiag{D1,,DT;C2,,CT}

Image (8.41)

for some N×NImage matrices {Dt}Tt=1Image and {Ct}Tt=2Image. Kernels in this important family are designed in [51].

If ˜KImage is of the form (8.41), then the kernel Kalman filter (KKF) in Algorithm 2 returns the sequence {ˆft|t}Tt=1Image, where ˆft|tImage is given by (8.40). The N×NImage matrices {Pτ}Tτ=2Image and {Στ}Tτ=1Image are obtained offline by Algorithm 1, and σ2e(τ)=μS(τ)τImage.

Image
Algorithm 1 Recursion to set the parameters of the KKF.

The KKF generalizes the probabilistic KF since the latter is recovered upon setting ˜KImage to be the covariance matrix of ˜fImage in the probabilistic KF. The assumptions made by the probabilistic KF are stronger than those involved in the KKF. Specifically, in the probabilistic KF, ftImage must adhere to a linear state space model, ft=Ptft1+ηtImage, with known transition matrix PtImage, where the state noise ηtImage is uncorrelated over time and has known covariance matrix ΣtImage. Furthermore, the observation noise etImage must be uncorrelated over time and have a known covariance matrix. Correspondingly, the performance guarantees of the probabilistic KF are also stronger: the resulting estimate is optimal in the mean square error sense among all linear estimators. Furthermore, if ηtImage and ytImage are jointly Gaussian, t=1,,TImage, then the probabilistic KF estimate is optimal in the mean square error sense among all (not necessarily linear) estimators. In contrast, the requirements of the proposed KKF are much weaker since they only specify that f must evolve smoothly with respect to a given extended graph. As expected, the performance guarantees are similarly weaker; see e.g. [18, Chapter 5]. However, since the KKF generalizes the probabilistic KF, the reconstruction performance of the former for judiciously selected ˜KImage cannot be worse than the reconstruction performance of the latter for any given criterion. The caveat, however, is that such a selection is not necessarily easy.

For the rigorous statement and proof relating the celebrated KF [52, Chapter 17] and the optimization problem in (8.39a), see [51]. Algorithm 2 requires O(N3)Image operations per time slot, whereas the complexity of evaluating (8.39b) for the tth time slot is O(˜S3(t))Image, which increases with t and becomes eventually prohibitive. Since distributed versions of the Kalman filter are well studied [53], decentralized KKF algorithms can be pursued to further reduce the computational complexity.

Image
Algorithm 2 Kernel Kalman filter (KKF).

8.3.2 Multikernel Kriged Kalman Filters

The following section applies the KRR framework presented in §8.2 to online data-adaptive estimators of ftImage. Specifically, a spatiotemporal model is presented that judiciously captures the dynamics over space and time. Based on this model the KRR criterion over time and space is formulated, and an online algorithm is derived with affordable computational complexity, when the kernels are preselected. To bypass the need for selecting an appropriate kernel, this section discusses a data-adaptive multikernel learning extension of the KRR estimator that learns the optimal kernel “on-the-fly.”

8.3.2.1 Spatiotemporal Models

Consider modeling the dynamics of ftImage separately over time and space as f(vn,t)=f(ν)(vn,t)+f(χ)(vn,t)Image, or in vector form

ft=f(ν)t+f(χ)t,

Image (8.42)

where f(ν)t:=[f(ν)(v1,t),,f(ν)(vN,t)]TImage and f(χ)t:=[f(χ)(v1,t),,f(χ)(vN,t)]TImage. The first term {f(ν)t}tImage captures only spatial dependencies and can be thought of as exogenous input to the graph that does not affect the evolution of the function in time.

The second term {f(χ)t}tImage accounts for spatiotemporal dynamics. A popular approach [54, Chapter 3] models f(χ)tImage with the state equation

f(χ)t=At,t1f(χ)t1+ηt,t=1,2,,

Image (8.43)

where At,t1Image is a generic transition matrix that can be chosen, e.g., as the N×NImage adjacency of a possibly directed “transition graph,” with f(χ)0=0Image and ηtRNImage capturing the state error. The state transition matrix At,t1Image can be selected in accordance with the prior information available. Simplicity in estimation is a major advantage of the random walk model [55], where At,t1=αINImage with α>0Image. On the other hand, adherence to the graph prompts the selection At,t1=αAImage, in which case (8.43) amounts to a diffusion process on the time-invariant graph GImage. The recursion in (8.43) is a vector autoregressive model (VARM) of order one and offers flexibility in tracking multiple forms of temporal dynamics [54, Chapter 3]. The model in (8.43) captures the dependence between f(χ)(vn,t)Image and its time lagged versions {f(χ)(vn,t1)}Nn=1Image.

Next, a model with increased flexibility is presented to account for instantaneous spatial dependencies as well. We have

f(χ)t=At,tf(χ)t+At,t1f(χ)t1+ηt,t=1,2,,

Image (8.44)

where At,tImage encodes the instantaneous relation between f(χ)(vn,t)Image and {f(χ)(vn,t)}nnImage. The recursion in (8.44) amounts to a structural vector autoregressive model (SVARM) [44]. Interestingly, (8.44) can be rewritten as

f(χ)t=(INAt,t)1At,t1f(χ)t1+(INAt,t)1ηt,

Image (8.45)

where INAt,tImage is assumed invertible. After defining ˜ηt:=(INAt,t)1ηtImage and ˜At,t1:=(INAt,t)1At,t1Image, (8.44) boils down to

f(χ)t=˜At,t1f(χ)t1+˜ηt,

Image (8.46)

which is equivalent to (8.43). This section will focus on deriving estimators based on (8.43), but can also accommodate (8.44) using the aforementioned reformulation.

Modeling ftImage as the superposition of a term f(χ)tImage, capturing the slow dynamics over time with a state space equation, and a term f(ν)tImage accounting for fast dynamics is motivated by the application at hand [56,55,57]. In the kriging terminology [56], f(ν)tImage is said to model small-scale spatial fluctuations, whereas f(χ)tImage captures the so-called trend. The decomposition (8.42) is often dictated by the sampling interval; while f(χ)tImage captures slow dynamics relative to the sampling interval, fast variations are modeled with f(ν)tImage. Such a modeling approach is advised in the prediction of network delays [55], where f(χ)tImage represents the queuing delay while f(ν)tImage represents the propagation, transmission and processing delays. Likewise, when predicting prices across different stocks, f(χ)tImage captures the daily evolution of the stock market, which is correlated across stocks and time samples, while f(ν)tImage describes unexpected changes, such as the daily drop of the stock market due to political statements, which are assumed uncorrelated over time.

8.3.2.2 Kernel Kriged Kalman Filter

The spatiotemporal model in (8.42), (8.43) can represent multiple forms of spatiotemporal dynamics by judicious selection of the associated parameters. The batch KRR estimator over time yields

argmin{f(χ)τ,ητ,f(ν)τ,fτ}tτ=1tτ=11S(τ)yτSτfτ2+μ1tτ=1ητ2K(η)τ+μ2tτ=1f(ν)τ2K(ν)τs.t.ητ=f(χ)τAτ,τ1f(χ)τ1,fτ=f(ν)τ+f(χ)τ,τ=1,,t.

Image (8.47)

The first term in (8.47) penalizes the fitting error in accordance with (8.1). The scalars μ1,μ20Image are regularization parameters controlling the effect of the kernel regularizers, while prior information about {fτ(ν),ητ}τ=1tImage may guide the selection of the appropriate kernel matrices. The constraints in (8.47) imply adherence to (8.43) and (8.42). Since the fτ(ν),ητImage are defined over the time evolving G(τ)Image, a potential approach is to select Laplacian kernels as Kτ(ν),Kτ(η)Image; see §8.2.2. Next, we rewrite (8.47) in a form amenable to online solvers, namely

arg min { f τ ( χ ) , f τ ( ν ) } τ = 1 t τ = 1 t 1 S ( τ ) y τ S τ f τ ( χ ) S τ f τ ( ν ) 2 + + μ 1 τ = 1 t f ( χ ) τ A τ , τ 1 f τ 1 ( χ ) K τ ( η ) 2 + + μ 2 τ = 1 t f τ ( ν ) K τ ( ν ) 2 .

Image (8.48)

In a batch form the optimization in (8.48) yields {fˆτ|t(ν)Image and fˆτ|t(χ)}τ=1tImage per slot t with complexity that grows with t. Fortunately, the filtered solutions {fˆτ|τ(ν),fˆτ|τ(χ)}τ=1tImage of (8.48) are attained by the kernel kriged Kalman filter (KeKriKF) in an online fashion. For the proof the reader is referred to [58]. One iteration of the proposed KeKriKF is summarized as Algorithm 3. This online estimator—with computational complexity O(N3)Image per t—tracks the temporal variations of the signal of interest through (8.43) and promotes desired properties such as smoothness over the graph, using Kt(ν)Image and Kt(η)Image. Different from existing KriKF approaches over graphs [55], the KeKriKF takes into account the underlying graph structure in estimating ft(ν)Image as well as ft(χ)Image. Furthermore, by using LtImage in (8.16), it can also accommodate dynamic graph topologies. Finally, it should be noted that KeKriKF encompasses as a special case the KriKF, which relies on knowing the statistical properties of the function [5557,60].

Image
Algorithm 3 Kernel Kriged Kalman filter (KeKriKF).

Lack of prior information prompts the development of data-driven approaches that efficiently learn the appropriate kernel matrix. Section 8.3.2.3 proposes an online MKL approach for achieving this goal.

8.3.2.3 Online Multikernel Learning

To cope with a lack of prior information about the pertinent kernel, the following dictionaries of kernels will be considered: Dν:={K(ν)(m)S+N}m=1MνImage and Dη:={K(η)(m)S+N}m=1MηImage. For the following assume that Kτ(ν)=K(ν)Image, Kτ(η)=K(η)Image and Sτ=S,τImage. Moreover, we postulate that the kernel matrices are of the form K(ν)=K(ν)(θ(ν))=m=1Mνθ(ν)(m)K(ν)(m)Image and K(η)=K(η)(θ(η))=m=1Mηθ(η)(m)K(η)(m)Image, where θ(η)(m),θ(ν)(m)0,mImage.

Next, in accordance with §8.2.3 the coefficients θ(ν)=[θ(ν)(1),,θ(ν)(M)]TImage and θ(η)=[θ(η)(1),,θ(η)(M)]TImage can be found by jointly minimizing (8.48) with respect to {fτ(χ),fτ(ν)}τ=1t,θ(ν)Image and θ(η)Image, which yields

arg min { f τ ( χ ) , f τ ( ν ) } τ = 1 t , θ ( ν ) 0 , θ ( η ) 0 τ = 1 t 1 S y τ S f τ ( χ ) S f τ ( ν ) 2 + μ 1 τ = 1 t f ( χ ) τ A τ , τ 1 f τ 1 ( χ ) K ( η ) ( θ ( η ) ) 2 + μ 2 τ = 1 t f τ ( ν ) K ( ν ) ( θ ( ν ) ) 2 + t ρ ν θ ( ν ) 2 2 + t ρ η θ ( η ) 2 2 ,

Image (8.49)

where ρν,ρη0Image are regularization parameters that effect a ball constraint on θ(ν)Image and θ(η)Image, weighted by t to account for the first three terms that are growing with t. Observe that the optimization problem in (8.49) gives time varying estimates θt(ν)Image and θt(η)Image allowing to track the optimal K(ν)Image and K(η)Image that change over time respectively.

The optimization problem in (8.49) is not jointly convex in {fτ(χ),fτ(ν)}τ=1t,θ(ν),θ(η)Image, but it is separately convex in these variables. To solve (8.49) alternating minimization strategies will be employed that suggest optimizing with respect to one variable, while keeping the other variables fixed [61]. If θ(ν),θ(η)Image are considered fixed, (8.49) reduces to (8.48), which can be solved by Algorithm 3 for the estimates fˆt|t(χ),fˆt|t(ν)Image at each t. For {fτ(χ),fτ(ν)}τ=1tImage fixed and replaced by {fˆτ|τ(χ),fˆτ|τ(ν)}τ=1tImage in (8.48) the time varying estimates of θ(ν),θ(η)Image are found by

θ ˆ t ( η ) = arg min θ ( η ) 0 1 t τ = 1 t f ˆ τ | τ ( χ ) A τ , τ 1 f ˆ τ 1 | τ 1 ( χ ) K ( η ) ( θ ( η ) ) 2 + ρ η μ 1 θ ( η ) 2 2 ,

Image (8.50a)

θ ˆ t ( ν ) = arg min θ ( ν ) 0 1 t τ = 1 t f ˆ τ | τ ( ν ) K ( ν ) ( θ ( ν ) ) 2 + + ρ ν μ 2 θ ( ν ) 2 2 .

Image (8.50b)

The optimization problems (8.50a) and (8.50b) are strongly convex and iterative algorithms are available based on projected gradient descent (PGD) [62], or the Frank–Wolfe algorithm [63]. When the kernel matrices belong to the Laplacian family (8.16), efficient algorithms that exploit the common eigenspace of the kernels in the dictionary have been developed in [58]. The proposed method reduces the per step computational complexity of PGD from a prohibitive O(N3M)Image for general kernels to a more affordable O(NM)Image for Laplacian kernels. The proposed algorithm, termed multikernel KriKF (MKriKF), alternates between computing fˆt|t(χ)Image and fˆt|t(ν)Image utilizing the KKriKF and estimating θˆt(ν)Image and θˆt(η)Image from solving (8.50b) and (8.50a).

8.3.3 Numerical Tests

This section compares the performance of the methods we discussed in §8.3.1 and §8.3.2 with state-of-the-art alternatives and illustrates some of the trade-offs inherent to time varying function reconstruction through real-data experiments. The source code for the simulations is available at the authors' websites.

Unless otherwise stated, the compared estimators include distributed least squares reconstruction (DLSR) [49] with step size μDLSRImage and parameter βDLSRImage; the least mean squares (LMS) algorithm in [50] with step size μLMSImage; the BL instantaneous estimator (BL-IE), which results after applying [11,13,15] separately per t; and the KRR instantaneous estimator (KRR-IE) in (8.36) with a diffusion kernel with parameter σ. DLSR, LMS and BL-IE also use a bandwidth parameter B.

Reconstruction via extended graphs. For our first experiment we use a dataset obtained from an epilepsy study [65], which is used to showcase an example analysis of ECoG data (analysis of ECoG data is a standard tool in diagnosing epilepsy). Our next experiment utilizes the ECoG time series in [65] from N=76Image electrodes implanted in a patient's brain before and after the onset of a seizure. A symmetric time-invariant adjacency matrix A was obtained using the method in [44] with ECoG data before the onset of the seizure. Function f(vn,t)Image comprises the electrical signal at the nth electrode and tth sampling instant after the onset of the seizure, for a period of T=250Image samples. The values of f(vn,t)Image were normalized by subtracting the temporal mean of each time series before the onset of the seizure. The goal of the experiment is to illustrate the reconstruction performance of KKF in capturing the complex spatiotemporal dynamics of brain signals.

Fig. 8.3A depicts the NMSE(t,{Sτ}τ=1t)Image, averaged over all sets St=S,tImage, of size S=53Image. For the KKF, a space–time kernel was created (see [51]) with KtImage a time-invariant covariance kernel Kt=ΣˆImage, where ΣˆImage was set to the sample covariance matrix of the time series before the onset of the seizure, with a time-invariant B(T)=b(T)IImage. The results clearly show the superior reconstruction performance of the KKF, which successfully exploits the statistics of the signal when available, among competing approaches, even with a small number of samples. This result suggests that the ECoG diagnosis technique could be efficiently conducted even with a smaller number of intracranial electrodes, which may have a positive impact on the patient's experience.

Image
Figure 8.3 NMSE for real data simulations. (A) NMSE for the ECoG data set (σ = 1.2, μ = 10−4, μDLSR = 1.2, b(T)=0.01Image, βDLSR = 0.5, μLMS = 0.6). (B) NMSE of temperature estimates (μ1 = 1, μ2 = 1, μDLSR = 1.6, βDLSR = 0.5, μLMS = 0.6, α = 10−3, μη = 10−5, rη = 10−6, μν = 2, rν = 0.5, Mν = 40, Mη = 40).

Reconstruction via KeKriKF. The second dataset is provided by the National Climatic Data Center [66] and comprises hourly temperature measurements at N=109Image measuring stations across the continental United States in 2010. A time-invariant graph was constructed as in [51], based on geographical distances. The value f(vn,t)Image represents the temperature recorded at the nth station and tth day.

Fig. 8.3B reports the performance of different reconstruction algorithms in terms of NMSE, for S=40Image. The KeKriKF Algorithm 3 adopts a diffusion kernel for K(ν)Image with σ=1.8Image and for K(η)=sηINImage with sη=105Image. The MKriKF is configured with: DνImage that contains MνImage diffusion kernels with parameters {σ(m)}m=1MνImage drawn from a Gaussian distribution with mean μνImage and variance rνImage; DηImage that contains MηImage sηINImage with parameters {sη(m)}m=1MηImage drawn from a Gaussian distribution with mean μηImage and variance rηImage. The specific kernel selection for KeKriKF leads to the smallest NMSE error and was selected using cross validation. Observe that MKriKF captures the spatiotemporal dynamics, successfully explores the pool of available kernels and achieves superior performance.

The third dataset is provided by the World Bank Group [67] and is comprised of the gross domestic product (GDP) per capita values for N=127Image countries for the years 1960–2016. A time-invariant graph was constructed using the correlation between the GDP values for the first 25 years of different countries. The graph function f(vn,t)Image denotes the GDP value reported for the nth country and tth year for t=1985,,2016Image. The graph Fourier transform of the GDP values shows that the graph frequencies fˇkImage, 4<k<120Image, take small values and large values otherwise. Motivated by the aforementioned observation, the KKriKF is configured with a band-reject kernel K(ν)Image that results after applying r(λn)=βImage for knNlImage and r(λn)=1/βImage otherwise in (8.16) with k=3,l=6,β=15Image and for K(η)=sηINImage with sη=104Image. The MKriKF adopts a DνImage that contains band-reject kernels with k[2,4]Image, l[3,6]Image, β=15Image that result in Mν=12Image kernels and a DηImage that contains {sη(m)IN}m=140Image with sη(m)Image drawn from a Gaussian distribution with mean μη=105Image and variance rη=106Image. Next, the performance of different algorithms in tracking the GDP value is evaluated after sampling S=38Image countries.

Fig. 8.4 illustrates the actual GDP as well as GDP estimates for Greece, which is not contained in the sampled countries. Clearly, MKriKF, which learns the pertinent kernels from the data, achieves roughly the same performance as KKriKF, which is configured manually to obtain the smallest possible NMSE.

Image
Figure 8.4 NMSE for GDP data. Tracking of GDP (μDLSR = 1.6, βDLSR = 0.4, μLMS = 1.6, ρν = 105, ρη = 105).

8.3.4 Summary

The task of reconstructing functions defined on graphs arises naturally in a plethora of applications. The kernel-based approach offers a clear, principled and intuitive way for tackling this problem. In this chapter, we gave a contemporary treatment of this framework focusing on both time-invariant and time evolving domains. The methods presented herein offer the potential of providing an expressive way to tackle interesting real-world problems. Besides illustrating the effectiveness of the discussed approaches, our tests were also chosen to showcase interesting application areas as well as reasonable modeling approaches for the interested readers to build upon. For further details about the models discussed here and their theoretical properties, the reader is referred to [23,39,59,58,68,69] and the references therein.

References

[1] E.D. Kolaczyk, Statistical Analysis of Network Data: Methods and Models. New York: Springer; 2009.

[2] R.I. Kondor, J. Lafferty, Diffusion kernels on graphs and other discrete structures, Proc. Int. Conf. Mach. Learn.. Sydney, Australia. 2002:315–322.

[3] A.J. Smola, R.I. Kondor, Kernels and regularization on graphs, Learning Theory and Kernel Machines. Springer; 2003:144–158.

[4] D.I. Shuman, S.K. Narang, P. Frossard, A. Ortega, P. Vandergheynst, The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains, IEEE Signal Processing Magazine 2013;30(3):83–98.

[5] A. Sandryhaila, J.M.F. Moura, Discrete signal processing on graphs, IEEE Transactions on Signal Processing 2013;61(7):1644–1656.

[6] O. Chapelle, B. Schölkopf, A. Zien, et al., Semi-Supervised Learning. Cambridge: MIT Press; 2006.

[7] O. Chapelle, V. Vapnik, J. Weston, Transductive inference for estimating values of functions, Proc. Advances Neural Inf. Process. Syst., vol. 12. Denver, Colorado. 1999:421–427.

[8] C. Cortes, M. Mohri, On transductive regression, Proc. Advances Neural Inf. Process. Syst.. Vancouver, Canada. 2007:305–312.

[9] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 2006;7:2399–2434.

[10] S.K. Narang, A. Gadde, A. Ortega, Signal processing techniques for interpolation in graph structured data, Proc. IEEE Int. Conf. Acoust., Speech, Sig. Process.. Vancouver, Canada: IEEE; 2013:5445–5449.

[11] S.K. Narang, A. Gadde, E. Sanou, A. Ortega, Localized iterative methods for interpolation in graph structured data, Global Conf. Sig. Inf. Process.. Austin, Texas: IEEE; 2013:491–494.

[12] A. Gadde, A. Ortega, A probabilistic interpretation of sampling theory of graph signals, Proc. IEEE Int. Conf. Acoust., Speech, Sig. Process.. Brisbane, Australia. 2015:3257–3261 10.1109/ICASSP.2015.7178573.

[13] M. Tsitsvero, S. Barbarossa, P. Di Lorenzo, Signals on graphs: uncertainty principle and sampling, IEEE Transactions on Signal Processing 2016;64(18):4845–4860.

[14] S. Chen, R. Varma, A. Sandryhaila, J. Kovacevic, Discrete signal processing on graphs: sampling theory, IEEE Transactions on Signal Processing 2015;63(24):6510–6523.

[15] A. Anis, A. Gadde, A. Ortega, Efficient sampling set selection for bandlimited graph signals using graph spectral proxies, IEEE Transactions on Signal Processing 2016;64(14):3775–3789.

[16] X. Wang, P. Liu, Y. Gu, Local-set-based graph signal reconstruction, IEEE Transactions on Signal Processing 2015;63(9):2432–2444 10.1109/TSP.2015.2411217.

[17] A.G. Marques, S. Segarra, G. Leus, A. Ribeiro, Sampling of graph signals with successive local aggregations, IEEE Transactions on Signal Processing 2016;64(7):1832–1843.

[18] B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press; 2002.

[19] B. Schölkopf, R. Herbrich, A.J. Smola, A generalized representer theorem, Computational Learning Theory. Springer; 2001:416–426.

[20] V.N. Vapnik, Statistical Learning Theory, vol. 1. New York: Wiley; 1998.

[21] G. Kimeldorf, G. Wahba, Some results on Tchebycheffian spline functions, Journal of Mathematical Analysis and Applications 1971;33(1):82–95.

[22] D. Zhou, B. Schölkopf, A regularization framework for learning from graph data, ICML Workshop Stat. Relational Learn. Connections Other Fields, vol. 15. Banff, Canada. 2004:67–68.

[23] D. Romero, M. Ma, G.B. Giannakis, Kernel-based reconstruction of graph signals, IEEE Transactions on Signal Processing 2017;65(3):764–778.

[24] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web. [tech. rep.] Stanford InfoLab; 1999.

[25] A.N. Nikolakopoulos, A. Korba, J.D. Garofalakis, Random surfing on multipartite graphs, 2016 IEEE International Conference on Big Data (Big Data). 2016:736–745 10.1109/BigData.2016.7840666.

[26] A.N. Nikolakopoulos, J.D. Garofalakis, Random surfing without teleportation, Algorithms, Probability, Networks, and Games. Springer International Publishing; 2015:344–357.

[27] J.A. Bazerque, G.B. Giannakis, Nonparametric basis pursuit via kernel-based learning, IEEE Signal Processing Magazine 2013;28(30):112–125.

[28] V. Sindhwani, P. Niyogi, M. Belkin, Beyond the point cloud: from transductive to semi-supervised learning, Proc. Int. Conf. Mach. Learn.. ACM; 2005:824–831.

[29] M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, Journal of Machine Learning Research 2011;12(Jul):2211–2268.

[30] J.A. Bazerque, G. Mateos, G.B. Giannakis, Group-lasso on splines for spectrum cartography, IEEE Transactions on Signal Processing 2011;59(10):4648–4663 10.1109/TSP.2011.2160858.

[31] G.B. Giannakis, Q. Ling, G. Mateos, I.D. Schizas, H. Zhu, Decentralized learning for wireless communications and networking, arXiv preprint arXiv:1503.08855; 2016.

[32] C.A. Micchelli, M. Pontil, Learning the kernel function via regularization, Journal of Machine Learning Research 2005:1099–1125.

[33] S. Segarra, A.G. Marques, G. Leus, A. Ribeiro, Reconstruction of graph signals through percolation from seeding nodes, IEEE Transactions on Signal Processing 2016;64(16):4363–4378.

[34] A.S. Zamzam, V.N. Ioannidis, N.D. Sidiropoulos, Coupled graph tensor factorization, Proc. Asilomar Conf. Sig., Syst., Comput.. Pacific Grove, CA. 2016:1755–1759.

[35] A.N. Nikolakopoulos, J.D. Garofalakis, NCDawareRank: a novel ranking method that exploits the decomposable structure of the web, Proceedings of the Sixth ACM International Conference on Web Search and Data Mining. ACM; 2013:143–152.

[36] A.N. Nikolakopoulos, J.D. Garofalakis, Top-n recommendations in the presence of sparsity: an NCD-based approach, Web Intelligence, vol. 13. IOS Press; 2015:247–265.

[37] A.N. Nikolakopoulos, M.A. Kouneli, J.D. Garofalakis, Hierarchical itemspace rank: exploiting hierarchy to alleviate sparsity in ranking-based recommendation, Neurocomputing 2015;163:126–136.

[38] A.N. Nikolakopoulos, J.D. Garofalakis, NCDREC: a decomposability inspired framework for top-n recommendation, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 1. 2014:183–190 10.1109/WI-IAT.2014.32.

[39] V.N. Ioannidis, A.N. Nikolakopoulos, G.B. Giannakis, Semi-parametric graph kernel-based reconstruction, IEEE Global Conf. Sig. Inf. Process.. Montreal, Canada. 2017.

[40] V. Vapnik, The Nature of Statistical Learning Theory. Springer; 2013.

[41] Bureau of Transportation, United States, [online], available: http://www.transtats.bts.gov/; 2016.

[42] S. Chen, R. Varma, A. Singh, J. Kovačević, Signal representations on graphs: tools and applications, arXiv preprint arXiv:1512.05406; 2015 [online].

[43] U. Von Luxburg, A tutorial on spectral clustering, Statistics and Computing 2007;17(4):395–416.

[44] Y. Shen, B. Baingana, G.B. Giannakis, Nonlinear structural vector autoregressive models for inferring effective brain network connectivity, arXiv preprint arXiv:1610.06551; 2016.

[45] B. Baingana, G.B. Giannakis, Tracking switched dynamic network topologies from information cascades, IEEE Transactions on Signal Processing 2017;65(4):985–997.

[46] F.R. Bach, M.I. Jordan, Learning graphical models for stationary time series, IEEE Transactions on Signal Processing 2004;52(8):2189–2199.

[47] J. Mei, J.M.F. Moura, Signal processing on graphs: causal modeling of big data, arXiv preprint arXiv:1503.00173v3; 2016.

[48] P.A. Forero, K. Rajawat, G.B. Giannakis, Prediction of partially observed dynamical processes over networks via dictionary learning, IEEE Transactions on Signal Processing 2014;62(13):3305–3320.

[49] X. Wang, M. Wang, Y. Gu, A distributed tracking algorithm for reconstruction of graph signals, IEEE Journal of Selected Topics in Signal Processing 2015;9(4):728–740.

[50] P.D. Lorenzo, S. Barbarossa, P. Banelli, S. Sardellitti, Adaptive least mean squares estimation of graph signals, IEEE Transactions on Signal and Information Processing over Networks 2017;2(4):555–568.

[51] D. Romero, V.N. Ioannidis, G.B. Giannakis, Kernel-based reconstruction of space-time functions on dynamic graphs, IEEE Journal of Selected Topics in Signal Processing 2017;11(6):1–14.

[52] G. Strang, K. Borre, Linear Algebra, Geodesy, and GPS. Siam; 1997.

[53] I.D. Schizas, G.B. Giannakis, S.I. Roumeliotis, A. Ribeiro, Consensus in ad hoc WSNs with noisy links—part ii: distributed estimation and smoothing of random signals, IEEE Transactions on Signal Processing 2008;56(4):1650–1666.

[54] T.W. Anderson, An Introduction to Multivariate Statistical Analysis, vol. 2. New York: Wiley; 1958.

[55] K. Rajawat, E. Dall'Anese, G.B. Giannakis, Dynamic network delay cartography, IEEE Transactions on Information Theory 2014;60(5):2910–2920.

[56] C.K. Wikle, N. Cressie, A dimension-reduced approach to space-time Kalman filtering, Biometrika 1999:815–829.

[57] S.J. Kim, E. Dall'Anese, G.B. Giannakis, Cooperative spectrum sensing for cognitive radios using kriged Kalman filtering, IEEE Journal of Selected Topics in Signal Processing 2011;5(1):24–36 10.1109/JSTSP.2010.2053016.

[58] V.N. Ioannidis, D. Romero, G.B. Giannakis, Inference of spatio-temporal functions over graphs via multi-kernel kriged Kalman filtering, arXiv preprint arXiv:1711.09306; 2017.

[59] V.N. Ioannidis, D. Romero, G.B. Giannakis, Inference of spatiotemporal processes over graphs via kernel kriged Kalman filtering, Proc. European Sig. Process. Conf.. Kos, Greece. 2017.

[60] K.V. Mardia, C. Goodall, E.J. Redfern, F.J. Alonso, The kriged Kalman filter, Test 1998;7(2):217–282.

[61] I. Csiszár, G. Tusnády, Information geometry and alternating minimization procedures, Statistics and Decisions 1984:205–237.

[62] L. Zhang, D. Romero, G.B. Giannakis, Fast convergent algorithms for multi-kernel regression, Proc. Workshop Stat. Sig. Process.. Palma de Mallorca, Spain. 2016.

[63] L. Zhang, G. Wang, D. Romero, G.B. Giannakis, Randomized block Frank–Wolfe for convergent large-scale learning, IEEE Transactions on Signal Processing 2017;65(24):6448–6461.

[64] A.N. Nikolakopoulos, V. Kalantzis, E. Gallopoulos, J.D. Garofalakis, Factored proximity models for top-N recommendations, Proc. 2017 IEEE International Conference on Big Knowledge. ICBK, Hefei, China. 2017:80–87.

[65] M.A. Kramer, E.D. Kolaczyk, H.E. Kirsch, Emergent network topology at seizure onset in humans, Epilepsy Research 2008;79(2):173–186.

[66] 1981–2010 U.S. climate normals, [online], available: https://www.ncdc.noaa.gov.

[67] GDP per capita (current US), [online], available: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD.

[68] V.N. Ioannidis, D. Romero, G.B. Giannakis, Kernel-based reconstruction of space-time functions via extended graphs, Proc. Asilomar Conf. Sig., Syst., Comput.. Pacific Grove, CA. 2016:1829–1833.

[69] D. Romero, M. Ma, G.B. Giannakis, Estimating signals over graphs via multi-kernel learning, Proc. Workshop Stat. Sig. Process.. Palma de Mallorca, Spain. 2016.


1  “While f denotes a function, f(v)Image represents the scalar resulting from evaluating f at vertex v.”

2  “Smola et al. [3], for example, discuss the connection between r(λ)=(αλ)1Image and PageRank [24] whereby the sought-after signal is essentially defined as the limiting distribution of a simple underlying “random surfing” process. For more about random surfing processes, see also [25,26].”

3  “A sum is chosen here for tractability, but the right-hand side of (8.19) could in principle combine the functions {fˆm}mImage in different forms.”

4  “For simplicity here we consider only the case of semiparametric partially linear models.”

5  “For general designs of space–time kernels K˜Image for time-invariant as well as time varying topologies, see [51].”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.33.157