Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8

Kernel-Based Inference of Functions Over Graphs

Vassilis N. Ioannidis^⁎; Meng Ma^⁎; Athanasios N. Nikolakopoulos^⁎; Georgios B. Giannakis^⁎; Daniel Romero^† ^⁎University of Minnesota, Minneapolis, MN, United States
^†University of Agder, Grimstad, Norway

Abstract

The study of networks has witnessed an explosive growth over the past decades with several ground-breaking methods introduced. A particularly interesting—and prevalent in several fields of study—problem is that of inferring a function defined over the nodes of a network. This work presents a versatile kernel-based framework for tackling this inference problem that naturally subsumes and generalizes the reconstruction approaches put forth recently for the signal processing by the community studying graphs. Both the static and the dynamic settings are considered along with effective modeling approaches for addressing real-world problems. The analytical discussion herein is complemented with a set of numerical examples, which showcase the effectiveness of the presented techniques, as well as their merits related to state-of-the-art methods.

Keywords

Signal processing on graphs; Kernel-based learning; Graph function reconstruction; Dynamic graphs; Kernel Kalman filter

Acknowledgements

The research was supported by NSF grants 1442686, 1500713, 1508993, 1509040, and 1739397.

8.1 Introduction

Numerous applications arising in diverse disciplines involve inference over networks [1]. Modeling nodal attributes as signals that take values over the vertices of the underlying graph allows the associated inference tasks to leverage node dependencies captured by the graph structure. In many real settings one often affords to work with only a limited number of node observations due to inherent restrictions particularly to the inference task at hand. In social networks, for example, individuals may be reluctant to share personal information; in sensor networks the nodes may report observations sporadically in order to save energy; in brain networks acquiring node samples may involve invasive procedures (e.g., electrocorticography). In this context, a frequently encountered challenge that often emerges is that of inferring the attributes for every node in the network given the attributes for a subset of nodes. This is typically formulated as the task of reconstructing a function defined on the nodes [1–6], given information about some of its values.

Reconstruction of functions over graphs has been studied by the machine learning community, in the context of semisupervised learning under the term of transductive regression and classification [6–8]. Existing approaches assume “smoothness” with respect to the graph—in the sense that neighboring vertices have similar values—and devise nonparametric methods [2,3,6,9] targeting primarily the task of reconstructing binary-valued signals. Function estimation has also been investigated recently by the community of signal processing on graphs (SPoG) under the term signal reconstruction [10–17]. Most of such approaches commonly adopt parametric estimation tools and rely on bandlimitedness, by which the signal of interest is assumed to lie in the span of the B principal eigenvectors of the graph's Laplacian (or adjacency) matrix.

This chapter cross-pollinates ideas arising from both communities and presents a unifying framework for tackling signal reconstruction problems both in the traditional time-invariant, as well as in the more challenging time varying setting. We begin by a comprehensive presentation of kernel-based learning for solving problems of signal reconstruction over graphs (Section 8.2). Data-driven techniques are then presented based on multikernel learning (MKL) that enables combining optimally the kernels in a given dictionary and simultaneously estimating the graph function by solving a single optimization problem (Section 8.2.3). For the case where prior information is available, semiparametric estimators are discussed that can incorporate seamlessly structured prior information into the signal estimators (Section 8.2.4). We then move to the problem of reconstructing time evolving functions on dynamic graphs (Section 8.3). The kernel-based framework is now extended to accommodate the time evolving setting building on the notion of graph extension, specific choices of which can lend themselves to a reduced complexity online solver (Section 8.3.1). Next, a more flexible model is introduced that captures multiple forms of time dynamics, and kernel-based learning is employed to derive an online solver that effects online MKL by selecting the optimal combination of kernels on-the-fly (Section 8.3.2). Our analytical exposition, in both parts, is supplemented with a set of numerical tests based on both real and synthetic data that highlight the effectiveness of the methods, while providing examples of interesting realistic problems that they can address.

Notation: Scalars are denoted by lowercase characters, vectors by bold lowercase, matrices by bold uppercase; ${(A)}_{m, n}$ is the $(m, n)$ th entry of matrix A; superscripts ^T and ^† respectively denote transpose and pseudoinverse. If $A : = [a_{1}, \dots, a_{N}]$ , then $vec {A} : = {[a_{1}^{T}, \dots, a_{N}^{T}]}^{T} : = a$ . With $N \times N$ matrices ${A_{t}}_{t = 1}^{T}$ and ${B_{t}}_{t = 2}^{T}$ satisfying $A_{t} = A_{t}^{T} \forall t$ , $btridiag {A_{1}, \dots, A_{T}; B_{2}, \dots, B_{T}}$ represents the symmetric block tridiagonal matrix. Symbols ⊙, ⊗ and ⊕, respectively, denote the element-wise (Hadamard) matrix product, Kronecker product and Kronecker sum, the latter being defined for $A \in R^{M \times M}$ and $B \in R^{N \times N}$ as $A \oplus B : = A \otimes I_{N} + I_{M} \otimes B$ . The nth column of the identity matrix $I_{N}$ is represented by $i_{N, n}$ . If $A \in R^{N \times N}$ is positive definite and $x \in R^{N}$ , then ${‖ x ‖}_{A}^{2} : = x^{T} A^{- 1} x$ and ${‖ x ‖}_{2} : = {‖ x ‖}_{I_{N}}$ . The cone of $N \times N$ positive definite matrices is denoted by $S_{+}^{N}$ . Finally, $δ [\cdot]$ stands for the Kronecker delta and $E$ for expectation.

8.2 Reconstruction of Functions Over Graphs

Before giving the formal problem statement, it is instructive to start with the basic definitions that will be used throughout this chapter.

Definitions: A graph can be specified by a tuple $G : = (V, A)$ , where $V : = {v_{1}, \dots, v_{N}}$ is the vertex set and A is the $N \times N$ adjacency matrix, whose $(n, n^{'})$ th entry, $A_{n, n^{'}} \geq 0$ , denotes the nonnegative edge weight between vertices $v_{n}$ and $v_{n^{'}}$ . For simplicity, it is assumed that the graph has no self-loops, i.e., $A_{n, n} = 0, \forall v_{n} \in V$ . This chapter focuses on undirected graphs, for which $A_{n^{'}, n} = A_{n, n^{'}} \forall v_{n}, v_{n^{'}} \in V$ . A graph is said to be unweighted if $A_{n, n^{'}}$ is either 0 or 1. The edge set is defined as $E : = {(v_{n}, v_{n^{'}}) \in V \times V : A_{n, n^{'}} \neq 0}$ . Two vertices $v_{n}$ and $v_{n^{'}}$ are adjacent, connected or neighbors if $(v_{n}, v_{n^{'}}) \in E$ . The Laplacian matrix is defined as $L : = diag {A 1} - A$ and is symmetric and positive semidefinite [1, Chapter 2]. A real-valued function (or signal) on a graph is a map $f : V \to R$ . The value $f (v)$ represents an attribute or feature of $v \in V$ , such as age, political alignment or annual income of a person in a social network. Signal f is thus represented by $f : = {[f (v_{1}), \dots, f (v_{N})]}^{T}$ .

Problem statement. Suppose that a collection of noisy samples (or observations) ${y_{s} | y_{s} = f (v_{n_{s}}) + e_{s}}_{s = 1}^{S}$ is available, where $e_{s}$ models noise and $S : = {n_{1}, \dots, n_{S}}$ contains the indices $1 ⩽ n_{1} < \dots < n_{S} ⩽ N$ of the sampled vertices, with $S \leq N$ . Given ${(n_{s}, y_{s})}_{s = 1}^{S}$ and assuming knowledge of $G$ , the goal is to estimate f. This will provide estimates of $f (v)$ both at observed and unobserved vertices. By defining $y : = {[y_{1}, \dots, y_{S}]}^{T}$ , the observation model is summarized as

$y = S f + e,$

(8.1)

where $e : = {[e_{1}, \dots, e_{S}]}^{T}$ and S is a known $S \times N$ binary sampling matrix with entries $(s, n_{s})$ , $s = 1, \dots, S$ , set to one and the rest set to zero.

8.2.1 Kernel Regression

Kernel methods constitute the “workhorse” of machine learning for nonlinear function estimation [18]. Their popularity can be attributed to their simplicity, flexibility and good performance. Here, we present kernel regression as a unifying framework for graph signal reconstruction along with the so-called representer theorem.

Kernel regression seeks an estimate of f in a reproducing kernel Hilbert space (RKHS) $H$ , which is the space of functions $f : V \to R$ defined as

$H : = {f : f (v) = \sum_{n = 1}^{N} α_{n} κ (v, v_{n}), α_{n} \in R},$

(8.2)

where the kernel map $κ : V \times V \to R$ is any function defining a symmetric and positive semidefinite $N \times N$ matrix with entries ${[K]}_{n, n^{'}} : = κ (v_{n}, v_{n^{'}})$ [19]. Intuitively, $κ (v, v^{'})$ is a basis function in (8.2) measuring similarity between the values of f at v and $v^{'}$ . (For a more detailed treatment of RKHS, see, e.g., [3].)

Note that, for signals over graphs, the expansion in (8.2) is finite since $V$ is finite-dimensional. Thus, any $f \in H$ can be expressed in compact form

$f = K α$

(8.3)

for some $N \times 1$ vector $α : = {[α_{1}, \dots, α_{N}]}^{T}$ .

Given two functions $f (v) : = \sum_{n = 1}^{N} α_{n} κ (v, v_{n})$ and $f^{'} (v) : = \sum_{n = 1}^{N} α_{n}^{'} κ (v, v_{n})$ , their RKHS inner product is defined as¹

${〈 f, f^{'} 〉}_{H} : = \sum_{n = 1}^{N} \sum_{n^{'} = 1}^{N} α_{n} α_{n^{'}}^{'} κ (v_{n}, v_{n^{'}}) = α^{T} K α^{'},$

(8.4)

where $α^{'} : = {[α_{1}^{'}, \dots, α_{N}^{'}]}^{T}$ and the reproducing property has been employed that suggests ${〈 κ (\cdot, v_{n_{0}}), κ (\cdot, v_{n_{0}^{'}}) 〉}_{H} = i_{n_{0}}^{T} K i_{n_{0}^{'}} = κ (v_{n_{0}}, v_{n_{0}^{'}})$ . The RKHS norm is defined by

${‖ f ‖}_{H}^{2} : = {〈 f, f 〉}_{H} = α^{T} K α$

(8.5)

and will be used as a regularizer to control overfitting and to cope with the under-determined reconstruction problem. As a special case, setting $K = I_{N}$ recovers the standard inner product ${〈 f, f^{'} 〉}_{H} = f^{T} f^{'}$ and the Euclidean norm ${‖ f ‖}_{H}^{2} = {‖ f ‖}_{2}^{2}$ . Note that when $K ≻ 0$ , the set of functions of the form (8.3) coincides with $R^{N}$ .

Given ${y_{s}}_{s = 1}^{S}$ , RKHS-based function estimators are obtained by solving functional minimization problems formulated as (see also, e.g., [18–20])

$\hat{f} : = \underset{f \in H}{\arg \min} L (y, \bar{f}) + μ Ω ({‖ f ‖}_{H}),$

(8.6)

where the loss $L$ measures how the estimated function f at the observed vertices ${v_{n_{s}}}_{s = 1}^{S}$ , collected in $\bar{f} : = {[f (v_{n_{1}}), \dots, f (v_{n_{S}})]}^{T} = S f$ , deviates from the data y. The so-called square loss $L (y, \bar{f}) : = (1 / S) \sum_{s = 1}^{S} {[y_{s} - f (v_{n_{s}})]}^{2}$ constitutes a popular choice for $L$ . The increasing function Ω is used to promote smoothness with typical choices including $Ω (ζ) = | ζ |$ and $Ω (ζ) = ζ^{2}$ . The regularization parameter $μ > 0$ controls overfitting. Substituting (8.3) and (8.5) into (8.6) shows that $\hat{f}$ can be found as

$\hat{α} : = \underset{α \in R^{N}}{\arg \min} L (y, S K α) + μ Ω ({(α^{T} K α)}^{1 / 2}),$

(8.7a)

$\hat{f} = K \hat{α} .$

(8.7b)

An alternative form of (8.5) that will be frequently used in the sequel results upon noting that

$α^{T} K α = α^{T} K K^{†} K α = f^{T} K^{†} f .$

(8.8)

Thus, one can rewrite (8.6) as

$\hat{f} : = \underset{f \in R {K}}{\arg \min} L (y, S f) + μ Ω ({(f^{T} K^{†} f)}^{1 / 2}) .$

(8.9)

Although graph signals can be reconstructed from (8.7), such an approach involves optimizing over N variables. Thankfully, the solution can be obtained by solving an optimization problem in S variables (where typically $S ≪ N$ ), by invoking the so-called representer theorem [19,21].

The representer theorem plays an instrumental role in the traditional infinite-dimensional setting where (8.6) cannot be solved directly; however, even when $H$ comprises graph signals, it can still be beneficial to reduce the dimension of the optimization in (8.7). The theorem essentially asserts that the solution to the functional minimization in (8.6) can be expressed as

$\hat{f} (v) = \sum_{s = 1}^{S} {\bar{α}}_{s} κ (v, v_{n_{s}})$

(8.10)

for some ${\bar{α}}_{s} \in R$ , $s = 1, \dots, S$ .

The representer theorem shows the form of $\hat{f}$ , but does not provide the optimal ${{\bar{α}}_{s}}_{s = 1}^{S}$ , which are found after substituting (8.10) into (8.6) and solving the resulting optimization problem with respect to these coefficients. To this end, let $\bar{α} : = {[{\bar{α}}_{1}, \dots, {\bar{α}}_{S}]}^{T}$ and write $α = S^{T} \bar{α}$ to deduce that

$\hat{f} = K α = K S^{T} \bar{α} .$

(8.11)

With $\bar{K} : = S K S^{T}$ and using (8.7) and (8.11), the optimal $\bar{α}$ can be found as

$\hat{\bar{α}} : = \underset{\bar{α} \in R^{S}}{\arg \min} L (y, \bar{K} \bar{α}) + μ Ω ({({\bar{α}}^{T} \bar{K} \bar{α})}^{1 / 2}) .$

(8.12)

Kernel ridge regression. For $L$ chosen as the square loss and $Ω (ζ) = ζ^{2}$ , the $\hat{f}$ in (8.6) is referred to as the kernel ridge regression (RR) estimate [18]. If $\bar{K}$ is full-rank, this estimate is given by ${\hat{f}}_{RR} = K S^{T} \hat{\bar{α}}$ , where

$\hat{\bar{α}} : = \underset{\bar{α} \in R^{S}}{\arg \min} \frac{1}{S} {‖ y - \bar{K} \bar{α} ‖}^{2} + μ {\bar{α}}^{T} \bar{K} \bar{α}$

(8.13a)

$= {(\bar{K} + μ S I_{S})}^{- 1} y .$

(8.13b)

Therefore, ${\hat{f}}_{RR}$ can be expressed as

${\hat{f}}_{RR} = K S^{T} {(\bar{K} + μ S I_{S})}^{- 1} y .$

(8.14)

Section 8.2.2 shows that (8.14) generalizes a number of existing signal reconstructors upon properly selecting K.

8.2.2 Kernels on Graphs

When estimating functions on graphs, conventional kernels such as the Gaussian kernel cannot be adopted because the underlying set where graph signals are defined is not a metric space. Indeed, no vertex addition $v_{n} + v_{n^{'}}$ , scaling $β v_{n}$ or norm $‖ v_{n} ‖$ can be naturally defined on $V$ . An alternative is to embed $V$ into Euclidean space via a feature map $ϕ : V \to R^{D}$ and invoke a conventional kernel afterwards. However, for a given graph it is generally unclear how to explicitly design ϕ or select D. This motivates the adoption of kernels on graphs [3].

A common approach to designing kernels on graphs is to apply a transformation function on the graph Laplacian [3]. The term Laplacian kernel comprises a wide family of kernels obtained by applying a certain function $r (\cdot)$ to the Laplacian matrix L. Laplacian kernels are well motivated since they constitute the graph counterpart of the so-called translation-invariant kernels in Euclidean spaces [3]. This section reviews Laplacian kernels, provides beneficial insights in terms of interpolating signals and highlights their versatility in capturing information about the graph Fourier transform of the estimated signal.

The reason why the graph Laplacian constitutes one of the prominent candidates for regularization on graphs becomes clear upon recognizing that

$f^{T} L f = \frac{1}{2} \sum_{(n, n^{'}) \in E} A_{n, n^{'}} {(f_{n} - f_{n^{'}})}^{2},$

(8.15)

where $A_{n, n^{'}}$ denotes weight associated with edge $(n, n^{'})$ . The quadratic form of (8.15) becomes larger when function values vary a lot among connected vertices and therefore quantifies the smoothness of f on $G$ .

Let $0 = λ_{1} ⩽ λ_{2} ⩽ \dots ⩽ λ_{N}$ denote the eigenvalues of the graph Laplacian matrix L and consider the eigendecomposition $L = U Λ U^{T}$ , where $Λ : = diag {λ_{1}, \dots, λ_{N}}$ . A Laplacian kernel matrix is defined by

$K : = r^{†} (L) : = U r^{†} (Λ) U^{T},$

(8.16)

where $r (Λ)$ is the result of applying a user-selected, scalar, nonnegative map $r : R \to R_{+}$ to the diagonal entries of Λ. The selection of map r generally depends on desirable properties that the target function is expected to have. Table 8.1 summarizes some well-known examples arising for specific choices of r.

Table 8.1

Common spectral weight functions
Kernel name	Function	Parameters
Diffusion [2]	$r (λ) = \exp {σ^{2} λ / 2}$	σ ²
p-step random walk [3]	r(λ)=(a−λ)^−p	a ⩾ 2, p ≥ 0
Regularized Laplacian [3,22]	r(λ)=1 + σ²λ	σ ²
Bandlimited [23]	$\begin{matrix} r (λ) = {\begin{matrix} 1 / β, & λ ⩽ λ_{max}, \\ β, & otherwise \end{matrix} \end{matrix}$	β > 0, λ_max

At this point, it is prudent to offer interpretations and insights on the operation of Laplacian kernels. Towards this objective, note first that the regularizer from (8.9) is an increasing function of

$α^{T} K α = α^{T} K K^{†} K α = f^{T} K^{†} f = f^{T} U r (Λ) U^{T} f = {\overset{ˇ}{f}}^{T} r (Λ) \overset{ˇ}{f} = \sum_{n = 1}^{N} r (λ_{n}) | {\overset{ˇ}{f}}_{n} |^{2},$

(8.17)

where $\overset{ˇ}{f} : = U^{T} f : = {[{\overset{ˇ}{f}}_{1}, \dots, {\overset{ˇ}{f}}_{N}]}^{T}$ comprises the projections of f onto the eigenspace of L and is referred to as the graph Fourier transform of f in the SPoG parlance [4]. Consequently, ${{\overset{ˇ}{f}}_{n}}_{n = 1}^{N}$ are called frequency components. The so-called bandlimited (BL) functions in SPoG refer to those whose frequency components only exist inside some band B, that is, ${\overset{ˇ}{f}}_{n} = 0, \forall n > B$ .

By adopting the aforementioned SPoG notions, one can intuitively interpret the role of BL kernels. Indeed, it follows from (8.17) that the regularizer strongly penalizes those ${\overset{ˇ}{f}}_{n}$ for which the corresponding $r (λ_{n})$ is large, thus promoting a specific structure in this “frequency” domain. Specifically, one prefers $r (λ_{n})$ to be large whenever $| {\overset{ˇ}{f}}_{n} |^{2}$ is small and vice versa. The fact that $| {\overset{ˇ}{f}}_{n} |^{2}$ is expected to decrease with n for smooth f motivates the adoption of an increasing function r [3]. From (8.17) it is clear that $r (λ_{n})$ determines how heavily ${\overset{ˇ}{f}}_{n}$ is penalized. Therefore, by setting $r (λ_{n})$ to be small when $n ⩽ B$ and extremely large when $n > B$ , one can expect the result to be a BL signal.

Observe that Laplacian kernels can capture forms of prior information richer than bandlimitedness [11,13,16,17] by selecting function r accordingly. For instance, using $r (λ) = \exp {σ^{2} λ / 2}$ (diffusion kernel) accounts not only for smoothness of f as in (8.15), but also for the prior knowledge that f is generated by a process diffusing over the graph. Similarly, the use of $r (λ) = {(α - λ)}^{- 1}$ (1-step random walk) can accommodate cases where the signal captures a notion of network centrality.²

So far, f has been assumed deterministic, which precludes accommodating certain forms of prior information that probabilistic models can capture, such as domain knowledge and historical data. Suppose without loss of generality that ${f (v_{n})}_{n = 1}^{N}$ are zero mean random variables. The LMMSE estimator of f given y in (8.1) is the linear estimator ${\hat{f}}_{LMMSE}$ minimizing $E {‖ f - {\hat{f}}_{LMMSE} ‖}_{2}^{2}$ , where the expectation is over all f and noise realizations. With $C : = E [f f^{T}]$ , the LMMSE estimate is

${\hat{f}}_{LMMSE} = C S^{T} {[S C S^{T} + σ_{e}^{2} I_{S}]}^{- 1} y,$

(8.18)

where $σ_{e}^{2} : = (1 / S) E [{‖ e ‖}_{2}^{2}]$ denotes the noise variance. Comparing (8.18) with (8.14) and recalling that $\bar{K} : = S K S^{T}$ , it follows that ${\hat{f}}_{LMMSE} = {\hat{f}}_{RR}$ if $μ S = σ_{e}^{2}$ and $K = C$ . In other words, the similarity measure $κ (v_{n}, v_{n^{'}})$ embodied in such a kernel map is just the covariance $cov [f (v_{n}), f (v_{n^{'}})]$ . A related observation was pointed out in [27] for general kernel methods.

In short, one can interpret kernel ridge regression as the LMMSE estimator of a signal f with covariance matrix equal to K; see also [28]. The LMMSE interpretation also suggests the usage of C as a kernel matrix, which enables signal reconstruction even when the graph topology is unknown. Although this discussion hinges on kernel ridge regression after setting $K = C$ , any other kernel estimator of the form (8.7) can benefit from vertex covariance kernels too.

8.2.3 Selecting Kernels From a Dictionary

The selection of the pertinent kernel matrix is of paramount importance to the performance of kernel-based methods [23,29]. This section presents an MKL approach that effects kernel selection in graph signal reconstruction. Two algorithms with complementary strengths will be presented. Both rely on a user-specified kernel dictionary, and the best kernel is built from the dictionary in a data-driven way.

The first algorithm, which we call RKHS superposition, is motivated by the fact that one specific $H$ in (8.6) is determined by some κ; therefore, kernel selection is tantamount to RKHS selection. Consequently, a kernel dictionary ${κ_{m}}_{m = 1}^{M}$ gives rise to an RKHS dictionary ${H_{m}}_{m = 1}^{M}$ , which motivates estimates of the form³

$\hat{f} = \sum_{m = 1}^{M} {\hat{f}}_{m}, {\hat{f}}_{m} \in H_{m} .$

(8.19)

Upon adopting a criterion that controls sparsity in this expansion, the “best” RKHSs will be selected. A reasonable approach is therefore to generalize (8.6) to accommodate multiple RKHSs. With $L$ selected to be the square loss and $Ω (ζ) = | ζ |$ , one can pursue an estimate $\hat{f}$ by solving

$\min_{{f_{m} \in H_{m}}_{m = 1}^{M}} \frac{1}{S} \sum_{s = 1}^{S} {[y_{s} - \sum_{m = 1}^{M} f_{m} (v_{n_{s}})]}^{2} + μ \sum_{m = 1}^{M} {‖ f_{m} ‖}_{H_{m}} .$

(8.20)

Invoking the representer theorem per $f_{m}$ establishes that the minimizers of (8.20) can be written as

${\hat{f}}_{m} (v) = \sum_{s = 1}^{S} {\bar{α}}_{s}^{m} κ_{m} (v, v_{n_{s}}), m = 1, \dots, M$

(8.21)

for some coefficients ${\bar{α}}_{s}^{m}$ . Substituting (8.21) into (8.20) suggests obtaining these coefficients as

$\underset{{{\bar{α}}_{m}}_{m = 1}^{M}}{\arg \min} \frac{1}{S} {‖ y - \sum_{m = 1}^{M} {\bar{K}}_{m} {\bar{α}}_{m} ‖}^{2} + μ \sum_{m = 1}^{M} {({\bar{α}}_{m}^{T} {\bar{K}}_{m} {\bar{α}}_{m})}^{1 / 2},$

(8.22)

where ${\bar{α}}_{m} : = {[{\bar{α}}_{1}^{m}, \dots, {\bar{α}}_{S}^{m}]}^{T}$ and ${\bar{K}}_{m} : = S K_{m} S^{T}$ with ${(K_{m})}_{n, n^{'}} : = κ_{m} (v_{n}, v_{n^{'}})$ . Letting ${\overset{ˇ}{\bar{α}}}_{m} : = {\bar{K}}_{m}^{1 / 2} {\bar{α}}_{m}$ , Eq. (8.22) becomes

$\underset{{{\overset{ˇ}{\bar{α}}}_{m}}_{m = 1}^{M}}{\arg \min} \frac{1}{S} {‖ y - \sum_{m = 1}^{M} {\bar{K}}_{m}^{1 / 2} {\overset{ˇ}{\bar{α}}}_{m} ‖}^{2} + μ \sum_{m = 1}^{M} {‖ {\overset{ˇ}{\bar{α}}}_{m} ‖}_{2} .$

(8.23)

Interestingly, (8.23) can be efficiently solved using the alternating-direction method of multipliers (ADMM) [30,31] after some necessary reformulation [23].

After obtaining ${{\overset{ˇ}{\bar{α}}}_{m}}_{m = 1}^{M}$ , the sought-after function estimate can be recovered as

$\hat{f} = \sum_{m = 1}^{M} K_{m} S^{T} {\bar{α}}_{m} = \sum_{m = 1}^{M} K_{m} S^{T} {\bar{K}}_{m}^{- 1 / 2} {\overset{ˇ}{\bar{α}}}_{m} .$

(8.24)

This MKL algorithm can identify the best subset of RKHSs—and therefore kernels—but entails MS unknowns (cf. (8.22)). Next, an alternative approach is discussed which can reduce the number of variables to $M + S$ at the price of not being able to assure a sparse kernel expansion.

The alternative approach is to postulate a kernel of the form $K (θ) = \sum_{m = 1}^{M} θ_{m} K_{m}$ , where ${K_{m}}_{m = 1}^{M}$ is given and $θ_{m} ⩾ 0 \forall m$ . The coefficients $θ : = {[θ_{1}, \dots, θ_{M}]}^{T}$ can be found by jointly minimizing (8.12) with respect to θ and $\bar{α}$ [32]. We have

$(θ, \hat{\bar{α}}) : = \underset{θ, \bar{α}}{\arg \min} \frac{1}{S} L (v, y, \bar{K} (θ) \bar{α}) + μ Ω ({({\bar{α}}^{T} \bar{K} (θ) \bar{α})}^{1 / 2}),$

(8.25)

where $\bar{K} (θ) : = S K (θ) S^{T}$ . Except for degenerate cases, problem (8.25) is not jointly convex in θ and $\hat{\bar{α}}$ , but it is separately convex in each vector for a convex $L$ [32]. Iterative algorithms for solving (8.23) and (8.25) are available in [23].

8.2.4 Semiparametric Reconstruction

The approaches discussed so far are applicable to various problems but they are certainly limited by the modeling assumptions they make. In particular, the performance of algorithms belonging to the parametric family [11,15,33] is restricted by how well the signals actually adhere to the selected model. Nonparametric models, on the other hand [2,3,6,34], offer flexibility and robustness but they cannot readily incorporate information available a priori.

In practice however, it is not uncommon that neither of these approaches alone suffices for reliable inference. Consider, for instance, an employment-oriented social network such as LinkedIn, and suppose the goal is to estimate the salaries of all users given information about the salaries of a few. Clearly, besides network connections, exploiting available information regarding the users' education level and work experience could benefit the reconstruction task. The same is true in problems arising in link analysis, where the exploitation of hierarchical web structure can aid the task of estimating the importance of web pages [35]. In recommender systems, inferring preference scores for every item, given the users' feedback about particular items, could be cast as a signal reconstruction problem over the item correlation graph. Data sparsity imposes severe limitations on the quality of pure collaborative filtering methods [64]. Exploiting side information about the items is known to alleviate such limitations [36], leading to considerably improved recommendation performance [37,38].

A promising direction to endow nonparametric methods with prior information relies on a semiparametric approach whereby the signal of interest is modeled as the superposition of a parametric and a nonparametric component [39]. While the former leverages side information, the latter accounts for deviations from the parametric part and can also promote smoothness using kernels on graphs. In this section we outline two simple and reliable semiparametric estimators with complementary strengths, as detailed in [39].

8.2.4.1 Semiparametric Inference

Function f is modeled as the superposition⁴

$/ f = f_{P} + f_{NP},$

(8.26)

where $f_{P} : = {[f_{P} (v_{1}), \dots, f_{P} (v_{N})]}^{T}$ and $f_{NP} : = [f_{NP} (v_{1})$ , ${\dots, f_{N P} (v_{N})]}^{T}$ .

The parametric term $f_{P} (v) : = \sum_{m = 1}^{M} β_{m} b_{m} (v)$ captures the known signal structure via the basis $B : = {b_{m}}_{m = 1}^{M}$ , while the nonparametric term $f_{NP}$ belongs to an RKHS $H$ , which accounts for deviations from the span of $B$ . The goal of this section is to provide an efficient and reliable estimation of f given y, S, $B$ , $H$ and $G$ .

Since $f_{NP} \in H$ , vector $f_{NP}$ can be represented as in (8.3). By defining $β : = {[β_{1}, \dots, β_{M}]}^{T}$ and the $N \times M$ matrix B with entries ${(B)}_{n, m} : = b_{m} (v_{n})$ , the parametric term can be written in vector form as $f_{P} : = B β$ . The semiparametric estimates can be found as the solution of the following optimization problem:

$\begin{matrix} {\hat{α}, \hat{β}} = \arg \min_{α, β} \frac{1}{S} & \sum_{s = 1}^{S} L (y_{s}, f (v_{n_{s}})) + μ {‖ f_{NP} ‖}_{H}^{2} \\ s.t. f & = B β + K α, \end{matrix}$

(8.27)

where the fitting loss $L$ quantifies the deviation of f from the data and $μ > 0$ is the regularization scalar that controls overfitting of the nonparametric term. Using (8.27), the semiparametric estimates are expressed as $\hat{f} = B \hat{β} + K \hat{α}$ .

Solving (8.27) entails minimization over $N + M$ variables. Clearly, when dealing with large-scale graphs this could lead to prohibitively large computational costs. To reduce complexity, the semiparametric version of the representer theorem [18,19] is employed, which establishes that

$\hat{f} = B \hat{β} + K S^{T} \hat{\bar{α}},$

(8.28)

where $\hat{\bar{α}} : = {[{\hat{\bar{α}}}_{1}, \dots, {\hat{\bar{α}}}_{S}]}^{T}$ . Estimates $\hat{\bar{α}}, \hat{β}$ are found as

$\begin{matrix} {\hat{\bar{α}}, \hat{β}} = \arg \min_{\bar{α}, β} \frac{1}{S} & \sum_{s = 1}^{S} L (y_{s}, f (v_{n_{s}})) + μ {‖ f_{NP} ‖}_{H}^{2} \\ s.t. f & = B β + K S^{T} \bar{α}, \end{matrix}$

(8.29)

where $\bar{α} : = {[{\bar{α}}_{1}, \dots, {\bar{α}}_{S}]}^{T}$ . The RKHS norm in (8.29) is expressed as ${‖ f_{NP} ‖}_{H}^{2} = {\bar{α}}^{T} \bar{K} \bar{α}$ , with $\bar{K} : = S K S^{T}$ . Relative to (8.27), the number of optimization variables in (8.29) is reduced to the more affordable $S + M$ , with $S ≪ N$ .

Next, two loss functions with complementary benefits will be considered: the square loss and the ϵ-insensitive loss. The square loss function is

${L (y_{s}, f (v_{n_{s}})) : = ‖ y_{s} - f (v_{n_{s}})) ‖}_{2}^{2}$

(8.30)

and (8.29) admits the following closed-form solution:

$\hat{\bar{α}} = {(P \bar{K} + μ I_{S})}^{- 1} P y,$

(8.31a)

$\hat{β} = {({\bar{B}}^{T} \bar{B})}^{- 1} {\bar{B}}^{T} (y - \bar{K} \hat{\bar{α}}),$

(8.31b)

where $\bar{B} : = S B$ , $P : = I_{S} - \bar{B} {({\bar{B}}^{T} \bar{B})}^{- 1} {\bar{B}}^{T}$ . The complexity of (8.31) is $O (S^{3} + M^{3})$ .

The ϵ-insensitive loss function is given by

$L (y_{s}, f (v_{n_{s}})) : = \max (0, | y_{s} - f (v_{n_{s}}) | - ϵ),$

(8.32)

where ϵ is tuned, e.g. via cross-validation, to minimize the generalization error and it has well-documented merits in signal estimation from quantized data [40]. Substituting (8.32) into (8.29) yields a convex nonsmooth quadratic problem that can be solved efficiently for $\bar{α}$ and β using, e.g., interior-point methods [18].

8.2.5 Numerical Tests

This section reports on the signal reconstruction performance of different methods using real as well as synthetic data. The performance of the estimators is assessed via Monte Carlo simulation by comparing the normalized mean square error (NMSE)

$NMSE = E [\frac{{‖ \hat{f} - f ‖}^{2}}{{‖ f ‖}^{2}}] .$

(8.33)

Multikernel reconstruction. The first data set contains departure and arrival information for flights among U.S. airports [41], from which $3 \times 10^{6}$ flights in the months of July, August and September of 2014 and 2015 were selected. We construct a graph with $N = 50$ vertices corresponding to the airports with highest traffic, and whenever the number of flights between the two airports exceeds 100 within the observation window, we connect the corresponding nodes with an edge.

A signal was constructed per day averaging the arrival delay of all inbound flights per selected airport. A total of 184 signals were considered, of which the first 154 were used for training (July, August, September 2014 and July, August 2015), and the remaining 30 for testing (September 2015). The weights of the edges between airports were learned using the training data based on the technique described in [23].

Table 8.2 lists the NMSE and the RMSE in minutes for the task of predicting the arrival delay at 40 airports when the delay at a randomly selected collection of 10 airports is observed. The second row corresponds to the ridge regression estimator that uses the nearly optimal estimated covariance kernel. The next two rows correspond to the multikernel approaches in §8.2.3 with a dictionary of 30 diffusion kernels with values of $σ^{2}$ uniformly spaced between 0.1 and 7. The rest of the rows pertain to graph BL estimators. Table 8.2 demonstrates the reliable performance of covariance kernels as well as the herein discussed multikernel approaches relative to competing alternatives.

Table 8.2

Multikernel reconstruction
	NMSE	RMSE[min]
KRR with cov. kernel	0.34	3.95
Multikernel, RS	0.44	4.51
Multikernel, KS	0.43	4.45
BL for B=2	1.55	8.45
BL for B=3	32.64	38.72
BL, cut-off	3.97	13.5

Semiparametric reconstruction. An Erdős–Rényi graph with probability of edge presence 0.6 and $N = 200$ nodes was generated, and f was formed by superimposing a BL signal [13,15] plus a piece-wise constant signal [42]; that is,

$f = \sum_{i = 1}^{10} γ_{i} u_{i} + \sum_{i = 1}^{6} δ_{i} 1_{V_{c}},$

(8.34)

where ${γ_{i}}_{i = 1}^{10}$ and ${δ_{i}}_{i = 1}^{6}$ are standardized Gaussian distributed, ${u_{i}}_{i = 1}^{10}$ are the eigenvectors associated with the 10 smallest eigenvalues of the Laplacian matrix, ${V_{i}}_{i = 1}^{6}$ are the vertex sets of six clusters obtained via spectral clustering [43] and $1_{V_{i}}$ is the indicator vector with entries ${(1_{V_{i}})}_{n} : = 1$ , if $v_{n} \in V_{i}$ , and 0 otherwise. The parametric basis $B = {1_{V_{i}}}_{i = 1}^{6}$ was used by the estimators capturing the prior knowledge, and S vertices were sampled uniformly at random. The subsequent experiments evaluate the performance of the semiparametric graph kernel estimators, SP-GK and SP-GK(ϵ), resulting from using (8.30) and (8.32) in (8.29), respectively; the parametric (P) that considers only the parametric term in (8.26), the nonparametric (NP) [2,3] that considers only the nonparametric term in (8.26) and the graph BL estimators from [13,15], which assume a BL model with bandwidth B. For all the experiments, the diffusion kernel (cf. Table 8.1) with parameter σ is employed. First, white Gaussian noise $e_{s}$ of variance $σ_{e}^{2}$ is added to each sample $f_{s}$ to yield the signal-to-noise ratio ${SNR}_{e} : = {‖ f ‖}_{2}^{2} / (N σ_{e}^{2})$ . Fig. 8.1B presents the NMSE of different methods. As expected, the limited flexibility of the parametric approaches, BL and P, affects their ability to capture the true signal structure. The NP estimator achieves smaller NMSE, but only when the amount of available samples is adequate. Both semiparametric estimators were found to outperform other approaches, exhibiting reliable reconstruction even with few samples.

Figure 8.1 NMSE of the synthetic signal estimates. (A) μ = 5 × 10⁻⁴, σ = 5 × 10⁻⁴, SNR_e = 5 dB. (B) μ = 5 × 10⁻⁴, σ = 5 × 10⁻⁴, ϵ = 10⁻⁴ and SNR_o = −5 dB.

To illustrate the benefits of employing different loss functions (8.30) and (8.32), we compare the performance of SP-GK and SP-GK(ϵ) in the presence of outlying noise. Each sample $f_{s}$ is contaminated with Gaussian noise $o_{s}$ of large variance $σ_{o}^{2}$ with probability $p = 0.1$ . Fig. 8.1A demonstrates the robustness of SP-GK(ϵ) which is attributed to the ϵ-insensitive loss function (8.32). Further experiments using real signals can be found in [39].

8.3 Inference of Dynamic Functions Over Dynamic Graphs

Networks that exhibit time varying connectivity patterns with time varying node attributes arise in a plethora of network science–related applications. Sometimes these dynamic network topologies switch between a finite number of discrete states, governed by sudden changes of the underlying dynamics [44,45]. A challenging problem that arises in this setting is that of reconstructing time evolving functions on graphs, given their values on a subset of vertices and time instants. Efficiently exploiting spatiotemporal dynamics can markedly impact sampling costs by reducing the number of vertices that need to be observed to attain a target performance. Such a reduction can be of paramount importance in certain applications, e.g., in monitoring time-dependent activity of different regions of the brain through invasive electrocorticography (ECoG), where observing a vertex requires the implantation of an intracranial electrode [44].

Although one could reconstruct a time varying function per time slot using the non- or semiparametric methods of §8.2, leveraging time correlations typically yields estimators with improved performance. Schemes tailored for time evolving functions on graphs include [46] and [47], which predict the function values at time t given observations up to time $t - 1$ . However, these schemes assume that the function of interest adheres to a specific vector autoregressive model. Other works target time-invariant functions, but can only afford tracking sufficiently slow variations. This is the case with the dictionary learning approach in [48] and the distributed algorithms in [49] and [50]. Unfortunately, the flexibility of these algorithms to capture spatial information is also limited since [48] focuses on Laplacian regularization, whereas [49] and [50] require the signal to be BL.

Motivated by the aforementioned limitations, in what comes next we extend the framework presented in §8.2 accommodating time varying function reconstruction over dynamic graphs. But before we delve into the time varying setting, a few definitions are in order.

Definitions: A time varying graph is a tuple $G (t) : = (V, A_{t})$ , where $V : = {v_{1}, \dots, v_{N}}$ is the vertex set and $A_{t} \in R^{N \times N}$ is the adjacency matrix at time t, whose $(n, n^{'})$ th entry $A_{n, n^{'}} (t)$ assigns a weight to the pair of vertices $(v_{n}, v_{n^{'}})$ at time t. A time-invariant graph is a special case with $A_{t} = A_{t^{'}} \forall t, t^{'}$ . Adopting common assumptions made in related literature (e.g., [1, Chapter 2], [4,9]), we also define $G (t)$ (i) to have nonnegative weights ( $A_{n, n^{'}} (t) ⩾ 0 \forall t, and \forall n \neq n^{'}$ ), (ii) to have no self-edges ( $A_{n, n} (t) = 0 \forall n, t$ ) and (iii) to be undirected ( $A_{n, n^{'}} (t) = A_{n^{'}, n} (t) \forall n, n^{'}, t$ ).

A time varying function or signal on a graph is a map $f : V \times T \to R$ , where $T : = {1, 2, \dots}$ is the set of time indices. The value $f (v_{n}, t)$ of f at vertex $v_{n}$ and time t can be thought of as the value of an attribute of $v_{n} \in V$ at time t. The values of f at time t will be collected in $f_{t} : = {[f (v_{1}, t), \dots, f (v_{N}, t)]}^{T}$ .

At time t, vertices with indices in the time-dependent set $S_{t} : = {n_{1} (t), \dots, n_{S (t)} (t)}$ , $1 ⩽ n_{1} (t) < \dots < n_{S (t)} (t) ⩽ N$ , are observed. The resulting samples can be expressed as $y_{s} (t) = f (v_{n_{s} (t)}, t) + e_{s} (t), s = 1, \dots, S (t)$ , where $e_{s} (t)$ models the observation error. By letting $y_{t} : = {[y_{1} (t), \dots, y_{S (t)} (t)]}^{T}$ , the observations can be conveniently expressed as

$y_{t} = S_{t} f_{t} + e_{t}, t = 1, 2, \dots,$

(8.35)

where $e_{t} : = {[e_{1} (t), \dots, e_{S (t)} (t)]}^{T}$ and the $S (t) \times N$ sampling matrix $S_{t}$ contains ones at positions $(s, n_{s} (t))$ , $s = 1, \dots, S (t)$ and zeros elsewhere.

The broad goal of this section is to “reconstruct” f from the observations ${y_{t}}_{t}$ in (8.35). Two formulations will be considered.

Batch formulation. In the batch reconstruction problem, one aims at finding ${f_{t}}_{t = 1}^{T}$ given ${G (t)}_{t = 1}^{T}$ , the sample locations ${S_{t}}_{t = 1}^{T}$ and all observations ${y_{t}}_{t = 1}^{T}$ .

Online formulation. At every time t, one is given $G$ together with $S_{t}$ and $y_{t}$ , and the goal is to find $f_{t}$ . The latter can be obtained possibly based on a previous estimate of $f_{t - 1}$ , but the complexity per time slot t must be independent of t.

To solve these problems, we will rely on the assumption that f evolves smoothly over space and time, yet more structured dynamics can be incorporated if known.

8.3.1 Kernels on Extended Graphs

This section extends the kernel-based learning framework of §8.2 to subsume time evolving functions over possibly dynamic graphs through the notion of graph extension, by which the time dimension receives the same treatment as the spatial dimension. The versatility of kernel-based methods to leverage spatial information [23] is thereby inherited and broadened to account for temporal dynamics as well. This vantage point also accommodates time varying sampling sets and topologies.

8.3.1.1 Extended Graphs

An immediate approach to reconstructing time evolving functions is to apply (8.9) separately for each $t = 1, \dots, T$ . This yields the instantaneous estimator (IE)

${\hat{f}}_{t}^{(I E)} : = \underset{f}{\arg \min} \frac{1}{S (t)} {‖ y_{t} - S_{t} f ‖}_{2}^{2} + μ f^{T} K_{t}^{†} f .$

(8.36)

Unfortunately, this estimator does not account for the possible relation between, e.g., $f (v_{n}, t)$ and $f (v_{n}, t - 1)$ . If, for instance, f varies slowly over time, an estimate of $f (v_{n}, t)$ may as well benefit from leveraging observations $y_{s} (τ)$ at time instants $τ \neq t$ . Exploiting temporal dynamics potentially reduces the number of vertices that have to be sampled to attain a target reconstruction performance, which in turn can markedly reduce sampling costs.

Incorporating temporal dynamics into kernel-based reconstruction, which can only handle a single snapshot (cf. §8.2), necessitates an appropriate reformulation of time evolving function reconstruction as a problem of reconstructing a time-invariant function. An appealing possibility is to replace $G$ with its extended version $\tilde{G} : = (\tilde{V}, \tilde{A})$ , where each vertex in $V$ is replicated T times to yield the extended vertex set $\tilde{V} : = {v_{n} (t), n = 1, \dots, N, t = 1, \dots, T}$ , and the $(n + N (t - 1), n^{'} + N (t^{'} - 1))$ th entry of the $T N \times T N$ extended adjacency matrix $\tilde{A}$ equals the weight of the edge $(v_{n} (t), v_{n^{'}} (t^{'}))$ . The time varying function f can thus be replaced with its extended time-invariant counterpart $\tilde{f} : \tilde{V} \to R$ with $\tilde{f} (v_{n} (t)) = f (v_{n}, t)$ .

Definition 1

Let $V : = {v_{1}, \dots, v_{N}}$ denote a vertex set and let $G : = (V, {A_{t}}_{t = 1}^{T})$ be a time varying graph. A graph $\tilde{G}$ with vertex set $\tilde{V} : = {v_{n} (t), n = 1, \dots, N, t = 1, \dots, T}$ and $N T \times N T$ adjacency matrix $\tilde{A}$ is an extended graph of $G$ if the tth $N \times N$ diagonal block of $\tilde{A}$ equals $A_{t}$ .

In general, the diagonal blocks ${A_{t}}_{t = 1}^{T}$ do not provide full description of the underlying extended graph. Indeed, one also needs to specify the off-diagonal block entries of $\tilde{A}$ to capture the spatiotemporal dynamics of f.

As an example, consider an extended graph with

$\tilde{A} = btridiag {A_{1}, \dots, A_{T}; B_{2}^{(T)}, \dots, B_{T}^{(T)}},$

(8.37)

where $B_{t}^{(T)} \in R_{+}^{N \times N}$ connects ${v_{n} (t - 1)}_{n = 1}^{N}$ to ${v_{n} (t)}_{n = 1}^{N}$ , $t = 2, \dots, T$ , and $btridiag {A_{1}$ , $\dots, A_{T}; B_{2}, \dots, B_{T}}$ represents the symmetric block tridiagonal matrix

$\tilde{A} = [\begin{matrix} A_{1} & B_{2}^{T} & 0 & \dots & 0 & 0 \\ B_{2} & A_{2} & B_{3}^{T} & \dots & 0 & 0 \\ 0 & B_{3} & A_{3} & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & 0 & 0 & \dots & A_{T - 1} & {B^{T}}_{T} \\ 0 & 0 & 0 & \dots & B_{T} & A_{T} \end{matrix}] .$

For instance, each vertex can be connected to its neighbors at the previous time instant by setting $B_{t}^{(T)} = A_{t - 1}$ , or it can be connected to its replicas at adjacent time instants by setting $B_{t}^{(T)}$ to be diagonal. Such a graph is depicted in Fig. 8.2.

Figure 8.2 NMSE of the synthetic signal estimates. (A) Original graph and (B) extended graph $\tilde{G}$ for diagonal $B_{t}^{(T)}$ . Edges connecting vertices at the same time instant are represented by solid lines whereas edges connecting vertices at different time instants are represented by dashed lines.

8.3.1.2 Batch and Online Reconstruction via Space–Time Kernels

The extended graph enables a generalization of the estimators in §8.2 for time evolving functions. The rest of this subsection discusses two such KRR estimators.

Consider first the batch formulation, where all the $\tilde{S} : = \sum_{t = 1}^{T} S_{t}$ samples in $\tilde{y} : = {[y_{1}^{T}, \dots, {y^{T}}_{T}]}^{T}$ are available, and the goal is to estimate $\tilde{f} : = {[f_{1}^{T}, \dots, f_{T}^{T}]}^{T}$ . Directly applying the KRR criterion in (8.9) to reconstruct $\tilde{f}$ on the extended graph $\tilde{G}$ yields

$\hat{\tilde{f}} : = \underset{\tilde{f}}{\arg \min} {‖ \tilde{y} - \tilde{S} \tilde{f} ‖}_{D}^{2} + μ {\tilde{f}}^{T} {\tilde{K}}^{†} \tilde{f},$

(8.38a)

where $\tilde{K}$ is now a $T N \times T N$ “space–time” kernel matrix, $\tilde{S} : = bdiag {S_{1}, \dots, S_{T}}$ and $D : = bdiag {S (1) I_{S (1)}, \dots, S (T) I_{S (T)}}$ . If $\tilde{K}$ is invertible, (8.38a) can be solved in closed form as

$\hat{\tilde{f}} = \tilde{K} {\tilde{S}}^{T} {(\tilde{S} \tilde{K} {\tilde{S}}^{T} + μ D)}^{- 1} \tilde{y} .$

(8.38b)

The “space–time” kernel $\tilde{K}$ , captures complex spatiotemporal dynamics. If the topology is time-invariant, $\tilde{K}$ can be specified in a bidimensional plane of spatiotemporal frequency similar to §8.2.2.⁵

In the online formulation, one aims to estimate $f_{t}$ after the $\tilde{S} (t) : = \sum_{τ = 1}^{t} S (τ)$ samples in ${\tilde{y}}_{t} : = {[y_{1}^{T}, \dots, {y^{T}}_{t}]}^{T}$ become available. Based on these samples, the KRR estimate of $\tilde{f}$ , denoted as ${\hat{\tilde{f}}}_{1 : T | t}$ , is clearly

${\hat{\tilde{f}}}_{1 : T | t} : = \underset{\tilde{f}}{\arg \min} {‖ {\tilde{y}}_{t} - {\tilde{S}}_{t} \tilde{f} ‖}_{D_{t}}^{2} + μ {\tilde{f}}^{T} {\tilde{K}}^{- 1} \tilde{f}$

(8.39a)

$= \tilde{K} {\tilde{S}}_{t}^{T} {({\tilde{S}}_{t} \tilde{K} {\tilde{S}}_{t}^{T} + μ D_{t})}^{- 1} {\tilde{y}}_{t},$

(8.39b)

where $\tilde{K}$ is assumed invertible for simplicity, $D_{t} : = bdiag {S (1) I_{S (1)}, \dots, S (t) I_{S (t)}}$ and ${\tilde{S}}_{t} : = [diag {S_{1}, \dots, S_{t}}, 0_{\tilde{S} (t) \times (T - t) N}] \in {0, 1}^{\tilde{S} (t) \times T N}$ .

The estimate in (8.39) comprises the per slot estimates ${{\hat{f}}_{τ | t}}_{τ = 1}^{T}$ ; that is, ${\hat{\tilde{f}}}_{1 : T | t} : = {[{\hat{f}}_{1 | t}^{T}, \dots, {\hat{f}}_{T | t}^{T}]}^{T}$ with ${\hat{f}}_{τ | t} : = {[{\hat{f}}_{1} (τ | t), \dots, {\hat{f}}_{N} (τ | t)]}^{T}$ , where ${\hat{f}}_{τ | t}$ (respectively ${\hat{f}}_{n} (τ | t)$ ) is the KRR estimate of $f_{τ}$ ( $f (v_{n}, τ)$ ) given the observations up to time t. With this notation, it follows that for all $t, τ$

${\hat{f}}_{τ | t} = (i_{T, τ}^{T} \otimes I_{N}) {\hat{\tilde{f}}}_{1 : T | t} .$

(8.40)

Regarding t as the present, (8.39) therefore provides estimates of past, present and future values of f. The solution to the online problem comprises the sequence of present KRR estimates for all t, that is, ${{\hat{f}}_{t | t}}_{t = 1}^{T}$ . This can be obtained by solving (8.39a) in closed form per t as in (8.39b) and then applying (8.40). However, such an approach does not yield a desirable online algorithm since its complexity per time slot is $O ({\tilde{S}}^{3} (t))$ and therefore increasing with t. For this reason, this approach is not satisfactory since the online problem formulation requires the complexity per time slot of the desired algorithm to be independent of t. An algorithm that does satisfy this requirement yet provides the exact KRR estimate is presented next when the kernel matrix is any positive definite matrix $\tilde{K}$ satisfying

${\tilde{K}}^{- 1} = btridiag {D_{1}, \dots, D_{T}; C_{2}, \dots, C_{T}}$

(8.41)

for some $N \times N$ matrices ${D_{t}}_{t = 1}^{T}$ and ${C_{t}}_{t = 2}^{T}$ . Kernels in this important family are designed in [51].

If $\tilde{K}$ is of the form (8.41), then the kernel Kalman filter (KKF) in Algorithm 2 returns the sequence ${{\hat{f}}_{t | t}}_{t = 1}^{T}$ , where ${\hat{f}}_{t | t}$ is given by (8.40). The $N \times N$ matrices ${P_{τ}}_{τ = 2}^{T}$ and ${Σ_{τ}}_{τ = 1}^{T}$ are obtained offline by Algorithm 1, and $σ_{e}^{2} (τ) = μ S (τ) \forall τ$ .

Algorithm 1 Recursion to set the parameters of the KKF.

The KKF generalizes the probabilistic KF since the latter is recovered upon setting $\tilde{K}$ to be the covariance matrix of $\tilde{f}$ in the probabilistic KF. The assumptions made by the probabilistic KF are stronger than those involved in the KKF. Specifically, in the probabilistic KF, $f_{t}$ must adhere to a linear state space model, $f_{t} = P_{t} f_{t - 1} + η_{t}$ , with known transition matrix $P_{t}$ , where the state noise $η_{t}$ is uncorrelated over time and has known covariance matrix $Σ_{t}$ . Furthermore, the observation noise $e_{t}$ must be uncorrelated over time and have a known covariance matrix. Correspondingly, the performance guarantees of the probabilistic KF are also stronger: the resulting estimate is optimal in the mean square error sense among all linear estimators. Furthermore, if $η_{t}$ and $y_{t}$ are jointly Gaussian, $t = 1, \dots, T$ , then the probabilistic KF estimate is optimal in the mean square error sense among all (not necessarily linear) estimators. In contrast, the requirements of the proposed KKF are much weaker since they only specify that f must evolve smoothly with respect to a given extended graph. As expected, the performance guarantees are similarly weaker; see e.g. [18, Chapter 5]. However, since the KKF generalizes the probabilistic KF, the reconstruction performance of the former for judiciously selected $\tilde{K}$ cannot be worse than the reconstruction performance of the latter for any given criterion. The caveat, however, is that such a selection is not necessarily easy.

For the rigorous statement and proof relating the celebrated KF [52, Chapter 17] and the optimization problem in (8.39a), see [51]. Algorithm 2 requires $O (N^{3})$ operations per time slot, whereas the complexity of evaluating (8.39b) for the tth time slot is $O ({\tilde{S}}^{3} (t))$ , which increases with t and becomes eventually prohibitive. Since distributed versions of the Kalman filter are well studied [53], decentralized KKF algorithms can be pursued to further reduce the computational complexity.

8.3.2 Multikernel Kriged Kalman Filters

The following section applies the KRR framework presented in §8.2 to online data-adaptive estimators of $f_{t}$ . Specifically, a spatiotemporal model is presented that judiciously captures the dynamics over space and time. Based on this model the KRR criterion over time and space is formulated, and an online algorithm is derived with affordable computational complexity, when the kernels are preselected. To bypass the need for selecting an appropriate kernel, this section discusses a data-adaptive multikernel learning extension of the KRR estimator that learns the optimal kernel “on-the-fly.”

8.3.2.1 Spatiotemporal Models

Consider modeling the dynamics of $f_{t}$ separately over time and space as $f (v_{n}, t) = f^{(ν)} (v_{n}, t) + f^{(χ)} (v_{n}, t)$ , or in vector form

$f_{t} = f_{t}^{(ν)} + {f^{(χ)}}_{t},$

(8.42)

where $f_{t}^{(ν)} : = {[f^{(ν)} (v_{1}, t), \dots, f^{(ν)} (v_{N}, t)]}^{T}$ and $f_{t}^{(χ)} : = {[f^{(χ)} (v_{1}, t), \dots, f^{(χ)} (v_{N}, t)]}^{T}$ . The first term ${f_{t}^{(ν)}}_{t}$ captures only spatial dependencies and can be thought of as exogenous input to the graph that does not affect the evolution of the function in time.

The second term ${f_{t}^{(χ)}}_{t}$ accounts for spatiotemporal dynamics. A popular approach [54, Chapter 3] models $f_{t}^{(χ)}$ with the state equation

$f_{t}^{(χ)} = A_{t, t - 1} f_{t - 1}^{(χ)} + η_{t}, t = 1, 2, \dots,$

(8.43)

where $A_{t, t - 1}$ is a generic transition matrix that can be chosen, e.g., as the $N \times N$ adjacency of a possibly directed “transition graph,” with $f_{0}^{(χ)} = 0$ and $η_{t} \in R^{N}$ capturing the state error. The state transition matrix $A_{t, t - 1}$ can be selected in accordance with the prior information available. Simplicity in estimation is a major advantage of the random walk model [55], where $A_{t, t - 1} = α I_{N}$ with $α > 0$ . On the other hand, adherence to the graph prompts the selection $A_{t, t - 1} = α A$ , in which case (8.43) amounts to a diffusion process on the time-invariant graph $G$ . The recursion in (8.43) is a vector autoregressive model (VARM) of order one and offers flexibility in tracking multiple forms of temporal dynamics [54, Chapter 3]. The model in (8.43) captures the dependence between $f^{(χ)} (v_{n}, t)$ and its time lagged versions ${f^{(χ)} (v_{n}, t - 1)}_{n = 1}^{N}$ .

Next, a model with increased flexibility is presented to account for instantaneous spatial dependencies as well. We have

$f_{t}^{(χ)} = A_{t, t} f_{t}^{(χ)} + A_{t, t - 1} f_{t - 1}^{(χ)} + η_{t}, t = 1, 2, \dots,$

(8.44)

where $A_{t, t}$ encodes the instantaneous relation between $f^{(χ)} (v_{n}, t)$ and ${f^{(χ)} (v_{n^{'}}, t)}_{n^{'} \neq n}$ . The recursion in (8.44) amounts to a structural vector autoregressive model (SVARM) [44]. Interestingly, (8.44) can be rewritten as

$f_{t}^{(χ)} = {(I_{N} - A_{t, t})}^{- 1} A_{t, t - 1} f_{t - 1}^{(χ)} + {(I_{N} - A_{t, t})}^{- 1} η_{t},$

(8.45)

where $I_{N} - A_{t, t}$ is assumed invertible. After defining ${\tilde{η}}_{t} : = {(I_{N} - A_{t, t})}^{- 1} η_{t}$ and ${\tilde{A}}_{t, t - 1} : = {(I_{N} - A_{t, t})}^{- 1} A_{t, t - 1}$ , (8.44) boils down to

$f_{t}^{(χ)} = {\tilde{A}}_{t, t - 1} f_{t - 1}^{(χ)} + {\tilde{η}}_{t},$

(8.46)

which is equivalent to (8.43). This section will focus on deriving estimators based on (8.43), but can also accommodate (8.44) using the aforementioned reformulation.

Modeling $f_{t}$ as the superposition of a term $f_{t}^{(χ)}$ , capturing the slow dynamics over time with a state space equation, and a term $f_{t}^{(ν)}$ accounting for fast dynamics is motivated by the application at hand [56,55,57]. In the kriging terminology [56], $f_{t}^{(ν)}$ is said to model small-scale spatial fluctuations, whereas $f_{t}^{(χ)}$ captures the so-called trend. The decomposition (8.42) is often dictated by the sampling interval; while $f_{t}^{(χ)}$ captures slow dynamics relative to the sampling interval, fast variations are modeled with $f_{t}^{(ν)}$ . Such a modeling approach is advised in the prediction of network delays [55], where $f_{t}^{(χ)}$ represents the queuing delay while $f_{t}^{(ν)}$ represents the propagation, transmission and processing delays. Likewise, when predicting prices across different stocks, $f_{t}^{(χ)}$ captures the daily evolution of the stock market, which is correlated across stocks and time samples, while $f_{t}^{(ν)}$ describes unexpected changes, such as the daily drop of the stock market due to political statements, which are assumed uncorrelated over time.

8.3.2.2 Kernel Kriged Kalman Filter

The spatiotemporal model in (8.42), (8.43) can represent multiple forms of spatiotemporal dynamics by judicious selection of the associated parameters. The batch KRR estimator over time yields

$\begin{matrix} \underset{{f_{τ}^{(χ)}, η_{τ}, f_{τ}^{(ν)}, f_{τ}}_{τ = 1}^{t}}{\arg \min} & \sum_{τ = 1}^{t} \frac{1}{S (τ)} {‖ y_{τ} - S_{τ} f_{τ} ‖}^{2} + μ_{1} \sum_{τ = 1}^{t} {‖ η_{τ} ‖}_{K_{τ}^{(η)}}^{2} + μ_{2} \sum_{τ = 1}^{t} {‖ f_{τ}^{(ν)} ‖}_{K_{τ}^{(ν)}}^{2} \\ s.t. & η_{τ} = f_{τ}^{(χ)} - A_{τ, τ - 1} {f^{(χ)}}_{τ - 1}, f_{τ} = f_{τ}^{(ν)} + f_{τ}^{(χ)}, τ = 1, \dots, t . \end{matrix}$

(8.47)

The first term in (8.47) penalizes the fitting error in accordance with (8.1). The scalars $μ_{1}, μ_{2} \geq 0$ are regularization parameters controlling the effect of the kernel regularizers, while prior information about ${f_{τ}^{(ν)}, η_{τ}}_{τ = 1}^{t}$ may guide the selection of the appropriate kernel matrices. The constraints in (8.47) imply adherence to (8.43) and (8.42). Since the $f_{τ}^{(ν)}, η_{τ}$ are defined over the time evolving $G (τ)$ , a potential approach is to select Laplacian kernels as $K_{τ}^{(ν)}, K_{τ}^{(η)}$ ; see §8.2.2. Next, we rewrite (8.47) in a form amenable to online solvers, namely

$\begin{matrix} \underset{{f_{τ}^{(χ)}, f_{τ}^{(ν)}}_{τ = 1}^{t}}{\arg \min} & \sum_{τ = 1}^{t} \frac{1}{S (τ)} {‖ y_{τ} - S_{τ} f_{τ}^{(χ)} - S_{τ} f_{τ}^{(ν)} ‖}^{2} + \\ + μ_{1} \sum_{τ = 1}^{t} {‖ {f^{(χ)}}_{τ} - A_{τ, τ - 1} f_{τ - 1}^{(χ)} ‖}_{K_{τ}^{(η)}}^{2} + \\ + μ_{2} \sum_{τ = 1}^{t} {‖ f_{τ}^{(ν)} ‖}_{K_{τ}^{(ν)}}^{2} . \end{matrix}$

(8.48)

In a batch form the optimization in (8.48) yields ${{\hat{f}}_{τ | t}^{(ν)}$ and ${\hat{f}}_{τ | t}^{(χ)}}_{τ = 1}^{t}$ per slot t with complexity that grows with t. Fortunately, the filtered solutions ${{\hat{f}}_{τ | τ}^{(ν)}, {\hat{f}}_{τ | τ}^{(χ)}}_{τ = 1}^{t}$ of (8.48) are attained by the kernel kriged Kalman filter (KeKriKF) in an online fashion. For the proof the reader is referred to [58]. One iteration of the proposed KeKriKF is summarized as Algorithm 3. This online estimator—with computational complexity $O (N^{3})$ per t—tracks the temporal variations of the signal of interest through (8.43) and promotes desired properties such as smoothness over the graph, using $K_{t}^{(ν)}$ and $K_{t}^{(η)}$ . Different from existing KriKF approaches over graphs [55], the KeKriKF takes into account the underlying graph structure in estimating $f_{t}^{(ν)}$ as well as $f_{t}^{(χ)}$ . Furthermore, by using $L_{t}$ in (8.16), it can also accommodate dynamic graph topologies. Finally, it should be noted that KeKriKF encompasses as a special case the KriKF, which relies on knowing the statistical properties of the function [55–57,60].

Algorithm 3 Kernel Kriged Kalman filter (KeKriKF).

Lack of prior information prompts the development of data-driven approaches that efficiently learn the appropriate kernel matrix. Section 8.3.2.3 proposes an online MKL approach for achieving this goal.

8.3.2.3 Online Multikernel Learning

To cope with a lack of prior information about the pertinent kernel, the following dictionaries of kernels will be considered: $D_{ν} : = {K^{(ν)} (m) \in S_{+}^{N}}_{m = 1}^{M_{ν}}$ and $D_{η} : = {K^{(η)} (m) \in S_{+}^{N}}_{m = 1}^{M_{η}}$ . For the following assume that $K_{τ}^{(ν)} = K^{(ν)}$ , $K_{τ}^{(η)} = K^{(η)}$ and $S_{τ} = S, \forall τ$ . Moreover, we postulate that the kernel matrices are of the form $K^{(ν)} = K^{(ν)} (θ^{(ν)}) = \sum_{m = 1}^{M_{ν}} θ^{(ν)} (m) K^{(ν)} (m)$ and $K^{(η)} = K^{(η)} (θ^{(η)}) = \sum_{m = 1}^{M_{η}} θ^{(η)} (m) K^{(η)} (m)$ , where $θ^{(η)} (m), θ^{(ν)} (m) \geq 0, \forall m$ .

Next, in accordance with §8.2.3 the coefficients $θ^{(ν)} = {[θ^{(ν)} (1), \dots, θ^{(ν)} (M)]}^{T}$ and $θ^{(η)} = {[θ^{(η)} (1), \dots, θ^{(η)} (M)]}^{T}$ can be found by jointly minimizing (8.48) with respect to ${f_{τ}^{(χ)}, f_{τ}^{(ν)}}_{τ = 1}^{t}, θ^{(ν)}$ and $θ^{(η)}$ , which yields

$\begin{matrix} \underset{\begin{matrix} {f_{τ}^{(χ)}, f_{τ}^{(ν)}}_{τ = 1}^{t}, \\ θ^{(ν)} \geq 0, θ^{(η)} \geq 0 \end{matrix}}{\arg \min} & \sum_{τ = 1}^{t} \frac{1}{S} {‖ y_{τ} - S f_{τ}^{(χ)} - S f_{τ}^{(ν)} ‖}^{2} + μ_{1} \sum_{τ = 1}^{t} {‖ {f^{(χ)}}_{τ} - A_{τ, τ - 1} f_{τ - 1}^{(χ)} ‖}_{K^{(η)} (θ^{(η)})}^{2} \\ + & μ_{2} \sum_{τ = 1}^{t} {‖ f_{τ}^{(ν)} ‖}_{K^{(ν)} (θ^{(ν)})}^{2} + t ρ_{ν} {‖ θ^{(ν)} ‖}_{2}^{2} + t ρ_{η} {‖ θ^{(η)} ‖}_{2}^{2}, \end{matrix}$

(8.49)

where $ρ_{ν}, ρ_{η} \geq 0$ are regularization parameters that effect a ball constraint on $θ^{(ν)}$ and $θ^{(η)}$ , weighted by t to account for the first three terms that are growing with t. Observe that the optimization problem in (8.49) gives time varying estimates $θ_{t}^{(ν)}$ and $θ_{t}^{(η)}$ allowing to track the optimal $K^{(ν)}$ and $K^{(η)}$ that change over time respectively.

The optimization problem in (8.49) is not jointly convex in ${f_{τ}^{(χ)}, f_{τ}^{(ν)}}_{τ = 1}^{t}, θ^{(ν)}, θ^{(η)}$ , but it is separately convex in these variables. To solve (8.49) alternating minimization strategies will be employed that suggest optimizing with respect to one variable, while keeping the other variables fixed [61]. If $θ^{(ν)}, θ^{(η)}$ are considered fixed, (8.49) reduces to (8.48), which can be solved by Algorithm 3 for the estimates ${\hat{f}}_{t | t}^{(χ)}, {\hat{f}}_{t | t}^{(ν)}$ at each t. For ${f_{τ}^{(χ)}, f_{τ}^{(ν)}}_{τ = 1}^{t}$ fixed and replaced by ${{\hat{f}}_{τ | τ}^{(χ)}, {\hat{f}}_{τ | τ}^{(ν)}}_{τ = 1}^{t}$ in (8.48) the time varying estimates of $θ^{(ν)}, θ^{(η)}$ are found by

${\hat{θ}}_{t}^{(η)} = \underset{θ^{(η)} \geq 0}{\arg \min} \frac{1}{t} \sum_{τ = 1}^{t} {‖ {\hat{f}}_{τ | τ}^{(χ)} - A_{τ, τ - 1} {\hat{f}}_{τ - 1 | τ - 1}^{(χ)} ‖}_{K^{(η)} (θ^{(η)})}^{2} + \frac{ρ_{η}}{μ_{1}} {‖ θ^{(η)} ‖}_{2}^{2},$

(8.50a)

${\hat{θ}}_{t}^{(ν)} = \underset{θ^{(ν)} \geq 0}{\arg \min} \frac{1}{t} \sum_{τ = 1}^{t} {‖ {\hat{f}}_{τ | τ}^{(ν)} ‖}_{K^{(ν)} (θ^{(ν)})}^{2} + + \frac{ρ_{ν}}{μ_{2}} {‖ θ^{(ν)} ‖}_{2}^{2} .$

(8.50b)

The optimization problems (8.50a) and (8.50b) are strongly convex and iterative algorithms are available based on projected gradient descent (PGD) [62], or the Frank–Wolfe algorithm [63]. When the kernel matrices belong to the Laplacian family (8.16), efficient algorithms that exploit the common eigenspace of the kernels in the dictionary have been developed in [58]. The proposed method reduces the per step computational complexity of PGD from a prohibitive $O (N^{3} M)$ for general kernels to a more affordable $O (N M)$ for Laplacian kernels. The proposed algorithm, termed multikernel KriKF (MKriKF), alternates between computing ${\hat{f}}_{t | t}^{(χ)}$ and ${\hat{f}}_{t | t}^{(ν)}$ utilizing the KKriKF and estimating ${\hat{θ}}_{t}^{(ν)}$ and ${\hat{θ}}_{t}^{(η)}$ from solving (8.50b) and (8.50a).

8.3.3 Numerical Tests

This section compares the performance of the methods we discussed in §8.3.1 and §8.3.2 with state-of-the-art alternatives and illustrates some of the trade-offs inherent to time varying function reconstruction through real-data experiments. The source code for the simulations is available at the authors' websites.

Unless otherwise stated, the compared estimators include distributed least squares reconstruction (DLSR) [49] with step size $μ_{DLSR}$ and parameter $β_{DLSR}$ ; the least mean squares (LMS) algorithm in [50] with step size $μ_{LMS}$ ; the BL instantaneous estimator (BL-IE), which results after applying [11,13,15] separately per t; and the KRR instantaneous estimator (KRR-IE) in (8.36) with a diffusion kernel with parameter σ. DLSR, LMS and BL-IE also use a bandwidth parameter B.

Reconstruction via extended graphs. For our first experiment we use a dataset obtained from an epilepsy study [65], which is used to showcase an example analysis of ECoG data (analysis of ECoG data is a standard tool in diagnosing epilepsy). Our next experiment utilizes the ECoG time series in [65] from $N = 76$ electrodes implanted in a patient's brain before and after the onset of a seizure. A symmetric time-invariant adjacency matrix A was obtained using the method in [44] with ECoG data before the onset of the seizure. Function $f (v_{n}, t)$ comprises the electrical signal at the nth electrode and tth sampling instant after the onset of the seizure, for a period of $T = 250$ samples. The values of $f (v_{n}, t)$ were normalized by subtracting the temporal mean of each time series before the onset of the seizure. The goal of the experiment is to illustrate the reconstruction performance of KKF in capturing the complex spatiotemporal dynamics of brain signals.

Fig. 8.3A depicts the $NMSE (t, {S_{τ}}_{τ = 1}^{t})$ , averaged over all sets $S_{t} = S, \forall t$ , of size $S = 53$ . For the KKF, a space–time kernel was created (see [51]) with $K_{t}$ a time-invariant covariance kernel $K_{t} = \hat{Σ}$ , where $\hat{Σ}$ was set to the sample covariance matrix of the time series before the onset of the seizure, with a time-invariant $B^{(T)} = b^{(T)} I$ . The results clearly show the superior reconstruction performance of the KKF, which successfully exploits the statistics of the signal when available, among competing approaches, even with a small number of samples. This result suggests that the ECoG diagnosis technique could be efficiently conducted even with a smaller number of intracranial electrodes, which may have a positive impact on the patient's experience.

Figure 8.3 NMSE for real data simulations. (A) NMSE for the ECoG data set (σ = 1.2, μ = 10⁻⁴, μ_DLSR = 1.2, $b^{(T)} = 0.01$ , β_DLSR = 0.5, μ_LMS = 0.6). (B) NMSE of temperature estimates (μ₁ = 1, μ₂ = 1, μ_DLSR = 1.6, β_DLSR = 0.5, μ_LMS = 0.6, α = 10⁻³, μ_η = 10⁻⁵, r_η = 10⁻⁶, μ_ν = 2, r_ν = 0.5, M_ν = 40, M_η = 40).

Reconstruction via KeKriKF. The second dataset is provided by the National Climatic Data Center [66] and comprises hourly temperature measurements at $N = 109$ measuring stations across the continental United States in 2010. A time-invariant graph was constructed as in [51], based on geographical distances. The value $f (v_{n}, t)$ represents the temperature recorded at the nth station and tth day.

Fig. 8.3B reports the performance of different reconstruction algorithms in terms of NMSE, for $S = 40$ . The KeKriKF Algorithm 3 adopts a diffusion kernel for $K^{(ν)}$ with $σ = 1.8$ and for $K^{(η)} = s_{η} I_{N}$ with $s_{η} = 10^{- 5}$ . The MKriKF is configured with: $D_{ν}$ that contains $M_{ν}$ diffusion kernels with parameters ${σ (m)}_{m = 1}^{M_{ν}}$ drawn from a Gaussian distribution with mean $μ_{ν}$ and variance $r_{ν}$ ; $D_{η}$ that contains $M_{η}$ $s_{η} I_{N}$ with parameters ${s_{η} (m)}_{m = 1}^{M_{η}}$ drawn from a Gaussian distribution with mean $μ_{η}$ and variance $r_{η}$ . The specific kernel selection for KeKriKF leads to the smallest NMSE error and was selected using cross validation. Observe that MKriKF captures the spatiotemporal dynamics, successfully explores the pool of available kernels and achieves superior performance.

The third dataset is provided by the World Bank Group [67] and is comprised of the gross domestic product (GDP) per capita values for $N = 127$ countries for the years 1960–2016. A time-invariant graph was constructed using the correlation between the GDP values for the first 25 years of different countries. The graph function $f (v_{n}, t)$ denotes the GDP value reported for the nth country and tth year for $t = 1985, \dots, 2016$ . The graph Fourier transform of the GDP values shows that the graph frequencies ${\overset{ˇ}{f}}_{k}$ , $4 < k < 120$ , take small values and large values otherwise. Motivated by the aforementioned observation, the KKriKF is configured with a band-reject kernel $K^{(ν)}$ that results after applying $r (λ_{n}) = β$ for $k ⩽ n ⩽ N - l$ and $r (λ_{n}) = 1 / β$ otherwise in (8.16) with $k = 3, l = 6, β = 15$ and for $K^{(η)} = s_{η} I_{N}$ with $s_{η} = 10^{- 4}$ . The MKriKF adopts a $D_{ν}$ that contains band-reject kernels with $k \in [2, 4]$ , $l \in [3, 6]$ , $β = 15$ that result in $M_{ν} = 12$ kernels and a $D_{η}$ that contains ${s_{η} (m) I_{N}}_{m = 1}^{40}$ with $s_{η} (m)$ drawn from a Gaussian distribution with mean $μ_{η} = 10^{- 5}$ and variance $r_{η} = 10^{- 6}$ . Next, the performance of different algorithms in tracking the GDP value is evaluated after sampling $S = 38$ countries.

Fig. 8.4 illustrates the actual GDP as well as GDP estimates for Greece, which is not contained in the sampled countries. Clearly, MKriKF, which learns the pertinent kernels from the data, achieves roughly the same performance as KKriKF, which is configured manually to obtain the smallest possible NMSE.

Figure 8.4 NMSE for GDP data. Tracking of GDP (μ_DLSR = 1.6, β_DLSR = 0.4, μ_LMS = 1.6, ρ_ν = 10⁵, ρ_η = 10⁵).

8.3.4 Summary

The task of reconstructing functions defined on graphs arises naturally in a plethora of applications. The kernel-based approach offers a clear, principled and intuitive way for tackling this problem. In this chapter, we gave a contemporary treatment of this framework focusing on both time-invariant and time evolving domains. The methods presented herein offer the potential of providing an expressive way to tackle interesting real-world problems. Besides illustrating the effectiveness of the discussed approaches, our tests were also chosen to showcase interesting application areas as well as reasonable modeling approaches for the interested readers to build upon. For further details about the models discussed here and their theoretical properties, the reader is referred to [23,39,59,58,68,69] and the references therein.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Kernel-Based Inference of Functions Over Graphs

Create new playlist

Sign In

Sign Up