Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

A Random Fourier Features Perspective of KAFs With Application to Distributed Learning Over Networks

Pantelis Bouboulis^⁎; Sergios Theodoridis^⁎; Symeon Chouvardas^† ^⁎National and Kapodistrian University of Athens, Athens, Greece
^†Capital Fund Management, Paris, France

Abstract

A major problem in any typical online kernel-based scheme is that the model's solution is given as an expansion of kernel functions that grows linearly with time. Usually, some sort of pruning strategy is adopted to make the solution sparse for practical reasons. The key idea is to keep the most informative training data in the expansion (the so-called dictionary), while the rest is omitted. Although these strategies have been proven effective, they still consume computational resources due to their nature (e.g., they require a sequential search of the dictionary at each time instant) and they limit the design of kernel-based methods in more general settings, as for example in distributed systems. In this chapter, we show how one can employ random features of the kernel function to transform the original nonlinear problem, which lies in an infinite-dimensional Hilbert space, to a fixed-dimension Euclidean space, without significantly compromising performance. This paves the way for designing kernel-based methods for distributed systems based on their linear counterparts.

Keywords

Online kernel-based learning; Online distributed kernel-based learning; Random Fourier features

Chapter Points

• Provide an overview of the random Fourier features rationale, as a tool to approximate the kernel.
• Provide a review of existing online kernel-based schemes (e.g., KLMS, KRLS, PEGASOS), using the random Fourier features procedure.
• Propose elegant and efficient online distributed kernel-based schemes for learning over networks, using the random Fourier features rationale.

7.1 Introduction

During the last decade, there has been an increasing interest for online learning schemes based on the rationale of reproducing kernels both for regression and classification. This has been sparked by the great success of the Support Vectors Machines (SVMs) and other related kernel methods [1–3] in addressing nonlinear tasks in a batch setting. Similar to the way that the kernel trick generalizes the classical linear SVM to the nonlinear case, these methods exploit the inner product structure of a Reproducing Kernel Hilbert Space (RKHS), $H$ , to modify well-established linear learning methods (such as the LMS and the RLS) to treat nonlinear tasks [4–7]. In essence, the data are transformed to a high-dimensional space equipped with an inner product (induced by the selected kernel) and then standard linear learning methods are applied to the transformed data. This is equivalent to nonlinear processing in the original space.

In general, these methods consider that the training data points ${(x_{n}, y (n)), n = 1, 2, \dots}$ , $x_{n} \in R^{d}$ , are generated by a nonlinear model of the form $y = g (x)$ , where g is an unknown nonlinear function, and that the data pairs arrive in a sequential (online) fashion. The objective is to learn a function $f \in H$ , to minimize a certain cost, i.e., $L (x_{n}, y (n), f)$ . Typically, a specific kernel function, κ, is selected and each data pair (arriving at the nth time instant) is mapped to $(Φ (x_{n}), y (n))$ , using the feature map $Φ : R^{d} \to H : Φ (x) = κ (x, \cdot)$ . Consequently, the system's output is modeled as ${〈 Φ (x_{n}), f 〉}_{H}$ and the data pair is used to update the previously available estimate of f. The result is that (after n iterations) the solution, at time instant n, is given in terms of all past training points, i.e.,

$f_{n} = \sum_{i = 1}^{n} w_{i} Φ (x_{i}) = \sum_{i = 1}^{n} w_{i} κ (x_{i}, \cdot) .$

(7.1)

Since n can grow indefinitely, it is evident that this scheme soon becomes impractical. This becomes very serious in big data applications, where online algorithms are the only viable possibility, e.g., [3]. Usually, some sort of pruning criterion is adopted to make the aforementioned linear expansion sparse. Thus, a dictionary is formed, which contains all valid training points (centers and coefficients) and the specific criterion is used to determine whether a new point will enter the dictionary or not. For example, if the novelty criterion is adopted, new arriving points enter the expansion only if the respective center, $x_{n}$ , is far away from all other centers of the dictionary. If the quantization criterion is adopted, the centers are quantized and each new data pair is used to update the coefficient of the dictionary's center with the same quantization value.

In this chapter, we propose a different approach. Based on the original idea of Rahimi and Recht [8], which has also been applied in [9,10] in a similar context, we approximate the kernel function using randomly selected Fourier features and transform the problem from the original RKHS, $H$ , to another Euclidean space $R^{D}$ . Thus, we can model the solution as a fixed-size vector in $R^{D}$ and directly apply any desired linear learning method. More importantly, we can use this rationale to develop learning methods over distributed networks. These are decentralized networks, which are represented by different nodes; each node collects the observations in a sequential fashion and communicates with a subset of nodes, which define its neighborhood, in order to reach a consensus regarding the solution. In such a setting, although the noise and observations, in each node, may follow different statistics, the parameters defining the generating model are assumed to be common for all nodes. The problem in this case is that, if one follows the rationale of Eq. (7.1), the typical sparsification methods become increasingly complicated, as each node has to transmit the entire sum at each time instant to its neighbors and the node has to, somehow, fuse together all received sums to compute the updated estimate. This is probably the reason that (until now) no practical solutions for online learning with kernels in distributed networks have been proposed. On the contrary, we will show that the newly proposed scheme can be directly applied to distributed environments and that the standard established linear methods can be exploited.

The rest of the chapter is organized as follows. Section 7.2 presents the kernel approximation rationale of Rahimi and Recht. In Section 7.3, it is shown how the aforementioned kernel approximation rationale can be exploited to generate fixed-budget kernel-based online learning methods (kernel adaptive filters). The section contains fixed-budget implementations of the KLMS, the KRLS and the kernel-PEGASOS algorithms, as well as simulations to demonstrate the effectiveness of the methods. Moreover, in Section 7.4 the kernel approximation rationale is employed to generate elegant and efficient distributed learning methods over networks, based on the combine-then-adapt strategy. Finally, Section 7.5 presents some concluding remarks.

7.2 Approximating the Kernel

All kernel-based learning techniques usually demand a large number of kernel computations between training samples. For example, the popular SVM algorithm requires the computation of a large kernel matrix consisting of the kernel evaluations between pairs of all possible combinations of the training points. Hence, to alleviate the computational burden, one common line of research suggests to use some sort of approximation of the kernel matrix. This can be done by the celebrated Nyström method [11,12] or by other similar techniques. In this chapter, we are focusing on the Fourier features approach proposed in [8,13], as this fits more naturally to the online setting. The following theorem plays a key role in this procedure.

Theorem 1

Consider a shift-invariant positive definite kernel $κ (x - y)$ defined on $R^{d}$ and its Fourier transform, $p (ω) = \frac{1}{{(2 π)}^{d}} \int_{R^{d}} κ (δ) e^{- i ω^{T} δ} d δ$ , which (according to Bochner's theorem) can be regarded as a probability density function. Then, defining $z_{ω, b} (x) = \sqrt{2} \cos (ω^{T} x + b)$ , it turns out that

$κ (x - y) = E_{ω, b} [z_{ω, b} (x) z_{ω, b} (y)],$

(7.2)

where ω is drawn from p, b from the uniform distribution on $[0, 2 π]$ and $E [\cdot]$ denotes the related expectation.

Following Theorem 1, we choose to approximate $κ (x_{n} - x_{m})$ using D random Fourier features, $ω_{1}, ω_{2}, \dots, ω_{D}$ , (drawn from p) and D random numbers, $b_{1}, b_{2}, \dots, b_{D}$ (drawn uniformly from $[0, 2 π]$ ) that define a sample average:

$κ (x_{n} - x_{m}) \approx \frac{1}{D} \sum_{i = 1}^{D} z_{ω_{i}, b_{i}} (x_{m}) z_{ω_{i}, b_{i}} (x_{n}) .$

(7.3)

Hence, instead of relying on the implicit map, Φ, provided by the kernel trick, we can map the input data to a finite-dimensional Euclidean space (with dimension lower than $H$ , but still larger than the input space) using the randomized feature map $z_{Ω} : R^{d} \to R^{D}$ :

$z_{Ω} (x) = \sqrt{\frac{2}{D}} (\begin{matrix} \cos (ω_{1}^{T} x + b_{1}) \\ ⋮ \\ \cos (ω_{D}^{T} x + b_{D}) \end{matrix}),$

(7.4)

where Ω is the $(d + 1) \times D$ matrix defining the random Fourier features of the respective kernel, i.e.,

$Ω = (\begin{matrix} ω_{1} & ω_{2} & . . . & ω_{D} \\ b_{1} & b_{2} & . . . & b_{D} \end{matrix}),$

(7.5)

so that the kernel evaluations can be approximated as $κ (x_{n}, x_{m}) \approx z_{Ω} {(x_{n})}^{T} z_{Ω} (x_{m})$ . Details on the quality of this approximation, as well as other theoretical results, can be found in [8,13–15]. We note that for the Gaussian kernel, i.e., $κ_{σ} (x, y) = \exp (- {‖ x - y ‖}^{2} / σ^{2})$ , which is employed throughout the paper, the respective Fourier transform becomes

$p (ω) = {(σ / \sqrt{2 π})}^{D} e^{- \frac{σ^{2} {‖ ω ‖}^{2}}{2}},$

(7.6)

which is actually the multivariate Gaussian distribution with mean $0_{D}$ and covariance matrix $\frac{1}{σ^{2}} I_{D}$ .

7.3 Online Kernel-Based Learning: A Random Fourier Features Perspective

Inspired by the randomized feature map, we propose an alternative approach for online learning in RKHS. The procedure can be summarized as follows:

• Map all data pairs, to $(z_{Ω} (x_{n}), y (n))$ .
• Approximate the desired output as $f (x) \approx θ^{T} z_{Ω} (x)$ , for some $θ \in R^{D}$ .
• Estimate θ using any standard linear learning method (e.g., LMS, RLS) on the transformed data.

In the following, we will demonstrate in details three applications of the proposed procedure: (a) the Kernel LMS (KLMS), (b) the Kernel RLS (KRLS) and (c) the Primal Estimated sub-GrAdient SOlver for non linear SVM (kernel PEGASOS).

7.3.1 RFF-KLMS

Assuming a sequentially arriving data set, ${(z_{Ω} (x_{n}), y (n)), n = 1, 2, \dots}$ , the typical LMS algorithm models the desired output as $f (x) = \hat{y} = θ^{T} z_{Ω} (x)$ and estimates θ, so that the MSE cost function, $L (x_{n}, y (n), θ) = E [{(y (n) - θ^{T} z_{Ω} (x_{n}))}^{2}]$ is minimized. The minimization is carried out via a gradient descent rationale, where the gradient of the cost, is approximated by the current measurement, i.e.,

$\begin{matrix} \nabla_{θ} L (x_{n}, y (n), θ) & = - 2 E [y (n) - θ^{T} z_{Ω} (x_{n})] z_{Ω} (x_{n}) \\ \approx - 2 (y (n) - θ^{T} z_{Ω} (x_{n})) z_{Ω} (x_{n}) . \end{matrix}$

This method (RFF-KLMS) has been introduced independently in [9] and [10]. Algorithm 1 presents the steps in details. It is a matter of elementary algebra to see that after $n - 1$ steps, the estimation of the solution will be $θ_{n} = μ \sum_{i = 1}^{n - 1} e (i) z_{Ω} (x_{i})$ , which leads us to conclude that the RFF-KLMS scheme will produce, approximately, the same output as that of the standard KLMS (provided that D is sufficiently large):

$\hat{y} (n) = μ \sum_{i = 1}^{n - 1} e (i) z_{Ω} {(x_{i})}^{T} z_{Ω} (x_{n}) \approx μ \sum_{i = 1}^{n - 1} e (i) κ (x_{i}, x_{n}),$

(7.7)

where for the last relation (which gives the output of the standard KLMS [7,4]) we used Theorem 1. The major difference is that the RFF-KLMS provides a single vector θ of fixed dimensions, instead of a growing expansion of kernel functions, hence no pruning mechanisms are required.

Algorithm 1 The Random Fourier Features Kernel LMS algorithm.

Working similar to the case of the standard LMS, we can study convergence properties of RFFKLMS. Henceforth, we will assume that the data pairs are generated by

$y (n) = \sum_{m = 1}^{M} α_{m} κ (c_{m}, x_{n}) + η (n),$

(7.8)

where $c_{1}, \dots, c_{M}$ are fixed centers, $x_{n}$ are zero-mean independent and identically distributed samples drawn from the Gaussian distribution with covariance matrix $σ_{X}^{2} I_{d}$ and $η (n)$ are independent and identically distributed noise samples drawn from $N (0, σ_{η}^{2})$ . We note that the parameters $σ_{X}$ and $σ_{η}$ are the variances of the input and the noise, respectively, and they are not in any way related to the kernel parameter σ. Applying the approximation rationale of Theorem 1, we obtain

$κ (c_{m}, x_{n}) = E_{ω, b} [z_{ω, b} (c_{m}) z_{ω, b} (x_{n})] = z_{Ω} {(c_{m})}^{T} z_{Ω} (x_{n}) + ϵ_{m} (n),$

where $ϵ_{m} (n)$ is the error between the actual value of the kernel function and the approximated one. Thus, if we define

$Z_{C} = (z_{Ω} (c_{1}), \dots, z_{Ω} (c_{M})), α = {(α_{1}, \dots, α_{M})}^{T}, θ_{o} = Z_{Ω} α,$

we get

$y (n) = α^{T} Z_{Ω}^{T} z_{Ω} (x_{n}) + ϵ (n) + η (n) = θ_{o}^{T} z_{Ω} (x_{n}) + ϵ (n) + η (n),$

where $ϵ (n)$ stands for the approximation error between the noise-free component of $y (n)$ (evaluated only by the linear kernel expansion of Eq. (7.8)) and the approximation of this component using random Fourier features, i.e.,

$ϵ (n) = \sum_{m = 1}^{M} α_{m} κ (c_{m}, x_{n}) - θ_{o}^{T} z_{Ω} (x_{n}) = \sum_{m = 1}^{M} a_{m} ϵ_{m} (n) .$

Let $x \in R^{d}$ , $y \in R$ , be the random variables that generate the measurements and $R_{z z} = E [z_{Ω} (x) z_{Ω}^{T} (x)]$ the corresponding autocorrelation matrix. We can prove that $R_{z z}$ is positive definite, provided that all the random features are different from each other, i.e., $ω_{i} \neq ω_{j}$ , for all $i \neq j$ [16]. Thus, the cross-correlation vector takes the form

$E [z_{Ω} (x) y] = E [z_{Ω} (x) (z_{Ω} {(x)}^{T} θ_{o} + ϵ + η)] = R_{z z} θ_{o} + E [z_{Ω} (x) ϵ],$

where for the last relation we have used the fact that η is a zero-mean variable representing noise and that $z_{Ω} (x)$ and η are independent. For large enough D, we can assume that the approximation error, ϵ, approaches 0 [14], hence the optimal solution becomes

$θ_{⁎} = E [{(y - z_{Ω} {(x)}^{T} θ_{o})}^{2}] = R_{z z}^{- 1} (R_{z z} θ_{o} + E [z_{Ω} (x) ϵ]) = θ_{o} + R_{z z}^{- 1} E [z_{Ω} (x) ϵ] \approx θ_{o} .$

To summarize, the proposed rationale assumes that, for sufficiently large D, Eq. (7.8) can be closely approximated by $y (n) \approx α^{T} Z_{C}^{T} z_{Ω} (x_{n}) + η (n)$ . The major difference with the standard LMS case is that the transformed inputs, $z_{Ω} (x_{n})$ , can no longer be assumed to be generated by the Gaussian distribution. Applying similar assumptions as in the case of the standard LMS (e.g., independence between $x_{n}, x_{m}$ , for $n \neq m$ and between $x_{n}, η (n)$ ), we get several convergence properties, where the eigenvalues of the positive definite matrix $R_{z z}$ , i.e., $0 < λ_{1} ⩽ λ_{2} ⩽ \dots ⩽ λ_{D}$ , play a pivotal role:

1. If $0 < μ < 2 / λ_{D}$ , then RFFKLMS converges in the mean, i.e., $E [θ_{n} - θ_{o}] \to 0$ .
2. The optimal MSE is given by

$J {(n)}^{opt} = σ_{η}^{2} + E [ϵ (n)] - E [ϵ (n) z_{Ω} (x_{n})] R_{z z}^{- 1} E [ϵ (n) z_{Ω} {(x_{n})}^{T}] .$

For large enough D, we have $J {(n)}^{opt} \approx σ_{η}^{2}$ .
3. The excess MSE is given by $J {(n)}^{ex} = J (n) - J {(n)}^{opt} = tr (R_{z z} A_{n})$ , where $A_{n} = E [(θ_{n} - θ_{o}) {(θ_{n} - θ_{o})}^{T}]$ .
4. If $0 < μ < 1 / λ_{D}$ , then $A_{n}$ converges.

7.3.2 RFF-KRLS

Contrary to the stochastic approach employed by the LMS, the RLS utilizes information of all past data to estimate θ by minimizing the regularized risk cost function, i.e.,

$L (x_{n}, y (n), θ) = \sum_{k = 1}^{n} β^{n - k} {(y (n) - θ^{T} z_{Ω} (x_{n}))}^{2} + λ β^{n} {‖ θ ‖}^{2},$

for some chosen $λ > 0$ , where β is a weighting factor that is usually added as a forgetting mechanism and which makes the algorithm more sensitive to recent data. Solving $θ_{⁎} = \min_{θ} L (x_{n}, y (n), θ)$ requires the inversion of an $n \times n$ matrix at each step. However, employing the matrix inversion lemma, it is possible to compute this solution recursively. Algorithm 2 presents the details. The procedure is identical to that of the standard RLS, except that we replace $x_{n}$ with the transformed data $z_{Ω} (x_{n})$ . Moreover, we should note that, unlike the LMS case, the proposed scheme provides a solution that is different from other implementations of KRLS (like the ones presented in [6] and [4]). Although all KRLS-like implementations, provided in the respective literature, estimate the system's output as a linear expansion of kernel functions centered at the past data, the estimation of the expansion's coefficients differs in the case presented here.

Algorithm 2 The Random Fourier Features Kernel RLS algorithm.

7.3.3 RFF-PEGASOS

The methods presented in Sections 7.3.1 and 7.3.2 are suitable for online regression problems. In this section, we present an online classification method based on the PEGASOS scheme (see [17]). Consider a training set of the form ${(x_{n}, y (n)), n = 1, 2, \dots N}$ , where $x_{n} \in R^{d}$ and $y (n) = \pm 1$ . PEGASOS is an SVM-like algorithm that processes the data sequentially as follows. On iteration t, the algorithm selects a random training example $(x_{n_{t}}, y (n_{t}))$ by picking an index $n_{t}$ uniformly at random from ${1, 2, \dots, N}$ . The scheme follows a stochastic gradient rationale, adopting the regularized hinge loss function, i.e.,

$θ_{⁎} = \max {0, 1 - y {〈 f, Φ (x) 〉}_{H}} + \frac{λ}{2} {‖ f ‖}_{H}^{2},$

where $H$ is the RKHS associated with a preselected kernel κ and Φ the respective feature map. Thus, the step update equation becomes:

$f_{t} = (1 - \frac{1}{t}) f_{t - 1} + 1_{+} (1 - y (n_{t}) {〈 f_{t - 1}, Φ (x_{n_{t}}) 〉}_{H}) \frac{y_{n_{t}}}{λ t} Φ (x_{n_{t}}) .$

(7.9)

After a predetermined number of iterations (say T), the algorithm provides the solution, which is given as a sparse linear expansion of kernel functions, similar to Eq. (7.1). The algorithm is implemented by keeping in memory an N-size vector that comprises all the coefficients of the expansion. During each iteration, at most one of these coefficients might change. The performance of the scheme has been shown to improve with data reuse, i.e., when the same data are used again and again for training.

Evidently, since the solution provided by the algorithm depends on the dataset's size, PEGASOS cannot be considered as an online scheme (where the actual size of the dataset is not known beforehand). However, employing the random Fourier features rationale, we can model the output as $f (x) = θ^{T} z_{Ω} (x)$ , where $θ \in R^{D}$ , and thus rewrite the step update equation as follows (see Algorithm 3):

$θ_{n} = (1 - \frac{1}{n}) θ_{n - 1} + 1_{+} (1 - y (n) θ_{n - 1}^{T} z_{Ω} (x_{n})) \frac{y (n)}{λ n} z_{Ω} (x_{n}),$

(7.10)

where we have assumed that the data arrive sequentially. Similar to the LMS case, the proposed algorithm can be seen as a linear PEGASOS on the dataset ${(z_{Ω} (x_{n}), y (n)), n = 1, 2, \dots}$ , hence all convergence properties of [17] hold in this case too. We will call this scheme RFF-PEGASOS.

Algorithm 3 The Random Fourier Features PEGASOS algorithm.

We should note that the proposed algorithm features several significant advantages with respect to the original form of kernel PEGASOS presented in [17]. Firstly, the complexity per iteration is $O (D d)$ and does not depend on the training database size, in contrast with the kernel-PEGASOS, which has complexity $O (M d)$ , M being the number of the support vectors. Moreover, RFF-PEGASOS solves the problem of the growing sum and thus it can work in a truly online fashion (where the dataset is not known beforehand), while kernel-PEGASOS was designed for batch processing (i.e., fixed datasets). Finally, it is evident that for datasets with many support vectors (where $M > > D$ ) RFF-PEGASOS is significantly faster. However, in databases with a low number of support vectors ( $M < < D$ ), it runs slower.

7.3.4 Simulations—Regression

In this section, we provide some simple simulations, which are indicative of the behavior of the proposed algorithms, compared to their typical implementations. For the regression case, we generate 5000 data pairs using the following model: $y (n) = \sum_{m = 1}^{M} a_{m} κ (c_{m}, x_{n}) + η (n)$ , where $x_{n} \in R^{5}$ are drawn from $N (0, I_{5})$ and the noise are independent and identically distributed Gaussian samples with $σ_{η} = 0.1$ . The parameters of the expansion (i.e., $a_{1}, \dots, a_{M}$ ) are drawn from $N (0, 25)$ and the kernel parameter σ is set to 5. The specific parameters of the algorithms are shown in Table 7.1. We compare RFF-KLMS and RFF-KRLS with a standard implementation of KLMS using the quantization pruning technique and the KRLS version of [6]. Fig. 7.1 shows the evolution of the MSE for 100 realizations of the experiment for several values of D (i.e., the total number of random Fourier features used), where the number of the free parameters of the model is set to $M = 100$ . As this is a kernel approximation scheme, it is expected that the performance (in terms of MSE) would be somewhat reduced. However, we can see in Fig. 7.1 that the performance of both RFF-KLMS and RFF-KRLS is very close to their typical implementations. In this particular example, both algorithms can achieve similar performance with their standard counterparts earlier in time (see Table 7.2 and Figs. 7.1B, 7.1D for the KLMS and KRLS, respectively). Fig. 7.2 shows the evolution of the MSE for the same experiment using $M = 400$ parameters in the design model. In this case, it is evident that a larger number of Fourier features is required to achieve the same approximation quality. For example, the RFF-KRLS requires more than $D = 200$ random features and the RFF-KLMS requires $D = 2000$ features to achieve approximately the same steady-state MSE as their typical variants. To conclude:

• Both RFF-KLMS and RFF-KRLS can efficiently approximate their typical counterparts for large enough values of $D > D_{0}$ .
• The value of D needed (so that the RFF-KLMS and the KRLS approximate their counterparts) depends on the model complexity (it also depends on the size of the input space, d, for obvious reasons).
• The RFF-KRLS can achieve the same performance with the KRLS using significantly smaller D than the one needed by the RFF-KLMS for comparable performance to KLMS. Possibly, this is due to the fact that the KRLS mechanism utilizes all past data in each iteration.

Table 7.1

The parameters of the KLMS and KRLS variants

QKLMS	RFF-KLMS	E-KRLS	RFF-KRLS
μ = 1, ϵ = 5	μ = 1	ν = 0.0005	λ = 0.00001, w = 1

Figure 7.1 Comparing the performances of KLMS and KRLS versus their random Fourier features variants (FouKLMS and FouKRLS, respectively). (A) D = 100. (B) D = 200. (C) D = 500. (D) D = 1000.

Table 7.2

Mean times needed for the KLMS and KRLS on a typical core i5 machine for M = 100 and M = 400

	$M = 100$				$M = 400$
	D = 100	D = 200	D = 500	D = 1000	D = 100	D = 200	D = 1000	D = 2000
RFF-KLMS	0.03 s	0.07 s	0.14 s	0.25 s	0.03 s	0.06 s	0.26 s	0.41 s
QKLMS	0.27 s	0.27 s	0.27 s	0.27 s	0.27 s	0.27 s	0.27 s	0.27 s
RFF-KRLS	0.35 s	1.00 s	11.5 s	58.2 s	0.39 s	1.0 s	53.2 s	140.2 s
KRLS	3.03 s	3.2 s	3.2 s	3.2 s	3.2 s	3.2 s	3.2 s	3.2

Figure 7.2 Comparing the performances of KLMS and KRLS versus their random Fourier features variants (FouKLMS and FouKRLS, respectively) on a more complex input–output model. (A) D = 100. (B) D = 200. (C) D = 1000. (D) D = 2000.

7.3.5 Simulations—Classification

For the classification case, we have performed two sets of experiments. In the first set, we have compared the performances of the standard kernel-PEGASOS and the RFF-PEGASOS on four datasets downloaded from Leon Bottou's LASVM web page [18]. The chosen datasets are (a) the Adult dataset, (b) the Banana dataset (where we have used the first 4000 points as training data and the remaining 1300 as testing data), (c) the Waveform dataset (where we have used the first 4000 points as training data and the remaining 1000 as testing data) and (d) the MNIST dataset (for the task of classifying the digit 8 versus the rest). The sizes of the datasets are given in Table 7.4. Table 7.5 reports the test errors for each dataset and each method, along with the total number of support vectors after training (only for the kernel-PEGASOS) and the respective time in seconds (in parentheses). The experiments were performed on an i7-3770 machine using a MatLab implementation. The parameters used in each method are reported in Table 7.3. The parameters were selected after extensive trials to give the best possible accuracy. The parentheses (after the title of the method) indicate the number of times the algorithm passed through the entire dataset. As expected from the theoretical analysis of the corresponding complexities, for datasets with a large number of support vectors (such as Adult and Banana), RFF-PEGASOS requires significantly less time than its standard implementation. For datasets with a medium number of SVs (such as Waveform), this difference is diminishing, while for datasets with very few SVs (e.g., MNIST) kernel-PEGASOS can be significantly faster. We note that, for the MNIST database, RFF-PEGASOS attains a 0.47% test error percentage after 20 reruns (19000 s).

Table 7.3

Parameters for each classification method

Method	Adult	Banana	Waveform	MNIST
Kernel-PEGASOS	$\begin{matrix} σ = \sqrt{10} \\ λ = 0.0000307 \end{matrix}$	$\begin{matrix} σ = 0.7 \\ λ = \frac{1}{316} \end{matrix}$	$\begin{matrix} σ = \sqrt{10} \\ λ = 0.001 \end{matrix}$	$\begin{matrix} σ = 4 \\ λ = 10^{- 7} \end{matrix}$
RFF-PEGASOS	$\begin{matrix} σ = \sqrt{10} \\ λ = 0.0000307 \\ D = 2000 \end{matrix}$	$\begin{matrix} σ = 0.7 \\ λ = \frac{1}{316} \\ D = 200 \end{matrix}$	$\begin{matrix} σ = \sqrt{10} \\ λ = 0.001 \\ D = 2000 \end{matrix}$	$\begin{matrix} σ = 4 \\ λ = 10^{- 7} \\ D = 100000 \end{matrix}$

Table 7.4

Dataset Information

Method	Adult	Banana	Waveform	MNIST
Training size	32562	4000	4000	60000
Testing size	16282	1300	1000	10000
dimensions	123	2	21	784

Table 7.5

Comparing the performances of kernel-PEGASOS and RFF-PEGASOS

Method	Adult	Banana	Waveform	MNIST
kernel-PEGASOS (1)	16.52% (8754 sv/129 s)	9.46% (1510 sv/0.42 s)	9.2% (1069 sv/0.52 s)	0.85% (940 sv/182 s)
kernel-PEGASOS (2)	15.89% (11009 sv/464 s)	9.46% (1696 sv/1.11 s)	9.8% (1237 sv/1.38 s)	0.56% (1194 sv/548 s)
kernel-PEGASOS (5)	15% (12513 sv/1660 s)	9.38% (1760 sv/3.22 s)	8.8% (1306 sv/4.05 s)	0.52% (1432 sv/1947 s)
kernel-PEGASOS (10)	14.83% (13078 sv/3804 s)	9.46% (1784 sv/6.77 s)	8.1% (1336 sv/8.61 s)	0.43% (1555 sv/4692 s)
RFF-PEGASOS (1)	16.55% (2.53 s)	9.61% (0.06 s)	9.3% (0.26 s)	0.96% (1908 s)
RFF-PEGASOS (2)	15.84% (5.02 s)	9.55% (0.09 s)	10.1% (0.51 s)	0.69% (3800 s)
RFF-PEGASOS (5)	14.95% (12.59 s)	9.69% (0.19 s)	8.9% (1.24 s)	0.55% (9463 s)
RFF-PEGASOS (10)	14.75% (25 s)	9.62% (0.36 s)	8.7% (2.45 s)	0.51% (18986 s)

In the second set of experiments we tested the performance of RFF-PEGASOS and compared it with that of the LA-SVM [19]. LA-SVM is one of the most elegant and fast SVM training implementations. In order to compare running times we used a C implementation of the respective algorithms (as the C implementation of LA-SVM is publicly available on the web). The experiments were performed on an i5 650 machine running at 3.20 GHz with 6 Gb of RAM. The parameters for LA-SVM (shown in Table 7.3) have been taken from [19]. The results are shown in Table 7.6. Once more, we observe that for datasets with a large number of support vectors RFF-PEGASOS is faster. This is not the case however for datasets with a lower number of support vectors, where RFF-PEGASOS can be significantly slower.

Table 7.6

Comparing the performances of LA-SVM and RFF-PEGASOS

Method	Adult	Banana	Waveform	MNIST
RFF-Peg (1)	16.34% (10 s)	9.62% (<1 s)	10% (1 s)	0.98% (1.5 h)
RFF-Peg (2)	15.67% (21 s)	9.46% (<1 s)	8.9% (1 s)	0.76% (3 h)
RFF-Peg (5)	14.97% (52 s)	9.46% (≈1 s)	8.7% (2 s)	0.62% (7 h)
RFF-Peg (10)	14.76% (104 s)	9.62% (≈1 s)	8.8% (5 s)	0.52% (14 h)
LA-SVM	14.94% (276 s)	9.92% (3 s)	8.8% (1 s)	0.46% (313 s)

Remark 1

We stress that the second set of experiments concerns classification tasks in a batch mode. The LA-SVM method is one of the most powerful SVM solvers, but it is not designed for online tasks, as is the case for the RFF-Pegasos. The provided experiments demonstrate that in many cases the RFF-Pegasos can be used in a batch mode as well.

7.4 Online Distributed Learning With Kernels

Although the RFF methods, presented in Section 7.3, constitute a competitive alternative to existing online learning techniques in RKHS, the obtained results do not offer any significant real benefits from a practical point of view. In this section, we discuss the scenario of online distributed learning in RKHS, where it will be shown that the random Fourier features rationale can solve major open problems and lead to simple and elegant solutions.

In the following, we consider K connected nodes, labeled $k \in N = {1, 2, \dots K}$ , which operate in cooperation to solve a specific learning task. This is a more general learning scenario compared to the one considered in Section 7.3, as the nodes are able to communicate with each other and exchange valuable information. The nodes comprising the network and the corresponding topology are represented as an undirected connected graph, consisting of K vertices (i.e., the nodes) and a set of edges that represent the communication between those nodes (see Fig. 7.3). The nodes that interact with node k are called the “neighbors” of k and the respective subset (that comprises the neighbors of k) is denoted as $N_{k} \subseteq N$ . Moreover, we assign a nonnegative weight $a_{k, l}$ to the edge connecting node k to l, which is used by k to scale the data transmitted from l and vice versa (i.e., the confidence that node k assigns to node l). We collect all those weights into a $K \times K$ symmetric matrix $A = (a_{k, l})$ , such that the entries of the kth row of A contain the coefficients used by node k to scale the data arriving from its neighbors. Additionally, we assume that A is doubly stochastic, so that the weights of all incoming and outgoing “transmissions” sum to 1. Such a matrix can be generated using the Metropolis rule:

$a_{k, l} = {\begin{matrix} \frac{1}{\max {| N_{k} |, | N_{l} |}}, & if l \in N_{k} and l \neq k, \\ 1 - \sum_{i \in N_{k} ∖ k} a_{k, i}, & if l = k, \\ 0, & otherwise . \end{matrix}$

Note that the topology considered here is similar to that of Chapter 9, which also deals with nonlinear learning over networks. Finally, we assume that each node, k, receives streaming data ${(x_{k, n}, y_{k} (n)), n = 1, 2, \dots}$ that are generated by $y_{k} (n) = g (x_{k, n}) + η_{k} (n)$ , where $x_{k, n} \in R^{d}$ , $y_{k} (n) \in R$ and $η_{k} (n)$ represents the respective noise, for the regression task. For classification, we assume that $y_{k} (n) = ϕ (f (x_{k, n}))$ , where ϕ is a thresholding function, such that $y_{k} (n) \in {- 1, 1}$ . In both tasks, the goal of each node is to obtain an estimate of the nonlinear function g, working in cooperation with its neighbors, so that a specific cost function is minimized.

Figure 7.3 A network topology (with 8 nodes) and the respective weight matrix.

In the case where g is linear, i.e., it can be modeled as $g (x) = θ^{T} x$ , for some vector $θ \in R^{d}$ , the problem has been studied extensively and several learning methods have been proposed based on the diffusion scheme (e.g., the diffusion LMS and the diffusion RLS). In a nutshell, these methods can be summarized in the following steps (this rationale is usually called Combine-Then-Adapt or CTA for short):

• Each node, k, receives the new data pair, i.e., $(x_{k, n}, y_{k} (n))$ as well as the current estimates, $θ_{l, n - 1}$ , from all neighboring nodes, i.e., for all $l \in N_{k}$ .
• Consequently, the node combines these estimates to a single one, $ψ_{k, n - 1}$ , taking into account the weights between the nodes, for example, $ψ_{k, n - 1} = \sum_{l \in N_{k}} a_{k, l} θ_{l, n - 1}$ .
• Finally, the node exploits the combined estimate, $ψ_{k, n - 1}$ , as well as the newly received data to update the estimation. For example in diffusion LMS, where the mean square cost is minimized using a gradient descent rationale, the respective step update equation becomes $θ_{k, n} = ψ_{k, n - 1} + μ e_{k, n} x_{k, n}$ , where $e_{k} (n) = y_{k} (n) - ψ_{k, n - 1}^{T} x_{k, n}$ is the estimation error.

The implementation of such an approach in the context of RKHS presents significant challenges. The big difference lies in the fact that the estimation of the solution at each node is not a simple vector, but instead it is a function, which is expressed as a growing sum of kernel evaluations centered at the points observed by the specific node, i.e.,

$f_{k, n} = \sum_{i = 1}^{n} w_{k, i} κ (\cdot, x_{k, i}) .$

(7.11)

Hence, the implementation of a straightforward CTA strategy would require from each node, k, to transmit its entire growing sum (i.e., the so-called dictionary, which comprises the coefficients $w_{k, i}$ as well as the respective centers $x_{k, i}$ , for all $i = 1, 2, \dots, n$ ) to all neighbors. This would significantly increase both the communication load among the nodes, as well as the computational cost at each node, since the size of the transmitted data would become increasingly larger as time evolves (as for every time instant, they gather the centers transmitted by all neighbors). Evidently, after a specific point in time this scheme would become so demanding that no machine would be able to carry on indefinitely. This is the rationale adopted in [20–22] for the case of KLMS. Alternatively, one could follow a pruning method similar to those applied for the standard KLMS and devise an efficient method to sparsify the solution at each node. However, this would require that, at each node, to somehow merge all the respective dictionaries (i.e., Eq. (7.11)) transmitted by its neighbors, at each time instant. For example, each node would have to search all the dictionaries that it receives from the neighboring nodes, for similar centers, and treat them as a single one. Alternatively, one could adopt a single prearranged dictionary (i.e., a specific set of centers) for all nodes and then fuse each observed point with the best-suited center. However, no such strategy has appeared in the respective literature, perhaps due to its increased complexity and lack of theoretical elegance.

In the present chapter, we propose an alternative approach, which addresses the aforementioned problems. Once again, our starting point is the observation that the input–output relationship, $y = f (x)$ , can be closely approximated using the random Fourier features rationale (similar to Section 7.3). In particular, we see that

$\begin{matrix} f_{k, n} (x) & = \sum_{i = 1}^{n} w_{k, i} κ (x, x_{k, i}) \approx \sum_{i = 1}^{n} w_{k, i} z_{Ω} {(x)}^{T} z_{Ω} (x_{k, i}) \\ = {(\sum_{i = 1}^{n} w_{k, i} z_{Ω} (x_{k, i}))}^{T} z_{Ω} (x) . \end{matrix}$

Inspired by the last relation, we propose to approximate the desired input–output relationship as $y = θ^{T} z_{Ω} (x)$ and follow a two-step procedure: (a) we map each observed point $(x_{k, n}, y_{k} (n))$ to $(z_{Ω} (x_{k, n}), y_{k} (n))$ and then (b) we adopt a simple linear CTA diffusion strategy on the transformed points, where each node aims to estimate the vector $θ \in R^{D}$ minimizing a specific (convex) cost function, $L (x, y, θ)$ . We can summarize the proposed scheme for each node by the following relations:

$ψ_{k, n - 1} = \sum_{l \in N_{k}} a_{k, l} θ_{l, n - 1},$

(7.12)

$θ_{k, n} = ψ_{k, n - 1} - μ_{k, n} \nabla_{θ} L (z_{Ω} (x_{k, n}), y_{k} (n), ψ_{k, n - 1}),$

(7.13)

where $μ_{k, n}$ is the (possibly time varying) learning rate at the nth time instant on node k and $\nabla_{θ} L (z_{Ω} (x_{k, n}), y_{k, n}, ψ_{k, n - 1})$ is the gradient, or any subgradient of $L (x, y, θ)$ (with respect to θ), if the loss function is not differentiable. Eq. (7.12) refers to the fact that each node collects the past estimates from its neighbors to produce the combined estimate, $ψ_{k, n - 1}$ , while in Eq. (7.13) the most recent pair of observations is exploited to update the new estimate vector $θ_{k, n}$ , employing a gradient descent–based rationale. We emphasize that $L$ need not be differentiable. Hence, a large family of loss functions can be adopted. However, in this section, we focus on the following two types of cost functions:

• Mean squared error (for Regression): $L (x, y, θ) = E [{(y - θ^{T} x)}^{2}]$ .
• Hinge loss (for Classification): $L (x, y, θ) = \max (0, 1 - y θ^{T} x)$ .

We can write Eq. (7.12) and Eq. (7.13) more compactly using block-matrices as follows:

${\underline{θ}}_{n} = \underline{A} {\underline{θ}}_{n - 1} - {\underline{M}}_{n} {\underline{G}}_{n},$

(7.14)

where ${\underline{θ}}_{n} : = {(θ_{1, n}^{T}, \dots, θ_{K, n}^{T})}^{T} \in R^{K D}$ is the collection of the estimations from each node, ${\underline{M}}_{n} : = diag {μ_{1, n}, \dots, μ_{K, n}} \otimes I_{D}$ is the collection of learning rates, $\underline{A} : = A \otimes I_{D}$ and ${\underline{G}}_{n} : = {[u_{1, n}^{T}, \dots, u_{K, n}^{T}]}^{T}$ , where $u_{k, n} = \nabla L (z_{Ω} (x_{k, n}), y_{k} (n), ψ_{k, n})$ comprises all (sub)gradients. Here, the symbol ⊗ denotes the matrix tensor product.

The advantage of the proposed scheme is that each node transmits a single vector (i.e., its current estimate, $θ_{k, n}$ ) to its neighbors, instead of an entire dictionary of centers (and their respective coefficients) while the merging of the currently available estimates requires only a straightforward summation. However, we should point out that the new vector space, $R^{D}$ , usually has a significantly larger dimension than the original input space $R^{d}$ . This may pose some communication obstacles; to this end, various suboptimal techniques can be used and have been proposed, e.g., [23].

7.4.1 Diffusion KLMS

Adopting the mean square error in place of $L$ and estimating the gradient by its current measurement, the Random Fourier Features Diffusion KLMS (RFF-DKLMS) results, where Eq. (7.12) can be recast as

$θ_{k, n} = ψ_{k, n - 1} + μ ε_{k} (n) z_{Ω} (x_{k, n}),$

(7.15)

where $ε_{k} (n) = y_{n} - ψ_{k, n - 1}^{T} z_{Ω} (x_{k, n})$ . Algorithm 4 describes the proposed method in detail.

Algorithm 4 The Random Fourier Features Diffusion KLMS algorithm.

Under certain general conditions, we can establish consensus (i.e., that all nodes converge to the same solution), [16], following the results of the standard Diffusion LMS (e.g., [24,25]). To this end, we will assume that the data pairs are generated by

$y_{k} (n) = \sum_{m = 1}^{M} α_{m} κ (c_{m}, x_{k, n}) + η_{k} (n),$

(7.16)

where $c_{1}, \dots, c_{M}$ are fixed centers, $x_{k, n}$ are zero-mean i.i.d, samples drawn from the Gaussian distribution with covariance matrix $σ_{X}^{2} I_{d}$ and $η_{k} (n)$ are independent and identically distributed noise samples drawn from $N (0, σ_{η}^{2})$ . Hence, we have

$\begin{matrix} y_{k} (n) & = \sum_{m = 1}^{M} α_{m} E_{ω, b} [z_{ω, b} (c_{m}) z_{ω, b} (x_{k, n})] + η_{k} (n) \\ \approx α^{T} Z_{Ω}^{T} z_{Ω} (x_{k, n}) + ϵ_{k} (n) + η_{k} (n) \\ = θ_{o}^{T} z_{Ω} (x_{k, n}) + ϵ_{k} (n) + η_{k} (n), \end{matrix}$

where $Z_{Ω} = (z_{Ω} (c_{1}), \dots, z_{Ω} (c_{M}))$ , $α = {(α_{1}, \dots, α_{M})}^{T}$ , $θ_{o} = Z_{Ω} α$ and $ϵ_{k} (n)$ is the approximation error between the noise-free component of $y_{k} (n)$ (evaluated only by the linear kernel expansion of Eq. (7.16)) and the approximation of this component using random Fourier features, i.e., $ϵ_{k} (n) = \sum_{m = 1}^{M} α_{m} κ (c_{m}, x_{k, n}) - θ_{o}^{T} z_{Ω} (x_{k, n})$ . Using block-matrices we can derive the following, more compact, formulation for the entire network:

${\underline{y}}_{n} = {\underline{V}}_{n}^{T} {\underline{θ}}_{o} + {\underline{ϵ}}_{n} + {\underline{η}}_{n},$

(7.17)

where

• ${\underline{y}}_{n} : = {(y_{1} (n), y_{2} (n), \dots, y_{K} (n))}^{T}$ ,
• ${\underline{V}}_{n} : = diag (z_{Ω} (x_{1, n}), z_{Ω} (x_{2, n}), \dots, z_{Ω} (x_{K, n}))$ , is a $D K \times K$ matrix,
• ${\underline{θ}}_{o} = {(θ_{o}^{T}, θ_{o}^{T}, \dots, θ_{o}^{T})}^{T} \in R^{D K}$ ,
• ${\underline{ϵ}}_{n} = {(ϵ_{1} (n), ϵ_{2} (n), \dots, ϵ_{K} (n))}^{T} \in R^{K}$ ,
• ${\underline{η}}_{n} = {(η_{1} (n), η_{2} (n), \dots, η_{K} (n))}^{T} \in R^{K}$ .

Let $x_{1}, \dots, x_{K} \in R^{d}$ , $\underline{y} \in R^{K}$ , be the random variables that generate the measurements of the nodes and

$\underline{V} = diag (z_{Ω} (x_{1}), z_{Ω} (x_{2}), \dots, z_{Ω} (x_{K}))$

be the $D K \times K$ matrix that collects the transformed random variables for the whole network. Moreover, let $\underline{R} = E [\underline{V} {\underline{V}}^{T}]$ be the $D K \times D K$ respective autocorrelation matrix. It is not difficult to see that $\underline{R}$ can be given in a block form as $\underline{R} = E [\underline{V} {\underline{V}}^{T}] = diag (R_{z z}, R_{z z} \dots, R_{z z})$ , where $R_{z z} = E [z_{Ω} (x_{k}) z_{Ω} {(x_{k})}^{T}]$ , for all $k = 1, 2, \dots, K$ . We can prove that, under certain general conditions, the autocorrelation matrix R is invertible (see [16] for the proof).

Lemma 1

Consider a selection of samples $ω_{1}, ω_{2}, \dots, ω_{D}$ , drawn from (7.6) such that $ω_{i} \neq ω_{j}$ , for any $i \neq j$ . Then both $R_{z z}$ and $\underline{R} = E [\underline{V} {\underline{V}}^{T}]$ are strictly positive definite (hence invertible) matrices.

Moreover, as $R_{z z}$ is positive definite (this has been also discussed in Section 7.4.1), the respective eigenvalues satisfy $0 < λ_{1} ⩽ λ_{2} ⩽ \dots ⩽ λ_{D}$ . In this case, the optimal solution is given by

$\begin{matrix} {\underline{θ}}_{⁎} & = {argmin}_{\underline{θ}} E [{‖ \underline{y} - {\underline{V}}^{T} \underline{θ} ‖}^{2}] \\ = E {[\underline{V} {\underline{V}}^{T}]}^{- 1} E [\underline{V} \underline{y}] \\ = E {[\underline{V} {\underline{V}}^{T}]}^{- 1} (E [\underline{V} ({\underline{V}}^{T} {\underline{θ}}_{o} + \underline{ϵ} + \underline{η})]) \\ = E {[\underline{V} {\underline{V}}^{T}]}^{- 1} (E [\underline{V} {\underline{V}}^{T}] {\underline{θ}}_{o} + E [\underline{V} \underline{ϵ}] + E [\underline{V} \underline{η}]) \\ = E {[\underline{V} {\underline{V}}^{T}]}^{- 1} (E [\underline{V} {\underline{V}}^{T}] {\underline{θ}}_{o} + E [\underline{V} \underline{ϵ}]), \end{matrix}$

where for the last relation we have used the notion that η is a zero mean vector representing noise and that $\underline{V}$ and $\underline{η}$ are independent. For large enough D, the approximation error vector $\underline{ϵ}$ approaches $0_{K}$ , hence the optimal solution becomes

${\underline{θ}}_{⁎} = {\underline{θ}}_{o} + E {[\underline{V} {\underline{V}}^{T}]}^{- 1} E [\underline{V} \underline{ϵ}] \approx {\underline{θ}}_{o} .$

Since we work under the assumption that $\underline{ϵ}$ approaches $0_{K}$ , we can say that Eq. (7.17) can be closely approximated by ${\underline{y}}_{n} \approx {\underline{V}}_{n} {\underline{θ}}_{o} + {\underline{η}}_{n}$ . This actually means that the RFF-DKLMS is nothing more than the standard diffusion LMS applied to the data pairs ${(z_{Ω} (x_{k, n}), y_{k} (n), k = 1, \dots, K, n = 1, 2 \dots}$ . However, the difference is that the input vectors $z_{Ω} (x_{k, n})$ may have nonzero mean and do not follow, necessarily, the Gaussian distribution. In fact we can prove that, if $x_{k, n} \sim N (0, σ_{X} I_{d})$ , then the entries of $R_{z z}$ can be evaluated as

$\begin{matrix} r_{i, j} = & \frac{1}{2} \exp (\frac{- {‖ ω_{i} - ω_{j} ‖}^{2} σ_{X}^{2}}{2}) \cos (b_{i} - b_{j}) \\ + \frac{1}{2} \exp (\frac{- {‖ ω_{i} + ω_{j} ‖}^{2} σ_{X}^{2}}{2}) \cos (b_{i} + b_{j}) . \end{matrix}$

Consequently, the available results regarding convergence and stability of diffusion LMS (e.g., [24,26]) cannot be applied here directly, as in these works the inputs are assumed to be zero mean Gaussian (to simplify the formulas related to stability). However, it is possible to end up with similar conclusions (the proofs are given in [16]).

Proposition 1

If the step update μ satisfies: $0 < μ < \frac{2}{λ_{D}}$ , where $λ_{D}$ is the maximum eigenvalue of $R_{z z}$ , then the RFF-DKLMS achieves asymptotic consensus in the mean, i.e.,

$\lim_{n} E [θ_{k, n} - θ_{o}] = 0_{D}, for all k = 1, 2, \dots, K .$

Proposition 2

For stability in the mean square sense, we must ensure that both μ and A satisfy

$ρ (I_{D^{2} K^{2}} - μ (\underline{R} ⊠ I_{D K} + I_{D K} ⊠ \underline{R}) (\underline{A} ⊠ \underline{A})) < 1,$

where ⊠ denotes the unbalanced block Kronecker product (see [27]) and $ρ (\cdot)$ denotes the spectral radius of the respective matrix.

7.4.2 Diffusion PEGASOS

For the case of classification tasks, we can adopt the regularized hinge loss function, for a specific value of the regularization parameter, λ, and get a diffusion version of PEGASOS, in its full online version (see Algorithm 5). In this case, the gradient becomes

$\nabla_{θ} L (x, y, θ) = λ θ - I_{+} (1 - y θ^{T} z_{Ω} (x)) y z_{Ω} (x),$

where $I_{+}$ is the indicator function of $(0, + \infty)$ , which takes a value of 1, if its argument belongs in $(0, + \infty)$ , and zero otherwise. Hence the step update equation of the algorithm becomes

$θ_{k, n} = (1 - \frac{1}{n}) ψ_{k, n - 1} + I_{+} (1 - y_{n} ψ_{k, n - 1}^{T} z_{Ω} (x_{k, n})) \frac{y_{k} (n)}{λ n} z_{Ω} (x_{k, n}),$

(7.18)

where, following [17], we have used a decreasing learning rate, $μ_{n} = \frac{1}{λ n}$ . Recall, however, that we can write the update scheme more compactly by Eq. (7.14), setting $M_{n} = diag (1 / (λ n), \dots, 1 / (λ n)) \otimes I_{D}$ .

Algorithm 5 The Random Fourier Features Diffusion PEGASOS algorithm.

We can prove that all nodes converge to the same solution using the result from [16], where it is proven that the general update scheme described by Eq. (7.14) achieves asymptotic consensus, if the following assumptions are satisfied:

Assumption 1: The step size is time decaying and is bounded by the inverse square root of time, i.e., $μ_{k, n} = μ_{n} ⩽ μ n^{- 1 / 2}$ .
Assumption 2: The norm of the transformed input is bounded, i.e., there is $U_{1} > 0$ , such that $‖ z_{Ω} (x_{k, n}) ‖ ⩽ U_{1}, \forall k \in N, \forall n \in N$ . Furthermore, $y_{k, n}$ is bounded, i.e., there is $V > 0$ , such that $| y_{k, n} | ⩽ V, \forall k \in N, \forall n \in N$ .
Assumption 3: The estimates are bounded, i.e., there is $U_{2} > 0$ , such that $‖ θ_{k, n} ‖ ⩽ U_{2}, \forall k \in N$ , $\forall n \in N$ .
Assumption 4: The matrix comprising the combination weights, i.e., A, is doubly stochastic.

Moreover, in [16], it is shown that the regret is sublinearly bounded, which means that on average the algorithm performs at least as well as the best fixed strategy.

Proposition 3

Under Assumptions 1–4 the network-wise regret is bounded by

$\sum_{i = 1}^{N} \sum_{k \in N} (L (x_{k, i}, y_{k, i}, ψ_{k, i}) - L (x_{k, i}, y_{k, i}, g)) ⩽ γ \sqrt{N} + δ,$

for all $g \in B_{[0_{D}, U_{2}]}$ , where $γ, δ$ are positive constants and $B_{[0_{D}, U_{2}]}$ is the closed ball with center $0_{D}$ and radius $U_{2}$ .

7.4.3 Simulations—Diffusion KLMS

In order to demonstrate that the estimation provided by the cooperative strategy is better than having each node working alone, we follow a similar experimental setup as in Section 7.3.4. Each realization of the experiments uses a different random connected graph with $K = 10$ nodes and with a probability of attachment per node equal to 0.25 (i.e., there is a 25% probability that a specific node k is connected to any other node l). The graphs are generated using MIT's random_graph routine (see [28]) and their adjacency matrices, A, using the Metropolis rule (resulting in graphs with mean algebraic connectivity around 0.69). For the noncooperative case, we simply used a graph that connects each node to itself, i.e., $A = I_{10}$ . All parameters were optimized (after trials) to give the lowest MSE. The algorithms were implemented in Matlab and the experiments were performed on an i7-3770 machine running at 3.4 GHz with 32 Mb of RAM. Similar to Section 7.3.4, we generated 5000 data pairs for each node using the following model:

$y_{k} (n) = \sum_{m = 1}^{M} a_{m} κ (c_{m}, x_{k, n}) + η_{k} (n),$

where $x_{k, n} \in R^{5}$ are drawn from $N (0, I_{5})$ and the noise are independent and identically distributed Gaussian samples with $σ_{η} = 0.1$ . The parameters of the expansion (i.e., $a_{1}, \dots, a_{M}$ ) are drawn from $N (0, 25)$ , the kernel parameter σ is set to 5 and the learning rate is set to $μ = 1$ . Fig. 7.4 shows the evolution of the MSE over all network nodes for 100 realizations of the experiment (for various values of D and M). We note that the selected value of the step size satisfies the conditions of Proposition 1.

Figure 7.4 Comparing the performances of diffusion KLMS versus the uncooperative KLMS. (A) D = 500, M = 100. (B) D = 1000, M = 100. (C) D = 2000, M = 100. (D) D = 500, M = 500. (E) D = 1000, M = 500. (F) D = 2000, M = 500.

7.4.4 Simulations—Diffusion PEGASOS

We have tested the performance of Diffusion-PEGASOS versus the noncooperative kernel Pegasus on the same four datasets that were used in Section 7.3.5. In all experiments, we generated random graphs (using MIT's random_graph routine) and compared the proposed diffusion method versus the respective noncooperative strategy (where each node works independent of the rest). Similar to Section 7.4.3, a different random connected graph with $K = 5$ or $K = 20$ nodes was generated, for each realization of the experiments. The probability of attachment per node was set equal to 0.2 and the respective adjacency matrix, A, of each graph was generated using the Metropolis rule. For the noncooperative strategies, we used a graph that connects each node to itself, i.e., $A = I_{5}$ or $A = I_{20}$ , respectively. Moreover, for each realization, the corresponding dataset was randomly split into K subsets of equal size (one for every node). All parameters were optimized (after trials) to give the lowest number of test errors. Their values are reported in Table 7.9. The algorithms were implemented in Matlab and the experiments were performed on an i7-3770 machine running at 3.4 GHz with 32 Gb of RAM. Tables 7.7 and 7.8 report the mean test errors obtained by both procedures. The number inside the parentheses indicates the times of data reuse (i.e., running the algorithm again over the same data, albeit with a continuously decreasing step size $μ_{n}$ ), which has been suggested to improve the classification accuracy of PEGASOS (see [17]). For example, the number 2 indicates that the algorithm runs over a dataset of double size, that contains the same data pairs twice. For the three first datasets (Adult, Banana, Waveform) we have run 100 realizations of the experiment, while for the fourth (MNIST) we have run only 10 (to save time). Besides the ADULT dataset, all other simulations show that the distributed implementation significantly outperforms the noncooperative one. For that particular dataset, we observe that for a single run the noncooperative strategy behaves better (for $K = 20$ ), but as data reuse increases the distributed implementation reaches lower error floors.

Table 7.7

Comparing the performances of the Diffusion PEGASOS versus the noncooperative PEGASOS for graphs with K = 5 nodes

Method	Adult	Banana	Waveform	MNIST
Diffusion-PEGASOS (1)	19%	11.80%	11.82%	0.79%
Diffusion-PEGASOS (2)	17.43%	10.84%	10.49%	0.68%
Diffusion-PEGASOS (5)	15.87%	10.34%	9.56%	0.59%
Noncooperative-PEGASOS (1)	19.11%	14.52%	13.75%	1.42%
Noncooperative-PEGASOS (2)	18.31%	12.52%	12.59%	1.19%
Noncooperative-PEGASOS (5)	17.29%	11.32%	11.86%	1.01%

Table 7.8

Comparing the performances of the Distributed PEGASOS versus the noncooperative PEGASOS for graphs with K = 20 nodes

Method	Adult	Banana	Waveform	MNIST
Diffusion-PEGASOS (1)	24.04%	16.38%	16.26%	1.03%
Diffusion-PEGASOS (2)	22.34%	13.23%	13.93%	0.77%
Diffusion-PEGASOS (5)	18.94%	10.83%	11.20%	0.57%
Noncooperative-PEGASOS (1)	20.81%	21.74%	18.40%	2.93%
Noncooperative-PEGASOS (2)	20.52%	18.64%	16.54%	2.19%
Noncooperative-PEGASOS (5)	19.88%	15.96%	14.86%	1.87%

Table 7.9

Parameters for each method

Method	Adult	Banana	Waveform	MNIST
Kernel-PEGASOS	$\begin{matrix} σ = \sqrt{10} \\ λ = 0.0000307 \end{matrix}$	$\begin{matrix} σ = 0.7 \\ λ = \frac{1}{316} \end{matrix}$	$\begin{matrix} σ = \sqrt{10} \\ λ = 0.001 \end{matrix}$	$\begin{matrix} σ = 4 \\ λ = 10^{- 7} \end{matrix}$
RFF-PEGASOS	$\begin{matrix} σ = \sqrt{10} \\ λ = 0.0000307 \\ D = 2000 \end{matrix}$	$\begin{matrix} σ = 0.7 \\ λ = \frac{1}{316} \\ D = 200 \end{matrix}$	$\begin{matrix} σ = \sqrt{10} \\ λ = 0.001 \\ D = 2000 \end{matrix}$	$\begin{matrix} σ = 4 \\ λ = 10^{- 7} \\ D = 100000 \end{matrix}$

7.5 Conclusions

In this chapter, we presented an alternative path for designing online kernel-based learning methods. Instead of relying on the typical growing sum, the presented approach suggests to approximate the solution using randomly sampled Fourier features of the kernel. This is equivalent to transforming the training data to a Euclidean space of larger dimension, thus leading to linear fixed-budget strategies. Moreover, the proposed approach can be exploited to derive online distributed schemes for learning over networks, where the nodes cooperate to find a combined solution, based on the combine-then-adapt rationale. We stress out that this is the first practical scheme for kernel-based distributed learning that has been proposed in the literature so far. The provided simulations verify the validity of the proposed approach.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: A Random Fourier Features Perspective of KAFs With Application to Distributed Learning Over Networks

Create new playlist

Sign In

Sign Up