Chapter 9

Online Nonlinear Modeling via Self-Organizing Trees

Nuri Denizcan VanliSuleyman Serdar Kozat    Massachusetts Institute of Technology, Laboratory for Information and Decision Systems, Cambridge, MA, United States
Bilkent University, Ankara, Turkey

Abstract

We study online supervised learning and introduce regression and classification algorithms based on self-organizing trees (SOTs), which adaptively partition the feature space into small regions and combine simple local learners defined in these regions. The proposed algorithms sequentially minimize the cumulative loss by learning both the partitioning of the feature space and the parameters of the local learners defined in each region. The output of the algorithm at each time instance is constructed by combining the outputs of a doubly exponential number (in the depth of the SOT) of different predictors defined on this tree with reduced computational and storage complexity. The introduced methods are generic, such that they can incorporate different tree construction methods from the ones presented in this chapter. We present a comprehensive experimental study under stationary and nonstationary environments using benchmark datasets and illustrate remarkable performance improvements with respect to the state-of-the-art methods in the literature.

Keywords

Self-organizing trees; Nonlinear learning; Online learning; Classification and regression trees; Adaptive nonlinear filtering; Nonlinear modeling; Supervised learning

Chapter Points

  • •  We present a nonlinear modeling method for online supervised learning problems.
  • •  Nonlinear modeling is introduced via SOTs, which adaptively partition the feature space to minimize the loss of the algorithm.
  • •  Experimental validation shows huge empirical performance improvements with respect to the state-of-the-art methods.

Acknowledgements

The authors would like to thank Huseyin Ozkan for his contributions in this work.

9.1 Introduction

Nonlinear adaptive learning is extensively investigated in the signal processing [14] and machine learning literature [57], especially for applications where linear modeling is inadequate and hence does not provide satisfactory results due to the structural constraint on linearity. Although nonlinear approaches can be more powerful than linear methods in modeling, they usually suffer from overfitting and stability and convergence issues [8], which considerably limit their application to signal processing and machine learning problems. These issues are especially exacerbated in adaptive filtering due to the presence of feedback, which is hard to control even for linear models [9]. Furthermore, for applications involving big data, which require the processing of input vectors with considerably large dimensions, nonlinear models are usually avoided due to unmanageable computational complexity increase [10]. To overcome these difficulties, tree-based nonlinear adaptive filters or regressors are introduced as elegant alternatives to linear models since these highly efficient methods retain the breadth of nonlinear models while mitigating the overfitting and convergence issues [1113].

In its most basic form, a tree defines a hierarchical or nested partitioning of the feature space [12]. As an example, consider the binary tree in Fig. 9.1, which partitions a two-dimensional feature space. On this tree, each node is constructed by a bisection of the feature space (where we use hyperplanes for separation), which results in a complete nested and disjoint partitioning of the feature space. After the partitions are defined, the local learners in each region can be chosen as desired. As an example, to solve a regression problem, one can train a linear regressor in each region, which yields an overall piecewise linear regressor. In this sense, tree-based modeling is a natural nonlinear extension of linear models via a tractable nested structure.

Image
Figure 9.1 Feature space partitioning using a binary tree. The partitioning of a two-dimensional feature space using a complete tree of depth-2 with hyperplanes for separation. The feature space is first bisected by st,λ, which is defined by the hyperplane ϕt,λ, where the region on the direction of the ϕt,λ vector corresponds to the child with label “1”. We then continue to bisect children regions using st,0 and st,1, defined by ϕt,0 and ϕt,1, respectively.

Although nonlinear modeling using trees is a powerful and efficient method, there exist several algorithmic parameters and design choices that affect their performance in many applications [11]. Tuning these parameters is a difficult task for applications involving nonstationary data exhibiting saturation effects, threshold phenomena or chaotic behavior [14]. In particular, the performance of tree-based models heavily depends on a careful partitioning of the feature space. Selection of a good partition is essential to balance the bias and variance of the regressor [12]. As an example, even for a uniform binary tree, while increasing the depth of the tree improves the modeling power, such an increase usually results in overfitting [15]. To address this issue, there exist nonlinear modeling algorithms that avoid such a direct commitment to a particular partition but instead construct a weighted average of all possible partitions (or equivalently, piece-wise models) defined on a tree [6,7,16,17]. Note that a full binary tree of depth d defines a doubly exponential number of different partitions of the feature space [18]; for an example, see Fig. 9.2. Each of these partitions can be represented by a certain collection of the nodes of the tree, where each node represents a particular region of the feature space. Any of these partitions can be used to construct a nonlinear model, e.g., by training a linear model in each region, we can obtain a piece-wise linear model. Instead of selecting one of these partitions and fixing it as the nonlinear model, one can run all partitions in parallel and combine their outputs using a mixture-of-experts approach. Such methods are shown to mitigate the bias–variance tradeoff in a deterministic framework [6,7,16,19]. However, these methods are naturally constrained to work on a fixed partitioning structure, i.e., the partitions are fixed and cannot be adapted to data.

Image
Figure 9.2 Example partitioning for a binary classification problem. The left figure shows an example partitioning of a two-dimensional feature space using a depth-2 tree. The active region corresponding to a node is shown colored, where the dashed line represents the separating hyperplane at that node, and the two different colored subregions in a node represent the local classifier trained in that region. The right figure shows all different partitions (and consequently classifiers) defined by the tree on the left.

Although there exist numerous methods to partition the feature space, many of these split criteria are typically chosen a priori and fixed such as the dyadic partitioning [20] and a specific loss (e.g., the Gini index [21]) is minimized separately for each node. For instance, multivariate trees are extended to allow the simultaneous use of functional inner and leaf nodes to draw a decision in [13]. Similarly, the node-specific individual decisions are combined in [22] via the context tree weighting method [23] and a piece-wise linear model for sequential classification is obtained. Since the partitions in these methods are fixed and chosen even before the processing starts, the nonlinear modeling capability of such methods is very limited and significantly deteriorates in cases of high dimensionality [24].

To resolve this issue, we introduce self-organizing trees (SOTs) that jointly learn the optimal feature space partitioning to minimize the loss of the algorithm. In particular, we consider a binary tree where a separator (e.g., a hyperplane) is used to bisect the feature space in a nested manner, and an online linear predictor is assigned to each node. The sequential losses of these node predictors are combined (with their corresponding weights that are sequentially learned) into a global loss that is parameterized via the separator functions and the parameters of the node predictors. We minimize this global loss using online gradient descent, i.e., by updating the complete set of SOT parameters, i.e., the separators, the node predictors and the combination weights, at each time instance. The resulting predictor is a highly dynamical SOT structure that jointly (and in an online and adaptive manner) learns the region classifiers and the optimal feature space partitioning and hence provides an efficient nonlinear modeling with multiple learning machines. In this respect, the proposed method is remarkably robust to drifting source statistics, i.e., nonstationarity. Since our approach is essentially based on a finite combination of linear models, it generalizes well and does not overfit or limitedly overfits (as also shown by an extensive set of experiments).

9.2 Self-Organizing Trees for Regression Problems

In this section, we consider the sequential nonlinear regression problem, where we observe a desired signal {dt}t1Image, dtRImage, and regression vectors {xt}t1Image, xtRpImage, such that we sequentially estimate dtImage by

ˆdt=ft(xt),

Image

where ft()Image is the adaptive nonlinear regression function defined by the SOT. At each time t, the regression error of the algorithm is given by

et=dtˆdt

Image

and the objective of the algorithm is to minimize the square error loss Tt=1e2tImage, where T is the number of observed samples.

9.2.1 Notation

We first introduce a labeling for the tree nodes following [23]. The root node is labeled with an empty binary string λ and assuming that a node has a label n, where n is a binary string, we label its upper and lower children as n1 and n0, respectively. Here we emphasize that a string can only take its letters from the binary alphabet {0,1}Image, where 0 refers to the lower child and 1 refers to the upper child of a node. We also introduce another concept, i.e., the definition of the prefix of a string. We say that a string n=q1qlImage is a prefix to string n=q1qlImage if llImage and qi=qiImage for all i=1,,lImage and the empty string λ is a prefix to all strings. Let P(n)Image represent all prefixes to the string n, i.e., P(n){n0,,nl}Image, where ll(n)Image is the length of the string n, niImage is the string with l(ni)=iImage and n0=λImage is the empty string, such that the first i letters of the string n form the string niImage for i=0,,lImage.

For a given SOT of depth D, we let NDImage denote all nodes defined on this SOT and LDImage denote all leaf nodes defined on this SOT. We also let βDImage denote the number of partitions defined on this SOT. This yields the recursion βj+1=β2j+1Image for all j1Image, with the base case β0=1Image. For a given partition k, we let MkImage denote the set of all nodes in this partition.

For a node nNDImage (defined on the SOT of depth D), we define SD(n){ˊnND|P(ˊn)=n}Image as the set of all nodes of the SOT of depth D, whose set of prefixes includes node n.

For a node nNDImage (defined on the SOT of depth D) with length l(n)1Image, the total number of partitions that contain n can be found by the following recursion:

γd(l(n))l(n)j=1βdj.

Image

For the case where l(n)=0Image (i.e., for n=λImage), one can clearly observe that there exists only one partition containing λ, therefore γd(0)=1Image.

For two nodes n,ˊnNDImage (defined on the SOT of depth D), we let ρ(n,ˊn)Image denote the number of partitions that contain both n and ˊnImage. Trivially, if ˊn=nImage, then ρ(n,ˊn)=γd(l(n))Image. If nˊnImage, then letting ˉnImage denote the longest prefix to both n and ˊnImage, i.e., the longest string in P(n)P(ˊn)Image, we obtain

ρ(n,ˊn){γd(l(n)),if n=´n,γd(l(n))γdl(ˉn)1(l(ˊn)l(ˉn)1)βdl(ˉn)1,if nP(ˊn)SD(ˊn),0,otherwise.

Image (9.1)

Since l(ˉn)+1l(n)Image and l(ˉn)+1l(ˊn)Image from the definition of the SOT, we naturally have ρ(n,ˊn)=ρ(ˊn,n)Image.

9.2.2 Construction of the Algorithm

For each node n on the SOT, we define a node predictor

ˆdt,n=vTt,nxt,

Image (9.2)

whose parameter vt,nImage is updated using the online gradient descent algorithm. We also define a separator function for each node p on the SOT except the leaf nodes (note that leaf nodes do not have any children) using the sigmoid function

st,n=11+exp(ϕTt,nxt),

Image (9.3)

where ϕt,nImage is the normal to the separation plane. We then define the prediction of any partition according to the hierarchical structure of the SOT as the weighted sum of the prediction of the nodes in that partition, where the weighting is determined by the separator functions of the nodes between the leaf node and the root node. In particular, the prediction of the kth partition at time t is defined as follows:

ˆd(k)t=nMk(ˆdt,nl(n)1i=0sqit,ni),

Image (9.4)

where niP(n)Image is the prefix to string n with length i1Image, qiImage is the ith letter of the string n, i.e., ni+1=niqiImage, and finally sqit,niImage denotes the value of the separator function at node niImage such that

sqit,ni{st,ni,if qi=0,1st,ni,otherwise,

Image (9.5)

with st,niImage defined as in (9.3). We emphasize that we dropped the n dependency of qiImage and niImage to simplify notation. Using these definitions, we can construct the final estimate of our algorithm as

ˆdt=kβDw(k)tˆd(k)t,

Image (9.6)

where w(k)tImage represents the weight of partition k at time t.

Having found a method to combine the predictions of all partitions to generate the final prediction of the algorithm, we next aim to obtain a low-complexity representation since there are O(1.52D)Image different partitions defined on the SOT and (9.6) requires a storage and computational complexity of O(1.52D)Image. To this end, we denote the product terms in (9.4) as follows:

ˆδt,nˆdt,nl(n)1i=0sqit,ni,

Image (9.7)

where ˆδt,nImage can be viewed as the estimate of the node n at time t. Then (9.4) can be rewritten as follows:

ˆd(k)t=pMkˆδt,n.

Image

Since we now have a compact form to represent the tree and the outputs of each partition, we next introduce a method to calculate the combination weights of O(1.52D)Image partitions in a simplified manner. For this, we assign a particular linear weight to each node. We denote the weight of node n at time t as wt,nImage and then define the weight of the kth partition as the sum of the weights of its nodes, i.e.,

w(k)t=nMkwt,n,

Image

for all k{1,,βD}Image. Since we use online gradient descent to update the weight of each partition, the weight of partition k is recursively updated as

w(k)t+1=w(k)t+μtetˆd(k)t.

Image

This yields the following recursive update on the node weights:

wt+1,n=wt,n+μtetˆδt,n,

Image (9.8)

where ˆδt,nImage is defined as in (9.7). This result implies that instead of managing O(1.52D)Image memory locations and making O(1.52D)Image calculations, only keeping track of the weights of every node is sufficient and the number of nodes in a depth-D model is |ND|=2D+11Image. Therefore, we can reduce the storage and computational complexity from O(1.52D)Image to O(2D)Image by performing the update in (9.8) for all nNDImage.

Using these node predictors and weights, we construct the final estimate of our algorithm as follows:

ˆdt=βdk=1{(nMkwt,n)(nMkˆδt,n)}.

Image

Here, we observe that for arbitrary two nodes n,ˊnNdImage, the product wt,nˆδt,ˊnImage appears ρ(n,ˊn)Image times in ˆdtImage (cf. (9.1)). Hence, the combination weight of the estimate of the node n at time t can be calculated as follows:

κt,n=ˊnNdρ(n,ˊn)wt,ˊn.

Image (9.9)

Using the combination weight (9.9), we obtain the final estimate of our algorithm as follows:

ˆdt=nNDκt,nˆδt,n.

Image (9.10)

Note that (9.10) is equal to (9.6) with a storage and computational complexity of O(4D)Image instead of O(1.52D)Image.

As we derived all the update rules for the node weights and the parameters of the individual node predictors, what remains is to provide an update scheme for the separator functions. To this end, we use the online gradient descent update

ϕt+1,n=ϕt,n12ηte2t(ϕt,n),

Image (9.11)

for all nodes nNDLDImage, where ηtImage is the learning rate of the algorithm and e2t(ϕt,n)Image is the derivative of e2t(ϕt,n)Image with respect to ϕt,nImage. After some algebra, we obtain

ϕt+1,n=ϕt,n+ηtetˆdtst,nst,nϕt,n=ϕt,n+ηtet{ˊnNDκt,ˊnˆδt,ˊnst,n}st,nϕt,n=ϕt,n+ηtet{1q=0ˊnSD(nq)(1)qκt,ˊnˆδt,ˊnsqt,n}st,nϕt,n,

Image (9.12)

where we use the logistic regression classifier as our separator function, i.e., st,n=(1+exp(xTtϕt,n))1Image. Therefore, we have

st,nϕt,n=(1+exp(xTtϕt,n))2exp(xTtϕt,n)xt=st,n(1st,n)xt.

Image (9.13)

We emphasize that other separator functions can also be used in a similar way by simply calculating the gradient with respect to the extended direction vector and plugging in (9.12) and (9.13). From (9.13), we observe that e2t(ϕt,n)Image includes the product of st,nImage and 1st,nImage terms; hence, in order not to slow down the learning rate of our algorithm, we restrict s+st1s+Image for some 0<s+<0.5Image. In accordance with this restriction, we define the separator functions as follows:

st=s++12s+1+exTtϕt.

Image (9.14)

According to the update rule in (9.12), the computational complexity of the introduced algorithm results in O(p4d)Image. This concludes the construction of the algorithm and a pseudocode is given in Algorithm 1.

Image
Algorithm 1 Self-Organizing Tree Regressor (SOTR).

9.2.3 Convergence of the Algorithm

For Algorithm 1, we have the following convergence guarantee, which implies that our regressor (given in Algorithm 1) asymptotically achieves the performance of the best linear combination of the O(1.52D)Image different adaptive models that can be represented using a depth-D tree with a computational complexity O(p4D)Image. While constructing the algorithm, we refrain from any statistical assumptions on the underlying data, and our algorithm works for any sequence of {dt}t1Image with an arbitrary length of n. Furthermore, one can use this algorithm to learn the region boundaries and then feed this information to the first algorithm to reduce computational complexity.

Theorem 1

Let {dt}t1Image and {xt}t1Image be arbitrary, bounded and real-valued sequences. The predictor ˆdtImage given in Algorithm 1 when applied to these sequences yields

Tt=1(dtˆdt)2minwRβdTt=1(dtwTˆdt)2O(log(T)),

Image (9.15)

for all T, when e2t(w)Image is strongly convext, where ˆdt=[ˆd(1)t,,ˆd(βd)t]TImage and ˆd(k)tImage represents the estimate of dtImage at time t for the adaptive model k=1,,βdImage.

Proof of this theorem can be found in Appendix 9.A.1.

9.3 Self-Organizing Trees for Binary Classification Problems

In this section, we study online binary classification, where we observe feature vectors {xt}t1Image and determine their labels {yt}t1Image in an online manner. In particular, the aim is to learn a classification function ft(xt)Image with xtRpImage and yt{1,1}Image such that, when applied in an online manner to any streaming data, the empirical loss of the classifier ft()Image, i.e.,

LT(ft)Tt=11{ft(xt)dt},

Image (9.16)

is asymptotically as small (after averaging over T) as the empirical loss of the best partition classifier defined over the SOT of depth D. To be more precise, we measure the relative performance of ftImage with respect to the performance of a partition classifier f(k)tImage, where k{1,,βD}Image, using the following regret:

RT(ft;f(k)t)LT(ft)LT(f(k)t)T,

Image (9.17)

for any arbitrary length T. Our aim is then to construct an online algorithm with guaranteed upper bounds on this regret for any partition classifier defined over the SOT.

9.3.1 Construction of the Algorithm

Using the notations described in Section 9.2.1, the output of a partition classifier k{1,,βD}Image is constructed as follows. Without loss of generality, suppose that the feature xtImage has fallen into the region represented by the leaf node nLDImage. Then xtImage is contained in the nodes n0,,nDImage, where ndImage is the i letter prefix of n, i.e., nD=nImage and n0=λImage. For example, if node ndImage is contained in partition k, then one can simply set f(k)t(xt)=ft,nd(xt)Image. Instead of making a hard selection, we allow an error margin for the classification output ft,nd(xt)Image in order to be able to update the region boundaries later in the proof. To achieve this, for each node contained in partition k, we define a parameter called path probability to measure the contribution of each leaf node to the classification task at time t. This parameter is equal to the multiplication of the separator functions of the nodes from the respective node to the root node, which represents the probability that xtImage should be classified using the region classifier of node ndImage. This path probability (similar to the node predictor definition in (9.7)) is defined as

Pt,nd(xt)d1i=0sqi+1t,ni(xt),

Image (9.18)

where pqi+1t,ni()Image represents the value of the partitioning function corresponding to node niImage towards the qi+1Image direction as in (9.5). We consider that the classification output of node ndImage can be trusted with a probability of Pt,nd(xt)Image. This and the other probabilities in our development are independently defined for ease of exposition and gaining intuition, i.e., these probabilities are not related to the unknown data statistics in any way and they definitely cannot be regarded as certain assumptions on the data. Indeed, we do not take any assumptions about the data source.

Intuitively, the path probability is low when the feature vector is close to the region boundaries; hence we may consider to classify that feature vector by another node classifier (e.g., the classifier of the sibling node). Using these path probabilities, we aim to update the region boundaries by learning whether an efficient node classifier is used to classify xtImage, instead of directly assigning xtImage to node ndImage and lose a significant degree of freedom. To this end, we define the final output of each node classifier according to a Bernoulli random variable with outcomes {ft,nd(xt),ft,nd(xt)}Image, where the probability of the latter outcome is Pt,nd(xt)Image. Although the final classification output of node ndImage is generated according to this Bernoulli random variable, we continue to call ft,nd(xt)Image the final classification output of node ndImage, with an abuse of notation. Then the classification output of the partition classifier is set to f(k)t(xt)=ft,nd(xt)Image.

Before constructing the SOT classifier, we first introduce certain definitions. Let the instantaneous empirical loss of the proposed classifier ftImage at time t be denoted by t(ft)1{ft(xt)yt}Image. Then the expected empirical loss of this classifier over a sequence of length T can be found by

LT(ft)=E[Tt=1t(ft)],

Image (9.19)

with the expectation taken with respect to the randomization parameters of the classifier ftImage. We also define the effective region of each node ndImage at time t as follows: Rt,nd{x:Pt,nd(x)(0.5)d}Image. According to the aforementioned structure of partition classifiers, the node ndImage classifies an instance xtImage only if xtRt,ndImage. Therefore, the time accumulated empirical loss of any node n during the data stream is given by

LT,ntT:{xt}t1Rt,nt(ft,n).

Image (9.20)

Similarly, the time accumulated empirical loss of a partition classifier k is L(k)TnMkLT,nImage.

We then use a mixture-of-experts approach to achieve the performance of the best partition classifier that minimizes the accumulated classification error. To this end, we set the final classification output of our algorithm as ft(xt)=f(k)tImage with probability w(k)tImage, where

w(k)t=1Zt12J(k)exp(bL(k)t1),

Image

b0Image is a constant controlling the learning rate of the algorithm, J(k)2|L(k)|1Image represents the number of bits required to code the partition k (which satisfies βDk=1J(k)=1Image) and Zt=βDk=12J(k)exp(bL(k)t)Image is the normalization factor.

Although this randomized method can be used as the SOT classifier, in its current form, it requires a computational complexity O(1.52Dp)Image since the randomization w(k)tImage is performed over the set {1,,βD}Image and βD1.52DImage. However, the set of all possible classification outputs of these partitions has a cardinality as small as D+1Image since xtRt,nDImage for the corresponding leaf node nDImage (in which xtImage is included) and f(k)t=ft,ndImage for some d=0,,DImage, k{1,,βD}Image. Hence, evaluating all the partition classifiers in k at the instance xtImage to produce ft(xt)Image is unnecessary. In fact, the computational complexity for producing ft(xt)Image can be reduced from O(1.52Dp)Image to O(Dp)Image by performing the exact same randomization over ft,ndImage's using the new set of weights wt,ndImage, which can be straightforwardly derived as follows:

wt,nd=βDk=1w(k)t1f(k)t(xt)=ft,nd(xt).

Image (9.21)

To efficiently calculate (9.21) with complexity O(Dp)Image, we consider the universal coding scheme and let

Mt,n{exp(bLt,n), ifnhas depthD,12[Mt,n0Mt,n1+exp(bLt,n)], otherwise

Image (9.22)

for any node n and observe that we have Mt,λ=ZtImage [23]. Therefore, we can use the recursion (9.22) to obtain the denominator of the randomization probabilities w(k)tImage. To efficiently calculate the numerator of (9.21), we introduce another intermediate parameter as follows. Letting ndImage denote the sibling of node ndImage, we recursively define

κt,nd{12, if d=012Mt1,ndκt,nd1, if 0<d<DMt1,ndκt,nd1, if d=D,

Image (9.23)

d{0,,D}Image, where xtRt,nDImage. Using the intermediate parameters in (9.22) and (9.23), it can be shown that we have

wt,nd=κt,ndexp(bLt,nd)Mt,λ.

Image (9.24)

Hence, we obtain the final output of the algorithm as ft(xt)=ft,nd(xt)Image with probability wt,ndImage, where d{0,,D}Image (i.e., with a computational complexity O(D)Image).

We then use the final output of the introduced algorithm and update the region boundaries of the tree (i.e., organize the tree) to minimize the final classification error. To this end, we minimize the loss E[t(ft)]=E[1{ft(xt)yt}]=14E[(ytft(xt))2]Image with respect to the region boundary parameters, i.e., we use the stochastic gradient descent method as follows:

ϕt+1,nd=ϕt,ndηE[t(ft)]=ϕt,nd(1)qd+1η(ytft(xt))sqd+1t,nd(xt)[Di=d+1ft,ni(xt)]xt,

Image (9.25)

d{0,,D1}Image, where η denotes the learning rate of the algorithm and qd+1Image represents the complementary letter to qd+1Image from the binary alphabet {0,1}Image. Defining a new intermediate variable

πt,nd{ft,nd(xt), if d=D1,πt,nd+1+ft,nd(xt), if d<D1,

Image (9.26)

one can perform the update in (9.25) with a computational complexity O(p)Image for each node ndImage, where d{0,,D1}Image, resulting in an overall computational complexity of O(Dp)Image as follows:

ϕt+1,nd=ϕt,nd(1)md+1η(ytft(xt))πt,ndsqd+1t,nd(xt)xt.

Image (9.27)

This concludes the construction of the algorithm and the pseudocode of the SOT classifier can be found in Algorithm 2.

Image
Algorithm 2 Self-Organizing Tree Classifier (SOTC).

9.3.2 Convergence of the Algorithm

In this section, we illustrate that the performance of Algorithm 2 is asymptotically as good as the best partition classifier such that, as TImage, we have RT(ft;f(k)t)0Image. Hence, Algorithm 2 asymptotically achieves the performance of the best partition classifier among O(1.52D)Image different classifiers that can be represented using the SOT of depth D with a significantly reduced computational complexity of O(Dp)Image without any statistical assumptions on data.

Theorem 2

Let {xt}t1Image and {yt}t1Image be arbitrary and real-valued sequences of feature vectors and their labels, respectively. Then Algorithm 2, when applied to these data sequences, sequentially yields

maxk{1,,βD}E[RT(ft;f(k)t)]O(2DT),

Image (9.28)

for all T with a computational complexity O(Dp)Image, where p represents the dimensionality of the feature vectors and the expectation is with respect to the randomization parameters.

Proof of this theorem can be found in Appendix 9.A.2.

9.4 Numerical Results

In this section, we illustrate the performance of SOTs under different scenarios with respect to state-of-the-art methods. The proposed method has a wide variety of application areas, such as channel equalization [26], underwater communications [27], nonlinear modeling in big data [28], speech and texture analysis [29, Chapter 7] and health monitoring [30]. Yet, in this section, we consider nonlinear modeling for fundamental regression and classification problems.

9.4.1 Numerical Results for Regression Problems

Throughout this section, “SOTR” represents the self-organizing tree regressor defined in Algorithm 1, “CTW” represents the context tree weighting algorithm of [16], “OBR” represents the optimal batch regressor, “VF” represents the truncated Volterra filter [1], “LF” represents the simple linear filter, “B-SAF” and “CR-SAF” represent the Beizer and the Catmull–Rom spline adaptive filter of [2], respectively, and “FNF” and “EMFNF” represent the Fourier and even mirror Fourier nonlinear filter of [3], respectively. Finally, “GKR” represents the Gaussian-kernel regressor and it is constructed using n node regressors, say ˆdt,1,,ˆdt,nImage, and a fixed Gaussian mixture weighting (that is selected according to the underlying sequence in hindsight), giving

ˆdt=ni=1f(xt;μi,Σi)ˆdt,i,

Image

where ˆdt,i=vTt,ixtImage and

f(xt;μi,Σi)12π|Σi|e12(xtμi)TΣ1i(xtμi),

Image

for all i=1,,nImage.

For a fair performance comparison, in the corresponding experiments in Subsection 9.4.1.2, the desired data and the regressor vectors are normalized between [1,1]Image since the satisfactory performance of several algorithms requires the knowledge on the upper bounds (such as the B-SAF and the CR-SAF) and some require these upper bounds to be between [1,1]Image (such as the FNF and the EMFNF). Moreover, in the corresponding experiments in Subsection 9.4.1.1, the desired data and the regressor vectors are normalized between [1,1]Image for the VF, the FNF and the EMFNF algorithms due to the aforementioned reason. The regression errors of these algorithms are then scaled back to their original values for a fair comparison.

Considering the illustrated examples in the respective papers [2,3,16], the orders of the FNF and the EMFNF are set to 3 for the experiments in Subsection 9.4.1.1 and 2 for the experiments in Subsection 9.4.1.2. The order of the VF is set to 2 for all experiments. Similarly, the depth of the trees for the SOTR and CTW algorithms is set to 2 for all experiments. For these tree-based algorithms, the feature space is initially partitioned by the direction vectors ϕt,n=[ϕ(1)t,n,,ϕ(p)t,n]TImage for all nodes nNDLDImage, where ϕ(i)t,n=1Image if il(n)(modD)Image, e.g., when D=p=2Image, we have the four quadrants as the four leaf nodes of the tree. Finally, we use cubic B-SAF and CR-SAF algorithms, whose number of knots are set to 21 for all experiments. We emphasize that both these parameters and the learning rates of these algorithms are selected to give equal rates of performance and convergence.

9.4.1.1 Mismatched Partitions

In this subsection, we consider the case where the desired data is generated by a piece-wise linear model that mismatches with the initial partitioning of the tree-based algorithms. Specifically, the desired signal is generated by the following piece-wise linear model:

dt={wTxt+πt,if ϕT0xt0.5 and ϕT1xt1,wTxt+πt,if ϕT0xt0.5 and ϕT1xt<1,wTxt+πt,if ϕT0xt<0.5 and ϕT2xt1,wTxt+πt,if ϕT0xt<0.5 and ϕT2xt<1,

Image (9.29)

where w=[1,1]TImage, ϕ0=[4,1]TImage, ϕ1=[1,1]TImage, ϕ2=[1,2]TImage, xt=[x1,t,x2,t]TImage, πtImage is a sample function from a zero mean white Gaussian process with variance 0.1 and x1,tImage and x2,tImage are sample functions of a jointly Gaussian process of mean [0,0]TImage and variance I2Image. The learning rates are set to 0.005 for SOTR and CTW, 0.1 for FNF, 0.025 for B-SAF and CR-SAF, 0.05 for EMFNF and VF. Moreover, in order to match the underlying partition, the mass points of GKR are set to μ1=[1.4565,1.0203]TImage, μ2=[0.6203,0.4565]TImage, μ3=[0.5013,0.5903]TImage and μ4=[1.0903,1.0013]TImage with the same covariance matrix as in the previous example.

Fig. 9.3 shows the normalized time accumulated regression error of the proposed algorithms. We emphasize that the SOTR algorithm achieves a better error performance compared to its competitors. Comparing the performances of the SOTR and CTW algorithms, we observe that the CTW algorithm fails to accurately predict the desired data, whereas the SOTR algorithm learns the underlying partitioning of the data, which significantly improves the performance of the SOTR. This illustrates the importance of the initial partitioning of the regressor space for tree-based algorithms to yield a satisfactory performance.

Image
Figure 9.3 Regression error performances for the second-order piece-wise linear model in (9.29).

In particular, the CTW algorithm converges to the best batch regressor having the predetermined leaf nodes (i.e., the best regressor having the four quadrants of two-dimensional space as its leaf nodes). However that regressor is suboptimal since the underlying data is generated using another constellation; hence its time accumulated regression error is always lower bounded by O(1)Image compared to the global optimal regressor. The SOTR algorithm, on the other hand, adapts its region boundaries and captures the underlying unevenly rotated and shifted regressor space partitioning perfectly. Fig. 9.4 shows how our algorithm updates its separator functions and illustrates the nonlinear modeling power of SOTs.

Image
Figure 9.4 Changes in the boundaries of the leaf nodes of the SOT of depth 2 generated by the SOTR algorithm at time instances t = 0,1000,2000,5000,20000,50000. The separator functions adaptively learn the boundaries of the piece-wise linear model in (9.29).

9.4.1.2 Chaotic Signals

In this subsection, we illustrate the performance of the SOTR algorithm when estimating chaotic data generated by the Henon map and the Lorenz attractor [31].

First, we consider a zero mean sequence generated by the Henon map, a chaotic process given by

dt=1ζd2t1+ηdt2,

Image (9.30)

known to exhibit chaotic behavior for the values of ζ=1.4Image and η=0.3Image. The desired data at time t is denoted as dtImage whereas the extended regressor vector is xt=[dt1,dt2,1]TImage, i.e., we consider a prediction framework. The learning rate is set to 0.025 for B-SAF and CR-SAF, whereas it is set to 0.05 for the rest.

Fig. 9.5 (left plot) shows the normalized regression error performance of the proposed algorithms. One can observe that the algorithms whose basis functions do not include the necessary quadratic terms and the algorithms that rely on a fixed regressor space partitioning yield unsatisfactory performance. On the other hand, VF can capture the salient characteristics of this chaotic process since its order is set to 2. Similarly, FNF can also learn the desired data since its basis functions can well approximate the chaotic process. The SOTR algorithm, however, uses a piece-wise linear modeling and still achieves asymptotically the same performance as the VF algorithm, while outperforming the FNF algorithm.

Image
Figure 9.5 Regression error performances of the proposed algorithms for the signal generated by the Henon map in (9.30) (left figure) and for the Lorenz attractor in (9.31) with parameters dt = 0.01, ρ = 28, σ = 10 and β = 8/3 (right figure).

Second, we consider the chaotic signal set generated using the Lorenz attractor [31] that is defined by the following three discrete-time equations:

xt=xt1+(σ(yx))dt,yt=yt1+(xt1(ρzt1)yt1)dt,zt=zt1+(xt1yt1βzt1)dt,

Image (9.31)

where we set dt=0.01Image, ρ=28Image, σ=10Image and β=8/3Image to generate the well-known chaotic solution of the Lorenz attractor. In the experiment, xtImage is selected as the desired data and the two-dimensional region represented by yt,ztImage is set as the regressor space, that is, we try to estimate xtImage with respect to ytImage and ztImage. The learning rates are set to 0.01 for all algorithms.

Fig. 9.5 (right plot) illustrates the nonlinear modeling power of the SOTR algorithm even when estimating a highly nonlinear chaotic signal set. As can be observed from Fig. 9.5, the SOTR algorithm significantly outperforms its competitors and achieves a superior error performance since it tunes its region boundaries to the optimal partitioning of the regressor space, whereas the performances of the other algorithms directly rely on the initial selection of the basis functions and/or tree structures and partitioning.

9.4.2 Numerical Results for Classification Problems

9.4.2.1 Stationary Data

In this section, we consider stationary classification problems and compare the SOTC algorithm with the following methods: Perceptron, “PER” [25]; Online AdaBoost, “OZAB” [32]; Online GradientBoost, “OGB” [33]; Online SmoothBoost, “OSB” [34]; and Online Tree–Based Nonadaptive Competitive Classification, “TNC” [22]. The parameters for all of these compared methods are set as in their original proposals. For “OGB” [33], which uses K weak learners per M selectors, essentially resulting in MK weak learners in total, we use K=1Image, as in [34], for a fair comparison along with the logit loss that has been shown to consistently outperform other choices in [33]. The TNC algorithm is nonadaptive, i.e., not self-organizing, in terms of the space partitioning, which we use in our comparisons to illustrate the gain due to the proposed self-organizing structure. We use the Perceptron algorithm as the weak learners and node classifiers in all algorithms. We set the learning rate of the SOTC algorithm to η=0.05Image in all of our stationary as well as nonstationary data experiments. We use N=100Image weak learners for the boosting methods, whereas we use a depth-4 tree in SOTC and TNC algorithms, which corresponds to 31=251Image local node classifiers. The SOTC algorithm has linear complexity in the depth of the tree, whereas the compared methods have linear complexity in the number of weak learners.

As can be observed in Table 9.1, the SOTC algorithm consistently outperforms the compared methods. In particular, the compared methods essentially fail to classify Banana and BMC datasets, which indicates that these methods are not able to extend to complex nonlinear classification problems. On the contrary, the SOTC algorithm successfully models these complex nonlinear relations with piece-wise linear curves and provides a superior performance. In general, the SOTC algorithm has significantly better transient characteristics and the TNC algorithm occasionally performs poorly (such as on BMC and Banana data sets) depending on the mismatch between the initial partitions defined on the tree and the underlying optimal separation of the data. This illustrates the importance of learning the region boundaries in piece-wise linear models.

Table 9.1

Average classification errors (in percentage) of algorithms on benchmark datasets

Data Set PER OZAB OGB OSB TNC SOTC
Heart 24.66 23.96 23.28 23.63 21.75 20.09
Breast cancer 5.77 5.44 5.71 5.23 4.84 4.65
Australian 20.82 20.26 19.70 20.01 15.92 14.86
Diabetes 32.25 32.43 33.49 31.33 26.89 25.75
German 32.45 31.86 32.72 31.86 28.13 26.74
BMC 47.09 45.72 46.92 46.37 25.37 17.03
Splice 33.42 32.59 32.79 32.81 18.88 18.56
Banana 48.91 47.96 48.00 48.84 27.98 17.60

Image

9.4.2.2 Nonstationary Data: Concept Change/Drift

In this section, we apply the SOTC algorithm to nonstationary data, where there might be continuous or sudden/abrupt changes in the source statistics, i.e., concept change. Since the SOTC algorithm processes data in a sequential manner, we choose the Dynamically Weighted Majority (DWM) algorithm (DWM) [35] with Perceptron (DWM-P) or naive Bayes (DWM-N) experts for the comparison, since the DWM algorithm is also an online algorithm. Although the batch algorithms do not truly fit into our framework, we still devise an online version of the tree-based local space partitioning algorithm [24] (which also learns the space partitioning and the classifier using the coordinate ascent approach) using a sliding-window approach and abbreviate it as the WLSP algorithm. For the DWM method, which allows the addition and removal of experts during the stream, we set the initial number of experts to 1, where the maximum number of experts is bounded by 100. For the WLSP method, we use a window size of 100. The parameters for these compared methods are set as in their original proposals.

We run these methods on the BMC dataset (1200 instances, Fig. 9.6), where a sudden/abrupt concept change is obtained by rotating the feature vectors (clock-wise around the origin) 180Image after the 600th instance. This is effectively equivalent to flipping the label of the feature vectors; hence the resulting dataset is denoted as BMC-F. For a continuous concept drift, we rotate each feature vector 180/1200=0.15Image starting from the beginning; the resulting dataset is denoted as BMC-C. In Fig. 9.6, we present the classification errors for the compared methods averaged over 1000 trials. At each 10th instance, we test the algorithms with 1200 instances drawn from the active set of statistics (active concept).

Image
Figure 9.6 Performances of the algorithms in case of the abrupt and continuous concept changes in the BMC dataset. On the left figure, at the 600th instance, there is a 180 clock-wise rotation around the origin that is effectively a label flip. On the right figure, at each instance, there is a 180/1200 clock-wise rotation around the origin.

Since the BMC dataset is non-Gaussian with strongly nonlinear class separations, the DWM method does not perform well on the BMC-F data. For instance, DWM-P operates with an error rate fluctuating around 0.48–0.49 (random guess). This results since the performance of the DWM method is directly dependent on the expert success and we observe that both base learners (Perceptron or the naive Bayes) fail due to the high separation complexity in the BMC-F data. On the other hand, the WLSP method quickly converges to steady state, however it is also asymptotically outperformed by the SOTC algorithm in both experiments. Increasing the window size is clearly expected to boost the performance of WLSP, however at the expense of an increased computational complexity. It is already significantly slower than the SOTC method even when the window size is 100 (for a more detailed comparison, see Table 9.2). The performance of the WLSP method is significantly worse on the BMC-C data set compared to the BMC-F data set, since in the former scenario, WLSP is trained with batch data of a continuous mixture of concepts in the sliding windows. Under this continuous concept drift, the SOTC method always (not only asymptotically as in the case of the BMC-F data set) performs better than the WLSP method. Hence, the sliding-window approach is sensitive to the continuous drift. Our discussion about the DWM method on the concept change data (BMC-F) remains valid in the case of the concept drift (BMC-C) as well. In these experiments, the power of self-organizing trees is obvious as the SOTC algorithm almost always outperforms the TNC algorithm. We also observe from Table 9.2 that the SOTC algorithm is computationally very efficient and the cost of region updates (compared to the TNC algorithm) does not increase the computational complexity of the algorithm significantly.

Table 9.2

Running times (in seconds) of the compared methods when processing the BMC data set on a daily-use machine (Intel(R) Core(TM) i5-3317U CPU @ 1.70 GHz with 4 GB memory)

PER OZAB OGB OSB TNC DWM-P DWM-N WLSP SOTC
0.06 12.90 3.57 3.91 0.43 2.06 6.91 68.40 0.62

Image

Appendix 9.A

9.A.1 Proof of Theorem 1

For the SOT of depth D, suppose ˆd(k)tImage, k=1,,βdImage, are obtained as described in Section 9.2.2. To achieve the upper bound in (9.15), we use the online gradient descent method and update the combination weights as

wt+1=wt12ηte2t(wt)=wt+ηtetˆdt,

Image (9.32)

where ηtImage is the learning rate of the online gradient descent algorithm. We derive an upper bound on the sequential learning regret RnImage, which is defined as

RTTt=1e2t(wt)Tt=1e2t(wn),

Image

where wTImage is the optimal weight vector over T, i.e.,

wTargminwRβdTt=1e2t(w).

Image

Following [36], using Taylor series approximation, for some point ztImage on the line segment connecting wtImage to wTImage, we have

e2t(wT)=e2t(wt)+(e2t(wt))T(wTwt)+12(wTwt)T2e2t(zt)(wTwt).

Image (9.33)

According to the update rule in (9.32), at each iteration the update on weights are performed as wt+1=wtηt2e2t(wt)Image. Hence, we have

||wt+1wT||2=||wtηt2e2t(wt)wT||2=||wtwT||2ηt(e2t(wt))T(wtwT)+η2t4||e2t(wt)||2.

Image

This yields

(e2t(wt))T(wtwT)=||wtwT||2||wt+1wT||2ηt+ηt||e2t(wt)||24.

Image

Under the mild assumptions that ||e2t(wt)||2A2Image for some A>0Image and e2t(wT)Image is λ-strong convex for some λ>0Image [36], we achieve the following upper bound:

e2t(wt)e2t(wT)||wtwT||2||wt+1wT||2ηtλ2||wtwT||2+ηtA24.

Image (9.34)

By selecting ηt=2λtImage and summing up the regret terms in (9.34), we get

Rn=nt=1{e2t(wt)e2t(wT)}nt=1||wtwT||2(1ηt1ηt1λ2)+A24nt=1ηt=A24nt=12λtA22λ(1+log(n)).

Image

9.A.2 Proof of Theorem 2

Since ZtImage is a summation of terms that are all positive, we have Zt2J(k)exp(bL(k)t)Image and after taking the logarithm of both sides and rearranging the terms, we get

1blogZTL(k)T+J(k)log2b

Image (9.35)

for all k{1,,βD}Image at the (last) iteration at time T. We then make the following observation:

ZT=Tt=1ZtZt1=Tt=1βDk=12J(k)exp(bL(k)t1)Zt1exp(bt(f(k)t))exp(bLT(ft)+Tb28),

Image (9.36)

where the second line follows from the definition of ZtImage and the last line follows from the Hoeffding inequality by treating the w(k)t=2J(k)exp(bL(k)t1)/Zt1Image terms as the randomization probabilities. Note that LT(ft)Image represents the expected loss of the final algorithm; cf. (9.19). Combining (9.35) and (9.36), we obtain

LT(ft)TL(k)TT+J(k)log2Tb+b8

Image

and choosing b=2D/TImage, we find the desired upper bound in (9.28) since J(k)2D+11Image, for all k{1,,βD}Image.

References

[1] M. Schetzen, The Volterra and Wiener Theories of Nonlinear Systems. NJ: John Wiley & Sons; 1980.

[2] M. Scarpiniti, D. Comminiello, R. Parisi, A. Uncini, Nonlinear spline adaptive filtering, Signal Processing 2013;93(4):772–783.

[3] A. Carini, G.L. Sicuranza, Fourier nonlinear filters, Signal Processing 2014;94(0):183–194.

[4] N.D. Vanli, M.O. Sayin, I. Delibalta, S.S. Kozat, Sequential nonlinear learning for distributed multiagent systems via extreme learning machines, IEEE Transactions on Neural Networks and Learning Systems March 2017;28(3):546–558.

[5] D.P. Helmbold, R.E. Schapire, Predicting nearly as well as the best pruning of a decision tree, Machine Learning 1997;27(1):51–68.

[6] E. Takimoto, A. Maruoka, V. Vovk, Predicting nearly as well as the best pruning of a decision tree through dynamic programming scheme, Theoretical Computer Science 2001;261:179–209.

[7] E. Takimoto, M.K. Warmuth, Predicting nearly as well as the best pruning of a planar decision graph, Theoretical Computer Science 2002;288:217–235.

[8] D.P. Bertsekas, Nonlinear Programming. Athena Scientific; 1999.

[9] A.H. Sayed, Fundamentals of Adaptive Filtering. NJ: John Wiley & Sons; 2003.

[10] S. Dasgupta, Y. Freund, Random projection trees for vector quantization, IEEE Transactions on Information Theory 2009;55(7):3229–3242.

[11] W.-Y. Loh, Classification and regression trees, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2011;1(1):14–23.

[12] L. Brieman, J. Friedman, C.J. Stone, R.A. Olsen, Classification and Regression Trees. Chapman & Hall; 1984.

[13] J. Gama, Functional trees, Machine Learning 2004;55(3):219–250.

[14] O.J.J. Michel, A.O. Hero, A.-E. Badel, Tree-structured nonlinear signal modeling and prediction, IEEE Transactions on Signal Processing Nov 1999;47(11):3027–3041.

[15] H. Ozkan, N.D. Vanli, S.S. Kozat, Online classification via self-organizing space partitioning, IEEE Transactions on Signal Processing Aug 2016;64(15):3895–3908.

[16] S.S. Kozat, A.C. Singer, G.C. Zeitler, Universal piecewise linear prediction via context trees, IEEE Transactions on Signal Processing 2007;55(7):3730–3745.

[17] N.D. Vanli, M.O. Sayin, S.S. Kozat, Predicting nearly as well as the optimal twice differentiable regressor, CoRR arXiv:1401.6413; 2014.

[18] A.V. Aho, N.J.A. Sloane, Some doubly exponential sequences, Fibonacci Quarterly 1970;11:429–437.

[19] N.D. Vanli, K. Gokcesu, M.O. Sayin, H. Yildiz, S.S. Kozat, Sequential prediction over hierarchical structures, IEEE Transactions on Signal Processing Dec 2016;64(23):6284–6298.

[20] C. Scott, R.D. Nowak, Minimax-optimal classification with dyadic decision trees, IEEE Transactions on Information Theory 2006;52(4):1335–1353.

[21] C. Strobl, A.-L. Boulesteix, T. Augustin, Unbiased split selection for classification trees based on the Gini index, Computational Statistics & Data Analysis 2007;52(1):483–501.

[22] H. Ozkan, M.A. Donmez, O.S. Pelvan, A. Akman, S.S. Kozat, Competitive and online piecewise linear classification, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). May 2013:3452–3456.

[23] F.M.J. Willems, Y.M. Shtarkov, T.J. Tjalkens, The context-tree weighting method: basic properties, IEEE Transactions on Information Theory 1995;41(3):653–664.

[24] J. Wang, V. Saligrama, Local supervised learning through space partitioning, Advances in Neural Information Processing Systems (NIPS). 2012:91–99.

[25] Y. Freund, R.E. Schapire, Large margin classification using the perceptron algorithm, Machine Learning 1999;37(3):277–296.

[26] K. Kim, N. Kalantarova, S.S. Kozat, A.C. Singer, Linear MMSE-optimal turbo equalization using context trees, IEEE Transactions on Neural Networks and Learning Systems April 2013;61(12):3041–3055.

[27] D. Kari, N.D. Vanli, S.S. Kozat, Adaptive and efficient nonlinear channel equalization for underwater acoustic communication, Physical Communication 2017;24:83–93.

[28] F. Khan, D. Kari, I.A. Karatepe, S.S. Kozat, Universal nonlinear regression on high dimensional data using adaptive hierarchical trees, IEEE Transactions on Big Data 2016;2(2):175–188.

[29] T. Kohonen, Self-Organizing Maps. 3rd edition New York: Springer-Verlag, Inc.; 2001.

[30] H.G. Basara, M. Yuan, Community health assessment using self-organizing maps and geographic information systems, International Journal of Health Geographics Dec 2008;7(1).

[31] E.N. Lorenz, Deterministic nonperiodic flow, Journal of the Atmospheric Sciences 1963;20(2):130–141.

[32] N.C. Oza, S. Russell, Online bagging and boosting, Artificial Intelligence and Statistics. 2001:105–112.

[33] C. Leistner, A. Saffari, P.M. Roth, H. Bischof, On robustness of on-line boosting-a competitive study, IEEE 12th International Conference on Computer Vision Workshops. 2009:1362–1369.

[34] S.-T. Chen, H.-T. Lin, C.-J. Lu, An online boosting algorithm with theoretical justifications, International Conference on Machine Learning. 2012.

[35] J. Zico Kolter, Marcus Maloof, Dynamic weighted majority – an ensemble method for drifting concepts, Journal of Machine Learning Research 2007;8:2755–2790.

[36] E. Hazan, A. Agarwal, S. Kale, Logarithmic regret algorithms for online convex optimization, Machine Learning 2007;69(2–3):169–192.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.185.180