Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13

Time Series Data Classification

Dimitrios Kotsakos

Dept. of Informatics and Telecommunications
University of Athens
Athens, Greece
[email protected]

Dimitrios Gunopulos

Dept. of Informatics and Telecommunications
University of Athens
Athens, Greece
[email protected]

13.1 Introduction

Time-series data is one of the most common forms of data encountered in a wide variety of scenarios such as the stock markets, sensor data, fault monitoring, machine state monitoring, environmental applications, or medical data. The problem of classification finds numerous applications in the time series domain, such as the determination of predefined groups of entities that are most similar to a time series entity whose group is still unknown. Timeseries classification has numerous applications in diverse problem domains:

Financial Markets: In financial markets, the values of the stocks represent time-series which continually vary with time. The classification of a new time series of unknown group can provide numerous insights into descisions about this specific time series.
Medical Data: Different kinds of medical data such as EEG readings are in the form of time series. The classification of such time series, e.g., for a new patient, can provide insights into similar treatment or aid the domain experts in the decisions that have to be made, as similar behavior may indicate similar diseases.
Machine State Monitoring: Numerous forms of machines create sensor data, which provides a continuous idea of the states of these objects. These can be used in order to provide an idea of the underlying behaviors and the groups a time series belongs to.
Spatio-temporal Data: Trajectory data can be considered a form of multi-variate time series data, in which the X- and Y-coordinates of objects correspond to continuously varying series. The behavior in these series can be used in order to determine the class a trajectory belongs to, and then decide, e.g., if a specific trajectory belongs to a pedestrian or to a vehicle.

Time series data falls in the class of contextual data representations. Many kinds of data such as time series data, discrete sequences, and spatial data fall in this class. Contextual data contain two kinds of attributes:

Contextual Attribute: For the case of time-series data, this corresponds to the time dimension. The time dimension provides the reference points at which the behavioral values are measured. The timestamps could correspond to actual time values at which the data points are measured, or they could correspond to indices at which these values are measured.
Behavioral Attribute: This could correspond to any kind of behavior that is measured at the reference point. Some examples include stock ticker values, sensor measurements such as temperature, or other medical time series.

Given a time series object of unknown class, the determination of the class it belongs to is extremely challenging because of the difficulty in defining similarity across different time series, which may be scaled and translated differently both on the temporal and behavioral dimension. Therefore, the concept of similarity is very important for time series data classification. Therefore, this chapter will devote a section to the problem of time series similarity measures.

In the classification problem the goal is to separate the classes that characterize the dataset by a function that is induced from the available labeled examples. The ultimate goal is to produce a classifier that classifies unseen examples, or examples of unknown class, with high precision and recall, while being able to scale well. A significant difference between time series data classification and classification of objects in Euclidean space is that the time series to be classified to pre-defined or pre-computed classes may not be of equal length to the time series already belonging to these classes. When this is not the case, so all time series are of equal length, standard classification techniques can be applied by representing each time series as a vector and using a traditional Lp-norm distance. With such an approach, only similarity in time can be exploited, while similarity in shape and similarity in change are disregarded. In this study we split the classification process in two basic steps. The first one is the choice of the similarity measure that will be employed, while the second one concerns the classification algorithm that will be followed. Xing et al. in their sequence classification survey [42] argue that the high dimensionality of the feature space and their sequential nature make sequence classification a challenging task. These facts apply in the time series classification task as well, as time series are essentially sequences.

The majority of the studies and approaches proposed in the literature for time series classification is application dependent. In most of them the authors try to improve the performance of a specific feature. From this point of view, the more general time series classification approaches are of their own interest and importance.

13.2 Time Series Representation

In this study we employ the raw data representation, where a time series T of length n is represented as an ordered sequence of values T = [t1,t2,…,tn]. T is called raw representation of the time series data. Regarding the dimensionality of the data points ti, for each time series T, each ti ∈ T can be n-dimensional, where n ≥ 1. For example, when considering time series that represent trajectories of moving objects in the two-dimensional plane, n = 2, while for financial time series that represent stock prices, n = 1.

13.3 Distance Measures

When it comes to time series data mining, the distance or similarity measure that will be used to compare time series objects must be carefully selected and defined, as time series exhibit different characteristics than traditional data types. In this section we briefly describe the most commonly used distance measures that have been used in the literature to facilitate similarity search on time series data. Ding et al. performed an extensive evaluation over a variety of representations and distance/similarity measures for time series, using a nearest neighbor classifier (1NN) on labeled data, where the correct class label of each time series is known a priori [9].

13.3.1 Lp-Norms

The Lp-norm is a distance metric, since it satisfies all of the non-negativity, identity, symmetry and the triangle inequality conditions. An advantage of the Lp-norm is that it can be computed in linear time to the length of the trajectories under comparison, thus its time complexity is O(n), n being the length of the time series. In order to use the Lp-norm, the two time series under comparison must be of the same length.

The Minkowski of order p or the Lp-norm distance, being the generalization of Euclidean distance, is defined as follows:

\begin{matrix} L_{p} - norm (T_{1}, T_{2}) = D_{M} (T_{1}, T_{2}) = \sqrt[p]{\sum_{i = 1}^{n} {(T_{1 i} - T_{2 i})}^{p} .} & (13.1) \end{matrix}

$\begin{matrix} L_{p} - norm (T_{1}, T_{2}) = D_{M} (T_{1}, T_{2}) = \sqrt[p]{\sum_{i = 1}^{n} {(T_{1 i} - T_{2 i})}^{p} .} & (13.1) \end{matrix}$

The Euclidean Distance between two one-dimensional time series T1 and T2 of length n is a special case of the Lp-norm for p = 2 and is defined as:

\begin{matrix} D_{E} (T_{1}, T_{2}) = L_{2} - norm = \sqrt{\sum_{i = 1}^{n} {(T_{1 i} - T_{2 i})}^{2} .} & (13.2) \end{matrix}

$\begin{matrix} D_{E} (T_{1}, T_{2}) = L_{2} - norm = \sqrt{\sum_{i = 1}^{n} {(T_{1 i} - T_{2 i})}^{2} .} & (13.2) \end{matrix}$

L1-norm (p=1) is named the Manhattan distance or city block distance.

13.3.2 Dynamic Time Warping (DTW)

Dynamic Time Warping (DTW) is a well known and widely used shape-based distance measure. DTW computes the warping path W = w1, w2, …,wK with max(m, n) ≤ K ≤ m + n − 1 of minimum distance for two time series of lengths m and n.

DTW stems from the speech processing community [33] and has been very popular in the literature of time series distance measures. With use of dynamic programming, DTW between two one-dimensional time series T1 and T2 of length m and n, respectively, can be computed as follows:

DDTW(T1 T2) = 0, if m = n = 0
DDTW(T1 T2) = ∞, if m = n = 0
DDTW(T1 T2) = dist(T11 T21) + minFactor, otherwise

where minFactor is computed as:

m i n F a c t o r = \min {\begin{matrix} D_{D T W} (Re st (T_{1}), Re st (T_{2})) \\ D_{D T W} (Re st (T_{1}), Re st (T_{2})) \\ D_{D T W} (T_{1}, Re st (T_{2})) \end{matrix}

$m i n F a c t o r = \min {\begin{matrix} D_{D T W} (Re st (T_{1}), Re st (T_{2})) \\ D_{D T W} (Re st (T_{1}), Re st (T_{2})) \\ D_{D T W} (T_{1}, Re st (T_{2})) \end{matrix}$

where dist((T1,1, T2,1) is typically the L2-norm. The Euclidean Distance, as long as all Lp-norms, described in 13.3.1, performs a one-to-one mapping between the data points of the time series under comparison. Thus, it can be seen as a special case of the DTW distance, which performs a one-to-many mapping. DTW is a more robust distance measure than Lp-norm because it allows time shifting and thus matches similar shapes even if they have a time phase difference. An important point is that DTW does not satisfy the triangular inequality, which could be a problem while indexing time series. However, in the literature there are several lower bounds that serve as solutions for indexing DTW offering faster performance [36]. Another advantage of DTW over Lp-norm is that DTW can handle different sampling intervals in the time series. This is a very important feature especially for long time series that span many years as the data sampling strategy may change over a long time.

A variation of DTW is the Frèchet distance. Intuitively, the Frèchet distance is defined as the minimum length of a leash a dog-owner needs to have to be connected to his dog at all times for a given trajectory of the owner and the dog, over all possible monotone movements of the owner and the dog along their trajectories.

When comparing two trajectories — or curves — using the Frèchet distance, they are represented as the trajectories of the dog and the owner. Frèchet distance is used to calculate distances between curves and takes into account the location and ordering of the points along the curves [16], [2]. The Frèchet distance is essentially the Linf equivalent of the Dynamic Time Warping distance. Eiter and Manilla proposed a discrete variation and a dynamic programming polynomial time algorithm for the computation of the Frèchet distance [12]. Driemel et al. give almost linear time approximation algorithms for Frèchet distance [11].

13.3.3 Edit Distance

EDIT distance (ED) comes from the field of string comparison and measures the number of insert, delete, and replace operations that are needed to make two strings of possibly different lengths identical to each other. More specifically, the EDIT distance between two strings S1 and S2 of length m and n, respectively, is computed as follows:

DED(S1,S2)= m,if n = 0
DED(S1,S2)= n, if m = 0
DED(S1,S2) = DED(Rest(S1), Rest(S2)), if S11 = S21
$\begin{matrix} D_{E D} (S_{1}, S_{2}) = \min {\begin{matrix} D_{E D} (Re st (S_{1}), Re st (S_{2})) + 1 \\ D_{E D} (Re st (S_{1}), S_{2}) + 1 \\ D_{E D} (S_{1}, Re st (S_{2})) + 1. \end{matrix} & otherwise \end{matrix}$ $\begin{matrix} D_{E D} (S_{1}, S_{2}) = \min {\begin{matrix} D_{E D} (Re st (S_{1}), Re st (S_{2})) + 1 \\ D_{E D} (Re st (S_{1}), S_{2}) + 1 \\ D_{E D} (S_{1}, Re st (S_{2})) + 1. \end{matrix} & otherwise \end{matrix}$

Although ED for strings is proven to be a metric distance, the two ED-related time series distance measures DTW and Longest Common Subsequence (LCSS) that will be described in the next subsection are proven not to follow the triangle inequality. Lei Chen [7] proposed two extensions to EDIT distance, namely Edit distance with Real Penalty (ERP) to support local time shifting and Edit Distance on Real sequence (EDR) to handle both local time shifting and noise in time series and trajectories. Both extensions have a high computational cost, so Chen proposes various lower bounds, indexing and pruning techniques to retrieve similar time series more efficiently. Both ERP and DTW can handle local time shifting and measure the distance between two out-of-phase time series effectively. An advantage that ERP has over DTW is that the former is a metric distance function, whereas the latter is not. DTW does not obey triangle inequality, and therefore traditional index methods cannot be used to improve efficiency in DTW-based applications. On the other hand, ERP is proven to be a metric distance function [7], and therefore traditional access methods can be used. EDR, as a distance function, proves to be more robust than Euclidean distance, DTW, and ERP and more accurate than LCSS as will be described in the next subsection. EDR is not a metric, thus the author proposes three non-constraining pruning techniques (mean value Q-grams, near triangle inequality, and histograms) to improve retrieval efficiency.

13.3.4 Longest Common Subsequence (LCSS)

The Longest Common Subsequence (LCSS) distance is a variation of EDIT distance described in 13.3.3 [37]. LCSS allows time series to stretch in the time axis and does not match all elements, thus being less sensitive to outliers than Lp-norms and DTW. Specifically, the LCSS distance between two real-valued sequences S1 and S2 of length m and n respectively is computed as follows:

(a) LCSSδ,ε(S1,S2)= 0, if n = 0 or m = 0

if |S1,m − S2,m| < ∈ and |m − n| ≤ δ

(d) $\begin{matrix} \max {\begin{matrix} L C S S_{δ,ε} (HEAD (S_{1}), S_{2}) \\ L C S S_{δ,ε} (S_{1}, HEAD (S_{2})) \end{matrix} & , otherwise \end{matrix}$ $\begin{matrix} \max {\begin{matrix} L C S S_{δ,ε} (HEAD (S_{1}), S_{2}) \\ L C S S_{δ,ε} (S_{1}, HEAD (S_{2})) \end{matrix} & , otherwise \end{matrix}$

where HEAD(S1) is the subsequence [S1,1, S1,2, …S1,m−1], δ is an integer that controls the maximum distance in the time axis between two matched elements, and ∈ is a real number 0 < ε < 1 that controls the maximum distance that two elements are allowed to have to be considered matched.

Apart from being used in time series classification, LCSS distance is often used in domains like speech recognition and text pattern mining. Its main drawback is that it often is needed to scale are transform the one sequence to the other. A detailed study of LCSS variations and algorithms is presented in [5].

13.4 k-NN

The well-known and popular traditional k-NN classifier is not different when it comes to time series classification. The most challenging part in this setting is again the choice of the distance or similarity measure that will be used to compute the distance or similarity between two time series objects. The intuition is that a time series object is assigned the class label of the majority of its k nearest neighbors. k-NN has been very popular in the classification literature, due to both its simplicity and the highly accurate results it produces.

Prekopcsák and Lemire study the problem of choosing a suitable distance measure for nearest neighbor time series classification [28]. They compare variations of the Mahalanobis distance against Dynamic Time Warping (DTW), and find that the latter is superior in most cases, while the Mahalanobis distance is one to two orders of magnitude faster. The task of learning one Maha-lanobis distance per class is recommended by the authors, in order to achieve better performance. The authors experimented with a traditional 1-NN classifier and came to the conclusion that the class-specific covariance-shrinkage estimate and the class-specific diagonal Mahalanobis distances achieve the best results regarding classification error and running time. Another interesting observation is that in 1-NN classification, the DTW is at least two orders of magnituted slower than the Mahalanobis distance for all 17 datasets the authors experimented on.

Ding et al. [9] use the k-NN classifier with k = 1 on labelled data as a golden measure to evaluate the efficacy of a variety of distance measures in their attempt to extensively validate representation methods and similarity measures over many time series datasets. Specifically, the idea is to predict the correct class label of each time series with an 1-NN classifier as the label of the object that is most similar to the query object. With this approach, the authors evaluate directly the effectiveness of each distance measure as its choice is very important for the k-NN classifier in all settings, let alone time series classification, where the choice of a suitable distance measure is not a trivial task.

In a more recent work, similar to k-NN classification, Chen et al. propose generative latent source model for online binary time series classification in a social network with the goal of detecting trends as early as possible [6]. The proposed classifier uses weighted majority voting to classify the unlabeled time series, by assigning the time series to be classified to the class indicated by the labeled data. The labeled time series vote in favor of the ground truth model they belong to. The authors applied the described model for prediction of virality of news topics in the Twitter social network, while they do not take the definitions of virality or trend into account and use them only as a black box ground truth. The experimental results indicated that weighted majority voting can often detect or predict trends earlier than Twitter.

13.4.1 Speeding up the k-NN

In order to improve the time needed to classify a time series using the k-NN classifier, an index structure can be used to store the labeled objects. However, it can be proved that neither DTW nor LCSS fail to satisfy the the triangle inequality, which is a basic metric property, thus they are not metrics and a traditional index cannot be used. More specifically, in an example scenario with three time series A,B, and C where:

\begin{matrix} A = 1, 1, 5; B = 2, 2, 5; C = 2, 2, 2, 2, 2, 5 & (13.3) \end{matrix}

$\begin{matrix} A = 1, 1, 5; B = 2, 2, 5; C = 2, 2, 2, 2, 2, 5 & (13.3) \end{matrix}$

the DTW computation yields:

\begin{matrix} D T W (A, B) = 2, D T W (B, C) = 0, D T W (A, C) = 5 & (13.4) \end{matrix}

$\begin{matrix} D T W (A, B) = 2, D T W (B, C) = 0, D T W (A, C) = 5 & (13.4) \end{matrix}$

thus DTW (A, B) + DTW (B, C) < DTW (A, C). Similarly if:

\begin{matrix} A = 1, 1, 1, 1; B = 2, 2, 2, 2; C = 3, 3, 3, 3 & (13.5) \end{matrix}

$\begin{matrix} A = 1, 1, 1, 1; B = 2, 2, 2, 2; C = 3, 3, 3, 3 & (13.5) \end{matrix}$

with ∈ = 1, the LCSS computation yields:

\begin{matrix} L C S S_{ε} (A, B) = 4, L C S S_{ε} (B, C) = 4, L C S S_{ε} (A, C) = 0. & (13.6) \end{matrix}

$\begin{matrix} L C S S_{ε} (A, B) = 4, L C S S_{ε} (B, C) = 4, L C S S_{ε} (A, C) = 0. & (13.6) \end{matrix}$

As LCSS is not a distance function but a similarity measure, we can turn it into a distance:

\begin{matrix} D_{L C S S} (T S_{i}, T S_{j}) = 1 - \frac{L C S S (T S_{i}, T S_{j})}{m a x {l e n g h t (T S_{i}), l e n g t h (T S_{j})}} . & (13.7) \end{matrix}

$\begin{matrix} D_{L C S S} (T S_{i}, T S_{j}) = 1 - \frac{L C S S (T S_{i}, T S_{j})}{m a x {l e n g h t (T S_{i}), l e n g t h (T S_{j})}} . & (13.7) \end{matrix}$

By applying the above formula on the time series of the example we obtain

\begin{matrix} D_{L C S S} (A, B) = 0, D_{L C S S} (B, C) = 0, D_{L C S S} (A, C) = 1. & (13.8) \end{matrix}

$\begin{matrix} D_{L C S S} (A, B) = 0, D_{L C S S} (B, C) = 0, D_{L C S S} (A, C) = 1. & (13.8) \end{matrix}$

Again, DLCSS(A,B) + DLCSS(B,C) < DLCSS(A,C).

Alternatively, instead of using a traditional index structure that requires the distance measure to be a metric, Vlachos et al. [40] have proposed to examine if the used distance measure satisfies a pseudo-metric property. In their work they show that LCSS indeed has a pseudo-metric property that allows us to bound DLCSS(TSi, TSj) if we know D(TSi, TSk) and D(TSK, TSj) for a third time series TSk. This fact allows tree-based intex structures to be used and speeds up the identification of the k nearest neighbors to a query time series TSq.

Another speeding up technique is to lower bound the used distance measure. For DTW, LBimproved proposed in [25] is one of the most recent lower bounding techniques, while other lower bounds have been proposed in [21] and [1]. Vlachos et al. have proposed lower bounding techniques for LCSS in [38] and [39]

13.5 Support Vector Machines (SVMs)

Support Vector Machine (SVMs) [8] is a powerful machine learning approach that has been used widely in the literature for solving binary classification problems. The SVM algorithm constructs a hyper plane or set of hyper planes in high dimensional space without any assumptions regarding the distribution of the studied datasets. SVMs use a hypothesis space that consists of linear functions in a high dimensional feature space and the training phase derives from optimization theory. SVM classification achieves a global minimum solution, while involving a relatively fast training phase, which becomes of increasing importance as datasets of ever larger scale become available. SVMs have proven to perform well in tackling overfitting problems.

SVMs are reported to perform non-optimally when it comes to time series classification, because of the not-trivial task to identify intra-class similarities between the training objects, whereas distance-based classification methods like k-NN can overcome this difficulty by using DTW-distance. As described above, DTW is insensitive to shifts, length variations, and deformations and captures the similarity between time series despite of the presence of such characteristics.

More specifically, SVM operates on a training dataset Dtraining that contains N training time series and their corresponding class labels, such that Dtraining = (ts0,l0), (ts1,l1),…(tsN−1,lN−1). The Support Vector Machine algorithm separates the sets in a space of much higher dimensionality, since the classes may not be lineary separable in the original space. In order to do this mapping, SVMs use kernel functions that represent a dot product of the input data points mapped into the higher dimensional feature space by a transformation ϕ.

Köknar and Latecki focus on the problem of classification of imbalanced time series datasets [23]. They argue that it is impossible to insert synthetic data points in feature spaces, while they propose to insert synthetic data points in distance spaces. Doing so, they then use a Support Vector Machines classifier for time series classification and they report significant improvements of classification rates of points belonging to minority classes, called rare events. The authors also report an overall improvement in the accuracy of SVM after the insertion of the synthetic data points. The main contribution of [23] was the insertion of synthetic data points and the over-sampling of minority class in imbalanced data sets. Synthetic data points can be added in all distance spaces, even in non-metric ones, so that elastic sequence matching algorithms like DTW can be used. Huerta et al. extend SVMs to classify multidimensional time series by using concepts from nonlinear dynamical systems theory [17], and more specifically the Rössler oscillator, which is a three-dimensional dynamical system. The authors experimented on both real and synthetic benchmark datasets and the proposed dynamical SVMs proved to achieve in several cases better results than many proposed kernel-based approaches for multidimensional time series classification. Sasirekha and Kumar applied SVM classification on real-life long-term ECG time series recordings and concluded that SVMs can act as a good predictor of the heart rate signal [34]. In their work they propose and compare a variety of feature extraction techniques and compare the respective accuracy values by comparing the corresponding Heart Rate Variability (HRV) features. The experimental results show that SVM classification for heartbeat time series performs well in detecting different attacks and is comparable to those reported in literature. Kampouraki et al. [20] also applied Gaussian kernel-based SVM classification with a variety of statistical and wavelet HRV features to tackle the heartbeat time series classification problem, arguing that the SVM classifier performs better in this specific application than other proposed approaches based on neural networks. In their study, SVMs proved to be more robust to cases where the signal-to-noise ratio was very low. In the experimental evaluation, the authors compared SVM classification to learning vector quantization (LVQ) neural network [22] and a Levenberg-Marquardt (LM) minimization back-propagation neural network [15], and the SVMs achieved better precision and recall values, even when the signals to be classified were noisy. Grabocka et al. propose a method that improves SVM time series classification performance by transforming the training set by developing new training objects based on the support vector instances [14]. In their experimental evaluation on many time series datasets, the enhanced method outperforms both the DTW-based k-NN and the traditional SVM classifiers. The authors propose a data-driven, localized, and selective new-instance generation method based on Virtual Support Vectors (VSV), which have been used for image classification. In a brief outline, the method uses a deformation algorithm to transform the time series, by splitting each time series instance into many regions and transforming each of them separately. The experimental results show an accuracy improvement between 5% and 11% over traditional SVM classification, while producing better mean error rates than the DTW-based k-NN classifier. However, the proposed approach of boosting SVM accuracy has the trade-off of being worse in terms of time complexity, because of the presence of more training examples.

As all works mentioned above underline, the main issue in applying SVMs to time series classification is the choice of the kernel function that leads to the minimum generalization error and at the same time keeps the classification error of the time series training set low. Various kernel approaches have been proposed in the literature, some of which have been used to classify sequences that have similar characteristics with time series (e.g. [18, 41] or for handwriting recognition [4].

13.6 Classification Trees

Classification or decision trees, being a widely used classifier, create a model that predicts the value of a target variable based on several input variables [27]. Classification trees are popular mainly because of the comprehensibility they offer regarding the classification task. Starting with the root node, a classification tree building algorithm searches among the features for the binary distinction that provides the most information about the classes. Then, the same process is repeated for each of the resulting new nodes, continuing the recursion until a stopping criterion is reached. The resulting tree is often too large, a fact that leads to overfitting problems, so a pruning process reduces the tree size in order to both improve classification accuracy and to reduce classification complexity. The most popular decision tree algorithms are ID3 (Iterative Dichotomiser 3) [29], C4.5, which is an extension of ID3 [30], CART (Classification and Regression Tree) [27], and CHAID (CHi-squared Automatic Interaction Detector) [35], which performs multi-level splits when computing classification trees. Douzal-Chouakria and Amblard describe an extension to the traditional decision tree classifier, where the metric changes from one node to another and the split criterion in each node extracts the most characteristic subsequences of the time series to be split [10]. Geurts argues that many time series classification problems can be solved by identifying and exploiting local patterns in the time series. In [13] the author performs time series segmentation to extract prototypes that can be represented by sets of numerical values so that a traditional classifier can be applied. More specifically, the proposed method extracts patterns from time series, which then are combined in decision trees to produce classification rules. The author applies the proposed approach on a variety of synthetic and real datasets and proves that pattern extraction used this way can improve classification accuracy and interpretability, as the decision trees’ rules can provide a user with a broad overview of classes and datasets. With a similar goal in mind, namely inter-pretability of classification results, Rodriguez and Alonso present a method that uses decision trees for time series classification [32]. More specifically, two types of trees are presented: a) interval-based trees, where each decision node evaluates a function in an interval, and b) DTW-based trees, where each decision node has a reference example, which the time series to classify is compared against, e.g., DTW(tsi, tsref) < T hreshold. The decision trees can also be constructed by extracting global and local features (e.g., global: mean, length, maximum; local: local maximum, slope, etc.) or by extracting simple or complex patterns. A complex pattern is a sequence of simple patterns and a simple pattern is a time series whose Euclidean Distance to the examined one is small enough. Most decision tree algorithms are known to have high variance error [19]. In their work, Jović et al. examined the performance of decision tree ensembles and experimented on biomedical datasets. Decision tree ensembles included AdaBoost.M1 and Multiboost applied to the traditional decision tree algorithm C4.5. The classification accuracy improved and the authors argue that decision tree ensembles are equivalent if not better than SVM-based classifiers that have been commonly used for time series classification in the biomedical domain, which is one of the most popular ones for time series classification.

As in all cases where classification trees are chosen as a mechanism to classify data of unknown labels, in the time series domain the main issue is find appropriate split criteria that separate the input dataset and avoid overfitting. Some time series specific issues regarding the choice of a split criterion are related to the open nature of the distance between two time series objects. Time series classification trees that use the same split criterion for all nodes may fail to identify unique characteristics that distinguish time series classes in different parts of the tree. Moreover, distance-based approaches should not consider only the cases where the distance is computed on the whole time series, since there may be cases where subsequences exhibit certain separation characteristics.

However, most of the works that use decision trees for time series classification focus on comprehensibility of the classification results, rather than running time or accuracy, precision, or recall. Thus, a general conclusion that can be drawn regarding decision trees for time series classification is the following: classification trees for time series are good and suitable when interpretability of results are the main concern. When classification accuracy matters, other classifiers, like k-NN or SVMs, achieve better results.

13.7 Model-Based Classification

A method to identify the most similar time series to a given one is by using a model to describe the time series. Kotsifakos et al. [24] used Hidden Markov Models (HMMs) to model the underlying structure of the time series. An HMM is a doubly stochastic process that contains a finite set of states, each one emitting a value by following a different probability distribution. Moreover, an HMM is described by a set of transition probabilities that define how the process moves from one state to another. When a dataset consists of sets of similar time series, then HMMs can be used to perform search or classification tasks. Each group of similar time series forms a class and each class is represented by an HMM. Then, given a query time series TSq, the classification algorithm assigns TSq to the class that maximizes the probability of having generated it. More specifically, regarding time series representation, an HMM is a dynamic probabilistic model that describes a time series and deals with uncertainty. It models the joint probability of a collection of hidden and observed discrete random variables, so it consists of directly unobservable variables that form the hidden layer of the model, each of them representing a different timestamp, assuming the Markov property. The Markov property holds when the conditional probability distribution of a future state of the studied process depends only on the present state and not on the time series values that preceded it. HMMs incorporate as well an observable layer that consists of observable variables that are affected by the corresponding hidden ones. For a time series TS = tst,t ∈ [0 ,N], the HMM representing TS is defined by:

a set of transition probabilities TP = tpij that represent the probability of transitioning from state i to state j
and a set of observation probabilities OP = opjk that represent the probability of observing the value k at state j.

Moreover, there is a set of prior probabilities Pr = pj that represent the probability for the first state of the model. HMMs in most proposed approaches assume a Gaussian probability distribution OPj for each state j. The most common approach for learning Hidden Markov Models is the Expectation Maximization (EM) algorithm [26], which after performing a recursive estimation converges to a local maximum of the likelihood. More specifically, the EM algorithm is used by the popular BaumWelch algorithm in order to find the unknown parameters of a HMM [31]. It relies on the assumption that the i-th hidden variable depends only on the (i − 1)-th hidden variable and not on previous ones, as the current observation variable does. So, given a set of observed time series that get transformed into a set of observed feature vectors, the Baum-Welch algorithm finds the maximum likelihood estimate of the paramters of the corresponding HMM. More formally, if Xt is a hidden random variable at time t, the probability P(Xt|Xt−1) does note depend on the value of t. A hidden Markov chain is described by θ = (TP, OP, Pr). The BaumWelch algorithm computes θ = maxθ P(TS|θ), namely the parameters that maximize the probability of the observation sequence TS.

In [24], Kotsifakos et al. evaluated and compared the accuracy achieved by search-by-model and search-by-example methods in large time series databases. Search-by-example included DTW and constrained-DTW methods. In the search-by-model approach, the authors used a two-step training phase that consisted of (i) initialization and (ii) refinement. In the initialization step an HMM is computed for each class in the dataset. In the refinement step, the Viterbi algorithm is applied on each training time series per class in order to identify the best state sequence for each of them. The transition probabilities used in this step are the ones computed at the initialization step of the algorithm. After an HMM has been trained for each class in the training set, a probability distribution is defined, which represents the probability of each time series in the testing set assuming that it belongs to the corresponding class. Given a time series TS, the probability of TS belonging to class Ci, P(TS|TS ∈ Ci), i = 0,…, |C| is computed with the dynamic programming Forward algorithm [31]. In the experimental evaluation on 20 time series datasets, the model-based classification method yielded more accurate results than the search-by-example technique when there were a lot of training examples per model. However, when there were few training examples per class, search-by-example proved to be more accurate.

In [3], Antonucci and De Rosa also use HMMs for time series classification, while coping with the problems imposed by the short length of time series in some applications. The EM algorithm calculates unreliable estimates when the amount of training data is small. To tackle this problem, the authors describe an imprecise version of the learning algorithm and propose an HMM-based classifier that returns multiple classes when the highest-likelihood intervals corresponding to the returned class overlap.

13.8 Distributed Time Series Classification

Zeinalipour-Yazti et al. define and solve the distributed spatio-temporal similarity search problem, which given a a query trajectory Q aims to identify the trajectories that are most similar to Q [44]. Trajectories are essentially time series, in that they consist of multidimensional points ordered in the time dimension. Trajectories are usually two-dimensional time series, when movement in the two-dimensional plane is considered. In the described setting each trajectory is segmented over a distributed network of nodes, whether they are sensors or vehicles. The authors propose two algorithms that compute local lower and upped bounds on the similarity between the distributed segments of the time series and the query time series Q. In a similar spirit, in a more recent work, Zeinalipour-Yazti et al. describe a setting where the distributed power of smartphones has been used to provide a distributed solution to a complex time series classification problem [43]. The difference to the previously mentioned setting is that in the latter each trajectory or trace is in its entirety stored in a specific node, whereas in the former each trajectory was segmented across the network. Specifically, in the proposed framework the authors compare a query trace Q against a dataset of traces that are generated from distributed smartphones. Zeinalipour-Yazti et al. proposed a top-K query processing algorithm that uses distributed trajectory similarity measures that are insensitive to noise, in order to identify the most identical time series to the query trace Q.

13.9 Conclusion

Time series data mining literature includes a lot of works devoted to time series classification tasks. As time series datasets and databases grow larger in size and as applications are evolving, performance and accuracy of time series classification algorithms become ever more important for real life applications. Although most time series classification approaches adopt and adapt techniques from general data classification studies, some essential modifications to these techniques have to be introduced in order to cope with the nature of time series distance and time series representation. The problems and the benefits of time series classification have found applications in economics, health care, and multimedia among others. In this chapter, we highlighted the major issues and the most influential corresponding techniques that deal with the time series classification task.

Acknowledgements

This work was supported by European Union and Greek National funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework (NSRF) — Research Funding Programs: Heraclitus II fellowship, THALIS — GeomComp, THALIS — DISFER, ARISTEIA — MMD, ARISTEIA — INCEPTION” and the EU FP7 project INSIGHT (www.insight-ict.eu).

Bibliography

[1] Ghazi Al-Naymat, Sanjay Chawla, and Javid Taheri. Sparsedtw: A novel approach to speed up dynamic time warping. In Proceedings ofthe Eighth Australasian Data Mining ConferenceVolume 101, pages 117–127. Australian Computer Society, Inc., 2009.

[2] Helmut Alt and Michael Godau. Computing the fréchet distance between two polygonal curves. International Journal of Computational Geometry & Applications, 5(01–02):75–91, 1995.

[3] Alessandro Antonucci and Rocco De Rosa. Time series classification by imprecise hidden markov models. In WIRN, pages 195–202, 2011.

[4] Claus Bahlmann, Bernard Haasdonk, and Hans Burkhardt. Online handwriting recognition with support vector machines—a kernel approach. In Proceedings of Eighth International Workshop on Frontiers in Handwriting Recognition, 2002, pages 49–54. IEEE, 2002.

[5] Lasse Bergroth, Harri Hakonen, and Timo Raita. A survey of longest common subsequence algorithms. In SPIRE 2000. Proceedings of Seventh International Symposium on String Processing and Information Retrieval, 2000, pages 39–48. IEEE, 2000.

[6] George H Chen, Stanislav Nikolov, and Devavrat Shah. A latent source model for online time series classification. arXiv preprint arXiv:1302.3639, 2013.

[7] Lei Chen, M Tamer Özsu, and Vincent Oria. Robust and fast similarity search for moving object trajectories. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pages 491–502. ACM, 2005.

[8] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.

[9] Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang, and Eamonn Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. Proceedings of the VLDB Endowment, 1(2):1542–1552, 2008.

[10] Ahlame Douzal-Chouakria and Cécile Amblard. Classification trees for time series. Pattern Recognition, 45(3):1076–1091, 2012.

[11] Anne Driemel, Sariel Har-Peled, and Carola Wenk. Approximating the frechet distance for realistic curves in near linear time. Discrete & Computational Geometry, 48(1):94–127, 2012.

[12] Thomas Eiter and Heikki Mannila. Computing discrete fréchet distance. Tech. Report CS-TR-2008–0010, Christian Doppler Laboratory for Expert Systems, Vienna, Austria, 1994.

[13] Pierre Geurts. Pattern extraction for time series classification. In Principles of Data Mining and Knowledge Discovery, pages 115–127. Springer, 2001.

[14] Josif Grabocka, Alexandros Nanopoulos, and Lars Schmidt-Thieme. Invariant time-series classification. In Machine Learning and Knowledge Discovery in Databases, pages 725–740. Springer, 2012.

[15] Martin T Hagan and Mohammad B Menhaj. Training feedforward networks with the marquardt algorithm. IEEE Transactions on Neural Networks, 5(6):989–993, 1994.

[16] Sariel Har-Peled et al. New similarity measures between polylines with applications to morphing and polygon sweeping. Discrete & Computational Geometry, 28(4):535–569, 2002.

[17] Ramön Huerta, Shankar Vembu, Mehmet K Muezzinoglu, and Alexander Vergara. Dynamical SVM for time series classification. In Pattern Recognition, pages 216–225. Springer, 2012.

[18] Tommi Jaakkola, David Haussler, et al. Exploiting generative models in discriminative classifiers. Advances in Neural Information Processing Systems, pages 487–493, 1999.

[19] Alan Jović, Karla Brkić, and Nikola Bogunović. Decision tree ensembles in biomedical time-series classification. In Pattern Recognition, pages 408–417. Springer, 2012.

[20] Argyro Kampouraki, George Manis, and Christophoros Nikou. Heartbeat time series classification with support vector machines. IEEE Transactions on Information Technology in Biomedicine, 13(4):512–518, 2009.

[21] Eamonn Keogh and Chotirat Ann Ratanamahatana. Exact indexing of dynamic time warping. Knowledge and Information Systems, 7(3):358–386, 2005.

[22] Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.

[23] Suzan Köknar-Tezel and Longin Jan Latecki. Improving SVM classification on unbalanced time series data sets with ghost points. Knowledge and Information Systems, 28(1):1–23, 2011.

[24] Alexios Kotsifakos, Vassilis Athitsos, Panagiotis Papapetrou, Jaakko Hollmén, and Dimitrios Gunopulos. Model-based search in large time series databases. In PETRA, page 36, 2011.

[25] Daniel Lemire. Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recognition, 42(9):2169–2180, 2009.

[26] Geoffrey McLachlan and Thriyambakam Krishnan. The EM Algorithm and Extensions,vol-ume 382. John Wiley & Sons, 2007.

[27] L Breiman, JH Friedman, RA Olshen, and Charles J Stone. Classification and Regression Trees, Wadsworth International Group, 1984.

[28] Zoltán Prekopcsák and Daniel Lemire. Time series classification by class-specific mahalanobis distance measures. Advances in Data Analysis and Classification, 6(3):185–200, 2012.

[29] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.

[30] John Ross Quinlan. C4.5: Programs for Machine Learning, volume 1. Morgan Kaufmann, 1993.

[31] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

[32] Juan J Rodríguez and Carlos J Alonso. Interval and dynamic time warping-based decision trees. In Proceedings of the 2004 ACM symposium on Applied computing, pages 548–552. ACM, 2004.

[33] Nick Roussopoulos, Stephen Kelley, and Frédéric Vincent. Nearest neighbor queries. ACM sigmod record, 24(2):71–79, 1995.

[34] A Sasirekha and P Ganesh Kumar. Support vector machine for classification of heartbeat time series data. International Journal of Emerging Science and Technology, 1(10):38–41, 2013.

[35] Merel van Diepen and Philip Hans Franses. Evaluating chi-squared automatic interaction detection. Information Systems, 31(8):814–831, 2006.

[36] Michail Vlachos, Dimitrios Gunopulos, and Gautam Das. Rotation invariant distance measures for trajectories. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 707–712. ACM, 2004.

[37] Michail Vlachos, Dimitrios Gunopulos, and George Kollios. Robust similarity measures for mobile object trajectories. In 2002. Proceedings of the 13th International Workshop on Database and Expert Systems Applications, pages 721–726. IEEE, 2002.

[38] Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn Keogh. Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 216–225. ACM, 2003.

[39] Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn Keogh. Indexing multidimensional time-series. The VLDB Journal—The International Journal on Very Large Data Bases, 15(1):1–20, 2006.

[40] Michail Vlachos, George Kollios, and Dimitrios Gunopulos. Discovering similar multidimensional trajectories. In 2002 Proceedings on the 18th International Conference on Data Engineering, pages 673–684. IEEE, 2002.

[41] Chris Watkins. Dynamic alignment kernels. Advances in Neural Information Processing Systems, pages 39–50, 1999.

[42] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence classification. SIGKDD Explor. Newsl., 12(1):40–48, November 2010.

[43] Demetrios Zeinalipour-Yazti, Christos Laoudias, Costandinos Costa, Michail Vlachos, M An-dreou, and Dimitrios Gunopulos. Crowdsourced trace similarity with smartphones. IEEE Transactions on Knowledge and Data Engineering, 25(6): 1240–1253, 2012.

[44] Demetrios Zeinalipour-Yazti, Song Lin, and Dimitrios Gunopulos. Distributed spatio-temporal similarity search. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 14–23. ACM, 2006.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 13 Time Series Data Classification

Create new playlist

Sign In

Sign Up