Chapter 6
Analyzing Categorical Time Series

In Part II of this book, we shall be concerned with categorical processes c06-math-001; that is, the range c06-math-002 of c06-math-003 is not only assumed to be discrete, but also to be qualitative, consisting of a finite number c06-math-004 of categories, with c06-math-005 (state space). The time series c06-math-006 stemming from this kind of process are referred to as categorical time series. In some applications, the range of c06-math-007 exhibits at least a natural ordering; it is then referred to as an ordinal range. In other cases, not even such an inherent order exists (a nominal range). In Brenčič et al. (2015), for instance, time series about atmospheric circulation patterns are analyzed. Each day is assigned 1 out of 41 categories, called elementary circulation mechanisms (ECMs); although there are some relationships (similarities) between these categories (for example, they can be arranged in four groups), the categories do not exhibit an inherent ordering; that is, we are concerned with a nominal range. In contrast, Chang et al. (1984) consider time series for daily precipitation and distinguish between dry days, days with medium or with strong precipitation, thus leading to an ordinal time series. Another example of nominal “time” series are nucleotide sequences (a range of four DNA bases) and protein sequences (twenty amino acids) (Churchill, 1989; Krogh et al., 1994; Dehnert et al., 2003), although again similarities exist within these types of nominal range (Taylor, 1986). The time series of electroencephalographic (EEG) sleep states (per minute), as analyzed by Stoffer et al. (2000), are also ordinal time series.

Here, unless stated otherwise, we shall consider the more general case of a nominal range. So even if there is some ordering, we do not make use of it but assume that each random variable c06-math-008 takes one of a finite number of unordered categories. To simplify notation, we adapt the convention from Appendix B.2 and assume the possible outcomes to be arranged in a certain lexicographical order, c06-math-009.

As discussed in the context of Example A.3.3, a categorical random variable c06-math-010 can be represented equivalently as a binary random vector c06-math-011, with the range consisting of the unit vectors c06-math-012, by defining c06-math-013 if c06-math-014. We shall sometimes switch to this kind of representation, referred to as a binarization, if it allows us to simplify expressions.

6.1 Introduction to Categorical Time Series Analysis

For (stationary) real-valued time series, a huge toolbox for analysis and modeling is readily available and well known to a broad audience. To highlight a few basic approaches, the time series are visualized by simply plotting the observed values against time, the marginal properties such as location and dispersion may be measured in terms of mean/median and variance/quartile range, respectively, and serial dependence is commonly quantified in terms of autocorrelation; see also Section 2.4.

Things change if the time series is categorical. As an example, since the elementary mathematical operations are not applicable for such a qualitative range, moments like the mean or the autocovariance can no longer be computed. In the ordinal case, at least a few methods can be preserved. For example, a time series plot is still feasible by arranging the possible outcomes in their natural ordering along the Y-axis, and the location can be measured by the median (more generally, quantiles and cdf are defined for ordinal data). But in the purely nominal case (as mainly considered here), not even these basic analytic tools are applicable. Therefore, tailor-made solutions are required for visualizing such time series, or for quantifying location, dispersion and serial dependence.

In the sequel, when calling a categorical process c06-math-017 stationary, we refer to the concept of strict stationarity according to Definition B.1.3. While specific models for such stationary categorical processes are discussed in Chapter 7, the particular instance of an i.i.d. categorical process will be of importance here, since it constitutes the benchmark when trying to uncover serial dependence.

Table 6.1 Frequency table of infant EEG sleep states data

State qt qh tr al ah aw
Absolute frequency 33 3 12 27 32 0
Illustartion of Time series plot of infant EEG sleep states (per minute).

Figure 6.1 Time series plot of infant EEG sleep states (per minute); see Example 6.1.1.

An application leading to a visibly non-linear rate evolution graph is presented by Brenčič et al. (2015), who analyzed a time series about atmospheric circulation patterns. Other tools for visually analyzing a categorical time series, such as the IFS (iterated function systems) circle transformation (Weiß, 2008d), look for the occurrence of patterns; that is, the occurrence of tuples (“strings”) c06-math-026 or of sets of such tuples. A comprehensive survey of tools for visualizing time series data in general (not restricted to the categorical case) is provided by Aigner et al. (2011).

Illustartion of Spectral envelope of wood pewee data.

Figure 6.3 Spectral envelope of wood pewee data; see Remark 6.1.3.

6.2 Marginal Properties of Categorical Time Series

Let c06-math-051 be a stationary categorical process with marginal distribution c06-math-052. Given the segment c06-math-053 from this process, we estimate c06-math-054 by the vector c06-math-055 of relative frequencies computed from c06-math-056, which is also expressed as c06-math-057 by using the above binarization of the process. Especially if c06-math-058 is large, the complete (estimated) marginal distribution might be difficult to interpret. So, as with real-valued data, it is necessary in practice to reduce the full information about the marginal distribution into a few metrics that concentrate on features such as location and dispersion.

Measuring the location of a categorical random variable c06-math-059 (or to estimate it from c06-math-060) is rather straightforward; see also Examples 6.1.1 and 6.1.2. In any case, it is possible to compute “the” (sample) mode, although such a mode is sometimes not uniquely determined. If c06-math-061 is even ordinal, then the median (or any other quantile) can be used to express the “center” of c06-math-062 or c06-math-063, respectively.

Categorical dispersion is not that obvious in the beginning. Even in the ordinal case, a quantile-based dispersion measure such as the inter quartile range (IQR) is not applicable, since a difference between categories is not defined (one might use the number of categories between the quartiles as a substitute). Therefore, let us first think about the intuitive meaning of dispersion. For a real-valued random variable c06-math-064, measures such as variance or IQR ultimately aim at expressing uncertainty. The smaller the dispersion of c06-math-065, the better we can predict the outcome of c06-math-066. Adapting this intuitive understanding of dispersion to the categorical case, we have maximal dispersion if all probabilities c06-math-067 are equal to each other, because then, every outcome is equally probable and a reasonable prediction is impossible. So a uniform distribution in c06-math-068 constitutes one extreme of categorical dispersion. At the other extreme, if c06-math-069 for one c06-math-070 and 0 otherwise (one-point distribution, so c06-math-071 equals one of the unit vectors c06-math-072), then we are able to perfectly predict the outcome of c06-math-073, so c06-math-074 has minimal dispersion in this sense.

Now that the extremes of categorical dispersion are known, we can think of dispersion measures c06-math-075 that map these extremes at the extremes of their range. In fact, several measures for this purpose are readily available in the literature; see the survey in Appendix A of Weiß & Göb (2008), for instance. Furthermore, any concentration index can be used as a measure of dispersion.

For the sake of simplicity, we consider measures c06-math-076 with range c06-math-077, where 0 refers to minimal dispersion, and 1 to maximal dispersion. Two popular (and, in the author's opinion, quite useful) measures of categorical dispersion are the Gini index and entropy. We define the (sample) Gini index as

respectively. The theoretical Gini index c06-math-079 has range c06-math-080, where increasing values indicate increasing dispersion, with the extremes c06-math-081 iff c06-math-082 has a one-point distribution, and c06-math-083 iff c06-math-084 has a uniform distribution. The sample Gini index c06-math-085 is asymptotically normally distributed in the i.i.d. case, and the variance is approximated by c06-math-086. Furthermore, although it is a biased estimator of c06-math-087, its bias is easily corrected in the i.i.d. case by considering c06-math-088 instead (Weiß, 2011a).

As an alternative, we define the (sample) entropy as

6.2 equation

respectively, where we always use the convention c06-math-090. c06-math-091 has the same properties as mentioned for the theoretical Gini index. In the i.i.d. case, c06-math-092 is also asymptotically normally distributed, now with approximate variance c06-math-093, but there is no simple way to exactly correct the bias of c06-math-094 (Weiß, 2013b).

If we would like to do a bias correction or compute confidence intervals for the dispersion measures in Example 6.2.2, we would first need to further investigate the serial dependence structure of the available time series, say to establish a possible i.i.d.-behavior such that the above asymptotics could be used. Corresponding tools for measuring serial dependence are presented in the next section.

6.3 Serial Dependence of Categorical Time Series

For the count time series considered in Part I, we simply used the well-known autocorrelation function to analyze the serial dependence structure; see Section 2.4. But this function is not defined in the categorical case (neither nominal nor ordinal), so different approaches are required. Before presenting particular measures, let us again start with some more general thoughts. As for the autocorrelation function, we shall look at pairs c06-math-097 with c06-math-098 from the underlying stationary categorical process. If, after having observed c06-math-099, it is possible to perfectly predict c06-math-100, then it would be plausible to refer to c06-math-101 and c06-math-102 as perfectly dependent. If, in contrast, knowledge about c06-math-103 would not help in these respects, then c06-math-104 and c06-math-105 would seem to be independent.

To translate this intuition into formulae, let us introduce the notation c06-math-106 with c06-math-107 for the lagged bivariate probabilities, with the sample counterpart c06-math-108 being the relative frequency of c06-math-109 within the pairs c06-math-110. Using the binarization, we can express the latter as c06-math-111; that is, c06-math-112. The corresponding conditional bivariate probabilities are denoted as c06-math-113 for c06-math-114; see also Appendix B.2. To avoid computational difficulties, we assume that all marginal probabilities are truly positive (c06-math-115 for all c06-math-116); otherwise, we would first have to reduce the state space.

Following Weiß & Göb (2008), we now say that:

  • we have perfect (unsigned) serial dependence at lag c06-math-117 iff for any c06-math-118, the conditional distribution c06-math-119 is a one-point distribution
  • we have perfect serial independence at lag c06-math-120 iff c06-math-121 for any c06-math-122 (or, equivalently, if c06-math-123).

The term “unsigned” was used above for the following reason: the autocorrelation function may take positive or negative values, hence being a signed measure, and positive autocorrelation implies, amongst other things, that large values tend to be followed by large values (and vice versa). This motivates us to introduce an analogous concept of signed categorical dependence, where positive dependence implies that the process tends to stay in the state it has reached (and vice versa). So again following Weiß & Göb (2008), and given that we have already established perfect serial dependence at lag c06-math-124 (in the unsigned sense above), we now say that

  • we even have perfect positive serial dependence iff all c06-math-125, or
  • we even have perfect negative serial dependence iff all c06-math-126.

The latter implies that c06-math-127 necessarily has to take a state other than c06-math-128. A number of measures of unsigned serial dependence have been proposed in the literature so far (Dehnert et al., 2003; Weiß & Göb, 2008; Biswas & Song, 2009; Weiß, 2013b). We shall consider one such measure here, namely Cramer's c06-math-129, where the selection is motivated by the attractive properties of the theoretical c06-math-130 as well as of the sample version c06-math-131 of this measure. It is defined by

c06-math-133 has the range c06-math-134, where the boundaries 0 and 1 are reached iff we have perfect serial independence/dependence at lag c06-math-135. The distribution of its sample counterpart c06-math-136, in the case of an underlying i.i.d. process, is asymptotically approximated by a c06-math-137-distribution (Weiß, 2013b): c06-math-138.

This relationship is quite useful in practice, since it allows us to uncover significant serial dependence. If the null of serial independence at lag c06-math-139 is to be tested on (approximate) level c06-math-140, and if c06-math-141 denotes the c06-math-142-quantile of the c06-math-143-distribution, then we will reject the null if c06-math-144. This critical value can also be plotted into a graph of c06-math-145 against c06-math-146, as a substitute for the ACF plot familiar from real-valued time series analysis; see also Remark 2.3.1. On the other hand, this asymptotic result also shows that c06-math-147 is generally a biased estimator of c06-math-148.

As a measure of signed serial dependence, we consider the (sample) Cohen's c06-math-149

6.4 equation

The range of c06-math-151 is given by c06-math-152, where 0 corresponds to serial independence, with positive (negative) values indicating positive (negative) serial dependence at lag c06-math-153. For the i.i.d. case, Weiß (2011a) showed that c06-math-154 is asymptotically normally distributed, with approximate mean c06-math-155 and variance c06-math-156. So there is only a small negative bias, which is easily corrected by adding c06-math-157 to c06-math-158, and the asymptotic result can again be applied to test for significant dependence.

Image described by caption/surrounding text.

Figure 6.4 Serial dependence plots of wood pewee data based on (a) Cramer's c06-math-181, (b) Cohen's c06-math-182. See Example 6.3.1.

Table 6.3 Partial Cramer's c06-math-183 and partial Cohen's c06-math-184 for wood pewee data

k 1 2 3 4 5 6 7
c06-math-185 0.626 0.665 −0.207 0.315 0.053 −0.027 0.024
c06-math-186 −0.542 −0.157 −0.564 0.431 0.143 0.137 −0.041

For completeness, the serial dependence plots for the EEG sleep state data (Example 6.1.1) are also shown, in Figure 6.5, although these data are even ordinal. This specific application illustrates a possible issue with Cramer's c06-math-187: for short time series (here, we have c06-math-188), it may be that some states are not observed (the state ‘aw’ in this case). To circumvent division by zero when computing (6.3), all summands related to ‘aw’ have been dropped while computing c06-math-189. The dependence measure c06-math-190, in contrast, is robust with respect to zero frequencies.

Image described by caption/surrounding text.

Figure 6.5 Serial dependence plots of EEG sleep state data, based on (a) Cramer's c06-math-191, (b) Cohen's c06-math-192; see Example 6.1.1.

Let us conclude this section with the special case of a binary process (that is, c06-math-193). If the range of c06-math-194 is coded by 0 and 1, then a quantitative interpretation is possible, since each c06-math-195 then simply follows a Bernoulli distribution with c06-math-196 and c06-math-197; see Example A.2.1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.223.190