Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6
Analyzing Categorical Time Series

In Part II of this book, we shall be concerned with categorical processes $c06-math-001$ ; that is, the range $c06-math-002$ of $c06-math-003$ is not only assumed to be discrete, but also to be qualitative, consisting of a finite number $c06-math-004$ of categories, with $c06-math-005$ (state space). The time series $c06-math-006$ stemming from this kind of process are referred to as categorical time series. In some applications, the range of $c06-math-007$ exhibits at least a natural ordering; it is then referred to as an ordinal range. In other cases, not even such an inherent order exists (a nominal range). In Brenčič et al. (2015), for instance, time series about atmospheric circulation patterns are analyzed. Each day is assigned 1 out of 41 categories, called elementary circulation mechanisms (ECMs); although there are some relationships (similarities) between these categories (for example, they can be arranged in four groups), the categories do not exhibit an inherent ordering; that is, we are concerned with a nominal range. In contrast, Chang et al. (1984) consider time series for daily precipitation and distinguish between dry days, days with medium or with strong precipitation, thus leading to an ordinal time series. Another example of nominal “time” series are nucleotide sequences (a range of four DNA bases) and protein sequences (twenty amino acids) (Churchill, 1989; Krogh et al., 1994; Dehnert et al., 2003), although again similarities exist within these types of nominal range (Taylor, 1986). The time series of electroencephalographic (EEG) sleep states (per minute), as analyzed by Stoffer et al. (2000), are also ordinal time series.

Here, unless stated otherwise, we shall consider the more general case of a nominal range. So even if there is some ordering, we do not make use of it but assume that each random variable $c06-math-008$ takes one of a finite number of unordered categories. To simplify notation, we adapt the convention from Appendix B.2 and assume the possible outcomes to be arranged in a certain lexicographical order, $c06-math-009$ .

As discussed in the context of Example A.3.3, a categorical random variable $c06-math-010$ can be represented equivalently as a binary random vector $c06-math-011$ , with the range consisting of the unit vectors $c06-math-012$ , by defining $c06-math-013$ if $c06-math-014$ . We shall sometimes switch to this kind of representation, referred to as a binarization, if it allows us to simplify expressions.

6.1 Introduction to Categorical Time Series Analysis

For (stationary) real-valued time series, a huge toolbox for analysis and modeling is readily available and well known to a broad audience. To highlight a few basic approaches, the time series are visualized by simply plotting the observed values against time, the marginal properties such as location and dispersion may be measured in terms of mean/median and variance/quartile range, respectively, and serial dependence is commonly quantified in terms of autocorrelation; see also Section 2.4.

Things change if the time series is categorical. As an example, since the elementary mathematical operations are not applicable for such a qualitative range, moments like the mean or the autocovariance can no longer be computed. In the ordinal case, at least a few methods can be preserved. For example, a time series plot is still feasible by arranging the possible outcomes in their natural ordering along the Y-axis, and the location can be measured by the median (more generally, quantiles and cdf are defined for ordinal data). But in the purely nominal case (as mainly considered here), not even these basic analytic tools are applicable. Therefore, tailor-made solutions are required for visualizing such time series, or for quantifying location, dispersion and serial dependence.

Example 6.1.1 (Ordinal vs. nominal data)

In Table 1 of Stoffer et al. (2000)¹ we find a categorical time series of length $c06-math-015$ , which expresses the EEG sleep state (per minute) for an infant 24–36 hours after birth. The range is ordinal, with the six possible states (in their natural ordering):

‘qt’ (quiet sleep, trace alternant)
‘qh’ (quiet sleep, high voltage)
‘tr’ (transitional sleep)
‘al’ (active sleep, low voltage)
‘ah’ (active sleep, high voltage)
‘aw’ (awake).

The observed frequencies are as shown in Table 6.1.

So the median equals ‘al’, while the mode (most frequent category) is given by ‘qt’. The time series plot shown in Figure 6.1 has a meaningful interpretation, since the six states are arranged in their natural ordering along the $c06-math-016$ -axis. For a purely nominal range, in contrast, any arrangement of the states along the ordinate would be arbitrary and hence misleading. Different approaches are required for a visual representation.

In the sequel, when calling a categorical process $c06-math-017$ stationary, we refer to the concept of strict stationarity according to Definition B.1.3. While specific models for such stationary categorical processes are discussed in Chapter 7, the particular instance of an i.i.d. categorical process will be of importance here, since it constitutes the benchmark when trying to uncover serial dependence.

Table 6.1 Frequency table of infant EEG sleep states data

State	qt	qh	tr	al	ah	aw
Absolute frequency	33	3	12	27	32	0

Illustartion of Time series plot of infant EEG sleep states (per minute). — **Figure 6.1** Time series plot of infant EEG sleep states (per minute); see Example 6.1.1.

Example 6.1.2 (Rate evolution graph)

As already emphasized in Example 6.1.1, the widely-used time series plot cannot be applied to a purely nominal time series in a meaningful way. There are several proposals for a visual analysis of a nominal time series; see the survey by Weiß (2008d). Although none of them seems to be a perfect substitute for the time series plot, the rate evolution graph as suggested by Ribler (1997) is at least an easily implemented visual tool that can be used for stationarity analysis. If $c06-math-018$ denotes the binarization of the available time series, then component-wise graphs of the cumulated sums $c06-math-019$ – that is, the component series $c06-math-020$ for $c06-math-021$ – are plotted in one graph against time $c06-math-022$ . The slope of the graphs is an estimate for the corresponding marginal probability. If the process is stationary, then the graphs should be approximately linear in $c06-math-023$ , while visible violations of linearity indicate non-stationarity.

Table 6.2 Frequency table of wood pewee data

State	1	2	3
Absolute frequency	691	357	279

For illustration, let us consider a time series referring to the morning twilight song of the wood pewee, a North American song bird famous for its great vocal abilities. The time series data (length $c06-math-024$ ) are printed in Table 12 of Raftery & Tavaré (1994).² The data date back to Craig (1943) – apart from a few deviations, they correspond to Record 9 given there – and were analyzed afterwards by several authors, including Raftery & Tavaré (1994) and Berchtold (2002). The wood pewee song is composed of three different phrases, labeled ‘1’, ‘2’ and ‘3’ (Craig, 1943, p. 21):

The gliding phrases:
1. ‘1’ “pee-ah-wee”
2. ‘2’ “pee-oh”
The rhythmic phrase:
1. ‘3’ “ah-di-dee”.

Figure 6.2 Rate evolution graph of wood pewee data; see Example 6.1.2.

So the range of the time series is of size $c06-math-025$ . The observed frequencies are as shown in Table 6.2: Hence, the mode (the most frequent phrase) is given by ‘1’. The rate evolution graph shown in Figure 6.2 indicates a stationary behavior (at least with respect to the marginal distribution), since the three graphs appear roughly linear. Their slopes are computed (via linear regression) at about 0.509 for state ‘1’, 0.265 for ‘2’ and 0.210 for ‘3’, respectively, expressing the overall “rates” (estimated marginal probabilities) for the wood pewee's phrases.

An application leading to a visibly non-linear rate evolution graph is presented by Brenčič et al. (2015), who analyzed a time series about atmospheric circulation patterns. Other tools for visually analyzing a categorical time series, such as the IFS (iterated function systems) circle transformation (Weiß, 2008d), look for the occurrence of patterns; that is, the occurrence of tuples (“strings”) $c06-math-026$ or of sets of such tuples. A comprehensive survey of tools for visualizing time series data in general (not restricted to the categorical case) is provided by Aigner et al. (2011).

Remark 6.1.3 (Frequency-domain analysis)

At this point, it is worth mentioning the so-called spectral envelope developed by Stoffer et al. (1993 2000), see also Section 7.9 in the textbook by Shumway & Stoffer (2011). The idea is to look at different numerical codings (called scalings) of the categorical process: for $c06-math-027$ , $c06-math-028$ represents the coding of the $c06-math-029$ range $c06-math-030$ by the numbers $c06-math-031$ . As a simple example, $c06-math-032$ implies that $c06-math-033$ is mapped onto a binary process, where ‘1’ occurs if either $c06-math-034$ or $c06-math-035$ are observed, and ‘0’ otherwise.

Depending on the particular coding, certain periodicities might be observed in the time series. For a given frequency $c06-math-036$ , the idea is now to determine the “most striking” $c06-math-037$ (in some sense). For this purpose, Stoffer et al. (1993 2000) apply a Fourier transform³ and compute the spectral density $c06-math-038$ or a sample version of it for given time series data. If $c06-math-039$ denotes the variance of $c06-math-040$ , then $c06-math-041$ is chosen to maximize $c06-math-042$ . The corresponding maximal value, or

to be more precise, is called the spectral envelope of the process $c06-math-043$ . So $c06-math-044$ expresses the maximal proportion of the variance that can be explained by the frequency $c06-math-045$ , and this maximal proportion is reached if the optimal scaling $c06-math-046$ is used. More details on the computation of $c06-math-047$ and on corresponding sample versions $c06-math-048$ can be found in Stoffer et al. (1993 2000) and Shumway & Stoffer (2011). If $c06-math-049$ is plotted against $c06-math-050$ , a visual frequency analysis of the categorical time series is possible. For illustration, Figure 6.3 shows the spectral envelope of the wood pewee data from Example 6.1.2. The plot was created by adapting Examples 7.17 and 7.18 in Shumway & Stoffer (2011). It can be seen that frequencies around 1/4 and 1/2 are dominant, an observation that will also be plausible in view of our analyses in Example 6.3.1 below.

Illustartion of Spectral envelope of wood pewee data. — **Figure 6.3** Spectral envelope of wood pewee data; see Remark 6.1.3.

6.2 Marginal Properties of Categorical Time Series

Let $c06-math-051$ be a stationary categorical process with marginal distribution $c06-math-052$ . Given the segment $c06-math-053$ from this process, we estimate $c06-math-054$ by the vector $c06-math-055$ of relative frequencies computed from $c06-math-056$ , which is also expressed as $c06-math-057$ by using the above binarization of the process. Especially if $c06-math-058$ is large, the complete (estimated) marginal distribution might be difficult to interpret. So, as with real-valued data, it is necessary in practice to reduce the full information about the marginal distribution into a few metrics that concentrate on features such as location and dispersion.

Measuring the location of a categorical random variable $c06-math-059$ (or to estimate it from $c06-math-060$ ) is rather straightforward; see also Examples 6.1.1 and 6.1.2. In any case, it is possible to compute “the” (sample) mode, although such a mode is sometimes not uniquely determined. If $c06-math-061$ is even ordinal, then the median (or any other quantile) can be used to express the “center” of $c06-math-062$ or $c06-math-063$ , respectively.

Categorical dispersion is not that obvious in the beginning. Even in the ordinal case, a quantile-based dispersion measure such as the inter quartile range (IQR) is not applicable, since a difference between categories is not defined (one might use the number of categories between the quartiles as a substitute). Therefore, let us first think about the intuitive meaning of dispersion. For a real-valued random variable $c06-math-064$ , measures such as variance or IQR ultimately aim at expressing uncertainty. The smaller the dispersion of $c06-math-065$ , the better we can predict the outcome of $c06-math-066$ . Adapting this intuitive understanding of dispersion to the categorical case, we have maximal dispersion if all probabilities $c06-math-067$ are equal to each other, because then, every outcome is equally probable and a reasonable prediction is impossible. So a uniform distribution in $c06-math-068$ constitutes one extreme of categorical dispersion. At the other extreme, if $c06-math-069$ for one $c06-math-070$ and 0 otherwise (one-point distribution, so $c06-math-071$ equals one of the unit vectors $c06-math-072$ ), then we are able to perfectly predict the outcome of $c06-math-073$ , so $c06-math-074$ has minimal dispersion in this sense.

Now that the extremes of categorical dispersion are known, we can think of dispersion measures $c06-math-075$ that map these extremes at the extremes of their range. In fact, several measures for this purpose are readily available in the literature; see the survey in Appendix A of Weiß & Göb (2008), for instance. Furthermore, any concentration index can be used as a measure of dispersion.

For the sake of simplicity, we consider measures $c06-math-076$ with range $c06-math-077$ , where 0 refers to minimal dispersion, and 1 to maximal dispersion. Two popular (and, in the author's opinion, quite useful) measures of categorical dispersion are the Gini index and entropy. We define the (sample) Gini index as

6.1

respectively. The theoretical Gini index $c06-math-079$ has range $c06-math-080$ , where increasing values indicate increasing dispersion, with the extremes $c06-math-081$ iff $c06-math-082$ has a one-point distribution, and $c06-math-083$ iff $c06-math-084$ has a uniform distribution. The sample Gini index $c06-math-085$ is asymptotically normally distributed in the i.i.d. case, and the variance is approximated by $c06-math-086$ . Furthermore, although it is a biased estimator of $c06-math-087$ , its bias is easily corrected in the i.i.d. case by considering $c06-math-088$ instead (Weiß, 2011a).

As an alternative, we define the (sample) entropy as

6.2

respectively, where we always use the convention $c06-math-090$ . $c06-math-091$ has the same properties as mentioned for the theoretical Gini index. In the i.i.d. case, $c06-math-092$ is also asymptotically normally distributed, now with approximate variance $c06-math-093$ , but there is no simple way to exactly correct the bias of $c06-math-094$ (Weiß, 2013b).

If we would like to do a bias correction or compute confidence intervals for the dispersion measures in Example 6.2.2, we would first need to further investigate the serial dependence structure of the available time series, say to establish a possible i.i.d.-behavior such that the above asymptotics could be used. Corresponding tools for measuring serial dependence are presented in the next section.

6.3 Serial Dependence of Categorical Time Series

For the count time series considered in Part I, we simply used the well-known autocorrelation function to analyze the serial dependence structure; see Section 2.4. But this function is not defined in the categorical case (neither nominal nor ordinal), so different approaches are required. Before presenting particular measures, let us again start with some more general thoughts. As for the autocorrelation function, we shall look at pairs $c06-math-097$ with $c06-math-098$ from the underlying stationary categorical process. If, after having observed $c06-math-099$ , it is possible to perfectly predict $c06-math-100$ , then it would be plausible to refer to $c06-math-101$ and $c06-math-102$ as perfectly dependent. If, in contrast, knowledge about $c06-math-103$ would not help in these respects, then $c06-math-104$ and $c06-math-105$ would seem to be independent.

To translate this intuition into formulae, let us introduce the notation $c06-math-106$ with $c06-math-107$ for the lagged bivariate probabilities, with the sample counterpart $c06-math-108$ being the relative frequency of $c06-math-109$ within the pairs $c06-math-110$ . Using the binarization, we can express the latter as $c06-math-111$ ; that is, $c06-math-112$ . The corresponding conditional bivariate probabilities are denoted as $c06-math-113$ for $c06-math-114$ ; see also Appendix B.2. To avoid computational difficulties, we assume that all marginal probabilities are truly positive ( $c06-math-115$ for all $c06-math-116$ ); otherwise, we would first have to reduce the state space.

Following Weiß & Göb (2008), we now say that:

we have perfect (unsigned) serial dependence at lag $c06-math-117$ iff for any $c06-math-118$ , the conditional distribution $c06-math-119$ is a one-point distribution
we have perfect serial independence at lag $c06-math-120$ iff $c06-math-121$ for any $c06-math-122$ (or, equivalently, if $c06-math-123$ ).

The term “unsigned” was used above for the following reason: the autocorrelation function may take positive or negative values, hence being a signed measure, and positive autocorrelation implies, amongst other things, that large values tend to be followed by large values (and vice versa). This motivates us to introduce an analogous concept of signed categorical dependence, where positive dependence implies that the process tends to stay in the state it has reached (and vice versa). So again following Weiß & Göb (2008), and given that we have already established perfect serial dependence at lag $c06-math-124$ (in the unsigned sense above), we now say that

we even have perfect positive serial dependence iff all $c06-math-125$ , or
we even have perfect negative serial dependence iff all $c06-math-126$ .

The latter implies that $c06-math-127$ necessarily has to take a state other than $c06-math-128$ . A number of measures of unsigned serial dependence have been proposed in the literature so far (Dehnert et al., 2003; Weiß & Göb, 2008; Biswas & Song, 2009; Weiß, 2013b). We shall consider one such measure here, namely Cramer's $c06-math-129$ , where the selection is motivated by the attractive properties of the theoretical $c06-math-130$ as well as of the sample version $c06-math-131$ of this measure. It is defined by

6.3

$c06-math-133$ has the range $c06-math-134$ , where the boundaries 0 and 1 are reached iff we have perfect serial independence/dependence at lag $c06-math-135$ . The distribution of its sample counterpart $c06-math-136$ , in the case of an underlying i.i.d. process, is asymptotically approximated by a $c06-math-137$ -distribution (Weiß, 2013b): $c06-math-138$ .

This relationship is quite useful in practice, since it allows us to uncover significant serial dependence. If the null of serial independence at lag $c06-math-139$ is to be tested on (approximate) level $c06-math-140$ , and if $c06-math-141$ denotes the $c06-math-142$ -quantile of the $c06-math-143$ -distribution, then we will reject the null if $c06-math-144$ . This critical value can also be plotted into a graph of $c06-math-145$ against $c06-math-146$ , as a substitute for the ACF plot familiar from real-valued time series analysis; see also Remark 2.3.1. On the other hand, this asymptotic result also shows that $c06-math-147$ is generally a biased estimator of $c06-math-148$ .

As a measure of signed serial dependence, we consider the (sample) Cohen's $c06-math-149$

6.4

The range of $c06-math-151$ is given by $c06-math-152$ , where 0 corresponds to serial independence, with positive (negative) values indicating positive (negative) serial dependence at lag $c06-math-153$ . For the i.i.d. case, Weiß (2011a) showed that $c06-math-154$ is asymptotically normally distributed, with approximate mean $c06-math-155$ and variance $c06-math-156$ . So there is only a small negative bias, which is easily corrected by adding $c06-math-157$ to $c06-math-158$ , and the asymptotic result can again be applied to test for significant dependence.

Example 6.3.1 (Serial dependence plots)

Let us have a look again at the wood pewee time series from Example 6.1.2. On level $c06-math-159$ , we want to test for serial dependence in the data. Instead of evaluating the results numerically, we will draw a “serial dependence plot” analogous to the common plots of the sample ACF. With $c06-math-160$ and $c06-math-161$ , it follows that the critical value for $c06-math-162$ equals about 0.060. For $c06-math-163$ , we also require information about the marginal distribution to compute the asymptotic distribution. So we plug $c06-math-164$ instead of $c06-math-165$ into the above asymptotic formula and obtain the two-sided critical values $c06-math-166$ and 0.038. The resulting plots are shown in Figure 6.4. Both indicate strong significant serial dependencies. As a consequence, returning to the discussion at the end of Section 6.2, confidence intervals for $c06-math-167$ or $c06-math-168$ could not be constructed with the asymptotics given there, since these rely on an i.i.d. assumption. Besides the serial dependencies being very strong, regular patterns can also be observed. $c06-math-169$ shows larger values at even lags $c06-math-170$ than at the adjacent odd lags. An even more complex pattern is observed for $c06-math-171$ , with negative values at odd lags, positive values at even lags, and with much larger values at lags of the form $c06-math-172$ than $c06-math-173$ . These periodicities appear plausible in view of our earlier discussion in Remark 6.1.3. They also match the analyses of Raftery & Tavaré (1994) and Berchtold (2002), who emphasize the repeated occurrence of certain patterns, especially the pattern “1312”.

At this point, it is also illuminating to look at the “partial versions” of $c06-math-174$ and $c06-math-175$ ; that is, at a partial Cramer's $c06-math-176$ and a partial Cohen's $c06-math-177$ , defined by exactly the same relation as in Theorem B.3.4, but replacing $c06-math-178$ by $c06-math-179$ or $c06-math-180$ ; see also the discussion in Section 7.2. For both partial measures, a rather abrupt decline after lag 4 can be observed (see Table 6.3).

This indicates some kind of fourth-order autoregressive dependence, which is in line with a result in Berchtold (2002), where a fourth-order Markov model performed quite well. Many details about Markov models are provided in Section 7.1.

Image described by caption/surrounding text. — **Figure 6.4** Serial dependence plots of wood pewee data based on (a) Cramer's $c06-math-181$ , (b) Cohen's $c06-math-182$ . See Example 6.3.1.

Table 6.3 Partial Cramer's $c06-math-183$ and partial Cohen's $c06-math-184$ for wood pewee data

k	1	2	3	4	5	6	7
$c06-math-185$	0.626	0.665	−0.207	0.315	0.053	−0.027	0.024
$c06-math-186$	−0.542	−0.157	−0.564	0.431	0.143	0.137	−0.041

For completeness, the serial dependence plots for the EEG sleep state data (Example 6.1.1) are also shown, in Figure 6.5, although these data are even ordinal. This specific application illustrates a possible issue with Cramer's $c06-math-187$ : for short time series (here, we have $c06-math-188$ ), it may be that some states are not observed (the state ‘aw’ in this case). To circumvent division by zero when computing (6.3), all summands related to ‘aw’ have been dropped while computing $c06-math-189$ . The dependence measure $c06-math-190$ , in contrast, is robust with respect to zero frequencies.

Let us conclude this section with the special case of a binary process (that is, $c06-math-193$ ). If the range of $c06-math-194$ is coded by 0 and 1, then a quantitative interpretation is possible, since each $c06-math-195$ then simply follows a Bernoulli distribution with $c06-math-196$ and $c06-math-197$ ; see Example A.2.1.