Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7
Models for Categorical Time Series

As in Part I, Markov models are very attractive for categorical processes, because of the ease of interpreting the model, making likelihood inferences (see Remark B.2.1.2 in the appendix) and making forecasts. However, without further restrictions concerning the conditional distributions, the number of model parameters becomes quite large. Therefore, Section 7.1 presents approaches for defining parsimoniously parametrized Markov models. One of these approaches is linked to a family of discrete ARMA models, which exhibit an ARMA-like serial dependence structure and allow also for non-Markovian forms of dependence; see Section 7.2. Note that the count data version of this model was discussed in Section 5.3. Two other approaches from Chapter 5, namely hidden-Markov models and regression models, can also be adapted to the categorical case, as described in Sections 7.3 and 7.4, respectively.

7.1 Parsimoniously Parametrized Markov Models

Perhaps the most obvious approach to model a categorical process $c07-math-001$ with state space $c07-math-002$ is to use a (homogeneous) pth-order Markov model; see (B.1) for the definition:

The idea of having a limited memory is often plausible in practice, and the Markov assumption is also advantageous, say for parameter estimation (see Remark B.2.1.2) or for forecasting. Concerning the latter application, it is important to note that a pth-order Markov process can always be transformed into a first-order one (for $c07-math-003$ , we speak of a Markov chain; see Appendix B.2) by considering the vector-valued process $c07-math-004$ with $c07-math-005$ . And for a Markov chain with transition matrix $c07-math-006$ , in turn, $c07-math-007$ -step-ahead transition probabilities are obtained as the entries of the matrix $c07-math-008$ . So, to summarize, $c07-math-009$ -step-ahead conditional distributions of the form $c07-math-010$ are calculated with relatively little effort for a pth-order Markov process.

Then, point forecasts are computed as a mode of this conditional distribution, or, for an ordinal process, as the median. In the ordinal case, it is obvious how to define a prediction region on level $c07-math-011$ : for a two-sided region, the limits are defined based on the $c07-math-012$ - and the $c07-math-013$ -quantile of the conditional distribution, while one uses the $c07-math-014$ - or $c07-math-015$ -quantile, respectively, in case of a one-sided region (“best/worst-case scenario”). If $c07-math-016$ is very small, however, it may be that one ends up with a trivial region (say, the full range $c07-math-017$ ). The same problem may occur in the nominal case, but here, in addition, a reasonable definition of a prediction region is more difficult because of the lack of a natural ordering. Commonly, one defines a prediction set for $c07-math-018$ to consist of the most frequent states in $c07-math-019$ (most frequent with respect to the $c07-math-020$ -step-ahead conditional distribution) such that the required level $c07-math-021$ is ensured.

Example 7.1.1 (Three-state Markov chain)

Let us consider two examples of a three-state Markov chain ( $c07-math-022$ , $c07-math-023$ ), which are defined by their transition matrices

Solving the invariance equation (B.4), we obtain the corresponding stationary marginal distributions as

where the latter does not sum exactly to 1 because of rounding. $c07-math-024$ is closer to a uniform distribution than $c07-math-025$ ; this manifests itself in the Gini dispersions $c07-math-026$ and $c07-math-027$ , respectively.

The transition matrix $c07-math-030$ concentrates its probability mass along the diagonal; that is, each state tends to be followed by itself (positive dependence). This is illustrated by the directed graph in Figure 7.1a, where the thickness of the edges represents the size of the respective transition probabilities, and where the positive dependence causes the loops to be dominant. Different behavior is observed for the Markov chain $c07-math-031$ , where the diagonal probabilities are close to zero (negative dependence) and hence the loops in Figure 7.1b are rather thin. The most probable rules are “ $c07-math-032$ ”, “ $c07-math-033$ ” and “ $c07-math-034$ ”, as can be seen from the dominant edges in Figure 7.1b.

Figure 7.1 Visual representation of two three-state Markov chains (a) $c07-math-028$ and (b) $c07-math-029$ ; see Example 7.1.1.

The respective extent of serial dependence is illustrated by the serial dependence plots shown in Figure 7.1, where only Cohen's $c07-math-037$ is able to distinguish between the positive and negative dependencies. The positive values of $c07-math-038$ for $c07-math-039$ in part (d) are reasonable, since the above rules imply a probable return to the starting state after three time units.

The computation of the serial dependence measures (6.3), (6.4) for a Markov chain with transition matrix $c07-math-040$ and stationary marginal distribution $c07-math-041$ , as in Example 7.1.1, is done by first computing the matrices $c07-math-042$ of bivariate joint probabilities via $c07-math-043$ .

While a full Markov model might be a feasible approach if p is very small, higher orders p will cause a practical problem: a general pth-order Markov process (that is, where the conditional probabilities are not further restricted by parametric assumptions) has a huge number of model parameters, $c07-math-044$ , which increases exponentially in p. This problem becomes visible in Figure 7.3, where the possible paths in the past for increasing p are illustrated. Each path $c07-math-045$ (out of $c07-math-046$ such paths) requires $c07-math-047$ parameters to be specified to obtain the conditional distribution $c07-math-048$ , thus leading to altogether $c07-math-049$ parameters. For the EEG sleep state data from Example 6.1.1, for instance, we have $c07-math-050$ states such that a first-order Markov model would already require 30 parameters to be estimated (from only $c07-math-051$ observations available).

To make Markov models more useful in practice, the number of parameters has to be reduced. One suggestion in this direction is due to Bühlmann & Wyner (1999): the variable-length Markov model (VLMM). Here, the “embedding Markov model” possibly has a rather large order p, but then several branches of the tree in Figure 7.3 are “cut” by assuming identical conditional distributions; that is, by assuming shortened tuples $c07-math-052$ with $c07-math-053$ such that

7.1

for all $c07-math-055$ and all $c07-math-056$ . Hence, if the last $c07-math-057$ observations have been equal to $c07-math-058$ , the remaining past is negligible. So the order $c07-math-059$ of the embedding Markov model determines the maximal length of the required memory, but depending on the observed past, a shorter memory may also suffice, which explains the name “variable-length” Markov model. While this kind of parameter reduction is reasonable at a first glance, the number of possible VLMMs is very large, so the model choice is non-trivial. Algorithms for constructing VLMMs have been proposed by Ron et al. (1996).

Another proposal to reduce the number of parameters of the full Markov model aims at introducing parametric relations between the conditional probabilities: the pth-order mixture transition distribution (MTD(p)) model of Raftery (1985). The idea is to start with a (full) Markov chain with transition matrix $c07-math-060$ (Appendix B.2), and to define the pth-order conditional probabilities as a mixture of these transition probabilities:

7.2

Here, $c07-math-062$ is required. A common further restriction is to assume all $c07-math-063$ , but it would even be possible to allow some $c07-math-064$ to be negative (Raftery, 1985).

While the general pth-order Markov model has $c07-math-065$ parameters, this number is now reduced to $c07-math-066$ ; that is, we have the number of parameters for a Markov chain (the ones from $c07-math-067$ ) plus only one additional parameter for each increment of p. Since the conditional probabilities are just some kind of weighted mean of transition probabilities from $c07-math-068$ , Raftery (1985) showed that the stationary marginal distribution $c07-math-069$ for positive $c07-math-070$ is simply obtained as the solution of

7.3

see also (B.4). Furthermore, Raftery (1985) showed that the bivariate probabilities $c07-math-072$ satisfy a set of Yule–Walker-type equations. Denoting $c07-math-073$ for $c07-math-074$ , and $c07-math-075$ , then

7.4

For $c07-math-077$ , for instance, we immediately obtain that $c07-math-078$ , and the remaining $c07-math-079$ s are computed via (7.4). In this way, (7.4) can be used to compute serial dependence measures such as Cramer's $c07-math-080$ from (6.3) or Cohen's $c07-math-081$ from (6.4), although simple closed-form results will usually not be available due to the general form of $c07-math-082$ . Note that because of (7.3), Equation 7.4 can also be rewritten in a centered version as

Approaches for parameter estimation are discussed by Raftery & Tavaré (1994). For a survey of results for MTD models, see Berchtold & Raftery (2001).

Illustration of Tree structure of Markov model: past observations influencing current outcome. — **Figure 7.3** Tree structure of Markov model: past observations influencing current outcome.

Example 7.1.2 (Three-state MTD(2) processes)

Let us pick up Example 7.1.1. We define two three-state MTD(2) models by setting $c07-math-085$ and $c07-math-086$ , so the stationary marginal distributions remain $c07-math-087$ as before; see (7.3). We choose $c07-math-088$ and $c07-math-089$ such that lag 2 gets a higher weight than lag 1. Comparing the resulting serial dependence plots in Figure 7.4 with those of the corresponding Markov chains (the MTD(1) model) in Figure 7.2, we see increased values for even lags $c07-math-090$ but reduced ones for odd lags, as to be expected from the chosen weights. Furthermore, especially for model 1, we generally observe increased values for higher lags; that is, the second-order model indeed leads to a longer memory.

Figure 7.2 Serial dependence plots of two three-state Markov chains: (a, b) $c07-math-035$ and (c, d) $c07-math-036$ ; see Example 7.1.1.

Figure 7.4 Serial dependence plots of three-state MTD(2) models : (a, b) $c07-math-083$ and (c, d) $c07-math-084$ ; see Example 7.1.2.

A further reduction in the number of parameters is obtained by assuming parametric relations for $c07-math-091$ . Jacobs & Lewis (1978c) and Pegram (1980) suggest defining $c07-math-092$ with $c07-math-093$ such that

7.5

Jacobs & Lewis (1978c) require $c07-math-095$ and refer to these models as the discrete autoregressive models of order p (DAR(p)), while Pegram (1980) even allows some $c07-math-096$ to be negative (see above). This model has only $c07-math-097$ parameters. We shall provide a more detailed discussion of it in Section 7.2 when considering the extension to a full discrete ARMA model. Also, the discussion of another type of parsimoniously parametrized Markov model, namely the autoregressive logit model, is postponed until later; see Example 7.4.4.

With increasing p, not only the number of parameters of a binary Markov model increases rapidly ( $c07-math-109$ ); in general, we also do not have an AR(p)-like autocorrelation structure anymore. An exception is the binary version of the MTD(p) model. If $c07-math-110$ is assumed to be parametrized as in (7.6) – that is, $c07-math-111$ – then it immediately follows that (7.5) holds. This implies an AR(p)-like autocorrelation structure; see Section 7.2 below.

A closely related approach to the binary MTD(p) and DAR(p) models is the binary AR(p) model of Kanter (1975), which was extended to a full binary ARMA(p, q) model by McKenzie (1981) and Weiß (2009d). The model recursion looks somewhat artificial, as it uses addition modulo 2. On the other hand, this definition also allows for negative autocorrelations; see the cited references for further details.

7.2 Discrete ARMA Models

Based on preliminary works (Jacobs & Lewis, 1978a–1978c) on discrete counterparts to the ARMA(1, q) and AR(p) model, respectively, Jacobs & Lewis (1983) introduced two types of discrete counterparts to the full ARMA(p, q) model. In particular the second of these models, referred to as the NDARMA model by Jacobs & Lewis (1983), is quite attractive, since its definition and serial dependence structure are very close to those of a conventional ARMA model; see also our discussion of the count data version of this model in Section 5.3. Let us pick up Definition 5.3.1 and adapt it to the categorical case (Weiß & Göb, 2008).

Note that the NDARMA(p, q) model has only $c07-math-130$ parameters. Although its recursion is written down in an “ARMA” style, $c07-math-131$ does nothing but choose the state of either $c07-math-132$ or … or $c07-math-133$ . So $c07-math-134$ is generated as a random choice from $c07-math-135$ (a random mixture). But as we shall see in (7.9), the ARMA-like notation is indeed adequate and reasonable for NDARMA processes.

With the same argument as in Section 5.3, it follows that $c07-math-136$ and $c07-math-137$ have the same stationary marginal distribution; that is, $c07-math-138$ for all $c07-math-139$ . It is useful to know that an NDARMA model can always be represented by a $c07-math-140$ -dimensional finite Markov chain with a primitive transition matrix (Jacobs & Lewis, 1983; Weiß, 2013b). Using this Markov representation, Weiß (2013b) showed that an NDARMA process is $c07-math-141$ -mixing with exponentially decreasing weights (see Definition B.1.5) such that the central limit theorem (CLT) in Billingsley (1999, p. 200) is applicable. Conditional distributions of an NDARMA process are determined by

7.8

which reduces to an expression like the one in (7.5) for the DAR(p) case ( $c07-math-143$ ). Only in the latter case, we have Markov dependence (of order p), while an NDARMA(p, q) process $c07-math-144$ with $c07-math-145$ is not Markovian, although it can be represented as a $c07-math-146$ -dimensional Markov chain.

Let us now investigate the serial dependence structure of an NDARMA(p, q) process $c07-math-147$ using the concepts described in Section 6.3. Only positive dependence is possible (implying that NDARMA processes tend to show long runs of their states); that is, $c07-math-148$ , and it even holds that $c07-math-149$ (Weiß & Göb, 2008). These properties can be utilized to identify if an NDARMA model might be appropriate for a set of time series data (by looking at the sample versions $c07-math-150$ ). For example, the serial dependence measures for the wood pewee time series shown in Figure 6.4 clearly deviate from NDARMA behavior, with negative dependencies and strong deviation between $c07-math-151$ and $c07-math-152$ .

Since both measures lead to identical results anyway, let us focus on $c07-math-153$ in the following. $c07-math-154$ itself is obtained from a set of Yule–Walker-type equations (Weiß & Göb, 2008):

7.9

where the $c07-math-156$ satisfy

which implies $c07-math-157$ for $c07-math-158$ , and $c07-math-159$ . Note the analogy to (5.22) for the autocorrelation function in the count data case. The relation between $c07-math-160$ and the bivariate distributions with lag $c07-math-161$ is given by

7.10

see Weiß & Göb (2008). The equations (7.9) can now be applied in an analogous way to the use of the original Yule–Walker equations for an ARMA process; see Appendix B.3. For a DMA(q) model ( $c07-math-163$ ), we have $c07-math-164$ . Hence, $c07-math-165$ such that $c07-math-166$ vanishes after lag $c07-math-167$ ; see (B.11) for the corresponding MA(q) result. So the model order q can be estimated using $c07-math-168$ or $c07-math-169$ , as described in Appendix B.3.

For a DAR(p) model ( $c07-math-170$ ), in turn, we have $c07-math-171$ , which corresponds exactly to the AR(p) result (B.13). Hence, defining the partial Cohen's $c07-math-172$ (or partial Cramer's $c07-math-173$ ) with exactly the same relation as in Theorem B.3.4 (just replacing $c07-math-174$ by $c07-math-175$ or $c07-math-176$ ), we obtain a tool for identifying the autoregressive order p: $c07-math-177$ for lags $c07-math-178$ . These results also apply to the binary MTD(p) model; see Section 7.1, which was shown to have an DAR(p) dependence structure, and where $c07-math-179$ according to Example 6.3.1.

Now let us look at the sample measures of dispersion from Section 6.2 and the sample measures of serial dependence from Section 6.3. There, we presented the asymptotic properties of these measures if applied to i.i.d. categorical processes. Now let us analyze what changes if these measures are applied to an underlying NDARMA process.

The asymptotic results for the Gini index (6.1) and the entropy (6.2) are easily adapted. As shown by Weiß (2013b), they are still asymptotically normally distributed, and the i.i.d. variance just has to be inflated by the factor

7.12

While $c07-math-201$ in the i.i.d. case, it is given by $c07-math-202$ for a DAR(1) model, as an example. So altogether, we have

7.13

At least for the sample Gini index, an exact bias correction is possible. As shown by Weiß (2013b),

7.14

The latter formula provides a simple way to obtain an approximate bias correction, and it is exact in the i.i.d. case.

For the sample serial dependence measures from Section 6.3, asymptotic normality can also be established, and explicit expressions for the asymptotic variances can be derived. But these expressions are much more complex than in the i.i.d. case, so we refer the reader to Weiß (2013b) for further details.

Let us now look at applications of discrete ARMA models. An application of count time series to video traffic count data was sketched in Example 5.3.2. For categorical time series, Chang et al. (1984) and Delleur et al. (1989) used discrete ARMA models for modeling daily precipitation. While Chang et al. (1984) used a three-state model (that is, $c07-math-205$ ) representing either dry days, days with medium or with strong precipitation, Delleur et al. (1989) used a two-state model (that is, $c07-math-206$ ) to just distinguish between wet and dry days. If the rainfall quantity also has to be considered, Delleur et al. (1989) propose using a state-dependent exponential distribution, analogous to the corresponding approach with HMMs (Sections 5.2 and 7.3). Another field of application of discrete ARMA models is DNA sequence data; see, for example, Dehnert et al. (2003). Returning to this application, let us consider the DNA sequence corresponding to the bovine leukemia virus.

Example 7.2.4 (Bovine leukemia DNA data)

We consider a “time” series that was analyzed in Weiß & Göb (2008) and Weiß (2013b): the DNA sequence of the bovine leukemia virus, which was published by the National Center for Biotechnology Information (NCBI).¹ It is of length $c07-math-207$ , and its range consists of the $c07-math-208$ DNA bases ‘a’, ‘c’, ‘g’ and ‘t’ (adenine, cytosine, guanine and thymine, respectively). Certainly, such a biological sequence cannot be assumed to be a realization of a stochastic process, but corresponding stochastic models are commonly applied in practice as a tool for summarizing properties of the considered sequence; see Churchill (1989) and Dehnert et al. (2003), for instance.

For the bovine leukemia DNA data, the rate evolution graph shown in Figure 7.5a indicates stationary behavior. The estimated marginal distribution $c07-math-211$ is quite close to a uniform distribution (with the mode being ‘c’). So it is not surprising that the point estimates 0.988 and 0.987 for Gini index and entropy, respectively, are close to 1, indicating a strong degree of dispersion.

Figure 7.5 Bovine leukemia DNA data: (a) rate evolution graph; serial dependence plots based on (b) Cramer's $c07-math-209$ and (c) Cohen's $c07-math-210$ . See Example 7.2.4.

The serial dependence plots shown in Figure 7.5 both exhibit significant serial dependencies, especially for lags 1 and 2, although the absolute extent is rather small, even at lag 1. This, together with an analysis of the corresponding partial $c07-math-212$ and $c07-math-213$ , indicates that an AR(1)- or AR(2)-like model might be appropriate for the series. The values of $c07-math-214$ and $c07-math-215$ are also relatively close to each other, and the ones of $c07-math-216$ are mainly positive. So we shall try to fit the DAR(1) and DAR(2) models to the data, as well as the MTD(1) (full first-order Markov model) and MTD(2) model as further candidate models. In any case, final estimates are obtained using the conditional maximum likelihood (CML) approach (see Remark B.2.1.2), and they are computed with a numerical optimization routine.

Let us start with the DAR(1) and DAR(2) models. After obtaining initial estimates based on $c07-math-217$ and $c07-math-218$ (the “method of moments”), the CML estimates are computed by maximizing the respective conditional log-likelihood functions $c07-math-219$ from (B.6). The required transition probabilities are given by (7.8), with $c07-math-220$ and $c07-math-221$ , leading to the CML estimates shown in Table 7.1.

Table 7.1 Bovine leukemia DNA data: CML estimates for DAR(p) models, together with maximized log-likelihood and BIC

$c07-math-222$	$c07-math-223$	$c07-math-224$	$c07-math-225$	$c07-math-226$	$c07-math-227$	$c07-math-228$	$c07-math-229$	$c07-math-230$
$c07-math-231$	0.220	0.331	0.208	0.241	0.081		$c07-math-232$ 446	22 927
$c07-math-233$	0.219	0.331	0.209	0.241	0.079	0.020	$c07-math-234$ 440	22 926

It can be seen that there is just a little difference between the fitted DAR(1) and DAR(2) models, so the more parsimonious DAR(1) model appears to be preferable. It is a Markov chain, the transition matrix of which is easily computed; see Example 7.2.2. We just have to multiply the marginal probabilities $c07-math-235$ by $c07-math-236$ and to increase the diagonal elements by $c07-math-237$ , leading to

where some columns do not sum up to exactly 1 because of rounding. The columns of $c07-math-238$ are the 1-step-ahead conditional distributions and might be applied to determine the conditional modes as the point forecasts. Note that ‘c’ is always the most probable state; that is, ‘c’ is the 1-step-ahead point forecast independent of the previous observation. The $c07-math-239$ -step-ahead forecast distributions for $c07-math-240$ are computed in the same way but by replacing $c07-math-241$ by $c07-math-242$ ; see Example 7.2.2. However, since $c07-math-243$ is already rather close to 0, they barely differ from the marginal distribution.

For the MTD models, it turned out that the MTD(2) model does not lead to a visible improvement compared to the MTD(1) model, the parameter estimate for $c07-math-244$ is nearly equal to 0. So we concentrate on the fitted MTD(1) model; that is, a standard Markov chain model with 12 parameters (transition probabilities). CML estimation leads to the following transition matrix and stationary marginal distribution (the latter being computed from the invariance equation (B.4)):

where again some columns do not sum exactly to 1 because of rounding. Considering the conditional modes as the point forecasts, we obtain the rules “ $c07-math-245$ ”, “ $c07-math-246$ ”, “ $c07-math-247$ ”, and “ $c07-math-248$ ”, which differs from the DAR(1) rule, which always predicts ‘c’. But the 2-step-ahead conditional distributions – that is, the columns of $c07-math-249$ – are again very close to $c07-math-250$ , so the 2-step-ahead mode forecast is always equal to ‘c’. This is not surprising in view of the Perron–Frobenius theorem (Remark B.2.2.1), since the second largest eigenvalue is rather small, at about 0.156.

In terms of the BIC ( $c07-math-251$ ), the full Markov model should be preferred to the DAR(1) model. Comparing the marginal properties of the fitted models to the observed ones, the DAR(1) model does reproduce the observed dispersion slightly better (see Table 7.2): while the full Markov model does better in terms of the serial dependence structure. For instance, it has $c07-math-252$ and $c07-math-253$ (observed values 0.0804 and 0.1134, respectively), while $c07-math-254$ for the fitted DAR(1) model. For computing these serial dependence measures, we make use of the fact that the matrix $c07-math-255$ of bivariate probabilities of a Markov chain is computed as $c07-math-256$ ; see Appendix B.2.1.

c07-math-257 — **Table 7.2** Gini index and entropy for Bovine leukemia DNA data, and for fitted models

c07-math-258 — **Table 7.2** Gini index and entropy for Bovine leukemia DNA data, and for fitted models

Altogether, the full Markov model seems preferable. Nevertheless, let us conclude this example with the following exercise. Using the fitted DAR(1) model, the constant $c07-math-262$ from (7.12) becomes $c07-math-263$ . Following (7.14), we compute the bias-corrected point estimate of the Gini index as $c07-math-264$ . The approximate standard errors are obtained from (7.13) as 0.00141 (Gini) and 0.00163 (entropy), so approximate 95% confidence intervals follow as $c07-math-265$ and $c07-math-266$ , respectively.

7.3 Hidden-Markov Models

Another important type of model for categorical processes with a non-Markovian serial dependence structure is the hidden-Markov model (HMM), which we are already familiar with from the count data case; see Section 5.2 as well as the book by Zucchini & MacDonald (2009) and the survey article by Ephraim & Merhav (2002). HMMs refer to a bivariate process $c07-math-267$ , in which the hidden states $c07-math-268$ (latent states) constitute a homogeneous Markov chain (Appendix B.2) with range $c07-math-269$ and $c07-math-270$ , and where the observable random variables $c07-math-271$ (now also categorical with state space $c07-math-272$ ) are generated conditionally independently, given the state process; see Figure 5.7 for an illustration of the data-generating mechanism.

As mentioned in Section 5.2, we may interpret a HMM as a “probabilistic function of a Markov chain” (Baum & Petrie (1966); see also Remark 7.3.3). While we shall again concentrate on HMMs for stationary processes, these models can also be used for non-stationary processes by, for example, including covariate information (Remark 5.2.6). HMMs for categorical processes have been applied in many contexts: in biological sequence analysis (Churchill, 1989; Krogh et al., 1994) and especially in fields related to natural languages:

speech recognition (Rabiner, 1989): transforming spoken into written text
text recognition (Makhoul et al., 1994; Natarajan et al., 2001): handwriting recognition or optical character recognition
part-of-speech tagging (Cutting et al., 1992; Thede & Harper, 1999): where each word in a text is assigned its correct part of speech.

As described in Section 5.2, an HMM for $c07-math-273$ is defined based on three assumptions:

the observation equation (5.8):
the state equation (5.9):
the Markov assumption with state transition probabilities (5.11):

As before, we denote the hidden states' transition matrix by $c07-math-274$ . The initial distribution $c07-math-275$ of $c07-math-276$ is assumed to be determined by the stationarity assumption; that is, $c07-math-277$ , where $c07-math-278$ satisfies the invariance equation $c07-math-279$ (see (B.4)). In contrast to Section 5.2, the time-homogeneous state-dependent distributions $c07-math-280$ – that is, $c07-math-281$ for all $c07-math-282$ – are categorical distributions, each having $c07-math-283$ parameters. So altogether, we have $c07-math-284$ parameters for the hidden states, plus $c07-math-285$ parameters related to the observations.

Most of the stochastic properties discussed in Section 5.2 directly carry over to the categorical case (certainly except those referring to moments, since the latter do not exist for categorical random variables). For instance, defining again the diagonal matrices $c07-math-286$ for $c07-math-287$ , the marginal pmf and the bivariate probabilities are given by (5.13); that is, by

7.15

Furthermore, maximum likelihood (ML) estimation is still possible, as described in Remark 5.2.3, the forecast distributions (5.19) and (5.20) remain valid, and both decoding schemes from Remark 5.2.4 are also applicable in the categorical case.

As the main difference, the serial dependence structure of the HMM's observations can no longer be described in terms of the ACF, and measures such as Cramer's $c07-math-289$ (6.3) or Cohen's $c07-math-290$ (6.4) have to be computed using (7.15). Measures of categorical dispersion (Section 6.2), such as the Gini index (6.1) or entropy (6.2), are also available in this way.

Example 7.3.2 (Three-state HMM)

We define two stationary three-state HMMs using the same state transition matrix $c07-math-306$ as in Example 5.2.2 (see also Example 7.1.1), but with categorical observations having the state space $c07-math-307$ ( $c07-math-308$ ) and the following state-dependent distributions:

So for both models, the hidden state ‘0’ tends to observation ‘ $c07-math-309$ ’, ‘1’ to observation ‘ $c07-math-310$ ’, and ‘2’ to observations ‘ $c07-math-311$ ’ or ‘ $c07-math-312$ ’. But while this tendency is rather weak for model 1 (here, the different state-dependent distributions are quite close to each other), it is very pronounced for model 2. As expected from Example 7.3.1, model 1 causes a strong damping of the underlying Markov chain's serial dependence structure (see Figure 7.2a,b):

The marginal distributions of the observations (7.15) are computed as $c07-math-313$ and $c07-math-314$ , respectively, so the first one shows more dispersion (Gini index 0.993 vs. 0.881).

Picking up Remark 7.2.3, it is obvious by model construction that any binarization $c07-math-315$ of the HMM's observation process $c07-math-316$ , with $c07-math-317$ for a $c07-math-318$ , follows an HMM again, now with state-dependent probabilities $c07-math-319$ and $c07-math-320$ . For such a binary HMM, in turn, moments and hence the ACF will be well-defined, where the relationship between ACF and $c07-math-321$ has already been investigated; see Example 6.3.2.

Remark 7.3.3 (Markov representation)

In Baum & Petrie (1966), an HMM is introduced as a “probabilistic function of a Markov chain”. But in the same article, it is also shown that an HMM can be expressed as a deterministic function of (another) Markov chain; see also Remark 7.2.3. The idea is quite simple: define the bivariate process $c07-math-322$ by $c07-math-323$ , then $c07-math-324$ is a finite Markov chain with transition probabilities

The observable random variables $c07-math-325$ are obtained from the Markovian variables $c07-math-326$ by applying the deterministic function $c07-math-327$ . As a benefit of this representation, if the finite Markov chain $c07-math-328$ can be shown to satisfy, say, some mixing properties (see Definition B.1.5 and Appendix B.2.2), these carry over to the observations process $c07-math-329$ , since this is obtained by just applying a deterministic function to $c07-math-330$ .

Example 7.3.4 (Bovine leukemia DNA data)

Let us continue Example 7.2.4, where we modeled the bovine leukemia DNA series with its $c07-math-331$ observable states ‘a’, ‘c’, ‘g’ and ‘t’. As an alternative to the models considered before, we shall now try to fit a two-state HMM. Since the time series is rather long ( $c07-math-332$ ), we are faced with the numerical issues mentioned at the end of Remark 5.2.3, so the likelihood computation has to be based on $c07-math-333$ instead of $c07-math-334$ ; see formula (5.18).

The (full) ML estimates of the two-state HMM are

for the hidden states' stationary Markov model, and the state-dependent distributions are estimated as

The maximal log-likelihood equals $c07-math-335$ , which is less than for the full Markov model from Example 7.2.4 ( $c07-math-336$ ), but better than for the DAR models fitted there. At this point, it should also be mentioned that a three-state HMM does not lead to a visible improvement in terms of model performance (see also the discussion below): the full Markov model remains the preferred choice. However, it is interesting to further interpret and analyze the fitted two-state HMM.

Looking at the state-dependent distributions $c07-math-337$ , it becomes clear that the hidden state 0 mainly leads to either observation ‘a’ or ‘g’, while state 1 goes along with ‘c’ and ‘t’. Such a separation is plausible, since the nucleotides ‘a’ and ‘g’ form the group of purines, while ‘c’ and ‘t’ are the pyrimidines. So state 0 might be interpreted as a “purine state”, while state 1 constitutes a “pyrimidine state”. The corresponding transition matrix $c07-math-338$ has maximal entries on the diagonal, so a purine tends to be followed by a purine again, for example, but the probability for changing between both groups is also rather large. Overall, $c07-math-339$ shows that the “pyrimidine state” is dominant.

The observations’ marginal distribution within the fitted model, given by equation (7.15), equals about $c07-math-340$ and thus agrees with the marginal frequencies up to three decimal places. There is, however, a visible discrepancy between the sample dependence measures $c07-math-341$ on the one hand (as plotted in Figure 7.5), and the theoretical $c07-math-342$ within the fitted model on the other hand. For the latter, we compute $c07-math-343$ and $c07-math-344$ , respectively, both being lower than the corresponding sample values. This confirms our earlier conclusion that the two-state HMM is not the optimal choice for the bovine leukemia DNA series.

Remark 7.3.5 (Higher-order HMM)

As pointed out in Remark 5.2.6, there are several ways of extending the basic HMM to higher-order models, for example by using a higher-order Markov model (such as an MTD or DAR model) for the hidden states (Zucchini & MacDonald, 2009, Section 8.3). With respect to categorical processes, the double-chain Markov model (DCMM), as proposed by Berchtold (1999), is particularly worth mentioning. Here, the current observation $c07-math-345$ is influenced by both the current state $c07-math-346$ and the past observation $c07-math-347$ . So the observation equation (5.8) is modified to

A further extension of this model was developed by Berchtold (2002), who allows for pth-order dependence with respect to past observations, and for qth-order Markov dependence concerning the hidden states; that is, (5.8) and (5.9) become

This DCMM $c07-math-348$ model was applied by Berchtold (2002) to a categorical time series representing the song of a wood pewee (as presented in Examples 6.1.2, 6.2.2 and 6.3.1), and to a categorical time series expressing the behavior of young rhesus monkeys, with four behaviors: “passive”, “explore”, “fear/disturb” and “play”).

7.4 Regression Models

In Section 5.1, we introduced regression models for time series of counts, the main advantage of which is their ability to easily incorporate covariate information, the latter being represented by the (possibly deterministic) vector-valued covariate process $c07-math-349$ . In our discussion, we focussed on generalized linear models (GLMs), where the observations’ conditional mean is linked to a linear expression of the “available information”. The available information is not necessarily limited to the covariate information, but it may also include past observations of the process, as in the case of the conditional regression models according to Definition 5.1.1. If, however, only the current covariate is required for “explaining” the current observation, then we referred to such a situation as a marginal regression model (5.5) (Fahrmeir & Tutz, 2001). In the sequel, we shall see that the regression approach can also be adapted to the case of the observations process $c07-math-350$ being categorical. A much more detailed discussion of such categorical regression models together with several real-data examples is provided in Chapters 2 and 3 of the book by Kedem & Fokianos (2002).

Before turning to the general categorical case, let us first look at the special situation, where $c07-math-351$ is a binary process with the state space being coded as $c07-math-352$ (see Example 6.3.2); that is, where the $c07-math-353$ are Bernoulli random variables (Example A.2.1). The parameter of the Bernoulli distribution is its “success probability”, which is also equal to its mean. Hence, it lends itself to proceed in the same way as in Section 5.1; that is, to define a GLM with respect to this mean parameter. So Definition 5.1.1 now reads as follows.

The definition (5.5) of a marginal regression model is adapted accordingly. Note that $c07-math-368$ is expressed equivalently as $c07-math-369$ ; this allows for a simplified notation of the (partial) likelihood function (Remark 5.1.5), since $c07-math-370$ now equals $c07-math-371$ . A detailed discussion of likelihood estimation for binary regression models is provided by Slud & Kedem (1994).

While in the count data case, we had to ensure that $c07-math-372$ always produces a positive value, here, we even have to ensure that the value of $c07-math-373$ is in the interval $c07-math-374$ . If one wants to avoid severe restrictions concerning the parameter range for $c07-math-375$ , it is recommended to use response functions $c07-math-376$ with range $c07-math-377$ ; for example, a (strictly monotonic increasing) cdf. The most common choice is the cdf of the standard logistic distribution, leading to a logit model.

Example 7.4.2 (Binary logit model)

The logit link, which is the canonical link function of the Bernoulli distribution, is given by

Looking at the definition of $c07-math-378$ , we may interpret a logit GLM as a log-linear model with respect to the odds $c07-math-379$ the quantity $c07-math-380$ is also referred to as the log-odds. In particular, the conditional odds are again determined multiplicatively as $c07-math-381$ Further motivating arguments for the particular choice of a logit model are presented by Slud & Kedem (1994) and in Section 2.2.1 of Kedem & Fokianos (2002).

A simple autoregressive logit model was defined by Kedem & Fokianos (2002) and Fokianos & Kedem (2004)

7.16

Binary logit models with a feedback mechanism (analogous to the INGARCH(1, 1) model from Example 4.1.4) are discussed by Moysiadis & Fokianos (2014); for example, the basic model

7.17

where the condition $c07-math-384$ ensures a stationary solution.

Possible alternatives to the logit approach are the probit model (based on the cdf of the standard normal distribution), the log–log model (cdf of standard maximum extreme value distribution), or the complementary log–log model (cdf of standard minimum extreme value distribution); see Section 2.2 in Kedem & Fokianos (2002) for further details.

To extend the methods described before to the general categorical case – that is, where $c07-math-385$ has the state space $c07-math-386$ – it is helpful to look at the binarization $c07-math-387$ of $c07-math-388$ , defined by $c07-math-389$ if $c07-math-390$ , with $c07-math-391$ being the unit vectors (see Example A.3.3). If $c07-math-392$ is distributed according to $c07-math-393$ (unit simplex, see Remark A.3.4), then $c07-math-394$ is multinomially distributed according to $c07-math-395$ , and its $c07-math-396$ th component $c07-math-397$ follows the Bernoulli distribution $c07-math-398$ . The basic idea in the sequel is to apply the above binary approaches (especially the logit approach) to these components $c07-math-399$ .

Because of the sum constraint $c07-math-400$ , it is reasonable to concentrate on the reduced vectors $c07-math-401$ (then $c07-math-402$ ), the distribution of which is denoted by $c07-math-403$ according to Example A.3.3. To further simplify the notation, we define the open $c07-math-404$ -part unit simplex

then $c07-math-405$ satisfies $c07-math-406$ , and we just write $c07-math-407$ .

Following Fahrmeir & Kaufmann (1987), Kedem & Fokianos (2002) and Fokianos & Kedem (2003), we now extend Definition 7.4.1 to a conditional categorical regression model.

Analogous to the binary case, the probabilities $c07-math-423$ are expressed in a simple way, namely as $c07-math-424$ (the component $c07-math-425$ of $c07-math-426$ expresses $c07-math-427$ ). These probabilities are required for (partial) likelihood computation. Likelihood estimation for categorical regression models is discussed by Fahrmeir & Kaufmann (1987) and Fokianos & Kedem (2003) as well as in the book by Kedem & Fokianos (2002).

To illustrate Definition 7.4.3, let us consider a particular instance of a categorical GLM: the categorical logit model as discussed by Kedem & Fokianos (2002) and Fokianos & Kedem (2003).

A particular example of the categorical logit model described in Example 7.4.4 is a $c07-math-452$ th-order autoregressive model (analogous to (7.16); see also Kedem & Fokianos (2002) and Fokianos & Kedem (2003)), where $c07-math-453$ is composed of $c07-math-454$ ; that is, of dimension $c07-math-455$ . Hence, the total number of model parameters is given by $c07-math-456$ , with the components’ parameter vectors $c07-math-457$ being of dimension $c07-math-458$ . So the $c07-math-459$ th-order autoregressive logit model can be understood as a parsimoniously parametrized $c07-math-460$ th-order Markov model; see also Section 7.1. Partitioning the components of $c07-math-461$ as $c07-math-462$ with the $c07-math-463$ consisting of $c07-math-464$ parameters, we can rewrite the autoregressive logit model’s recursion as

7.19

A non-Markovian categorical regression model with an additional feedback component, analogous to equation (7.17), was developed by Moysiadis & Fokianos (2014).

Example 7.4.5 (Bovine leukemia DNA data)

Let us continue Examples 7.2.4 and 7.3.4 about the bovine leukemia DNA series (length $c07-math-466$ ) with state space $c07-math-467$ of size $c07-math-468$ . In view of the AR-like serial dependence structure of these data, we try to fit the autoregressive logit model as described in (7.19). For model order $c07-math-469$ , it has $c07-math-470$ parameters. The $c07-math-471$ components of $c07-math-472$ refer to the states $c07-math-473$ , $c07-math-474$ and $c07-math-475$ , and $c07-math-476$ is represented by $c07-math-477$ .

For model order $c07-math-478$ , we have $c07-math-479$ parameters, which is exactly the same number as for a full Markov chain model; see Section 7.1. In fact, the first-order autoregressive logit model and a full Markov chain are equivalent to each other; the logit model just constitutes a reparametrization of the Markov chain. This reparametrization, however, is quite useful in practice, because it avoids constraints for the model parameters and thus simplifies their estimation. We also consider the model orders 2 and 3, leading to 21 and 30 model parameters, respectively. These numbers are already quite large, but much lower than for a full second- or third-order Markov model (48 and 192, respectively). The obtained (rounded) values for the maximized (conditional) log-likelihood as well as the AIC and BIC are shown in Table 7.3.

c07-math-480 — **Table 7.3** Bovine leukemia DNA data: maximized log-likelihood, AIC and BIC for AR(p) logit models

So the BIC prefers the first-order logit model (full MC), while the AIC prefers the second-order one. Note that the values obtained for the first-order logit model slightly deviate from those in Example 7.2.4 for the Markov chain. This is caused by the use of different numerical optimization routines, namely unconstrained vs. constrained optimization.

In view of the diffuse picture caused by the information criteria, let us compare some stochastic properties of the fitted first- and second-order logit models with the corresponding observed ones. The required transition probabilities $c07-math-495$ and $c07-math-496$ are computed from (7.18) and (7.19), respectively, by inserting the estimated parameter values (the latter are not shown here to save space). Marginal distribution and serial dependence measures for the first-order logit model (full MC) have already been checked in Example 7.2.4.

For the second-order logit model, it would again be possible to compute the considered characteristics exactly, by transforming the model into a bivariate Markov chain (see the discussion below (B.1) in Appendix B.1). But for the sake of simplicity, the fitted model was used to simulate a time series of length 10 million, the sample properties of which serve as an approximation to the true model's values. For the marginal distribution, we obtain about $c07-math-497$ (which does not sum up to exactly 1 because of rounding), which is very close to $c07-math-498$ as reported in Example 7.2.4. Furthermore, the serial dependence structure is represented very well now also for lag 2: $c07-math-499$ , $c07-math-500$ and $c07-math-501$ , $c07-math-502$ , respectively. So if the additional number of parameters compared to the first-order model is acceptable, the second-order logit model appears to be preferable.

We conclude this section by pointing out that the regression approach can also be used for ordinal time series; see Kedem & Fokianos (2002) and Fokianos & Kedem (2003). To simplify the presentation, we first discuss a marginal regression model, but the model is easily extended to a conditional regression model.

Example 7.4.6 (Ordinal regression model)

Let the states in $c07-math-503$ exhibit a natural ordering, $c07-math-504$ . The idea is to assume that $c07-math-505$ or $c07-math-506$ , respectively, is generated from a latent real-valued random variable $c07-math-507$ in the following way:

7.20

where $c07-math-509$ are threshold parameters. Here, $c07-math-510$ is the covariate information at time $c07-math-511$ , and the $c07-math-512$ are assumed to be i.i.d. If $c07-math-513$ denotes the cdf of $c07-math-514$ , then

7.21

The model parameters to be estimated are $c07-math-516$ as well as $c07-math-517$ .

In the special case of $c07-math-518$ being the cdf of the standard logistic distribution – that is, $c07-math-519$ – the resulting model is referred to as the proportional odds model or ordered logit model. Then (7.21) becomes

7.22

An example of the ordered logit model with $c07-math-521$ states $c07-math-522$ , and with threshold parameters $c07-math-523$ according to (7.20), is shown in Figure 7.6a. There, the probability density function (pdf) $c07-math-524$ of the standard logistic distribution is plotted together with the threshold values $c07-math-525$ as well as with the resulting probability masses. As an example, the probability that the latent random variable $c07-math-526$ takes a value between $c07-math-527$ and $c07-math-528$ (this corresponds to state ‘ $c07-math-529$ ’) equals $c07-math-530$ .

Image described by caption/surrounding text. — **Figure 7.6** (a) Standard logistic distribution with threshold values; see Example 7.4.6. Ordinal logit AR(1) model from Example 7.4.8: (b) Cohen's $c07-math-531$ , (c) simulated sample path.

The ordinal regression approach of Example 7.4.6 is easily modified to obtain an autoregressive model. As for model (7.19), we skip one component of the binarization $c07-math-534$ of $c07-math-535$ . But in view of the thresholds (7.20), this time, it is advantageous to define $c07-math-536$ following $c07-math-537$ (Example A.3.3). Combining approaches (7.19) and (7.22), we define an ordinal autoregressive logit model as (Fokianos & Kedem, 2003):

7.23

Note that the $c07-math-539$ -dimensional parameter vectors $c07-math-540$ do not depend on $c07-math-541$ (certainly, this would also be possible), so that the model has only $c07-math-542$ parameters.

Example 7.4.8 (Ordinal logit autoregression)

Let us pick up the ordinal logit approach with $c07-math-543$ states $c07-math-544$ from Example 7.4.6, and let us construct an autoregressive model (7.23) of order $c07-math-545$ . Note that this model is a (parsimoniously parametrized) ordinal Markov chain with transition probabilities

see (7.21). In model (7.23), the event that $c07-math-546$ equals the largest state ‘ $c07-math-547$ ’ is represented by the vector $c07-math-548$ , such that the expression for the log-odds (7.23) reduces to $c07-math-549$ . If we aim at having a model with positive dependence, then a large state should be followed by a large state with a high probability, which is not the case for the choice of the threshold values in Example 7.4.6; see Figure 7.6a. Therefore, for the present example, we shift these values by $c07-math-550$ ; a negative shift such that more probability mass is left above the largest threshold $c07-math-551$ . So we use $c07-math-552$ . As a result, the probabilities for falling between $c07-math-553$ and $c07-math-554$ (this is the conditional distribution for $c07-math-555$ given $c07-math-556$ ) become approximately $c07-math-557$ for $c07-math-558$ . In particular, ‘ $c07-math-559$ ’ is followed by ‘ $c07-math-560$ ’ with about 75% probability.

The components of the autoregressive parameter vector $c07-math-561$ , in turn, are chosen as positive values. Since the smallest state corresponds to $c07-math-562$ , the second smallest to $c07-math-563$ and so on, and since positive dependence requires that small values tend to be followed by small ones, we choose $c07-math-564$ : $c07-math-565$ . So altogether, the model's transition matrix becomes

where not all columns sum to 1 because of rounding. As an example, the first column gives the conditional distribution for $c07-math-566$ given $c07-math-567$ , which implies that ‘ $c07-math-568$ ’ is followed by ‘ $c07-math-569$ ’ with about 69.0% probability, and by the next largest state ‘ $c07-math-570$ ’ with about 15.6% probability. This “inertia” for the smallest and largest state, respectively, becomes visible in the simulated sample path shown in Figure 7.6c, and it causes Cohen's $c07-math-571$ in Figure 7.6b to take positive values.

The stationary marginal distribution is computed according to the invariance equation (B.4) in the appendix as $c07-math-572$ , so the boundary states ‘ $c07-math-573$ ’ and ‘ $c07-math-574$ ’ are the most probable ones. Overall, this distribution is quite close to a uniform distribution, which explains the large values for the Gini index (6.1) and entropy (6.2), given by 0.973 and 0.964, respectively.

For further information on ordinal regression models, consult Kedem & Fokianos (2002) and Fokianos & Kedem (2003).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

	$c07-math-257$	$c07-math-258$	$c07-math-259$
$c07-math-260$	0.98767	0.98763	0.98757
$c07-math-261$	0.98736	0.98728	0.98725

$c07-math-480$			$c07-math-481$			$c07-math-482$
$c07-math-483$	$c07-math-484$	$c07-math-485$	$c07-math-486$	$c07-math-487$	$c07-math-488$	$c07-math-489$	$c07-math-490$	$c07-math-491$
$c07-math-492$	$c07-math-493$	$c07-math-494$	22 740	22 703	22 706	22 824	22 851	22 917

Table of Contents for Chapter 7: Models for Categorical Time Series

Create new playlist

Sign In

Sign Up

7.1 Parsimoniously Parametrized Markov Models

7.2 Discrete ARMA Models

7.3 Hidden-Markov Models

7.4 Regression Models

Table of Contents for
Chapter 7: Models for Categorical Time Series