Chapter 2

Model Order Selection

Visa Koivunen and Esa Ollila,    Department of Signal Processing and Acoustics, School of Electrical Engineering, Aalto University, Finland

Abstract

In this chapter we provide an overview of methods used for model order selection. One looks for the best dimension for a parametric model given a set of observed data. Most widely used model order selection techniques stem either form statistical inference or information theory. Among the commonly used statistical methods are Bayesian Information Criterion (BIC) and generalized likelihood ratio tests (GLRT), whereas the Minimum Description Length (MDL) and Akaike Information Criterion (AIC) have their roots in information and coding theory. We also show that likelihood function plays an important role in most model selection techniques. Application examples how these methods may be applied in practice are provided.

Keywords

Model order estimation; Parametric model; Statistical inference; Information theory; Stochastic complexity; Likelihood function

3.02.1 Introduction

In model selection the main task is to choose a mathematical model for a phenomenon from a collection of models given some evidence such as the observed data at hand. Informally, one aims at identifying a proper complexity for the model. Typical application examples include finding the number of radio frequency signals impinging an antenna array, choosing the order of an ARMA model used for analyzing time series data, variable selection in the regression model in statistics, estimating the channel order in wireless communication receiver and selecting the order of a state-space model used in tracking a maneuvering target in radar.

Model order selection uses a set of observed data and seeks the best dimension for the parametric model used. This is related to the idea of the inverse experiment and the maximum likelihood estimation where one looks for the parameters that most likely have produced the observed data. Various methods for model selection or model order estimation stem from the detection and estimation theory, statistical inference, information and coding theory and machine learning. The outcome of a model order selection is an integer-value describing the dimension of the model.

One of the well known heuristic approach used in model selection is Occam’s razor (e.g., [1]). It usually favors parsimony, i.e., simpler explanations instead of more complex ones while retaining the explanatory power of the model. Hence, Occam’s razor chooses the simplest model among the candidate models. Simpler models tend to capture the underlying structure of the phenomenon (such as signal of interest) better, and consequently provide better performance in predicting the output when new observations are processed. More complex models tend to overfit the observed data and to be more sensitive to noise. Moreover, the model may not generalize well to new data. The same problem is addressed in statistical inference where there is a trade-off between bias and variance when explaining the observed data. Too simple models lead to a bigger systematic error (bias), while too complex models may lead to an increased variance of estimated parameters and higher Cramer-Rao Bound as well as to finding structures in the data that do not actually exist. Parsimony is also favored in the probability theory, since the probability of making an error will increase by introducing additional assumptions that may be invalid or unnecessary.

Traditionally it is assumed that there is a single correct hypothesis within the set of hypotheses considered that describe the (true) model generating the given data. In practice, any model that we assume will be only an approximation to the truth: for example, a data does not necessarily follow an exactly normal distribution. Yet the described criteria are still useful in selecting the “best model” from the set which is most supported by the data within the mathematical paradigm of the selected information criteria. There are many practical consequences for a misspecified model order. In state-space models and Kalman filtering, models with too low order make the innovation sequence correlated and the optimality of the filter is lost. On the other hand, if the order is too high, one may end up with tracking the noise that can lead to divergence of the Kalman filter. In regression, models with too low order may lead to biased estimates since they do not have enough flexibility to fit the data with high fidelity. On the other hand, too many parameters increase the variance of the estimates unnecessarily and cause the loss of optimality.

Model selection is of fundamental interest to the investigator seeking a better understanding of the phenomenon under study and the literature on this subject is considerable. Several review articles [24] are noticeable among others as well as the full length text-books [1,5,6]. In this overview, we first review the classical model selection paradigms: the Bayesian Information Criterion (BIC) with an emphasis on its asymptotic approximation due to Schwarz [7], Akaike’s Information Criterion (AIC) [8,9], stochastic complexity [10] and its early development, the Minimum Description Length (MDL) criterion due to Rissanen [11] and the generalized likelihood ratio test (GLRT-) based sequential hypothesis testing (see, e.g., [12]). Among the methods considered, the MDL and AIC have their foundations in the information theory and coding whereas BIC and GLRT methods have their roots in major statistical inference paradigms (Bayesian and likelihood principle). The AIC is based on the asymptotic approximation of the Kullback-Leibler (K-L) divergence, a fundamental notion in information theory, whereas Rissanen’s MDL is based on the coding theory. However, both of these methods can also be linked with the likelihood principle and the classification of the statistical inference and the information theory based methods is provided mainly for pedagogic and informative purpose. There are some more recent contributions such as the exponentially embedded family (EEF) method by Kay [13] and Bozdogan’s extensions of AIC [14] among others which are not reviewed herein.

This chapter is organized as follows. First, Section 3.02.2 describes the basic principles, challenges and the complexity of the model selection problems with a simple example: variable selection in the multiple linear regression model. Especially, the need for compromise with the goodness of fit and concision (“principle of parsimony”), which is the key idea of model selection, is illustrated and the forward stepwise regression variable selection principle using the AIC is explained along with the cross-validation principle. Section 3.02.3 addresses statistical inference based methods (BIC, GLRT) in detail. Section 3.02.4 introduces the AIC, the MDL and its variants and enhancements. In Section 3.02.5, the model selection methods are derived and illustrated using a practical engineering example in determining the dimension of the signal subspace; such problems arise commonly in sensor array processing, and in harmonic retrieval; see, e.g., [1518].

Notations

More specifically, model selection refers to the multihypothesis testing problem of determining which model best describes a given data (measurement) y of size N. Let image, denote the set of hypotheses which may not be nested (i.e., image), and that each hypothesis specifies a probability density function (p.d.f.) image with a parameter vector image of dimension image in the parameter space image. For the notational convenience we often suppress the subscript m (the model index) from image, and image but the reader should keep in mind that these differ for each model. Model order selection refers to finding the true dimension image of the model commonly in the nested set of hypotheses. Let

image

denote the log-likelihood function of the data under the mth model and the a priori probability of the mth hypothesis (image), respectively. Usually, all models are a priori equally likely, in which case image for image. In the likelihood theory, the natural point estimate of image is the maximum likelihood estimate (MLE) image which is assumed to be unique. Let us denote by

image (2.1)

the (expected) Fisher information matrix (FIM) and the observed FIM, respectively.

3.02.2 Example: variable selection in regression

In the multiple linear regression model the data image is modeled as image, where image denotes the unknown vector of regression coefficients, X is the known image matrix of explanatory variables (image) and image is the N-variate noise vector with i.i.d. components from normal distribution image with an unknown variance image. Let us call the columns of X (explanatory) variables following the convention in statistics, i.e., we use image to denote both the column of X and the ith variable of the model. Thus, image and the p.d.f. and the log-likelihood are

image

where image denotes the parameter vector. Commonly there is a large set of explanatory variables and the main interest is choosing the subset of explanatory variables that significantly contribute to the data (response variable) y; see [19,20]. Thus, we consider the hypotheses

image

where image, denotes the index set of the contributing variables. Then under image, the MLEs of the regression coefficient and the noise variance correspond to the classic least squares (LS-)estimates image and the residual sample variance, image, respectively, where image is the residual sum of squares. Hence the −2 image (maximum log-likelihood) under hypothesis image becomes

image (2.2)

which always decrease when we include additional variables in the model. This is because image, where image denotes the residual variance with one additional variable in the model. Then, image if we introduce as many explanatory variables as responses. More generally, if two models that are compared are nested, then the one with more parameters will always yield a larger maximum log-likelihood, even though it is not necessarily a better model.

3.02.2.1 AIC and the stepwise regression

Previously, we noticed that minimizing image cannot be used as the criterion. An intuitive approach is to introduce an additive term that penalizes overfitting. The penalty term needs to be an increasing function of image, so that the criterion −2 image (maximum log-likelihood) decreases as the penalty term increases yielding a trade-off between goodness of fit and simplicity. This principle of parsimony is the fundamental idea of model selection. As we shall see, the classic approximations of information criteria reduce to such forms. For example, the AIC can be stated as

image

that is, image, and the regression model (hypothesis image) that minimize the criterion is chosen. Here, the penalty term is image which increases with image and hence penalizes selecting too many variables. In the regression setting,

image

where we have suppressed the irrelevant additive constant N from (2.2) and the “plus one” term from image which does not depend on the number of explanatory variables image in the model. Note that the robust versions of AIC have been proposed, e.g., [21] which utilizes robust M-estimation.

The AIC is not computationally feasible for variable selection when the number of explanatory variable candidates k is large. We need to consider all index sets I of size image for each image and for each m, there are image possible index sets. More practical and commonly used approach is to use stepwise regression variable selection method. In the forward stepwise search strategy one starts with a model containing only an intercept term (or a subset of explanatory variables), and then includes the variable that yields the largest reduction in the AIC, i.e., the largest reduction in the RSS (and thus improving the fit the most). One stops when no variable produces a normalized reduction greater than a threshold. There are several variants of stepwise regression which differ based on the criterion used for inclusion/exclusion of variables. For example, instead of AIC, Mallows Cp, BIC, etc. can be used. One particularly simple stepwise regression is to include in each step the variable that is most correlated with the current residual image, where image is the LS fit based on the current selection of variables. This strategy is called the orthogonal matching pursuit[22] in the engineering literature. Some modern approaches for variable selection in regression that can be related to the forward stepwise regression are the LASSO (least absolute shrinkage and selection operator) [23] and the LARS (least angle regression) [24]. LASSO minimizes the usual LS criterion (the sum of squared residuals) with an additional image-norm constraint on the regression coefficient. As a result, some regression coefficients are shrunk to(wards) zero. LARS is a forward stepwise regression method where the selection of variables is done differently: namely a regression coefficient of the variable is increased until that variable is no longer the one most correlated with the residual r. A simple modification of the LARS algorithm can be used to compute the LASSO solution [24].

3.02.2.2 Cross-validation and bootstrap methods

Cross-validation (CV) is a widely used technique for model order selection in machine learning, regression analysis, pattern recognition, and density estimation; see [25]. In CV one is interested in not only finding a statistical model for the observed data but also in prediction performance for other independent data sets. The goal is to ensure that the model generalizes well, not adapting too much to the particular data set at hand. Unlike many other model selection techniques, CV can be considered a computer-based heuristic method that replaces some statistical analyses by repeated computations over subsets of data.

In CV, one splits the data set into subsets, one training set for the initial statistical analysis and the other (potentially multiple) subsets for testing or validating the analysis. If the test data are independent and identically distributed (i.i.d.), they can be viewed as new data in prediction. If the sample size for the training set is small or the dimension of the initial model is high, it is likely that the obtained model will fit the set of observations with too high fidelity. Obviously, this specific model will not fit the other test data sets equally well and the prediction estimates produced by the initial model using the training set may contain significantly off-values than expected ones.

The prediction results from the testing (validation) sets are used to evaluate the goodness of the model optimized for the training set. The original model may then be adjusted by combining with the models estimated from the test sets to produce a single model, for example by averaging. CV may also be used in cases where the response is indicator of a class membership instead of being continuous valued. This is the case in M-ary detection problems and pattern recognition, for example.

There are many ways to subdivide the data into training and test sets. The assignment may be random such that the training set and the test set have equal number of observations. These two data sets may be used for training and testing in an alternating manner. A very commonly used approach is so-called leave-one-out (LOO) cross-validation. Each individual observation in the data set is serving a test set at its turn while the remaining data form the training set. This is done exhaustively until all the observations have been used as a test set. It is obvious that LOO has a high computational complexity since the training set is different and training process is repeated every time.

Cross-validation can also be used for model order selection in regression. Let us assume that we have a “supermodel” of large number of explaining variables and we want to select which subset of explaining or predictor variables gives the best model in terms of its prediction performance. By using the cross-validation approach with the test sets of observations one may find a subset of explaining variables that gives the best prediction performance measured, for example, in terms of mean square error. If cross-validation is not used, adding more predictor variables in the model decreases the risk or error measure, such as mean square error, sum of squared residuals or classification error for the training set but the predictive performance may be very different.

A few remarks on the statistical performance of the CV method are in place. CV gives a too positive view on the quality of the model optimized for the training set. The risk or error measure such as MSE will be smaller for the training set than the one for the test set. The variance of the cross-validation estimates depends on the data splitting scheme and it is difficult to analyze the performance of the method in a general case. However, it has been established in some special cases, see [25]. If the model is estimated for each subset of test data, the values of estimates will vary (estimator is a random variable).

The CV estimators often have some bias that depends on the size of the training sample relative to the whole sample size. It tends to decrease as the training sample size increases. For the same size of training set the bias remains the same but differences may be attested in variance (accuracy). The overall performance of the CV estimators depends on many factors including the size of training data set, validation data set, and the number of data subsets.

Bootstrapping [26] is a resampling method that is typically used for characterizing confidence intervals. It has found applications in hypothesis testing and model order estimation as well, see, e.g., [27]. Assuming independent and identically distributed (i.i.d.) data samples, it resamples the observed data set with replacement to simulate new data called bootstrap data. Then the parameter of interest such as the variance or the mean is computed for the bootstrap sample. It is a computer-based method since it requires that a very large number of resamples are drawn, and bootstrap estimates for each bootstrap sample are computed. Availability of highly powerful computing machinery has made the use of bootstrap more feasible in different applications as well as allowed to increase the sample sizes. The bootstrap method does not make any assumptions on the underlying distribution of the observed data. It is a nonparametric method and allows for characterizing the distribution of the observed data. This is particularly useful for non-Gaussian data where the distribution may not be symmetric or unimodal. An example of model order estimation in a sensor array signal processing application is provided in [27].

3.02.3 Methods based on statistical inference paradigms

In this section, we review the model selection methods based on two main stream statistical inference paradigms (Bayesian and likelihood principles). The widely used asymptotic approximation of the Bayesian Information Criterion (BIC) due to Schwarz [7] is reviewed first, followed by the GLRT-based model selection method which is based on the sequential use of the generalized likelihood ratio test (GLRT). The latter method is one of the oldest approaches in model selection.

3.02.3.1 Bayesian Information Criterion (BIC)

We start by recalling the steps and assumptions that lead to the classic asymptotic approximation of BIC due to Schwarz [7] which can be stated as follows:

image

so that image and the hypothesis image (mth model) that minimizes the criterion is to be chosen. This approximation is based on the second-order Taylor expansion of the log posterior density and some regularity assumptions on the log-likelihood and log posterior. Recall from earlier discussion on the nested hypotheses, the term −2 image (maximum log-likelihood) monotonically decreases with an increasing m (model index), and cannot be used alone as it leads to the choice of the model with the largest number of parameters. The second term is the penalty term that increases with image and hence penalizes the use of too many parameters.

In the Bayesian framework, the parameter vector is taken to be a random vector (r.v.) with a prior density image. For the mth model, let

image

denote the log posterior density (neglecting the irrelevant additive constant term). The full BIC is obtained by integrating out the nuisance variable image in the posterior density of image given the data y and the model m:

image (2.3)

The mth model that maximizes image over image is selected for the data. In most cases, the integration cannot be solved in a closed-form and approximations are sought for.

In Bayesian approach, the maximum-a-posteriori probability (MAP) estimate image is a natural point estimate of image and assume that it is unique. An approximation of image around the MAP estimate based on second-order Taylor expansion yields

image

Note that the first-order term vanishes when evaluated at the maximum. Let us assume that image is twice continuously differentiable (image). This means that its second-order partial derivatives w.r.t. image exist on image and that the resulting image (negative) Hessian matrix image is symmetric. Substituting the approximation into (2.3) yields

image

where the last equality follows by noticing that the integrand is the density of image-variate image distribution up to a constant term image, where image denotes the matrix determinant. Equivalently, we can minimize the log of BIC. In logarithmic form, the BIC becomes

image (2.4)

An alternative approximation for the BIC can be derived with the following assumption:

(A) the prior distribution image is sufficiently flat around the MAP estimate image and does not depend on N for image.

Due to (A), when evaluating the integral in (2.3), we may extract the image term from the integrand as a constant proportional to image, yielding the approximation

image (2.5)

image (2.6)

where image is the Fisher information matrix of the mth model evaluated at the unique MLE image. Note that (2.6) is an approximation of (2.5) using (similarly as earlier) the second-order Taylor expansion on image around the MLE. Thus in terms of logarithms, we have

image (2.7)

Note that approximations (2.4) and (2.7) differ as the former (resp. the latter) is obtained for the case that the full log posterior (resp. log-likelihood) is expanded to the second-order terms. Furthermore, (2.7) relies upon (A).

Next, note that under some regularity conditions and due to consistency of the MLE image, we can in most cases approximate image for a sufficiently large N as

image (2.8)

Since the three first terms in (2.7) are independent of N [recall assumption (A)], we may neglect the first three terms in (2.7) for large N, yielding after multiplying by image the earlier stated classic approximation due to Schwarz,

image (2.9)

Let us recall the assumptions for the above result. First of all, the log-likelihood function image ought to be a image-function around the neighborhood of the unique MLE image. Second, the prior density is assumed to verify Assumption (A). Third, sufficient regularity conditions based on likelihood theory, e.g., consistency of image, are needed for the approximation (2.8) to hold. Note that the classic BIC approximation, in contrast to approximations (2.4) or (2.7), is based on asymptotic arguments, i.e., N is assumed to be sufficiently large. Naturally, how large N is required, depends on the application at hand. The benefit of such an approximation is its simplicity and good performance observed in many applications. As we shall see in Section 3.02.4, Rissanen’s MDL derived using arguments in coding theory can be asymptotically approximated in the same form as BIC in (2.9). This sometimes leads to misunderstanding that MDL and BIC are the same.

If we assume that the set of hypotheses image includes the true model image that generated the data, then the probability that the true imageth model is selected can often be calculated analytically. For example, in the regression setting, the BIC estimate has proven to be consistent [28], that is, image as image. This can be contrasted with the AIC estimate that fails to be consistent, due to a tendency of overfitting [28].

3.02.3.2 GLRT-based sequential hypothesis testing

Let us assume that sequence of model p.d.f.’s image, belong to the same parametric family (i.e., share the same functional form, so image for image) and are ordered in the sense that image. Hence, we have a nested set of hypotheses image. We can consider image as the “full model” and image, a parsimonious model presuming that a subset of the parameters of the full model are equal to zero. This type of situation can arise in many engineering applications; a practical application is given in Section 3.02.5. In such a scenario, the GLRT statistic, defined as the image the log of the ratio of the likelihood functions maximized under image and under the full model image:

image (2.10)

where image and image and image denotes the log-likelihood. By the general maximum likelihood theory, image has a image distribution asymptotically (i.e., chi-squared distribution with image degrees of freedom), where

image

The null hypothesis image is rejected if image exceeds a rejection threshold image, where image is imageth quantile of the image distribution (i.e., image when image). Such a detector has an asymptotic probability of false alarm (PFA) equal to image. For example, choosing PFA equal to image is quite reasonable for many applications; however, in radar applications, smaller PFA is often desired.

The sequential GLRT-based model selection procedure chooses the first hypothesis image (and hence image as the model order) that is not rejected by the GLRT statistic. In other words, we first test image and if it is rejected by the GLRT statistic (2.10) (image for some predetermined PFA image), then test image, etc. until acceptance, i.e., until image. In case that all hypotheses are rejected, then we accept the full model image. Note that the above decision sequence is not statistically independent, even asymptotically. Hence, even though the asymptotic PFA of each individual GLRT for testing image is image, it does not mean that the asymptotic PFA of the GLRT-based model selection is image. It is also possible to use a backward testing strategy, i.e., test the sequence of hypotheses image from the reverse direction. Then we select the model in the sequence before the first rejection. That is, we first test image and if it is accepted, then test image. Continue until the first rejection and select the hypothesis tested prior the rejection. If it is known a priori that a large model order is more likely than a small model order, then the backward strategy might be preferred from pure computational point of view.

3.02.4 Information and coding theory based methods

The information and coding theory based approaches for model order selection are frequently used. We will consider two such techniques, Information Criterion proposed by Akaike [8,9] and Minimum Description Length (MDL) methods proposed by Rissanen [10,11]. The enhanced versions of both methods have been introduced in order to provide more accurate results for finite sample sizes and in some cases to relax unnecessary assumptions on underlying distribution models.

3.02.4.1 Akaike Information Criterion (AIC)

The AIC criterion is based on an asymptotic approximation of the K-L divergence. Let image denote the true p.d.f. of the data and let image denote the approximating model. The Kullback-Leibler (K-L) divergence (or relative entropy), can be defined as the integral (as f and image are continuous functions)

image

and it can be viewed as the information loss when the model f is used to approximate image or as a “distance” (discrepancy) between f and image since it verifies

image

Clearly, among the models in the set of hypotheses image, the best one loses the least information (have smallest value of image) relative to the other models. Note that the K-L divergence can be expressed as

image

where the model dependent part is image. When finding the minimum of image over the set of models considered is equivalent to finding the maximum of the function

image

among all the models. The function image, sometimes referred to as relative K-L information, cannot be used directly in model selection because it requires the full knowledge of the true data p.d.f. image and the true value of the parameter image in the approximate model image, both of which are unknown. Hence we settle with its estimate

image (2.11)

where w is an (imaginary) independent data vector from the same unknown true distribution image as the observed data y and image denotes the MLE of image based on the assumed model f and the data y. Note that the statistical expectations above are w.r.t. to the true (unknown) p.d.f. image and we maximize (2.11) instead of image. Key finding is that for a large sample size N, the maximized log-likelihood image is a biased estimate of (2.11) where the bias is approximately equal to image, the number of the estimable parameters in the model. The maximum log-likelihood corrected by the asymptotic bias, i.e., image, can be viewed (under regularity conditions) as an unbiased estimator of image when N is large. Multiplying by image gives the famous Akaike’s Information Criterion

image

Unlike the BIC, the AIC is not consistent, namely the probability of correctly identifying the true model does not converge to one with the sample length. It can be shown that for nested hypotheses and under fairly general conditions,

image

that is, AIC tends to overfit, i.e., choosing model order that is larger then the true model order. See, e.g., [15].

Some enhanced versions have been introduced where the penalty term is larger to reduce the chance of overfitting. The following corrected AIC rule [29,30]

image (2.12)

image (2.13)

provides a finite sample (second-order bias correction) adjustment to AIC; asymptotically they are the same since image when image, but image for finite N. Consequently the penalty term of image is always larger than that of AIC, which means that the criterion is less likely to overfit. On the other hand, this comes with an increased risk of underfitting, i.e., choosing a smaller model order than the true one. Since the underfitting risk does not seem to be unduly large in many practical experiments, many researchers recommend the use of image instead of AIC. It should be noted, however, that image possess a sound theoretical justification only in the regression model with normal errors, where it is an unbiased estimate of the relative K-L information; note that AIC is by its construction only an asymptotically unbiased estimate of the relative K-L information. Besides the regression model, the reasonings for image, lay merely on its property of reducing the risk of overfitting. The so-called generalized information criteria (GIC) [2], defined as in (2.13) as

image

can often outperform AIC for image. For example, when image the image is obtained and the choice image yields the MDL criterion.

3.02.4.2 Minimum Description Length

The Minimum Description Length [11] method stems from the algorithmic information and coding theory. The basic idea is to find any regularity in the observed data that can be used to compress the data. This allows for condensing the data so that less symbols are needed to present it. The code used for compressing the data should be such that the total length (in bits) of description of the code and the description of the data is minimal. There exists a simple and refined version of the MDL method. The concept of stochastic complexity will be employed in estimating the model order. It defines the code length of the data with respect to the employed probability distribution model. It also establishes a connection to statistical model order estimation methods such as BIC and MDL. In some special cases, the approximation of stochastic complexity and BIC leads to the same criterion. Detailed descriptions of the MDL method can be found in [1,4].

The MDL method does not make any assumptions on the underlying distribution of the data. In fact, it is considered to be less important since the task at hand is not to estimate the distribution that generated the data but rather to extract useful information or model from the data that facilitates by compressing it efficiently. However, codes used for compressing data and the probability mass functions are related. The Kraft inequality establishes a correspondence between code lengths and probabilities whereas the information inequality relates the optimal code in terms of the minimum expected code length with the distribution of the data. If the data obeys the distribution image, then the code with the length image achieves the minimum code length on average.

A simple version of the MDL principle may now be stated as follows. Let image be a class of candidate models or codes used to explain the data y. The best model image is the one that minimizes

image

i.e., the sum of the length of the description of the model and the length of the data encoded using the model. Trading-off between complexity of the code and complexity of the data is built in the MDL method. Hence, it is less likely to overestimate the model order and overfit the data than AIC, for example.

A probability distribution where the parameter image may vary defines a family of distributions. The task of interest is to optimally encode the data and to use the whole family of distributions in order to make the process efficient. An obvious choice from the distribution family would be to use the code corresponding to image where image is the MLE of image for the observed data, i.e., image that most likely would have produced the data set. Consequently, the code corresponding to the maximum likelihood (or the probability model) depends on the data and the code cannot be specified before observing the data. This makes this approach less practical. One could avoid this problem by using an universal distribution model that is representative of the entire family of distributions and would allow coding the data set almost as efficiently as the code associated with the maximum likelihood function. The refined MDL uses normalized maximum likelihood approach for finding an universal distribution model used to assist the coding. This will lead to an excessive code length that is often called regret. The Kullback-Leibler divergence image may be used as a starting point where p is employed to approximate the distribution f. Finding the optimal universal distribution image can be formulated as a minimax problem:

image

where q denotes the set of all feasible distributions. The distribution image represents the worst case approximation of the maximum likelihood code and it is allowed to be almost any distribution within a model class of feasible distributions. Hence, the assumption on the true distribution generating the data may be considered to be relaxed. The so-called normalized maximum likelihood (NML) distribution satisfies the above optimization problem. It may be written as [31]

image

where image denotes the maximum likelihood estimate of image over the observation set. The denominator is a normalizing factor that is the sum of the maximum likelihoods of all potential observation sets Y acquired in experiments, and the numerator represents the maximized likelihood value. With this solution, the worst case data regret is minimized. It has been further shown that by solving a related minimax problem

image

the worst case of the expected regret is considered instead of the regret for the observed data set, see [32].

The code length associated with normalized maximum likelihood is called the stochastic complexity of the data for a distribution family. It minimizes the worst case expected regret

image

The MDL method employs first the normalized maximum likelihood distributions to order the candidate models and chooses the model that gives minimal stochastic complexity. An early result by Rissanen [11] and a frequently used approximation for the stochastic complexity is given as follows:

image

which is the same result as the approximation of BIC by Schwarz given earlier in this section. Other approximations of the stochastic complexity that are more accurate avoid some pitfalls of the above approximation, for example:

image

where image denotes the Fischer information matrix of sample size 1. The term image contains the higher order terms and vanishes as image. The integral term with Fischer information needs to be finite. See [33] for more detail on other approximations and their computation. The MDL principle and the normalized maximum likelihood (NML) lead to the same expression as the Bayesian Information Criterion (BIC) in some special cases, for example, when Jeffrey’s noninformative prior is used in a Bayesian model and image. In general, however, the NML and BIC have different formulations. The MDL has been shown to provide consistent estimates of the model order, e.g., for estimating the dimensionality of the signal subspace in the presence of white noise [34]. This problem is reviewed in detail in the next section.

3.02.5 Example: estimating number of signals in subspace methods

We consider the problem of estimating the dimensionality of the signal subspace. Such a problem commonly arises in array processing, time-delay estimation, frequency estimation, channel order estimation, for example. In these applications, if the signals are not coherent and signals are narrow-band, the dimension of the signal subspace defines the number of signals.

We consider the model, where the received data image consists of (unobservable) random zero mean signal vector image plus additive zero mean white noise image (image with image unknown) uncorrelated with the signal v which is supposed to lie in an m-dimensional subspace of image (image), called the signal subspace. The covariance of the data image is then image, where the signal covariance matrix image is of image. Let image denote the ordered eigenvalues of image. If image, then the smallest eigenvalue image equals image and has multiplicity image, whereas the m largest eigenvalues are equal to the nonzero eigenvalues of image plus image. Hence, testing that the signal subspace has dimensionality m (image) can be formulated as a hypothesis of the equality of the eigenvalues

image (2.14)

and a GLRT can be formed against a general alternative image arbitrary. Thus, we have a set of hypotheses image and a model selection method can be used to determine the best model given the data. To this end, note that image corresponds to the noise-only hypothesis. Let us assume that v and n possess M-variate circular complex normal (CN) distribution in which case the received data y has a zero mean CN distribution with covariance matrix image. Further suppose that an i.i.d. random sample image is observed from image.

For the above problem, GLRT-based sequential test or the MDL and AIC criteria have been widely in sensor array signal processing and subspace estimation methods; see, e.g., [1517], where the task at hand is to determine the number of signals m impinging on the sensor array. In this case, the signal v has a representation image, where image denote the array response (steering vector) for a random source signal image impinging on the array from the direction-of-arrival (DOA) image (image). In this application, the sample image represents the array snapshots at time instants image and the SCM image is the conventional estimate of the array covariance matrix image, where the signal covariance matrix image has a representation image, where image denotes the array steering matrix and image the vector of source signals. Another type of the problem such as determining the degree of non-circularity of complex random signals is illustrated in detail in [35].

Let us proceed by first constructing the sequential GLRT-based model order selection procedure for the problem at hand. The p.d.f. of image is image, and thus the log-likelihood function of the sample is

image

where image denotes the sample covariance matrix (SCM), image denotes matrix trace and image denotes the vector consisting of functionally independent real-valued parameters in image. Naturally, image has different parameter space image under each hypothesis image dimension and hence also different order image. Under no-structure hypothesis image. This follows as image is Hermitian and hence there are image estimable parameters due to image and image for image and M estimable parameters image. For image, where image is the number of functionally independent real parameters in image with rank m plus 1 due to image. The eigenvalue decomposition of image is image where image represent its nonzero eigenvalues and images the corresponding eigenvectors. There are image real parameters in the eigenvectors but they are tied by image constraints due to the scale and phase indeterminacy of each eigenvector1 and an additional image constraints due to orthogonality of the eigenvectors, image for image. Since there are m eigenvalues image and image, the total number of estimable parameters in image is

image (2.15)

when image.

Let us then derive the GLRT statistic image as defined in (2.10). Under image for image, the maximum log-likelihood is

image (2.16)

where image denote the eigenvalues of the SCM image and image. The proof of (2.16) follows similarly as in Anderson [36, Theorem 2] in the real case. Then given the expression (2.16), it is easy to verify the GLRT statistic image, simplifies into the form

image (2.17)

where

image (2.18)

is the ratio of the geometric to the arithmetic mean of the noise subspace eigenvalues of the SCM. If image holds, this ratio tends to unity as image. Now recall from Section 3.02.3.2 that the GLRT statistic image in (2.17) has a limiting image distribution, where

image

The (forward) GLRT-based model selection procedure then chooses the first image not rejected by the GLRT, i.e., the first m which verifies image, where (as discussed in Section 3.02.3.2) the detection threshold image can be set as the imageth quantile of the image. This choice yields the asymptotic PFA image for testing hypothesis (2.14).

Let us now derive the AIC and MDL criteria. Let us first point out that the image term in (2.16) can be dropped as it does not depend on m whereas any constant term (namely, image) not dependent on m can be added, thus showing that image in (2.16) can be written compactly as image up to irrelevant additive constant not dependent on m. Further noting that the +1 can be dropped from the penalty term image in (2.15), we have the result that the AIC and the MDL criteria can be written in the following forms:

image

An estimate of the AIC/MDL model order, i.e., the dimensionality of the signal subspace m, is found by choosing image that minimizes the AIC/MDL criterion above. As already pointed out, the penalty term of the MDL is image times the penalty term of AIC. Since image already at small sample lengths (image), the MDL method is less prone to overfitting the data compared to the AIC. Note that first term in the MDL and AIC criterions above is simply the GLRT statistic image. Wax and Kailath [15] showed the MDL rule consistent in choosing the true dimensionality whereas the AIC tends to overestimate the dimensionality (in the large sample limit). Later [34] proved the consistency of the MDL rule when the CN assumption does not hold.

3.02.6 Conclusion

In this section, we provided an overview of model order estimation methods. The majority of well-defined methods stem from statistical inference or coding and information theory. The likelihood function plays a crucial role in most methods based on solid theoretical foundation. Statistical and information theoretical methods stem from fundamentally different theoretical background but have been shown to lead to identical results under certain restrictive assumptions and model classes. A practical example of model order estimation in case of low-rank model that is commonly used in sensor array signal processing and many other tasks that require high-resolution capability was provided.

References

1. Grünwald PD. The Minimum Description Length Principle. MIT Press 2007.

2. Stoica P, Selen Y. Model-order selection: a review of information criterion rules. IEEE Signal Process Mag. 2004;21(4):36–47.

3. Lanterman AD. Schwarz, Wallace and Rissanen: intertwining themes in theories of model selection. Int Stat Rev. 2001;69(2):185–212.

4. Grünwald PD, Myung I, Pitt M, eds. Advances in Minimum Description Length: Theory and Applications. MIT Press 2005.

5. H. Linhart, W. Zucchini, Model Selection, Wiley, New York.

6. Burnham K, Anderson D. Model Selection and Inference: A Practical Information-Theoretic Approach. New York: Springer; 1998.

7. Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6(2):461–464.

8. Akaike H. A new look at the statistical model identification. IEEE Trans Automat Control. 1974;19:716–723.

9. Akaike H. On the likelihood of a time series model. The Statistician. 1978;27(3):217–235.

10. Rissanen J. Stochastic complexity and modeling. Ann Stat. 1986;14(3):1080–1100.

11. Rissanen J. Modeling by the shortest data description. Automatica. 1978;14:465–471.

12. Kay SM. Fundamentals of Statistical Signal Processing: Detection Theory. Prentice Hall 1998.

13. Kay S. Exponentially embedded families—new approaches to model order estimation. IEEE Trans Aerosp Electron Syst. 2005;41(1):333–344.

14. Bozdogan H. Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika. 1987;52(3):345–370.

15. Wax T, Kailath T. Detection of signals by information theoretic criteria. IEEE Trans Acoust Speech Signal Process. 1985;33(2):387–392.

16. Zhang QT, Wong KM. Information theoretic criteria for the determination of the number of signals in spatially correlated noise. IEEE Trans Signal Process. 1993;41(4):1652–1663.

17. Xu G, Roy RHI, Kailath T. Detection of number of sources via exploitation of centro-symmetry property. IEEE Trans Signal Process. 1994;42(1):102–112.

18. Shah AA, Tufts DW. Determination of the dimension of a signal subspace from short data records. IEEE Trans Signal Process. 1994;42(9):2531–2535.

19. Weisberg S. Applied Linear Regression. New York: Wiley; 1980.

20. Draper NR, Smith H. Applied Regression Analysis. third ed. New York: Wiley; 1998.

21. Ronchetti E. Robust model selection in regression. Stat Prob Lett. 1985;3:21–23.

22. Tropp JA, Gilbert AC. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans Inform Theory. 2007;53(12):4655–4666.

23. Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B. 1996;58:267–288.

24. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat. 2004;32(2):407–451.

25. Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv. 2010;4:40–79.

26. Efron B. Bootstrap methods: another look at the jackknife. Ann Stat. 1979;7(1):1–26.

27. R. Brcich, A.M. Zoubir, P. Pelin, Detection of sources using bootstrap techniques, IEEE Trans. Signal Process. 50 (2) 206–215.

28. Nishii R. Asymptotic properties of criteria for selection of variables in multiple regression. Ann Stat. 1984;2:758–765.

29. Cavanaugh JE. Unifying the derivations for the Akaike and corrected Akaike information criteria. Stat Prob Lett. 1977;33:201–208.

30. Sugiura N. Further analysis of the data by Akaike information criterion and the finite corrections. Commun Stat Theory Methods. 1978;7:13–26.

31. Shtarkov Y. Block universal sequential coding of single messages. Prob Inform Transmission. 1987;23(3):175–186.

32. Myung J, Navarro D, Pitt M. Model selection by normalized maximum likelihood. J Math Psychol. 2006;50:167–179.

33. Kontkanen P, Buntine W, Myllymaki P, Rissanen J, Tirri H. Efficient computation of stochastic complexity. In: Proceedings of International Conference on Artificial Intelligence and, Statistics. 2003;233–238.

34. Zhao LC, Krishnaiah PR, Bai ZD. On detection of the number of signals in presence of white noise. J Multivar Anal. 1986;20(1):1–25.

35. Novey M, Ollila E, Adali T. On testing the extent of noncircularity. IEEE Trans Signal Process. 2011;59(11):5632–5637.

36. Anderson TW. Asymptotic theory for principal component analysis. Ann Math Stat. 1963;1:122–148.


1Recall that scale indeterminacy is fixed by assuming image and the phase of image can be fixed similarly by any suitable convention.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.220.114