Luiz Wagner Pereira Biscainho, DEL/Poli & PEE/COPPE, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
This chapter is divided in three main parts. First, the main concepts of Probability are introduced. In the sequence, they are encapsulated into the Random Variable framework; a brief discussion of estimators is provided as an application example. At last, Random Processes and Sequences are tackled. After their detailed presentation, noise models, and modulation, spectral characterization and sampling of random processes are briefly discussed. The description of linear time-invariant processing of stationary processes and sequences leads to applications like Wiener filtering and special models for random signals.
Probability; Random variable; Random process; Random sequence; Linear time-invariant system
The author thanks Dr. Paulo A.A. Esquef for carefully reviewing this manuscript and Leonardo de O. Nunes for kindly preparing the illustrations.
Probability is an abstract concept useful to model chance experiments. The definition of a numerical representation for the outcomes of such experiments (the random variable) is essential to build a complete and general framework for probabilistic models. Such models can be extended to non-static outcomes in the form of time signals,1 leading to the so-called stochastic process, which can evolve along continuous or discrete time. Its complete description is usually too complicated to be applied to practical situations. Fortunately, a well-accepted set of simplifying properties like stationarity and ergodicity allows the modeling of many problems in so different areas as Biology, Economics, and Communications.
There are plenty of books on random processes,2 each one following some preferred order and notation, but covering essentially the same topics. This chapter is just one more attempt to present the subject in a compact manner. It is structured in the usual way: probability is first introduced, then described through random variables; within this framework, stochastic processes are presented, then associated with processing systems. Due to space limitations, proofs were avoided, and examples were kept at a minimum. No attempt was made to cover many families of probability distributions, for example. The preferred path was to define clear and unambiguously the concepts and entities associated to the subject and whenever possible give them simple and intuitive interpretations. Even risking to seem redundant, the author decided to explicitly duplicate the formulations related to random processes and sequences (i.e., continuous- and discrete-time random processes); the idea was to provide always a direct response to a consulting reader, instead of suggesting modifications in the given expressions.
Writing technical material always poses a difficult problem: which level of detail and depth will make the text useful? Our goal was making the text approachable for an undergraduate student as well as a consistent reference for more advanced students or even researchers (re)visiting random processes. The author tried to be especially careful with the notation consistence throughout the chapter in order to avoid confusion and ambiguity (which may easily occur in advanced texts). The choices of covered topics and order of presentation reflect several years of teaching the subject, and obviously match those of some preferred books. A selected list of didactic references on statistics [1,2], random variables and processes [3,4], and applications [5,6] is given at the end of the chapter. We hope you enjoy the reading.
Probabilistic models are useful to describe and study phenomena that cannot be precisely predicted. To establish a precise framework for the concept of probability, one should define a chance experiment, of which each trial yields an outcome s. The set of all possible outcomes is the sample space S. In this context, one speaks of the probability that one trial of the experiment yields an outcome s that belongs to a desired set (when one says the event A has occurred), as illustrated in Figure 4.1.
There are two different views of probability: the subjectivist and the objectivist. For subjectivists, the probability measures someone’s degree of belief on the occurrence of a given event, while for objectivists it results from concrete reality. Polemics aside, objectivism can be more rewarding didactically.
From the objectivist view point, one of the possible ways to define probability is the so-called classical or a priori approach: given an experiment whose possible outcomes s are equally likely, the probability of an event A is defined as the ratio between the number of acceptable outcomes (elements of A) and the number of possible outcomes (elements of S):
(4.1)
The experiment of flipping a fair coin, with , is an example in which the probabilities of both individual outcomes can be theoretically set: .
Another way to define probability is the relative frequency or a posteriori approach: the probability that the event A occurs after one trial of a given experiment can be obtained by taking the limit of the ratio between the number of successes (i.e., occurrences of A) and the number n of experiment trials, when the repeats go to infinity:
(4.2)
In the ideal coin flip experiment, one is expected to find equal probabilities for head and tail. On the other hand, this pragmatic approach allows modeling the non-ideal case, provided the experiment can be repeated.
A pure mathematical definition can provide a sufficiently general framework to encompass every conceptual choice: the axiomatic approach develops a complete probability theory from three axioms:
Referring to an experiment with sample space S:
• The events , are said to be mutually exclusive when the occurrence of one prevents the occurrence of the others. They are the subject of the third axiom.
• The complement of a given event A, illustrated in Figure 4.2, is determined by the non-occurrence of A, i.e., . From this definition, . Complementary events are also mutually exclusive.
It should be emphasized that all events A related to a given experiment are completely determined by the sample space S, since by definition . Therefore, a set B of outcomes not in S is mapped to an event . For instance, for the experiment of rolling a fair die, ; the event corresponding to (showing a 7) is .
According to the experiment, sample spaces may be countable or uncountable.3 For example, the sample space of the coin flip experiment is countable. On the other hand, the sample space of the experiment that consists in sampling with no preference any real number from the interval is uncountable. Given an individual outcome , defining ,
(4.3)
As a consequence, one should not be surprised to find an event with or an event with . In the interval sampling experiment, has ; and has .
The joint probability of a set of events , is the probability of their simultaneous occurrence . Referring to Figure 4.3, given two events A and B, their joint probability can be found as
(4.4)
By rewriting this equation as
(4.5)
one finds an intuitive result: the term is included twice in , and should thus be discounted. Moreover, when A and B are mutually exclusive, we arrive at the third axiom. In the die experiment, defining and , and .
The conditional probability is the probability of event A conditioned to the occurrence of event B, and can be computed as
(4.6)
The value of accounts for the uncertainty of both A and B; the term discounts the uncertainty of B, since it is certain in this context. In fact, the conditioning event B is the new (reduced) sample space for the experiment. By rewriting this equation as
(4.7)
one gets another interpretation: the joint probability of A and B combines the uncertainty of B with the uncertainty of A when B is known to occur. Using the example in the last paragraph, in sample space .
Since , the Bayes Rule follows straightforwardly:
(4.8)
This formula allows computing one conditional probability from its reverse . Using again the example in the last paragraph, one would arrive at the same result for using in sample space in the last equation.
If the sample space S is partitioned into M disjoint sets , such that , then any event can be written as . Since , are disjoint sets,
(4.9)
which is called the total probability of B.
For a single event of interest A, for example, the application of Eq. (4.9) to Eq. (4.8) yields
(4.10)
Within the Bayes context, and are usually known as the a priori and a posteriori probabilities of A, respectively; is called likelihood in the estimation context, and also transition probability in the communications context. In the latter case, could refer to a symbol a sent by the transmitter and , to a symbol b recognized by the receiver. Knowing , the probability of recognizing b when a is sent (which models the communication channel), and , the a priori probability of the transmitter to send a, allows to compute , the probability of a having been sent given that b has been recognized. In the case of binary communication, we could partition the sample space either in the events TX0 (0 transmitted) and TX1 (1 transmitted), or in the events RX0 (0 recognized) and RX1 (1 recognized), as illustrated in Figure 4.4; the event “error” would be .
The events are said to be mutually independent when the occurrence of one does not affect the occurrence of any combination of the others.
For two events A and B, three equivalent tests can be employed: they are independent if and only if.
The first two conditions follow directly from the definition of independence. Using any of them in Eq. (4.6) one arrives at the third one. The reader is invited to return to Eq. (4.7) and conclude that B and are independent events. Consider the experiment of rolling a special die with the numbers 1, 2, and 3 stamped in black on three faces and stamped in red on the other three; the events and are mutually independent.
Algorithmically, the best choice for testing the mutual independence of more than two events is using the third condition for every combination of events among .
As a final observation, mutually exclusive events are not mutually independent; on the contrary, one could say they are maximally dependent, since the occurrence of one precludes the occurrence of the other.
The theory we have discussed so far can be applied when more than one experiment is performed at a time. The generalization to the multiple experiment case can be easily done by using cartesian products.
Consider the experiments with their respective sample spaces . Define the combined experiment E such that each of its trials is composed by one trial of each . An outcome of E is the M-tuple , where , and the sample space of E can be written as . Analogously, any event of E can be expressed as , where is a properly chosen event of .
In the special case when the sub-experiments are mutually independent, i.e., the outcomes of one do not affect the outcomes of the others, we have . However, this is not the general case. Consider the experiment of randomly selecting a card from a 52-card deck, repeated twice: if the first card drawn is replaced, the sub-experiments are independent; if not, the first outcome affects the second experiment. For example, for if the first card is replaced, and if not.
At this point, an interesting counting experiment (called Bernoulli trials) can be defined. Take a random experiment E with sample space S and a desired event A (success) with , which also defines (failure) with . What is the probability of getting exactly k successes in N independent repeats of E? The solution can be easily found by noticing that the desired result is composed by k successes and failures, which may occur in different orders. Then,
(4.11)
(4.12)
Returning to the card deck experiment (with replacement): the probability of selecting exactly 2 aces of spades after 300 repeats is approximately 5.09% according to Eq. (4.11), while Eq. (4.12) provides the approximate value 5.20%. In a binary communication system where 0 and 1 are randomly transmitted with equal probabilities, the probability that exactly 2 bits 1 are sent among 3 bits transmitted is 0.25%; this result can be easily checked by inspection of Figure 4.5.
Mapping each outcome of a random experiment to a real number provides a different framework for the study of probabilistic models, amenable to simple interpretation and easy mathematical manipulation. This mapping is performed by the so-called random variable.
Given a random experiment with sample space S, a random variable is any function that maps each into some (see Figure 4.6). The image of this transformation with domain S and co-domain , which results of the convenient choice of the mapping function, can be seen as the sample space of ; any event A of can be described as a subset of ; and each mapped outcome x is called a sample of . The following conditions should be satisfied by a random variable :
We will see later that these conditions allow the proper definition of the cumulative probability distribution function of .
As seen in Section 1.04.2, sample spaces (and events) can be countable or uncountable, according to the nature of the random experiment’s individual outcomes. After mapped into subsets of , countable events remain countable—e.g., one could associate with the coin flip experiment a random variable such that and . On the other hand, uncountable events may or may not remain uncountable. Consider the following four distinct definitions of a random variable associated with the interval experiment described immediately before Eq. (4.3):
The sample space of:
The classification of sample spaces as countable or uncountable leads directly to the classification of random variables as discrete, continuous, or mixed. A discrete random variable has a countable sample space , and has —this is the case of and defined above. A continuous random variable has an uncountable sample space , and (to avoid ambiguity) has —this is the case of defined above. A mixed variable has a sample space composed by the union of real intervals with continuously distributed probabilities, within which , and discrete real values with finite probabilities —this is the case of and defined above. Specifically, since , should rather be treated as part uncountable, part countable than as simply uncountable.
From the conditions that must be satisfied by any random variable , an overall description of its probability distribution can be provided by the so-called cumulative probability distribution function (shortened to CDF),
(4.13)
Since . It is obvious that ; but since . Moreover, and is a non-decreasing function of x, i.e., . Also, , i.e., is continuous from the right; this will be important in the treatment of discrete random variables. A typical CDF is depicted in Figure 4.7. One can use the CDF to calculate probabilities by noticing that
(4.14)
For the random variable whose distribution is described in Figure 4.7, one can easily check by inspection that .
The CDF of the random variable associated above with the interval sampling experiment is
(4.15)
The CDF of the random variable associated above with the coin flip experiment is
(4.16)
Given a random variable , any single value such that contributes a step4 of amplitude p to . Then, it is easy to conclude that for a discrete random variable with ,
(4.17)
An even more informative function that can be derived from the CDF to describe the probability distribution of a random variable is the so-called probability density function (shortened to PDF),
(4.18)
Since is a non-decreasing function of x, it follows that . From the definition,
(4.19)
Then,
(4.20)
The PDF corresponding to the CDF shown in Figure 4.7 is depicted in Figure 4.8. One can use the PDF to calculate probabilities by noticing that
(4.21)
Again, for the random variable whose distribution is described in Figure 4.8, one can easily check by inspection that .
The PDF of the random variable associated above with the interval sampling experiment is
(4.22)
The PDF of the random variable associated above with the coin flip experiment is
(4.23)
which is not well-defined. But coherently with what has been seen for the CDF, given a random variable , any single value such that contributes an impulse5 of area p to . Then, it is easy to conclude that for a discrete random variable with ,
(4.24)
In particular, for the coin flip experiment the PDF is .
In the case of discrete random variables, in order to avoid the impulses in the PDF, one can operate directly on the so-called mass probability function6:
(4.25)
In this case,
(4.26)
This chapter favors an integrate framework for continuous and discrete variables, based on CDFs and PDFs.
It is usual in the literature referring (for short) to the CDF as “the distribution” and to the PDF as “the density” of the random variable. This text avoids this loose terminology, since the word “distribution” better applies to the overall probabilistic behavior of the random variable, no matter in the form of a CDF or a PDF.
The simplest continuous distribution is the so-called uniform. A random variable is said to be uniformly distributed between a and if its PDF is
(4.27)
depicted in Figure 4.9. Notice that the inclusion or not of the interval bounds is unimportant here, since the variable is continuous. The error produced by uniform quantization of real numbers is an example of uniform random variable.
Perhaps the most recurrent continuous distribution is the so-called Gaussian (or normal). A Gaussian random variable is described by the PDF
(4.28)
As seen in Figure 4.10, this function is symmetrical around , with spread controlled by . These parameters, respectively called statistical mean and standard deviation of , will be precisely defined in Section 1.04.3.4. The Gaussian distribution arises from the combination of several independent random phenomena, and is often associated with noise models.
There is no closed expression for the Gaussian CDF, which is usually tabulated for a normalized Gaussian random variable with and such that
(4.29)
In order to compute for a Gaussian random variable , one can build an auxiliary variable such that , which can then be approximated by tabulated values of .
Notice once more that since the variable is continuous, the inclusion or not of the interval bounds has no influence on the result.
In Section 1.04.2.3, an important discrete random variable was implicitly defined. A random variable that follows the so-called binomial distribution is described by
(4.31)
and counts the number of occurrences of an event A which has probability p, after N independent repeats of a random experiment. Games of chance are related to binomial random variables.
The so-called Poisson distribution, described by
(4.32)
counts the number of occurrences of an event A that follows a mean rate of occurrences per unit time, during the time interval T. Traffic studies are related to Poisson random variables.
It is simple to derive the Poisson distribution from the binomial distribution. If an event A may occur with no preference anytime during a time interval , the probability that A occurs within the time interval is . If A occurs N times in , the probability that it falls exactly k times in is . If ; if A follows a mean rate of occurrences per unit time, then and the probability of exact k occurrences in a time interval of duration T becomes , where substituted for Np. If a central office receives 100 telephone calls per minute in the mean, the probability that more than 1 call arrive during 1s is .
Due to space restrictions, this chapter does not detail other random distributions, which can be easily found in the literature.
In many instances, one is interested in studying the behavior of a given random variable under some constraints. A conditional distribution can be built to this effect, if the event summarizes those constraints. The corresponding CDF of conditioned to B can be computed as
(4.33)
The related PDF is simply
(4.34)
Conditional probabilities can be straightforwardly computed from conditional CDFs or PDFs.
When the conditioning event is an interval , it can be easily deduced that
(4.35)
and
(4.36)
Both sentences in Eq. (4.36) can be easily interpreted:
• Within , the conditional PDF has the same shape of the original one; the normalization factor ensures in the restricted sample space .
• By definition, there is null probability of getting x outside .
As an example, the random variable H defined by the non-negative outcomes of a normalized Gaussian variable follows the PDF
(4.37)
The results shown in this section can be promptly generalized to any conditioning event.
The concept of mean does not need any special introduction: the single value that substituted for each member in a set of numbers produces the same total. In the context of probability, following a frequentist path, one could define the arithmetic mean of infinite samples of a random variable as its statistical mean or its expected value.
Recall the random variable associated with the fair coin flip experiment, with . After infinite repeats of the experiment, one gets 50% of heads () and 50% of tails (); then, the mean outcome will be7. If another variable is associated with an unfair coin with probabilities and , the same reasoning leads to . Instead of averaging infinite outcomes, just summing the possible values of the random variable weighted by their respective probabilities also yields its statistical mean. Thus we can state that for any discrete random variable with sample space ,
(4.38)
This result can be generalized. Given a continuous random variable , the probability of drawing a value in the interval around is given by . The weighted sum of every is simply
(4.39)
By substituting the PDF of a discrete variable (see Eq. (4.24)) into this expression, one arrives at Eq. (4.38). Then, Eq. (4.39) is the analytic expression for the expected value of any random variable .
Suppose another random variable is built as a function of . Since the probability of getting the value is the same as getting the respective , we can deduce that
(4.40)
A complete family of measures based on expected values can be associated with a random variable. The so-called nth-order moment of (about the origin) is defined as
(4.41)
The first two moments of are:
A modified family of parameters can be formed by computing the moments about the mean. The so-called nth-order central moment of is defined as
(4.42)
Subtracting the statistical mean from a random variable can be interpreted as disregarding its “deterministic” part (represented by its statistical mean) Three special cases:
• , known as the variance of , which measures the spread of around . The so-called standard deviation is a convenient measure with the same dimension of .
• , whose standardized version is the so-called skewness of , which measures the asymmetry of .
• , whose standardized version8 minus three is the so-called kurtosis of , which measures the peakedness of . One can say the distributions are measured against the Gaussian, which has a null kurtosis.
Of course, analogous definitions apply to the moments computed over conditional distributions.
A useful expression, which will be recalled in the context of random processes, relates three of the measures defined above:
(4.43)
Rewritten as
(4.44)
it allows to split the overall “intensity” of , measured by its mean square value, into a deterministic part, represented by , and a random part, represented by .
As an example, consider a discrete random variable distributed as shown in Table 4.1. Their respective parameters are:
• mean , indicating the PDF of is shifted to the right of the origin;
• mean square value , which measures the intensity of ;
• variance , which measures the random part of the intensity of ;
• standard deviation , which measures the spread of around its mean;
• skewness , indicating the PDF of is left-tailed;
• kurtosis , indicating the PDF of is less peaky than the PDF of a Gaussian variable.
At this point, we can discuss two simple transformations of a random variable whose effects can be summarized by low-order moments:
• A random variable can be formed by adding a fixed offset to each sample of . As a consequence of this operation, the new PDF is a shifted version of the original one: , thus adding to the mean of : .
• A random variable can be formed by scaling by each sample of . As a consequence of this operation, the new PDF is a scaled version of the original one: , thus scaling by the standard deviation of : .
Such transformations do not change the shape (and therefore the type) of the original distribution. In particular, one can generate:
• a zero-mean version of by making , which disregards the deterministic part of ;
• a unit-standard deviation version of by making , which enforces a standard statistical variability to ;
• a normalized version of by making , which combines both effects.
We already defined a normalized Gaussian distribution in Section 1.04.3.2. The normalized version of the random variable of the last example would be distributed as shown in Table 4.2.
The main importance of these expectation-based parameters is to provide a partial description of an underlying distribution without the need to resource to the PDF. In a practical situation in which only a few samples of a random variable are available, as opposed to its PDF and related moments, it is easier to get more reliable estimates for the latter (especially low-order ones) than for the PDF itself. The same rationale applies to the use of certain auxiliary inequalities that avoid the direct computation of probabilities on a random variable (which otherwise would require the knowledge or estimation of its PDF) by providing upper bounds for them based on low-order moments. Two such inequalities are:
The derivation of Markov’s inequality is quite simple:
(4.45)
Chebyshev’s inequality follows directly by substituting for in Markov’s inequality. As an example, the probability of getting a sample from the normalized Gaussian random variable is ; Chebyshev’s inequality predicts , an upper bound almost 100 times greater than the actual probability.
Two representations of the distribution of a random variable in alternative domains provide interesting links with its statistical moments. The so-called characteristic function of a random variable is defined as
(4.46)
and inter-relates the moments by successive differentiations:
(4.47)
The so-called moment generating function of a random variable is defined as
(4.48)
and provides a more direct way to compute :
(4.49)
Consider the problem of describing a random variable obtained by transformation of another random variable , illustrated in Figure 4.11. By definition,
(4.50)
If is a continuous random variable and is a differentiable function, the sentence is equivalent to a set of sentences in the form . Since , then can be expressed as a function of .
Fortunately, an intuitive formula expresses as a function of :
(4.51)
Given that the transformation is not necessarily monotonic, are all possible values mapped to y; therefore, their contributions must be summed up. It is reasonable that must be directly proportional to : the more frequent a given , the more frequent its respective . By its turn, the term accounts for the distortion imposed to by . For example, if this transformation is almost constant in a given region, no matter if increasing or decreasing, then in the denominator will be close to zero; this just reflects the fact that a wide range of values will be mapped into a narrow range of values of , which will then become denser than in thar region.
As an example, consider that a continuous random variable is transformed into a new variable . For the CDF of ,
(4.52)
By Eq. (4.51), or by differentiation of Eq. (4.52), one arrives at the PDF of :
(4.53)
The case when real intervals of are mapped into single values of requires an additional care to treat the nonzero individual probabilities of the resulting values of the new variable. However, the case of a discrete random variable with being transformed is trivial:
(4.54)
Again, are all possible values mapped to .
An interesting application of transformations of random variables is to obtain samples of a given random variable from samples of another random variable , both with known distributions. Assume that exists such that . Then, , which requires only the invertibility of .
A single random experiment E (composed or not) may give rise to a set of N random variables, if each individual outcome is mapped into several real values . Consider, for example, the random experiment of sampling the climate conditions at every point on the globe; daily mean temperature and relative air humidity can be considered as two random variables that serve as numerical summarizing measures. One must generalize the probability distribution descriptions to cope with this multiple random variable situation: it should be clear that the set of N individual probability distribution descriptions, one for each distinct random variable, provides less information than their joint probability distribution description, since the former cannot convey information about their mutual influences.
We start by defining a multiple or vector random variable as the function that maps into , such that
(4.55)
and
(4.56)
Notice that are jointly sampled from .
The joint cumulative probability distribution function of , or simply the CDF of , can be defined as
(4.57)
The following relevant properties of the joint CDF can be easily deduced:
Define containing variables of interest among the random variables in , and leave the remaining nuisance variables in . The so-called marginal CDF of separately describes these variables’ distribution from the knowledge of the CDF of :
(4.58)
One says the variables have been marginalized out: the condition simply means they must be in the sample space, thus one does not need to care about them anymore.
The CDF of a discrete random variable with , analogously to the single variable case, is composed by steps at the admissible N-tuples:
(4.59)
The joint probability density function of , or simply the PDF of , can be defined as
(4.60)
Since is a non-decreasing function of , it follows that . From the definition,
(4.61)
(4.62)
The probability of any event can be computed as
(4.63)
Once more we can marginalize the nuisance variables to obtain the marginal PDF of from the PDF of :
(4.64)
Notice that if and are statistically dependent, the marginalization of does not “eliminate” the effect of on : the integration is performed over , which describes their mutual influences. For a discrete random variable with consists of impulses at the admissible N-tuples:
(4.65)
Suppose, for example, that we want to find the joint and marginal CDFs and PDFs of two random variables and jointly uniformly distributed in the region , i.e., such that
(4.66)
The marginalization of yields
(4.67)
The marginalization of yields
(4.68)
By definition,
(4.69)
The reader is invited to sketch the admissible region on the plane and try to solve Eq. (4.69) by visual inspection. The marginal CDF of x is given by
(4.70)
which could also have been obtained by direct integration of . The marginal CDF of y is given by
(4.71)
which could also have been obtained by direct integration of .
Similarly to the univariate case discussed in Section 1.04.3.3, the conditional distribution of a vector random variable restricted by a conditioning event can be described by its respective CDF
(4.72)
and PDF
(4.73)
A special case arises when B imposes a point conditioning: Define containing variables among those in , such that . It can be shown that
(4.74)
an intuitive result in light of Eq. (4.6)—to which one may directly refer in the case of discrete random variables. Returning to the last example, the point-conditioned PDFs can be shown to be
(4.75)
and
(4.76)
Based on the concept of statistical independence, discussed in Section 1.04.2.2, the mutual statistical independence of a set of random variables means that the probabilistic behavior of each one is not affected by the probabilistic behavior of the others. Formally, the random variables are independent if any of the following conditions is fulfilled:
Returning to the example developed in Section 1.04.3.6, x and y are clearly statistically dependent. However, if they were jointly uniformly distributed in the region defined by , they would be statistically independent—this verification is left to the reader.
The PDF of a random variable composed by the sum of N independent variables can be computed as9
(4.77)
The definition of statistical mean can be directly extended to a vector random variable with known PDF. The expected value of a real scalar function is given by:
(4.78)
We can analogously proceed to the definition of N-variable joint statistical moments. However, due to the more specific usefulness of the cases, only 2-variable moments will be presented.
An th-order joint moment of and (about the origin) has the general form
(4.79)
The most important case corresponds to : is called the statistical correlation between and . Under a frequentist perspective, is the average of infinite products of samples x and y jointly drawn from , which links the correlation to an inner product between the random variables. Indeed, when , and are called orthogonal. One can say the correlation quantifies the overall linear relationship between the two variables. Moreover, when , and are called mutually uncorrelated; this “separability” in the mean is weaker than the distribution separability implied by the independence. In fact, if and are independent,
(4.80)
i.e., they are also uncorrelated. The converse is not necessarily true.
An th-order joint central moment of and , computed about the mean, has the general form
(4.81)
Once more, the case is specially important: is called the statistical covariance between and , which quantifies the linear relationship between their random parts. Since10, and are uncorrelated when . If , one says and are positively correlated, i.e., the variations of their statistical samples tend to occur in the same direction; if , one says and are negatively correlated, i.e., the variations of their statistical samples tend to occur in opposite directions. For example, the age and the annual medical expenses of an individual are expected to be positively correlated random variables. A normalized covariance can be computed as the correlation between the normalized versions of and :
(4.82)
known as the correlation coefficient between and , , and can be interpreted as the percentage of correlation between x and y. Recalling the inner product interpretation, the correlation coefficient can be seen as the cosine of the angle between the statistical variations of the two random variables.
Consider, for example, the discrete variables and jointly distributed as shown in Table 4.3. Their joint second-order parameters are:
• correlation , indicating and are not orthogonal;
• covariance , indicating and tend to evolve in opposite directions;
• correlation coefficient , indicating this negative correlation is relatively strong.
Returning to N-dimensional random variables, the characteristic function of a vector random variable is defined as
(4.83)
and inter-relates the moments of order by:
(4.84)
The moment generating function of a vector random variable is defined as
(4.85)
and allows the computation of as:
(4.86)
A result that partially justifies the ubiquitous use of Gaussian models, the Central Limit Theorem (CLT) states that (under mild conditions in practice) the distribution of a sum of N independent variables approaches a Gaussian distribution as . Having completely avoided the CLT proof, we can at least recall that in this case, (see Eq. (4.77). Of course, and, by the independence property, .
Gaussian approximations for finite-N models are useful for sufficiently high N, but due to the shape of the Gaussian distribution, the approximation error grows as y distances from .
The reader is invited to verify the validity of the approximation provided by the CLT for successive sums of independent uniformly distributed in .
Interestingly, the CLT also applies to the sum of discrete distributions: even if remains impulsive, the shape of approaches the shape of the CDF of a Gaussian random variable as N grows.
The PDF of an N-dimensional Gaussian variable is defined as
(4.87)
where is an vector with elements , and is an matrix with elements , such that any related marginal distribution remains Gaussian. As a consequence, by Eq. (4.74), any conditional distribution with point conditioning B is also Gaussian.
By this definition, we can see the Gaussian distribution is completely defined by its first- and second-order moments, which means that if N jointly distributed random variables are known to be Gaussian, estimating their mean-vector and their covariance-matrix is equivalent to estimate their overall PDF. Moreover, if and are mutually uncorrelated, , then is diagonal, and the joint PDF becomes , i.e., will be mutually independent. This is a strong result inherent to Gaussian distributions.
The treatment of a general multiple random variable resulting from the application of the transformation to the multiple variable always starts by enforcing , with .
The special case when is invertible, i.e., (or ), follows a closed expression:
(4.88)
where
(4.89)
The reader should notice that this expression with reduces to Eq. (4.51) particularized to the invertible case.
Given the multiple random variable with known mean-vector and covariance-matrix , its linear transformation will have:
It is possible to show that the linear transformation of an N-dimensional Gaussian random variable is also Gaussian. Therefore, in this case the two expressions above completely determine the PDF of the transformed variable.
The generation of samples that follow a desired multivariate probabilistic distribution from samples that follow another known multivariate probabilistic distribution applies the same reasoning followed in the univariate case to the multivariate framework. The reader is invited to show that given the pair of samples from the random variables , jointly uniform in the region , it is possible to generate the pair of samples from the random variables , jointly Gaussian with the desired parameters by making
(4.90)
At first sight, defining a complex random variable may seem impossible, since the seminal event makes no sense. However, in the vector random variable framework this issue can be circumvented if one jointly describes the real and imaginary parts (or magnitude and phase) of .
The single complex random variable , is completely represented by , which allows one to compute the expected value of a scalar function as . Moreover, we can devise general definitions for the mean and variance of as, respectively:
An N-dimensional complex random variable can be easily tackled through a properly defined joint distribution, e.g., . The individual variables , , will be independent when
(4.91)
The following definitions must be generalized to cope with complex random variables:
Now, . As before, and will be:
One of the most encompassing applications of vector random variables is the so-called parameter estimation, per se an area supported by its own theory. The typical framework comprises using a finite set of measured data from a given phenomenon to estimate the parameters11 of its underlying model.
A set of N independent identically distributed (iid) random variables such that
(4.92)
can describe an N-size random sample of a population modeled by . Any function (called a statistic of ) can constitute a point estimator of some parameter such that given an N-size sample of , one can compute an estimate of .
A classical example of point estimator for a fixed parameter is the so-called sample mean, which estimates by the arithmetic mean of the available samples :
(4.93)
Not always defining a desired estimator is trivial task as in the previous example. In general, resorting to a proper analytical criterion is necessary. Suppose we are given the so-called likelihood function of , , which provides a probabilistic model for as an explicit function of . Given the sample , the so-called maximum likelihood (ML) estimator computes an estimate of by direct maximization of . This operation is meant to find the value of that would make the available sample most probable.
Sometimes the parameter itself can be a sample of a random variable described by an a priori PDF . For example, suppose one wants to estimate the mean value of a shipment of components received from a given factory; since the components may come from several production units, we can think of a statistical model for their nominal values. In such situations, any reliable information on the parameter distribution can be taken in account to better tune the estimator formulation; these so-called Bayesian estimators rely on the a posteriori PDF of :
(4.94)
Given a sample , the so-called maximum a posteriori (MAP) estimator computes an estimate of as the mode of . This choice is meant to find the most probable that would have produced the available data . Several applications favor the posterior mean estimator, which for a given sample computes , over the MAP estimator.
The quality of an estimator can be assessed through some of its statistical properties.12 An overall measure of the estimator performance is its mean square error, which can be decomposed in two parts:
(4.95)
where is the estimator bias and is its variance. The bias measures the deterministic part of the error, i.e., how much the estimates deviate from the target in the mean, thus providing an accuracy measure: the smaller its bias, the more accurate the estimator. The variance, by its turn, measures the random part of the error, i.e., how much the estimates spread about their mean, thus providing a precision measure: the smaller its variance, the more precise the estimator. Another property attributable to an estimator is consistence13: is said to be consistent when . Using the Chebyshev inequality (see Section 1.04.3.4), one finds that an unbiased estimator with is consistent. It can be shown that the sample mean defined in Eq. (4.93) has (thus is unbiased) and (thus is consistent). The similarly built estimator for , the unbiased sample variance, computes
(4.96)
The reader is invited to prove that this estimator is unbiased, while its more intuitive form with denominator N instead of is not.
One could argue how much reliable a point estimation is. In fact, if the range of is continuous, one can deduce that . However, perhaps we would feel safer if the output of an estimator were something like “ with probability p.” and are the confidence limits and p is the confidence of this interval estimate.
If the outcomes s of a given random experiment are amenable to variation in time t, each one can be mapped by some into a real function . This is a direct generalization of the random variable. As a result, we get an infinite ensemble of time-varying sample functions or realizations which form the so-called stochastic or random process. For a fixed , reduces to a single random variable, thus a random process can be seen as a time-ordered multiple random variable. An example of random process is the simultaneous observation of air humidity at every point of the globe, : each realization describes the humidity variation in time at a randomly chosen place on the earth whereas the random variable describes the distribution of air humidity over the earth at instant . A similar construction in the discrete time n domain would produce composed of realizations . If the hourly measures of air humidity substitute for its continuous measure in the former example, we get a new random process .
As seen, there can be continuous- or discrete-time random processes. Since random variables have already been classified as continuous or discrete, one analogously refers to continuous-valued and discrete-valued random processes. Combining these two classifications according to time and value is bound to result ambiguous or awkwardly lengthy; when this complete categorization is necessary, one favors the nomenclature random process for the continuous-time case and random sequence for the discrete-time case, using the words “continuous” and “discrete” only with reference to amplitude values.14 In the former examples, the continuous random process and random sequence would model the idealized actual air humidity as a physical variable, whereas the practical digitized measures of air humidity would be modeled by their discrete counterparts.
One could think of the complete description of a random process or sequence as a joint CDF (or PDF) encompassing every possible instant of time t or n, respectively, which is obviously impractical. However, their partial representation by Lth-order distributions (employing a slightly modified notation to include the reference to time) can be useful. The CDF of a random process can be defined as
(4.97)
(4.98)
The CDF of a random sequence can be defined as
(4.99)
with an associated PDF
(4.100)
This convention can be easily extended to the joint distributions of M random processes (sequences ), as will be seen in the next subsection. The notation for point-conditioned distributions can also be promptly inferred (see Section 1.04.4.12).
As with random variables, the mutual independence of random processes or sequences is tested by the complete separability of their respective distributions. The random processes , are independent when
(4.101)
for any choice of time instants. Similarly, the random sequences , are independent when
(4.102)
for any choice of time instants.
It is not necessary to redefine moments in the context of random processes and sequences, since we will always be tackling a set of random variables as in Section 1.04.3.8. But it is useful to revisit the first- and second-order cases with a slightly modified notation to include time information.
For a random process , the following definitions apply:
As seen for random variables,
(4.103)
which means that the mean instantaneous power of the process is the sum of a deterministic parcel with a random parcel.
Analogously, for a random sequence , the following definitions apply:
Again,
(4.104)
i.e., the mean instantaneous power of the sequence is the sum of a deterministic parcel with a random parcel.
Given two random processes and , one defines:
The processes and are said to be mutually:
The mean instantaneous power of is given by
(4.105)
which suggests the definition of as the mean instantaneous cross power between and .
Analogously, given two random sequences and , one defines:
The sequences and are said to be mutually:
The mean instantaneous power of is given by
(4.106)
which suggests the definition of as the mean instantaneous cross power between and .
The interpretation of this “cross power” is not difficult: it accounts for the constructive or destructive interaction of the two involved processes (sequences) as dictated by their mutual correlation.
We will dedicate some space later to the properties of second-order moments.
The stationarity of random processes and sequences refers to the time-invariance of their statistical properties, which turns their treatment considerably easier.
Strict-sense is the strongest class of stationarity. A strict-sense stationary (SSS) random process (sequence ) bears exactly the same statistical properties as (), which means that its associated joint distribution for any set of time instants does not change when they are all shifted by the same time lag. A random process (sequence) in which each realization is a constant sampled from some statistical distribution in time is SSS.
Under this perspective, we can define that a random process is Lth-order stationary when
(4.107)
or, alternatively,
(4.108)
Analogously, a random sequence is Lth-order stationary when
(4.109)
or, alternatively,
(4.110)
We can say that an SSS random process or sequence is stationary for any order L. Moreover, Lth-order stationarity implies th-order stationarity, by marginalization at both sides of any of the four equations above.
First-order stationarity says the statistical distribution at any individual instant of time t (n) is the same, i.e., does not depend on t ( does not depend on n). A natural consequence is the time-invariance of its statistical mean: . Second-order stationarity says the joint statistical distribution at any two time instants t and for a fixed (n and for a fixed k) is the same. A natural consequence is the time-invariance of its auto-correlation, which can depend only of the time lag between the two time instants: .
A weaker class of stationarity called wide-sense is much used in practice for its simple testability. A random process (sequence ) is said to be wide-sense stationary (WSS) when its mean and auto-correlation are time-invariant. Such conditions do not imply stationarity of any order, although second-order stationarity implies wide-sense stationarity. As an extension of such definition, one says that two processes and (sequences and ) are jointly WSS when they are individually WSS and their cross-correlation is time-invariant, i.e., can depend only of the time lag between the two time instants: .
Wide-sense stationarity in random processes and sequences induces several interesting properties, some of which are listed below. For processes,
For sequences,
Even if the properties are not proven here, some interesting observations can be traced about them:
• The statistical behavior of the process (sequence) at a given instant of time is maximally correlated with itself, an intuitive result.
• The autocorrelation commutes.
• The statistical behaviors of a WSS process (sequence) with no periodic components at two time instants separated by are uncorrelated. Then, the autocorrelation tends to the product of two identical statistical means.
• Periodic components in the process (sequence) cause the correlation maximum and surrounding behavior to repeat at each fundamental period.
• The arithmetic and geometric mean properties are in a certain sense linked to the first property. The second property is more stringent than the first.
The proper use of random processes and sequences to model practical problems depends on the careful understanding of both their statistical and time variations. The main difference between these two contexts, which must be kept in mind, is the fact that time is inexorably ordered, thus allowing the inclusion in the model of some deterministic (predictable) time-structure. When we examine the statistical auto-correlation between two time instants of a random process or sequence, we learn how the statistical samples taken at those instants follow each other. In a WSS process or sequence, this measure for different lags can convey an idea of statistical periodicity.15 But what if one is concerned with the time structure and characteristics of each (or some) individual realization of a given process or sequence? The use of time averages can provide this complementary description.
The time average of a given continuous-time function can be defined as
(4.111)
Given the random process , the time average of any of its realizations is
(4.112)
The average power of is
(4.113)
The time auto-correlation of for a lag is defined as
(4.114)
The time cross-correlation between the realizations of process and of process for a lag is defined as
(4.115)
As in the statistical case, an average cross power between and , which accounts for their mutual constructive or destructive interferences due to their time structure, can be defined as
(4.116)
Computed for the complete ensemble, such measures produce the random variables , , , , and , respectively. Under mild convergence conditions,
(4.117)
(4.118)
(4.119)
(4.120)
(4.121)
which are respectively the overall mean value, mean power, and auto-correlation (for lag ) of process , and cross-correlation (for lag ) and mean cross power between processes and .
The time average of a given discrete-time function can be defined as
(4.122)
Given the random sequence , the time average of any of its realizations is
(4.123)
The average power of is
(4.124)
The time auto-correlation of for lag k is defined as
(4.125)
The time cross-correlation between the realizations of sequence and of sequence for lag k is defined as
(4.126)
As in the statistical case, an average cross power between and , which accounts for their mutual constructive or destructive interferences due to their time structure, can be defined as
(4.127)
Computed for the complete ensemble, such measures produce the random variables , , , and , respectively. Under mild convergence conditions,
(4.128)
(4.129)
(4.130)
(4.131)
and
(4.132)
which are respectively the overall mean value, mean power, and auto-correlation (for lag k) of sequence , and cross-correlation (for lag k) and mean cross power between sequences and .
As an example, consider that a deterministic signal (with ) is additively contaminated by random noise which can be modeled as a realization of the WSS random process (with and ), thus yielding the random signal . The corresponding process is obviously not WSS, since . Furthermore, (following the nomenclature we have adopted) we can compute its:
Defining the signal-to-noise ratio (SNR) as the ratio between signal and noise powers, we can conclude that .
Now, suppose that two differently contaminated versions of are available: and , and that the underlying noise processes and are jointly WSS and uncorrelated, with , , and . If an average signal is computed, we can guarantee and (i.e., noise is reduced) as long as .
The so-called ergodicity is a property that allows interchanging statistic and temporal characteristics of some random processes (sequences)—which are then called ergodic. There are several levels of ergodicity, some of which are discussed below.
A random process with constant statistical mean is said to be mean-ergodic when any time average is equal to with probability 1, which requires . If is WSS, a necessary and sufficient condition for mean-ergodicity is
(4.133)
A random process with time-invariant auto-correlation is said to be auto-correlation-ergodic when any time auto-correlation is equal to with probability 1, which requires . Two processes and with time-invariant cross-correlation are cross-correlation-ergodic when any time cross-correlation is equal to with probability 1, which requires . Conditions for correlation-ergodicity of random processes involve 4th-order moments.
A random sequence with constant statistical mean is said to be mean-ergodic when any time average is equal to with probability 1, which requires . If is WSS, a necessary and sufficient condition for mean-ergodicity is
(4.134)
A random sequence with time-invariant auto-correlation is said to be auto-correlation-ergodic when any time auto-correlation is equal to with probability 1, which requires . Two sequences and with time-invariant cross-correlation are cross-correlation-ergodic when any time cross-correlation is equal to with probability 1, which requires . Conditions for correlation-ergodicity of random sequences involve 4th-order moments.
A process (sequence) that is mean- and auto-correlation-ergodic is called wide-sense ergodic. Two wide-sense ergodic processes (sequences) that are cross-correlation-ergodic are called jointly wide-sense ergodic.
A process (sequence) is distribution-ergodic when is ergodic for every moment.
The SSS random process (sequence) formed by random constant realizations is not ergodic in any sense. From Eqs. (4.133) and (4.134), a WSS process (sequence ), statistically uncorrelated at any different time instants and ( and ), i.e., with (), is mean-ergodic.
In practical real-life situations, quite often just a single realization of an underlying random process (sequence) is available. In such cases, if the latter is known to be ergodic, we can make use of the complete powerful statistical modeling framework described in this chapter. But how can one guarantee the property is enjoyed by a process (sequence) of which an only sample function is known? This is not so stringent a requirement: in fact, one needs just to be sure that there can be an ergodic process (sequence) of which that time function is a realization. Strictly speaking, a given music recording additively contaminated by background noise cannot be modeled as a realization of a mean-ergodic process,16 since each member of the ensemble would combine the same with a different . The random process which describes the noisy signal can be written as ; thus, , while in practice we know that .
Define the random sequences and , such that:
• is a random variable with mean and variance ;
• is a random variable with mean and variance ;
• rad/sample is a real constant;
• is a random variable uniformly distributed between and rad;
• is a random variable uniformly distributed between and rad;
Compute their time and statistical means, mean powers, auto-correlations, cross-correlations, and mean cross-powers; and discuss their correlation, orthogonality, stationarity, and ergodicity.
Expected values:
(4.143)
(4.144)
(4.145)
then,
(4.146)
(4.147)
then,
(4.148)
(4.149)
Conclusions:
A very special case of random processes (sequences) are the Gaussian-distributed. The corresponding Lth-order PDFs can be written (see Section 1.04.3.10).
(4.150)
where is the mean-vector with elements , for , and is the covariance-matrix with elements , for ;
(4.151)
where is the mean-vector with elements , for , and is the covariance-matrix with elements , for .
From the definition above:
• A WSS Gaussian random process (sequence) is SSS: Since the PDF of a Gaussian process (sequence) is entirely described by first- and second-order moments, if these moments do not depend on t (n), the PDF itself does not depend on t (n).
• Uncorrelated Gaussian processes (sequences) are independent: A joint PDF of two uncorrelated Gaussian processes (sequences) can be easily factorized as the product of their individual PDFs, since the covariance-matrix becomes block diagonal.
These two strong properties turn Gaussian models mathematically simpler to tackle, and the strict-sense stationarity and independence conditions easier to meet in practice.
The reader is invited to show that a stationary Gaussian process (sequence) whose auto-correlation is absolutely integrable in (summable in k) is ergodic.
Poisson is an example of discrete random process that we have already defined in Section 1.04.3.2. Each realization of counts the occurrences along time of a certain event whose mean occurrence rate is per time unit, provided that.
It can model the entry of clients in a store, for example. By convention:
Time origin can be shifted if necessary.
From Eq. (4.32), the first-order PDF of a Poisson process is given by
(4.152)
Applying the definition iteratively, one can show that for , the corresponding Lth-order PDF is
(4.153)
Analogously to what has been done for random variables, real random processes and sequences can be generalized to handle the complex case.
A complex random process can be described as , where and are real random processes. It is stationary in some sense as long as e are jointly stationary in that sense. The remaining definitions related to process stationarity are kept the same.
For a complex random process , the following definitions apply:
Given two random processes and , one defines:
The processes and are said to be mutually:
For a random sequence , the following definitions apply:
Given two random processes and , one defines:
The processes and are said to be mutually:
The simplest random processes (sequences) are those whose statistics at a given time are independent from every other time: a first-order distribution suffices to describe them. A less trivial model is found when the statistical time interdependency assumes a special recursive behavior such that the knowledge of a past conditioning state summarizes all previous history of the process (sequence). For the so-called Markov random process (sequence), the knowledge of the past does not affect the expectation of the future when the present is known.
Mathematically, a random process is said to be a Markov process when for ,
(4.154)
or for ,
(4.155)
We will restrict ourselves to the discrete-time case. A random sequence is said to be a Markov sequence when
(4.156)
which can be seen as a transition PDF. From the definition, one arrives at the chain rule
(4.157)
which means that the overall Markov sequence can be statistically described for from the knowledge of its distribution at and its subsequent transition distributions. Some interesting properties can be deduced from the expressions above:
A much important property of Markov sequences is the so-called Chapman-Kolmogorov equation: for ,
(4.158)
which provides a recursive way to compute arbitrary transition PDFs.
We can say a Markov sequence is stationary when and are shift-invariant. In this case, the overall sequence can be obtained from . A less trivial (and quite useful) model is provided by homogeneous Markov sequences, which are characterized only by a shift-invariant transition distribution; they are not stationary in general, but can be asymptotically stationary (i.e., for ) under certain conditions.
Discrete Markov processes and sequences are called Markov chains, which can assume a countable number of random states described by their state probabilities ( for sequences) and transition probabilities ( for sequences). Discrete-time Markov chains enjoy the following properties, for :
• , which totalizes all possible ways to left state in ;
• , which totalizes all possible ways to arrive at state in ;
If the chain has a finite number of states, a matrix notation can be employed. The state probability vector , with elements , and the transition probability matrix , with elements , are related by .
The special class of homogeneous Markov chains enjoys the following additional properties:
Accordingly, the transition probability matrix becomes . Defining ,
(4.159)
i.e., . When asymptotic stationarity is reachable, one can find the steady-state probability vector such that .
Consider, for example, the homogeneous Markov chain depicted in Figure 4.12, with states 1 and 2, state probabilities and , and transition probabilities and . Its corresponding one-sample transition matrix is
(4.160)
such that
(4.161)
It can also be shown that the chain reaches the steady-state probability vector
(4.162)
At first sight, the direct conversion of each realization () to the frequency domain seems the easiest way to spectrally characterize a random process (sequence ). However, it is difficult to guarantee the existence of the Fourier transform
(4.163)
(in the case os random processes) or
(4.164)
(in the case of random sequences) for every realization. And even if possible, we would get a random spectrum of limited applicability. In Section 1.04.4.5, we have found that the auto-correlation conveys information about every sinusoidal component found in the process (sequence). A correct interpretation of the Fourier transform17 suggests that the auto-correlation in fact conveys information about the overall spectrum of the process (sequence). Furthermore, it is a better behaved function than the individual realizations, and thus amenable to be Fourier transformed.
In Section 1.04.4.6, Eq. (4.118), we defined the overall mean power of the random process , which for the general complex case, becomes
(4.165)
It can be shown that
(4.166)
where
(4.167)
and
(4.168)
We can then define a power spectral density
(4.169)
such that . Some additional algebraic manipulation yields
(4.170)
which directly relates the auto-correlation to the power spectral density of the process, as predicted. From the expressions above, is a non-negative real function of . Furthermore, if is real, then is an even function of .
In Section 1.04.4.6, Eq. (4.121), we also defined the overall mean cross-power between the processes and , which for the general complex case, becomes
(4.171)
Following the same steps as above, we arrive at
(4.172)
where is defined as before,
(4.173)
and
(4.174)
We can then define the corresponding cross-power spectral density
(4.175)
such that . In terms of the cross-correlation, the cross-power spectral density can be written as
(4.176)
As expected, the cross-power density of orthogonal processes is zero. From the expressions above, . Furthermore, if and are real, then the real part of is even and the imaginary part of is odd.
It should be noticed that in the WSS case, , , , and .
A similar development can be done for random sequences. In Section 1.04.4.6, Eq. (4.129), we defined the overall mean power of the random sequence , which for the general complex case, becomes
(4.177)
It can be shown that
(4.178)
where
(4.179)
and
(4.180)
We can then define a power spectral density
(4.181)
such that . Some additional algebraic manipulation yields
(4.182)
which directly relates the auto-correlation to the power spectral density of the random sequence, as expected. From the expressions above, is a non-negative real function of . Furthermore, if is real, then is an even function of .
In Section 1.04.4.6, Eq. (4.132), we also defined the overall mean cross-power between the random sequences and , which for the general complex case, becomes
(4.183)
Following the same steps as above, we arrive at
(4.184)
where is defined as before,
(4.185)
and
(4.186)
We can then define the corresponding cross-power spectral density
(4.187)
such that . In terms of the cross-correlation, the cross-power spectral density can be written as
(4.188)
As expected, the cross-power density of orthogonal processes is zero. From the expressions above, . Furthermore, if and are real, then the real part of is even and the imaginary part of is odd.
One usually refers to a disturbing signal as noise. Since noise signals are typically unpredictable, they are preferably modeled as realizations of some noise random process (sequence ).
A very special case is the so-called white noise (by analogy with white light), which is a uniform combination of all frequencies.
Continuous-time white noise is sampled from a random process characterized by the following properties:
From the last property, continuous-time white noise is not physically realizable. Furthermore, WSS continuous-time white noise has .
Discrete-time white noise is sampled from a random sequence characterized by the following properties:
For WSS discrete-time white noise, , , and .
Unless differently stated, white noise is implicitly assumed to be generated by a WSS random process (sequence). Notice, however, that the sequence , where rad and is WSS white noise with unit variance, for example, satisfies all conditions to be called white noise, even if .
A common (more stringent) model of white noise imposes that values at different time instants of the underlying process (sequence) be statistically i.i.d.
Any random noise whose associated power spectral density is not constant is said to be colored noise. One can find in the literature several identified colored noises (pink, grey, etc.), each one with a pre-specified spectral behavior. It should also be noticed that any band-limited approximation of white noise destroys its non-correlation property. Consider generated by a WSS random process such that
(4.189)
It is common practice to call “white noise” of bandwidth W and overall mean power P. Since its auto-correlation is
(4.190)
An interesting application of the spectral description of random processes is modeling of AM (amplitude modulation). Assuming a carrier signal modulated by a real random signal sampled from a process , the resulting modulated random process has auto-correlation
(4.191)
If is WSS, then
(4.192)
and the corresponding power spectral density is
(4.193)
We conclude that the AM constitution of carries through its auto-correlation, which provides a simple statistical model for AM.
Quite often we find ourselves dealing with band-limited random signals. In the AM discussion above, typically has bandwidth . For a random process whose spectrum is at least concentrated around (baseband process), we can attribute it an RMS (root-mean-squared) bandwidth such that
(4.194)
If the spectrum is concentrated around a centroid
(4.195)
is given by
(4.196)
If the process bandwidth excludes , it is called18 a “bandpass” process.
It can be shown that if a band-limited baseband WSS random process with bandwidth W, i.e., such that for , is sampled at rate to generate the random sequence
(4.197)
then can be recovered from with zero mean-squared-error. This is the stochastic version of the Nyquist criterion for lossless sampling.
Statistical signal modeling is often employed in the context of signal processing, and thus the interaction between random signals and processing systems calls for a systematic approach. In this chapter, we will restrict ourselves to linear time-invariant (LTI) systems.
A system defined by the operation between input x and output y is called linear if any linear combination of m inputs produce as output the same linear combination of their m corresponding outputs:
For a linear system, one can define an impulse response h as the system output to a unit impulse applied to the system input at a given time instant:
• for a continuous-time system, a unit impulse applied at instant , produces an impulse response ,
• for a discrete-time system, a unit impulse19 applied at instant k, produces an impulse response .
A system is called time-invariant if a given input x applied at different time instants produces the same output y accordingly time-shifted:
The output of an LTI system to any input x is the convolution between the input and the impulse response h of the system:
In the frequency domain, the output Y is related to the input X by the system frequency response H:
The product XH brings the concept of filtering: H defines how much of each frequency component of input X passes to the output Y. From now on, when referring to the processing of a random process (sequence) by a system we imply that each process realization is filtered by that system.
The filtering of a random process by an LTI system with impulse response results in another random process such that . Assuming WSS and possibly complex, so as , then will be also WSS and20:
The processing of a random sequence by an LTI system with impulse response results in another random sequence such that . Assuming WSS and possibly complex, so as , then will be also WSS and21:
Among the above results, one of the most important is the effect of filtering on the power spectral density of the input WSS random process (sequence): it is multiplied by the squared magnitude of the frequency response of the LTI system to produce the power spectral density of the output WSS random process (sequence). As a direct consequence, any WSS random process (sequence) with known power spectral density S can be modeled by the output of an LTI system with squared magnitude response whose input is white noise.
In order to qualitatively illustrate this filtering effect:
• Figure 4.13 shows 200 samples of white noise ;
• Figure 4.14 shows 200 samples of , a slowly evolving signal obtained by low-pass filtering of to 20% of its original bandwidth;
• Figure 4.15 shows 200 samples of , an almost sinusoidal signal obtained by band-pass filtering of to the central 4% of its original bandwidth.
From the area of optimum filtering, whose overall goal is finding the best filter to perform a defined task, we can bring an important application of the concepts presented in this section. The general discrete Wiener filter H, illustrated in Figure 4.16, is the linear time-invariant filter which minimizes the mean quadratic value of the error between the desired signal and the filtered version of the input signal . Assuming that and are realizations of two jointly WSS random sequences and , if the optimum filter is not constrained to be causal, one finds that it must satisfy the equation
(4.198)
i.e., its frequency response is
(4.199)
If one needs a causal FIR solution, a different but equally simple solution can be found for .
We can solve a very simple example where we compute an optimum zero-order predictor for a given signal of which a noisy version is available. In this case, (a simple gain to be determined), the input (signal additively corrupted by noise ) and . The solution is
(4.200)
which yields the minimum mean quadratic error
(4.201)
Notice that if no noise is present,
(4.202)
and
(4.203)
The reader should be aware that this example studies a theoretical probabilistic model, which may therefore depend on several parameters which probably would not be available in practice and should rather be estimated. In fact, it points out the solution a practical system should pursue.
The Wiener filter is a kind of benchmark for adaptive filters [7], which optimize themselves recursively.
A direct way to find the impulse response of an LTI system is to apply white noise with power spectral density to the system input, which yields . From an estimate of the cross-correlation, one can obtain an estimate of the desired impulse response: . In practice, assuming ergodicity, the following approximations are employed:
Furthermore, the operations are often performed in the discrete-time domain.
The effective bandwidth of an LTI system with frequency response can be estimated by its noise bandwidth. For a low-pass system, one looks for an equivalent ideal low-pass filter with bandwidth around with bandpass gain ; is computed to guarantee that both systems deliver the same mean output power when the same white noise is applied to their both inputs. A straightforward algebraic development yields
(4.204)
For a band-pass system with magnitude response centered about , one looks for an ideal band-pass filter equivalent with bandwidth around with bandpass gain is computed to guarantee that both systems delivers the same mean output power when the same equal white noise is applied to their both inputs. For this case,
(4.205)
A real “bandpass” random process (see Section 1.04.4.15) about center frequency can be expressed as , where the envelope and the phase are both baseband random processes. We can write
(4.206)
where and are the quadrature components of . In the typical situation where is a zero-mean WSS random process with auto-correlation , we can explicitly find the first and second-order moments of the quadrature components that validate the model. First define the Hilbert transform of a given signal as , where
(4.207)
From , we can find and proceed to write:
In the frequency domain, if we define
(4.208)
then
In the particular case where is Gaussian, it can be shown that and are Gaussian, and thus completely defined by their first- and second-order moments above.
A discrete-time signal with spectrum can be modeled as the output of a linear time-invariant system whose frequency response is equal to when excited by a unit impulse . In this case, the impulse response of the system is expected to be .
In the context of statistic models, an analogous model can be built: now, we look for a linear time-invariant system which produces at its output the WSS random sequence with power spectral density , when excited by a unit-variance random sequence of white noise. As before, the frequency response of the system alone is expected to shape the output spectrum, yet this time in the mean power sense, i.e., the system must be such that .
Assume the overall modeling system is described by the difference equation
(4.209)
where is white noise with . The corresponding transfer function is
(4.210)
The output of this general system model with at its input is called ARMA (auto-regressive moving-average22) process of order . It can be described by the following difference equation in terms of its auto-correlation:
(4.211)
The values of for can be easily found by symmetry.
If one restricts the system to be FIR (i.e., to have a finite-duration impulse response), then , i.e., for , and
(4.212)
The output of this “all-zero”23 system model with at its input is called an MA (moving-average) process of order q. Its auto-correlation can be found to be
(4.213)
In this case, for or . This model is more suited for modeling notches in the random sequence power spectrum.
If one restricts the system to be “all-pole,”24 then , i.e., for , and
(4.214)
The output of this system model with at its input is called an AR (auto-regressive) process of order p. Its auto-correlation follows the difference equation below:
(4.215)
Again, the values of for can be easily found by symmetry. This model is more suited for modeling peaks in the random sequence power spectrum, which is typical of quasi-periodic signals (as audio signals, for example). It should be noticed that, differently from the ARMA and MA cases, the equation for the AR process auto-correlation is linear in the system coefficients, which makes their estimation easier. If is known for , and recalling that , one can:
• solve the pth-order linear equation system obtained by substituting in Eq. (4.215) to find ;
• compute from Eq. (4.215) with .
All-pole modeling of audio signals is extensively used in restoration systems [8].
Relevant Theory: Signal Processing Theory
1. Deep R. Probability and Statistics. San Diego: Academic; 2006.
2. Casella G, Berger RL. Statistical Inference. second ed. Pacific Grove: Duxbury; 2001.
3. Papoulis A, Unnikrishna P. Probability, Random Variables and Stochastic Processes. fourth ed. New York: McGraw-Hill; 2002.
4. Peebles Jr PZ. Probability, Random Variables and Random Signal Principles. fourth ed. New York: McGraw-Hill; 2001.
5. Shanmugan KS, Breipohl AM. Random Signals: Detection, Estimation and Data Analysis. New York: Wiley; 1988.
6. Hayes MH. Statistical Digital Signal Processing and Modeling. Hoboken: Wiley; 1996.
7. Diniz PSR. Adaptive Filtering: Algorithms and Practical Implementation. New York: Springer; 2012.
8. Godsill SJ, Rayner JW. Digital Audio Restoration: A Statistical Model Based Approach. London: Springer; 1988.
1Of course, other independent variables (e.g., space) can substitute for time in this context.
2Even if the literature sometimes prefers to employ “stochastic” for processes and “random” for signals, the expression “random processes” became common use and will be used throughout the chapter.
3One could think of mixed cases, but this discussion is better conveyed in the random variable framework, which comprises every possible mapping from the original experiment to a subset of the real numbers.
4Unit step function:
5Unit impulse distribution, or Dirac delta: For and .
6This unusual notation is employed here for the sake of compatibility with the other definitions.
7Notice that the expected value of a random variable is not necessarily meaningful for the associated experiment.
8The classical definition does not subtracts three.
9Convolution integral: .
10Equation (4.43) can be seen as the next expression computed for , since and .
11We will tackle only scalar parameters, without loss of generality.
12For the sake of simplicity, we refer only to the estimation of a fixed parameter .
13There are other definitions of consistence.
14In this text, for completeness we opted for always explicitly writing both formulations, for processes and sequences.
15We hinted this point when discussing the properties of second-order moments, in Section 1.04.4.5.
16Even if the author has done so many times.
17The Fourier transform represents a time signal as a linear combination of continuously distributed sinusoids.
18With considerable freedom of nomenclature.
19Unit impulse function, or Kronecker delta: For , and .
20Where notation was used.
21Where notation was used.
22Auto-regressive for its output feedback, moving average for its weighted sum of past input samples.
23This is a misname, since the poles of the system are at the origin. In fact, the zeros are the only responsible for shaping the frequency response of the system.
24This is another misname, since the zeros of the system are at the origin. In fact, the poles are the only responsible for shaping the frequency response of the system.
18.216.123.120