Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Quantifying preconceptions

Abstract

Chapter 5, Quantifying Preconceptions, argues that we usually know things about the systems we are studying that can be used to supplement actual observations. Temperatures often lie in specific ranges governed by freezing and boiling points. Chemical gradients often vary smoothly in space, owing to the process of diffusion. Energy and momentum obey conservation laws. The methodology through which this prior information can be incorporated into the models is developed in this chapter. Called generalized least squares, it is applied to several substantial examples in which prior information is used to fill in data gaps in datasets.

Keywords

prior information; prior covariance; generalized least squares; smoothness; flatness; roughness; Bayes theorem; normal distribution; weighted error; damped least squared

5.1 When least square fails

The least-squares solution fails when [G^TG] has no inverse, or equivalently, when its determinant is zero. In the straight line case, the [G^TG] is 2 × 2 and the determinant, D, can readily be computed from Equation (4.29):

$D = N \sum_{i = 1}^{N} x_{i}^{2} - {(\sum_{i = 1}^{N} x_{i})}^{2}$ $D = N \sum_{i = 1}^{N} x_{i}^{2} - {(\sum_{i = 1}^{N} x_{i})}^{2}$

si10_e

Two different scenarios lead to the determinant being zero. If only one observation is available (i.e., N = 1), then

$D = x_{1}^{2} - {(x_{1})}^{2} = 0$ $D = x_{1}^{2} - {(x_{1})}^{2} = 0$

This case corresponds to the problem of trying to fit a straight line to a single point. The determinant is also zero when N > 1, but the data are all measured at the same value of x_i (say x_i = x*). Then,

$D = NN {(x^{*})}^{2} - {(N x^{*})}^{2} = 0$ $D = NN {(x^{*})}^{2} - {(N x^{*})}^{2} = 0$

This case corresponds to the problem of trying to fit a straight line to many points, all with the same x. In both cases, the problem is that more than one choice of m has minimum error. In the first case, any line that passes through the point (x₁, d₁) has zero error, regardless of its slope (Figure 5.1A). In the second case, all lines that pass through the point, $(x^{*}, d^{*})$ $(x^{*}, d^{*})$ , where $d^{*}$ $d^{*}$ is an arbitrary value of d, will have the same error, regardless of the slope, and one of these will correspond to the minimum error ( $d^{*} = \bar{d}$ $d^{*} = \bar{d}$ , actually) (Figure 5.1B).

f05-01-9780128044889 — Figure 5.1 (A) All lines passing through (x_i, d_i) have zero error. (B) All lines passing through (x*, d*) have the same error.

In general, the method of least squares fails when the data do not uniquely determine the model parameters. The problem is associated with the data kernel, G, which describes the geometry or structure of the experiment, and not with the actual values of the data, d, themselves. Nor is the problem limited to the case where [G^TG]⁻¹ is exactly singular. Solutions when [G^TG]⁻¹ is almost singular are useless as well, because the covariance of the model parameters is proportional to [G^TG]⁻¹, and it has very large elements in this case. If almost no data constrains the value of a model parameter, then its value is very sensitive to measurement noise. In these cases, the matrix, G^TG, is said to be ill-conditioned.

Methods are available to spot deficiencies in G that lead to G^TG being ill-conditioned. However, they are usually of little practical value, because while they can identify the problem, they offer no remedy for it. We take a different approach here, which is to assume that most experiments have deficiencies that lead to at least a few model parameters being poorly determined.

We will not concern ourselves too much with which model parameters are causing the problem. Instead, we will use a modified form of the least-squares methodology that leads to a solution in all cases. This methodology will, in effect, fill in gaps in information, but without providing much insight into the nature of those gaps.

5.2 Prior information

Usually, we know something about the model parameters, even before we perform any observations. Even before we measure the density of a soil sample, we know that its density will be around 1500 kg/m³, give or take 500 or so, and that negative densities are nonsensical. Even before we measure a topographic profile across a range of hills, we know that it can contain no impossibly high and narrow peaks. Even before we measure the chemical components of an organic substance, we know that they should sum to 100%. Further, even before we measure the concentration of a pollutant in an underground reservoir, we know that its dispersion is subject to the diffusion equation.

These are, of course, just preconceptions about the world, and as such, they are more than a little dangerous. Observations might prove them to be wrong. On the other hand, most are based on experience, years of observations that have shown that, at least on Earth, most physical parameters commonly behave in well-understood ways. Furthermore, we often have a good idea of just how good a preconception is. Experience has shown that the range of plausible densities for sea water, for example, is much more restricted than, say, that for crude oil.

These preconceptions embody prior information about the results of observations. They can be used to supplement observations. In particular, they can be used to fill in the gaps in the information content of a dataset that prevent least squares from working.

We will express prior information probabilistically, using the Normal probability density function.

This choice gives us the ability to represent both the information itself, through the mean of the probability density function, and our uncertainty about the information, through its covariance matrix. The simplest case is when we know that the model parameters, m, are near the values, $\bar{m}$ $\bar{m}$ , where the uncertainty of the nearness is quantified by a prior covariance matrix, $C_{m}^{p}$ $C_{m}^{p}$ . Then, the prior information can be represented as the probability density function:

$\begin{array}{l} p_{p} (m) & = \frac{1}{{(2 π)}^{M / 2} {| C_{m}^{p} |}^{1 / 2}} exp {- \frac{1}{2} {(m - \bar{m})}^{T} {[C_{m}^{p}]}^{- 1} (m - \bar{m})} \\ = \frac{exp {- \frac{1}{2} E_{p} (m)}}{{(2 π)}^{M / 2} {| C_{m}^{p} |}^{1 / 2}} with E_{p} (m) = {(m - \bar{m})}^{T} {[C_{m}^{p}]}^{- 1} (m - \bar{m}) \end{array}$ $\begin{array}{l} p_{p} (m) & = \frac{1}{{(2 π)}^{M / 2} {| C_{m}^{p} |}^{1 / 2}} exp {- \frac{1}{2} {(m - \bar{m})}^{T} {[C_{m}^{p}]}^{- 1} (m - \bar{m})} \\ = \frac{exp {- \frac{1}{2} E_{p} (m)}}{{(2 π)}^{M / 2} {| C_{m}^{p} |}^{1 / 2}} with E_{p} (m) = {(m - \bar{m})}^{T} {[C_{m}^{p}]}^{- 1} (m - \bar{m}) \end{array}$

si18_e (5.1)

Note that we interpret the argument of the exponential as depending on a function, E_p(m), which quantifies the degree to which the prior information is satisfied. It can be thought of as a measure of the error in the prior information (compare with Equation 4.24).

In the soil density case above, we would choose, $\bar{m}$ $\bar{m}$ = 1500 kg/m³ and $C_{m}^{p} = σ_{m}^{2} I$ $C_{m}^{p} = σ_{m}^{2} I$ , with σ_m = 500 kg/m³. In this case, we view the prior information as uncorrelated, so $C_{m}^{p} \propto I$ $C_{m}^{p} \propto I$ .

Note that the prior covariance matrix, $C_{m}^{p}$ $C_{m}^{p}$ , is not the same as the covariance matrix of the estimated model parameters, C_m (which is called the posterior covariance matrix). The matrix, $C_{m}^{p}$ $C_{m}^{p}$ , expresses the uncertainty in the prior information about the model parameters, before we make any observations. The matrix, C_m, expresses the uncertainty of the estimated model parameters, after we include the observations.

A more general case is when the prior information can be represented as a linear function of the model parameters:

$\begin{array}{c} a linear function of the model parameters = a known value \\ or \\ Hm = \bar{h} \end{array}$ $\begin{array}{c} a linear function of the model parameters = a known value \\ or \\ Hm = \bar{h} \end{array}$

si24_e (5.2)

where H is a K × M matrix, where K is the number of rows of prior information. This more general representation can be used in the chemical component case mentioned above, where the concentrations need to sum to 100% or unity. This is a single piece of prior information, so K = 1 and the equation for the prior information has the form

$\begin{array}{c} sum of model parameters = unity \\ or \\ [\begin{array}{l} 1 & 1 & 1 & \dots & 1 \end{array}] m = 1 \\ or \\ Hm = \bar{h} \end{array}$ $\begin{array}{c} sum of model parameters = unity \\ or \\ [\begin{array}{l} 1 & 1 & 1 & \dots & 1 \end{array}] m = 1 \\ or \\ Hm = \bar{h} \end{array}$

si25_e (5.3)

The prior probability density function of the prior information is then

$\begin{array}{l} p_{p} (h) = \frac{1}{{(2 π)}^{M / 2} {| C_{h} |}^{1 / 2}} exp {- \frac{1}{2} {(Hm - \bar{h})}^{T} {[C_{h}]}^{- 1} (Hm - \bar{h})} = \frac{exp {- \frac{1}{2} E_{p} (m)}}{{(2 π)}^{M / 2} {| C_{h} |}^{1 / 2}} \\ where E_{p} (m) = {(Hm - \bar{h})}^{T} {[C_{h}]}^{- 1} (Hm - \bar{h}) \\ note that \\ p_{p} (m) = p_{p} [h (m)] J (m) \propto p_{p} [h (m)] \end{array}$ $\begin{array}{l} p_{p} (h) = \frac{1}{{(2 π)}^{M / 2} {| C_{h} |}^{1 / 2}} exp {- \frac{1}{2} {(Hm - \bar{h})}^{T} {[C_{h}]}^{- 1} (Hm - \bar{h})} = \frac{exp {- \frac{1}{2} E_{p} (m)}}{{(2 π)}^{M / 2} {| C_{h} |}^{1 / 2}} \\ where E_{p} (m) = {(Hm - \bar{h})}^{T} {[C_{h}]}^{- 1} (Hm - \bar{h}) \\ note that \\ p_{p} (m) = p_{p} [h (m)] J (m) \propto p_{p} [h (m)] \end{array}$

si26_e (5.4)

Here the covariance matrix, $C_{h}$ $C_{h}$ , expresses the uncertainty to which the model parameters obey the linear equation, $Hm = \bar{h}$ $Hm = \bar{h}$ . Note that the Normal probability density function contains the quantity, E_p(m), which is zero when the prior information, $Hm = \bar{h}$ $Hm = \bar{h}$ , is satisfied exactly, and positive otherwise. E_p(m) quantifies the error in the prior information. The probability density function for m is proportional to the probability density function for h, as the Jacobian determinant, J(m), is constant (see Note 5.1).

5.3 Bayesian inference

Our objective is to combine prior information with observations. Bayes theorem (Equation 3.25) provides the methodology through the equation

$p (m | d) = \frac{p (d | m) p (m)}{p (d)}$ $p (m | d) = \frac{p (d | m) p (m)}{p (d)}$

si30_e (5.5)

We can interpret this equation as a rule for updating our knowledge of the model parameters. Let us ignore the factor of p(d) on the right hand side, for the moment. Then the equation reads as follows:

$\begin{array}{c} the probability of the model parameters, m, given the data, d \\ is proportional to \\ the probability that the data, d, was observed, given a particular \\ set of model parameters, m multiplied by \\ the prior probability of that set of model parameters, m \end{array}$ $\begin{array}{c} the probability of the model parameters, m, given the data, d \\ is proportional to \\ the probability that the data, d, was observed, given a particular \\ set of model parameters, m multiplied by \\ the prior probability of that set of model parameters, m \end{array}$

si31_e (5.6)

We identify p(m) with p_p(m), that is, our best estimate of the probability of the model parameters, before the observations are made.. The conditional probability density function, p(d|m), is the probability that data, d, are observed, given a particular choice for the model parameters, m. We assume, as we did in Equation (4.23), that this probability density function is Normal:

$\begin{array}{c} p (d | m) = \frac{1}{{(2 π)}^{N / 2} {| C_{d} |}^{1 / 2}} exp {- \frac{1}{2} {(Gm - d)}^{T} {[C_{d}]}^{- 1} (Gm - d)} = \frac{exp {- \frac{1}{2} E (m)}}{{(2 π)}^{N / 2} {| C_{d} |}^{½}} \\ where E (m) = {(Gm - d)}^{T} {[C_{d}]}^{- 1} (Gm - d) \end{array}$ $\begin{array}{c} p (d | m) = \frac{1}{{(2 π)}^{N / 2} {| C_{d} |}^{1 / 2}} exp {- \frac{1}{2} {(Gm - d)}^{T} {[C_{d}]}^{- 1} (Gm - d)} = \frac{exp {- \frac{1}{2} E (m)}}{{(2 π)}^{N / 2} {| C_{d} |}^{½}} \\ where E (m) = {(Gm - d)}^{T} {[C_{d}]}^{- 1} (Gm - d) \end{array}$

si32_e (5.7)

where C_d is the covariance matrix of the observations. (Previously, we assumed that C_d = σ_d²I, but now we allow the general case). Note that the Normal probability density function contains the quantity, E(m), which is zero when the data are exactly satisfied and positive when they are not. This quantity is the total data error, as defined in Equation (4.24), except that the factor $C_{d}^{- 1}$ $C_{d}^{- 1}$ acts to weight each of the component errors. Its significance will be discussed later in the section.

We now return to the factor of p(d) on the right-hand side of Bayes theorem (Equation 5.5). It is not a function of the model parameters, and so acts only as a normalization factor. Hence, we can write

$\begin{array}{l} p (m | d) \propto p (d | m) p (m) \\ \propto exp {- \frac{1}{2} {(Gm - d)}^{T} {[C_{d}]}^{- 1} (Gm - d) - \frac{1}{2} {(Hm - \bar{h})}^{T} {[C_{h}]}^{- 1} (Hm - \bar{h})} \\ = exp {- \frac{1}{2} [E (m) + E_{p} (m)]} = exp {- \frac{1}{2} E_{T} (m)} \\ with E_{T} (m) = E (m) + E_{p} (m) \end{array}$ $\begin{array}{l} p (m | d) \propto p (d | m) p (m) \\ \propto exp {- \frac{1}{2} {(Gm - d)}^{T} {[C_{d}]}^{- 1} (Gm - d) - \frac{1}{2} {(Hm - \bar{h})}^{T} {[C_{h}]}^{- 1} (Hm - \bar{h})} \\ = exp {- \frac{1}{2} [E (m) + E_{p} (m)]} = exp {- \frac{1}{2} E_{T} (m)} \\ with E_{T} (m) = E (m) + E_{p} (m) \end{array}$

si34_e (5.8)

Note that p(m|d) contains the quantity, E_T(m), that is the sum of two errors: the error in fitting the data; and the error in satisfying the prior information. We call it the generalized error. We do not need the overall normalization factor, because the only operation that we will perform with this probability density function is the computation of its mode (point of maximum likelihood), which (as in Equation 4.24) we will identify as the best estimate, m^est, of the model parameters. An example for the very simple N = 1, M = 2 case is shown in Figure 5.2. However, before proceeding with more complex problems, we need to discuss an important issue associated with products of Normal probability density functions (as in Equation 5.8).

f05-02-9780128044889 — Figure 5.2 Example of the application of Bayes theorem to a N = 1, M = 2 problem. (A) The prior probability density function, p_p(m), for the model parameters has its maximum at (20,10) and is uncorrelated with variance (6²,10²). (B) The conditional probability density function, p(d|m), is for one observation, m₁ − m₂ = d₁ = 0 with a variance of 3². Note that this observation, by itself, is not sufficient to uniquely determine two model parameters. The conditional probability density distribution, p(m|d) ∝ p(d|m)p_p(m), has its maximum at (m₁^est, m₂^est) = (13,15). The estimated model parameters do not exactly satisfy the observation, m₁^est − m₂^est ≠ d₁, reflecting the observational error represented in the probability density function, p(d|m). They do not exactly satisfy the prior information, either, reflecting the uncertainty represented in p_p(m). MatLab script eda05_01.

5.4 The product of Normal probability density distributions

The conditional probability density function, p(m|d), in Equation (5.8) is the product of two Normal probability density functions. One of the many useful properties of Normal probability density functions is that their products are themselves Normal (Figure 5.3). To verify that this is true, we start with three Normal probability density functions, p_a(m), p_b(m), and p_c(m):

$\begin{array}{l} p_{a} (m) & \propto exp {- \frac{1}{2} {(m - \bar{a})}^{T} C_{a}^{- 1} (m - \bar{a})} \\ p_{b} (m) & \propto exp {- \frac{1}{2} {(m - \bar{b})}^{T} C_{b}^{- 1} (m - \bar{b})} \\ p_{c} (m) & \propto exp {- \frac{1}{2} {(m - \bar{c})}^{T} C_{c}^{- 1} (m - \bar{c})} \\ = exp {- \frac{1}{2} (m^{T} C_{c}^{- 1} m - 2 m^{T} C_{c}^{- 1} \bar{c} + {\bar{c}}^{T} C_{c}^{- 1} \bar{c})} \end{array}$ $\begin{array}{l} p_{a} (m) & \propto exp {- \frac{1}{2} {(m - \bar{a})}^{T} C_{a}^{- 1} (m - \bar{a})} \\ p_{b} (m) & \propto exp {- \frac{1}{2} {(m - \bar{b})}^{T} C_{b}^{- 1} (m - \bar{b})} \\ p_{c} (m) & \propto exp {- \frac{1}{2} {(m - \bar{c})}^{T} C_{c}^{- 1} (m - \bar{c})} \\ = exp {- \frac{1}{2} (m^{T} C_{c}^{- 1} m - 2 m^{T} C_{c}^{- 1} \bar{c} + {\bar{c}}^{T} C_{c}^{- 1} \bar{c})} \end{array}$

si35_e (5.9)

f05-03-9780128044889 — Figure 5.3 The product of two Normal distributions is itself a Normal distribution. (A) A Normal distribution, p_a(m₁, m₂). (B) A Normal distribution, p_b(m₁, m₂). (C) The product, p_c(m₁, m₂) = p_a(m₁, m₂) p_b(m₁, m₂). MatLab script eda05_02.

Note that the second version of p_c(m) is just the first with the expression within the braces expanded out. We now compute the product of the first two:

$\begin{array}{l} p_{a} (m) p_{b} (m) & \propto exp {- \frac{1}{2} {(m - \bar{a})}^{T} C_{a}^{- 1} (m - \bar{a}) - \frac{1}{2} {(m - \bar{b})}^{T} C_{b}^{- 1} (m - \bar{b})} \\ = exp {- \frac{1}{2} (m^{T} [C_{a}^{- 1} + C_{b}^{- 1}] m - 2 m^{T} [C_{a}^{- 1} \bar{a} + C_{b}^{- 1} \bar{b}] + [a^{T} C_{a}^{- 1} a + b^{T} C_{b}^{- 1} b])} \end{array}$ $\begin{array}{l} p_{a} (m) p_{b} (m) & \propto exp {- \frac{1}{2} {(m - \bar{a})}^{T} C_{a}^{- 1} (m - \bar{a}) - \frac{1}{2} {(m - \bar{b})}^{T} C_{b}^{- 1} (m - \bar{b})} \\ = exp {- \frac{1}{2} (m^{T} [C_{a}^{- 1} + C_{b}^{- 1}] m - 2 m^{T} [C_{a}^{- 1} \bar{a} + C_{b}^{- 1} \bar{b}] + [a^{T} C_{a}^{- 1} a + b^{T} C_{b}^{- 1} b])} \end{array}$

si36_e (5.10)

We now try to choose $\bar{c}$ $\bar{c}$ and C_c in Equation (5.9) so that p_c(m) in Equation (5.9) matches p_a(m)p_b(m) in Equation (5.10). The choice

$C_{c}^{- 1} = C_{a}^{- 1} + C_{b}^{- 1}$ $C_{c}^{- 1} = C_{a}^{- 1} + C_{b}^{- 1}$

(5.11)

matches the first pair of terms (the ones quadratic in m) and gives, for the second pair of terms (the ones linear in m)

$2 m^{T} (C_{a}^{- 1} + C_{b}^{- 1}) \bar{c} = 2 m^{T} (C_{a}^{- 1} \bar{a} + C_{b}^{- 1} \bar{b})$ $2 m^{T} (C_{a}^{- 1} + C_{b}^{- 1}) \bar{c} = 2 m^{T} (C_{a}^{- 1} \bar{a} + C_{b}^{- 1} \bar{b})$

(5.12)

Solving for $\bar{c}$ $\bar{c}$ , we find that these terms are equal when

$\bar{c} = {(C_{a}^{- 1} + C_{b}^{- 1})}^{- 1} (C_{a}^{- 1} \bar{a} + C_{b}^{- 1} \bar{b})$ $\bar{c} = {(C_{a}^{- 1} + C_{b}^{- 1})}^{- 1} (C_{a}^{- 1} \bar{a} + C_{b}^{- 1} \bar{b})$

(5.13)

Superficially, these choices do make the third pair of terms (the ones that do not contain m) equal. However, as these terms do not depend on m, they just correspond to the multiplicative factors that affect the normalization of the probability density function. We can always remove the discrepancy by absorbing it into the normalization. Thus, up to a normalization factor, p_c(m) = p_a(m)p_b(m); that is, a product of two Normal probability density functions is a Normal probability density function.

In the uncorrelated, equal variance case, these rules simplify to

$σ_{c}^{- 2} = σ_{a}^{- 2} + σ_{b}^{- 2} and \bar{c} = {(σ_{a}^{- 2} + σ_{b}^{- 2})}^{- 1} (σ_{a}^{- 2} \bar{a} + σ_{b}^{- 2} \bar{b})$ $σ_{c}^{- 2} = σ_{a}^{- 2} + σ_{b}^{- 2} and \bar{c} = {(σ_{a}^{- 2} + σ_{b}^{- 2})}^{- 1} (σ_{a}^{- 2} \bar{a} + σ_{b}^{- 2} \bar{b})$

(5.14)

Note that in the case where one of the component probability density functions, say p_a(m), contains no information (i.e., when C_a⁻¹ → 0), the multiplication has no effect on the covariance matrix or the mean (i.e., C_c⁻¹ = C_b⁻¹ and $\bar{c} = \bar{b}$ $\bar{c} = \bar{b}$ ). In the case where both p_a(m) and p_b(m) contain information, the covariance of the product will, in general, be smaller than the covariance of either probability density function (Equation 5.11), and the mean, $\bar{c}$ $\bar{c}$ , will be somewhere on a line connecting $\bar{a} and \bar{b}$ $\bar{a} and \bar{b}$ (Equation 5.13).

Thus, p(m|d) in Equation (5.8), being the product of two Normal probability density functions, is itself a Normal probability density function.

5.5 Generalized least squares

We now return to the matter of deriving an estimate of model parameters that combines both observations and prior information by finding the peak (mode) of the Normal distribution in Equation (5.8). This Normal distribution depends on the generalized error, E_T(m):

$p (m| d) \propto exp \{- \frac{1}{2} E_{T} (m)\} where$ $p (m| d) \propto exp \{- \frac{1}{2} E_{T} (m)\} where$

si46_e

$E_{T} (m) = {(H m - \bar{h})}^{T} {[C_{h}]}^{- 1} (H m - \bar{h}) + {(Gm - d^{o b s})}^{T} {[C_{d}]}^{- 1} (Gm - d^{o b s})$ $E_{T} (m) = {(H m - \bar{h})}^{T} {[C_{h}]}^{- 1} (H m - \bar{h}) + {(Gm - d^{o b s})}^{T} {[C_{d}]}^{- 1} (Gm - d^{o b s})$

(5.15)

The expression for the generalized error can be simplified by defining a matrix F and a vector f such that:

$F = [\begin{matrix} C_{d}^{- ½} G \\ C_{h}^{- ½} H \end{matrix}] and f^{o b s} = [\begin{matrix} C_{d}^{- ½} d^{o b s} \\ C_{h}^{- ½} \bar{h} \end{matrix}]$ $F = [\begin{matrix} C_{d}^{- ½} G \\ C_{h}^{- ½} H \end{matrix}] and f^{o b s} = [\begin{matrix} C_{d}^{- ½} d^{o b s} \\ C_{h}^{- ½} \bar{h} \end{matrix}]$

si48_e (5.16)

Here C_d^− ½ is the square root of C_d^− 1 (which obeys $C_{d}^{- 1} = C_{d}^{- ½} C_{d}^{- ½}$ $C_{d}^{- 1} = C_{d}^{- ½} C_{d}^{- ½}$ ) and C_h^− ½ is the square root of C_h^− 1. In the commonly-encountered case where C_d^− 1 and C_h^− 1 are diagonal matrices, the square root is computed simply by taking the square root of the diagonal elements). The generalized error is then:

$E_{T} (m) = {[f^{o b s} - F m^{e s t}]}^{T} C_{f}^{- 1} [f^{o b s} - F m^{e s t}] with C_{f}^{- 1} = I$ $E_{T} (m) = {[f^{o b s} - F m^{e s t}]}^{T} C_{f}^{- 1} [f^{o b s} - F m^{e s t}] with C_{f}^{- 1} = I$

(5.17)

This equivalence can be shown by substituting Equation (5.16) into Equation (5.17) and multiplying out the expression. The generalized error has been manipulated into a form identical to the ordinary least squares error, implying that the solution is the ordinary least squares solution:

$[F^{T} F] m^{e s t} = F^{T} f^{o b s} and m^{e s t} = {[F^{T} F]}^{- 1} F^{T} f^{o b s}$ $[F^{T} F] m^{e s t} = F^{T} f^{o b s} and m^{e s t} = {[F^{T} F]}^{- 1} F^{T} f^{o b s}$

(5.18)

The covariance matrix C_m is computed by the usual rules of error propagation:

$C_{m} = ({[F^{T} F]}^{- 1} F^{T}) C_{f} {({[F^{T} F]}^{- 1} F^{T})}^{T} = {[F^{T} F]}^{- 1} since C_{f} = I$ $C_{m} = ({[F^{T} F]}^{- 1} F^{T}) C_{f} {({[F^{T} F]}^{- 1} F^{T})}^{T} = {[F^{T} F]}^{- 1} since C_{f} = I$

si52_e (5.19)

These results are due to Tarantola and Valette (1982) and are further discussed by Menke (1989). When we substitute Equation (5.16) into Equation (5.19), we obtain the expressions:

$\begin{array}{l} m^{e s t} & = {[G^{T} {[C_{d}]}^{- 1} G + H^{T} {[C_{h}]}^{- 1} H]}^{- 1} [G^{T} {[C_{d}]}^{- 1} d^{o b s} + H^{T} {[C_{h}]}^{- 1} \bar{h}] \\ C_{m} & = {[G^{T} {[C_{d}]}^{- 1} G + H^{T} {[C_{h}]}^{- 1} H]}^{- 1} \end{array}$ $\begin{array}{l} m^{e s t} & = {[G^{T} {[C_{d}]}^{- 1} G + H^{T} {[C_{h}]}^{- 1} H]}^{- 1} [G^{T} {[C_{d}]}^{- 1} d^{o b s} + H^{T} {[C_{h}]}^{- 1} \bar{h}] \\ C_{m} & = {[G^{T} {[C_{d}]}^{- 1} G + H^{T} {[C_{h}]}^{- 1} H]}^{- 1} \end{array}$

si53_e (5.20)

However, these expressions are cumbersome and almost never necessary. Instead, we construct the equation:

$[\begin{array}{c} C_{d}^{- ½} G \\ C_{h}^{- ½} H \end{array}] m^{e s t} = [\begin{array}{c} C_{d}^{- ½} d^{o b s} \\ C_{h}^{- ½} \bar{h} \end{array}]$ $[\begin{array}{c} C_{d}^{- ½} G \\ C_{h}^{- ½} H \end{array}] m^{e s t} = [\begin{array}{c} C_{d}^{- ½} d^{o b s} \\ C_{h}^{- ½} \bar{h} \end{array}]$

si55_e (5.21)

directly and solve it by ordinary least squares. In this equation, the rows of the data equations, $Gm = d^{o b s}$ $Gm = d^{o b s}$ and the rows of the prior information equation Hm^est= $\bar{h}$ $\bar{h}$ are combined into a single matrix equation, Fm^est=f, with the N rows of Gm^est=d weighted by the certainty of the data (that is, by the factor σ_d^− 1), and the K rows of Hm^est= $\bar{h}$ $\bar{h}$ weighted by the certainty of the prior information (that is, by the factor σ_h^− 1). Observations and prior information play symmetrical roles in this generalized least squares solution. Provided that enough prior information is added to “fill in the gaps”, the generalized least squares solution, $m^{e s t} = {[F^{T} F]}^{- 1} F^{T} f^{o b s}$ $m^{e s t} = {[F^{T} F]}^{- 1} F^{T} f^{o b s}$ , will be well-behaved, even when the ordinary least squares solution, $m^{e s t} = {[G^{T} G]}^{- 1} {F G}^{T} d^{o b s}$ $m^{e s t} = {[G^{T} G]}^{- 1} {F G}^{T} d^{o b s}$ , fails. The prior information regularizes the matrix, [F^TF].

One type of prior information that always regularizes a generalized least squares problem is the model parameters being close to a constant, $\bar{m}$ $\bar{m}$ . This is the case where K = M, H = I and $\bar{h} = \bar{m}$ $\bar{h} = \bar{m}$ . The special case of $\bar{h} = \bar{m} = 0$ $\bar{h} = \bar{m} = 0$ is called damped least squares, and corresponds to the solution:

$m^{e s t} = [G^{T} G + ε^{2} I] G^{T} d with ε^{2} = σ_{d}^{2} / σ_{m}^{2}$ $m^{e s t} = [G^{T} G + ε^{2} I] G^{T} d with ε^{2} = σ_{d}^{2} / σ_{m}^{2}$

(5.22)

The attractiveness of the damped least squares is the ease by which it can be used. One merely adds a small number, ε², to the main diagonal of [G^TG]. However, while easy, damped least squares is only warranted when there is good reason to believe that the model parameters are actually near-zero.

In the generalized least squares formulation, all model parameters are affected by the prior information, even those that are well-determined by the observations. Unfortunately, alternative methods that target prior information at only underdetermined or poorly-determined model parameters are much more cumbersome to implement and are, in general, computationally unsuited to problems with a large number of model parameters (e.g. M > 10³ or so). On the other hand, by choosing the magnitude of the elements of C_h^− 1 to be sufficiently small, a similar result can be achieved, though often trial-and-error is required to determine how small is small.

As an aside, we mention an interesting interpretation of the equation for the generalized least squares solution, in the special case where M = K and H^− 1 exists, so we can write $\bar{m} = H^{- 1} \bar{h}$ $\bar{m} = H^{- 1} \bar{h}$ . Then, if we subtract $[G^{T} {[C_{d}]}^{- 1} G + H^{T} {[C_{h}]}^{- 1} H] \bar{m}$ $[G^{T} {[C_{d}]}^{- 1} G + H^{T} {[C_{h}]}^{- 1} H] \bar{m}$ from both sides of Equation (5.20), we obtain:

$[G^{T} {[C_{d}]}^{- 1} G + H^{T} {[C_{h}]}^{- 1} H] (m^{e s t} - \bar{m}) = G^{T} {[C_{d}]}^{- 1} (d - G \bar{m})$ $[G^{T} {[C_{d}]}^{- 1} G + H^{T} {[C_{h}]}^{- 1} H] (m^{e s t} - \bar{m}) = G^{T} {[C_{d}]}^{- 1} (d - G \bar{m})$

(5.23)

which involves the deviatoric quantities $Δ m = m^{e s t} - \bar{m}$ $Δ m = m^{e s t} - \bar{m}$ and $Δ d = d - G \bar{m}$ $Δ d = d - G \bar{m}$ . In this view, the generalized least squares solution determines the deviation $Δ m$ $Δ m$ of the solution away from the prior model parameters, $\bar{m}$ $\bar{m}$ , using the deviation $Δ d$ $Δ d$ , of the data away from the prediction $G \bar{m}$ $G \bar{m}$ of the prior model parameters.

5.6 The role of the covariance of the data

Generalized least squares (Equation 5.21) adds an important nuance to the estimation of model parameters, even in the absence of prior information, because it weights the contribution of an observation, d, to the error, E(m), according to its certainty (the inverse of its variance):

$E (m) = {(Gm - d)}^{T} {[C_{d}]}^{- 1} (Gm - d) = e^{T} {[C_{d}]}^{- 1} e$ $E (m) = {(Gm - d)}^{T} {[C_{d}]}^{- 1} (Gm - d) = e^{T} {[C_{d}]}^{- 1} e$

(5.24)

This effect is more apparent in the special case where the data are uncorrelated with variance, σ²_di. Then, C_d is a diagonal matrix and the error is

$E (m) = e^{T} [\begin{array}{l} σ_{d 1}^{- 2} & 0 & \dots & 0 \\ 0 & σ_{d 2}^{- 2} & \dots & 0 \\ 0 & 0 & .. & 0 \\ \dots & \dots & \dots & \dots \\ 0 & 0 & 0 & σ_{d N}^{- 2} \end{array}] e = \sum_{i = 1}^{N} \frac{e_{i}^{2}}{σ_{di}^{2}}$ $E (m) = e^{T} [\begin{array}{l} σ_{d 1}^{- 2} & 0 & \dots & 0 \\ 0 & σ_{d 2}^{- 2} & \dots & 0 \\ 0 & 0 & .. & 0 \\ \dots & \dots & \dots & \dots \\ 0 & 0 & 0 & σ_{d N}^{- 2} \end{array}] e = \sum_{i = 1}^{N} \frac{e_{i}^{2}}{σ_{di}^{2}}$

si75_e (5.25)

Thus, poorly determined data contribute less to the total error than well-determined data and the resulting solution better fits the data with small variance (Figure 5.4).

f05-04-9780128044889 — Figure 5.4 Example of least-squares fitting of a line to N = 50 data of unequal variance. The data values (circles) in (A) and (B) are the same, but their variance (depicted here by 2σ_d error bars) is different. (A) The variance of the first 25 data is much greater than that of the second 25 data. (B) The variance of the first 25 data is much less than that of the second 25 data. The best-fit straight line (solid line) is different in the two cases, and in each case more closely fits the half of the dataset with the smaller error. MatLab scripts eda05_03 and eda05_04.

The special case of generalized least squares that weights the data according to its certainty but includes no prior information is called weighted least squares. In MatLab, the solution is computed as

mest = (G′*Cdi*G)(G′*Cdi*d);

(MatLab eda05_03)

where Cdi is the inverse of the covariance matrix of the data, C_d⁻¹. In many cases, however, the covariance is diagonal, as in Equation (5.25). Then, defining a column vector, sigmad, with elements, σ_di, Equation (5.18) can be used, as follows:

for i=[1:N]
 F(i,:)=G(i,:)./sd(i);
end
f=d./sd;
mest = (F′*F)(F’*f);

(MatLab eda05_04)

where sd is a column vector with elements, σ_di.

5.7 Smoothness as prior information

An important type of prior information is the belief that the model parameter vector, m, is smooth. This notion implies some sort of natural ordering of the model parameters in time or space, because smoothness characterizes how model parameters vary from one position or time to another one nearby. The simplest case is when the model parameters vary with one coordinate, such as position, x. They are then a discrete version of a function, m(x), and their roughness (the opposite of smoothness) can be quantified by the second derivative, d²m/dx². When the model parameters are evenly spaced in x, say with spacing Δx, the first and second derivative can be approximated with the finite differences (see Section 1.9):

$\begin{array}{l} {\frac{d m}{d x} |}_{x_{i}} & \approx \frac{m (x_{i} + Δ x) - m (x_{i})}{Δ x} = \frac{1}{Δ x} [m_{i + 1} - m_{i}] \\ {\frac{d^{2} m}{d x^{2}} |}_{x_{i}} & \approx \frac{m (x_{i} + Δ x) - 2 m (x_{i}) + m (x_{i} - Δ x)}{{(Δ x)}^{2}} = \frac{1}{{(Δ x)}^{2}} [m_{i + 1} - 2 m_{i} + m_{i - 1}] \end{array}$ $\begin{array}{l} {\frac{d m}{d x} |}_{x_{i}} & \approx \frac{m (x_{i} + Δ x) - m (x_{i})}{Δ x} = \frac{1}{Δ x} [m_{i + 1} - m_{i}] \\ {\frac{d^{2} m}{d x^{2}} |}_{x_{i}} & \approx \frac{m (x_{i} + Δ x) - 2 m (x_{i}) + m (x_{i} - Δ x)}{{(Δ x)}^{2}} = \frac{1}{{(Δ x)}^{2}} [m_{i + 1} - 2 m_{i} + m_{i - 1}] \end{array}$

si76_e (5.26)

The smoothness condition implies that the roughness is small. We represent roughness with the equation, $Hm = \bar{h} = 0$ $Hm = \bar{h} = 0$ , where each row of the equation corresponds to a second derivative centered at a different x-position. A typical row of H has elements proportional to

$\begin{array}{l} 0 & \dots & 0 & 1 & - 2 & 1 & 0 & \dots & 0 \end{array}$ $\begin{array}{l} 0 & \dots & 0 & 1 & - 2 & 1 & 0 & \dots & 0 \end{array}$

However, a problem arises with the first and last row, because the model parameters m₀ and m_M+₁ are unavailable. We can either omit these rows, in which case H will contain only M − 2 pieces of information, or use different prior information there. A natural choice is to require the slope (i.e., the first derivative, dm/dx) to be small at the ends (i.e., the ends are flat), which leads to

$H = \frac{1}{{(Δ x)}^{2}} [\begin{matrix} - Δ x & Δ x & 0 & 0 & 0 & \dots & 0 \\ 1 & - 2 & 1 & 0 & 0 & \dots & 0 \\ 0 & 1 & - 2 & 1 & 0 & \dots & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ 0 & \dots & 0 & 1 & - 2 & 1 & 0 \\ 0 & \dots & 0 & 0 & 1 & - 2 & 1 \\ 0 & \dots & 0 & 0 & 0 & - Δ x & Δ x \end{matrix}]$ $H = \frac{1}{{(Δ x)}^{2}} [\begin{matrix} - Δ x & Δ x & 0 & 0 & 0 & \dots & 0 \\ 1 & - 2 & 1 & 0 & 0 & \dots & 0 \\ 0 & 1 & - 2 & 1 & 0 & \dots & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ 0 & \dots & 0 & 1 & - 2 & 1 & 0 \\ 0 & \dots & 0 & 0 & 1 & - 2 & 1 \\ 0 & \dots & 0 & 0 & 0 & - Δ x & Δ x \end{matrix}]$

si79_e (5.27)

The vector, $\bar{h}$ $\bar{h}$ , is taken to be zero, as our intent is to make the roughness and steepness—the opposites of smoothness and flatness—as small as possible.

One simple application of smoothness information is the filling in of data gaps. The idea is to have the model parameters represent the values of a function on a grid, with the data representing the values on a subset of grid points whose values have been observed. The other grid points represent data gaps. The equation, Gm = d, reduces to m_i = d_j, which has a G as follows:

$G = [\begin{matrix} 0 & 1 & 0 & 0 & 0 & 0 & 0 & \dots & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & \dots & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & 0 & \dots & \dots \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & \dots & 0 \end{matrix}]$ $G = [\begin{matrix} 0 & 1 & 0 & 0 & 0 & 0 & 0 & \dots & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & \dots & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & 0 & \dots & \dots \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & \dots & 0 \end{matrix}]$

si81_e (5.28)

Each row of G has M − 1 zeros and a single 1, positioned to match up an observation with the model parameter corresponding to at the same value of x. In MatLab, the matrix, F, and vector, f, are created in two stages. The first stage creates the top part of F and f (i.e., the part containing G and d):

L=N+M;
F=zeros(L,M);
f=zeros(L,1);
for p = [1:N]
 F(p,rowindex(p)) = 1;
 f(p)=d(p);
end

(MatLab eda05_05)

Here, rowindex is a column vector that specifies the correspondence of observation, d_p, and its corresponding model parameter. For simplicity, we assume that the observations have unit variance, and so omit factors of σ_d⁻¹. The second stage creates the bottom part of F and f (i.e., the part containing H and h)

shi = 1e-6;
for p = [1:M−2]
 F(p+N,p) = shi/Dx2;
 F(p+N,p+1) = −2*shi/Dx2;
 F(p+N,p+2) = shi/Dx2;
 f(p+N)=0.0;
end
F(L−1,1)=−shi*Dx;
F(L−1,2)=shi*Dx;
f(L−1)=0;
F(L,M−1)=−shi*Dx;
F(L,M)=shi*Dx;
f(L)=0;

(MatLab eda05_05)

Here, we assume that the prior information is uncorrelated and with equal variance, so we can use a single variable shi to represent σ_h⁻¹. We set it to a value much smaller than unity so that it will have much less weight in the solution than the data. This way, the solution will favor satisfying the data at points where data is available. A for loop is used to create this part of the matrix, F, which corresponds to smoothness. Finally, the flatness information is put in the last two rows of F. The estimated model parameters are then calculated by solving Fm = f in the least squares sense:

mest = (F′*F)(F′*f);

(MatLab eda05_05)

An example is shown in Figure 5.5.

f05-05-9780128044889 — Figure 5.5 The model parameters, m_i, consist of the values of an unknown function, m(x), evaluated at M = 100 equally spaced points, x_i. The data, d_i, consist of observations (circles) of the function at N = 40 of these points. Prior information, that the function is smooth, is used to fill in the gaps and produce an estimate (solid line) of the function at all points, x_i. MatLab script eda05_05.

Crib Sheet 5.1

Generalized least squares

Step 1: State the problem in words

How are the data related to the model

Step 2: Organize the problem in standard form

identify the data d (length N) the model parameters m (length $M)$ $M)$

define the data kernel G so that $d^{o b s} = Gm$ $d^{o b s} = Gm$

Step 3: Examine the data

make plots of the data

Step 4: Establish the accuracy of the data

state a prior variance σ_d² based on accuracy of the measurement technique

Step 5: State the prior information in words, for example:

the model parameters are close to a known values, h^pri

the mean of the model parameters is close to a known value

the model parameters vary smoothly with space and time

Step 6: Organize the prior information in standard form:

$h^{p r i} = H m$ $h^{p r i} = H m$

Step 7: Establish the accuracy of the prior information

state a prior variance σ_h² based on the accuracy of the prior information

Step 8: Estimate model parameter m^est and their covariance C_m

$m^{e s t} = {[F^{T} F]}^{- 1} F^{T} f^{o b s} and C_{m} = {[F^{T} F]}^{- 1}$ $m^{e s t} = {[F^{T} F]}^{- 1} F^{T} f^{o b s} and C_{m} = {[F^{T} F]}^{- 1}$

$with F = [\begin{matrix} σ_{d}^{- 1} G \\ σ_{h}^{- 1} H \end{matrix}] and f^{o b s} = [\begin{matrix} σ_{d}^{- 1} d^{o b s} \\ σ_{h}^{- 1} h^{p r i} \end{matrix}]$ $with F = [\begin{matrix} σ_{d}^{- 1} G \\ σ_{h}^{- 1} H \end{matrix}] and f^{o b s} = [\begin{matrix} σ_{d}^{- 1} d^{o b s} \\ σ_{h}^{- 1} h^{p r i} \end{matrix}]$

si5_e

Step 9: State estimates and their 95% confidence intervals

$m_{i}^{true} = m_{i}^{e s t} \pm 2 σ_{m i} (95 %) with σ_{m i} = \sqrt{{[C_{m}]}_{i i}}$ $m_{i}^{true} = m_{i}^{e s t} \pm 2 σ_{m i} (95 %) with σ_{m i} = \sqrt{{[C_{m}]}_{i i}}$

Step 10: Examine the individual errors

$d^{p r e} = G m^{e s t} and e = d^{o b s} - d^{p r e}$ $d^{p r e} = G m^{e s t} and e = d^{o b s} - d^{p r e}$

$h^{p r e} = H m^{e s t} and e_{p} = h^{p r i} - h^{p r e}$ $h^{p r e} = H m^{e s t} and e_{p} = h^{p r i} - h^{p r e}$

plot e_i vs. i and plot e_pi vs. i

scatter plot of d_i^pre vs. d_i^obs and scatter plot of h_i^pre vs. h_i^pri

any unusually large errors?

Step 11: Examine the total error E_T

$E_{T} = E + E_{p} with E = σ_{d}^{- 2} e^{T} e and E_{p} = σ_{h}^{- 2} e_{p}^{T} e_{p}$ $E_{T} = E + E_{p} with E = σ_{d}^{- 2} e^{T} e and E_{p} = σ_{h}^{- 2} e_{p}^{T} e_{p}$

use a chi-squared test on E_T to assess the likelihood of the Null Hypothesis that E_T is different than expected only because of random variation

Step 12: Two different models?

use an F-test on the E’s of the two models to assess the likelihood

of the Null Hypothesis that the E’s are

different from each other only because of random variation

5.8 Sparse matrices

Many perfectly reasonable problems have large number of model parameters—hundreds of thousands or even more. The gap-filling scenario discussed in the previous section is one such example. If, for instance, we were to use it to fill in the gaps in the Black Rock Forest dataset (see Chapter 2), we would have $N \approx M \approx 10^{5}$ $N \approx M \approx 10^{5}$ . The L × M matrix, F, where L = M + N, would then have about $2 N^{2} \approx 2 \times 10^{10}$ $2 N^{2} \approx 2 \times 10^{10}$ elements—enough to tax the memory of a notebook computer, at least! On the other hand, only about $3 N \approx 3 \times 10^{5}$ $3 N \approx 3 \times 10^{5}$ of these elements are nonzero. Such a matrix is said to be sparse. A computer's memory is wasted storing the zero elements and its processing power is wasted multiplying other quantities by them (as the result is a foregone conclusion—zero). An obvious solution is to omit storing the zero elements of sparse matrices and to omit any multiplications involving them. However, such a solution requires special software support to properly organize the matrix's elements and to optimize arithmetic operations involving them.

In MatLab, a matrix needs to be defined as sparse, but once defined, MatLab more or less transparently handles all array-element access and arithmetic operations. The command

L=M+N;
F=spalloc(L,M,4*N);

(MatLab eda05_06)

creates a L × M sparse matrix, F, capable of holding 4N nonzero elements. MatLab will properly process the command:

mest = (F′*F)(F′*f);

(MatLab eda05_05)

Nevertheless, we do not recommend solving for m^est this way, except when M is very small, because it does not utilize all the inherent efficiencies of the generalized least-squares equation, F^TFm = F^Tf. Our preferred technique is to use MatLab's bicg()function, which solves the matrix equation by the biconjugate gradient method. The simplest way to use this function is

mest=bicg(F′*F,F′*f,1e−10,3*L);

(MatLab eda05_06)

As you can see, two extra argument are present, in addition to the matrix, F′*F, and the vector, F′*f. They are a tolerance (set here to 1e−10) and a maximum number of iterations (set here to 3*L). The bicg() function works by iteratively improving an initial guess for the solution, with the tolerance specifying when the error is small enough for the solution to be considered done, and the maximum number of iterations specifying that the method should terminate after this limit is reached, regardless of whether or not the error is smaller than the tolerance. The actual choice of these two parameters needs to be adjusted by trial and error to suit a particular problem. Each time it is used, the bicg()function displays a line of information that can be useful in determining the accuracy of the solution.

This simple way of calling bicg()has one defect—it requires the computation of the quantity, F^TF. This is undesirable, for while F^TF is sparse, it is typically not nearly as sparse as F, itself. Fortunately, the biconjugate gradient method utilizes F^TF in only one simple way: it multiplies various internally constructed vectors to form products such as F^TFv. However, this product can be performed as F^T(Fv), that is, v is first premultiplied by F and the resulting vector is then premultiplied by F^T so that the matrix F^TF is never actually calculated. MatLab provides a way to modify the bicg()function to perform the multiplication in this very efficient fashion. However, in order to use it, we must first write a MatLab function, stored in a separate file that performs the two multiplications (see Note 5.2). We call this function, afun, and the corresponding file, afun.m:

function y = afun(v,transp_flag)
global F;
temp = F*v;
y = F′*temp;
return

(MatLab afun.m)

We have not said anything about the MatLab function command so far, and will say little about it here (however, see Note 5.2). Briefly, MatLab provides a mechanism for a user to define functions of his or her own devising that act in analogous fashion to built-in functions such a sin() and cos(). However, as the afun() function will not need to be modified, the user is free to consider it a black box. In order to use this function, the two commands

clear F;
global F;

(MatLab eda05_06)

need to be placed at the top of the script that uses the bicg() function. They ensure that MatLab understands that the matrix, F, in the main script and in the function refers to the same variable. Then the bicg() function is called as follows:

mest=bicg(@afun,F′*f,1e−10,3*L);

(MatLab eda05_06)

Note that only the first argument is different than in the previous version, and that this argument is a reference (a function handle) to the afun() function, indicated with the syntax, @afun. Incidentally, we gave the function the name, afun(), to match the example in the MatLab help page for bicg() (which you should read). A more descriptive name might have been better.

5.9 Reorganizing grids of model parameters

Sometimes, the model parameters have a natural organization that is more complicated than can be represented naturally with a column vector, m. For example, the model parameters may represent the values of a function, m(x, y), on a two-dimensional (x, y) grid, in which case they are more naturally ordered into a matrix, A, whose elements are A_ij = m(x_i, y_j). Unfortunately, the model parameters must still be arranged into a column vector, m, in order to use the formulas of least squares, at least as they have been developed in this book. One possible solution is to unwrap (reorganize) the matrix into a column vector as follows:

$A = [\begin{array}{l} A_{11} & A_{12} \\ A_{21} & A_{22} \end{array}] \to m = [\begin{array}{l} A_{11} \\ A_{12} \\ A_{21} \\ A_{22} \end{array}] or m_{k} = A_{ij} with k = (i - 1) J + j$ $A = [\begin{array}{l} A_{11} & A_{12} \\ A_{21} & A_{22} \end{array}] \to m = [\begin{array}{l} A_{11} \\ A_{12} \\ A_{21} \\ A_{22} \end{array}] or m_{k} = A_{ij} with k = (i - 1) J + j$

si85_e (5.29)

Here, A is assumed to be an I × J matrix so that m is of length, M = IJ. In MatLab, the conversions from k to (i,j) and back to k are given by

k = (i−1)*J+j;
i = floor((k−1)/J)+1;
j = k−(i−1)*J;

(MatLab eda05_07)

The floor() function rounds down to the nearest integer. See Note 5.3 for a discussion of several advanced MatLab functions that can be used as alternatives to these formulas.

As an example, we consider a scenario in which spatial variations of pore pressure cause fluid flow in aquifer (Figure 5.6). The aquifer is a thin layer, so pressure, p(x, y), varies only in the (x, y) plane. The pressure is measured in N wells, each located at (x_i, y_i). These measurements constitute the data, d. The problem is to fill in the data gaps, that is, to estimate the pressure on an evenly spaced grid, A_ij, of (x_i, y_j) points. These gridded pressure values constitute an I × J matrix, A, which can be unwrapped into a column vector of model parameters, m, as described above. The prior information is the belief that the pressure satisfied a known differential equation, in this case, the diffusion equation ∂²p/∂x² + ∂²p/∂y² = 0. This equation is appropriate when the fluid flow obeys Darcy's law and the hydraulic properties of the aquifer are spatially uniform.

f05-06-9780128044889 — Figure 5.6 Scenario for aquifer example. Ground water is flowing through the permeable aquifer (shaded layer), driven by variations in pore pressure. The pore pressure is measured in N wells (cylinders).

The problem is structured in a manner similar to the previous one-dimensional gap-filling problem. Once again, the equation, Gm = d, reduces to d_i = m_j, where index, j, matches up the location, (x_i, y_i), of the i-th data to the location, (x_j, y_j), of the j-th model parameter. The differential equation contains only second derivatives, which have been discussed earlier (Equation 5.26). The only nuance is that one derivative is taken in the x-direction and the other in the y-direction so that the value of pressure at five neighboring grid points is needed. (Figure 5.7):

${\frac{∂^{2} m}{∂ x^{2}} |}_{x_{i}, y_{j}} = \frac{[A_{i + 1, j} - 2 A_{i, j} + A_{i - 1, j}]}{{(Δ x)}^{2}} and {\frac{∂^{2} m}{∂ y^{2}} |}_{x_{i}, y_{j}} \approx \frac{[A_{i, j + 1} - 2 A_{i, j} + A_{i, j - 1}]}{{(Δ y)}^{2}}$ ${\frac{∂^{2} m}{∂ x^{2}} |}_{x_{i}, y_{j}} = \frac{[A_{i + 1, j} - 2 A_{i, j} + A_{i - 1, j}]}{{(Δ x)}^{2}} and {\frac{∂^{2} m}{∂ y^{2}} |}_{x_{i}, y_{j}} \approx \frac{[A_{i, j + 1} - 2 A_{i, j} + A_{i, j - 1}]}{{(Δ y)}^{2}}$

si86_e (5.30)

f05-07-9780128044889 — Figure 5.7 The expression, ∂²p/∂x² + ∂²p/∂y², is calculated by summing finite difference approximations for ∂²p/∂x² and ∂²p/∂y². The approximation for ∂²p/∂x² involves the column of three grid points (circles) parallel to the i-axis and the approximation for ∂²p/∂y² involves the row of three grid points parallel to the j-axis.

Note that the central point, A_ij, is common to the two derivatives, so five, and not six, grid points are involved. While these five points are neighboring elements of A, they do not correspond to neighboring elements of m, once A is unwrapped.

Once again, a decision needs to be made about what to do on the edges of the grid. One possibility is to assume that the pressure derivative in the direction perpendicular to the edge of the grid is zero (which is the two-dimensional analog to the previously discussed one-dimensional case). This corresponds to the choice, ∂p/∂y = 0, on the left and right edges of the grid, and ∂p/∂x = 0 on the top and bottom edges. Physically, these equations imply that the pore fluid is not flowing across the edges of the grid (an assumption that may or may not be sensible, depending on the circumstances). The four corners of the grid require special handling, as two edges are coincident at these points. One possibility is to compute the first derivative along the grid's diagonals at these four points.

In the exemplary MatLab script, eda05_08, the equation, Fm = f, is built up row-wise, in a series of steps: (1) the N “data” rows; (2) the (I − 2)(J − 2) Laplace's equation rows; (3) the (J − 2) rows of first derivatives top-of-the-grid rows; (4) the (J − 2) rows of first derivatives bottom-of-the-grid rows; (5) the (I − 2) rows of first derivatives left-of-the-grid rows; (6) the (I − 2) rows of first derivatives right-of-the-grid rows; and (7) the four rows of first derivatives at grid-corners. When debugging a script such as this, a few exemplary rows of, F and f, from each section should be displayed and carefully examined, to ensure correctness.

The script, eda05_08, is set up to work on a file of test data, pressure.txt, that is created with a separate script, eda05_09. The data are simulated or synthetic data, meaning that they are calculated from a formula and that no actual measurements are involved. The test script evaluates a known solution of Laplace's equation

$p (x, y) = P_{0} sin (κ x) exp (- κ y) where κ and P_{0} are constants$ $p (x, y) = P_{0} sin (κ x) exp (- κ y) where κ and P_{0} are constants$

(5.31)

on N randomly selected grid points and adds a random noise to them to simulate measurement error. Random grid points can be selected and utilized as follows:

rowindex=unidrnd(I,N,1);
xobs = x(rowindex);
colindex=unidrnd(J,N,1);
yobs = y(colindex);
kappa = 0.1;
dtrue = 10*sin(kappa*xobs).*exp(−kappa*yobs);

(MatLab eda05_09)

Here, the function, unidrnd(), returns an N × 1 array, rowindex, of random integers in the range (1, I). A column vector, xobs, of the x-coordinates of the data is then created from the grid coordinate vector, x, with the expression, xobs = x(rowindex). A similar pair of expressions creates a column vector, yobs, of the y-coordinates of the data. Finally, the (x, y) coordinates of the data are used to evaluate Equation (5.31) to construct the “true” synthetic data, dtrue.

Normally distributed random noise can be added to the true data to simulate measurement error:

sigmad = 0.1;
dobs = dtrue + random(‘normal’,0.0,sigmad,N,1);

(MatLab eda05_09)

Here, the random() function returns an N × 1 column vector of random numbers with zero mean and variance, σ_d².

Results of the eda05_08 script are shown in Figure 5.8. Note that the predicted pressure is a good match to the synthetic data, as it should be, for the data are, except for noise, exactly a solution of Laplace's equation. This step of testing a script on synthetic data should never be omitted. A series of tests with synthetic data are more likely to reveal problems with a script than a single application to real data. Such a series of tests should vary a variety of parameters, including the grid spacing, the parameter, κ, and the noise level.

f05-08-9780128044889 — Figure 5.8 Filling in gaps in pressure data, p(x, y), using the prior information that the pressure satisfies Laplace's equation, ∂²p/∂x² + ∂²p/∂y² = 0. (A) Observed pressure, p^obs(x, y). (B) Predicted pressure, p^pre(x, y). MatLab script eda05_08.

Problems

5.1 The first paragraph of Section 5.2 mentions one type of prior information that cannot be implemented by a linear equation of the form, $Hm = \bar{h}$ $Hm = \bar{h}$ . What is it?

5.2 What happens in the eda05_05 script if it is applied to an inconsistent dataset (meaning a dataset containing multiple points with the same xs but different ds)? Modify the script to implement a test of this scenario. Comment on the results.

5.3 Modify the eda05_05 script to fill in the gaps of the cleaned version of the Black Rock Forest temperature dataset. Make plots of selected data gaps and comment on how well the method filled them in. Suggestions: First create a short version of the dataset for test purposes. It should contain a few thousand data points that bracket one of the data gaps. Do not run your script on the complete dataset until it works on the short version. Only the top part of the script needs to be changed. First, the data must be read using the load() function. Second, you must check whether all the times are equally spaced. Missing times must be inserted and the corresponding data set to zero. Third, a vector, rowindex, that gives the row index of the good data but excludes the zero data, hot spikes, and cold spikes must be computed with the find() function.

5.4 Run the eda05_07 script in a series of tests in which you vary the parameter, κ, and the noise level, σ_d. You will need to edit the eda05_08 script to produce the appropriate file, pressure.txt, of synthetic data. Comment on your results.

5.5 Suppose that the water in a shallow lake flows only horizontally (i.e., in the (x, y) plane) and that the two components of fluid velocity, v_x and v_y are measured at a set of N observation points, (x_i, y_i). Water is approximately incompressible, so a reasonable type of prior information is that the divergence of the fluid velocity is zero; that is ∂v_x/∂_x + ∂v_y/∂_y = 0. Furthermore, if the grid covers the whole lake, then the condition that no water flows across the edges is a reasonable one, implying that the perpendicular component of velocity is zero at the edges. (A) Sketch out how scripts eda05_07 and eda05_08 might be modified to fill in the gaps of fluid velocity data.

5.6 In the example shown in Fig. 5.8, a two-dimensional pressure field p(x, y) is reconstructed using sparse data and the prior information that the field satisfies Laplace’s equation. Consider an alternative scenario in which the pressure is believed to vary smoothly but the equation that it satisfies is unknown. In that case, we might opt to use a combination of flatness:

$\frac{d p}{d x} \approx 0 and \frac{d p}{dy} \approx 0$ $\frac{d p}{d x} \approx 0 and \frac{d p}{dy} \approx 0$

si89_e

(say with variance σ_s²) and smallness $p \approx 0$ $p \approx 0$ (say with variance σ_m²) to create a smooth solution. Modify eda05_06.m to solve this problem and adjust σ_s² and σ_m² by trial and error to produce a solution that is a reasonable compromise between smoothness and goodness of fit. How much worse is the error, compared to the solution that employs Laplace’s equation?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5: Quantifying preconceptions

Create new playlist

Sign In

Sign Up

5.1 When least square fails

5.2 Prior information

5.3 Bayesian inference

5.4 The product of Normal probability density distributions

5.5 Generalized least squares

5.6 The role of the covariance of the data

5.7 Smoothness as prior information

5.8 Sparse matrices

5.9 Reorganizing grids of model parameters

Problems

Table of Contents for
5: Quantifying preconceptions