Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

“Approximate” is not a pejorative word

Abstract

“Approximate” is not a pejorative word explores the use of approximations to make data analysis simpler, faster, and more adaptable. Taylor's theorem is derived and used to linearize nonlinear functions. The method is used to create small number approximations, to estimate the variance of nonlinear functions, and to solve nonlinear least squares problems. The gradient of the error is identified as the key quantity controlling the solutions and its further application to the gradient method is explored. The lookup table is introduced as an approximate method of evaluating functions of one or two variables and applied to speeding up iterative calculations. Finally, the artificial neural network is developed as an approximation technique that shares the adaptability of a lookup table while providing smoothness and enhanced flexibility. The properties of several simple network designs are illustrated and the back-propagation algorithm for training them is derived. The power of a neural network is demonstrated by predicting the nonlinear response of a river network to precipitation.

Keywords

Approximation; Taylor series; Small number approximation; Newton's method; Iterative least squares; Gradient method; Lookup table; Artificial neural net; Back-propagation

11.1 The value of approximation

As schoolchildren, we were taught to admire the exactness of mathematics. The length of the hypotenuse of a right triangle with legs of 3 and 4 m is 5 m exactly, not a little less or a little more. Archimedes's 250 B.C.E. value of 22/7 for π was merely an upper bound for the ratio of a circle's circumference to its diameter. The exact value, known to one million digits by 1973, is something different and worth striving for.

Later in life, we learned that transcribing all 10 digits off the screen of a calculator is not really necessary; 3 or 4 is sufficient for most purposes. Yet we know that we can recompute all 10, should the need arise. We also came to appreciate the value of a back-of-the-envelope calculation, where the goal is merely to estimate the correct order of magnitude of the answer. Yet we view such calculations as imperfect—a tool to guide the development of a more accurate solution.

The situation in data analysis is fundamentally different. The accuracy of a result is a combination of methodological error, meaning error associated with the underlying theory and its computation, and observational error. Once methodological error is reduced to a level well below observation error, further improvement accomplishes little. The appropriate goal is to develop a methodology with acceptably small error, not one with none.

Of course, one might decide to err on the safe side; that is, to use only exact theories and to carry out all calculations to as many significant digits as possible. Superficially, such a strategy would not seem to be a bad one and, indeed, solutions achieved this way are certainly no worse than those that employ approximations. Yet the approach has serious limitations, over and above the obvious one that exact theories are unavailable for many imperfectly understood real-world phenomena. Approximations can bring simplicity, speed of computation, and adaptability to a solution which, in its exact form, is inscrutable, slow, and inflexible.

A simple method is one whose structure is sufficiently well-understood that its general behavior can be anticipated and its pitfalls identified and avoided. The grid search procedure (Section 4.6) for finding the minimum error solution is simple in this sense. This exhaustive search is easy to understand (and to code) and, although not completely foolproof, fails only when the search limits have been so poorly chosen that they fail to enclose the solution or when the solution varies over scale lengths smaller than the grid spacing. Linear theories of the form $Gm = d$ $Gm = d$ are simple, too. The matrix G may be large and complicated, yet it is straightforward to create and interpret. Furthermore, its least squares solution (Section 4.7) will succeed except in cases that can be easily detected (e.g., when the matrix inverse is near-singular) and corrected (e.g., by adding prior information). As we will see below, approximations that linearize an exact nonlinear theory have wide application.

Approximations can also improve the speed of computation. Speed is an extremely important issue in very large problems that challenge the capabilities of the current generation of computers and in real-time applications, where a solution is needed immediately after the data becomes available. A Global Positioning System (GPS) receiver would be of limited utility in an automobile if it took 10 min to determine a location!

Finally, some data analysis scenarios need to adapt to evolving conditions that affect the character of the data. The character of a river hydrograph, for example, may evolve significantly over the years as its watershed becomes urbanized. The methodology needed to predict discharge from precipitation may need to be correspondingly updated. Some approximate methods are particularly well-suited for incorporating machine learning.

11.2 Polynomial approximations and Taylor series

A polynomial can always be designed to behave similarly to a smooth function y(t) in the vicinity of a specified point t₀. This assertion can be demonstrated by constructing a polynomial y_p(t) that has the same value, first derivative, second derivative, etc. at this point. The polynomial becomes a better and better approximation to the function as derivatives of higher and higher order are matched (Figure 11.1). A polynomial with unknown coefficients c_i is

$y_{p} (t) = c_{0} + c_{1} (t - t_{0}) + c_{2} {(t - t_{0})}^{2} + c_{3} {(t - t_{0})}^{3} + \dots \dots$ $y_{p} (t) = c_{0} + c_{1} (t - t_{0}) + c_{2} {(t - t_{0})}^{2} + c_{3} {(t - t_{0})}^{3} + \dots \dots$

(11.1)

and its derivates are

$\begin{array}{l} \frac{d y_{p}}{d t} = c_{1} + 2 c_{2} (t - t_{0}) + 3 c_{3} {(t - t_{0})}^{2} + \dots \\ \frac{d^{2} y_{p}}{d t^{2}} = 2 c_{2} + 6 c_{3} (t - t_{0}) + \dots \\ e t c . \end{array}$ $\begin{array}{l} \frac{d y_{p}}{d t} = c_{1} + 2 c_{2} (t - t_{0}) + 3 c_{3} {(t - t_{0})}^{2} + \dots \\ \frac{d^{2} y_{p}}{d t^{2}} = 2 c_{2} + 6 c_{3} (t - t_{0}) + \dots \\ e t c . \end{array}$

si60_e (11.2)

f11-01-9780128044889 — Figure 11.1 The function $y (t) = 1 / (1 - t)$ $y (t) = 1 / (1 - t)$ (gray curve) approximated by a sequence of polynomials of increasing degree. Note that the approximations become more accurate, especially near the point t₀, as the degree is increased. MatLab script eda11_01.

si1_e — Figure 11.1 The function $y (t) = 1 / (1 - t)$ $y (t) = 1 / (1 - t)$ (gray curve) approximated by a sequence of polynomials of increasing degree. Note that the approximations become more accurate, especially near the point t₀, as the degree is increased. MatLab script eda11_01.

When evaluated at the point $t - t_{0}$ $t - t_{0}$ , these functions each depends upon a single constant:

$y_{p} (t_{0}) = c_{0} and {\frac{d y_{p}}{d t}|}_{t_{0}} = c_{1} and {\frac{d^{2} y_{p}}{d t^{2}}|}_{t_{0}} = 2 c_{2} e t c .$ $y_{p} (t_{0}) = c_{0} and {\frac{d y_{p}}{d t}|}_{t_{0}} = c_{1} and {\frac{d^{2} y_{p}}{d t^{2}}|}_{t_{0}} = 2 c_{2} e t c .$

si62_e (11.3)

After solving for the constants and inserting them into Equation (11.1), a polynomial y_p(t) is obtained, whose value and derivatives match the smooth function y(t) at the point t₀:

$y_{p} (t) = y (t_{0}) + {\frac{d y}{d t}|}_{t_{0}} (t - t_{0}) + \frac{1}{2} {\frac{d^{2} y}{d t^{2}}|}_{t_{0}} {(t - t_{0})}^{2} + \dots$ $y_{p} (t) = y (t_{0}) + {\frac{d y}{d t}|}_{t_{0}} (t - t_{0}) + \frac{1}{2} {\frac{d^{2} y}{d t^{2}}|}_{t_{0}} {(t - t_{0})}^{2} + \dots$

si63_e (11.4)

This polynomial can be shown to converge to the smooth function when the number of terms becomes indefinitely large; that is, $y_{p} (t) \to y (t)$ $y_{p} (t) \to y (t)$ (at least for a finite interval of the t axis containing t₀). This result is known as Taylor theorem and Equation (11.4) is known as the Taylor series representation of the function y(t). As we will show below, Taylor series play a critical role in many approximation methods.

11.3 Small number approximations

An important class of approximations uses just the first two terms of the Taylor series to represent a function in the neighborhood of a point $t = t_{0}$ $t = t_{0}$ , since the higher order terms are typically negligible there. Consider the function $y (t) = {(1 \pm t)}^{n}$ $y (t) = {(1 \pm t)}^{n}$ . At the point $t_{0} = 0$ $t_{0} = 0$ , it has value $y (0) = 1$ $y (0) = 1$ and first derivative ${d y / d t|}_{0} = {\pm n {(1 \pm t)}^{n - 1}|}_{0} = \pm n$ ${d y / d t|}_{0} = {\pm n {(1 \pm t)}^{n - 1}|}_{0} = \pm n$ . Inserting these values into Equation (11.4) yields

${(1 \pm t)}^{n} \approx 1 \pm nt$ ${(1 \pm t)}^{n} \approx 1 \pm nt$

(11.5)

This approximation is surprisingly accurate. For instance, ${(1 + 0.04)}^{1 / 2} = 1.0198$ ${(1 + 0.04)}^{1 / 2} = 1.0198$ , whereas $(1 + ½ \times 0.04) = 1.02$ $(1 + ½ \times 0.04) = 1.02$ ; the difference is only 0.02 %.

Two additional examples are the functions sin t and cos t in the neighborhood of the point $t_{0} = 0$ $t_{0} = 0$ . The sine has a value of 0, a first derivative of 1, and a second derivative of 0 at this point, while the cosine has a value of 1, a first derivative of 0, and a second derivative of $- 1$ $- 1$ there. Inserting these values into Equation (11.4) yields

$sin t \approx t and cos t \approx 1 - \frac{1}{2} t^{2}$ $sin t \approx t and cos t \approx 1 - \frac{1}{2} t^{2}$

si75_e (11.6)

These and other small number approximations can be used to simplify and speed up data analysis problems (see Crib Sheet 11.1).

Crib Sheet 11.1

Small number approximations

$(valid when x ≪ 1)$ $(valid when x ≪ 1)$

${(1 \pm x)}^{n} \approx 1 \pm n x$ ${(1 \pm x)}^{n} \approx 1 \pm n x$

$sin x \approx x$ $sin x \approx x$

$cos x \approx 1 - \frac{1}{2} x^{2}$ $cos x \approx 1 - \frac{1}{2} x^{2}$

si40_e

$tan x \approx x$ $tan x \approx x$

$exp x \approx 1 + x$ $exp x \approx 1 + x$

$ln (1 \pm x) \approx \pm x$ $ln (1 \pm x) \approx \pm x$

$a^{x} \approx 1 + x ln a$ $a^{x} \approx 1 + x ln a$

$sinh x \approx x$ $sinh x \approx x$

$cosh x \approx 1 + \frac{1}{2} x^{2}$ $cosh x \approx 1 + \frac{1}{2} x^{2}$

si46_e

$tanh x \approx x$ $tanh x \approx x$

11.4 Small number approximation applied to distance on a sphere

A common problem encountered when working with geographical data is to determine the great circle distance r between two points on a sphere (such as the Earth). Spherical trigonometry can be used to show that this distance, measured in degrees of arc, is given by

$cos r = sin λ_{1} sin λ_{2} + cos λ_{1} cos λ_{2} cos Δ L$ $cos r = sin λ_{1} sin λ_{2} + cos λ_{1} cos λ_{2} cos Δ L$

(11.7)

Here, λ₁ and λ₂ are the latitudes of the two points, respectively; L₁ and L₂ are their longitudes; and $Δ L = L_{2} - L_{1}$ $Δ L = L_{2} - L_{1}$ . Six distinct trigonometric functions must be evaluated in order to determine r—a time-consuming process. A small r approximation can reduce the count of functions. The first step is to define the difference in latitude as $Δ λ = λ_{2} - λ_{1}$ $Δ λ = λ_{2} - λ_{1}$ and the mean latitude as $\bar{λ} = \frac{1}{2} (λ_{2} + λ_{1})$ $\bar{λ} = \frac{1}{2} (λ_{2} + λ_{1})$ . Standard trigonometric identities are then used to rewrite Equation (11.7) in terms of these quantities:

$cos r = \frac{1}{2} cos Δ λ - \frac{1}{2} cos \bar{λ} + \frac{1}{2} + \frac{1}{2} cos Δ λ cos Δ L + {cos}^{2} \bar{λ} cos Δ L - \frac{1}{2} cos Δ L$ $cos r = \frac{1}{2} cos Δ λ - \frac{1}{2} cos \bar{λ} + \frac{1}{2} + \frac{1}{2} cos Δ λ cos Δ L + {cos}^{2} \bar{λ} cos Δ L - \frac{1}{2} cos Δ L$

si80_e (11.8)

By assumption, ΔL, Δλ, and r are all small. After applying the small number approximations of Equation (11.6) and after considerable algebra, the approximation is found to be

$r^{2} \approx {(Δ λ)}^{2} + {(Δ L)}^{2} {cos}^{2} \bar{λ} - {(Δ λ)}^{2} {(Δ L)}^{2} [\frac{1}{4} - \frac{{cos}^{2} \bar{λ}}{6}]$ $r^{2} \approx {(Δ λ)}^{2} + {(Δ L)}^{2} {cos}^{2} \bar{λ} - {(Δ λ)}^{2} {(Δ L)}^{2} [\frac{1}{4} - \frac{{cos}^{2} \bar{λ}}{6}]$

si81_e (11.9)

The first two terms on the right-hand side represent the Euclidian distance between the points; that is, distance as measured on a plane. The last term corrects this flat-earth distance for the curvature of the surface. Only two functions, a cosine and a square root, need to be evaluated in order to determine r, which is a factor-of-three reduction in effort compared with the exact formula. Furthermore, the approximation is very accurate, at least out to distances of a few degrees (Figure 11.2).

f11-02-9780128044889 — Figure 11.2 (A) Linear (dotted curve) and quadratic (black curve) approximation for distance r on a sphere compares well with the exact value (gray curve). (B) The corresponding error. This calculation is for the case $L = λ$ $L = λ$ . MatLab script eda11_02.

11.5 Small number approximation applied to variance

A commonly encountered problem is that of estimating the variance of a nonlinear function of the model parameters. As an example, suppose that a least squares procedure estimates an angular frequency m^est with variance σ_m², but we would rather state confidence bounds for the period $T = 2 π / m$ $T = 2 π / m$ . As discussed in Chapter 3, this problem can be solved exactly by assuming that m is normally distributed, working out the probability density function p(T), and then calculating its mean and variance. However, the exact procedure is difficult and time-consuming; the approximation that is developed here is accurate when $m^{e s t} ≫ σ_{m}$ $m^{e s t} ≫ σ_{m}$ .

The idea is to examine how a small fluctuation Δm in frequency causes a small fluctuation ΔT in period. Writing $m = m^{e s t} + Δ m$ $m = m^{e s t} + Δ m$ and $T = T^{e s t} + Δ T$ $T = T^{e s t} + Δ T$ and inserting them into the formula relating frequency and period yield

$\begin{array}{c} T = \frac{2 π}{m} = 2 π {[m^{e s t} + Δ m]}^{- 1} = \frac{2 π}{m^{e s t}} {[1 + \frac{Δ m}{m^{e s t}}]}^{- 1} \\ \approx \frac{2 π}{m^{e s t}} [1 - \frac{Δ m}{m^{e s t}}] = \frac{2 π}{m^{e s t}} - \frac{2 π}{{(m^{e s t})}^{2}} Δ m \end{array}$ $\begin{array}{c} T = \frac{2 π}{m} = 2 π {[m^{e s t} + Δ m]}^{- 1} = \frac{2 π}{m^{e s t}} {[1 + \frac{Δ m}{m^{e s t}}]}^{- 1} \\ \approx \frac{2 π}{m^{e s t}} [1 - \frac{Δ m}{m^{e s t}}] = \frac{2 π}{m^{e s t}} - \frac{2 π}{{(m^{e s t})}^{2}} Δ m \end{array}$

si86_e (11.10)

After equating T^est with 2π/m^est, small fluctuations in period and frequency are found to be related through

$Δ T \approx \frac{2 π}{{(m^{e s t})}^{2}} Δ m$ $Δ T \approx \frac{2 π}{{(m^{e s t})}^{2}} Δ m$

si87_e (11.11)

If the fluctuations in Δm have a variance of σ_m², then according to the usual rules of error propagation, the fluctuations in ΔT will have variance:

$σ_{T}^{2} \approx {[\frac{2 π}{{(m^{e s t})}^{2}}]}^{2} σ_{m}^{2}$ $σ_{T}^{2} \approx {[\frac{2 π}{{(m^{e s t})}^{2}}]}^{2} σ_{m}^{2}$

si88_e (11.12)

The period T has the same variance, since it differs from ΔT only by an additive constant. As an example, we consider the case $m^{e s t} = 2 π \times 1.3$ $m^{e s t} = 2 π \times 1.3$ rad/s, $σ_{ω} = 0.2$ $σ_{ω} = 0.2$ rad/s, which leads to $T^{e s t} = 0.769$ $T^{e s t} = 0.769$ s and $σ_{T} = 0.019$ $σ_{T} = 0.019$ s, estimates that agree well with a numerical simulation (Figure 11.3).

f11-03-9780128044889 — Figure 11.3 Normal probability density function p(m) for angular frequency m with mean $\bar{m}$ $\bar{m}$ and variance $σ_{m}^{2} ≪ \bar{m}$ $σ_{m}^{2} ≪ \bar{m}$ (gray curve). An empirical probability density function (dashed curve) based on a histogram of 10,000 realizations of m_i closely matches this function. (B) Normal probability density function p(T) for period $T = 2 π / m$ $T = 2 π / m$ , with mean $\bar{T}$ $\bar{T}$ and variance σ_T² calculated using the linear approximation discussed in the text (gray curve). An empirical probability density function (dashed curve) based on a histogram of $T_{i} = 2 π / m_{i}$ $T_{i} = 2 π / m_{i}$ closely matches this function. MatLab script eda11_03.

11.6 Taylor series in multiple dimensions

In a typical data analysis problem, the predicted data d^pre(m) is a function of model parameters m. The Taylor series for d^pre(m) is a straightforward generalization of the one-dimensional series and can be shown to be

$\begin{array}{l} d_{i}^{p r e} (m) = d_{i}^{p r e} (m_{0}) + \sum_{j = 1}^{M} {\frac{\partial d_{i}^{p r e}}{\partial m_{j}}|}_{m_{0}} (m_{j} - m_{0 j}) \\ + \frac{1}{2} \sum_{j = 1}^{M} \sum_{k = 1}^{M} {\frac{\partial^{2} d_{i}^{p r e}}{\partial m_{j} \partial m_{k}}|}_{m_{0}} (m_{j} - m_{0 j}) (m_{k} - m_{0 k}) + \dots \end{array}$ $\begin{array}{l} d_{i}^{p r e} (m) = d_{i}^{p r e} (m_{0}) + \sum_{j = 1}^{M} {\frac{\partial d_{i}^{p r e}}{\partial m_{j}}|}_{m_{0}} (m_{j} - m_{0 j}) \\ + \frac{1}{2} \sum_{j = 1}^{M} \sum_{k = 1}^{M} {\frac{\partial^{2} d_{i}^{p r e}}{\partial m_{j} \partial m_{k}}|}_{m_{0}} (m_{j} - m_{0 j}) (m_{k} - m_{0 k}) + \dots \end{array}$

si93_e (11.13)

The sequence of terms is the same as in the one-dimensional case: a constant term; a linear term, a quadratic term, and so forth. Derivatives of the same order appear in the same places in both formulas, as well: the linear term is multiplied by a first derivative; the quadratic term by a second derivative; and so forth. The main complication is that both the data and the model parameters are subscripted quantities, so summations are involved. Note that the first derivative has two subscripts and is therefore a matrix, say G⁽⁰⁾:

$G_{i j}^{(0)} = {\frac{\partial d_{i}^{p r e}}{\partial m_{j}}|}_{m_{0}}$ $G_{i j}^{(0)} = {\frac{\partial d_{i}^{p r e}}{\partial m_{j}}|}_{m_{0}}$

si94_e (11.14)

The matrix G is called the linearized data kernel. The Taylor series is a little simpler when a scalar function of the model parameters, such as the prediction error E, is being considered:

$\begin{array}{l} E (m) = & E (m_{0}) + \sum_{j = 1}^{M} {\frac{\partial E}{\partial m_{j}}|}_{m_{0}} (m_{j} - m_{0 j}) \\ + \frac{1}{2} \sum_{j = 1}^{M} \sum_{k = 1}^{M} {\frac{\partial^{2} E}{\partial m_{j} \partial m_{k}}|}_{m_{0}} (m_{j} - m_{0 j}) (m_{k} - m_{0 k}) + \dots \end{array}$ $\begin{array}{l} E (m) = & E (m_{0}) + \sum_{j = 1}^{M} {\frac{\partial E}{\partial m_{j}}|}_{m_{0}} (m_{j} - m_{0 j}) \\ + \frac{1}{2} \sum_{j = 1}^{M} \sum_{k = 1}^{M} {\frac{\partial^{2} E}{\partial m_{j} \partial m_{k}}|}_{m_{0}} (m_{j} - m_{0 j}) (m_{k} - m_{0 k}) + \dots \end{array}$

si95_e (11.15)

The first derivative has one subscript and so is a vector, say b⁽⁰⁾, and the second derivative has two subscripts and so is a matrix, say B⁽⁰⁾:

$b_{j}^{(0)} = {\frac{\partial E}{\partial m_{j}}|}_{m_{0}} and B_{i j}^{(0)} = {\frac{\partial^{2} E}{\partial m_{j} \partial m_{k}}|}_{m_{0}}$ $b_{j}^{(0)} = {\frac{\partial E}{\partial m_{j}}|}_{m_{0}} and B_{i j}^{(0)} = {\frac{\partial^{2} E}{\partial m_{j} \partial m_{k}}|}_{m_{0}}$

si96_e (11.16)

The quantities b and B are called the gradient vector and the curvature matrix, respectively. Equations (11.13) and (11.15) can then be written in vector notation as

$\begin{array}{l} d^{p r e} \approx d_{0}^{p r e} + G^{(0)} Δ m \\ E \approx E_{0} + b^{(0) T} Δ m + \frac{1}{2} {(Δ m)}^{T} B^{(0)} Δ m \\ with d_{0}^{p r e} = d^{p r e} (m_{0}) and E_{0} = E (m_{0}) and Δ m = m - m_{0} \end{array}$ $\begin{array}{l} d^{p r e} \approx d_{0}^{p r e} + G^{(0)} Δ m \\ E \approx E_{0} + b^{(0) T} Δ m + \frac{1}{2} {(Δ m)}^{T} B^{(0)} Δ m \\ with d_{0}^{p r e} = d^{p r e} (m_{0}) and E_{0} = E (m_{0}) and Δ m = m - m_{0} \end{array}$

si97_e (11.17)

11.7 Small number approximation applied to covariance

The procedure for estimating variance (Section 11.5) can be generalized to the nonlinear vector function T(m), providing a way to estimate the covariance matrix C_T. The first step is to expand the function T(m) in a Taylor series around the point m^est and keep only the first two terms. The resulting linear approximation relates small fluctuations in m to small fluctuations in T:

$\begin{array}{l} Δ T = M Δ m \\ with Δ m = m - m^{e s t} and Δ T = T - T (m^{e s t}) and M_{i j} = {\frac{\partial T_{i}}{\partial m_{j}}|}_{m^{e s t}} \end{array}$ $\begin{array}{l} Δ T = M Δ m \\ with Δ m = m - m^{e s t} and Δ T = T - T (m^{e s t}) and M_{i j} = {\frac{\partial T_{i}}{\partial m_{j}}|}_{m^{e s t}} \end{array}$

si98_e (11.18)

The standard error propagation rule (Section 3.12) then gives $C_{T} = M C_{m} M^{T}$ $C_{T} = M C_{m} M^{T}$ .

11.8 Solving nonlinear problems with iterative least squares

The error E(m) and its derivatives can be computed as long as the theory d^pre(m) is known, irrespective of whether it is linear or nonlinear. Consequently, we can apply the principle of least square to nonlinear problems; the solution m^est is the one that minimizes the error E(m), or equivalently, the one that solves $\partial E / \partial m_{i} = 0$ $\partial E / \partial m_{i} = 0$ . Unfortunately, this set of nonlinear equations is usually difficult to solve exactly. However, if we can guess a trial solution m₀ that is close to the minimum error solution, then we can use Taylor's theorem to approximate E(m) as a low order polynomial centered about this point. If m₀ is close enough to the minimum that a quadratic approximation will suffice, then the process of solving $\partial E / \partial m_{i} = 0$ $\partial E / \partial m_{i} = 0$ is very simple, because the derivative of a quadratic function is a linear function. Differentiating the expression for E in Equation (11.17) and setting the result to zero yield

$0 \approx b^{(0)} + B^{(0)} Δ m or Δ m \approx - {[B^{(0)}]}^{- 1} b^{(0)}$ $0 \approx b^{(0)} + B^{(0)} Δ m or Δ m \approx - {[B^{(0)}]}^{- 1} b^{(0)}$

si102_e (11.19)

This equation enables us to find an updated solution, $m = m_{0} + Δ m$ $m = m_{0} + Δ m$ , that is closer to the minimum than is m₀. This process can be iterated indefinitely, to provide a sequence of ms that, under favorable circumstances, will converge to the minimum error solution. It is sometimes referred to as Newton's method. A limitation is that the curvature matrix B must be calculated, in addition to the gradient vector b. Calculating this matrix may be difficult.

In the special case of the linear theory $d^{p r e} = Gm$ $d^{p r e} = Gm$ , the error is

$E = {[d^{o b s} - Gm]}^{T} [d^{o b s} - Gm] = d^{obsT} d^{o b s} - 2 d^{obsT} Gm + m^{T} G^{T} Gm$ $E = {[d^{o b s} - Gm]}^{T} [d^{o b s} - Gm] = d^{obsT} d^{o b s} - 2 d^{obsT} Gm + m^{T} G^{T} Gm$

(11.20)

The unknown quantities m₀, b, and B are found by comparison with Equation (11.17):

$m_{0} = 0 and b = - 2 G^{T} d^{o b s} and B = 2 G^{T} G$ $m_{0} = 0 and b = - 2 G^{T} d^{o b s} and B = 2 G^{T} G$

(11.21)

Inserting these values into Equation (11.19) yields the familiar least squares solution: $m = {[G^{T} G]}^{- 1} G^{T} d^{o b s}$ $m = {[G^{T} G]}^{- 1} G^{T} d^{o b s}$ . This correspondence suggests the strategy of approximating the curvature matrix as $B^{(0)} \approx G^{(0) T} G^{(0)}$ $B^{(0)} \approx G^{(0) T} G^{(0)}$ , in which case the procedure for updating a trial solution m^(k) is (Tarantola and Valette, 1982; see also Menke, 2014)

$\begin{array}{l} Δ d = d^{o b s} - d^{p r e} (m^{(k)}) \\ G_{i j}^{(k)} = {\frac{\partial d_{i}^{p r e}}{\partial m_{j}}|}_{m^{(k)}} \\ Δ m = {[G^{(k) T} G^{(k)}]}^{- 1} G^{(k) T} Δ d \\ m^{(k + 1)} = m^{(k)} + Δ m \end{array}$ $\begin{array}{l} Δ d = d^{o b s} - d^{p r e} (m^{(k)}) \\ G_{i j}^{(k)} = {\frac{\partial d_{i}^{p r e}}{\partial m_{j}}|}_{m^{(k)}} \\ Δ m = {[G^{(k) T} G^{(k)}]}^{- 1} G^{(k) T} Δ d \\ m^{(k + 1)} = m^{(k)} + Δ m \end{array}$

si109_e (11.22)

The covariance of the solution is approximately

$C_{m} = σ_{d}^{2} {[G^{(k_{max}) T} G^{(k_{max})}]}^{- 1}$ $C_{m} = σ_{d}^{2} {[G^{(k_{max}) T} G^{(k_{max})}]}^{- 1}$

(11.23)

where k_max signifies the final iteration.

This procedure is known as iterative least squares. An important issue is when to stop iterating. Possibilities include when the error of the fit, as quantified by its posterior variance, declines below a specified value that represents the fit being acceptably good; or when the error no longer decreases significantly between iterations; that is, when $Δ E = E (m^{(k)}) - E (m^{(k + 1)})$ $Δ E = E (m^{(k)}) - E (m^{(k + 1)})$ declines below a specified value; or when the solution no longer changes significantly between iterations; that is {[Δm]^T[Δm]}^1/2 declines below a specified value.

Prior information, represented by the linear equation $H m = h^{p r i}$ $H m = h^{p r i}$ , can be added to the procedure in order to generalize it. The only nuance is that the prior information is satisfied by the model parameters $m^{(k + 1)}$ $m^{(k + 1)}$ and not by the model perturbation Δm. Inserting $m^{(k + 1)} = m^{(k)} + Δ m$ $m^{(k + 1)} = m^{(k)} + Δ m$ into the prior information equation yields

$H Δ m = h - H m^{(k)}$ $H Δ m = h - H m^{(k)}$

(11.24)

See Crib Sheet 11.2 for the complete formulation.

Crib Sheet 11.2

Iterative generalized least squares

$data equation g (m) = d^{o b s} and prior information H m = h^{p r i}$ $data equation g (m) = d^{o b s} and prior information H m = h^{p r i}$

$trial solution m^{(k)}$ $trial solution m^{(k)}$

$linearized equations G^{(k)} Δ m = Δ d^{(k)} and H Δ m^{(k)} = Δ h^{(k)}$ $linearized equations G^{(k)} Δ m = Δ d^{(k)} and H Δ m^{(k)} = Δ h^{(k)}$

with

$G_{i j}^{(k)} = {\frac{\partial g_{i}}{\partial m_{j}}|}_{m^{(k)}}$ $G_{i j}^{(k)} = {\frac{\partial g_{i}}{\partial m_{j}}|}_{m^{(k)}}$

si51_e

and with

$Δ m^{(k)} = m - m^{(k)} and Δ d^{(k)} = d - g (m^{(k)}) and Δ h^{(k)} = h - H m^{(k)}$ $Δ m^{(k)} = m - m^{(k)} and Δ d^{(k)} = d - g (m^{(k)}) and Δ h^{(k)} = h - H m^{(k)}$

combined equations

$F^{(k)} Δ m^{(k)} = f^{(k)} with$ $F^{(k)} Δ m^{(k)} = f^{(k)} with$

$F^{(p)} = [\begin{array}{c} C_{d}^{- 1 / 2} G^{(k)} \\ C_{h}^{- 1 / 2} H \end{array}] and f^{(p)} = [\begin{array}{c} C_{d}^{- 1 / 2} Δ d^{(k)} \\ C_{h}^{- 1 / 2} Δ h^{(k)} \end{array}]$ $F^{(p)} = [\begin{array}{c} C_{d}^{- 1 / 2} G^{(k)} \\ C_{h}^{- 1 / 2} H \end{array}] and f^{(p)} = [\begin{array}{c} C_{d}^{- 1 / 2} Δ d^{(k)} \\ C_{h}^{- 1 / 2} Δ h^{(k)} \end{array}]$

si54_e

$solution Δ m^{(k)} = {[F^{(k) T} F^{(k)}]}^{- 1} F^{(k) T} f^{(k)}$ $solution Δ m^{(k)} = {[F^{(k) T} F^{(k)}]}^{- 1} F^{(k) T} f^{(k)}$

si55_e

$update rule m^{(k + 1)} = m^{(k)} + Δ m^{(k)}$ $update rule m^{(k + 1)} = m^{(k)} + Δ m^{(k)}$

$covariance C_{m} = {[F^{(k) T} F^{(k)}]}^{- 1}$ $covariance C_{m} = {[F^{(k) T} F^{(k)}]}^{- 1}$

si57_e

11.9 Fitting a sinusoid of unknown frequency

Suppose that the data consists of a single sinusoid superimposed on a background level, and that the amplitude, phase, and frequency of the sinusoid, as well as the background level, is unknown. This scenario is described by an equation that is nonlinear in the model parameters:

$d_{i}^{p r e} = m_{1} \bar{d} + m_{2} A sin (m_{4} ω_{a} t_{i}) + m_{3} A cos (m_{4} ω_{a} t_{i})$ $d_{i}^{p r e} = m_{1} \bar{d} + m_{2} A sin (m_{4} ω_{a} t_{i}) + m_{3} A cos (m_{4} ω_{a} t_{i})$

(11.25)

The constants $\bar{d}$ $\bar{d}$ , A, and ω_a have been introduced in order to normalize the size of the model parameters, which will be in the $\pm 1$ $\pm 1$ range (or close to it) when these constants are properly chosen. The matrix G^(k) has elements:

$\begin{array}{l} G_{i 1}^{(k)} = \bar{d} \\ G_{i 2}^{(k)} = A sin (m_{4}^{(k)} ω_{a} t_{i}) \\ G_{i 3}^{(k)} = A cos (m_{4}^{(k)} ω_{a} t_{i}) \\ G_{i 3}^{(k)} = m_{2}^{(k)} A ω_{a} t_{i} cos (m_{4}^{(k)} ω_{a} t_{i}) - m_{3}^{(k)} A ω_{a} t_{i} sin (m_{4}^{(k)} ω_{a} t_{i}) \end{array}$ $\begin{array}{l} G_{i 1}^{(k)} = \bar{d} \\ G_{i 2}^{(k)} = A sin (m_{4}^{(k)} ω_{a} t_{i}) \\ G_{i 3}^{(k)} = A cos (m_{4}^{(k)} ω_{a} t_{i}) \\ G_{i 3}^{(k)} = m_{2}^{(k)} A ω_{a} t_{i} cos (m_{4}^{(k)} ω_{a} t_{i}) - m_{3}^{(k)} A ω_{a} t_{i} sin (m_{4}^{(k)} ω_{a} t_{i}) \end{array}$

si119_e (11.26)

This model can be applied to the Black Rock Forest temperature data, which has a pronounced annual cycle. We set $\bar{d}$ $\bar{d}$ to the mean of the data, A to half its range, and ω_a to the frequency 2π/365 rad/day. A reasonable trial solution is $m = {[1.0, 0.1, - 0.9, 1.0]}^{T}$ $m = {[1.0, 0.1, - 0.9, 1.0]}^{T}$ . (The choice $m_{2} = 0.1$ $m_{2} = 0.1$ , $m_{3} = - 0.9$ $m_{3} = - 0.9$ implies that the highest temperature is in summer.) The iterative procedure of Equation (11.22) is then used to refine this estimate. The iteration process is terminated when the change in model parameters, from one iteration to the next, drops below a predetermined value (in this example, $10^{- 4}$ $10^{- 4}$ ).

This method produces an estimated period of $365.61 \pm 0.06$ $365.61 \pm 0.06$ days (95%), where the confidence interval was computed by the method of Section 11.5. Whether the nonlinear fit (Figure 11.4) provides a better estimate of the period that can be determined from standard amplitude spectral density analysis is debatable. The annual peak in the amplitude spectral density is broad, owing to the relatively few annual oscillations in the temperature time series (only 11). Nevertheless, its center is very close to the value determined by the nonlinear analysis (Figure 11.5).

f11-04-9780128044889 — Figure 11.4 Black Rock Forest temperature data: observed (black dots); initial model (dark gray curve); final model (light gray curve). See text for further discussion. MatLab script eda11_04.

f11-05-9780128044889 — Figure 11.5 Amplitude spectral density (curve with circles) of the Black Rock Forest temperature dataset, for a restricted period range centered on the annual period of 365 days. Note that this curve has a broad peak centered on the annual period. The period (vertical bar) determined by the nonlinear least squares fit is close to the center of this peak. See text for further discussion. MatLab script eda11_04.

11.10 The gradient method

The gradient vector $b_{i}^{(k)} = \partial E^{(k)} / \partial m_{i}$ $b_{i}^{(k)} = \partial E^{(k)} / \partial m_{i}$ represents the slope of the error function E(m) in the vicinity of the trial solution m^(k). It points in the direction in which the error most rapidly increases, which is to say, directly away from the direction in which m^(k) must be perturbed to lower the error. For this reason, its appearance in the rule for updating a trial solution m^(k) (Equation 11.19) is unsurprising. This relationship is further examined by rewriting the update rule as

$\begin{array}{l} Δ m = - {[B^{(k)}]}^{- 1} b^{(k)} = M^{(k)} ν^{(k)} with M = |b^{(k)}| {[B^{(k)}]}^{- 1} \\ and ν = - b^{(k)} / |b^{(k)}| \end{array}$ $\begin{array}{l} Δ m = - {[B^{(k)}]}^{- 1} b^{(k)} = M^{(k)} ν^{(k)} with M = |b^{(k)}| {[B^{(k)}]}^{- 1} \\ and ν = - b^{(k)} / |b^{(k)}| \end{array}$

si127_e (11.27)

Here |b| is the length of b. This rule has a simple interpretation: the vector ν^(k) is the downslope direction; that is, the direction in which m^(k) should be perturbed to most reduce the error. The matrix M^(k) encodes information about the size of the perturbation needed to reach the minimum in the error E(m), and although it also can change the direction of the perturbation, this effect is usually secondary.

The gradient vector b^(k) can be computed easily from the linearized data kernel G^(k):

$\begin{array}{l} b_{i}^{(k)} = {\frac{\partial E}{\partial m_{i}}|}_{m^{(k)}} = \frac{\partial}{\partial m_{i}} \sum_{j = 1}^{N} {(d_{j}^{o b s} - d_{j}^{p r e})}^{2} = - 2 G_{j i}^{(k)} (d_{j}^{o b s} - d_{j}^{p r e}) \\ or b^{(k)} = - 2 G^{(k) T} (d^{o b s} - d^{p r e}) \end{array}$ $\begin{array}{l} b_{i}^{(k)} = {\frac{\partial E}{\partial m_{i}}|}_{m^{(k)}} = \frac{\partial}{\partial m_{i}} \sum_{j = 1}^{N} {(d_{j}^{o b s} - d_{j}^{p r e})}^{2} = - 2 G_{j i}^{(k)} (d_{j}^{o b s} - d_{j}^{p r e}) \\ or b^{(k)} = - 2 G^{(k) T} (d^{o b s} - d^{p r e}) \end{array}$

si128_e (11.28)

The calculation of the matrix M^(k) is more difficult because it involves taking a matrix inverse. A trial solution would be easier to update if it could be omitted from the formula. The gradient method approximates M^(k) as αI, where α is an unknown step size. An initial guess for α is made at the start of the error minimization process and is used to tentatively update the trial solution according to the rule $m^{(k + 1)} = m^{(k)} + α ν^{(k)}$ $m^{(k + 1)} = m^{(k)} + α ν^{(k)}$ . The iteration process continues if the error $E^{(k + 1)}$ $E^{(k + 1)}$ has decreased. However, if the error has increased, the new solution has overshot the minimum because the step size is too big. The step size is decreased, say by replacing α with α/2, and another tentative solution is tried.

The main limitation of the gradient method, compared to iterative least squares, is that more iteration is needed to achieve an acceptable solution. While each step moves the solution towards the minimum, the step size has not been correctly set to move it to the minimum, so convergence is much slower. When applied to Black Rock Forest temperature problem described in the previous section, about 3500 iterations of the gradient method are needed to provide an acceptable solution (Figure 11.6) whereas fewer than 10 iterations of the iterative least squares method are needed for a solution with similar accuracy.

f11-06-9780128044889 — Figure 11.6 Black Rock Forest temperature dataset (black dots) modeled as a single sinusoid of unknown period. (A) Iterative least squares fit. (B) Gradient method. (C) Stochastic gradient method. See text for further discussion. MatLab scripts eda11_05 and eda11_06.

The least squares error E(m) is the sum of many individual errors:

$E (m) = \sum_{i = 1}^{N} e_{i}^{2} where e_{i} = d_{i}^{o b s} - d_{i}^{p r e} (m)$ $E (m) = \sum_{i = 1}^{N} e_{i}^{2} where e_{i} = d_{i}^{o b s} - d_{i}^{p r e} (m)$

si131_e (11.29)

In many practical cases, the number of individual errors is very large (N ≈ 100,000 in the Black Rock Forest case). The updating rule $m^{(k + 1)} = m^{(k)} + α ν^{(k)}$ $m^{(k + 1)} = m^{(k)} + α ν^{(k)}$ is only approximate. The omission of some of the individual errors from the formula for ν^(k) makes it somewhat more approximate, but as long as ν^(k) is not too inaccurate, each step will still move the solution towards the minimum. Furthermore, if a different random subset of individual errors is used in each step, the entire dataset can be cycled through during the course of the solution. This approximation is called the stochastic gradient method. In the case of the Black Rock Forest dataset, a reduction of the number of individual errors by a factor of 1000 still leads to an acceptable solution (Figure 11.6).

11.11 Precomputation of a function and table lookups

Many data analysis methods require that the same function be evaluated numerous times. If the function is complicated but has only one or two arguments of known range, then precalculating it and storing its values in a table may prove computationally efficient. As long as the time needed to perform a table lookup is less than the time needed to evaluate the function, computation time is reduced. Some accuracy is lost, too, since a table lookup provides only an approximate value of the function, but in many cases this inaccuracy is overwhelmed by other sources of error (and especially measurement error) that enter into the problem.

The lookup table for a one-dimensional function d(x) is just the time series $d_{k} = d (x_{k})$ $d_{k} = d (x_{k})$ with $x_{k} = k Δ x + x_{min} (1 < k < K)$ $x_{k} = k Δ x + x_{min} (1 < k < K)$ . The range of x_k must encompass all values of x that need to be estimated, and the sampling must be small enough to faithfully represent fluctuations in the function. A simple estimate of the function d(x) is the value of the corresponding table entry, d_n with $n = floor ((z - z_{min}) / Δ x) + 1$ $n = floor ((z - z_{min}) / Δ x) + 1$ . Here the function floor(x) is the largest integer smaller than x. Alternatively, d(x) could be estimated by linearly interpolating between the two bracketing d_n and $d_{n + 1}$ $d_{n + 1}$ , though the interpolation process adds computation time.

As an example, we consider the problem of using a grid search to determine two model parameters (m₁, m₂) where the data d_i(t_i) obeys

$d_{i} (t_{i}) = sin \{π sin [π x_{i} sin π (1 - x_{i}^{2})]\} with x_{i} = m_{1} + m_{2} t_{i}$ $d_{i} (t_{i}) = sin \{π sin [π x_{i} sin π (1 - x_{i}^{2})]\} with x_{i} = m_{1} + m_{2} t_{i}$

(11.30)

Note that d_i(x_i) is a complicated function of just one variable x_i(t_i). In order to set up the lookup table, we need to know the range of plausible values of the model parameters (m₁, m₂) and time t. Let us suppose that they are $0 \leq m_{1} \leq 1$ $0 \leq m_{1} \leq 1$ , $0 \leq m_{2} \leq 1$ $0 \leq m_{2} \leq 1$ , and $0 \leq t \leq 1$ $0 \leq t \leq 1$ . These ranges allow us to compute the range for the xs in the lookup table as $0 \leq x \leq 2$ $0 \leq x \leq 2$ . The version of the grid search that uses the lookup table executes in only about half the time of the version that directly computes d_i(t_i) each time it is needed (Figure 11.7).

f11-07-9780128044889 — Figure 11.7 Results of grid search, showing observed data d^obs (dots), true data d^true (light gray curve), estimated data d^obs using a standard grid search (dark gray curve), and estimated data using a grid search with a table lookup (black curve). MatLab script eda11_07.

Lookup tables are applicable to functions of two or more variables, but tables of more than two dimensions are seldom useful, because the time needed to construct them is very large.

11.12 Artificial neural networks

Suppose that the function d(x) is not accurately known. One strategy is to start with an initially poor estimate of it and improve it as new information becomes available. One advantage of using a lookup table to represent the function is that a table is easily updated. The values of its elements can be trivially changed to reflect new information. However, a table is also inflexible, in the sense that, once defined, the width Δx of its elements cannot easily be changed to reflect new information about the scale length over which d(x) varies. Furthermore, the value of d(x) jumps abruptly as an interval boundary is crossed, which may violate prior information about the function's smoothness. In this section, a new approximation, called the artificial neural network (or neural net, for short), is introduced. It shares some similarity with a lookup table but is more adaptable and smooth. (Notice that the attribute of the lookup table upon which we are focusing is updatability, not computation speed; neural nets are not especially speedy.)

An arbitrary function d(x) can be represented schematically as a row of three boxes, with interconnecting lines that represent information flowing from left to right (Figure 11.8A). In this network representation, the left box represents input x, the right box represents output d(x), and the middle box represents the workings of the function, which may be arbitrarily complicated. The middle box encapsulates (or hides) all the details of the function, and for this reason is said to be hidden. The table approximation replaces the workings in the middle box with a simple table lookup (Figure 11.8B). Since each element of the table specifies the value of the function d(x) in a distinct interval of the x-axis, the process of interrogating the table lookup can be viewed as summing boxcar or tower functions, each of which is constant in the interval and zero outside of it (Figure 11.8C).

f11-08-9780128044889 — Figure 11.8 Schematic representation of a function, in which information flows from left to right. (A) The function d(x). (B) The table lookup approximation to the function d(x). (C) The middle box of the table lookup in (B) is subdivided into several distinct boxcars (or towers) each of which represents one element of the table. Note that information diverging out of a box is duplicated along the several connections and information from multiple connections converging on a box is summed.

The left and right edges of the each table entry can be made separate by viewing a tower as the sum of two step functions:

$d_{c} H (w_{L} x + b_{L}) - d_{c} H (w_{R} x + b_{R})$ $d_{c} H (w_{L} x + b_{L}) - d_{c} H (w_{R} x + b_{R})$

(11.31)

Here H(z) (with $z = w x + b$ $z = w x + b$ ) is the Heaviside step function, defined as zero when $z < 0$ $z < 0$ and unity when $z > 0$ $z > 0$ . The weights w and the biases b parameterize the location of the edges, with the left edge at $x_{L} = - b_{L} / w_{L}$ $x_{L} = - b_{L} / w_{L}$ and the right edge at $x_{R} = - b_{R} / w_{R}$ $x_{R} = - b_{R} / w_{R}$ . The constant d_c specifies the height of the tower (Figure 11.9). In a lookup table, the edges of adjacent towers match one another; that is, $x_{L}^{(i)} = x_{R}^{(i - 1)}$ $x_{L}^{(i)} = x_{R}^{(i - 1)}$ and $x_{R}^{(i)} = x_{L}^{(i + 1)}$ $x_{R}^{(i)} = x_{L}^{(i + 1)}$ (where the superscript consecutively numbers the towers). However, in a neural net, this requirement is relaxed and step functions are allowed to overlap in any fashion.

f11-09-9780128044889 — Figure 11.9 (A) Neural net, with one tower function singled out. (B) Graph of the tower function. (C) Neural net with the tower function replaced by two step functions. (D) and (E) Graphs of the two step functions.

The issue of smoothness is addressed by replacing the step function H(z) by the sigmoid function:

$σ (z) = \frac{1}{1 + exp (- z)}$ $σ (z) = \frac{1}{1 + exp (- z)}$

si150_e (11.32)

Like the step function, this function asymptotes to zero and unity, respectively, as $z \to \pm \infty$ $z \to \pm \infty$ and has a transition centered on $z = 0$ $z = 0$ (or $x_{0} = - b / w$ $x_{0} = - b / w$ ). Unlike the step function, the sigmoid function is smooth. The parameter w controls the sharpness of the transition, with a maximum slope at x₀ of $s = d σ / d x = w / 4$ $s = d σ / d x = w / 4$ (Figure 11.10).

f11-10-9780128044889 — Figure 11.10 Exemplary sigmoid functions, $σ (w x + b)$ $σ (w x + b)$ . (A) Slope $s = w / 4 = 100$ $s = w / 4 = 100$ , $b = 0$ $b = 0$ . (B) Slope $s = 3$ $s = 3$ , $b = 0$ $b = 0$ . (C) Slope $s = 1$ $s = 1$ , $b = 0$ $b = 0$ . (D) Slope $s = 0.1$ $s = 0.1$ , $b = 0$ $b = 0$ . Note that the width of the transition is sharper for the larger ss. MatLab script eda11_08.

si8_e — Figure 11.10 Exemplary sigmoid functions, $σ (w x + b)$ $σ (w x + b)$ . (A) Slope $s = w / 4 = 100$ $s = w / 4 = 100$ , $b = 0$ $b = 0$ . (B) Slope $s = 3$ $s = 3$ , $b = 0$ $b = 0$ . (C) Slope $s = 1$ $s = 1$ , $b = 0$ $b = 0$ . (D) Slope $s = 0.1$ $s = 0.1$ , $b = 0$ $b = 0$ . Note that the width of the transition is sharper for the larger ss. MatLab script eda11_08.

11.13 Information flow in a neural net

A neural net consists of L columns (or layers) of boxes, with the kth layer containing N^(k) boxes. Because the neural net approximation was motivated by studies of information flow in animal brains, the boxes are called neurons. Information flows from left to right along connections, denoted by line segments connecting the boxes in adjacent layers. Neuron i in layer k has one parameter associated with it, its bias b_i^(k). The connection between neuron i in layer k and neuron j in layer $(k - 1)$ $(k - 1)$ has one parameter associated with it, its weight w_ij^(k); the absence of a connection is indicated by a zero weight. Neuron i in layer k also has an input value z_i^(k) and an output value (or activity) a_i^(k). The activities of neurons in layer 1 represent the input to the function and the activities of the neurons in layer L represent the output of the function. In the one-dimensional case considered above, layer 1 has one neuron with activity x and layer L has one neuron with activity d(x). However, neural nets can handle a multidimensional function d(x) simply by having multiple input and output neurons. The rule for propagating information between one layer and the next is

$\begin{array}{l} \begin{array}{l} Input of \\ neuron i \\ on layer k \end{array} = \begin{array}{l} weighted sum of \\ activities of neurons \\ on layer (k - 1) \end{array} + \begin{array}{l} bias of \\ neuron i \\ on layer k \end{array} \\ z_{i}^{(k)} = \sum_{j = 1}^{N^{(k - 1)}} w_{i j}^{(k)} a_{j}^{(k - 1)} + b_{i}^{(k)} \\ \begin{array}{l} Activity of \\ neuron i \\ on layer k \end{array} = \begin{array}{l} k = L : input of that neuron \\ k < L : sigmoid function applied t o input \\ of that neuron \end{array} \\ a_{i}^{(k)} = \{\begin{array}{l} z_{i}^{(k)} when k = L \\ σ (z_{i}^{(k)}) when k < L \end{array} \end{array}$ $\begin{array}{l} \begin{array}{l} Input of \\ neuron i \\ on layer k \end{array} = \begin{array}{l} weighted sum of \\ activities of neurons \\ on layer (k - 1) \end{array} + \begin{array}{l} bias of \\ neuron i \\ on layer k \end{array} \\ z_{i}^{(k)} = \sum_{j = 1}^{N^{(k - 1)}} w_{i j}^{(k)} a_{j}^{(k - 1)} + b_{i}^{(k)} \\ \begin{array}{l} Activity of \\ neuron i \\ on layer k \end{array} = \begin{array}{l} k = L : input of that neuron \\ k < L : sigmoid function applied t o input \\ of that neuron \end{array} \\ a_{i}^{(k)} = \{\begin{array}{l} z_{i}^{(k)} when k = L \\ σ (z_{i}^{(k)}) when k < L \end{array} \end{array}$

si156_e (11.33)

These relationships are depicted in Figure 11.11. Once the activities in layer 1 are specified, this rule can be used to propagate information layer by layer through the neural network. A simple MatLab function that performs this operation is provided:

f11-11-9780128044889 — Figure 11.11 Information flow in a neural net. See text for further discussion.

a = eda_net_eval(N,w,b,a);
% set the activities of the layer 1 before calling this function
% N(i): column-vector with number of neurons in each layer;
% biases: b(1:N(i),i) is a column-vector that gives the
% biases of the N(i) neurons on the i-th layer
% weights: w(1:N(i),1:N(i-1),i) is a matrix that gives the
%  weights between the N(i) neurons in the i-th layer
%  and the N(i-1) neurons on the (i-1)-th layer.
% activity: a(1:N(i),i) is a column-vector that gives the
% activities of the N(i) neurons on the i-th layer

(MatLab eda_net_eval)

The neural nets for smoothed versions of a step function and a tower function are shown in Figure 11.12. Since the sigmoid function is omitted on the output layer, the activity of this layer is just a weighted sum of its inputs. This feature allows the easy amalgamation of networks. For instance, several single-tower networks can be amalgamated to yield a sum of towers (Figure 11.13).

f11-12-9780128044889 — Figure 11.12 Simple neural nets. (A) Neural net for a smoothed version of a step function d(x) with amplitude A, transition at x₀, and a slope s. All values not shown are zero. (B) Graph for the $A = 1$ $A = 1$ , $x_{0} = 2$ $x_{0} = 2$ , $s = 250$ $s = 250$ case. (C) Graph for the $A = 1$ $A = 1$ , $x_{0} = 2$ $x_{0} = 2$ , $s = 1$ $s = 1$ case. (D) Neural net for a smoothed version of a tower function d(x) with amplitude A, transitions at x₁ and at x₂, and slope s. (E) Graph for the $A = 1.5$ $A = 1.5$ , $x_{1} = 1.5$ $x_{1} = 1.5$ , $x_{2} = 2.5$ $x_{2} = 2.5$ , $s = 250$ $s = 250$ case. (F) Graph for the $A = 1.5$ $A = 1.5$ , $x_{1} = 1.5$ $x_{1} = 1.5$ , $x_{2} = 2.5$ $x_{2} = 2.5$ , $s = 1$ $s = 1$ case. MatLab scripts eda11_09 and eda11_10.

f11-13-9780128044889 — Figure 11.13 (A) A neural net with six towers (six pairs of sigmoid functions) is used to approximate a function d(x). (B) The true function d(x), a polynomial. (C) The neural net approximation. MatLab script eda11_11.

A two-dimensional tower, say for input (x, y), can be created by amalgamating two one-dimensional tower networks, a tower in x and a tower in y (Figure 11.14A). However, their sum is not quite a two-dimensional tower function but rather consists of two intersecting ridges, one parallel to the x-axis and the other parallel to the y-axis, with a hump where the tower should be (Figure 11.14B). A fourth layer, implementing a step function with a transition higher than the ridges but lower than the hump, is needed to delete the ridges and form the tower (Figure 11.14C).

f11-14-9780128044889 — Figure 11.14 (A) A neural net for a two-dimensional tower function d(x, y) with unit amplitude and nonzero between $1 < x < 2$ $1 < x < 2$ and $1 < y < 2$ $1 < y < 2$ . (B) The input to the third layer has low-amplitude ridges (gray) that intersect to produce a high amplitude hump (black). (C) The fourth layer deletes the ridges but leaves the hump. Its output is the tower. MatLab script eda11_12.

si31_e — Figure 11.14 (A) A neural net for a two-dimensional tower function d(x, y) with unit amplitude and nonzero between $1 < x < 2$ $1 < x < 2$ and $1 < y < 2$ $1 < y < 2$ . (B) The input to the third layer has low-amplitude ridges (gray) that intersect to produce a high amplitude hump (black). (C) The fourth layer deletes the ridges but leaves the hump. Its output is the tower. MatLab script eda11_12.

A function can be approximated to any degree of fidelity by summing together enough towers. Thus, a neural network is capable of approximating any continuous function. However, a network constructed in this fashion may not be the smallest capable of adequately approximating the function. In the next section we will develop a technique to train an arbitrary network to produce the best approximation of a function of which it is capable. This is a simple form of machine learning.

11.14 Training a neural net

The goal in training a neural net is to match the output $d_{i}^{p r e} = d^{p r e} (x_{i})$ $d_{i}^{p r e} = d^{p r e} (x_{i})$ of the neural net to some set of observed values (x_i, d_i^obs); that is, to find a set of weights and biases such that $d_{i}^{p r e} \approx d_{i}^{o b s}$ $d_{i}^{p r e} \approx d_{i}^{o b s}$ . This problem can be solved by defining an error $e_{i} = d_{i}^{o b s} - d_{i}^{p r e}$ $e_{i} = d_{i}^{o b s} - d_{i}^{p r e}$ and finding the weights and biases that minimize the total error $E = e^{T} e$ $E = e^{T} e$ . Since the neural net is inherently nonlinear, the error minimization must be performed with the iterative least squares method (Section 11.8) or the gradient method (Section 11.10). The process refines an initial guess of the model parameters (the combined set of weights and biases) into a final estimate that minimizes the prediction error.

Both procedures require the derivatives:

$\frac{\partial d_{i}^{p r e}}{\partial b_{n}^{(k)}} and \frac{\partial d_{i}^{p r e}}{\partial w_{nm}^{(k)}}$ $\frac{\partial d_{i}^{p r e}}{\partial b_{n}^{(k)}} and \frac{\partial d_{i}^{p r e}}{\partial w_{nm}^{(k)}}$

si161_e (11.34)

At first glance, computing these derivatives appears to be a daunting task. However, recall that d_i^pre is just the activity of neuron 1 in layer L, and the formula for computing activities (Equation 11.33) is simple and easily differentiated. The chain rule (see Note 11.1) implies

$\begin{array}{l} \frac{\partial d_{i}^{p r e}}{\partial b_{j}^{(k)}} = \frac{\partial a_{1}^{L}}{\partial b_{j}^{(k)}} = \frac{\partial z_{j}^{(k)}}{\partial b_{j}^{(k)}} \frac{\partial a_{j}^{(k)}}{\partial z_{j}^{(k)}} \frac{\partial a_{1}^{(L)}}{\partial a_{j}^{(k)}} \\ and \\ \frac{\partial d_{i}^{p r e}}{\partial w_{j m}^{(k)}} = \frac{\partial a_{1}^{L}}{\partial w_{j m}^{(k)}} = \frac{\partial z_{j}^{(k)}}{\partial w_{j m}^{(k)}} \frac{\partial a_{j}^{(k)}}{\partial z_{j}^{(k)}} \frac{\partial a_{1}^{(L)}}{\partial a_{j}^{(k)}} \end{array}$ $\begin{array}{l} \frac{\partial d_{i}^{p r e}}{\partial b_{j}^{(k)}} = \frac{\partial a_{1}^{L}}{\partial b_{j}^{(k)}} = \frac{\partial z_{j}^{(k)}}{\partial b_{j}^{(k)}} \frac{\partial a_{j}^{(k)}}{\partial z_{j}^{(k)}} \frac{\partial a_{1}^{(L)}}{\partial a_{j}^{(k)}} \\ and \\ \frac{\partial d_{i}^{p r e}}{\partial w_{j m}^{(k)}} = \frac{\partial a_{1}^{L}}{\partial w_{j m}^{(k)}} = \frac{\partial z_{j}^{(k)}}{\partial w_{j m}^{(k)}} \frac{\partial a_{j}^{(k)}}{\partial z_{j}^{(k)}} \frac{\partial a_{1}^{(L)}}{\partial a_{j}^{(k)}} \end{array}$

si162_e (11.35)

While the derivatives of Equation (11.35) are of varying complexity, they are all tractable. The easiest is

$\frac{\partial z_{j}^{(k)}}{\partial b_{j}^{(k)}} = 1$ $\frac{\partial z_{j}^{(k)}}{\partial b_{j}^{(k)}} = 1$

si163_e (11.36)

The two next most complicated are

$\frac{\partial z_{j}^{(k)}}{\partial w_{j m}^{(k)}} = a_{m}^{(k - 1)} and \frac{\partial z_{j}^{(k)}}{\partial a_{m}^{(k - 1)}} = w_{j m}^{(k)}$ $\frac{\partial z_{j}^{(k)}}{\partial w_{j m}^{(k)}} = a_{m}^{(k - 1)} and \frac{\partial z_{j}^{(k)}}{\partial a_{m}^{(k - 1)}} = w_{j m}^{(k)}$

si164_e (11.37)

More complicated still is

$\frac{\partial a_{j}^{(k)}}{\partial z_{j}^{(k)}} = \{\begin{array}{l} \frac{\partial σ}{\partial z_{j}^{(k)}} = \frac{exp (- z_{j}^{(k)})}{{[1 + exp (- z_{j}^{(k)})]}^{2}} = a_{j}^{(k)} [a_{j}^{(k)} - 1] & (k < L) \\ 1 & (k = L) \end{array}\}$ $\frac{\partial a_{j}^{(k)}}{\partial z_{j}^{(k)}} = \{\begin{array}{l} \frac{\partial σ}{\partial z_{j}^{(k)}} = \frac{exp (- z_{j}^{(k)})}{{[1 + exp (- z_{j}^{(k)})]}^{2}} = a_{j}^{(k)} [a_{j}^{(k)} - 1] & (k < L) \\ 1 & (k = L) \end{array}\}$

si165_e (11.38)

Finally, the $\partial a_{i}^{(L)} / \partial a_{j}^{(k)}$ $\partial a_{i}^{(L)} / \partial a_{j}^{(k)}$ derivative is built layer by layer, using another application of the chain rule. The derivative that connects changes in activities on adjacent layers is

$\frac{\partial a_{i}^{(k)}}{\partial a_{j}^{(k - 1)}} = \frac{\partial a_{i}^{(k)}}{\partial z_{i}^{(k)}} \frac{\partial z_{i}^{(k)}}{\partial a_{j}^{(k - 1)}}$ $\frac{\partial a_{i}^{(k)}}{\partial a_{j}^{(k - 1)}} = \frac{\partial a_{i}^{(k)}}{\partial z_{i}^{(k)}} \frac{\partial z_{i}^{(k)}}{\partial a_{j}^{(k - 1)}}$

si167_e (11.39)

A succession of these derivatives can be used to connect the activities on the Lth layer to all the other layers, a process called back-propagation, since it starts with the Lth layer and works backward towards the first layer (Figure 11.15) (Werbos, 1974):

$\begin{array}{l} \frac{\partial a_{i}^{(L)}}{\partial a_{j}^{(L - 1)}} = \frac{\partial a_{i}^{(L)}}{\partial z_{i}^{(L)}} \frac{\partial z_{i}^{(L)}}{\partial a_{j}^{(L - 1)}} and \\ \frac{\partial a_{i}^{(L)}}{\partial a_{j}^{(L - 2)}} = \sum_{p = 1}^{N^{(L - 1)}} \frac{\partial a_{i}^{(L)}}{\partial a_{p}^{(L - 1)}} \frac{\partial a_{p}^{(L - 1)}}{\partial a_{j}^{(L - 2)}} and \\ \frac{\partial a_{i}^{(L)}}{\partial a_{j}^{(L - 3)}} = \sum_{p = 1}^{N^{(L - 2)}} \frac{\partial a_{i}^{(L)}}{\partial a_{p}^{(L - 2)}} \frac{\partial a_{p}^{(L - 2)}}{\partial a_{j}^{(L - 3)}} e t c . \end{array}$ $\begin{array}{l} \frac{\partial a_{i}^{(L)}}{\partial a_{j}^{(L - 1)}} = \frac{\partial a_{i}^{(L)}}{\partial z_{i}^{(L)}} \frac{\partial z_{i}^{(L)}}{\partial a_{j}^{(L - 1)}} and \\ \frac{\partial a_{i}^{(L)}}{\partial a_{j}^{(L - 2)}} = \sum_{p = 1}^{N^{(L - 1)}} \frac{\partial a_{i}^{(L)}}{\partial a_{p}^{(L - 1)}} \frac{\partial a_{p}^{(L - 1)}}{\partial a_{j}^{(L - 2)}} and \\ \frac{\partial a_{i}^{(L)}}{\partial a_{j}^{(L - 3)}} = \sum_{p = 1}^{N^{(L - 2)}} \frac{\partial a_{i}^{(L)}}{\partial a_{p}^{(L - 2)}} \frac{\partial a_{p}^{(L - 2)}}{\partial a_{j}^{(L - 3)}} e t c . \end{array}$

si168_e (11.40)

The summation arises because a perturbation in the activity of a neuron on the kth layer is due to the combined effect of perturbations of the activities of all the neurons on the (k − 1)th layer that connect with it (see Note 11.1).

f11-15-9780128044889 — Figure 11.15 The back-propagation process follows the chain of events that lead to a change in the error Δe, backwards, from effect to cause. A perturbation in Δe is caused by a perturbation in the activity $a_{1}^{(4)} = d^{p r e}$ $a_{1}^{(4)} = d^{p r e}$ , which in turn is caused by perturbations Δa₁⁽³⁾ and Δa₂⁽³⁾, which in turn are caused by perturbation Δa₁⁽²⁾, which in turn is caused by input perturbation Δz₁⁽²⁾, which in turn is caused by a perturbation Δw₁₁⁽²⁾ in a specific weight.

si33_e — Figure 11.15 The back-propagation process follows the chain of events that lead to a change in the error Δe, backwards, from effect to cause. A perturbation in Δe is caused by a perturbation in the activity $a_{1}^{(4)} = d^{p r e}$ $a_{1}^{(4)} = d^{p r e}$ , which in turn is caused by perturbations Δa₁⁽³⁾ and Δa₂⁽³⁾, which in turn are caused by perturbation Δa₁⁽²⁾, which in turn is caused by input perturbation Δz₁⁽²⁾, which in turn is caused by a perturbation Δw₁₁⁽²⁾ in a specific weight.

A MatLab function that calculates the derivatives is provided:

[daLmaxdw, daLmaxdb] = eda_net_deriv(N,w,b,a);
% update the activities by calling a=eda_net_eval(N,w,b,a);
% before calling this function
% daLmaxdb(i,j,k): the change in the activity of the i-th neuron
% in layer Lmax due to a change in the bias of the of
% the j-th neuron in the layer k
% daLmaxdw(i,j,k,l): the change in the activity of the i-th
% neuron in layer Lmax due to a change in the weight
% connecting the j-th neuron in the layer l with the k-th
% neuron of layer l-1

(MatLab eda_net_deriv)

The weights and biases together constitute the model parameters m of the iterative least squares problem (Section 11.8). The matrix G_ij^(p) contains the derivatives, and has one row i for each observation and one column j for each model parameter. Careful bookkeeping is necessary to keep track of the correspondence between the jth model parameter and a particular weight or bias. Index tables can be used to translate back and forth between these two representations. We provide a MatLab function eda_net_map() that creates them but do not describe it further, here (see the comments in the script). An important issue is the starting values of the weights and activities. For simple networks that represent towers or other simple behaviors, it may be possible to choose the weights and biases so that initial behavior is somewhat close to matching the observations. In this case, training merely refines the behavior and only a few iterations are necessary. In more complicated cases, the weights and biases may need to be initialized to random numbers, with the hope that the training will converge eventually to the best-fitting values. In both cases, imposing prior information that the correction Δm^(k) is small can help prevent overshoots and improve the rate of convergence.

As an example, a network consisting of a single tower is trained to match observations d^obs(x, y) that have the shape of a two-dimensional Normal function (Figure 11.16A). The initial weights and biases are chosen to that the tower is centered on the peak of the Normal function but has edges that, while smooth, fall off more quickly than is observed (Figure 11.16B). The iterative least squares procedure converges rapidly and reduces the error to $6 \times 10^{- 5}$ $6 \times 10^{- 5}$ of its starting value (Figure 11.16C).

f11-16-9780128044889 — Figure 11.16 A single-tower network is trained to match an observed function d^obs(x, y). (A) The observed function d^obs(x, y), a Normal function. (B) The initial estimate d^obs(x, y), based on a single-tower neural net with a slope $s = 1.25$ $s = 1.25$ . (C) The prediction d^pre(x, y), after training. MatLab script eda11_13.

11.15 Neural net for a nonlinear response

In Section 7.4, river discharge d(t) is predicted from precipitation x(t) by developing a filter c(t) such that $d (t) = c (t) * x (t)$ $d (t) = c (t) * x (t)$ . A filter is appropriate when the response of the river to precipitation is linear. It assumes that the river responds similarly to a shower and a storm; the latter is merely a scaled up version of the former. In reality, few rivers are completely linear in this sense, because they behave differently in periods of low and high water. Storms cause a river to widen and, in extreme cases, to overflow its banks. Both effects lead to nonlinearities in the river's response to precipitation.

Consider the following very simple model of a river, in which the function y(t) represents the volume of water stored in its watershed:

$\frac{d y}{d x} = A_{w} x (t) - d (t) with d (t) = A_{r} v (t)$ $\frac{d y}{d x} = A_{w} x (t) - d (t) with d (t) = A_{r} v (t)$

si171_e (11.41)

Here v(t) is the mean velocity of water in the river, A_w is the area of the watershed, and A_r is the cross-sectional area of the river. Precipitation x(t) increases stored water and river discharge d(t) decreases it. Now assume that the velocity is linear in the volume of stored water; that is, $v (t) = c_{1} y (t)$ $v (t) = c_{1} y (t)$ , where c₁ is a constant. The model becomes

$\frac{d y}{d x} = A_{w} x (t) - c_{1} A_{r} y (t)$ $\frac{d y}{d x} = A_{w} x (t) - c_{1} A_{r} y (t)$

si173_e (11.42)

This differential equation is linear in the unknown y(t) and will therefore have a linear response to an impulsive precipitation event (the response to $A_{w} x (t) = δ (t)$ $A_{w} x (t) = δ (t)$ can be shown to be $y (t) = exp (- c_{1} A_{r} t)$ $y (t) = exp (- c_{1} A_{r} t)$ for $t > 0$ $t > 0$ and zero otherwise). Since discharge is proportional to stored water, it too has a linear response. A filter can completely capture the relationship between precipitation and discharge.

The model is now modified into a nonlinear one by assuming that the cross-sectional area of the river is proportional to stored water, so that $A_{r} = c_{2} y (t)$ $A_{r} = c_{2} y (t)$ , where c₂ is a constant. The model becomes

$\frac{d y}{d x} = A_{w} g (t) - c_{1} c_{2} y^{2} (t)$ $\frac{d y}{d x} = A_{w} g (t) - c_{1} c_{2} y^{2} (t)$

si178_e (11.43)

This differential equation is nonlinear in the unknown y(t) and consequently does not have a linear response. While an impulse in precipitation will cause a transient discharge signal, the signal for a storm will not have the same shape as the signal for a shower. Furthermore, the response to several closely spaced precipitation events will not linearly superimpose. As we will demonstrate below, a neural net is capable of predicting this nonlinear behavior.

Initially, a neural net is constructed that is capable of representing convolution by the filter c(t). Later, this network is modified by adding connections that enhance its ability to represent nonlinearities. The filtering operation $d (t) = c (t) * x (t)$ $d (t) = c (t) * x (t)$ is just a multidimensional generalization of the linear function $d (x) = c x + h$ $d (x) = c x + h$ (with $h = 0$ $h = 0$ ):

$d (t_{i}) = c (t) * x (t) = c_{1} x (t_{i}) + c_{2} x (t_{i - 1}) + c_{3} x (t_{i - 2}) + \dots \dots$ $d (t_{i}) = c (t) * x (t) = c_{1} x (t_{i}) + c_{2} x (t_{i - 1}) + c_{3} x (t_{i - 2}) + \dots \dots$

(11.44)

Therefore, a linear function-emulating network is first created (Figure 11.17) and K of these networks are amalgamated into a filter-emulating network that implements convolution by a length-K filter. The linear function-emulating network exploits the property of the sigmoid function $σ (w x - b)$ $σ (w x - b)$ that, for small weight w, its output is a linear function of its input, at least near its transition point $x_{0} = - b / w$ $x_{0} = - b / w$ (see Figure 11.10D). This linear behavior is demonstrated by expanding the sigmoid function in a Taylor series:

$σ (w x - b) \approx \frac{1}{2} + \frac{1}{4} w (x - x_{0}) f o r x \approx x_{0} and w ≪ 1$ $σ (w x - b) \approx \frac{1}{2} + \frac{1}{4} w (x - x_{0}) f o r x \approx x_{0} and w ≪ 1$

si185_e (11.45)

The linear function-emulating network has four layers containing six parameters (three weights and three biases). These parameters can be initialized so that the network behaves linearly; subsequent training enables it to capture more complicated nonlinear behaviors.

f11-17-9780128044889 — Figure 11.17 (A) Four layer neural net that implements the function $d (x) = h + c x$ $d (x) = h + c x$ . (B) Graph of $d^{true} (x) = 2 + 1.5 x$ $d^{true} (x) = 2 + 1.5 x$ (gray curve) and d^pre(x) (black curve). MatLab script eda11_14.

si35_e — Figure 11.17 (A) Four layer neural net that implements the function $d (x) = h + c x$ $d (x) = h + c x$ . (B) Graph of $d^{true} (x) = 2 + 1.5 x$ $d^{true} (x) = 2 + 1.5 x$ (gray curve) and d^pre(x) (black curve). MatLab script eda11_14.

This utility of the filter-emulating network is demonstrated using a precipitation-discharge test dataset. A synthetic precipitation time series x(t), of length $N = 1000$ $N = 1000$ , is created from a random patterns of spiky precipitation events. The corresponding discharge time series d^obs(t) is determined by solving Equation (11.43) numerically. The nonlinear nature of the relationship between precipitation and discharge is apparent; large precipitation events produce disproportionately large peaks in discharge. A $K = 10$ $K = 10$ filter-emulating network is initialized to the filter $c (t) = A exp (- b t)$ $c (t) = A exp (- b t)$ , where the parameters A and b are determined by least squares (see Section 7.4). Several additional connections are added to the network to allow it to better approximate a nonlinear response function (Figure 11.18) and these new connections are assigned small, randomly chosen weights. The final network contains $M = 69$ $M = 69$ unknown weights and biases. When trained on the first half of the dataset, the network achieves an error reduction of a factor of about 10 when compared to the filter (Figure 11.19). Finally, without any retraining, the network is used to predict the second half of the test dataset. The network predicts it equally well, suggesting that it has successfully captured the nonlinear and, from the point of view of the test, unknown dynamics of the river system.

f11-18-9780128044889 — Figure 11.18 Exemplary neural net for approximating a nonlinear response. Connections in black are the minimum number needed to implement a filter of length K = 10. Connections in gray have been added to improve the net's accuracy of prediction. MatLab script eda11_15.

f11-19-9780128044889 — Figure 11.19 Application of a neural net to a nonlinear response problem. (A) First half of a synthetic observed precipitation time series x(t). (B) Corresponding observed river discharge time series d^obs(t), obtained by solving Equation (11.41) (gray) and its neural net prediction d^pre(x) (solid) after training with these observations. (C) Second half of synthetic precipitation time series x(t). (D) Corresponding observed river discharge time series d^obs(x) (gray) and its prediction d^pre(x) (solid) (without retraining). MatLab script eda11_15.

While not pursued here, the neural net could in principal be retrained on a daily basis, as more data arrives, using a rolling window of several years of data (see Problem 11.7). Such a strategy—a form of machine learning—would allow the network to adapt to unquantified factors that affects river dynamics, such as land-use changes. In the case we have been considering, the problem is small enough that full iterative least squares could be used for each daily update. In larger problems or ones in which very frequent updates are needed, the stochastic gradient method (Section 11.10) might prove preferable.

Problems

11.1. Derive the small number approximation for exp(x) near the point $x = 0$ $x = 0$ . Plot the function and its approximation. What is the range of x for which the error is less than 5%?

11.2. Suppose that uncorrelated normally distributed random variables x and y have means $\bar{x}$ $\bar{x}$ and $\bar{y}$ $\bar{y}$ , respectively, and variances σ_x² and σ_y², respectively, with $σ_{x} ≪ \bar{x}$ $σ_{x} ≪ \bar{x}$ and $σ_{y} ≪ \bar{y}$ $σ_{y} ≪ \bar{y}$ . Use small number approximations to estimate the covariance matrix C_uv of $u = x y$ $u = x y$ and $v = x / y$ $v = x / y$ .

11.3. Modify eda11_05 to include both annual and diurnal frequencies of oscillation. By how much is the error reduced?

11.4. Suppose that the time needed to evaluate a function of one variable is T_f, the time needed to lookup its value in a table is $T_{t} = T_{f} / 10$ $T_{t} = T_{f} / 10$ , and the number of table entries is K. How many times must the function be evaluated in a grid search in order to achieve an 80% reduction in computation time? Be sure to include the time needed to create the table in your estimate.

11.5. Modify eda11_13.m to fit the function:

$d^{o b s} (x, y) = sin (x) cos (y) with 0 \leq x \leq π and 0 \leq y \leq π$ $d^{o b s} (x, y) = sin (x) cos (y) with 0 \leq x \leq π and 0 \leq y \leq π$

with a neural net containing two towers.

11.6. (A) Modify the neural net in eda11_11.m to model the parabola with just two towers of slope $s = 12.5$ $s = 12.5$ . (B) Add code, modeled on the iterative least squares procedure in eda11_13.m, to train the network. How good is the fit?

11.7. (This problem is especially difficult.) The test dataset in the nonlinear response example (Section 11.15) was generated by the script nonlinearresponse.m. This script solves Equation (11.43) with $c_{1} c_{2} = c_{f} = 0.2$ $c_{1} c_{2} = c_{f} = 0.2$ . Modify this script to generate a test dataset of length 10,000 in which the coefficient c_f systematically increases from 0.2 at the start of the dataset to 0.5 at its end, emulating land-use changes in the watershed. Then modify the network training script eda11_15. Retain the section that uses iterative least squares to train on the first 500 points, but then predict the data in two different ways: by applying the neural net, as is, to the entire dataset; and by retraining the network each time step using a moving window of the previous 500 points. Use the same iterative least squares method, but with only one iteration, to retrain. Compare the results of the two methods. Can the retrained network keep up with changing river conditions?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11: “Approximate” is not a pejorative word

Create new playlist

Sign In

Sign Up