Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data analysis with MatLab

Abstract

Chapter 1, Data Analysis with MatLab, is a brief introduction to MatLab as a data analysis environment and scripting language. It is meant to teach the reader barely enough to understand the MatLab scripts in the book and to begin to start using and modifying them. While MatLab is a fully featured programming language, Environmental Data Analysis with MatLab is not a book on computer programming. It teaches scripting mainly by example and avoids long discussions on programming theory.

Keywords

MatLab; script; m-file; variable; matrix; command; function; plot; graph

1.1 Why MatLab?

Data analysis requires computer-based computation. While a person can learn much of the theory of data analysis by working through short pencil-and-paper examples, he or she cannot become proficient in the practice of data analysis that way—for reasons both good and bad. Real datasets, which are almost always too large to handle manually, are inherently richer and more interesting than stripped-down examples. They have more to offer, but an expanded skill set is required to successfully tackle them. In particular, a new kind of judgment is required for selecting the analysis technique that is right for the problem at hand. These are good reasons. Unfortunately, the practice of data analysis is littered with bad reasons, too, most of which are related to the very steep learning curve associated with using computers. Many practitioners of data analysis find that they spend rather too many frustrating hours solving computer-related problems that have very little to do with data analysis, per se. That's bad, especially in a classroom setting where time is limited and where frustration gets in the way of learning.

One approach to dealing with this problem is to conduct all the data analysis within a single software environment—to limit the damage. Frustrating software problems will still arise, but fewer than if data were being shuffled between several different environments. Furthermore, in a group setting such as a classroom, the memory and experience of the group can help individuals solve commonly encountered problems. The trick is to select a single software environment that is capable of supporting real data analysis.

The key decision is whether to go with a spreadsheet or a scripting language-type software environment. Both are viable environments for computer-based data analysis. Stable implementations of both are available for most types of computers from commercial software developers at relatively modest prices (and especially for those eligible for student discounts). Both provide support for the data analysis itself, as well as associated tasks such as loading and writing data to and from files and plotting them on graphs. Spreadsheets and scripting languages are radically different in approach, and each has advantages and disadvantages.

In a spreadsheet-type environment, typified by Microsoft Excel, data are presented as one or more tables. Data are manipulated by selecting the rows and columns of a table and operating on them with functions selected from a menu and with formulas entered into the cells of the table itself. The immediacy of a spreadsheet is both its greatest advantage and its weakness. You see the data and all the intermediate results as you manipulate the table. You are, in a sense, touching the data, which gives you a great sense of what the data are like. More of a problem, however, is keeping track of what you did in a spreadsheet-type environment, as is transferring useful procedures from one spreadsheet-based dataset to another.

In a scripting language, typified by The MathWorks MatLab, data are presented as one or more named variables (in the same sense that the “c” and “d” in the formula c = πd are named variables). Data are manipulated by typing formulas that create new variables from old ones and by running scripts, that is, sequences of formulas stored in a file. Much of data analysis is simply the application of well-known formulas to novel data, so the great advantage of this approach is that the formulas that you type usually have a strong similarity to those printed in a textbook. Furthermore, scripts provide a way of both documenting the sequence of formulas used to analyze a particular dataset and transferring the overall data analysis procedure from one dataset to another. The main disadvantage of a scripting language environment is that it hides the data within the variable—not absolutely, but a conscious effort is nonetheless needed to display it as a table or as a graph. Things can go badly wrong in a script-based data analysis scheme without the practitioner being aware of it. Another disadvantage is that the parallel between the syntax of the scripting language and the syntax of standard mathematical notation is nowhere near perfect. One needs to learn to translate one into the other.

While both spreadsheets and scripting languages have pros and cons, our opinion is that, on balance, a scripting language wins out, at least for the data analysis scenarios encountered in Environmental Science. In our experience, these scenarios often require a long sequence of data manipulation steps before a final result is achieved. Here, the self-documenting aspect of the script is paramount. It allows the practitioner to review the data processing procedure both as it is being developed and years after it has been completed. It provides a way of communicating what you did, a process that is at the heart of science.

We have chosen MatLab, a commercial software product of The MathWorks, Inc. as our preferred software environment for several reasons, some having to do with its designs and others more practical. The most persuasive design reason is that its syntax fully supports both linear algebra and complex arithmetic, both of which are important in data analysis. Practical considerations include the following: it is a long-lived and stable product, available since the mid 1980s; implementations are available for most commonly used types of computers; its price, especially for students, is fairly modest; and it is widely used, at least in university settings.

1.2 Getting started with MatLab

We cannot walk you through the installation of MatLab, for procedures vary from computer to computer and quickly become outdated, anyway. Furthermore, we will avoid discussion of the appearance of MatLab on your computer screen, because its Graphical User Interface has evolved significantly over the years and can be expected to continue to do so. We will assume that you have successfully installed MatLab and that you can identify the Command Window, the place where MatLab formula and commands are typed.

You might try typing

date

in this window. If MatLab responds by displaying the current date, you're on track!

All the MatLab commands that we use are in MatLab scripts that are provided as a companion to this book. This one is named eda01_01 and is in a MatLab script file (m-file, for short) named eda01_01.m (conventionally, m-files have file names that end with “.m”). In this case, the script is pretty boring, as it contains just this one command, date, together with a couple of comment lines (which start with the character “%”):

% eda01_01
% displays the current date
date

(MatLab eda01_01)

After you install MatLab, you should copy the eda folder, provided with this book, to your computer's file system. Put it in some convenient and easy-to-remember place that you are not going to accidentally delete!

1.3 Getting organized

Files proliferate at an astonishing rate, even in the most trivial data management project. You start with a file of data, but then write m-scripts, each of which has its own file. You will usually output final results of the data analysis to a file, and you may well output intermediate results to files, too. You will probably have files containing graphs and files containing notes as well. Furthermore, you might decide to analyze your data in several different ways, so you may have several versions of some of these files.

A practitioner of data analysis may find that a little organization up front saves quite a bit of confusion down the line.

As data analysis scenarios vary widely, there can be no set rule regarding organization of the files associated with them. The goal should simply be to create a system of folders (directories), subfolders (sub-directories), and file names that are sufficiently systematic so that files can be located easily and they are not confused with one another. Predictability in both the pattern of filenames and in the arrangement of folders and subfolders is an extremely important part of the design.

By way of example, the files associated with this book are in a three-tiered folder/subfolder structure modeled on the chapter and section format of the book itself (Figure 1.1). Most of the files, such as the m-files, are in the chapter folders. However, some chapters have longish case studies that use a large number of files, and in those instances, section folders are used. Folder and file names are systematic. The chapter folder names are always of the form chNN, where NN is the chapter number. The section folder names are always of the form secNN_MM, where NN is the chapter number and MM is the section number. We have chosen to use leading zeros in the naming scheme (for example, ch01) so that filenames appear in the correct order when they are sorted alphabetically (as when listing the contents of a folder).

f01-01-9780128044889 — Figure 1.1 Folder (directory) structure used for the files accompanying this book.

1.4 Navigating folders

The MatLab command window supports a number of commands that enable you to navigate from folder to folder, list the contents of folders, and so on. For example, when you type

pwd

(for “print working directory”) in the Command Window, MatLab responds by displaying the name of the current folder. Initially, this is almost invariably the wrong folder, so you will need to cd (for “change directory”) to the folder where you want to be—the ch01 folder in this case. The pathname will, of course, depend on where you copied the eda folder, but will end in eda/ch01. On our computer, typing

cd c:/menke/docs/eda/ch01

does the trick. If you have spaces in your pathname, just surround it with single quotes:

cd ‘c:/menke/my docs/eda/ch01’

You can check if you are in the right folder by typing pwd again. Once in the ch01 folder, typing

eda01_01

will run the eda01_01 m-script, which displays the current date. You can move to the folder above the current one by typing

cd ..

and to one below it by giving just the folder name. For example, if you are in the eda folder you can move to the ch01 folder by typing

cd ch01

Finally, the command dir (for “directory”), lists the files and subfolders in the current directory.

dir

(MatLab eda01_02)

1.5 Simple arithmetic and algebra

The MatLab commands for simple arithmetic and algebra closely parallel standard mathematical notation. For instance, the command sequence

a = 3.5;
b = 4.1;
c = a+b;
c

(MatLab eda01_03)

evaluates the formula c = a + b for the case a = 3.5 and b = 4.1 to obtain c = 7.6. Only the semicolons require explanation. By default, MatLab displays the value of every formula typed into the Command Window. A semicolon at the end of the formula suppresses the display. Hence, the first three lines, which end with semicolons, are evaluated but not displayed. Only the final line, which lacks the semi-colon, causes MatLab to print the final result, c.

A perceptive reader might have noticed that the m-script could have been made shorter by one line, simply by omitting the semicolon in the formula, c = a + b. That is,

a = 3.5;
b = 4.1;
c = a+b

However, we recommend against such cleverness. The reason is that many intermediate results will need to be temporarily displayed and then un-displayed in the process of developing and debugging a long m-script. When this is accomplished by adding and then deleting the semicolon at the end of a functioning—and important—formula in the script, the formula can be inadvertently damaged by deleting one or more extra characters. Editing a line of the code that has no function other than displaying a value is safer.

Note that MatLab variables are static, meaning that they persist in MatLab's Workspace until you explicitly delete them or exit the program. Variables created by one script can be used by subsequent scripts. At any time, the value of a variable can be examined, either by displaying it in the Command Window (as we have done above) or by using the spreadsheet-like display tools available through MatLab's Workspace Window. The persistence of MatLab variables can sometimes lead to scripting errors, as described in Note 1.1.

The four commands discussed above can be run as a unit by typing eda01_03. Now open the m-file eda01_03 in MatLab, using the File/Open menu. MatLab will bring up a text-editor type window. First save it as a new file, say myeda01_03, edit it in some simple way, say by changing the 3.5 to 4.5, save the edited file, and run it by typing myeda01_03 in the Command Window. The value of c that is displayed will have changed appropriately.

A somewhat more complicated MatLab formula is

$c = \sqrt{a^{2} + b^{2}} with a = 3 and b = 4$ $c = \sqrt{a^{2} + b^{2}} with a = 3 and b = 4$

a = 3;
b = 4;
c = sqrt(aˆ2 + bˆ2);
c

(MatLab eda01_04)

Note that the MatLab syntax for a² is aˆ2 and that the square root is computed using the function, sqrt(). This is an example of MatLab's syntax differing from standard mathematical notation.

A final example is

$c = sin \frac{n π (x - x_{0})}{L} with n = 2, x = 3, x_{0} = 1, L = 5$ $c = sin \frac{n π (x - x_{0})}{L} with n = 2, x = 3, x_{0} = 1, L = 5$

si7_e

n = 2; x = 3; x0 = 1; L = 5;
c = sin(n*pi*(x−x0)/L);
c

(MatLab eda01_05)

Note that several formulas separated by semicolons can be typed on the same line. Variables, such as x0 and pi above, can have names consisting of more than one character, and can contain numerals as well as letters (although they must start with a letter). MatLab has a variety of predefined mathematical constants, including pi, which is the usual mathematical constant, π.

1.6 Vectors and matrices

Vectors and matrices are fundamental to data analysis both because they provide a convenient way to organize data and because many important operations on data can be very succinctly expressed using linear algebra (that is, the algebra of vectors and matrices).

Vectors and matrices are very easy to define in MatLab. For instance, the quantities

$r = [\begin{matrix} 2 & 4 & 6 \end{matrix}] and c = [\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}] = {[\begin{matrix} 1 & 3 & 5 \end{matrix}]}^{T} and M = [\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{matrix}]$ $r = [\begin{matrix} 2 & 4 & 6 \end{matrix}] and c = [\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}] = {[\begin{matrix} 1 & 3 & 5 \end{matrix}]}^{T} and M = [\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{matrix}]$

si8_e

are defined with the following commands:

r = [2, 4, 6];
c = [1, 3, 5]';
M =[ [1, 4, 7]', [2, 5, 8]', [3, 6, 9]'];

(MatLab eda01_06)

Note that the column-vector, c, is created by first defining a row vector, [1, 3, 5], and then converting it to a column vector by taking its transform, which in MatLab is indicated by a single quote. Note, also, that the matrix, M, is being constructed from a “row vector of column vectors”.

Although MatLab allows both column-vectors and row-vectors to be defined with ease, our experience is that using both creates serious opportunities for error. A formula that requires a column-vector will usually yield incorrect results if a row-vector is substituted into it, and vice-versa. Consequently, we adhere to a protocol where all vectors defined in this book are column vectors. Row vectors are created when needed—and as close as possible to where they are used in the script—by transposing the equivalent column vector. We also adhere to the convention that vectors have lower-case names and matrices have upper-case names (or, at least, names that start with an upper-case letter).

1.7 Multiplication of vectors of matrices

MatLab performs all multiplicative operations with ease. For example, suppose column vectors a and b, and matrices M and N are defined as

$a = [\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}] and b = [\begin{matrix} 2 \\ 4 \\ 6 \end{matrix}] and M = [\begin{matrix} 1 & 0 & 2 \\ 0 & 1 & 0 \\ 2 & 0 & 1 \end{matrix}] and N = [\begin{matrix} 1 & 0 & - 1 \\ 0 & 2 & 0 \\ - 1 & 0 & 3 \end{matrix}]$ $a = [\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}] and b = [\begin{matrix} 2 \\ 4 \\ 6 \end{matrix}] and M = [\begin{matrix} 1 & 0 & 2 \\ 0 & 1 & 0 \\ 2 & 0 & 1 \end{matrix}] and N = [\begin{matrix} 1 & 0 & - 1 \\ 0 & 2 & 0 \\ - 1 & 0 & 3 \end{matrix}]$

si9_e

Then,

$s = a^{T} b = {[\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}]}^{T} [\begin{matrix} 2 \\ 4 \\ 6 \end{matrix}] = [\begin{matrix} 1 & 3 & 5 \end{matrix}] [\begin{matrix} 2 \\ 4 \\ 6 \end{matrix}] = 2 \times 1 + 3 \times 4 + 5 \times 6 = 44$ $s = a^{T} b = {[\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}]}^{T} [\begin{matrix} 2 \\ 4 \\ 6 \end{matrix}] = [\begin{matrix} 1 & 3 & 5 \end{matrix}] [\begin{matrix} 2 \\ 4 \\ 6 \end{matrix}] = 2 \times 1 + 3 \times 4 + 5 \times 6 = 44$

si10_e

$T = a b^{T} = [\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}] {[\begin{matrix} 2 \\ 4 \\ 6 \end{matrix}]}^{T} = [\begin{matrix} 2 \times 1 & 4 \times 1 & 6 \times 1 \\ 2 \times 3 & 4 \times 3 & 6 \times 3 \\ 2 \times 5 & 4 \times 5 & 6 \times 5 \end{matrix}] = [\begin{matrix} 2 & 4 & 6 \\ 6 & 12 & 18 \\ 10 & 20 & 30 \end{matrix}]$ $T = a b^{T} = [\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}] {[\begin{matrix} 2 \\ 4 \\ 6 \end{matrix}]}^{T} = [\begin{matrix} 2 \times 1 & 4 \times 1 & 6 \times 1 \\ 2 \times 3 & 4 \times 3 & 6 \times 3 \\ 2 \times 5 & 4 \times 5 & 6 \times 5 \end{matrix}] = [\begin{matrix} 2 & 4 & 6 \\ 6 & 12 & 18 \\ 10 & 20 & 30 \end{matrix}]$

si11_e

$c = M a = [\begin{matrix} 1 & 0 & 2 \\ 0 & 1 & 0 \\ 2 & 0 & 1 \end{matrix}] [\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}] = [\begin{matrix} 1 \times 1 & + & 0 \times 3 & + & 2 \times 5 \\ 0 \times 1 & + & 1 \times 3 & + & 0 \times 5 \\ 2 \times 1 & + & 0 \times 3 & + & 1 \times 5 \end{matrix}] = [\begin{matrix} 11 \\ 3 \\ 7 \end{matrix}]$ $c = M a = [\begin{matrix} 1 & 0 & 2 \\ 0 & 1 & 0 \\ 2 & 0 & 1 \end{matrix}] [\begin{matrix} 1 \\ 3 \\ 5 \end{matrix}] = [\begin{matrix} 1 \times 1 & + & 0 \times 3 & + & 2 \times 5 \\ 0 \times 1 & + & 1 \times 3 & + & 0 \times 5 \\ 2 \times 1 & + & 0 \times 3 & + & 1 \times 5 \end{matrix}] = [\begin{matrix} 11 \\ 3 \\ 7 \end{matrix}]$

si12_e

$P = M N = [\begin{matrix} 1 & 0 & 2 \\ 0 & 1 & 0 \\ 2 & 0 & 1 \end{matrix}] [\begin{matrix} 1 & 0 & - 1 \\ 0 & 2 & 0 \\ - 1 & 0 & 3 \end{matrix}] = [\begin{matrix} - 1 & 0 & 5 \\ 0 & 2 & 0 \\ 1 & 0 & 1 \end{matrix}]$ $P = M N = [\begin{matrix} 1 & 0 & 2 \\ 0 & 1 & 0 \\ 2 & 0 & 1 \end{matrix}] [\begin{matrix} 1 & 0 & - 1 \\ 0 & 2 & 0 \\ - 1 & 0 & 3 \end{matrix}] = [\begin{matrix} - 1 & 0 & 5 \\ 0 & 2 & 0 \\ 1 & 0 & 1 \end{matrix}]$

si13_e

corresponds to

s = a′*b;
T = a*b′;
c = M*a;
P = M*N;

(MatLab eda01_07)

In MatLab, standard vector and matrix multiplication is performed just by using the normal multiplications sign, * (the asterisk). There are cases, however, where one needs to violate these rules and multiply the quantities element-wise (for example, create a vector, d, with elements d_i = a_ib_i). MatLab provides a special element-wise version of the multiplication sign, denoted.* (a period followed by an asterisk):

d = a.*b;

(MatLab eda01_07)

1.8 Element access

Individual elements of vectors and matrices can be accessed by specifying the relevant row and column indices in parentheses; for example, a(2) is the second element of the column vector a and M(2,3) is the second row, third column element of the matrix, M. Ranges of rows and columns can be specified using the : operator; for example, M(:,2) is the second column of matrix, M, M(2,:) is the second row of matrix, M, and M(2:3,2:3) is the 2 × 2 submatrix in the lower right-hand corner of the 3 × 3 matrix, M (the expression, M(2:end,2:end), would work as well). These operations are further illustrated below:

$a = [\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}] and M = [\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{matrix}]$ $a = [\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}] and M = [\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{matrix}]$

si14_e

$s = a_{2} = 2 and t = M_{23} = 6 and b = [\begin{matrix} M_{12} \\ M_{22} \\ M_{32} \end{matrix}] = [\begin{matrix} 2 \\ 5 \\ 8 \end{matrix}]$ $s = a_{2} = 2 and t = M_{23} = 6 and b = [\begin{matrix} M_{12} \\ M_{22} \\ M_{32} \end{matrix}] = [\begin{matrix} 2 \\ 5 \\ 8 \end{matrix}]$

si15_e

$c = {[\begin{matrix} M_{21} & M_{22} & M_{23} \end{matrix}]}^{T} = [\begin{matrix} 4 \\ 5 \\ 6 \end{matrix}] and T = [\begin{matrix} M_{22} & M_{23} \\ M_{32} & M_{33} \end{matrix}] = [\begin{matrix} 5 & 6 \\ 8 & 9 \end{matrix}]$ $c = {[\begin{matrix} M_{21} & M_{22} & M_{23} \end{matrix}]}^{T} = [\begin{matrix} 4 \\ 5 \\ 6 \end{matrix}] and T = [\begin{matrix} M_{22} & M_{23} \\ M_{32} & M_{33} \end{matrix}] = [\begin{matrix} 5 & 6 \\ 8 & 9 \end{matrix}]$

si16_e

correspond to:

s = a(2);
t = M(2,3);
b = M(:,2);
c = M(2,:)';
T = M(2:3,2:3);

(MatLab eda01_08)

The colon notion can be used in other contexts as well. For instance, [1:4] is the row vector [1, 2, 3, 4]. The syntax, 1:4, which omits the square brackets, works fine in MatLab. However, we usually use square brackets, as they draw attention to the presence of a vector. Finally, we note that two colons can be used in sequence to indicate the spacing of elements in the resulting vector. For example, the expression [1:2:9] is the row vector [1, 3, 5, 7, 9] and the expression [10:−1:1] is a row vector whose elements are in the reverse order from [10:1].

1.9 Representing functions

A vector can be used to represent an arbitrary function x(t). First one prepares a time vector t, which consists of N values of time, running from some minimum value t_min to some maximum value t_max, with even spacing $Δ t = (t_{\max} - t_{\min}) / (N - 1)$ $Δ t = (t_{\max} - t_{\min}) / (N - 1)$ . Next, one prepares a vector x that gives the value of the function at these times (Figure 1.2). The MatLab code below is for the exemplary function $x (t) = sin (π t)$ $x (t) = sin (π t)$ :

f01-02-9780128044889 — Figure 1.2 Time series representation of a function x(t). The time axis is divided into N intervals of length $Δ t$ $Δ t$ between t_min and t_max. The time vector t has elements $t_{i} = (i - 1) Δ t$ $t_{i} = (i - 1) Δ t$ and the corresponding time series vector x corresponds to the values of the function at these times. It has elements $x_{i} = x (t_{i})$ $x_{i} = x (t_{i})$ . MatLab script eda01_09.

% independent variable t
N = 21;
tmin = 0;
tmax = 1;
Dt = (tmax-tmin)/(N-1);
t = tmin + Dt*[0:N-1]’;
% exemplary function
x = sin(pi*t);

(MatLab eda01_09)

Many operations on functions are especially easy to peform when they are represented in this fashion, which is called a time series. For instance, the slope (first derivative) $s (t) = d x / d t$ $s (t) = d x / d t$ can be approximated by the slope of the line segments connecting adjacent points in the time series (Figure 1.3 A):

$s (t_{0}) = {\frac{d x}{d t}|}_{x = x_{0}} \approx \frac{x (t_{0} + Δ t) - x (t_{0})}{Δ t}$ $s (t_{0}) = {\frac{d x}{d t}|}_{x = x_{0}} \approx \frac{x (t_{0} + Δ t) - x (t_{0})}{Δ t}$

si20_e

f01-03-9780128044889 — Figure 1.3 Approximations for the derivative and integral of a function x(t). (A) The smooth function x(t) (gray curve) is represented as a time series; that is, by its values (circles) at a sequence of equally-spaced values of t (vertical bars). The derivative s(t₀) is approximated as the slope of a line segment (bold) connecting two adjacent values, the leftmost of which is at t₀. The integral a(t₀) is approximated as the sum of the areas $Δ a$ $Δ a$ of all the rectangles subtending the curve up to the position t₀ (shaded). (B) The resulting approximation for the derivative (black curve) closely approximates s(t) (grey curve). (C) The resulting approximation for the integral (black curve) closely approximates A(t) (grey curve). The value of the integral $a (t_{0} = 1)$ $a (t_{0} = 1)$ at the right hand end of the interval is shown (circle). MatLab script eda01_09.

A time series s of slopes (Figure 1.3 B) is calculated as:

sapprox = (x(2:N)-x(1:N-1))/Dt;

(MatLab eda01_09)

Note that the vector s is of length N − 1, since no value of the function is available for time $t_{\max} + Δ t$ $t_{\max} + Δ t$ .

The area (integral) $a (t_{0}) = \int_{0}^{t_{0}} x (t) d t$ $a (t_{0}) = \int_{0}^{t_{0}} x (t) d t$ can be computed using the Riemann Summation approximation of the integral:

$a (t_{0}) = \int_{0}^{t_{0}} x (t) d t \approx Δ t \sum_{i = 1}^{K} x (t_{i}) with K = t_{0} / Δ t$ $a (t_{0}) = \int_{0}^{t_{0}} x (t) d t \approx Δ t \sum_{i = 1}^{K} x (t_{i}) with K = t_{0} / Δ t$

si23_e

This approximation corresponds to adding up the areas of rectangles of width $Δ t$ $Δ t$ that extend below the function, from the leftmost rectangle to the one at position t₀ (Figure 1.3 A). A time series a of areas (Figure 1.3 B) is calculated as:

aapprox = Dt*cumsum(x);

(MatLab eda01_09)

The MatLab function cumsum(x)returns the cumulative sum (running sum) of the elements of the vector x. The last element of a corresponds to the total area under the curve. It can be calculated more simiply as (Figure 1.3 C):

atotal = Dt*sum(x);

(MatLab eda01_09)

The MatLab function sum(x)returns the sum of elements of the vector x.

1.10 To loop or not to loop

MatLab provides a looping mechanism, the for command, which can be useful when the need arises to sequentially access the elements of vectors and matrices. Thus, for example,

M = [ [1, 4, 7]', [2, 5, 8]', [3, 6, 9]' ];
for i = [1:3]
 a(i) = M(i,i);
end

(MatLab eda01_10)

executes the a(i) = M(i,i)formula three times, each time with a different value of i (in this case, i = 1, i = 2, and i = 3). The net effect is to copy the diagonal elements of the matrix M to the vector, a, that is, a_i = M_ii. Note that the end statement indicates the position of the bottom of the loop. Subsequent commands are not part of the loop and are executed only once.

Loops can be nested; that is, one loop can be inside another. Such an arrangement is necessary for accessing all the elements of a matrix in sequence. For example,

M = [ [1, 4, 7]', [2, 5, 8]', [3, 6, 9]'];
for i = [1:3]
for j = [1:3]
 N(i,4−j) = M(i,j);
end
end

(MatLab eda01_11)

copies the elements of the matrix, M, to the matrix, N, but reverses the order of the elements in each row; that is, N_i,j = M_i,_4−j. Loops are especially useful in conjunction with conditional commands. For example

a = [ 1, 2, 1, 4, 3, 2, 6, 4, 9, 2, 1, 4 ]';
for i = [1:12]
 if ( a(i) >= 6 )
 b(i) = 6;
 else
 b(i) = a(i);
 end
end

(MatLab eda01_12)

sets b_i = a_i if a_i < 6 and sets b_i = 6 otherwise (a process called clipping a vector, for it lops off parts of the vector that are larger than 6).

A purist might point out that MatLab syntax is so flexible that for loops are almost never really necessary. In fact, all three examples, above, can be computed with one-line formulas that omit for loops:

a = diag(M);
N = fliplr(M);
b = a.*(a<6)+6.*(a>=6);

(MatLab eda01_13)

The first two formulas are quite simple, but rely on the MatLab functions diag() and fliplr()whose existence we have not heretofore mentioned. One of the problems of a script-based environment is that learning the complete syntax of the scripting language can be pretty daunting. Writing a long script, such as one containing a for loop, will often be faster than searching through MatLab help files for a predefined function that implements the desired functionality in a single line of the script. The third formula points out a different problem: MatLab syntax is often pretty inscrutable. In this case, the expression (a<6) creates a column-vector of ones and zeros, depending on whether a given element of a is less-than or greater-than-or-equal-to 6. Element-wise multiplication is then used to create a vector a.*(a<6) whose elements are either a_i or 0. Similarly, 6.*(a>=6) is a vector whose elements are either 0 or 6. Their sum is a vector whose elements are either a_i or 6, depending on whether a_i is less-than or greater-than-or-equal-to 6. That's pretty complicated!

Because MatLab's syntax is so powerful, the same functionality can often be achieved in several different ways. Thus, for example, the commands

b = a;
b(find(a>6)) = 6;

(MatLab eda01_13)

will also clip the elements of the vector. The find() function returns a column-vector of the indices of the vector, a, that match the condition, and then that list is used to reset just those elements of b to 6, leaving the other elements unchanged.

When deciding between alternative ways of implementing a given functionality, you should always choose the one which you find clearest. Scripts that are terse or even computationally efficient are not necessarily a virtue, especially if they are difficult to debug. You should avoid creating formulas that are so inscrutable that you are not sure whether they will function correctly. Of course, the degree of inscrutability of any given formula will depend on your level of familiarity with MatLab. Your repertoire of techniques will grow as you become more practiced.

1.11 The matrix inverse

Recall that the matrix inverse is defined only for square matrices, and that it has the following properties:

$A^{- 1} A = A A^{- 1} = I$ $A^{- 1} A = A A^{- 1} = I$

(1.1)

Here, I is the identity matrix, that is, a matrix with ones on its main diagonal and zeros elsewhere. In MatLab, the matrix inverse is computed as

B = inv(A);

(MatLab eda01_14)

In many of the formulas of data analysis, the matrix inverse either premultiplies or postmultiplies other quantities; for instance,

$c = A^{- 1} b and D = B A^{- 1}$ $c = A^{- 1} b and D = B A^{- 1}$

These cases do not actually require the explicit calculation of A⁻¹; just the combinations A⁻¹b and BA⁻¹, which are computationally simpler are sufficient. MatLab provides generalizations of the division operator that implements these two cases:

c = A;
D = B/A;

(MatLab eda01_15)

1.12 Loading data from a file

MatLab can read and write files with a variety of formats, but we start here with the simplest and most common one, the text file.

As an example, we load a hydrological dataset of stream flow from the Neuse River near Goldsboro NC. Our recommendation is that you always keep a file of notes about any dataset that you work with, and that these notes include information on where you obtained the dataset and any modifications that you subsequently made to it. Bill Menke provides the following notes for this one (Figure 1.4):

f01-04-9780128044889 — Figure 1.4 Preliminary plot of the Neuse River discharge dataset. MatLab script eda01_18.

I downloaded stream flow data from the US Geological Survey's National Water Informatiuon Center for the Neuse River near Goldboro NC for the time period, 01/01/1974-12/31/1985. These data are in the file, neuse.txt. It contains two columns of data, time (in days starting on January 1, 1974) and discharge (in cubic feet per second, cfs). The data set contains 4383 rows of data. I also saved information about the data in the file neuse_header.txt.

We reproduce the first few lines of neuse.txt, here:

1	1450
2	2490
3	3250
….	….

The data is read into MatLab as follows:

D = load(‘neuse.txt’);
t = D(:,1);
d = D(:,2);

(MatLab eda01_16)

The load() function reads the data into a 4383 × 2 array, D. Note that the filename, neuse.txt, needs to be surrounded by single quotes to indicate that it is a character string and not a variable name. The subsequent two lines break out D into two separate column-vectors, t, of time and d, of discharge. Strictly speaking, this step is not necessary, but our opinion is that fewer mistakes will be made if each of the different variables in the dataset has its own name.

1.13 Plotting data

One of the best things to do after loading a new dataset is to make a quick plot of it, just to get a sense of what it looks like. Such plots are very easily created in MatLab:

plot(t,d);

The resulting plot is quite functional, but lacks some graphical niceties such as labeled axes and a title. These deficiencies are easy to correct:

set(gca,‘LineWidth’,2);
plot(t,d,‘k−’,‘LineWidth’,2);
title(‘Neuse River Hydrograph’);
xlabel(‘time in days’);
ylabel(‘discharge in cfs’);

(MatLab eda01_17)

The set command resets the line width of the axes, to make them easier to see. Several new arguments have been added to the plot() function. The ‘k−‘ changes the plot color from its default value (a blue line, at least on our computer) to a black line. The ‘LineWidth’, 2 makes the line thicker (which is important if you print the plot to paper). A quick review of the plot indicates that the Neuse River discharge has some interesting properties, such as pattern of highs and lows that repeat every few hundred days. We will discuss it more extensively in Chapter 2.

Data can span a very large range of values and in some cases the difference bewteen the sizes of the small values is every bit as important as the difference between the sizes of the larger values. The simple, linear plot that was descibed above is ineffective in such a case, because small values are squeezed into one tiny corner of the graph. A logarithmic plot is preferred, because its axes give the same space to each order of magnitude of values. The 10 – 100 decade, for example, is given the same prominance on the graph as the 100 – 1, 000 decade (Figure 1.5).

f01-05-9780128044889 — Figure 1.5 Exemplary horizontal axis on a log-log plot. Note that the each decade is the same length and that the distance between the minor tic marks is variable.

As an example, we consider an earthquake dataset in which the strength of each earthquake is characterized by its seismic energy:

I downloaded earthquake data from the US Geological Survey's earthquake database for the time period 01/01/2000-12/31/2010 and for the 5–10 magnitude range. The resulting dataset contains 13, 258 earthquakes, each of which is described by 15 parameters, including its seismic magnitude. I then created a separate file containing just the list of the magnitudes, ordered chronologically. I converted these magnitudes to energy using the Gutenberg-Richter Energy-Magnitude Formula (Gutenberg, B. and C.F. Richter, Magnitude and energy of earthquakes, Ann. Geofis., 9, 1-15, 1956).

The earthquakes span a large range of energies, from about $2 \times 1 0^{5}$ $2 \times 1 0^{5}$ to $6 \times 1 0^{12}$ $6 \times 1 0^{12}$ joules. The number of earrthquakes in a given energy range varies widely, too. Several hundred of the least energetic earthquakes occur per yea, in contrast to only a few of the most energetic. The logarithmic plot (Figure 1.6) reveals an interesting pattern: the rate of occurrence r of earthquakes decrease systematically with energy, making approximately a straight line on the logarithmic plot. This pattern implies the power law $r = a E^{- b}$ $r = a E^{- b}$ where a and b are constants, since then ${log}_{10} r = {log}_{10} a + b {log}_{10} E$ ${log}_{10} r = {log}_{10} a + b {log}_{10} E$ is a linear function of log₁₀E.

f01-06-9780128044889 — Figure 1.6 Earthquake rate, in number of events per year, plotted against their enegy, in joules. MatLab script eda01_19.

A log-log plot is easy to create in MatLab:

set(gca,'Xscale','log','Yscale','log');
hold on;
plot( E, r, 'ko', 'LineWidth', 2);
xlabel('energy (joules)');
ylabel('earthquakes rate (per year)');
title('Earthquake rate vs energy');

(MatLab eda01_18)

The set command specifies that both the x and y axes are logarithmic. The hold on command then prevents subsequent commands from overriding these settings. The plot command works the same way as described previously, except that we have set the the plot symbol to black circles with the 'ko' parameter and the boldness of the lines to 2 points with the 'LineWidth' parameter.

1.14 Saving data to a file

Data is saved to a to text file in a process that is more or less the reverse of the one used to read it. Suppose, for instance, that we want a version of neuse.txt that contains discharge in the metric units of m³/s. After looking up the conversion factor, f = 35.3146, between cubic feet and cubic meters, we perform the conversion and write the data to a new file:

f = 35.3146;
dm = d/f;
Dm(:,1) = t;
Dm(:,2) = dm;
dlmwrite(‘neuse_metric.txt’,Dm,’	’);

(MatLab eda01_19)

The function dlmwrite() (for “delimited write”) writes the matrix, Dm, to the file neuse_metric.txt, putting a tab character (which in MatLab is represented with the symbol, )between the columns as a delimiter. Note that the filename and the delimiter are quoted; they are character strings.

1.15 Some advice on writing scripts

Practices that reduce the likelihood of scripting mistakes (“bugs”) are almost always worthwhile, even though they may seem to slow you down a bit. They save time in the long run, as you will spend much less time debugging your scripts.

1.15.1 Think before you type

Think about what you want to do before starting to type in a script. Block out the necessary steps on a piece of scratch paper. Without some forethought, you can type for an hour and then realize that what you have been doing makes no sense at all.

1.15.2 Name variables consistently

MatLab automatically creates a new variable whenever you type a new variable name. That is convenient, but it means that a misspelled variable becomes a new variable. For instance, if you begin calling a quantity xmin but accidentally switch to minx halfway through, you will unknowingly have two variables in your script, and it will not function correctly. Do not tempt fate by creating two variables, such as xmin and miny, with an inconsistent naming pattern.

1.15.3 Save old scripts

Cannibalize an old script to make a new one, but keep a copy of the old one too, and make sure that the names are sufficiently different so that you will not confuse them with each other.

1.15.4 Cut and paste sparingly

Cutting and pasting segments of code from one script to another, tempting though it may be, is especially prone to error, particularly when variable names need to be changed. Read through the cut-and-pasted material carefully, to make sure that all necessary changes have been made.

1.15.5 Start small

Build scripts in small sections. Test each section thoroughly before going into the next. Check intermediate results, either by displaying variables to the Command Window, examining them with the spreadsheet tool in the Workspace Window, or by plotting them, to ensure that they look right.

1.15.6 Test your scripts

Build test datasets with known properties to test whether or not your scripts give the right answers. Test a script on a small, simple dataset before running it on large complicated datasets.

1.15.7 Comment your scripts

Use comments to communicate the big picture, not the minutia. Consider the two scripts in Figure 1.7. Which of the two styles of commenting code do you suppose will make a script easier to understand 2 years down the line?

f01-07-9780128044889 — Figure 1.7 The same script, commented in two different ways.

1.15.8 Don't be too clever

An inscrutable script is very prone to error.

Problems

1.1 Write MatLab scripts to evaluate the following equations:

$(A) y = a x^{2} + b x + c with a = 2, b = 4, c = 8, x = 3.5$ $(A) y = a x^{2} + b x + c with a = 2, b = 4, c = 8, x = 3.5$

$(B) p = p_{0} exp (- c x) with p_{0} = 1.6, c = 4, x = 3.5$ $(B) p = p_{0} exp (- c x) with p_{0} = 1.6, c = 4, x = 3.5$

$(C) z = h sin θ with h = 4, θ = 31 °$ $(C) z = h sin θ with h = 4, θ = 31 °$

$(D) v = π h r^{2} with h = 6.9, r = 3.7$ $(D) v = π h r^{2} with h = 6.9, r = 3.7$

1.2 Write a MatLab script that defines a column vector, a, of length N = 12 whose elements are the number of days in the 12 months of the year, for a nonleap year. Create a similar column vector, b, for a leap year. Then merge a and b together into an N × M = 12 × 2 matrix, C.

1.3 Write a MatLab script that solves the following linear equation, y = M x, for x:

$M = [\begin{matrix} 1 & - 1 & 0 & 0 \\ 0 & 1 & - 1 & 0 \\ 0 & 0 & 1 & - 1 \\ 0 & 0 & 0 & 1 \end{matrix}] and y = [\begin{matrix} 1 \\ 2 \\ 3 \\ 5 \end{matrix}]$ $M = [\begin{matrix} 1 & - 1 & 0 & 0 \\ 0 & 1 & - 1 & 0 \\ 0 & 0 & 1 & - 1 \\ 0 & 0 & 0 & 1 \end{matrix}] and y = [\begin{matrix} 1 \\ 2 \\ 3 \\ 5 \end{matrix}]$

si35_e

You may find useful the MatLab function zeros(N, N), which creates a N × N matrix of zeros. Be sure to check that your x solves the original equation.

1.4 Create a 50 × 50 version of the M, above. One possibility is to use a for loop. Another is to use the MatLab function toeplitz(),as M has the form of a Toeplitz matrix, that is, a matrix with constant diagonals. Type help toeplitz in the Command Window for details on how this function is called.

1.5 Rivers always flow downstream. Write a MatLab script to check that none of the Neuse River discharge data is negative.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 1: Data analysis with MatLab

Create new playlist

Sign In

Sign Up

1.1 Why MatLab?

1.2 Getting started with MatLab

1.3 Getting organized

1.4 Navigating folders

1.5 Simple arithmetic and algebra

1.6 Vectors and matrices

1.7 Multiplication of vectors of matrices

1.8 Element access

1.9 Representing functions

1.10 To loop or not to loop

1.11 The matrix inverse

1.12 Loading data from a file

1.13 Plotting data

1.14 Saving data to a file

1.15 Some advice on writing scripts

1.15.1 Think before you type

1.15.2 Name variables consistently

1.15.3 Save old scripts

1.15.4 Cut and paste sparingly

1.15.5 Start small

1.15.6 Test your scripts

1.15.7 Comment your scripts

1.15.8 Don't be too clever

Problems

Table of Contents for
1: Data analysis with MatLab