2

A first look at data

Abstract

Chapter 2, A First Look at Data, leads students through the steps that, in our view, should be taken when first confronted with a new dataset. Time plots, scatter plots, and histograms, as well as simple calculations, are used to examine the data with the aim of both understanding their general character and spotting problems. We take the position that all datasets have problems—errors, data gaps, inconvenient units of measurement, and so forth. Such problems should not scare a person away from data analysis! The chapter champions the use of the reality check—checking that observations of a particular parameter really have the properties that we know it must possess. Several case study datasets are introduced, including a hydrograph from the Neuse River (North Carolina, USA), air temperature from Black Rock Forest (New York), and chemical composition from the floor of the Atlantic Ocean.

Keywords

dataset; histogram; reality check; derivative; scatter plot; discharge; drop-out; outlier

2.1 Look at your data!

When presented with a new dataset, the most important action that a practitioner of data analysis can take is to look closely and critically at it. This examination has three important objectives:

Objective 1: Understanding the general character of the dataset.

Objective 2: Understanding the general behavior of individual parameters.

Objective 3: Detecting obvious problems with the data.

These objectives are best understood through examples, so we look at a sample dataset of temperature observations from the Black Rock Forest weather station (Cornwall, NY) that is in the file brf_temp.txt. It conxtains two columns of data. The first is time in days after January 1, 1997, and the second is temperature in degree Celsius.

We would expect that a weather station would record more parameters than just temperature, so a reasonable assumption is that this file is not the complete Black Rock Forest dataset, but rather some portion extracted from it. If you asked the person who provided the file—Bill Menke, in this case—he would perhaps say something like this:

I downloaded the weather station data from the International Research Institute (IRI) for Climate and Society at Lamont-Doherty Earth Observatory, which is the data center used by the Black Rock Forest Consortium for its environmental data. About 20 parameters were available, but I downloaded only hourly averages of temperature. My original file, brf_raw.txt has time in a format that I thought would be hard to work with, so I wrote a MatLab script, brf_convert.m, that converted it into time in days, and wrote the results into the file that I gave you (see Notes 2.1 and 2.2).

So our dataset is neither complete nor original. The issue of originality is important, because mistakes can creep into a dataset every time it is copied, and especially when it is reformatted. A purist might go back to the data center and download his or her own unadulterated copy—not a bad idea, but one that would require him or her to deal with the time format problem. In many instances, however, a practitioner of data analysis has no choice but to work with the data as it is supplied, regardless of its pedigree.

Any information about how one's particular copy of the dataset came about can be extremely useful, especially when diagnosing problems with the data. One should always keep a file of notes that includes a description of how the data was obtained and any modifications that were made to it. Unaltered data file(s) should also be kept, in case one needs to check that a format conversion was correctly made.

Developing some expectations about the data before actually looking at it has value. We know that the Black Rock Forest data are sampled every hour, so the time index, which is in days, should increment by unity every 24 points. As New York's climate is moderate, we expect that the temperatures will range from perhaps −20 °C (on a cold winter night) to around +40 °C (on a hot summer day). The actual temperatures may, of course, stray outside of this range during cold snaps and heat waves, but probably not by much. We would also expect the temperatures to vary with the diurnal cycle (24 h) and with the annual cycle (8760 h), and be hottest in the daytime and the summer months of those cycles, respectively.

As the data is stored in a tabular form in a text file, we can make use of the load() function to read it into MatLab:

D=load(‘brf_temp.txt’);
t=D(:,1);
d=D(:,2);
Ns=size(D);
N=Ns(1);
M=Ns(2);
N
M

(MatLab eda02_01)

The load() function reads the data into the matrix, D. We then copy time into the column vector t, and temperature into the column vector d. Knowing how much data was actually read is useful, so we query the size of D with the size() function. It returns a vector of the number of rows and columns, which we break out into the variables L and M and display. MatLab informs us that we read in a table of L = 110430 rows and M = 2 columns. That is about 4600 days or 12.6 years of data, at one observation per hour. A display of the first few data points, produced with the command, D(1:5,:), yields the following:

017.2700
0.041717.8500
0.083318.4200
0.125018.9400
0.166719.2900

The first column, time, does indeed increment by 1/24 = 0.0417 of a day. The temperature data seems to have been recorded with the precision of hundredths of a °C.

We are now ready to plot the data:

clf;
set(gca,‘LineWidth’,2);
hold on;
plot(t,d,‘k−’,‘LineWidth’,2);

(MatLab eda02_02)

The resulting graph is shown in Figures 2.1 and 2.2. Most of the data range from about −20 to +35 °C, as was expected. The data are oscillatory and about 12 major cycles—annual cycles, presumably − are visible. The scale of the plot is too small for diurnal cycles to be detectable but they presumably contribute to the fuzziness of the curve. The graph contains several unexpected features: Two brief periods of cold temperatures, or cold spikes, occur at around 400 and 750 days. In each case, the temperature dips below −50 °C. Even though they occur during the winter parts of cycles, such cold temperatures are implausible for New York, which suggests some sort of error in the data. A hot spike, with a temperature of about +40 °C occurs around the time of the second cold spike. While not impossible, it too is suspicious. Finally, two periods of constant—and zero − temperature occur, one in the 1400−1500 day range and the other in the 4600−4700 day range. These are some sort of data drop-outs, time periods where the weather station was either not recording data at all or not properly receiving input from the thermometer and substituting a zero, instead.

f02-01-9780128044889
Figure 2.1 Preliminary plot of temperature against time. MatLab script eda02_02.
f02-02-9780128044889
Figure 2.2 Annotated plot of temperature against time. MatLab script eda02_02.

Reality checks such as these should be performed on all datasets very early in the data analysis process. They will focus one's mind on what the data mean and help reveal misconceptions that one might have about the content of the data set as well as errors in the data themselves.

The next step is to plot portions of the data on a finer scale. MatLab's Figure Window has a tool for zooming in on portions of the plot. However, rather than to use it, we illustrate here a different technique, and create a new plot with a different scale, that enlarges a specified section of data, with the section specified by a mouse click. The advantage of the method is that scale of several successive enlarged sections can be made exactly the same, which helps when making comparisons. The script is:

w=1000; % width of new plot in samples
[tc, dc] = ginput(1); % detect mouse click
i=find((t>=tc),1); % find index i corresponding to click
figure(2);
clf;
set(gca,‘LineWidth’,2);
hold on;
plot( t(i−w/2:i+w/2), d(i−w/2:i+w/2), ‘k−’,‘LineWidth’,2);
plot( t(i−w/2:i+w/2), d(i−w/2:i+w/2), ‘k.’,‘LineWidth’,2);
title(‘Black Rock Forest Temp’);
xlabel(‘time in days after Jan 1, 1997’);
ylabel(‘temperature’);
figure(1);

(MatLab eda02_03)

This script needs careful explanation. The function ginput() waits for a mouse click and then returns its time and temperature in the variables tc and dc. The find() function returns the index, i, of the first element of the time vector, t, that is greater than or equal to tc, that is, an element near the time coordinate of the click. We now plot segments of the data, from i−w/2 to i+w/2, where w=1000, to a new figure window. The figure() function opens this new window, and the clf command (for ‘clear figure’) clears any previous contents. The rest of the plotting is pretty standard, except that we plot the data twice, once with black lines (the ‘k−’ argument) and then again with black dots (the ‘k.’ argument). We need to place a hold on command before the two plot functions, so that the second does not erase the plot made by the first. Finally, at the end, we call figure() again to switch back to Figure 1 (so that when the script is rerun, it will again put the cursor on Figure 1). The results are shown in Figure 2.3.

f02-03-9780128044889
Figure 2.3 Enlargements of two segments of the temperature versus time data. (A) Segment with a cold spike. (B) Section with a drop-out. MatLab script eda02_03.

The purpose behind plotting the data with both lines and symbols is to allow us to see the actual data points. Note that the cold spike in Figure 2.3A consists of two anomalously cold data points. The drop-out in Figure 2.3B consists of a sequence of zero-valued data, although examination of other portions of the dataset uncovers instances of missing data as well. The diurnal oscillations, each with a couple of dozens of data points, are best developed in the left part of Figure 2.3B. A more elaborate version of this script is given in eda02_04.

A complementary technique for examining the data is through its histogram, a plot of the frequency at which different values of temperature occur. The overall temperature range is divided into a modest number, say Lh, of bins, and the number of observations in each bin is counted up. In MatLab, a histogram is computed as follows:

Lh = 100;
dmin = min(d);
dmax = max(d);
bins = dmin + (dmax−dmin)*[0:Lh−1]’/(Lh−1);
dhist = hist(d, bins)’;

(MatLab eda02_05)

Here we use the min() and max() functions to determine the overall range of the data. The formula dmin+(dmax−dmin)*[0:Lh−1]’/(Lh−1) creates a column vector of length Lh of temperature values that are equally spaced between these two extremes. The histogram function hist() does the actual counting, returning a column-vector dhist whose elements are counts, that is, the number of observations in each of the bins. The value of Lh needs to be chosen with some care: too large and the bin size will be very small and the resulting histogram very rough; too small and the bins will be very wide and the histogram will lack detail. We use Lh=100, which divides the temperature range into bins about 1 °C wide.

The results (Figure 2.4) confirm our previous conclusions that most of the data fall in the range −20 to +35 °C and that the near-zero temperature bin is way overrepresented in the dataset. The histogram does not clearly display the two cold-spike outliers (although two tiny peaks are visible in the −60 to −40 °C range).

f02-04-9780128044889
Figure 2.4 Histogram of the Black Rock Forest temperature data. Note the peak at a temperature of 0 °C. MatLab script eda02_05.

The histogram can also be displayed as a grey-shaded column vector (Figure 2.5B). Note that the coordinate axes in this figure have been rotated with respect to those in Figure 2.4. The origin is in the upper-left and the positive directions are down and right. The intensity (darkness) of the grey-shade is proportional to the number of counts, as is shown by the color bar at the right of the figure. This display technique is most useful when only the pattern of variability, and not the numerical value, is of interest. Reading numerical values off a grey-scale plot is much less accurate than reading them off a standard graph! In many subsequent cases, we will omit the color bar, as only the pattern of variation, and not the numerical values, will be of interest. The MatLab commands needed to create this figure are described in the next section.

f02-05-9780128044889
Figure 2.5 Alternate ways to display a histogram. (A) A graph. (B) A grey-shaded column vector. MatLab script eda02_06.

An important variant of the histogram is the moving-window histogram. The idea is to divide the overall dataset into smaller segments, and compute the histogram of each segment. The resulting histograms can then be plotted side-by-side using the grey-shaded column-vector technique, with time increasing from left to right (Figure 2.6). The advantage is that the changes in the overall shape of the distribution with time are then easy to spot (as in the case of drop-outs). The MatLab code is as follows:

f02-06-9780128044889
Figure 2.6 Moving-window histogram, where the counts scale with the intensity (darkness) of the grey. MatLab script eda02_07.
offset=1000;
Lw=floor(N/offset)−1;
Dhist = zeros(Lh, Lw);
for i = [1:Lw];
 j=1+(i−1)*offset;
 k=j+offset−1;
 Dhist(:,i) = hist(d(j:k), bins)′;
end

(MatLab eda02_07)

Each segment of 1000 observations is offset by 1000 samples from the next (the variable, offset=1000). The number of segments, Lw, is the total length, N, of the dataset divided by the offset. However, as the result may be fractional, we round off to the nearest integer using the floor() function. Thus, we compute Lw histograms, each of length Lh, and store them in the columns of the matrix Dh. We first create an Lw × Lh matrix of zeros with the zeros() function. We then loop Lw times, each time creating one histogram from one segment, and copying the results into the proper columns of Dh. The integers j and k are the beginning and ending indices, respectively, of segment i, that is, d(j:k) is the i-th segment of data.

2.2 More on MatLab graphics

MatLab graphics are very powerful, but accessing that power requires learning what might, at first, seem to be a bewildering plethora of functions. Rather than attempting to review them in any detail, we provide some further examples. First consider

% create sample data, d1 and d2
N=51;
Dt = 1.0;
t = [0:N−1]';
tmax=t(N);
d1 = sin(pi*t/tmax);
d2 = sin(2*pi*t/tmax);
% plot the sample data
figure(7);
clf;
set(gca,‘LineWidth’,2);
hold on;
axis xy;
axis([0, tmax, −1.1, 1.1]);
plot(t,d1,‘k−’,‘LineWidth’,2);
plot(t,d2,‘k:’,‘LineWidth’,2);
title(‘data consisting of sine waves’);
xlabel(‘time’);
ylabel(‘data’);

(MatLab eda02_08)

The resulting plot is shown in Figure 2.7. Note that the first section of the code creates a time variable, t, and two data variables, d1 and d1, the sine waves sin(πt/L) and sin(2πt/L), with L a constant. MatLab can create as many figure windows as needed. We plot these data in a new figure window, numbered 7, created using the figure() function. We first clear its contents with the clf command. We then use a hold on, which informs MatLab that we intend to overlay plots; so the second plot should not erase the first. The axis xy command indicates that the axis of the coordinate system is in the lower-left of the plot. Heretofore, we have been letting MatLab auto-scale plots, but now we explicitly set the limits with the axis() function. We then plot the two sine waves against time, with the first a solid black line (set with the ‘k−’) and the second a dotted black line (set with the ‘k:’). Finally, we label the plot and axes.

f02-07-9780128044889
Figure 2.7 Plot of the functions sin(πt/L) (solid curve) and sin(2πt/L) (dashed curve), where L = 50. MatLab script eda02_08.

MatLab can draw two side-by-side plots in the same Figure Window, as is illustrated below:

figure(8);
clf;
subplot(1,2,1);
set(gca,‘LineWidth’,2);
hold on;
axis([−1.1, 1.1, 0, tmax]);
axis ij;
plot(d1,t,‘k−’);
title(‘d1’);
ylabel(‘time’);
xlabel(‘data’);
subplot(1,2,2);
set(gca,‘LineWidth’,2);
hold on;
axis ij;
axis([−1.1, 1.1, 0, tmax]);
plot(d2,t,‘k−’);
title(‘d2’);
ylabel(‘time’);
xlabel(‘data’);

(MatLab eda02_09)

Here the subplot(1,2,1) function splits the Figure Window into 1 column and 2 rows of subwindows and directs MatLab to plot into the first of them. We plot data into this subwindow in the normal way. After finishing with the first dataset, the subplot(1,2,2) directs MatLab to plot the second dataset into the second subwindow. Note that we have used an axis ij command, which sets the origin of the plots to the upper-left (in contrast to axis xy, which sets it to the lower-left). The resulting plot is shown in Figure 2.8.

f02-08-9780128044889
Figure 2.8 Plot of the functions sin(πt/L) (left plot) and sin(2πt/L) (right plot), where L = 50. MatLab script eda02_09.

MatLab plots grey-scale and color images through the use of a color-map, that is, a 3-column table that converts a data value into the intensities of the red, green, and blue colors of a pixel on a computer screen. If the data range from dmin to dmax, then the top row of the table gives the red, green, and blue values associated with dmax and the bottom row gives the red, green, and blue values associated with dmin, each of which range from 0 to 1 (Figure 2.9). The number of rows in the table corresponds to the smoothness of the color variation within the dmin to dmax range, with a larger number of rows corresponding to a smoother color variation. We generally use tables of 256 rows, as most computer screens can only display 256 distinct intensities of color.

f02-09-9780128044889
Figure 2.9 The data values are converted into color values through the color map.

While MatLab is capable of displaying a complete spectrum of colors, we use only black-and-white color maps in this book. A black-and-white color map has equal red, green, and blue intensities for any given data value. We normally use

% create grey-scale color map
bw=0.9*(256−[0:255]′)/256;
colormap([bw,bw,bw]);

(MatLab eda02_06)

which colors the minimum data value, dmin, a light gray and the maximum data value, dmax, black. In this example, the column vector bw is of length 256 and ranges from 0.9 (light grey) to 0 (black). A vector, dhist (as in Figure 2.5B), can then be plotted as a grey-shade image using the following script:

axis([0, 1, 0, 1]);
hold on;
axis ij;
axis off;
imagesc( [0.4, 0.6], [0, 1], dhist);
text(0.66,−0.2,‘dhist’);
colorbar(‘vert’);

(MatLab eda02_06)

Here, we set the axis to a simple 0 to 1 range using the axis() function, place the origin in the upper left with the axis ij command, and turn off the plotting of the axis and tick marks with the axis off command. The function,

imagesc( [0.4, 0.6], [0, 1], dhist);

plots the image. The quantities [0.4, 0.6] and [0, 1]are vectors x and y, respectively, which together indicate where dhist is to be plotted. They specify the positions of opposite corners of a rectangular area in the figure. The first element of dhist is plotted at the (x1, y1) corner of the rectangle and the last at (x2, y2). The text() function is used to place text (a caption, in this case) at an arbitrary position in the figure. Finally, the color bar is added with the colorbar() function.

A grey-shaded matrix, such as Dhist (Figure 2.6) can also be plotted with the imagesc() function:

figure(1);
clf;
axis([−Lw/8, 9*Lw/8, −Lh/8, 9*Lh/8]);
hold on;
axis ij;
axis equal;
axis off;
imagesc( [0, Lw−1], [0, Lh−1], Dhist);
text(6*Lw/16,17*Lw/16,‘Dhist’);

(MatLab eda02_06)

Here, we make the axes a little bigger than the matrix, which is Lw×Lh in size. Note the axis equal command, which ensures that the x and y axes have the same length on the computer screen. Note also that the two vectors in the imagesc function have been chosen so that the matrix plots in a square region of the window, as contrasted to the narrow and high rectangular area that was used in the previous case of a vector.

2.3 Rate information

We return now to the Neuse River Hydrograph (Figure 1.4). This dataset exhibits an annual cycle, with the river level being lowest in autumn. The data are quite spiky. An enlargement of a portion of the data (Figure 2.10A) indicates that the dataset contains many short periods of high discharge, each ~5 days long and presumably corresponding to a rain storm. Most of these storm events seem to have an asymmetric shape, with a rapid rise followed by a slower decline. The asymmetry is a consequence of the river rising rapidly after the start of the rain, but falling slowly after its end, as water slowly drains from the land.

f02-10-9780128044889
Figure 2.10 (A) Portion of the Neuse River Hydrograph. (B) Corresponding rate of change of discharge with time. (C) Histogram of rates for entire hydrograph. MatLab script eda01_10.

This qualitative assessment can be made more quantitative by estimating the time rate of change of the discharge—the discharge rate − using its finite-difference approximation:

dddtΔdΔt=d(t+Δt)d(t)Δtor[dddt]idi+1diti+1ti

si1_e  (2.1)

The corresponding MatLab script is

dddt=(d(2:L)−d(1:L−1)) ./ (t(2:L)−t(1:L−1));

(MatLab eda02_10)

Note that while the discharge data is of length L, its rate is of length L−1. The rate curve (Figure 2.10B) also contains numerous short events, although these are two-sided in shape. If, indeed, the typical storm event consists of a rapid rise followed by a long decline, we would expect that the discharge rate would be negative more often than positive. This hypothesis can be tested by computing the histogram of discharge rate, and examining whether or not it is centered about a rate of zero. The histogram (Figure 2.10C) peaks at negative rates, lending support to the hypothesis that the typical storm event is asymmetric.

We can also use rate information to examine whether the river rises (or falls) faster at high water (large discharges) than at low water (small discharges). We segregate the data into two sets of (discharge, discharge rate) pairs, depending on whether the discharge rate is positive or negative, and then make scatter plots of the two sets (Figure 2.11). The MatLab code is as follows:

f02-11-9780128044889
Figure 2.11 Scatter plot of discharge rate against discharge. (A) Positive rates. (B) Negative rates. MatLab script eda01_11.
pos = find(dddt>0);
neg = find(dddt<0);
- - -
plot(d(pos),dddt(pos),‘k.’);
- - -
plot(d(neg),dddt(neg),‘k.’);

(MatLab eda02_11)

Here, the “- - -” means that we have omitted lines of the script (standard plot setup commands, in this case). The find() function returns a column vector of indices of dddt that match the given test condition. For example, pos contains the indices of dddt for which dddt>0. Note that the quantities d(pos)and dddt(pos) are arrays of just the elements of d and dddt whose indices are contained in pos. Results are shown in Figure 2.11. Only the negative rates appear to correlate with discharge, that is, the river falls faster at high water than at low water. This pattern is related to the fact that a river tends to run faster when it's deeper, and can carry away water added by a rain storm quicker. The positive rates, which show no obvious correlation, are more influenced by meteorological conditions (e.g., the intensity and duration of the storms) than river conditions.

2.4 Scatter plots and their limitations

Both the Black Rock Forest temperature and the Neuse River discharge datasets are time series, that is, data that are organized sequentially in time. Many datasets lack this type or organization. An example is the Atlantic Rock Sample dataset, provided in the file, rocks.txt. Here are notes provided by Bill Menke, who created the file:

I downloaded rock chemistry data from PetDB's website at www.petdb.org. Their database contains chemical information about ocean floor igneous and metamorphic rocks. I extracted all samples from the Atlantic Ocean that had the following chemical species: SiO2, TiO2, Al2O3, FeOtotal, MgO, CaO, Na2O, and K2O. My original file, rocks_raw.txt included a description of the rock samples, their geographic location, and other textual information. However, I deleted everything except the chemical data from the file, rocks.txt, so it would be easy to read into MatLab. The order of the columns is as given above and the units are weight percent.

Note that this Atlantic Rock dataset is just a fragment of the total data in the PetDB database. After loading the file, we determine that it contains N = 6356 chemical analyses.

Scatter plots (Figure 2.12) are a reasonably effective means to quickly review the data. In this case, the number, M, of columns of data is small enough that we can exhaustively review all of the M2/2 combinations of chemical species. A MatLab script that runs through every combination uses a pair of nested for loops:

f02-12-9780128044889
Figure 2.12 Scatter plot of four combinations of chemical components of the Atlantic Ocean rock sample dataset. MatLab script eda01_12.
D = load(‘rocks.txt’);
Ns = size(D);
N = Ns(1);
M = Ns(2);
for i = [1:M−1]
for j = [i+1:M]
 clf;
 axis xy;
 hold on;
 plot( D(:,i), D(:,j), ‘k.’ );
 xlabel(sprintf(‘element %d’,i));
 ylabel(sprintf(‘element %d’,j));
 [x, y]=ginput(1);
end
end

(MatLab eda02_12)

This nested for loop plots all combinations of species i with species j. Note that we can limit ourselves to the j>i, as the j=i case corresponds to plotting a species against itself, and the j<i plots are redundant. Note that the outer for loop variable, i, ranges from 1 to M−1 and the inner for loop variable, j, ranges over the interval from i+1 to M. The pause between successive plots is implemented with the ginput() command; clicking on the figure signals that it is time for the next graph to be displayed.

Note, also, the use of the sprintf() function (for “string print formatted’). It creates a character string that includes both text and the value of a variable. This is a useful, although fairly inscrutable function, and we refer readers to the MatLab help pages for a detailed description. Briefly, the function uses placeholders that start with the character % to indicate where in the character string the value of the variable should be placed. Thus,

i=2;
sprintf(‘element %d’,i);

returns the character string ‘element 2’. The %d is the placeholder for an integer. It is replaced with ‘2’, the value of i. Several placeholders can be used in the same format string; for example, the script

i=2;
j=4;
sprintf(‘row %d column %d’, i, j);

returns the character string ‘row 2 column 4’. If the variable is fractional, as contrasted to integer, the floating-point placeholder, %f, is used instead. For example,

a=4.71;
sprintf(‘a=%f', a);

returns the character string ‘a=4.71’. The sprintf() command can be used in any function that expects a character string, including title(), xlabel(), ylabel(), load(), dlmwrite(), and disp(). In the special case of disp(), an alternative is also available, with the command

fprintf(‘a=%f ’, a);

being equivalent to:

disp(sprintf(‘a=%f’, a));

The (for newline) at the end of the format string ‘a=%f ’ indicates that subsequent characters should be written to a new line in the command window, rather than being appended to the end of the current line. In this book, our preference is to use disp(sprintf()), as it preserves regularity of usage.

Four of the resulting 32 plots are shown in Figure 2.12. Note their effectiveness in identifying both overall patterns in the data and data outliers that depart from the pattern. Figure 2.12A and Figure 2.12D both have a single outlier, in K2O and TiO2, respectively. We do not know whether they represent an unusual rock composition or an error in the data, but this issue could possibly be resolved with further information about the data. Note that two distinct groupings, or populations, of data are present in Figure 2.12C, whereas only one is evident in Figure 2.12A. Figure 2.12B has a well-defined linear variation of MgO with Al2O3, but Figure 2.12D has a more complicated Y-shaped relationship of Al2O3 and TiO2.

On the one hand, this preliminary inspection has yielded interesting patterns that would be worth pursuing in a more detailed analysis. On the other hand, it has revealed one of the limitations of scatter plots when they are applied to multivariate data: the plots all look different! Some pairs of parameters (chemical species, in this case) seem uncorrelated, while others have strong correlations. Some have a single population, others two or even more. The problem becomes even worse when one considers that plots can also be made of combinations of parameters (e.g., a plot of MgO + FeO against NaO + K2O). The problem is that the patterns within the dataset are inherently multidimensional, but a scatter plot reduces that pattern to just two dimensions.

This problem points to the need for more advanced data analysis tools that can get at the underlying multidimensional patterns—tools that we will discuss later in this book (e.g., the factor analysis discussed in Chapter 8).

Crib Sheet 2.1

When first examining a dataset

Archive your data

Keep an unaltered copy of your data, together with notes recording its source and any other useful information that you come across.

Questions to ask yourself

What was the purpose for which the data was collected?

What is the physical significance of each data type?

How were they measured?

What are the units of measurement?

How are missing data represented?

Look at the numerical values

How many significant digits are recorded?

Make preliminary plots

time series plots

symbols or lines or both?

scatter plots

linear or logarithmic axes?

histograms

color images

plot error bars whenever possible!

Reality Checks

Does the data make sense in the context of what you already know?

sign

magnitude

minimum and maximum values

scale of variation with time and distance

daily, annual and other periodicities

Problems

2.1. Plot the Black Rock Forest temperature data on a graph whose time units are years. Check whether the prominent cycles are really annual.

2.2. What is the largest hourly change in temperature in the Black Rock Forest dataset? Ignore the changes that occur at the temperature spikes and rop-outs.

2.3. Examine the diurnal cycles in the Black Rock Forest dataset. Qualitatively, does their pattern vary with time of year?

2.4. Adapt the eda02_03 script to plot segments of the Neuse River Hydrograph dataset.

2.5. Create histograms for the eight chemical species in the Atlantic Rock dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.196.146