Chapter 13. Basic techniques of graphical analysis

This chapter covers

  • Investigating relationships
  • Using logarithmic scales
  • Representing point distributions
  • Visualizing ranked data
  • Organizing your work

This chapter and the next largely take gnuplot and its features for granted and instead concentrate on applying gnuplot to specific goals that commonly arise when graphing data. At the same time, I’ll take the opportunity to show you how certain tasks can be accomplished using gnuplot’s features. In this chapter, I’ll discuss some basic activities that are generally useful in graphical analysis; the next chapter will cover some more specialized tasks in greater depth.

When faced with a new data set, two questions usually dominate. The first is, how does one quantity depend on some other quantity—how does y vary with x? The second question (for data sets that include some form of statistical noise) asks, how is some quantity distributed—where are data points located, and what’s the character of the randomness? These are the two primary questions for this chapter. In the course of it, we’ll also revisit logarithmic plots and their uses and examine more applications of smooth approximations to noisy data sets—topics first introduced in chapter 3.

13.1. Representing relationships

One of the conceptually easiest and most familiar types of graphs shows the dependence of one quantity on another—so much so that creating a plot of this type is what is commonly meant by “graphing data.” You’ve seen several examples in this book, like figure 2.4, figure 3.1, and many more.

13.1.1. Scatter plots

For many data sets, it’s natural to ask whether one quantity depends on another and, if so, how: does y grow as x grows, or does it fall, or does y not depend on x to begin with? A scatter plot is the first step in finding the answer. A scatter plot shows unconnected symbols located at the position given by x and y. It’s an easy way to get a feeling for an otherwise unknown data set.

A Simple Example: Car Data

Listing 13.1 shows the first few lines from a sample data set containing 26 attributes for 205 different car models that were imported into the US in 1985.[1] The 14th column gives the curb-weight in pounds, and the last (26th) column the price (in 1985 dollars). A scatter plot like the one in figure 13.1 shows how weight varies as a function of price.

1

This example comes from the “Automobile” data set, available from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Automobile.

Figure 13.1. Curb weight versus price for 205 different cars. See listing 13.1.

Listing 13.1. Incomplete data for figure 13.1 (file: imports-85.data)
1,158,audi,gas,turbo,four,sedan,fwd,front,105.80,192.70,71.40,55.90,
0,?,audi,gas,turbo,two,hatchback,4wd,front,99.50,178.20,67.90,52.00,
2,192,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2395
0,192,bmw,gas,std,four,sedan,rwd,front,101.20,176.80,64.80,54.30,239
0,188,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2710
...

In this case, the input file is not whitespace-separated, but comma-separated. Instead of transforming the input file to space separated, it’s more convenient to use gnuplot’s set datafile separator option to plot this data file. The open circle (pt 6) is used because it remains easiest to distinguish even when data points overlap:

set datafile separator ","
plot [2500:] "imports-85.data" u 26:14 pt 6

You can clearly see that weight goes up as the price increases, which is reasonable. You should also note that there are many more low-price/low-weight cars than heavy, premium vehicles. For budget cars, weight seems to increase in step (linearly) with price for a while; but for higher-priced vehicles, the gain in weight levels off. This observation may have a simple explanation: the price of mass-market vehicles is largely determined by the cost of materials, so that a car that’s twice as big (as measured by its overall mass) is also twice as expensive, whereas the price of luxury cars is determined by higher quality (fancier materials such as leather seats, and additional options such as more electronics) rather than by sheer bulk.

Using Scatter Plots

This example demonstrates what to look for when examining a scatter plot. The first question usually concerns the nature of the relationship between x and y. Does y fall as x grows or vice versa? Do the points fall approximately onto a straight line or not? Is there an oscillatory component? Whatever it is, take note of it.

The second question concerns the strength of the relationship, or, put another way, the amount of noise in the data. Do the data points jump around unpredictably as you go from one x value to the next? Are there outliers that seem to behave differently than the majority of the points? Detecting outliers is important: gone unnoticed, they’ll mess up most statistical quantities (such as the mean) you may want to calculate later. And sometimes, outliers indicate an interesting effect—maybe a subgroup of points follows different rules than the majority. Outliers in scatter plots should never go uninvestigated.

A third aspect to look for in a scatter plot is the distribution of points in either dimension. Are points distributed rather uniformly, or do they cluster in a few locations? If so, do you understand the reason for the clustering, or is this something you should investigate further? There may be a lot of information even in a humble scatter plot!

13.1.2. Highlighting trends

The primary reason for preparing scatter plots is to determine whether there is any relationship between the plotted quantities, and what the nature of this relationship might be. In particular when the data is noisy, it may be difficult to recognize the dominant behavior from the individual data points alone, and so there is a need for graphical tools to highlight trends that may exist in the data. Gnuplot offers weighted cubic splines for this purpose. Let’s study an example before discussing splines in more detail.

Worked Example: Marathon Winning Times

Figure 13.2 shows the finishing times of the winners in a marathon event from when the event was first conducted until 1990.[2] In general, you see that the finishing times have decreased over time—the top athletes are getting better every year. The changes are particularly dramatic for the women’s results since they started competing in 1966.

2

This example was inspired by the book Graphic Discovery by Howard Wainer (Princeton University Press, 2005).

Figure 13.2. Finishing times (in minutes) for the winner of a marathon (up to the year 1990), together with the best straight-line fit. Will women overtake men in the coming years?

Also shown are the best-fit straight-line approximations to the data (as found by the stats command), and they seem to represent the data quite well. The only issue is that according to those fits, women should overtake men sometime in the early 1990s—and then continue to get dramatically faster. Is this a reasonable conclusion?

This example attempts to demonstrates two important points when working with data. The first is the need to be sensitive to the structure and quality of the data. For the data in figure 13.2, fitting a straight line provides only a very coarse—and, as you’ll see, misleading—approximation of the real behavior.

You need to remember that by fitting a straight line, you’ve chosen one specific model to represent the data. At this point, you’re no longer analyzing the data with the intention of revealing the information contained in it, but are making a specific statement about its surmised behavior. Before making such a strong statement, you should have sufficient evidence for the applicability of the particular model selected. And coming back to the current example, there certainly doesn’t seem to be any strong theoretical reason why athletic performance should follow a straight line as a function of time.

To understand the structure of the data, it might be a better idea to represent the data by using “soft” local approximations, such as weighted splines. Some experimentation with the weights will tell you much about the structure of the data: does the overall shape of the approximation curve change drastically as you vary the weighting? Which features are the most robust, and which disappear most quickly? Typically, significant features tend to be rather robust under transformations, whereas less relevant features are more transient.

Figure 13.3 shows the same data as figure 13.2, but instead of a straight line, a soft spline is used to represent the data, like so:

plot [1890:2017.5][115:215] "men-women" u 1:2 t "Men" w p pt 4,
                         "" u 1:3 t "Women" w p pt 6,
                         "" u 1:2:(1e-2) t '' s acs lt 1 lw 2,
                         "" u 1:3:(1e-2) t '' s acs lt 2 lw 2
Figure 13.3. The same data as in figure 13.2, together with a weighted-splines fit. The fit is based only on points prior to 1990, but the actual finishing times for the following years are also shown. The softer spline clearly reveals the leveling off of the women’s results well before 1990.

This approximation suggests that women’s performance starts to level off in the late 1980s, and the results from years after 1990 corroborate this observation. Note that the spline approximation in the graph is based only on years up to and including 1990, but not on later data points.

More on splines

Splines are a way to provide a smooth approximation to a set of points.[a] Splines are constructed from piecewise polynomial functions, which are joined together in a smooth fashion; the joints are referred to as knots. In the case of interpolating splines, the knots are made to coincide with the actual data points; the resulting curve is therefore forced to pass exactly through all data points. In the case of smoothing or approximating splines, the vertical position of each knot is left to vary, so the resulting curve in general does not pass through the individual data points. Because in the latter case the curve doesn’t have to pass through any data points exactly, it can be less wiggly.

a

I’d like to thank Lucas Hart and Clark Gaylord for helpful correspondence regarding this topic.

Both interpolating and approximating splines must fulfill the same smoothness conditions, but in addition, the approximating spline must strike a balance between the following two conditions:

  • Passing close to the data points
  • Not being too wiggly

These conditions are expressed in the following functional, which is minimized by the approximating spline s(x)

where (xi, yi) are the coordinates of the data points, the wi are the weights attached to each data point, and the prime indicates a derivative with respect to x. In this functional, the first term is large if s(x) is wiggly, and the second term is large if s(x) doesn’t pass close to the data points. (The form of the first term comes from a physical analogy: if the spline were made out of a real material, such as a thin strip of wood or metal, the first term would be related to the total bending energy of the strip.)

The balance between these two terms is controlled through the weight parameters wi: if the wi are small, the first term dominates, and the resulting spline approaches a straight line (which happens to coincide with the least-squares linear regression line for the data set). If the weights are large, the second term dominates, and the spline approaches the interpolating spline (which passes exactly through all data points).

Another way to think about the weights is to write wi = 1/di2, where di is a measure for the uncertainty in the data of point i (such as the standard deviation in this point). One would expect the spline to pass through the interval [yi -di, yi +di] at xi. The higher the confidence in one point, the smaller you can choose to make this interval, and therefore the larger the weight wi will be. By choosing di = 0 for one of the points, you can even force the curve to pass through this point exactly, although you may let the spline float more freely for the other points. In principle it’s possible to choose a different weight for each point, but if all points are known to the same accuracy, then all weights should be the same. This is what was done for all examples in this book.

One important remark: the way J [s] is written in the example, the size of the second term depends on the number of data points—if you double the size of the data set, the size of the second term will be (approximately) twice as large. By contrast, the first term does not depend on the number of data points. If the number of data points grows, the second term will therefore become larger relative to the first one, and the resulting spline will be more wiggly.

To maintain the original balance between wigglyness and closeness of approximation, the weights must be increased accordingly for data sets containing a larger number of points. Equivalently, you might want to take the number of points into account explicitly by writing wi = ui /N, where ui is the actual weight and N is the number of knots. With this choice for wi, the balance between both terms will be maintained regardless of the number of knots in the data set.[b]

b

You can find more information on splines in chapter 1 of Handbook on Splines for the User by E. V. Shikin and A. I. Plis (CRC Press, 1995).

13.2. Logarithmic plots

Logarithmic scales are one of the most versatile tools in the graphical analyst’s toolbox. I introduced them in section 3.4 and discussed how they work. Now let’s put them into action.

Logarithmic scales serve three purposes when plotting:

  • They rein in large variations in the data.
  • They turn multiplicative deviations into additive ones.
  • They reveal exponential and power-law behavior.

13.2.1. Large variations in data

To understand the meaning of the first two items, let’s study the daily traffic pattern at a website. Figure 13.4 shows the number of hits per day over approximately three months. There’s tremendous variation in the data, with alternating periods of high and low traffic. During periods of high traffic, daily hit counts may reach close to half a million hits but then fall to very little traffic shortly thereafter. On the scale of the graph, the periods of low traffic seem barely different from zero, with little fluctuation.

Figure 13.4. Traffic patterns for a website. Daily hit count versus day of the year. Note the extreme variation in traffic over time.

Figure 13.5 displays the same data, but now on a semi-logarithmic scale. The logarithmic scale helps to dampen the extreme variation of the original data set (two orders of magnitude) so you can see the structure during both the high- and the low-traffic seasons. That’s the first effect of logarithmic plots: they help to make data spanning extreme ranges visible by suppressing high-value outliers and enhancing low-value background.

Figure 13.5. The same data as in figure 13.4, but on a semi-logarithmic scale. Note how the high-traffic outliers have been suppressed and the low-traffic background has been enhanced. In this presentation, data spanning two orders of magnitude can be compared easily.

Furthermore, you can see that the relative size of the day-to-day fluctuations is about equal during both phases. The absolute size of the fluctuations is quite different, but their size as a percentage of the average value is roughly the same (very approximately, during low season, traffic varies between 2,000 and 20,000 hits a day, a factor of 10; whereas during high season it varies between 30,000 and 300,000 hits a day, again a factor of 10). That’s the second effect of logarithmic plots: they turn multiplicative variations into additive ones.

Figure 13.6 tries to demonstrate the last point in a different way. The bottom panel shows the web traffic on consecutive days (like figure 13.4), displaying great seasonal variance, but the top panel shows the ratio of the difference in traffic on consecutive days divided by the actual value—(current dayprevious day)/current day—which does not exhibit a seasonal pattern. This is further proof that the daily fluctuation, viewed as a percentage of the overall traffic, is constant throughout. The calculation for the top panel can be performed as an inline transformation, by the way:

prv=40000
plot [0:365][-1:1] "webtraffic" u(y=($2-prv)/($2+prv), prv=$2, y) w l

Figure 13.6. Bottom panel: hits per day over time (as in figure 13.4); top panel: change in traffic between consecutive days, divided by the total traffic. Note how the relative change (top panel) doesn’t exhibit any seasonal pattern, indicating that the relative size of the variation is constant.

13.2.2. Power-law behavior

Finally, let’s look at a curious example that brings together two benefits of logarithmic plots: the ability to display and compare data of very different magnitude, and the ability to turn power-law behavior into straight lines. Mammals come in all shapes and sizes, from tiny rodents (the smallest known land mammal is the Pygmy Shrew, which weighs only a few grams, but some bats found in Thailand are apparently smaller still) to the largest of whales (weighing several hundreds of tons). It’s a curious empirical fact that there seem to be fixed relationships between different metabolic quantities—basically, the larger an animal is, the slower its bodily functions progress. Figure 13.7 shows an example: the duration (in seconds) of a single resting heartbeat as a function of the typical body mass. The regularity of the data is remarkable, spanning eight orders of magnitude for the mass of the animal. What’s even more amazing is how well the data is represented by the simple function T ~ m1/4. This law isn’t limited to the examples shown in the graph: if you added animals to the list, they’d also fall close to the straight line (I didn’t just pick the best ones).

Figure 13.7. Allometric scaling: the duration of an average resting heartbeat as a function of the typical body mass for several mammals. Note how the data points seem to fall on a straight line with slope 1/4.

The existence of such scaling relations in biological systems has been known for a long time and seems to hold generally. For example, it turns out that the typical lifetime of a mammal also obeys a quarter-power scaling law relation against the body mass, leading to the surprising conclusion that the total number of heartbeats in the life of a single organism is fixed—no matter what the typical resting heart rate is. (In case you care, the number comes out to about 1.5 billion heartbeats during a typical lifetime.)

These observations have been explained in terms of the geometrical constraints that must exist in the vascular networks (the veins and arteries), which supply nutrients to all parts of the organism.[3] As it turns out, you can derive the quarter-power scaling laws starting from only three simple assumptions: that the support network must be a space-filling fractal, reaching all parts of the organism; that the terminal capillaries where nutrients are exchanged are the same size in all animals; and that organisms have evolved in such a way that the energy required for the transport of nutrients through their bodies is minimized. I think it’s amazing how such a powerful result can be derived from such simple assumptions. But, on the other hand, we shouldn’t be surprised: generally applicable laws (such as the quarter-power scaling in this example) must stem from very fundamental assumptions, disregarding any specifics.

3

The original reference is the paper “A General Model for the Origin of Allometric Scaling Laws in Biology” by G. B. West, J. H. Brown, and B. J. Enquist, Science 276 (1997): 122. Additional references can be found on the web.

Let’s come back to figure 13.7. The double-logarithmic scales make it possible to follow the data over eight orders of magnitude. (Had I used linear scales, all animals except for the whale would be squished against the left side of the graph—literally crushed by the whale.) So again, logarithmic scales can help to deal with data spanning a wide range of values. In addition, the double-logarithmic plot turns the power law relationship T ~ m1/4 into a straight line and makes it possible to read off the exponent from the slope of the line. I explained how this works in detail in section 3.4 and won’t repeat it here.

Finally, figure 13.7 is a nice example for the power of gnuplot’s with labels plot style. The graph was generated using

plot [][:5] "mammals" u ($2/1000):(60/$3) w p pt 7,
         "" u ($2/1000):(60/$3*1.125):1 w labels

The first part of the command draws the symbols (with points); the second adds the labels. Inline transformations are used to express the weight in kilograms (instead of grams) and to transform the heart rate (given in beats per minute) into the duration of each beat (measured in seconds). All the labels are shifted a bit upward so as not to obscure the symbols. In this example, the vertical offset is multiplicative, because of the logarithmic scale of the graph (remember: logarithms turn multiplicative offsets into linear ones).

13.3. Point distributions

Besides detecting relationships between quantities, you may want to understand how data points that are somehow random are distributed. Are data points spread out evenly, or are they clustered in a few spots? Are distributions symmetric, or are they skewed? How much weight is contained in the tails of a distribution, compared to its center?

Let’s say you have a file containing a set of measurements. They can be anything: interarrival times for requests at a web server, completion times of database queries, weights of potatoes, heights of people—whatever. What can be said about them?

13.3.1. Summary statistics and box plots

Two commands will give a first answer: the stats command and the with boxplot style (see sections 5.6 and E.2, respectively). Both share the advantage that they have no adjustable parameters—they are as simple as can be. Both also share the disadvantage that they lose a lot of information about the distribution of individual points, because both aggregate the entire distribution into a handful of summary statistics. (A box-and-whisker plot is after all nothing more than a graphical representation of the results from running the stats command.)

In short: to get a first idea how the values in a new dataset are distributed, both of these commands are convenient, because of their extreme simplicity. But both hide too much information to develop a detailed understanding.

13.3.2. Jitter plots and histograms

One easy way to get a visualization of a collection of random points is to generate a jitter plot, which is really a one-dimensional scatter plot, but with a twist. The bottom part of figure 13.8 was created by shifting each data point vertically by a random amount. (The rand(0) function returns a random number in the range [0:1].) If I’d plotted the data in a true, one-dimensional fashion, too many of the points would’ve overlapped, making it difficult to detect clustering. Such jittering by a random amount is a good trick to remember whenever you’re creating scatter plots of larger data sets:

plot "jitter" u 1:(0.25*rand(0)-.35)
Figure 13.8. Three ways to represent a distribution of random points: jitter plot (bottom), histogram (with boxes), and cumulative distribution function (dashed line)

You can see that the distribution of points is skewed. It’s strictly bounded by 0 on the left, with points clustering around 1; and as you move to the right, points become increasingly sparse. But it’s hard to say something more definite just by looking at the jitter plot. For instance, is there a second cluster of points between 3 and 4? This does seem possible, but it’s hard to tell for sure using this representation.

The next step when investigating the properties of a distribution usually involves drawing a histogram. To create a histogram, you assign data points to buckets or bins and count how many events fall into each bin. It’s easiest to make all bins have equal width, but with proper normalization per bin, you can make a histogram containing bins of differing widths. This is sometimes useful out in the tails of a distribution where the number of events per bin is small.

Gnuplot doesn’t have an explicit histogramming function, but you can use the smooth frequency functionality (see section 3.5.3) to good effect. Recall: smooth frequency sorts the x values by size and then plots the sum of y values per x value. That’s what you need to build a histogram.

The next example introduces a function bin(x,s) that takes two parameters. The first parameter is the x value you’d like to bin, and the second parameter is the bin width. The return value of the function is the position of the left edge of the bin—you can use the binc(x,s) function for bins centered on the x value.

The smooth frequency feature forms the sum of all y values falling into each bin. If all you care about is the overall shape of the histogram, you may supply any constant, such as (1); but if you want to obtain a normalized histogram (one with a total surface area equal to unity), you need to take into account the number of points in the sample and the bin width. You can convince yourself easily that the proper y value for a normalized histogram is

1/(bin-width × total-number-of-points-in-sample)

You can use the with boxes style to draw a histogram (see figure 13.8), but you want to fix the width of the boxes in the graph to coincide with the bin width. (By default, the boxes expand to touch their neighbors, which leads to a faulty graphical representation if some of the internal bins are empty.) Choose a bin width of 0.1, and use the stats command to find the number of points in the data set. The plot command uses the binc() function, because the with boxes style positions its boxes centered at the supplied position:

bin(x,s)  = s*floor(x/s)
binc(x,s) = s*(floor(x/s)+0.5)

set boxwidth 0.1

stats "jitter" u 1 noout
plot "jitter"
  u (binc($1,0.1)):(1./(0.1*STATS_records)) smooth frequency with boxes

Figure 13.8 also includes a curve for the cumulative distribution function of the data set. We’ll come back to it in section 13.3.4.

13.3.3. Kernel density estimates and rug plots

The apparent simplicity of the histogramming method hides some pitfalls. The first concerns the width of the bins: make them too narrow, and the resulting histogram will be bumpy; make them too wide, and you lose relevant features. There’s also ambiguity in regard to the placement of the bins: is the first bin centered at 0 (or any other value) or flush left there? In particular for sparse data sets, the overall appearance of the histogram can depend quite sensitively on these details!

A better method to generate distribution curves from individual data points goes under the name kernel density estimation. Rather than count how many data points fall into each bin, a strongly peaked but smooth function (a kernel) is placed at the location of each data point. Then the contributions from all these curves are summed up and the result is plotted. Mathematically, the kernel estimate f(x) for a data set consisting of N points xi is

Here, K(x) is any smooth, peaked, normalized function, and h is the bandwidth: a measure of the width of the kernel function. A popular example is the Gaussian kernel:

As you already saw in section 3.5.2, gnuplot can generate such curves using the smooth kdensity functionality. It works in much the same way as the smooth frequency feature discussed earlier:

plot "jitter" u 1:(1./STATS_records) smooth kdensity bandwidth 0.05

The first column specifies the location; the second gives the weight each point should have. For a normalized histogram, this should be the inverse of the number of data points—because the kernel functions are normalized themselves, you don’t have to worry about the bandwidth at this point as you did for histograms using smooth frequency. The bandwidth parameter (preceded by the bandwidth keyword) is optional. If it’s omitted (or negative), gnuplot calculates a default bandwidth, which would be optimal if the data were normally distributed. This default bandwidth tends to be conservative (which means, rather broad).

Figure 13.9 shows several curves drawn using kdensity for the same data set you saw in figure 13.8, for a variety of bandwidth parameters. Studying this graph carefully, you may conclude that there is indeed a second cluster of points, located near 3.5. Note how the choice of bandwidth can hide or reveal features in the distribution of points.

Figure 13.9. An alternative to histograms: kernel density estimates using smooth kdensity. Curves for three different bandwidths are shown. A bandwidth of 0.3 seems to give the best trade-off between smoothing action and retention of details. Note how it brings out the secondary cluster near x=3.5. Individual data points are represented through the rug plot along the bottom.

One more thing: figure 13.9 uses a different representation for the individual data points. Instead of using symbols as in the jitter plot included in figure 13.8, this figure uses a rug plot along the bottom edge. Each data point is represented through a short, vertical line. In gnuplot, this can be accomplished through the with vectors style:

plot [:11][-0.1:0.8] "jitter" u 1:(-0.1):(0):(0.05) t '' w vec nohead

I prefer jitter plots, because lines in a rug plot tend to obscure each other, making it difficult to identify individual data points. But rug plots may give a somewhat neater appearance to your graphs and appear less noisy.

13.3.4. Cumulative distribution functions

Histograms and density estimates have the advantage of being intuitive: they show you directly the probability for a certain value to occur. But they have some disadvantages when it comes to making quantitative statements. For example, based on the histogram in figure 13.8, it’s hard to determine how much relative weight is in the tail of the distribution: how likely are values larger than 4 to occur? How about values larger than 6? You can guess that the probability is small, but it’s hard to be more precise. To answer such questions, you need to know the area under the histogram within certain bounds. In other words, you want to look at the cumulative distribution function (or distribution function for short).

The value of the cumulative distribution function at position x is the fraction of events that have occurred with xi less than x. In figure 13.8, I showed the distribution function together with the histogram. To repeat: the value of the cumulative distribution function at position x is equal to the area under the (normalized) histogram from its left border to position x.

Cumulative distribution functions are accessible using smooth cumulative. The smooth cumulative feature is similar to smooth frequency: first all points are sorted in order of ascending x value, and then the sum of all y values to the left of the current position is plotted as a smoothed value. To obtain a normalized distribution function, you must supply 1/number-of-points as the y value. In contrast to histograms or density estimates, distribution functions don’t depend on a width parameter:

stats "jitter" u 1 noout
plot "jitter" u 1:(1./STATS_records) smooth cumulative

Cumulative distribution functions can be a little unintuitive at first, but they’re well worth becoming familiar with. They make it easy to answer questions such as those raised at the beginning of this section. From figure 13.8, you can immediately see that there’s a 3% chance of finding a point at x > 6 and about a 15% chance for x > 4. You can also find more proof for the second cluster of points between 3 and 4: here the distribution function seems to make a jump, indicating an accumulation of points in this interval.

The cumulative distribution function is also very important theoretically and often the basis for further tests and calculations. The discussion of probability plots in section F.6 is a good example and also shows how the results from smooth cumul can be captured into a heredoc for further processing.

13.4. Ranked data

Imagine that I give you a list of the countries in the European Union, together with their land area (in square kilometers) and population numbers. How would you represent this information? How would you represent it graphically?

The particular challenge here is that the independent variable has no intrinsic ordering. What does this mean?

Given the name of a country, the value of the area measure is fixed; hence the name is the independent variable, and the area is the dependent variable. Usually, the independent variable is plotted along the x axis and the dependent variable is observed with it. But in this case, there’s no natural ordering of the independent variable. Sure, you can order the states alphabetically by their names, but this ordering is arbitrary and bears no relationship on the data. (You wouldn’t expect the size of a state to change if you gave it a different name, would you?) Also, the ordering might change if you were to translate the names to a different language. But the information that is to be displayed depends on the areas and shouldn’t be affected by the spelling of the country names.

For data like this, the only ordering that’s intrinsic to the data is in the values of the dependent variable. Therefore, a graphical representation of this data should be ordered by the dependent variable, not the independent one. Such plots are often called dotplots, but I tend to think of them as rank-order plots. Figure 13.10 shows an example.

Figure 13.10. A rank-order plot. Because there’s no natural ordering in the independent variable (in this case, the country names), the data has been sorted by the dependent variable to emphasize the structure in the data.

If the input file is sorted by the appropriate quantity, you can generate such plots easily using gnuplot’s facility for reading tic labels from the input file. Given an input file containing the names and areas in two columns, such as this[4]

4

The data can be found on the Wikipedia page for the European Union.

Malta               316
Luxembourg         2586
Cyprus             9251
Slovenia          20273
...

the entire plot can be generated using the following command:

plot "europeanunion" using ($2/1000):0:ytic(1) w p pt 7

The ytic(1) function selects the values in column 1 as tic labels for the y axis (see section 8.3.5); and the pseudocolumn 0, which evaluates the line number in the input file, is used as the corresponding vertical coordinate (see section 4.4). The inverted y range places the first line in the file at the top of the graph instead of the bottom.

This is the basic idea. You could plot the state names along the x axis instead, but then you’d need to rotate the labels to make sure they didn’t overlap. Unfortunately, rotating the labels by 90 degrees (so that they run vertically) makes them hard to read. A good trick is to rotate them by some angle so that they run diagonally; but the initial layout, with the names running down the y axis, is the easiest to read.

What if you want to show and compare multiple data sets, such as the land area and the population? The best strategy is to declare a primary data set, which determines the ordering for all others. Figure 13.11 shows an example. The points of the secondary data set (the population in millions) have been connected with lines to make them stand out more. Additionally, the x axis has been scaled logarithmically, which is often useful with dotplots of this sort. You can see that overall the population count follows the area, but there are some notable outliers: the northern Scandinavian countries Sweden and Finland are thinly populated, whereas the so-called Benelux countries (Belgium, Netherlands, and Luxembourg) have an exceptionally high population density.

Figure 13.11. A rank-order plot displaying a primary and a secondary data set for comparison. The country names are sorted according to the primary data set (the area); the points in the secondary data are connected by lines to make them easier to distinguish. Note the logarithmic scale for the horizontal axes.

Dot- or rank-order plots are useful whenever the independent variable has no natural ordering. On the other hand, if the independent variable can be ordered—even if it’s nonnumeric, such as the categories Strong Dislike, Dislike, Neutral, Like, and Strong Like—you should use the information in your graphs and order data points by the independent variable.

13.5. Pie charts

Pie charts are a way to visualize how an entity breaks down into its constituent parts. Their strongest point (and the only reason to use them) is that, by construction, they represent all components of the overall entity, without any pieces that might possibly be missing. Aside from that, pie charts have many well-known problems. Specifically, comparing the size of different components with similar magnitudes is difficult, as is the representation of more than a handful of components. For such tasks, some variant of rank-order plot (see section 13.4) is usually a better solution.

Gnuplot doesn’t have pie charts built in, but they can be constructed through judicious use of existing gnuplot features. Given the following data set

# Name Percent
A       55
B       25
C       10
D       7
E       3

you can generate the pie chart in figure 13.12 using the following commands:

set size square
set angles degrees
unset key; unset tics; unset border

beg(x) = (b=a, a=a+x, 360.*b)
end(x) = (c=c+x, 360.*c)
med(x) = (u=v+x/2, v=v+x, 360.*u)

a=b=c=u=v=0
plot [-1:1][-1:1]
  "pie-data" u (0):(0):(1):(beg($2/100.)):(end($2/100.)) w circ,
  "" u (0.75*cos(med($2/100.))):(0.75*sin(360.*u)):1 w labels
Figure 13.12. A pie chart

The challenge is that the with circles style requires the beginning and end angle of each pie segment (each wedge). But the data set only gives the percentage for each component, which corresponds to the angle enclosed (or subtended) by the wedge. The function end(x) therefore keeps a running total of all percentages encountered so far: this is the leading edge of the current slice. The function beg(x) does the same, but it always reports the most recent value: this is the trailing edge of the current slice. The function med(x) reports a value halfway between the leading and the trailing edges: this is where the label is placed. Don’t forget to set (or reset) all variables to 0 before each plot!

13.6. Organizational issues

The basic techniques of graphical analysis don’t just range over types of graphs and when to use them, but also include some recurring administrative tasks and organizational questions, mostly relating to the handling of data sets and the files containing them. In this section, I’ll make some recommendations on that.

I also want to stress that graphs are prepared for different purposes. Graphs that are prepared merely for personal use and discovery require little polishing, but this isn’t true for presentation graphics that are intended to communicate results to an audience. Presentation graphics are important: they document completed work. I’ll close this section with some recommendations for the preparation of attractive and valuable presentation graphics.

13.6.1. The lifecycle of a graph

It’s helpful to have a sense for the life expectancy of your graphs: short (seconds to days, for interactive exploration and ongoing work), intermediate (days to weeks, for intermediate results), and long (weeks to infinity, for final results and public consumption). Treat graphs differently, based on their expected lives: for short-lived graphs, ease of use is the predominant concern. Nothing should stop you from redrawing the graph, changing the range or plotting style, or applying a transformation. Any amount of polishing is too much.

For graphs in the intermediate bracket, polishing is still not required, but contextual descriptions (title, units, key) become important: not necessarily to communicate such details to others, but to serve as reminders to yourself, should you come back to your work after a few days or weeks of absence. (It’s amazing how quickly even essential details can be forgotten.)

For long-lived graphs, and those published or presented publicly, different rules apply. Such graphs belong to presentation graphics proper, and I’ll have a bit more to say about that topic in section 13.7.

13.6.2. Input data files

Data files should be reasonably self-contained and self-explanatory. Let me explain.

When I was looking for data sets to use as examples for this book, I checked out quite a few publicly accessible data-set libraries on the web. In one of them, I found a data set that, according to the description on the website, contained annual sunspot numbers for a certain specified range of years. I downloaded the corresponding file together with several others data sets from the website, and only then started to examine the contents of each file in detail.

Each file consisted of only a single column, containing the dependent variable—and nothing else! Looking at these files alone, it was no longer possible to determine which one contained what, be it sunspot numbers or the price of pork bellies, or some other data set. Because the independent variable was missing as well, I couldn’t even tell whether I was looking at monthly or yearly data, and, in any case, for what time frame. In other words, the sheer act of downloading a file turned it into instant garbage, by irrevocably separating it from the information that gave it meaning!

To avoid such embarrassments, it’s generally a good idea to keep the basic information that’s necessary to understand the data contained in the data set as part of the data file, typically as a block of comments near the beginning of the file. (I find it more convenient to have this information at the top of the file than at the end.) The information contained in such a header is more or less the same information you’d put onto a graph in the form of textual labels and descriptions. Here are some pointers:

  • Most important is a description of the file format. It should at least include a description of the content of all columns, together with the units used. It’s nice if it also contains a brief overall headline description of the file. If ancillary information would be required to re-create the data set, it should also be included: things such as parameter settings for the experimental apparatus, or starting values and the version number for the simulation program used to calculate the data. If the data was downloaded from the web, I always make sure to keep a reference to the source URL in the file.
  • I also recommend being generous when it comes to the inclusion of redundant information. The sunspot data I mentioned earlier is an interesting example. The lack of data for the independent variable made it more difficult to use and understand the contents of the file. Given the starting value and the increment, it’s trivial to reproduce the values, but it’s generally much more convenient to have all this information available already.
  • The ability to reproduce a data file if necessary is critical. If you combine data from several sources into a single data set, or manually edit a data set to remove glitches, keep the originals. And stick a note in the combined file explaining its origins. More likely than not, you’ll have to come back and do it all over again at some point. Remember: if you delete the originals, they’re gone forever.
  • Unless there are extenuating circumstances (and even in most cases when there are), prefer plain text files over binary formats. The advantages of plain text are just too numerous: plain text is portable across all platforms (despite the annoying newline character issue) and can be manipulated using standard tools (editors, spreadsheets, scripting languages). A corrupted text file can be repaired easily—not so for a binary file. A text file is by construction more self-explanatory than a binary file will ever be. And finally, text files compress nicely and therefore don’t take up much disk space.

Some statistics packages keep their data sets in proprietary binary formats. I find the constant need to invoke a special, proprietary program just to view the contents of a file a major inconvenience, and the dependence on the operations provided by the package for all data-manipulation tasks is an unwelcome hindrance to my work.

If you have a legitimate need for a binary file format, at least use an existing standard, for which quality libraries are easily available. Ad hoc binary formats are almost certainly a bad idea.

Something we don’t tend to think of often are clerical errors: typos, incorrectly named files, data entered into the wrong column, wrongly recorded units—that sort of thing. They’re apparently more common than you might think: Cleveland and Wainer[5] give interesting examples of odd entries in well-known data sets and provide convincing arguments that explain these entries as due to clerical errors, such as the interchange of digits or the inadvertent switching of columns for the data points in question.

5

See section 6.4 in Visualizing Data by W. S. Cleveland (Hobart Press, 1993) and the introduction to Graphic Discovery by H. Wainer (Princeton University Press, 2005) for details.

Finally, make sure critical data is backed up frequently. (I once accidentally deleted data that had taken weeks to compute. Praise your system administrators!)

13.6.3. Output files

The most important advice (again) is to make plots reproducible. Don’t just export to a printable format and move on. It’s almost guaranteed that you’ll want to redraw the graph, with minor modifications, as more data becomes available or the understanding of it grows. Always save the plotting commands and options to a file before moving on.

Use an appropriate file format. Bitmaps (such as PNG) are simple and portable but can’t be resized without loss of quality. Whenever there is a chance that you (or whoever is using your graphs) might need to resize a graph, make sure to use a vector format (such as PDF). This is particularly true for print publishing: not only will graphs routinely be resized, but only vector formats will take advantage of the higher resolution offered by printing devices.

Decide whether you need to prepare your graphs in color, black and white, or halftone (grayscale). Consider the use of stylesheets (see section 12.6) to target graphs to different display channels. I routinely create different versions of the same graph (using different stylesheets) so I’m set for all possible applications. Adopting a naming convention helps to keep different versions organized.

13.7. Presentation graphics

This book isn’t primarily about presentation graphics, but about graphical analysis. There’s already plenty of advice on presentation graphics, and you’ll have no difficulty finding it, but not much of it appears to be based on rigorous studies. Nevertheless, the advice is often worded assertively, if not emotionally, and there’s an unfortunate (and ultimately unhelpful) tendency toward the derision of work considered inadequate. Given the lack of rigorous evidence, tastes and personal opinions naturally play a large role.

I don’t intend to add to this debate. Instead, I’d like to present a list of reminders concerning details that are easily (and inadvertently) forgotten when preparing a graph for publication. Most of them concern the inclusion of contextual information, which can’t be inferred from the graph itself, but which is nevertheless important for understanding. By publication I mean any form of wider distribution of the graph: anything leaving the immediate group of coworkers who were involved in its creation, and in particular any use with a long expected lifetime.

These aren’t commandments, but reminders. Use your own good judgment:

  • Axes should be labeled. The labels should describe the quantity plotted and should include the units of measurement. Don’t leave this information to a separate figure caption. Keep in mind that the graph may be separated from its context, in which case information contained only in the caption may be lost. The caption should be used to describe salient features of the graph, not to give vital information about the plotted data.
  • Choose meaningful, self-explanatory labels. If you must use abbreviations that aren’t in common use, explain them, ideally on the graph itself, not in the caption (see previous item). (In a recent book on data graphing, I found a figure of a histogram in which the buckets were labeled Married, Nvd, Dvd, Spd, and Wdd. I marveled about the possible meaning of Nvd for quite a while. The abbreviations were explained neither in the text nor in the figure caption.)
  • If there’s more than one line in a graph, explain what each of the lines represents, either through a key or by using an explicit label referring to each line. If you use a key, make sure the key entries are meaningful.
  • If there’s ancillary information, consider placing it onto the graph, rather than supplying it only in the caption.
  • If at all possible, make sure text labels, including the key, don’t obscure the data. If necessary, move or rewrite them.
  • When publishing a false-color plot, always include the associated color scale in the graph. No matter how intuitive the chosen palette may appear to you, remember that there’s no universal and unambiguous mapping from numbers to colors and vice versa.
  • Describe the meaning of error bars. Do they show the calculated standard deviation of the sample population? Do they represent interquartile ranges? Or do they indicate the limits of resolution of your experimental apparatus? This information can’t be inferred from the graph but must be explained through textual information.
  • Use an appropriate measure of uncertainty. Don’t show standard deviations for highly skewed or multimodal populations just because they’re easy to calculate.
  • Choose meaningful plot ranges. The graph should display the relevant part of the data, and the data should make use of the entire available space. It may be necessary to use more than one graph to convey all the relevant detail: it isn’t unreasonable to have one graph to exhibit the overall shape of the data and one that zooms in on a particular region at greater resolution. Keep in mind that the relevant information in a data set is sometimes the trend and sometimes the absolute value. Make sure the chosen plot range is appropriate given the context.
  • Consider logarithmic scales if the variation in the data is too great to be represented otherwise.
  • Don’t be shy about choosing a different font if the default font looks ugly. Given the importance of textual information on a graph (labels, tic marks, keys), make the necessary effort to ensure that all text is clearly readable even after the graph has been reproduced and possibly reduced in size a few times. (On the other hand, making labels too big or too bold can easily ruin the overall appearance of a plot. Experiment!)
  • In general, sans serif fonts (such as Helvetica) are preferred for standalone pieces of text, whereas serif fonts (such as Times Roman) are considered more suitable for body text. Because labels on a graph tend to be short, this suggests using a good sans serif font in plots. (I also find that sans serif fonts enhance the visual clarity of a graph, whereas serif fonts don’t, but others may disagree. Judge for yourself.)
  • Use an appropriate aspect ratio. Human perception seems to have an affinity to figures with an aspect ratio of roughly 10:7, in landscape orientation (that is, wider than tall). In particular, it’s generally not a good idea to use a square graph unless the symmetry of the data demands it.
  • Don’t use bitmapped graphics formats (PNG, GIF, JPG) in print publications. Use vector formats such as PDF, PostScript, and EPS instead.
  • Proofread your graphs. Common error spots include typos in textual labels, switched data sets or interchanged labels, and omitted qualifiers (milli-, kilo-, and so on) for units.

13.8. Summary

Understanding data with graphs typically involves certain recurrent tasks and questions. How one quantity depends on another is such a question; how a bunch of data points are distributed is another. Other recurrent problems have to do with data that spans many orders of magnitude, and data that isn’t strictly numeric.

In this chapter, we discussed the standard types of graphs to investigate such questions, and showed how to create them using gnuplot. All these methods and plots are very general—in the next chapter, we’ll concentrate on some more specific problems and explore them in depth.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.93.12