Chapter 4. Managing data sets and files

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Managing data sets and files

This chapter covers

Data-file formats and options
Managing structured data sets
Accessing columns and pseudocolumns
Pseudofiles and input/output redirection
Metadata in data files

Working with data is what gnuplot is all about. Data is usually presented to gnuplot as files, and in this chapter you’ll learn a range of useful tricks that can make you more productive when working with gnuplot.

We’ll begin with a review of gnuplot’s standard file format and discuss options to modify its defaults. We’ll then look at some special gnuplot syntax to pick out only parts from larger files. This is required when dealing with data sets that are larger or more complicated than the straightforward ones you’ve encountered so far, and it’s often useful.

Next, we’ll systematically consider all the ways you can access columns—in particular, in conjunction with inline transformations. In the process, you’ll also encounter pseudocolumns, which are synthetic columns that gnuplot automatically provides when reading a file. Pseudocolumns hold useful information (such as line numbers) and often come in handy when you’re constructing inline transformations. Then we’ll turn to files that aren’t: pseudofiles are data sources that don’t live on the disk but that you can treat almost as if they were regular files.

But data files often contain information beyond the actual data that goes onto a plot. Such metadata typically consists of textual information that you may want to place onto a plot in the form of text labels. In this chapter, where we talk about data files in general, we’ll merely summarize the kind of metadata that gnuplot can take from a file and use in a plot. The details will follow in their proper context when we discuss how to handle text labels and similar auxiliary information in a plot in chapters 6 through 8.

4.1. Quickstart: the standard data-file format

Data is usually fed to gnuplot as plain text in a simple, whitespace-separated format: gnuplot neither uses nor requires special proprietary binary formats or weird markup. Nor do you need to use special tools or libraries to generate data that constitutes suitable input for gnuplot.

Input files may also contain comments, text strings, and other information, in addition to the data. Furthermore, many aspects of this format can be changed through options. We’ll consider each of these topics in turn.

By default, gnuplot data files are text files that contain data in whitespace-separated columns, with one line per record. This is a natural tabular format—basically, the way you would lay out information in a spreadsheet or the way it would result from a database query. Listing 4.1 shows the beginning of a data set with information on customer orders as received by an online retailer. There’s a lot of information in such a file; in particular, the relationships between different columns are often interesting. Figure 4.1 shows four different plots, using different plot types, based on this file. It includes a time-series plot of the number of orders received per day, a histogram of the number of items per order, and a scatter plot investigating the relationship between the weight of an order and its value. The last panel in figure 4.1 consists of a parallel-coordinates plot: a special graphical technique to identify relationships in multivariate data sets. (For the sake of completeness, the commands to generate figure 4.1 are given in listing 4.2. Don’t worry if you don’t understand all this yet; we’ll cover these commands in the chapters to come.)

Figure 4.1. Different types of graphs, all based on a single data file. Clockwise from top left: a time-series plot of the number of orders per day, a histogram showing the distribution of the number of items per order, a parallel-coordinates plot of several order attributes (with ground-shipping orders highlighted), and a scatter plot of order value versus order weight.

Listing 4.1. Incomplete generic data file—see figure 4.1 (file: orders)

# ID      Date         CustomerID   ItemCount   Weight   Value   ShipMethod
oid0012   2014-12-01   cid59        2           2.13     11.39   1-Regular
oid0014   2014-12-01   cid61        4           7.11     15.44   0-Ground
oid0015   2014-12-01   cid30        2           2.33     11.11   0-Ground
oid0025   2014-12-01   cid87        4           4.48     16.55   2-Express
oid0034   2014-12-01   cid08        4           4.35     16.95   1-Regular
...

Listing 4.2. Commands for figure 4.1 (file: orders.gp)

4.1.1. Comments and header lines

Lines beginning with a # are considered comment lines and are ignored. You can also instruct gnuplot to ignore the first n lines of a file by using the skip directive in the plot command:

plot "data" skip 2 using 1:2

Now gnuplot will discard the first two lines of the file, regardless of whether they’re commented out. This is a convenient way to get rid of header lines without having to comment them out explicitly.

Tip

You can use the new skip keyword to discard the first n lines of a file as it’s being plotted.

4.1.2. Selecting columns

You can pick out any column and plot its values against any other column by means of the using directive. If no using directive is found, gnuplot plots the first column against the line number; if only a single column is specified, gnuplot plots that column against the line number.

For the most part, input-file parsing with gnuplot is very robust and works without much tinkering. One good piece of advice is to always specify all required columns explicitly via the using directive. If this is done, gnuplot silently skips any garbage (fields it can’t parse) in the file, treating them as missing values. If you rely on the default columns (without the using directive), gnuplot will instead silently bail when it encounters an unparsable field. This is most likely to happen when you’re doing casual work with small files containing only two columns, and it can sometimes lead to mysterious failures. My advice: make it a habit always to specify all columns with using.

Tip

If you specify all columns to plot explicitly in the using directive, gnuplot will correctly parse even files that are partially garbled or that contain garbage.

Time and date information in a data file constitutes a special case. Gnuplot provides special facilities to parse timestamps, which we’ll discuss in detail in section 8.4.2. Do not attempt to parse time and date information yourself—use the appropriate options!^[1]

¹
In listing 4.2, I chose to manipulate the date value explicitly using string functions, but only because in the crowded multipanel figure, I needed to exert finer control over the appearance of the tic marks.

Plain text as data format

Using text files as primary data stores has a number of advantages: you don’t need special tools to generate or read them, and it’s easy to write scripts to pre- or post-process them. If they’re large, it’s simple to compress them. And if necessary, you can even load them into any text editor (or a spreadsheet!) and manipulate them by hand.

Furthermore, text files are largely portable across most current computer architectures (so that files generated on Unix, say, can be opened and read on Windows)—except for the choice of the linebreak indicator (newline character). Gnuplot expects lines in input files to be terminated by whatever the local operating system (or rather, the local C library) considers the “native” newline character. If you encounter problems reading files generated on a different platform, try converting newlines to the local format. (The same argument applies to the files gnuplot writes: lines are terminated with the native newline character on the given system.)

Gnuplot string handling is based around Unicode and UTF-8. We’ll have more to say about both topics in the sidebar in section 5.1 in the next chapter.

Gnuplot currently does not have built-in support for data files in a hierarchical or markup format, such as XML, JSON, or HDF5. If you want to plot such data sets, you’ll have to extract them to plain, column-oriented text files first.

4.2. Managing structured data sets

Sometimes data sets have more internal structure than the simple one-record-per-line model you’ve seen so far. Two cases in particular are reasonably common and must be dealt with: files containing data blocks—that is, different data sets one after another in a single file—and files containing records that span several lines each. Let’s look at how to handle each case in turn.

4.2.1. Multiple data sets per file: index

Here’s a common scenario: monitoring web traffic across multiple hosts. A script logs in to each host, pulls the log files for the last few days, summarizes the hits per day, and writes the results to an output file before moving on to the next host. You want to use gnuplot to look at the contents of the resulting file.

The important issue here is that the resulting file doesn’t contain individual data points, but entire data sets: one for each host. Each data set spans several rows: one for each day. The file might look something like the one shown in the following listing.^[2]

²
The data in this data set is dorky. This is intentional. I want to make sure you can distinguish data from each of the three hosts by looking at the data (the number of hits) itself. This will be important when we compare this data set with the one in listing 4.4.

Listing 4.3. Data file containing several sets (file: traffic1)

This data file looks as if several distinct data files (for different hosts) had been appended to one another. Because this situation is sufficiently common, gnuplot provides a way to handle it. But first, we need to look at the meaning of blank lines in a data file.

For gnuplot, blank lines in a data file are significant. A single blank line indicates a discontinuity in the data. The data above and below the blank line is treated as belonging to the same data set (and is shown using the same line color and line style), but no connecting line is drawn between the records before and after the blank.

In contrast, double blank lines are used to distinguish data sets in the file. Each set can be addressed in the plot command as if it were in a separate file using the index directive to plot. Table 4.1 summarizes these differences.

Table 4.1. Meaning of blank lines in data files

	Meaning
Single blank line	Indicates a discontinuity in the data. No connecting line is drawn between points separated by a single blank line.
Double blank line	Indicates different data set (or data blocks) in the same file. Different data blocks can be selected using the index keyword with the plot command.

The index directive follows immediately after the filename in the plot syntax and takes at least one argument specifying which data set to select from the file. The argument can be either numeric or a string.

Selecting data sets by position

A numeric argument is treated as an index into an array, following the C language convention of counting from 0 (zero). Therefore, to plot only the traffic for the staging site, you can use

plot "traffic1" index 1 using 1:2 w linespoints

This picks out only the data set with index 1 and shows it with linespoints.

The index directive can be abbreviated as i. It can take up to three arguments, separated by colons (similar to the syntax familiar from using):

index {int:start}[:{int:end}][:{int:step}]

If only a single argument is given, only the corresponding data set is plotted. If two arguments are present, they’re treated as the index of the first and last data set (inclusive) to be shown: plot "data" index 2:5 plots four data sets total. A third argument is interpreted as a step size. Accordingly, plot "data" index 2:5:2 plots only the data in sets 2 and 4. Only the first argument is mandatory.

Selecting data sets by name

Gnuplot 5 offers a new way to pick out a single data set from a file: if the data set is preceded by a comment line that includes the ID or name of the data set, you can use this identifier in the index directive. The comment character and any leading whitespace are removed; the following characters are treated as the “name” of the data set that follows the comment line in the file.

Referring again to the file shown in listing 4.3, the following two commands are equivalent:

plot "traffic1" index 1 using 1:2 w linespoints
plot "traffic1" index "host=staging" using 1:2 w linespoints

Only a single data set can be selected this way. If the “name” given is ambiguous, then gnuplot plots the first data set in the file that matches. (Matching is case-sensitive.) Referring again to listing 4.3, the following three commands are both unambiguous and will result in an identical plot:

plot "traffic1" index 2 using 1:2 w linespoints
plot "traffic1" index "host=t" using 1:2 w linespoints
plot "traffic1" index "host=test" using 1:2 w linespoints

In contrast, plot "traffic1" index "host" u 1:2 w lp is ambiguous, and gnuplot will plot the first matching data set in the file (which happens to be the one for the main host): that is, the first data set in the file.

4.2.2. Records spanning multiple lines: the every directive

Whereas the index directive lets you select consecutive sets of data from a file, the every option, which we discuss now, solves a different problem. Consider the same arrangement of hosts as in the previous section, together with an auxiliary script that gathers traffic data from all hosts and dumps it into a single file. But in contrast to the previous scenario, the script now retrieves only a single day’s worth of traffic and appends it to the output file before moving on to the next host. After a few days, the output file might look like the one shown next.

Listing 4.4. Data file containing interleaved data sets (file: traffic2)

Here, each record for a single day spans three lines: one for each host. (Compare this file against the one in listing 4.3: the data in both is exactly the same!) If you want to plot the traffic for each host separately, you can use the every directive to pick up only the relevant subset of all lines. The following command, for instance, plots only the traffic for the staging host:

plot "traffic2" every 3::1 using 1:2 with lp

Using the every directive, you can control how you step through individual lines. The syntax looks similar to the syntax used for index, except that individual arguments are separated by two colons. Unfortunately, this similarity is somewhat deceiving, because the order of the arguments isn’t the same for every as it is for index:

every {int:step}[::{int:start}[::{int:end}]]

The first argument is the increment, followed (optionally) by the first and last line numbers. Line numbers are counted from zero. Don’t forget to use double colons with the every directive: single colons won’t generate an error message but will lead to strange and hard-to-predict behavior.^[3]

³
I’m simplifying here. Gnuplot recognizes an additional concept known as a data block in a file: a set of consecutive lines delimited from each other using single blank lines. Data blocks are functionally redundant with data sets (delimited by double blank lines). Data blocks can be selected through additional arguments to the every directive, which are placed between the double colons. This is why it’s not illegal to use single colons in this context. If you want to know more about data blocks, check the standard gnuplot reference documentation.

You can play other tricks with the every feature. For instance, the following command skips the first two lines of every data set in the file:

plot "data" every ::2 using 1:2

This is similar to, but not quite the same as, the skip facility: skip only discards lines at the beginning of the file, not at the beginning of every data set in the file. Moreover, every only applies to data lines, whereas skip also skips comments. Feel free to devise other creative uses for the every keyword!

Tip

I recommend that you use a file format suitable for the index facility when you need to combine several data sets in a single file. The every keyword is a bit of a legacy feature, and data handling based on index is more robust and transparent.

4.3. File format options in detail

You can control several aspects of the file format using options. For instance, you can choose additional characters to indicate comment lines in data files. In this section, we look at the fine points of input-file formats: numbers, missing values, comments, and strings.

4.3.1. Number formats

Gnuplot can read both integers and floating-point numbers, as well as numbers in scientific notation (strictly speaking, scientific E notation): a floating-point number, followed by an uppercase or lowercase character e, followed by an integer, which is interpreted as a power of 10. The numeric value of such a field is obtained by multiplying the floating-point part by 10 raised to the appropriate power. A couple of examples will make this clear: in scientific notation, the value 35,100 is encoded 3.51e4; the value -0.0001 is written -1e-4.

You can also allow the letters d and q (both uppercase and lowercase) instead of e or E for Fortran D and Q constants, by setting

set datafile fortran

This option is off by default, because it requires additional parsing; it should be enabled only if needed.

4.3.2. Comments

You can include comments in a data file on lines starting with the comment character (#). The line must start with the comment character and is ignored entirely. If gnuplot encounters a # in any location other than the first one in the line, it isn’t interpreted as a comment character, and any text following it is interpreted as additional data. This isn’t a problem as long as only columns preceding it are specified in the using declaration of the plot command.

You can make gnuplot interpret additional characters as comment characters by using the set datafile commentschars command:

set datafile commentschars ["{str:chars}"]

For example, to tell gnuplot that the exclamation point indicates a comment line in a data file, you can say

set datafile commentschars "!"

The string can contain any number of characters, each of which is interpreted as a comment character if found at the beginning of a line. Resetting this option to a new value overrides all previous settings.

4.3.3. Field separator

By default, fields (columns) are separated from one another by whitespace, which means any number of space or tab characters. You can change the field separator using the set datafile separator command:

set datafile separator [ "{str:char}" | whitespace | tab | comma ]

You can use the symbolic names whitespace, tab, and comma, or specify a string of characters. If you provide an explicit string, then each character in the string will be treated as a separator character.

For example, to make both the colon and the pipe character individually be field separators, you’d use

set datafile separator ":|"

To let gnuplot parse comma-separated (CSV) files, either of the following two commands will work:

set datafile separator ","
set datafile separator comma

Separator characters aren’t interpreted as separators when inside quoted strings: quoted strings are always interpreted as the entry of a single column.

You can specify several characters, each of which by itself is interpreted as a field separator. It isn’t possible to specify a multicharacter pattern as a field separator. (This isn’t true when using whitespace.) To reset to gnuplot’s default behavior, you can issue set datafile separator whitespace or set datafile separator so that columns are split on whitespace again.

4.3.4. Missing values

You can use the set datafile missing command to specify a string to be used in a data file to denote missing data:

set datafile missing ["{str:str}"]

An example is set datafile missing "NaN", which interprets the IEEE floating-point indicator NaN (“not a number”) as a missing value. There’s no default value for this parameter.

Having an indicator for missing values is important when you’re using a whitespace-separated file format: if the missing value were left blank, gnuplot wouldn’t recognize it as a column value and would use the value from the next column instead.

Tip

In a whitespace-separated format, you must use an explicit missing-value indicator. It isn’t permissible to leave a missing value blank.

The interpretation of missing values in a data set depends on the precise syntax of the using directive. Let’s look at two examples. The following listing shows a file containing a missing value.

Listing 4.5. Data file containing a missing value (file: missing)

First, let’s assume that datafile missing has been set to "NaN". If you now plot this file using

plot "missing" using 1:2 with linespoints

then the fifth record (the one containing the missing value) is ignored and the data is plotted with one continuous, unbroken line. In contrast, if you use the command

plot "missing" using 1:($2) with linespoints

or if the datafile missing option doesn’t equal "NaN" (either because this option is undefined or because it’s been set to a different value), then gnuplot also ignores the fifth record but treats it as a blank line and therefore doesn’t draw a connecting line across the gap (see figure 4.2).

Figure 4.2. Gnuplot treats missing values differently, depending on the value of the `datafile missing` option and the `plot` syntax. The data file is shown in listing 4.5.

4.3.5. Strings in data files

Gnuplot can read and process text fields found in input files. A valid text field can be any string of printable characters that doesn’t include blank spaces. If the string contains blanks, it must be enclosed in double quotes to prevent gnuplot from interpreting the blanks as column separators. (Single quotes don’t work!) The enclosing double quotes are stripped off and aren’t part of the field’s value. If a field contains whitespace and is protected by enclosing double quotes, it must not contain double quotes as part of the string value. If you need to use quotation marks and blanks in the same string, you must use single quotes inside the string and double quotes to enclose the entire field. If you’ve designated a non-whitespace character as a column separator using set datafile separator (see the previous section), the same considerations apply: strings containing the separator must be protected with double quotes. Listing 4.6 shows some ways you can use strings in a data file.

Tip

You can include strings in data files. If the strings contain whitespace, they must be enclosed in double quotes. Single quotes won’t work.

Listing 4.6. Strings in data files need quotes only if they contain whitespace.

For more information on string handling, check section 5.1 in the next chapter.

4.4. Accessing columns and pseudocolumns

In chapters 2 and 3, you saw how to access columns in using directives. Here, you’ll review what you know already and learn some new features. Table 4.2 summarizes all methods of accessing columns in gnuplot.

Table 4.2. Column-access methods, pseudocolumns, and column-access functions. (An entry in the first column in italics is a placeholder for any matching value. Non-italic entries must be used verbatim.)

Specifier	Example	Description
number	using 1:2	Accesses a column by its horizontal position in the file.
string	using "Height":"Weight"	Accesses a column by its name as given in the first non-comment line in the file.
0	using 0:2	Pseudocolumn, containing the record number (starting from zero) in the current data set.
-1	using -1:2	Pseudocolumn, containing the line number (starting from zero). Reset by a single blank line.
-2	using -2:2	Pseudocolumn, containing the index (starting from zero) of the current data set. Reset by a double blank line.
(expression)	using 1:($2+$3)	Inline transformation. The expression in parentheses is evaluated, and the value of the expression is plotted. The column values for the current record are available through the shorthands $1, $2, and so on.
(constant)	using 1:(1)	Same as the previous, but the expression is a constant, independent of the values in the data file.
(column(expression))	using 1:(column($3+1))	The argument to the column() function should be an expression that evaluates to an integer. The value of this expression is interpreted as a column number, and the column() function returns the value of the current record for the desired column as a numeric value. (For example, column(3) returns the current value of the third column as a numeric value.)
(column(string))	using 1:(column("Weight"))	If the argument to the column() function is a string, it’s interpreted as the name of a column (as given in the first non-comment line in the data file). The column() function returns the value of the current record for the desired column as a numeric value.
(stringcolumn(expression)) or (strcol(expression))	using 1:(stringcolumn( $3+1))	The argument to the stringcolumn() function should be an expression that evaluates to an integer. The stringcolumn() function returns the value of the current record for the desired column as a string value.
(stringcolumn(string)) or (strcol(string))	using 1:(stringcolumn( "ID"))	If the argument to the stringcolumn() function is a string, it’s interpreted as the name of a column (as given in the first non-comment line in the data file). The stringcolumn() function returns the value of the current record for the desired column as a string value.
(timecolumn(expression, formatstring))	using 1:(timecolumn(3, "%Y-%m-%d"))	Parses a column entry as a timestamp and returns the corresponding Unix epoch second as numeric value. The first argument must evaluate to an integer, identifying a column. The second argument specifies a time format, according to table 8.4 or 8.5.
(valid(expression))	using 1:(valid(2)?$2:$3)	The argument to the valid() function must be an expression that evaluates to an integer (not a string). The valid() function returns 1 if the current record for the identified column is a valid number or zero otherwise.

4.4.1. Accessing columns by position or name

Columns are usually accessed by their position in the file (counting left to right, starting at one):

plot "data" using 1:2

This is convenient for files with only a few columns, but it can also work well for truly large files, because it’s easy to iterate over numeric column specifiers with gnuplot’s new inline looping feature. The following command plots columns 2 through 24 against the first column in a single plot (more on this kind of looping construct in section 5.4; we’ll treat loops in greater detail in chapter 11):

plot for [j=2:24] "data" u 1:j

You can also specify a column by name if the data file contains a set of column labels in the first row of the file (or the data set, if the file contains multiple data sets addressed using index). For example, for the file shown in listing 4.7, the following commands are equivalent:

plot "grains" skip 1 using 1:2
plot "grains" using 1:"Wheat"
plot "grains" using "Year":"Wheat"

Listing 4.7. Column names in a data file (file: grains)

Observe that the line containing the labels must not be a comment line. Furthermore, if any of the entries in the using phrase are strings, the entire first line is interpreted as labels and isn’t included in the plot.

4.4.2. Pseudocolumns

Gnuplot supplies three pseudocolumns for each file it reads. They’re numbered 0, -1, and -2.

The pseudocolumn 0 contains the line number in the current file (or data set), starting at zero, without counting any comment, label, or skipped lines. You can also access this column in inline transformations using $0. (This counter is reset to zero when it encounters the double blank line that separates data sets.)

The pseudocolumn -1 contains the line number, starting at zero, and is reset by a single blank line. (This is relevant if the data file is in grid format—see appendix C for details.)

The pseudocolumn -2 contains the index of the current data set within the data file. When a double blank line is encountered in the file, the line number (corresponding to the value of pseudocolumn 0) resets to zero, and the index is incremented.

You can use these pseudocolumns, for instance, like this:

4.4.3. Column-access functions

You can use the column() function whenever an expression has become too complicated for the using syntax, or in contexts where the $ shorthand isn’t available. The function evaluates its argument to a column identifier and returns the value of the identified column as a numeric value. (There is also a stringcolumn() [or strcol()] function that returns the column value as a string. See the next chapter for more information on strings.)

The argument to column() (or stringcolumn()) can be either numeric or a string. A numeric argument is evaluated into a column number. A string argument is only suitable if the file contains a non-comment header line with column identifiers (as in listing 4.7): if that’s the case, the argument should evaluate into one of the column names. An error results if no column with the supplied identifier can be found.

The column() function and pseudocolumns work well together. For instance, this example plots all values from a file but adds a constant vertical offset of 1.5 to values from different data sets (to separate curves from different data sets from each other, so that you can distinguish them more easily):

plot "data" using 1:($2 + 1.5*column(-2)) with lines

The following example (found in the standard reference documentation) plots each data set in a different color (see chapter 9 to learn more about linecolor variable):

plot "data" u 1:2:(column(-2)) linecolor variable

Further column-access functions

If a column contains date and time values, then the timecolumn() function may be useful. It takes two arguments: the first must be an integer identifying a column (it can’t be a string!), and the second must be a string specifying the format of the time-stamp in the data file. The function returns the Unix epoch seconds corresponding to the parsed value as a number. This value may, for example, be formatted into a text label using the strftime() function.

The format specifiers are based on those used by the strftime() and strptime() functions from the C standard library. We’ll discuss the handling of date and time information in section 8.4.2 in the context of time-series plotting; here we present only a brief preview. Imagine a file containing dates and values like this:

2014-12-01   3
2014-12-02   5
2014-12-05   1

The following command converts the dates in the first column to their corresponding epoch seconds and uses them as horizontal positions:

plot "data" u (timecolumn(1, "%Y-%m-%d")):2 w lp

One other function can be useful for inspecting column values: valid(). It returns true only if the value of the column with column number x is a valid number. You can use this function to test values from messy files and only plot those that are valid.

4.5. Pseudofiles

Typically, the data rendered by the plot command is read from a file, but in some situations it makes sense to accept data from a different source. Gnuplot includes a number of special filenames that you can read from as if they were proper files, although they don’t exist on any disk.

As of gnuplot 5, data can be embedded in command files and accessed through a variable-like identifier called a heredoc. Data can also be read from the command line interactively and stored in a heredoc—this can be useful when you want to add a few data points to an existing graph.

On platforms that support pipes, gnuplot can read data from standard input—this is particularly useful when you’re scripting gnuplot to run in batch mode. When they’re available, you can also send output to another process via pipes.

Finally, two special filenames let you apply the full flexibility of the using facility to function plots. Table 4.3 lists all these special filenames and describes their meaning.

Table 4.3. Pseudofiles

Pseudofile	Description
"" (empty filename)	Reuses the most recently encountered filename.
"-"	Reads from standard input. In an interactive session, reads from the command window. (Terminate input with the EOF character or e.)
"+"	Generates samples equally spaced x values covering the entire current plot range.
"++"	Generates spaced x and y values covering the entire current plot range. (This is the two-dimensional equivalent to "+".) The number of points in the x direction is controlled by set samples, and the number of points in the y direction is controlled by set isosamples.
"< name"	Reads from a subprocess via pipe.^[a]
"\| name"	Writes to a subprocess via pipe.^a
$heredoc	Reads from a heredoc (no quotes around the name).

^a
Only available on Unix-like platforms.

4.5.1. Reading data from standard input

When given the special filename - (as in: plot "-" u 1:2), gnuplot attempts to read data from standard input, which in an interactive session is the command window. After you run this plot command, gnuplot shows a prompt at which you can type data. Finish each line by pressing Enter. Gnuplot will keep prompting for data until either an end-of-file (EOF) character (typically Ctrl-d) is encountered or the character e is entered on a line by itself.

You can even read data from standard input multiple times within the same plot command: plot '-', '-' reads data until an EOF character is encountered and then expects to read more data (for the second “file”) until it finds a second EOF character. Of course, the data entered at a prompt this way can have multiple columns, from which you can select some with using; all the other features of the plot command can be used as well.

Although this feature can be used interactively, it’s mostly intended for situations where gnuplot is used in batch mode as part of larger scripts (we’ll talk more about that in chapter 11). When used interactively, this feature quickly becomes inconvenient, because (as explained earlier) gnuplot doesn’t maintain data sets in memory and therefore all data has to be manually reentered every single time you want to plot or replot the graph.

4.5.2. Heredocs

Gnuplot 5 contains a new feature that makes it possible to embed data in a command file (as opposed to a data file) or to enter data in the command window and store the data in a variable for the duration of the gnuplot session. This is somewhat of a radical departure, because traditionally gnuplot never maintained data in memory—data was read from file every time the plot command was executed. This new feature is referred to as heredocs because its syntax is modeled after the eponymous feature in Perl and in Unix shells. (The gnuplot standard reference also calls them data blocks, but this term is already used in other contexts. I therefore use the unambiguous heredoc terminology for this feature, which seems justified given that these data structures are intended to be populated through a corresponding facility.)

Defining and using heredocs

You define a heredoc by preceding a variable identifier with a $ (dollar sign), followed by the << redirection operator and an arbitrary sequence of characters that will mark the end of the data section:

$d << EOD

Now follows the actual data section, just as if it was a data file. The data is terminated by the end-of-data indicator (on a line by itself—the indicator must match exactly). The entire affair may look like the following listing.

Listing 4.8. Defining a heredoc

$d << EOD
1     0.5
2     0.75
3     0.99
EOD

You can now use the heredoc as you’d use a file. In particular, you can plot its contents (note that there are no quotes around the heredoc identifier):

plot $d using 1:2

Listing 4.8 could be part of a command file; in this case, it could be loaded (with load). Alternatively, it could be entered interactively during a gnuplot session.

Several gnuplot commands that usually write to a file can write to a heredoc, instead; both set print and set table accept a heredoc identifier (see section 5.3). This makes it possible to have gnuplot commands “write” to a heredoc from within a gnuplot session.

No specific commands exist to populate a heredoc from a generic data file or to persist the contents of a heredoc to disk. In contrast to other information that exists in the session (such as gnuplot variables), the contents of a heredoc are not saved by the save command. Using set print and set table, you can circumvent these hurdles (we’ll come back to this in section 5.3).

A heredoc, once defined, occupies memory. To release these resources, use the undefine command: undefine $d. Using undefine $* drops all currently existing heredocs.

Understanding heredocs

Heredocs are a new and somewhat controversial feature, and some of their aspects may have changed by the time you read this. Heredocs break with several of gnuplot’s longstanding and fundamental design principles. It’s important to understand how they came about.

Heredocs arose from a desire to embed data in command files, in order to enable command files that are self-contained and don’t require a separate data file. But because traditionally the contents of gnuplot command files can be issued without change at the gnuplot command prompt, the syntax chosen for gnuplot’s heredocs can be issued interactively, as well. This gives the impression that heredocs are akin to session variables, but that isn’t how they’re meant to be used.

The result is strangely halfhearted and contradictory. Heredocs live in the session and occupy memory, just like gnuplot variables—yet they aren’t persisted with save. Heredocs can stand in for files in the plot and splot commands, in set print and set table—yet they aren’t intended to be populated from generic data files.

Heredocs break with the tradition that gnuplot doesn’t maintain any memory of the data: data had to be read from file every time it was used so that gnuplot was stateless (with regard to data). With heredocs, data lives in the gnuplot session, giving the impression of a data set or data frame—yet there exist no operations to manipulate heredocs (in addition to the inability to populate them from a generic data file).

Heredocs are intended to be defined in command files, where they require a special (albeit simple) syntax, thus breaking with the principle that gnuplot data files are plain, unstructured, and generic text files.

Ultimately, the problem is that the heredoc design is too flexible and feature-rich for the intended purpose. What was intended was a static (read-only) data block as part of a command file—similar to the DATA filehandle in Perl. But the ability to define heredocs at runtime and to maintain them in the session, independent of file access, gives the appearance of a general-purpose data structure and invites to be used as such. But because heredocs were never intended as general-purpose data sets, some of the functionality you might expect is missing.

4.5.3. Reading data from a subprocess

On platforms that support input/output redirection, gnuplot can read from a subprocess via a pipe. Let’s say that a.out is a program that prints data suitable for gnuplot to standard output. The following command runs a.out and captures and plots its output without accessing the file system:

plot "< a.out" using 1:2

The < character must always be the first character of the filename, even when reading from an entire pipeline. For instance, if the output of the first program a.out is piped through a second program b.out, the gnuplot command is plot "< a.out | b.out".

You need to watch out for several things when reading from a subprocess like this. First, keep in mind data might be buffered by the operating system, so that gnuplot doesn’t plot anything until the buffer is flushed or the process has finished (which is guaranteed to flush the buffer), at which point gnuplot plots all the data received at once. If you want to monitor a live stream with data trickling in over time, you need to use one of the methods from section 11.6. Furthermore, if you attempt to read from the same subprocess twice—for instance, plot "< a.out" u 1:2, "" u 1:3—gnuplot runs the program twice (and you’ll have to wait until both instances complete). Repeating a plot command (or using the replot command) starts the subprocesses anew.

For these reasons, it’s usually a better idea to write the process output to a file, which can then be plotted at leisure and as often as necessary. Alternatively, you might consider running gnuplot itself as a subprocess and piping data to it—we’ll come back to this idea in chapter 11.

4.5.4. Writing to a pipe

If pipes are available, gnuplot can also write to them. Because gnuplot’s output consists primarily of graphs, this feature is mostly of interest for post-processing image files. A single example will suffice.

Among gnuplot’s terminals based on the cairo library (see section 10.3), there isn’t one that can generate files in GIF format. The following sequence of commands uses gnuplot’s pngcairo terminal to create a graph in the PNG format, which is then piped through the convert utility from ImageMagick (www.imagemagick.org) to create the desired GIF:

set terminal pngcairo
set output "| convert - graph.gif"
plot sin(x)
set output

The pipe character sends gnuplot’s output to a pipe; the dash instructs convert to read from standard input. The final set output is necessary to flush gnuplot’s output channel. (See chapter 10 for additional information.)

4.5.5. Generating data

When gnuplot plots a function (as in plot sin(x)), gnuplot evaluates the function on a set of equally spaced points, which are evenly distributed over the entire plot range. The number of points is controlled by the set samples option (see section 3.1), and the plot range is either given (in square brackets) as part of the plot command (see chapter 2) or specified using set xrange (see chapter 8).

Reading from the special filename "+" is like reading from a file that contains exactly those x positions at which the function would be evaluated, in a single column. In other words, the following two commands are equivalent:

plot sin(x)
plot "+" using 1:(sin($1)) w l

In the second command, you must specify the with lines style explicitly, because—as far as gnuplot is concerned—you’re plotting data (not a function), and hence the default data style applies (see chapter 9).

The special file "++" is the two-dimensional equivalent of "+". It generates regularly spaced positions on a two-dimensional grid as used by the splot command. (See appendix C for more information about splot.)

You probably won’t need the "+" and "++" pseudofiles often—they’re special-purpose tools for handling certain edge cases. Sections 6.3.4, 8.2.2, 9.1.5, and F.5.1 discuss potential applications.

4.6. Metadata in data files

Gnuplot can read a variety of (mostly textual) information from a data file in addition to the data points being plotted. We haven’t discussed gnuplot’s string-handling facilities in detail yet (we’ll do so in the next chapter), nor have we introduced the context in which the need for metadata arises (we’ll do so in chapters 7, on decorations that can be placed on a plot, and 8, on axes), so this section is mostly a teaser.

Strings can be read from a data file and placed directly onto a plot using the with labels style (see section 5.1.3 in the next chapter). With the ticslabels() family of functions (see chapter 8), you can use strings read from a file as labels on the axes of a plot (see section 8.3.5 and the examples in section 13.4). Special rules apply if the data in the file represents calendar dates and times (see the section on time series in 8.4.2). Finally, you can use the values of the first line in an input file (the column labels) as entries in the legend or key of a plot (see section 7.4.5). Table 4.4 summarizes all the metadata that can be found in a data file.

Table 4.4. Metadata in data files

	Section	Description
#	4.3.2	If found at the beginning of a line, treats the entire line as a comment.
plot skip n	4.1.1	Skips the first n lines.
index name	4.2.1	Identifies a data set in a file by a unique identifier in the comment line immediately preceding the data block.
Named column	4.4.1	Identifies a column by a name, which is found in the first non-comment line of the file.
Key entry explanations	7.4.5	Reads explanations to be used in the graph’s key from the first non-comment line of the file.
with labels	6.3.5	Reads text labels from the data file and places them on the graph.
xtics(col)	8.3.5	Reads tic labels from column col in the data file.
set xdata time	8.4.2	In time-series mode, reads and parses appropriately formatted strings as timestamps.

4.7. Other file formats

In passing, let’s mention some other file formats used by gnuplot that you haven’t seen. It’s possible to parse more complicated record formats than the ones we’ve discussed so far by passing a format string, which describes the format of each record, to using. The format string must be compatible with the scanf() family of functions, familiar from the standard C library. Check the standard gnuplot reference documentation if you believe this is relevant to you; but given the well-known fussiness of scanf(), this is rarely the best path forward. If a file has a format that can’t be parsed normally by gnuplot, it’s usually a better idea to convert it to a gnuplot-compatible format using a small conversion program in Perl, awk, or a similar tool.

Date and time strings are handled in a special way: don’t attempt to parse them using a scanf()-like format string. Use the timecolumn() function (see table 4.2) or the special commands described in section 8.4, instead, to parse and process such data.

Tip

Time and date information constitutes a special case. Use gnuplot’s dedicated facilities for handling such data (see section 8.4.2).

In addition to the regular, column-oriented, tabular file format you’ve seen up to this point, gnuplot can also handle data on a two-dimensional grid. There are two formats to choose from: one that contains the data values together with the coordinates of the grid points, and a packed “matrix” format. We’ll discuss them in appendix C on multidimensional plots.

Finally, gnuplot can read certain binary packed file formats. Again, if this is relevant to you, I suggest the standard gnuplot reference documentation. Unless you have particular needs, I recommend that you stick with text files.

4.8. Summary

Gnuplot is all about graphing data, and in this chapter we covered how to get data into gnuplot. You learned about data files and formats. Gnuplot reads data from text files, which are portable, easy to generate, and easy to modify.

You learned how to keep multiple data sets in a single data file and select from them with index, and how to select a subset of records with every. You also learned different ways to identify and access individual columns in a data file.

We discussed pseudofiles, which are data sources that don’t live on the disk, and we also mentioned that gnuplot can read data from another process (if it’s installed on a platform that permits it). Finally, we took a brief look at other information you can find in a data file and mentioned alternative file formats in addition to the whitespace-separated columns we’ve concerned ourselves with exclusively until now.

This concludes what you need to know to get started doing data analysis using gnuplot. In the next chapter, we’ll turn away from gnuplot’s core functionality of dealing with data and generating plots, and instead talk about a host of useful features that make your work with gnuplot easier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4. Managing data sets and files

Create new playlist

Sign In

Sign Up

Chapter 4. Managing data sets and files

4.1. Quickstart: the standard data-file format

Listing 4.1. Incomplete generic data file—see figure 4.1 (file: orders)

Listing 4.2. Commands for figure 4.1 (file: orders.gp)

4.1.1. Comments and header lines

Tip

4.1.2. Selecting columns

Tip

4.2. Managing structured data sets

4.2.1. Multiple data sets per file: index

Listing 4.3. Data file containing several sets (file: traffic1)

Table 4.1. Meaning of blank lines in data files

Selecting data sets by position

Selecting data sets by name

4.2.2. Records spanning multiple lines: the every directive

Listing 4.4. Data file containing interleaved data sets (file: traffic2)

Tip

4.3. File format options in detail

4.3.1. Number formats

4.3.2. Comments

4.3.3. Field separator

4.3.4. Missing values

Tip

Listing 4.5. Data file containing a missing value (file: missing)

Figure 4.2. Gnuplot treats missing values differently, depending on the value of the datafile missing option and the plot syntax. The data file is shown in listing 4.5.

4.3.5. Strings in data files

Tip

Listing 4.6. Strings in data files need quotes only if they contain whitespace.

4.4. Accessing columns and pseudocolumns

Table 4.2. Column-access methods, pseudocolumns, and column-access functions. (An entry in the first column in italics is a placeholder for any matching value. Non-italic entries must be used verbatim.)

4.4.1. Accessing columns by position or name

Listing 4.7. Column names in a data file (file: grains)

4.4.2. Pseudocolumns

4.4.3. Column-access functions

Further column-access functions

4.5. Pseudofiles

Table 4.3. Pseudofiles

4.5.1. Reading data from standard input

4.5.2. Heredocs

Defining and using heredocs

Listing 4.8. Defining a heredoc

4.5.3. Reading data from a subprocess

4.5.4. Writing to a pipe

4.5.5. Generating data

4.6. Metadata in data files

Table 4.4. Metadata in data files

4.7. Other file formats

Tip

4.8. Summary

Table of Contents for
Chapter 4. Managing data sets and files

Figure 4.2. Gnuplot treats missing values differently, depending on the value of the `datafile missing` option and the `plot` syntax. The data file is shown in listing 4.5.