Chapter 10, A Library of awk Functions,
presents the idea that reading programs in a language contributes to
learning that language. This chapter continues that theme, presenting a potpourri of
awk
programs for your reading enjoyment.
There are three sections. The first describes how to
run the programs presented in this chapter.
The second presents awk
versions of
several common POSIX utilities. These are programs that you are hopefully
already familiar with, and therefore whose problems are understood. By
reimplementing these programs in awk
, you
can focus on the awk
-related aspects of
solving the programming problems.
The third is a grab bag of interesting programs. These solve a number
of different data-manipulation and management problems. Many of the programs
are short, which emphasizes awk
’s ability
to do a lot in just a few lines of code.
Many of these programs use library functions presented in Chapter 10.
To run a given program, you would typically do something like this:
awk -fprogram
--options
files
Here, program
is the name of the awk
program (such as cut.awk
), the
options
are any command-line options for the
program that start with a ‘-
’, and the
files
are the actual datafiles.
If your system supports the ‘#!
’
executable interpreter mechanism (see Executable awk Programs), you
can instead run your program directly:
cut.awk -c1-8 myfiles > results
If your awk
is not gawk
, you may instead need to use this:
cut.awk -- -c1-8 myfiles > results
This section presents a number of POSIX utilities implemented in
awk
. Reinventing these programs in
awk
is often enjoyable, because the
algorithms can be very clearly expressed, and the code is usually very
concise and simple. This is true because awk
does so much for you.
It should be noted that these programs are not necessarily intended
to replace the installed versions on your system. Nor may all of these
programs be fully compliant with the most recent POSIX standard. This is
not a problem; their purpose is to illustrate awk
language programming for “real-world”
tasks.
The programs are presented in alphabetical order.
The cut
utility selects, or
“cuts,” characters or fields from its standard input and sends them to its
standard output. Fields are separated by TABs by default, but you may
supply a command-line option to change the field
delimiter (i.e., the field-separator character).
cut
’s definition of fields is less
general than awk
’s.
A common use of cut
might be to
pull out just the login names of logged-on users from the output of
who
. For example, the following
pipeline generates a sorted, unique list of the logged-on users:
who | cut -c1-8 | sort | uniq
The options for cut
are:
-c
list
Use list
as the list of
characters to cut out. Items within the list may be separated by
commas, and ranges of characters can be separated with dashes. The
list ‘1-8,15,22-35
’ specifies
characters 1 through 8, 15, and 22 through 35.
-f
list
Use list
as the list of fields to
cut out.
-d
delim
Use delim
as the field-separator
character instead of the TAB character.
-s
Suppress printing of lines that do not contain the field delimiter.
The awk
implementation of
cut
uses the getopt()
library
function (see Processing Command-Line Options) and the join()
library function (see Merging an Array into a String).
The program begins with a comment describing the options, the
library functions needed, and a usage()
function that prints out a usage message and exits. usage()
is called if invalid arguments are
supplied:
# cut.awk --- implement cut in awk # Options: # -f list Cut fields # -d c Field delimiter character # -c list Cut characters # # -s Suppress lines without the delimiter # # Requires getopt() and join() library functions function usage() { print("usage: cut [-f list] [-d c] [-s] [files...]") > "/dev/stderr" print("usage: cut [-c list] [files...]") > "/dev/stderr" exit 1 }
Next comes a BEGIN
rule that
parses the command-line options. It sets FS
to a single TAB character, because that is
cut
’s default field
separator. The rule then sets the output field separator to be the
same as the input field separator. A loop using getopt()
steps through the command-line
options. Exactly one of the variables by_fields
or by_chars
is set to true, to indicate that
processing should be done by fields or by characters, respectively. When
cutting by characters, the output field separator is set to the null string:
BEGIN { FS = " " # default OFS = FS while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) { if (c == "f") { by_fields = 1 fieldlist = Optarg } else if (c == "c") { by_chars = 1 fieldlist = Optarg OFS = "" } else if (c == "d") { if (length(Optarg) > 1) { printf("cut: using first character of %s" " for delimiter ", Optarg) > "/dev/stderr" Optarg = substr(Optarg, 1, 1) } FS = Optarg OFS = FS if (FS == " ") # defeat awk semantics FS = "[ ]" } else if (c == "s") suppress = 1 else usage() } # Clear out options for (i = 1; i < Optind; i++) ARGV[i] = ""
The code must take special care when the field delimiter is a
space. Using a single space (" "
)
for the value of FS
is
incorrect—awk
would separate fields
with runs of spaces, TABs, and/or newlines, and we want them to be
separated with individual spaces. Also remember that after getopt()
is through (as described in Processing Command-Line Options), we have to clear out all the elements of
ARGV
from 1 to Optind
, so that awk
does not try to process the command-line
options as filenames.
After dealing with the command-line options, the program verifies
that the options make sense. Only one or the other of
-c
and -f
should be used, and both
require a field list. Then the program calls either set_fieldlist()
or set_charlist()
to pull apart the list of fields or characters:
if (by_fields && by_chars) usage() if (by_fields == 0 && by_chars == 0) by_fields = 1 # default if (fieldlist == "") { print "cut: needs list for -c or -f" > "/dev/stderr" exit 1 } if (by_fields) set_fieldlist() else set_charlist() }
set_fieldlist()
splits the
field list apart at the commas into an array. Then, for each element of
the array, it looks to see if the element is actually a range, and if
so, splits it apart. The function checks the range to make sure that the
first number is smaller than the second. Each number in the list is
added to the flist
array, which
simply lists the fields that will be printed. Normal field splitting is
used. The program lets awk
handle the
job of doing the field splitting:
function set_fieldlist( n, m, i, j, k, f, g) { n = split(fieldlist, f, ",") j = 1 # index in flist for (i = 1; i <= n; i++) { if (index(f[i], "-") != 0) { # a range m = split(f[i], g, "-") if (m != 2 || g[1] >= g[2]) { printf("cut: bad field list: %s ", f[i]) > "/dev/stderr" exit 1 } for (k = g[1]; k <= g[2]; k++) flist[j++] = k } else flist[j++] = f[i] } nfields = j - 1 }
The set_charlist()
function is
more complicated than set_fieldlist()
. The idea here is to use gawk
’s
FIELDWIDTHS
variable (see Reading Fixed-Width Data), which describes constant-width input. When
using a character list, that is exactly what we have.
Setting up FIELDWIDTHS
is more
complicated than simply listing the fields that need to be printed. We
have to keep track of the fields to print and also the intervening
characters that have to be skipped. For example, suppose you wanted
characters 1 through 8, 15, and 22 through 35. You would use ‘-c 1-8,15,22-35
’. The necessary value for
FIELDWIDTHS
is "8 6 1 6 14"
. This yields five
fields, and the fields to print are $1
, $3
, and
$5
. The intermediate fields are
filler, which is stuff in between the desired
data. flist
lists the fields to
print, and t
tracks the complete
field list, including filler fields:
function set_charlist( field, i, j, f, g, n, m, t, filler, last, len) { field = 1 # count total fields n = split(fieldlist, f, ",") j = 1 # index in flist for (i = 1; i <= n; i++) { if (index(f[i], "-") != 0) { # range m = split(f[i], g, "-") if (m != 2 || g[1] >= g[2]) { printf("cut: bad character list: %s ", f[i]) > "/dev/stderr" exit 1 } len = g[2] - g[1] + 1 if (g[1] > 1) # compute length of filler filler = g[1] - last - 1 else filler = 0 if (filler) t[field++] = filler t[field++] = len # length of field last = g[2] flist[j++] = field - 1 } else { if (f[i] > 1) filler = f[i] - last - 1 else filler = 0 if (filler) t[field++] = filler t[field++] = 1 last = f[i] flist[j++] = field - 1 } } FIELDWIDTHS = join(t, 1, field - 1) nfields = j - 1 }
Next is the rule that processes the data. If the
-s
option is given, then suppress
is true. The first if
statement makes sure that the input record
does have the field separator. If cut
is processing fields, suppress
is
true, and the field separator character is not in the record, then the
record is skipped.
If the record is valid, then gawk
has split the data into fields, either
using the character in FS
or using
fixed-length fields and FIELDWIDTHS
.
The loop goes through the list of fields that should be printed. The
corresponding field is printed if it contains data. If the next field
also has data, then the separator character is written out between the
fields:
{ if (by_fields && suppress && index($0, FS) == 0) next for (i = 1; i <= nfields; i++) { if ($flist[i] != "") { printf "%s", $flist[i] if (i < nfields && $flist[i+1] != "") printf "%s", OFS } } print "" }
This version of cut
relies on
gawk
’s FIELDWIDTHS
variable to do the character-based
cutting. It is possible in other awk
implementations to use substr()
(see
String-Manipulation Functions), but it is also extremely painful.
The FIELDWIDTHS variable supplies
an elegant solution to the problem of picking the input line apart by
characters.
The egrep
utility searches
files for patterns. It uses regular expressions that are almost identical to
those available in awk
(see Chapter 3). You invoke it as follows:
egrep
[options
] '
pattern
'files
…
The pattern
is a regular expression. In
typical usage, the regular expression is quoted to prevent the shell
from expanding any of the special characters as filename wildcards.
Normally, egrep
prints the lines that
matched. If multiple filenames are provided on the command line, each
output line is preceded by the name of the file and a colon.
The options to egrep
are as
follows:
-c
Print out a count of the lines that matched the pattern, instead of the lines themselves.
-s
Be silent. No output is produced and the exit value indicates whether the pattern was matched.
-v
Invert the sense of the test. egrep
prints the lines that do
not match the pattern and exits successfully
if the pattern is not matched.
-i
Ignore case distinctions in both the pattern and the input data.
-l
Only print (list) the names of the files that matched, not the lines that matched.
-e
pattern
Use pattern
as the regexp to
match. The purpose of the -e
option is to allow
patterns that start with a ‘-
’.
This version uses the getopt()
library function (see Processing Command-Line Options) and the file transition library program (see Noting Datafile Boundaries).
The program begins with a descriptive comment and then a BEGIN
rule that processes the command-line
arguments with getopt()
. The
-i
(ignore case) option is particularly easy with gawk
; we just use the IGNORECASE
predefined variable (see Predefined Variables):
# egrep.awk --- simulate egrep in awk # # Options: # -c count of lines # -s silent - use exit value # -v invert test, success if no match # -i ignore case # -l print filenames only # -e argument is pattern # # Requires getopt and file transition library functions BEGIN { while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) { if (c == "c") count_only++ else if (c == "s") no_print++ else if (c == "v") invert++ else if (c == "i") IGNORECASE = 1 else if (c == "l") filenames_only++ else if (c == "e") pattern = Optarg else usage() }
Next comes the code that handles the egrep
-specific behavior. If no pattern is
supplied with -e
, the first nonoption on the command
line is used. The awk
command-line
arguments up to ARGV[Optind]
are
cleared, so that awk
won’t try to
process them as files. If no files
are specified, the standard input is used, and if multiple files are
specified, we make sure to note this so that the filenames can precede
the matched lines in the output:
if (pattern == "") pattern = ARGV[Optind++] for (i = 1; i < Optind; i++) ARGV[i] = "" if (Optind >= ARGC) { ARGV[1] = "-" ARGC = 2 } else if (ARGC - Optind > 1) do_filenames++ # if (IGNORECASE) # pattern = tolower(pattern) }
The last two lines are commented out, as they are not needed in
gawk
. They should be uncommented if
you have to use another version of awk
.
The next set of lines should be uncommented if you are not using
gawk
. This rule translates all the
characters in the input line into lowercase if the -i
option is specified.[69] The rule is commented out as it is not necessary
with gawk
:
#{ # if (IGNORECASE) # $0 = tolower($0) #}
The beginfile()
function is
called by the rule in ftrans.awk
when each new file is processed. In this case, it is very simple; all it does is
initialize a variable fcount
to zero.
fcount
tracks how many lines in the
current file matched the pattern. Naming the parameter junk
shows we know that beginfile()
is called with a parameter, but
that we’re not interested in its value:
function beginfile(junk) { fcount = 0 }
The endfile()
function is
called after each file has been processed. It affects the output only when the user wants a count of
the number of lines that matched. no_print
is true only if the exit status is
desired. count_only
is true if line
counts are desired. egrep
therefore
only prints line counts if printing and counting are enabled. The output
format must be adjusted depending upon the number of files to process.
Finally, fcount
is added to total
, so that we know the total number of
lines that matched the pattern:
function endfile(file) { if (! no_print && count_only) { if (do_filenames) print file ":" fcount else print fcount } total += fcount }
The BEGINFILE
and ENDFILE
special patterns (see The BEGINFILE and ENDFILE Special Patterns) could be used, but then the program
would be gawk
-specific. Additionally,
this example was written before gawk
acquired BEGINFILE
and ENDFILE
.
The following rule does most of the work of matching lines. The
variable matches
is true if the line
matched the pattern. If the user wants lines that did not match, the
sense of matches
is
inverted using the ‘!
’
operator. fcount
is incremented with
the value of matches
, which is either
one or zero, depending upon a successful or unsuccessful match. If the
line does not match, the next
statement just moves on to the next record.
A number of additional tests are made, but they are only done if
we are not counting lines. First, if the user only wants the exit status
(no_print
is true), then it is enough
to know that one line in this file matched, and we
can skip on to the next file with nextfile
. Similarly, if we are only printing
filenames, we can print the filename, and then skip to the next file
with nextfile
. Finally, each line is
printed, with a leading filename and colon if necessary:
{ matches = ($0 ~ pattern) if (invert) matches = ! matches fcount += matches # 1 or 0 if (! matches) next if (! count_only) { if (no_print) nextfile if (filenames_only) { print FILENAME nextfile } if (do_filenames) print FILENAME ":" $0 else print } }
The END
rule takes care of
producing the correct exit status. If there are no matches, the exit
status is one; otherwise, it is zero:
END { exit (total == 0) }
The usage()
function
prints a usage message in case of invalid options, and then
exits:
function usage() { print("Usage: egrep [-csvil] [-e pat] [files ...]") > "/dev/stderr" print(" egrep [-csvil] pat [files ...]") > "/dev/stderr" exit 1 }
The id
utility lists a user’s
real and effective user ID numbers, real and effective
group ID numbers, and the user’s group set, if any. id
only prints the effective user ID and group
ID if they are different from the real ones. If possible, id
also supplies the corresponding user and
group names. The output might look like this:
$ id
uid=1000(arnold) gid=1000(arnold) groups=1000(arnold),4(adm),7(lp),27(sudo)
This information is part of what is provided by gawk
’s PROCINFO
array (see Predefined Variables). However, the id
utility provides a more palatable output
than just individual numbers.
Here is a simple version of id
written in awk
. It uses the user
database library functions (see Reading the User Database) and
the group database library functions (see Reading the Group Database) from Chapter 10.
The program is fairly straightforward. All the work is done in the
BEGIN
rule. The user and group ID
numbers are obtained from PROCINFO
.
The code is repetitive. The entry in the user database for the real user
ID number is split into parts at the ‘:
’. The name is
the first field. Similar code is used for the effective user ID
number and the group numbers:
# id.awk --- implement id in awk # # Requires user and group library functions # output is: # uid=12(foo) euid=34(bar) gid=3(baz) # egid=5(blat) groups=9(nine),2(two),1(one) BEGIN { uid = PROCINFO["uid"] euid = PROCINFO["euid"] gid = PROCINFO["gid"] egid = PROCINFO["egid"] printf("uid=%d", uid) pw = getpwuid(uid) pr_first_field(pw) if (euid != uid) { printf(" euid=%d", euid) pw = getpwuid(euid) pr_first_field(pw) } printf(" gid=%d", gid) pw = getgrgid(gid) pr_first_field(pw) if (egid != gid) { printf(" egid=%d", egid) pw = getgrgid(egid) pr_first_field(pw) } for (i = 1; ("group" i) in PROCINFO; i++) { if (i == 1) printf(" groups=") group = PROCINFO["group" i] printf("%d", group) pw = getgrgid(group) pr_first_field(pw) if (("group" (i+1)) in PROCINFO) printf(",") } print "" } function pr_first_field(str, a) { if (str != "") { split(str, a, ":") printf("(%s)", a[1]) } }
The test in the for
loop is
worth noting. Any supplementary groups in the PROCINFO
array have the indices "group1"
through "group
for some
N
"N
(i.e., the total number of supplementary
groups). However, we don’t know in advance how many of these groups
there are.
This loop works by starting at one, concatenating the value with
"group"
, and then using in
to see if that value is in the array (see
Referring to an Array Element). Eventually, i
is incremented past the last group in the
array and the loop exits.
The loop is also correct if there are no supplementary groups; then the condition is false the first time it’s tested, and the loop body never executes.
The pr_first_field()
function
simply isolates out some code that is used repeatedly, making the
whole program shorter and cleaner. In particular, moving the check for
the empty string into this function saves several lines of
code.
The split
program splits large
text files into smaller pieces. Usage is as follows:[70]
split
[-
] [count
file
] [prefix
]
By default, the output files are named xaa
, xab
, and so on. Each file has 1,000 lines in
it, with the likely exception of the last file. To change the number of
lines in each file, supply a number on the command line preceded with a
minus sign (e.g., ‘-500
’ for files
with 500 lines in them instead of 1,000). To change the names of the
output files to something like myfileaa
, myfileab
, and so on, supply an additional
argument that specifies the filename prefix.
Here is a version of split
in
awk
. It uses the ord()
and chr()
functions presented in Translating Between Characters and Numbers.
The program first sets its defaults, and then tests to make sure there are not too many arguments. It then looks at each argument in turn. The first argument could be a minus sign followed by a number. If it is, this happens to look like a negative number, so it is made positive, and that is the count of lines. The datafile name is skipped over and the final argument is used as the prefix for the output filenames:
# split.awk --- do split in awk # # Requires ord() and chr() library functions # usage: split [-count] [file] [outname] BEGIN { outfile = "x" # default count = 1000 if (ARGC > 4) usage() i = 1 if (i in ARGV && ARGV[i] ~ /^-[[:digit:]]+$/) { count = -ARGV[i] ARGV[i] = "" i++ } # test argv in case reading from stdin instead of file if (i in ARGV) i++ # skip datafile name if (i in ARGV) { outfile = ARGV[i] ARGV[i] = "" } s1 = s2 = "a" out = (outfile s1 s2) }
The next rule does most of the work. tcount
(temporary count) tracks how many lines
have been printed to the output file so far. If it is greater than
count
, it is time to close the
current file and start a new one. s1
and s2
track the current suffixes for
the filename. If they are both ‘z
’,
the file is just too big. Otherwise, s1
moves to the next letter in the alphabet
and s2
starts over again at ‘a
’:
{ if (++tcount > count) { close(out) if (s2 == "z") { if (s1 == "z") { printf("split: %s is too large to split ", FILENAME) > "/dev/stderr" exit 1 } s1 = chr(ord(s1) + 1) s2 = "a" } else s2 = chr(ord(s2) + 1) out = (outfile s1 s2) tcount = 1 } print > out }
The usage()
function simply
prints an error message and exits:
function usage() { print("usage: split [-num] [file] [outname]") > "/dev/stderr" exit 1 }
This program is a bit sloppy; it relies on awk
to automatically close the last file
instead of doing it in an END
rule.
It also assumes that letters are contiguous in the character set, which
isn’t true for EBCDIC systems.
You might want to consider how to eliminate the use of ord()
and chr()
; this can be done in such a way as to
solve the EBCDIC issue as well.
The tee
program is known as a
“pipe fitting.” tee
copies its
standard input to its standard output and also duplicates
it to the files named on the command line. Its usage is as follows:
tee
[-a
] file
…
The -a
option tells tee
to append to the named files, instead of
truncating them and starting over.
The BEGIN
rule first makes a
copy of all the command-line arguments into an array named copy
. ARGV[0]
is not needed, so it is not copied.
tee
cannot use ARGV
directly, because awk
attempts to process each filename in
ARGV
as input data.
If the first argument is -a
, then the flag
variable append
is set to true, and
both ARGV[1]
and copy[1]
are deleted. If ARGC
is less than two, then no filenames were
supplied and tee
prints a usage
message and exits. Finally, awk
is
forced to read the standard input by setting ARGV[1]
to "-"
and ARGC
to two:
# tee.awk --- tee in awk # # Copy standard input to all named output files. # Append content if -a option is supplied. # BEGIN { for (i = 1; i < ARGC; i++) copy[i] = ARGV[i] if (ARGV[1] == "-a") { append = 1 delete ARGV[1] delete copy[1] ARGC-- } if (ARGC < 2) { print "usage: tee [-a] file ..." > "/dev/stderr" exit 1 } ARGV[1] = "-" ARGC = 2 }
The following single rule does all the work. Because there is no pattern, it is executed for each line of input. The body of the rule simply prints the line into each file on the command line, and then to the standard output:
{ # moving the if outside the loop makes it run faster if (append) for (i in copy) print >> copy[i] else for (i in copy) print > copy[i] print }
It is also possible to write the loop this way:
for (i in copy) if (append) print >> copy[i] else print > copy[i]
This is more concise, but it is also less efficient. The ‘if
’ is tested for each record and for each
output file. By duplicating the loop body, the ‘if
’ is only tested once for each input record.
If there are N
input records and
M
output files, the first method only
executes N
‘if
’ statements, while the second executes
N
*
M
‘if
’ statements.
Finally, the END
rule cleans up
by closing all the output files:
END { for (i in copy) close(copy[i]) }
The uniq
utility reads sorted
lines of data on its standard input, and by default removes duplicate
lines. In other words, it only prints unique lines—hence the
name. uniq
has a number of options.
The usage is as follows:
uniq
[-udc
[-
]] [n
+
] [n
inputfile
[outputfile
]]
The options for uniq
are:
-d
Print only repeated (duplicated) lines.
-u
Print only nonrepeated (unique) lines.
-c
Count lines. This option overrides -d
and
-u
. Both repeated and nonrepeated lines are
counted.
-n
Skip n
fields before comparing
lines. The definition of fields is similar to awk
’s default: nonwhitespace characters
separated by runs of spaces and/or TABs.
+n
Skip n
characters before
comparing lines. Any fields specified with ‘-
n
’ are
skipped first.
inputfile
Data is read from the input file named on the command line, instead of from the standard input.
outputfile
The generated output is sent to the named output file, instead of to the standard output.
Normally uniq
behaves as if
both the -d
and -u
options are
provided.
uniq
uses the getopt()
library function (see Processing Command-Line Options) and the join()
library function (see Merging an Array into a String).
The program begins with a usage()
function and then a brief outline of
the options and their meanings in comments. The BEGIN
rule deals with the command-line
arguments and options. It uses a trick to get getopt()
to handle options of the form
‘-25
’, treating such an option as the
option letter ‘2
’ with an argument of
‘5
’. If indeed two or more digits are
supplied (Optarg
looks like a
number), Optarg
is concatenated with
the option digit and then the result is added to zero to make it into a
number. If there is only one digit in the option, then Optarg
is not needed. In this case, Optind
must be decremented so that getopt()
processes it next time. This code is
admittedly a bit tricky.
If no options are supplied, then the default is taken, to print
both repeated and nonrepeated lines. The output file, if provided, is assigned to outputfile
. Early on, outputfile
is initialized to the standard
output, /dev/stdout
:
# uniq.awk --- do uniq in awk # # Requires getopt() and join() library functions function usage() { print("Usage: uniq [-udc [-n]] [+n] [ in [ out ]]") > "/dev/stderr" exit 1 } # -c count lines. overrides -d and -u # -d only repeated lines # -u only nonrepeated lines # -n skip n fields # +n skip n characters, skip fields first BEGIN { count = 1 outputfile = "/dev/stdout" opts = "udc0:1:2:3:4:5:6:7:8:9:" while ((c = getopt(ARGC, ARGV, opts)) != -1) { if (c == "u") non_repeated_only++ else if (c == "d") repeated_only++ else if (c == "c") do_count++ else if (index("0123456789", c) != 0) { # getopt() requires args to options # this messes us up for things like -5 if (Optarg ~ /^[[:digit:]]+$/) fcount = (c Optarg) + 0 else { fcount = c + 0 Optind-- } } else usage() } if (ARGV[Optind] ~ /^+[[:digit:]]+$/) { charcount = substr(ARGV[Optind], 2) + 0 Optind++ } for (i = 1; i < Optind; i++) ARGV[i] = "" if (repeated_only == 0 && non_repeated_only == 0) repeated_only = non_repeated_only = 1 if (ARGC - Optind == 2) { outputfile = ARGV[ARGC - 1] ARGV[ARGC - 1] = "" } }
The following function, are_equal()
, compares the current line,
$0
, to the previous line, last
. It handles skipping fields and
characters. If no field count and no character count are specified,
are_equal()
returns one or zero
depending upon the result of a simple string comparison of last
and $0
.
Otherwise, things get more complicated. If fields have to be
skipped, each line is broken into an array using split()
(see String-Manipulation Functions); the desired fields are then joined back
into a line using join()
. The joined
lines are stored in clast
and
cline
. If no fields are skipped,
clast
and cline
are set to last
and $0
, respectively. Finally, if characters are
skipped, substr()
is used to strip off the leading charcount
characters in clast
and cline
. The two strings are then compared and
are_equal()
returns the
result:
function are_equal( n, m, clast, cline, alast, aline) { if (fcount == 0 && charcount == 0) return (last == $0) if (fcount > 0) { n = split(last, alast) m = split($0, aline) clast = join(alast, fcount+1, n) cline = join(aline, fcount+1, m) } else { clast = last cline = $0 } if (charcount) { clast = substr(clast, charcount + 1) cline = substr(cline, charcount + 1) } return (clast == cline) }
The following two rules are the body of the program. The first one
is executed only for the very first line of data. It sets last
equal to $0
, so that subsequent lines of text have
something to be compared to.
The second rule does the work. The variable equal
is one or zero, depending upon the
results of are_equal()
’s comparison.
If uniq
is counting repeated lines,
and the lines are equal, then it increments the count
variable. Otherwise, it prints the line
and resets count
, because the two
lines are not equal.
If uniq
is not counting, and if
the lines are equal, count
is
incremented. Nothing is printed, as the point is to remove duplicates.
Otherwise, if uniq
is counting
repeated lines and more than one line is seen, or if uniq
is counting nonrepeated lines and only
one line is seen, then the line is printed, and count
is reset.
Finally, similar logic is used in the END
rule to print the final line of input
data:
NR == 1 { last = $0 next } { equal = are_equal() if (do_count) { # overrides -d and -u if (equal) count++ else { printf("%4d %s ", count, last) > outputfile last = $0 count = 1 # reset } next } if (equal) count++ else { if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile last = $0 count = 1 } } END { if (do_count) printf("%4d %s ", count, last) > outputfile else if ((repeated_only && count > 1) || (non_repeated_only && count == 1)) print last > outputfile close(outputfile) }
The logic for choosing which lines to print represents a state machine, which is “a device that can be in one of a set number of stable conditions depending on its previous condition and on the present values of its inputs.”[71] Brian Kernighan suggests that “an alternative approach to state machines is to just read the input into an array, then use indexing. It’s almost always easier code, and for most inputs where you would use this, just as fast.” Consider how to rewrite the logic to follow this suggestion.
The wc
(word count) utility
counts lines, words, and characters in one or more input
files. Its usage is as follows:
wc
[-lwc
] [files
…]
If no files are specified on the command line, wc
reads its standard input. If there are
multiple files, it also prints total counts for all the files. The
options and their meanings are as follows:
-l
Count only lines.
-w
Count only words. A “word” is a contiguous sequence of
nonwhitespace characters, separated by spaces and/or TABs.
Luckily, this is the normal way awk
separates fields in its input
data.
-c
Count only characters.
Implementing wc
in awk
is particularly elegant, because awk
does a lot of the work for us; it splits
lines into words (i.e., fields) and counts them, it counts lines (i.e.,
records), and it can easily tell us how long a line is.
This program uses the getopt()
library function (see Processing Command-Line Options) and the
file-transition functions (see Noting Datafile Boundaries).
This version has one notable difference from traditional versions
of wc
: it always prints the counts in
the order lines, words, and characters. Traditional versions note the
order of the -l
, -w
, and
-c
options on the command line, and print the counts in
that order.
The BEGIN
rule does the
argument processing. The variable print_total
is true if more than one file is
named on the command line:
# wc.awk --- count lines, words, characters # Options: # -l only count lines # -w only count words # -c only count characters # # Default is to count lines, words, characters # # Requires getopt() and file transition library functions BEGIN { # let getopt() print a message about # invalid options. we ignore them while ((c = getopt(ARGC, ARGV, "lwc")) != -1) { if (c == "l") do_lines = 1 else if (c == "w") do_words = 1 else if (c == "c") do_chars = 1 } for (i = 1; i < Optind; i++) ARGV[i] = "" # if no options, do all if (! do_lines && ! do_words && ! do_chars) do_lines = do_words = do_chars = 1 print_total = (ARGC - i > 2) }
The beginfile()
function
is simple; it just resets the counts of lines, words, and
characters to zero, and saves the current filename in fname
:
function beginfile(file) { lines = words = chars = 0 fname = FILENAME }
The endfile()
function adds the
current file’s numbers to the running totals of lines,
words, and characters. It then prints out those numbers for the file
that was just read. It relies on beginfile()
to reset the numbers for the
following datafile:
function endfile(file) { tlines += lines twords += words tchars += chars if (do_lines) printf " %d", lines if (do_words) printf " %d", words if (do_chars) printf " %d", chars printf " %s ", fname }
There is one rule that is executed for each line. It adds the length of the record, plus one, to chars
.[72] Adding one plus the record length is needed because the
newline character separating records (the value of RS
) is not part of the record itself, and thus
not included in its length. Next, lines
is incremented for each line read, and
words
is incremented by the value of NF
, which
is the number of “words” on this line:
# do per line { chars += length($0) + 1 # get newline lines++ words += NF }
Finally, the END
rule simply
prints the totals for all the files:
END { if (print_total) { if (do_lines) printf " %d", tlines if (do_words) printf " %d", twords if (do_chars) printf " %d", tchars print " total" } }
This section is a large “grab bag” of miscellaneous programs. We hope you find them both interesting and enjoyable.
A common error when writing large amounts of prose is to accidentally duplicate words. Typically you will see this in text as something like “the the program does the following…” When the text is online, often the duplicated words occur at the end of one line and the beginning of another, making them very difficult to spot.
This program, dupword.awk
,
scans through a file one line at a time and looks for adjacent
occurrences of the same word. It also saves the last word on a line (in
the variable prev
) for comparison
with the first word on the next line.
The first two statements make sure that the line is all lowercase,
so that, for example, “The” and “the” compare equal to each other. The
next statement replaces nonalphanumeric and nonwhitespace characters
with spaces, so that punctuation does not affect the comparison either.
The characters are replaced with spaces so that formatting controls
don’t create nonsense words (e.g., the Texinfo ‘@code{NF}
’ becomes ‘codeNF
’ if punctuation is simply deleted). The
record is then resplit into fields, yielding just the actual words on
the line, and ensuring that there are no empty fields.
If there are no fields left after removing all the punctuation, the current record is skipped. Otherwise, the program loops through each word, comparing it to the previous one:
# dupword.awk --- find duplicate words in text { $0 = tolower($0) gsub(/[^[:alnum:][:blank:]]/, " "); $0 = $0 # re-split if (NF == 0) next if ($1 == prev) printf("%s:%d: duplicate %s ", FILENAME, FNR, $1) for (i = 2; i <= NF; i++) if ($i == $(i-1)) printf("%s:%d: duplicate %s ", FILENAME, FNR, $i) prev = $NF }
Nothing cures insomnia like a ringing alarm clock.
—Arnold Robbins
Sleep is for web developers.
—Erik Quanstrom
The following program is a simple “alarm clock” program. You give it a time of day and an optional message. At the specified time, it prints the message on the standard output. In addition, you can give it the number of times to repeat the message as well as a delay between repetitions.
This program uses the getlocaltime()
function from Managing the Time of Day.
All the work is done in the BEGIN
rule. The first part is argument
checking and setting of defaults: the delay, the count, and the message
to print. If the user supplied a message without the ASCII BEL character
(known as the “alert” character, "a"
), then it is added to the message.
(On
many systems, printing the ASCII BEL generates an audible alert. Thus,
when the alarm goes off, the system calls attention to itself in case
the user is not looking at the computer.) Just for a change, this
program uses a switch
statement (see
The switch Statement), but the processing could be done
with a series of if
-else
statements instead. Here is the
program:
# alarm.awk --- set an alarm # # Requires getlocaltime() library function # usage: alarm time [ "message" [ count [ delay ] ] ] BEGIN { # Initial argument sanity checking usage1 = "usage: alarm time ['message' [count [delay]]]" usage2 = sprintf(" (%s) time ::= hh:mm", ARGV[1]) if (ARGC < 2) { print usage1 > "/dev/stderr" print usage2 > "/dev/stderr" exit 1 } switch (ARGC) { case 5: delay = ARGV[4] + 0 # fall through case 4: count = ARGV[3] + 0 # fall through case 3: message = ARGV[2] break default: if (ARGV[1] !~ /[[:digit:]]?[[:digit:]]:[[:digit:]]{2}/) { print usage1 > "/dev/stderr" print usage2 > "/dev/stderr" exit 1 } break } # set defaults for once we reach the desired time if (delay == 0) delay = 180 # 3 minutes if (count == 0) count = 5 if (message == "") message = sprintf("aIt is now %s!a", ARGV[1]) else if (index(message, "a") == 0) message = "a" message "a"
The next section of code turns the alarm time into hours and minutes, converts it (if necessary) to a 24-hour clock, and then turns that time into a count of the seconds since midnight. Next it turns the current time into a count of seconds since midnight. The difference between the two is how long to wait before setting off the alarm:
# split up alarm time split(ARGV[1], atime, ":") hour = atime[1] + 0 # force numeric minute = atime[2] + 0 # force numeric # get current broken down time getlocaltime(now) # if time given is 12-hour hours and it's after that # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m., # then add 12 to real hour if (hour < 12 && now["hour"] > hour) hour += 12 # set target time in seconds since midnight target = (hour * 60 * 60) + (minute * 60) # get current time in seconds since midnight current = (now["hour"] * 60 * 60) + (now["minute"] * 60) + now["second"] # how long to sleep for naptime = target - current if (naptime <= 0) { print "alarm: time is in the past!" > "/dev/stderr" exit 1 }
Finally, the program uses the system()
function
(see Input/Output Functions) to call the sleep
utility. The sleep
utility
simply pauses for the given number of seconds. If the exit status is not
zero, the program assumes that sleep
was interrupted and exits. If sleep
exited with an OK status (zero), then the program prints the message in
a loop, again using sleep
to delay
for however many seconds are necessary:
# zzzzzz..... go away if interrupted if (system(sprintf("sleep %d", naptime)) != 0) exit 1 # time to notify! command = sprintf("sleep %d", delay) for (i = 1; i <= count; i++) { print message # if sleep command interrupted, go away if (system(command) != 0) break } exit 0 }
The system tr
utility
transliterates characters. For example, it is often used to map
uppercase letters into lowercase for further
processing:
generate data
| tr 'A-Z' 'a-z' |process data
…
tr
requires two lists of
characters.[73] When processing the input, the first character in the
first list is replaced with the first character in the second list, the
second character in the first list is replaced with the second character
in the second list, and so on. If there are more characters in the
“from” list than in the “to” list, the last character of the “to” list
is used for the remaining characters in the “from” list.
Once upon a time, a user proposed
adding a transliteration function to gawk
.
The following program was written to
prove that character transliteration could be done with a user-level
function. This program is not as complete as the system tr
utility, but it does most of the
job.
The translate
program was
written long before gawk
acquired the
ability to split each character in a string into separate array
elements. Thus, it makes repeated use of the substr()
, index()
, and gsub()
built-in functions (see String-Manipulation Functions). There are two functions. The first,
stranslate()
, takes three arguments:
from
A list of characters from which to translate
to
A list of characters to which to translate
target
The string on which to do the translation
Associative arrays make the translation part fairly easy. t_ar
holds the “to” characters, indexed by the
“from” characters. Then a simple loop goes through from
, one character at a time. For each
character in from
, if the character
appears in target
, it is replaced
with the corresponding to
character.
The translate()
function calls
stranslate()
, using $0
as the target. The main program sets two
global variables, FROM
and TO
, from the command line, and then changes
ARGV
so that awk
reads from the standard input.
Finally, the processing rule simply calls translate()
for each record:
# translate.awk --- do tr-like stuff # Bugs: does not handle things like tr A-Z a-z; it has # to be spelled out. However, if `to' is shorter than `from', # the last character in `to' is used for the rest of `from'. function stranslate(from, to, target, lf, lt, ltarget, t_ar, i, c, result) { lf = length(from) lt = length(to) ltarget = length(target) for (i = 1; i <= lt; i++) t_ar[substr(from, i, 1)] = substr(to, i, 1) if (lt < lf) for (; i <= lf; i++) t_ar[substr(from, i, 1)] = substr(to, lt, 1) for (i = 1; i <= ltarget; i++) { c = substr(target, i, 1) if (c in t_ar) c = t_ar[c] result = result c } return result } function translate(from, to) { return $0 = stranslate(from, to, $0) } # main program BEGIN { if (ARGC < 3) { print "usage: translate from to" > "/dev/stderr" exit } FROM = ARGV[1] TO = ARGV[2] ARGC = 2 ARGV[1] = "-" } { translate(FROM, TO) print }
It is possible to do character transliteration in a user-level
function, but it is not necessarily efficient, and we (the gawk
developers) started to consider adding a
built-in function. However, shortly after writing this program, we
learned that Brian Kernighan had added the toupper()
and
tolower()
functions to his awk
(see String-Manipulation Functions).
These functions handle the vast majority of the cases where character
transliteration is necessary, and so we chose to simply add those
functions to gawk
as well and then
leave well enough alone.
An obvious improvement to this program would be to set up the
t_ar
array only once, in a BEGIN
rule. However, this assumes that the
“from” and “to” lists will never change throughout the lifetime of the
program.
Another obvious improvement is to enable the use of ranges, such
as ‘a-z
’, as allowed by the tr
utility. Look at the code for cut.awk
(see Cutting Out Fields and Columns)
for inspiration.
Here is a “real-world”[74] program. This script reads lists of names and addresses and generates mailing labels. Each page of labels has 20 labels on it, two across and 10 down. The addresses are guaranteed to be no more than five lines of data. Each address is separated from the next by a blank line.
The basic idea is to read 20 labels’ worth of data. Each line of
each label is stored in the line
array. The single rule takes care of filling the line
array and printing the page when 20
labels have been read.
The BEGIN
rule simply sets
RS
to the empty string, so that
awk
splits records at blank lines
(see How Input Is Split into Records). It sets MAXLINES
to 100, because 100 is the maximum
number of lines on the page (20 ⋅ 5 = 100).
Most of the work is done in the printpage()
function. The label lines are
stored sequentially in the line
array. But they have to print horizontally: line[1]
next to line[6]
, line[2]
next to line[7]
, and so on. Two loops accomplish this.
The outer loop, controlled by i
,
steps through every 10 lines of data; this is each row of labels. The
inner loop, controlled by j
, goes
through the lines within the row. As j
goes from 0 to 4, ‘i+j
’ is the j
th line in the row, and ‘i+j+5
’ is the entry next to it. The output
ends up looking something like this:
line 1 line 6 line 2 line 7 line 3 line 8 line 4 line 9 line 5 line 10 …
The printf
format string
‘%-41s
’ left-aligns the data and
prints it within a fixed-width
field.
As a final note, an extra blank line is printed at lines 21 and 61, to keep the output lined up on the labels. This is dependent on the particular brand of labels in use when the program was written. You will also note that there are two blank lines at the top and two blank lines at the bottom.
The END
rule arranges to flush
the final page of labels; there may not have been an even multiple of 20
labels in the data:
# labels.awk --- print mailing labels # Each label is 5 lines of data that may have blank lines. # The label sheets have 2 blank lines at the top and 2 at # the bottom. BEGIN { RS = "" ; MAXLINES = 100 } function printpage( i, j) { if (Nlines <= 0) return printf " " # header for (i = 1; i <= Nlines; i += 10) { if (i == 21 || i == 61) print "" for (j = 0; j < 5; j++) { if (i + j > MAXLINES) break printf " %-41s %s ", line[i+j], line[i+j+5] } print "" } printf " " # footer delete line } # main rule { if (Count >= 20) { printpage() Count = 0 Nlines = 0 } n = split($0, a, " ") for (i = 1; i <= n; i++) line[++Nlines] = a[i] for (; i <= 5; i++) line[++Nlines] = "" Count++ } END { printpage() }
When working with large amounts of text, it can be interesting to know how often different words appear. For example, an author may overuse certain words, in which case he or she might wish to find synonyms to substitute for words that appear too often. This subsection develops a program for counting words and presenting the frequency information in a useful format.
At first glance, a program like this would seem to do the job:
# wordfreq-first-try.awk --- print list of word frequencies { for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s %d ", word, freq[word] }
The program relies on awk
’s
default field-splitting mechanism to break each line up into “words” and
uses an associative array named freq
,
indexed by each word, to count the number of times the word occurs. In
the END
rule, it prints the
counts.
This program has several problems that prevent it from being useful on real text files:
The awk
language considers
upper- and lowercase characters to be distinct. Therefore,
“bartender” and “Bartender” are not treated as the same word. This
is undesirable, because words are capitalized if they begin
sentences in normal text, and a frequency analyzer should not be
sensitive to capitalization.
Words are detected using the awk
convention that fields are separated
just by whitespace. Other characters in the input (except newlines)
don’t have any special meaning to awk
. This means that punctuation
characters count as part of words.
The output does not come out in any useful order. You’re more likely to be interested in which words occur most frequently or in having an alphabetized table of how frequently each word occurs.
The first problem can be solved by using tolower()
to remove case distinctions. The
second problem can be solved by using gsub()
to remove punctuation characters.
Finally, we solve the third problem by using the system sort
utility to process the output of the
awk
script. Here is the new version
of the program:
# wordfreq.awk --- print list of word frequencies { $0 = tolower($0) # remove case distinctions # remove punctuation gsub(/[^[:alnum:]_[:blank:]]/, "", $0) for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s %d ", word, freq[word] }
The regexp /[^[:alnum:]_[:blank:]]/
might have been
written /[[:punct:]]/
, but then
underscores would also be removed, and we want to keep them.
Assuming we have saved this program in a file named wordfreq.awk
, and that the data is in
file1
, the following
pipeline:
awk -f wordfreq.awk file1 | sort -k 2nr
produces a table of the words appearing in file1
in order of decreasing
frequency.
The awk
program suitably
massages the data and produces a word frequency table, which is not
ordered. The awk
script’s output is
then sorted by the sort
utility and
printed on the screen.
The options given to sort
specify a sort that uses the second field of each input line (skipping one field), that the sort keys should be treated as numeric quantities
(otherwise ‘15
’ would come before
‘5
’), and that the sorting should be
done in descending (reverse) order.
The sort
could even be done
from within the program, by changing the END
action to:
END { sort = "sort -k 2nr" for (word in freq) printf "%s %d ", word, freq[word] | sort close(sort) }
This way of sorting must be used on systems that do not have true
pipes at the command-line (or batch-file) level. See the general
operating system documentation for more information on how to use the
sort
program.
The uniq
program (see Printing Nonduplicated Lines of Text) removes duplicate lines from
sorted data.
Suppose, however, you need to remove duplicate lines from a datafile but that you want to preserve the order the lines are in. A good example of this might be a shell history file. The history file keeps a copy of all the commands you have entered, and it is not unusual to repeat a command several times in a row. Occasionally you might want to compact the history by removing duplicate entries. Yet it is desirable to maintain the order of the original commands.
This simple program does the job. It uses two arrays. The data
array is indexed by the text of each
line. For each line, data[$0]
is
incremented. If a particular line has not been seen before, then
data[$0]
is zero. In this case, the
text of the line is stored in lines[count]
. Each element of lines
is a unique command, and the indices of
lines
indicate the order in which
those lines are encountered. The END
rule simply prints out the lines, in order:
# histsort.awk --- compact a shell history file # Thanks to Byron Rakitzis for the general idea { if (data[$0]++ == 0) lines[++count] = $0 } END { for (i = 1; i <= count; i++) print lines[i] }
This program also provides a foundation for generating other
useful information. For example, using the following print
statement in the END
rule indicates how often a particular
command is used:
print data[lines[i]], lines[i]
This works because data[$0]
is
incremented each time a line is seen.
Both this chapter and the previous chapter (Chapter 10) present a large number of awk
programs. If you want to experiment with
these programs, it is tedious to type them in by hand. Here we present a
program that can extract parts of a Texinfo input file into separate
files.
This book is written in Texinfo, the GNU Project’s document formatting language. A single Texinfo source file can be used to produce both printed documentation, with TeX, and online documentation. (Texinfo is fully documented in the book Texinfo—The GNU Documentation Format, available from the Free Software Foundation, and also available online.)
For our purposes, it is enough to know three things about Texinfo input files:
The “at” symbol (‘@
’) is
special in Texinfo, much as the backslash (‘’) is in C or
awk
. Literal ‘@
’ symbols are represented in Texinfo
source files as ‘@@
’.
Comments start with either ‘@c
’ or ‘@comment
’. The file-extraction program
works by using special comments that start at the beginning of a
line.
Lines containing ‘@group
’
and ‘@end group
’ commands bracket
example text that should not be split across a page boundary.
(Unfortunately, TeX isn’t always smart enough to do things exactly
right, so we have to give it some help.)
The following program, extract.awk
, reads through a Texinfo source
file and does two things, based on the special comments. Upon seeing
‘@c system …
’, it runs a
command, by extracting the command text from the control line and
passing it on to the system()
function (see Input/Output Functions). Upon seeing
‘@c file
’, each subsequent line is
sent to the file filename
filename
, until ‘@c endfile
’ is encountered. The rules in
extract.awk
match either ‘@c
’ or ‘@comment
’ by letting the ‘omment
’ part be optional. Lines containing
‘@group
’ and ‘@end group
’ are simply removed. extract.awk
uses the join()
library function (see Merging an Array into a String).
The example programs in the online Texinfo source for
Effective awk Programming (gawktexi.in
) have all been bracketed inside
‘file
’ and ‘endfile
’ lines. The gawk
distribution uses a copy of extract.awk
to extract the sample programs
and install many of them in a standard directory where gawk
can find them. The Texinfo file looks
something like this:
… This program has a @code{BEGIN} rule that prints a nice message: @example @c file examples/messages.awk BEGIN @{ print "Don't panic!" @} @c end file @end example It also prints some final advice: @example @c file examples/messages.awk END @{ print "Always avoid bored archaeologists!" @} @c end file @end example …
extract.awk
begins by setting
IGNORECASE
to one, so that mixed
upper- and lowercase letters in the directives won’t matter.
The first rule handles calling system()
, checking that a command is given
(NF
is at least three), and also
checking that the command exits with a zero exit status, signifying
OK:
# extract.awk --- extract files and run programs from Texinfo files BEGIN { IGNORECASE = 1 } /^@c(omment)?[ ]+system/ { if (NF < 3) { e = ("extract: " FILENAME ":" FNR) e = (e ": badly formed `system' line") print e > "/dev/stderr" next } $1 = "" $2 = "" stat = system($0) if (stat != 0) { e = ("extract: " FILENAME ":" FNR) e = (e ": warning: system returned " stat) print e > "/dev/stderr" } }
The variable e
is used so that
the rule fits nicely on the page.
The second rule handles moving data into files. It verifies that a
filename is given in the directive. If the file named is not the current
file, then the current file is closed. Keeping the current file open
until a new file is encountered allows the use of the ‘>
’ redirection for printing the contents,
keeping open-file management simple.
The for
loop does the work. It
reads lines using getline
(see Explicit Input with getline). For an unexpected end-of-file, it calls the
unexpected_eof()
function. If the line is an “endfile” line, then it breaks out of the
loop. If the line is an ‘@group
’ or
‘@end group
’ line, then it ignores it
and goes on to the next line. Similarly, comments within examples are
also ignored.
Most of the work is in the following few lines. If the line has no
‘@
’ symbols, the program can print it
directly. Otherwise, each leading ‘@
’
must be stripped off. To remove the ‘@
’
symbols, the line is split into separate elements of the array a
, using the split()
function
(see String-Manipulation Functions). The ‘@
’ symbol is used as the separator character.
Each element of a
that is empty
indicates two successive ‘@
’ symbols
in the original line. For each two empty elements (‘@@
’ in the original file), we have to add a
single ‘@
’ symbol back in.
When the processing of the array is finished, join()
is called with the value of SUBSEP
(see Multidimensional Arrays), to rejoin the pieces back into a single
line. That line is then printed to the output file:
/^@c(omment)?[ ]+file/ { if (NF != 3) { e = ("extract: " FILENAME ":" FNR ": badly formed `file' line") print e > "/dev/stderr" next } if ($3 != curfile) { if (curfile != "") close(curfile) curfile = $3 } for (;;) { if ((getline line) <= 0) unexpected_eof() if (line ~ /^@c(omment)?[ ]+endfile/) break else if (line ~ /^@(end[ ]+)?group/) continue else if (line ~ /^@c(omment+)?[ ]+/) continue if (index(line, "@") == 0) { print line > curfile continue } n = split(line, a, "@") # if a[1] == "", means leading @, # don't add one back in. for (i = 2; i <= n; i++) { if (a[i] == "") { # was an @@ a[i] = "@" if (a[i+1] == "") i++ } } print join(a, 1, n, SUBSEP) > curfile } }
An important thing to note is the use of the ‘>
’ redirection. Output done with ‘>
’ only opens the file once; it stays open
and subsequent output is appended to the file (see Redirecting Output of print and printf). This makes it easy to mix program text and
explanatory prose for the same sample source file (as has been done
here!) without any hassle. The file is only closed when a new datafile
name is encountered or at the end of the input file.
Finally, the function unexpected_eof()
prints an
appropriate error message and then exits. The END
rule handles
the final cleanup, closing the open file:
function unexpected_eof() { printf("extract: %s:%d: unexpected EOF or error ", FILENAME, FNR) > "/dev/stderr" exit 1 } END { if (curfile) close(curfile) }
The sed
utility is a
stream editor, a program that reads a stream of
data, makes changes to it, and passes it on. It is often used to make global changes to a large file
or to a stream of data generated by a pipeline of commands. Although sed
is a
complicated program in its own right, its most common use is to perform
global substitutions in the middle of a pipeline:
command1
< orig.data | sed 's/old/new/g' |command2
> result
Here, ‘s/old/new/g
’ tells
sed
to look for the regexp ‘old
’ on each input line and globally replace
it with the text ‘new
’ (i.e., all the
occurrences on a line). This is similar to awk
’s gsub()
function (see String-Manipulation Functions).
The following program, awksed.awk
, accepts at least two command-line
arguments: the pattern to look for and the text to replace it with. Any
additional arguments are treated as datafile names to process. If none
are provided, the standard input is used:
# awksed.awk --- do s/foo/bar/g using just print # Thanks to Michael Brennan for the idea function usage() { print "usage: awksed pat repl [files...]" > "/dev/stderr" exit 1 } BEGIN { # validate arguments if (ARGC < 3) usage() RS = ARGV[1] ORS = ARGV[2] # don't use arguments as files ARGV[1] = ARGV[2] = "" } # look ma, no hands! { if (RT == "") printf "%s", $0 else print }
The program relies on gawk
’s
ability to have RS
be a regexp, as
well as on the setting of RT
to the
actual text that terminates the record (see How Input Is Split into Records).
The idea is to have RS
be the
pattern to look for. gawk
automatically sets $0
to the text
between matches of the pattern. This is text that we want to keep,
unmodified. Then, by setting ORS
to
the replacement text, a simple print
statement outputs the text we want to keep, followed by the replacement
text.
There is one wrinkle to this scheme, which is what to do if the
last record doesn’t end with text that matches RS
. Using a print
statement unconditionally prints the
replacement text, which is not correct. However, if the file did not end
in text that matches RS
, RT
is set to the null string. In this case, we
can print $0
using printf
(see Using printf Statements for Fancier Printing).
The BEGIN
rule handles the
setup, checking for the right number of arguments and calling usage()
if there is a problem. Then it sets
RS
and ORS
from the command-line arguments and sets
ARGV[1]
and ARGV[2]
to the null string, so that they are
not treated as filenames (see Using ARGC and ARGV).
The usage()
function prints an
error message and exits. Finally, the single rule handles the printing
scheme outlined earlier, using print
or printf
as appropriate, depending
upon the value of RT
.
In Including Other Files into Your Program, we saw how gawk
provides a built-in file-inclusion
capability. However, this is a gawk
extension. This section provides the motivation for making file
inclusion available for standard awk
,
and shows how to do it using a combination of shell and awk
programming.
Using library functions in awk
can be very beneficial. It encourages code reuse and the writing of
general functions. Programs are smaller and therefore clearer. However,
using library functions is only easy when writing awk
programs; it is painful when running them,
requiring multiple -f
options. If gawk
is unavailable, then so too is the
AWKPATH
environment variable
and the ability to put awk
functions into a library directory (see
Command-Line Options). It would be nice to be able to write
programs in the following manner:
# library functions @include getopt.awk @include join.awk … # main program BEGIN { while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1) … … }
The following program, igawk.sh
, provides this service. It simulates gawk
’s
searching of the AWKPATH
variable and also allows nested includes (i.e., a
file that is included with @include
can contain further @include
statements). igawk
makes an effort to
only include files once, so that
nested includes don’t accidentally include a library function
twice.
igawk
should behave just like
gawk
externally. This means it should
accept all of gawk
’s command-line
arguments, including the ability to have multiple source files specified
via -f
and the ability to mix command-line and library
source files.
The program is written using the POSIX Shell (sh
)
command language.[75] It works as
follows:
Loop through the arguments, saving anything that doesn’t
represent awk
source code for
later, when the expanded program is run.
For any arguments that do represent awk
text, put the arguments into a shell
variable that will be expanded. There are two cases:
Literal text, provided with -e
or
--source
. This text is just appended
directly.
Source filenames, provided with -f
. We
use a neat trick and append ‘@include
’ to the shell
variable’s contents. Because the file-inclusion program works
the way filename
gawk
does, this gets
the text of the file included in the program at the correct
point.
Run an awk
program
(naturally) over the shell variable’s contents to expand @include
statements. The expanded
program is placed in a second shell variable.
Run the expanded program with gawk
and any other original command-line
arguments that the user supplied (such as the datafile
names).
This program uses shell variables extensively: for storing
command-line arguments and the text of the awk
program that will expand the user’s
program, for the user’s original program, and for the expanded program.
Doing so removes some potential problems that might arise were we to use
temporary files instead, at the cost of making the script somewhat more
complicated.
The initial part of the program turns on shell tracing if the
first argument is ‘debug
’.
The next part loops through all the command-line arguments. There are several cases of interest:
--
This ends the arguments to igawk
. Anything else should be passed on
to the user’s awk
program
without being evaluated.
-W
This indicates that the next option is specific to gawk
. To make argument processing
easier, the -W
is appended to the front of the
remaining arguments and the loop continues. (This is an sh
programming trick. Don’t worry about
it if you are not familiar with sh
.)
-v
, -F
These are saved and passed on to gawk
.
-f
, --file
,
--file=
, -Wfile=
The filename is appended to the shell variable program
with an @include
statement. The expr
utility is used to remove the
leading option part of the argument (e.g., ‘--file=
’). (Typical sh
usage would be to use the echo
and sed
utilities to do this work.
Unfortunately, some versions of echo
evaluate escape sequences in their
arguments, possibly mangling the program text. Using expr
avoids this problem.)
--source
, --source=
,
-Wsource=
The source text is appended to program
.
--version
, -Wversion
igawk
prints its version
number, runs ‘gawk --version
’
to get the gawk
version
information, and then exits.
If none of the -f
, --file
,
-Wfile
, --source
, or
-Wsource
arguments are supplied, then the first
nonoption argument should be the awk
program. If there are no command-line arguments left, igawk
prints an error message and exits.
Otherwise, the first argument is appended to program
. In any case, after the arguments have
been processed, the shell variable program
contains the complete text of the
original awk
program.
The program is as follows:
#! /bin/sh # igawk --- like gawk but do @include processing if [ "$1" = debug ] then set -x shift fi # A literal newline, so that program text is formatted correctly n=' ' # Initialize variables to empty program= opts= while [ $# -ne 0 ] # loop over arguments do case $1 in --) shift break ;; -W) shift # The ${x?'message here'} construct prints a # diagnostic if $x is the null string set -- -W"${@?'missing operand'}" continue ;; -[vF]) opts="$opts $1 '${2?'missing operand'}'" shift ;; -[vF]*) opts="$opts '$1'" ;; -f) program="$program$n@include ${2?'missing operand'}" shift ;; -f*) f=$(expr "$1" : '-f(.*)') program="$program$n@include $f" ;; -[W-]file=*) f=$(expr "$1" : '-.file=(.*)') program="$program$n@include $f" ;; -[W-]file) program="$program$n@include ${2?'missing operand'}" shift ;; -[W-]source=*) t=$(expr "$1" : '-.source=(.*)') program="$program$n$t" ;; -[W-]source) program="$program$n${2?'missing operand'}" shift ;; -[W-]version) echo igawk: version 3.0 1>&2 gawk --version exit 0 ;; -[W-]*) opts="$opts '$1'" ;; *) break ;; esac shift done if [ -z "$program" ] then program=${1?'missing program'} shift fi # At this point, `program' has the program.
The awk
program to process
@include
directives is stored in the
shell variable expand_prog
. Doing
this keeps the shell script readable. The awk
program reads through the user’s program,
one line at a time, using getline
(see Explicit Input with getline). The input filenames and @include
statements are managed using a stack.
As each @include
is encountered, the
current filename is “pushed” onto the stack and the file named in the
@include
directive becomes the
current filename. As each file is finished, the stack is “popped,” and
the previous input file becomes the current input file again. The
process is started by making the original file the first one on the
stack.
The pathto()
function does the
work of finding the full path to a file. It simulates gawk
’s behavior when searching the
AWKPATH
environment variable (see The AWKPATH Environment Variable). If a filename has a ‘/
’ in it, no path search is done. Similarly,
if the filename is "-"
, then that
string is used as-is. Otherwise, the filename is concatenated with the
name of each directory in the path, and an attempt is made to open the
generated filename. The only way to test if a file can be read in
awk
is to go ahead and try to read it
with getline
; this is what pathto()
does.[76] If the file can be read, it is closed and the filename is
returned:
expand_prog=' function pathto(file, i, t, junk) { if (index(file, "/") != 0) return file if (file == "-") return file for (i = 1; i <= ndirs; i++) { t = (pathlist[i] "/" file) if ((getline junk < t) > 0) { # found it close(t) return t } } return "" }
The main program is contained inside one BEGIN
rule. The first thing it does is set up
the pathlist
array that pathto()
uses. After splitting the path on
‘:
’, null elements are replaced with
"."
, which represents the current
directory:
BEGIN { path = ENVIRON["AWKPATH"] ndirs = split(path, pathlist, ":") for (i = 1; i <= ndirs; i++) { if (pathlist[i] == "") pathlist[i] = "." }
The stack is initialized with ARGV[1]
, which will be "/dev/stdin"
. The main loop comes next. Input
lines are read in succession. Lines that do not start with @include
are printed verbatim. If the line
does start with @include
, the
filename is in $2
. pathto()
is called to generate the full path.
If it cannot, then the program prints an error message and continues.
The next thing to check is if the file is included already. The
processed
array is indexed by the
full filename of each included file and it tracks this information for
us. If the file is seen again, a warning message is printed. Otherwise,
the new filename is pushed onto the stack and processing
continues.
Finally, when getline
encounters the end of the input file, the file is closed and the stack
is popped. When stackptr
is less than
zero, the program is done:
stackptr = 0 input[stackptr] = ARGV[1] # ARGV[1] is first file for (; stackptr >= 0; stackptr--) { while ((getline < input[stackptr]) > 0) { if (tolower($1) != "@include") { print continue } fpath = pathto($2) if (fpath == "") { printf("igawk: %s:%d: cannot find %s ", input[stackptr], FNR, $2) > "/dev/stderr" continue } if (! (fpath in processed)) { processed[fpath] = input[stackptr] input[++stackptr] = fpath # push onto stack } else print $2, "included in", input[stackptr], "already included in", processed[fpath] > "/dev/stderr" } close(input[stackptr]) } }' # close quote ends `expand_prog' variable processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF $program EOF )
The shell construct ‘
’ is called a here
document. Everything in the shell script up to the
command
<<
marker
marker
is fed to
command
as input. The shell processes the
contents of the here document for variable and command substitution (and
possibly other things as well, depending upon the shell).
The shell construct ‘$(…)
’ is
called command substitution. The output of the command inside the parentheses is
substituted into the command line. Because the result is used in a variable assignment, it is
saved as a single string, even if the results contain whitespace.
The expanded program is saved in the variable processed_program
. It’s done in these steps:
Run gawk
with the @include
-processing program (the value of
the expand_prog
shell variable)
reading standard input.
Standard input is the contents of the user’s program, from the
shell variable program
. Feed its
contents to gawk
via a here
document.
Save the results of this processing in the shell variable
processed_program
by using
command substitution.
The last step is to call gawk
with the expanded program, along with the original options and
command-line arguments that the user supplied:
eval gawk $opts -- '"$processed_program"' '"$@"'
The eval
command is a shell
construct that reruns the shell’s parsing process. This keeps things
properly quoted.
This version of igawk
represents the fifth version of this program. There are four key
simplifications that make the program work better:
Using @include
even for the
files named with -f
makes building the initial
collected awk
program much
simpler; all the @include
processing can be done once.
Not trying to save the line read with getline
in the pathto()
function when testing for the file’s accessibility
for use with the main program simplifies things considerably.
Using a getline
loop in the
BEGIN
rule does it all in one
place. It is not necessary to call out to a separate loop for
processing nested @include
statements.
Instead of saving the expanded program in a temporary file,
putting it in a shell variable avoids some potential security
problems. This has the disadvantage that the script relies upon more
features of the sh
language,
making it harder to follow for those who aren’t familiar with
sh
.
Also, this program illustrates that it is often worthwhile to
combine sh
and awk
programming together. You can usually
accomplish quite a lot, without having to resort to low-level
programming in C or C++, and it is frequently easier to do certain kinds
of string and argument manipulation using the shell than it is in
awk
.
Finally, igawk
shows that it is
not always necessary to add new features to a program; they can often be
layered on top.[77]
An interesting programming challenge is to search for
anagrams in a word list (such as /usr/share/dict/words
on many GNU/Linux
systems). One word is an anagram of another if both words contain
the same letters (e.g., “babbling” and “blabbing”).
Column 2, Problem C, of Jon Bentley’s Programming Pearls, Second Edition, presents an elegant algorithm. The idea is to give words that are anagrams a common signature, sort all the words together by their signatures, and then print them. Dr. Bentley observes that taking the letters in each word and sorting them produces those common signatures.
The following program uses arrays of arrays to bring together words with the same signature and array sorting to print the words in sorted order:
# anagram.awk --- An implementation of the anagram-finding algorithm # from Jon Bentley's "Programming Pearls," 2nd edition. # Addison Wesley, 2000, ISBN 0-201-65788-0. # Column 2, Problem C, section 2.8, pp 18-20. /'s$/ { next } # Skip possessives
The program starts with a header, and then a rule to skip possessives in the dictionary file. The next rule builds up the data structure. The first dimension of the array is indexed by the signature; the second dimension is the word itself:
{ key = word2key($1) # Build signature data[key][$1] = $1 # Store word with signature }
The word2key()
function creates
the signature. It splits the word apart into individual letters, sorts
the letters, and then joins them back together:
# word2key --- split word apart into letters, sort, and join back together function word2key(word, a, i, n, result) { n = split(word, a, "") asort(a) for (i = 1; i <= n; i++) result = result a[i] return result }
Finally, the END
rule traverses
the array and prints out the anagram lists. It sends the output to the
system sort
command because otherwise
the anagrams would appear in arbitrary order:
END { sort = "sort" for (key in data) { # Sort words with same key nwords = asorti(data[key], words) if (nwords == 1) continue # And print. Minor glitch: trailing space at end of each line for (j = 1; j <= nwords; j++) printf("%s ", words[j]) | sort print "" | sort } close(sort) }
Here is some partial output when the program is run:
$ gawk -f anagram.awk /usr/share/dict/words | grep '^b'
…
babbled blabbed
babbler blabber brabble
babblers blabbers brabbles
babbling blabbing
babbly blabby
babel bable
babels beslab
babery yabber
…
The following program was written by Davide Brini and
is published on his
website. It serves as his signature in the Usenet group comp.lang.awk
. He supplies the following
copyright terms:
Copyright © 2008 Davide Brini
Copying and distribution of the code published in this page, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved.
Here is the program:
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c"; printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O, X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O, O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'
We leave it to you to determine what the program does. (If you are truly desperate to understand it, see Chris Johansen’s explanation, which is embedded in the Texinfo source file for this book.)
The programs provided in this chapter continue on the theme that reading programs is an excellent way to learn Good Programming.
Using ‘#!
’ to make awk
programs directly runnable makes them
easier to use. Otherwise, invoke the program using ‘awk -f …
’.
Reimplementing standard POSIX programs in awk
is a pleasant exercise; awk
’s expressive power lets you write such
programs in relatively few lines of code, yet they are functionally
complete and usable.
One of standard awk
’s
weaknesses is working with individual characters. The ability to use
split()
with the empty string as
the separator can considerably simplify such tasks.
The examples here demonstrate the usefulness of the library functions from Chapter 10 for a number of real (if small) programs.
Besides reinventing POSIX wheels, other programs solved a selection of interesting problems, such as finding duplicate words in text, printing mailing labels, and finding anagrams.
[69] It also introduces a subtle bug; if a match happens, we output the translated line, not the original.
[70] This is the traditional usage. The POSIX usage is different, but not relevant for what the program aims to demonstrate.
[71] This is the definition returned from entering define: state machine
into Google.
[72] Because gawk
understands
multibyte locales, this code counts characters, not bytes.
[73] On some older systems, including Solaris, the system version
of tr
may require that the lists
be written as range expressions enclosed in square brackets
(‘[a-z]
’) and quoted, to prevent
the shell from attempting a filename expansion. This is not a
feature.
[74] “Real world” is defined as “a program actually used to get something done.”
[75] Fully explaining the sh
language is beyond the scope of this book. We provide some minimal
explanations, but see a good shell programming book if you wish to
understand things in more depth.
[76] On some very old versions of awk
, the test ‘getline junk < t
’ can loop forever if
the file exists but is empty.
[77] gawk
does @include
processing itself in order to
support the use of awk
programs
as Web CGI scripts.
3.17.68.14