As you have already seen, each awk
statement consists of a pattern with an associated action. This chapter
describes how you build patterns and actions, what kinds of things you can
do within actions, and awk
’s predefined
variables.
The pattern–action rules and the statements available for use within
actions form the core of awk
programming.
In a sense, everything covered up to here has been the foundation that
programs are built on top of. Now it’s time to start building something
useful.
Patterns in awk
control the
execution of rules—a rule is executed when its pattern matches the current input record.
The following is a summary of the types of awk
patterns:
/regular
expression
/
A regular expression. It matches when the text of the input record fits the regular expression. (See Chapter 3.)
expression
A single expression. It matches when its value is nonzero (if a number) or non-null (if a string). (See Expressions as Patterns.)
begpat
,
endpat
A pair of patterns separated by a comma, specifying a
range of records. The range includes both the initial record that
matches begpat
and the final record that
matches endpat
. (See Specifying Record Ranges with Patterns.)
BEGIN
END
Special patterns for you to supply startup or cleanup actions for your
awk
program. (See The BEGIN and END Special Patterns.)
BEGINFILE
ENDFILE
Special patterns for you to supply startup or cleanup actions to be done on a per-file basis. (See The BEGINFILE and ENDFILE Special Patterns.)
empty
The empty pattern matches every input record. (See The Empty Pattern.)
Regular expressions are one of the first kinds of patterns
presented in this book. This kind of pattern is simply a regexp constant in the
pattern part of a rule. Its meaning is ‘$0 ~
/
’. The pattern matches
when the input record matches the regexp. For example:pattern
/
/foo|bar|baz/ { buzzwords++ } END { print buzzwords, "buzzwords seen" }
Any awk
expression is valid as
an awk
pattern. The pattern matches
if the expression’s value is nonzero (if a number) or non-null (if a
string). The expression is reevaluated each time the rule is tested
against a new input record. If the expression uses fields such as
$1
, the value depends directly on the
new input record’s text; otherwise, it depends on only what has happened
so far in the execution of the awk
program.
Comparison expressions, using the comparison operators described
in Variable Typing and Comparison Expressions, are a very common kind of
pattern. Regexp matching and nonmatching are also very common
expressions. The left operand of the ‘~
’ and ‘!~
’
operators is a string. The right operand is either a constant regular
expression enclosed in slashes (/
), or any
expression whose string value is used as a dynamic regular expression
(see Using Dynamic Regexps). The following example prints
the second field of each input record whose first field is precisely
‘regexp
/li
’:
$ awk '$1 == "li" { print $2 }' mail-list
(There is no output, because there is no person with the exact
name ‘li
’.) Contrast this with the
following regular expression match, which accepts any record with a
first field that contains ‘li
’:
$ awk '$1 ~ /li/ { print $2 }' mail-list
555-5553
555-6699
A regexp constant as a pattern is also a special case of an
expression pattern. The expression /li/
has the value one if ‘li
’ appears in the current input record. Thus,
as a pattern, /li/
matches any record
containing ‘li
’.
Boolean expressions are also commonly used as patterns. Whether the pattern matches an input record depends on
whether its subexpressions match. For example, the following command
prints all the records in mail-list
that contain both ‘edu
’ and ‘li
’:
$ awk '/edu/ && /li/' mail-list
Samuel 555-3430 [email protected] A
The following command prints all records in mail-list
that contain
either ‘edu
’ or
‘li
’ (or both, of course):
$ awk '/edu/ || /li/' mail-list
Amelia 555-5553 [email protected] F
Broderick 555-0542 [email protected] R
Fabius 555-1234 [email protected] F
Julie 555-6699 [email protected] F
Samuel 555-3430 [email protected] A
Jean-Paul 555-2127 [email protected] R
The following command prints all records in mail-list
that do not
contain the string ‘li
’:
$ awk '! /li/' mail-list
Anthony 555-3412 [email protected] A
Becky 555-7685 [email protected] A
Bill 555-1675 [email protected] A
Camilla 555-2912 [email protected] R
Fabius 555-1234 [email protected] F
Martin 555-6480 [email protected] A
Jean-Paul 555-2127 [email protected] R
The subexpressions of a Boolean operator in a pattern can be
constant regular expressions, comparisons, or any other awk
expressions. Range patterns are not expressions, so they cannot appear
inside Boolean patterns. Likewise, the special patterns BEGIN
, END
,
BEGINFILE
, and ENDFILE
, which never match any input record,
are not expressions and cannot appear inside Boolean patterns.
The precedence of the different operators that can appear in patterns is described in Operator Precedence (How Operators Nest).
A range pattern is made of two patterns
separated by a comma, in the form ‘
’. It is used to match ranges of consecutive input records.
The first pattern, begpat
,
endpat
begpat
, controls where the range
begins, while endpat
controls where the
pattern ends. For example, the following:
awk '$1 == "on", $1 == "off"' myfile
prints every record in myfile
between ‘on
’/‘off
’ pairs, inclusive.
A range pattern starts out by matching
begpat
against every input record. When a
record matches begpat
, the range pattern is
turned on, and the range pattern matches this
record as well. As long as the range pattern stays turned on, it
automatically matches every input record read. The range pattern also matches
endpat
against every input record; when this
succeeds, the range pattern is turned off again
for the following record. Then the range pattern goes back to checking
begpat
against each record.
The record that turns on the range pattern and the one that turns
it off both match the range pattern. If you don’t want to operate on
these records, you can write if
statements in the rule’s action to distinguish them from the records you
are interested in.
It is possible for a pattern to be turned on and off by the same
record. If the record satisfies both conditions, then the action is
executed for just that record. For example, suppose there is text
between two identical markers (e.g., the ‘%
’ symbol), each on its own line, that should
be ignored. A first attempt would be to combine a range pattern that
describes the delimited text with the next
statement (not discussed yet, see The next Statement). This causes awk
to skip any further processing of the
current record and start over
again with the next input record. Such a program looks like this:
/^%$/,/^%$/ { next } { print }
This program fails because the range pattern is both turned on and
turned off by the first line, which just has a ‘%
’ on it. To accomplish this task, write the
program in the following manner, using a flag:
/^%$/ { skip = ! skip; next } skip == 1 { next } # skip lines with `skip' set
In a range pattern, the comma (‘,
’) has the lowest precedence of all the
operators (i.e., it is evaluated last). Thus, the following program
attempts to combine a range pattern with another, simpler test:
echo Yes | awk '/1/,/2/ || /Yes/'
The intent of this program is ‘(/1/,/2/)
|| /Yes/
’. However, awk
interprets this as ‘/1/, (/2/ ||
/Yes/)
’. This cannot be changed or worked around; range
patterns do not combine with other patterns:
$ echo Yes | gawk '(/1/,/2/) || /Yes/'
error→ gawk: cmd. line:1: (/1/,/2/) || /Yes/
error→ gawk: cmd. line:1: ^ syntax error
As a minor point of interest, although it is poor style, POSIX allows you to put a newline after the comma in a range pattern. (d.c.)
All the patterns described so far are for matching input
records. The BEGIN
and END
special patterns are different. They supply startup and cleanup actions
for awk
programs. BEGIN
and END
rules must have actions; there is no default action for these rules because there
is no current record when they run. BEGIN
and END
rules are often referred to as “BEGIN
and END
blocks” by longtime awk
programmers.
A BEGIN
rule is executed once
only, before the first input record is read. Likewise, an END
rule is executed once only, after all the input is read. For
example:
$awk '
>BEGIN { print "Analysis of "li"" }
>/li/ { ++n }
>END { print ""li" appears in", n, "records." }' mail-list
Analysis of "li" "li" appears in 4 records.
This program finds the number of records in the input file
mail-list
that contain the string
‘li
’. The BEGIN
rule prints a title for the report.
There is no need to use the BEGIN
rule to initialize the counter n
to
zero, as awk
does this
automatically (see Variables). The second rule
increments the variable n
every
time a record containing the pattern ‘li
’ is read. The END
rule prints the value of n
at the end of the run.
The special patterns BEGIN
and END
cannot be used in ranges or
with Boolean operators (indeed, they cannot be used with any
operators). An awk
program may have
multiple BEGIN
and/or END
rules. They are executed in the order in
which they appear: all the BEGIN
rules at startup and all the END
rules at termination. BEGIN
and
END
rules may be intermixed with
other rules. This feature was added in the 1987 version of awk
and is included in the POSIX standard.
The original (1978) version of awk
required the BEGIN
rule to be
placed at the beginning of the program, the END
rule to be placed at the end, and only
allowed one of each. This is no longer required, but it is a good idea
to follow this template in terms of program organization and
readability.
Multiple BEGIN
and END
rules are useful for writing library
functions, because each library file can have its own BEGIN
and/or END
rule to do its own initialization and/or
cleanup. The order in which library functions are named on the command
line controls the order in which their BEGIN
and END
rules are executed. Therefore, you have
to be careful when writing such rules in library files so that the
order in which they are executed
doesn’t matter. See Command-Line Options for more information on
using library functions. See Chapter 10 for
a number of useful library functions.
If an awk
program has only
BEGIN
rules and no other rules,
then the program exits after the BEGIN
rules are run.[34] However, if an END
rule exists, then the input is read, even if there are no other rules
in the program. This is necessary in case the END
rule checks the FNR
and NR
variables.
There are several (sometimes subtle) points to be aware of when
doing I/O from a BEGIN
or END
rule. The first has to do with the value of $0
in a BEGIN
rule. Because BEGIN
rules are executed before any input is
read, there simply is no input record, and therefore no fields, when
executing BEGIN
rules. References
to $0
and the fields yield a null
string or zero, depending upon the
context. One way to give $0
a real
value is to execute a getline
command without a variable (see Explicit Input with getline). Another
way is simply to assign a value to $0
.
The second point is similar to the first, but from the other
direction. Traditionally, due largely to implementation issues,
$0
and NF
were undefined
inside an END
rule. The POSIX standard specifies that NF
is available in an END
rule. It contains the number of fields
from the last input record. Most probably due to an oversight, the
standard does not say that $0
is
also preserved, although logically one would think that it should be.
In fact, all of BWK awk
, mawk
, and gawk
preserve the value of $0
for use in END
rules. Be aware, however, that some
other implementations and many older versions of Unix awk
do not.
The third point follows from the first two. The meaning of ‘print
’ inside a BEGIN
or END
rule is the same as always: ‘print $0
’. If $0
is the null string, then this prints an
empty record. Many longtime awk
programmers use an unadorned ‘print
’ in BEGIN
and END
rules, to mean ‘print ""
’, relying on $0
being null. Although one might generally
get away with this in BEGIN
rules,
it is a very bad idea in END
rules,
at least in gawk
. It is also poor
style, because if an empty line is needed in the output, the program
should print one explicitly.
Finally, the next
and
nextfile
statements are not allowed
in a BEGIN
rule, because the
implicit read-a-record-and-match-against-the-rules loop has not started yet. Similarly, those statements are
not valid in an END
rule, because
all the input has been read. (See The next Statement
and The nextfile Statement.)
This section describes a gawk
-specific feature.
Two special kinds of rule, BEGINFILE
and ENDFILE
, give you “hooks” into gawk
’s command-line file processing loop. As with the BEGIN
and END
rules (see the previous section), all
BEGINFILE
rules in a program are
merged, in the order they are read by gawk
, and all ENDFILE
rules are merged as well.
The body of the BEGINFILE
rules
is executed just before gawk
reads
the first record from a file. FILENAME
is set to the name of the current
file, and FNR
is set to zero.
The BEGINFILE
rule provides you
the opportunity to accomplish two tasks that would otherwise be
difficult or impossible to perform:
You can test if the file is readable. Normally, it is a fatal error if a file named on the command line cannot be opened for reading. However, you can bypass the fatal error and move on to the next file on the command line.
You do this by checking if the ERRNO
variable is not the empty string; if so, then gawk
was not able to open the file. In
this case, your program can execute the nextfile
statement (see The nextfile Statement). This causes gawk
to skip the file entirely. Otherwise,
gawk
exits with the usual fatal
error.
If you have written extensions that modify the record handling
(by inserting an “input parser”; see Customized input parsers), you can invoke them at this point,
before gawk
has started
processing the file. (This is a very advanced
feature, currently used only by the gawkextlib
project.)
The ENDFILE
rule is called when
gawk
has finished processing the last
record in an input file. For the last input file, it will be called
before any END
rules. The ENDFILE
rule is executed even for empty input
files.
Normally, when an error occurs when reading input in the normal
input-processing loop, the error is fatal. However, if an ENDFILE
rule is present, the error becomes
non-fatal, and instead ERRNO
is set.
This makes it possible to catch and process I/O errors at the level of
the awk
program.
The next
statement (see The next Statement) is not allowed inside either a BEGINFILE
or an ENDFILE
rule. The nextfile
statement
is allowed only inside a BEGINFILE
rule, not inside an ENDFILE
rule.
The getline
statement (see
Explicit Input with getline) is restricted inside both BEGINFILE
and ENDFILE
: only redirected forms of getline
are allowed.
BEGINFILE
and ENDFILE
are gawk
extensions. In most other awk
implementations, or if gawk
is in compatibility mode (see Command-Line Options), they are not special.
awk
programs are often used as
components in larger programs written in shell. For example, it is very common to use a shell variable to
hold a pattern that the awk
program
searches for. There are two ways to get the value of the shell variable
into the body of the awk
program.
A common method is to use shell quoting to substitute the variable’s value into the program inside the script. For example, consider the following program:
printf "Enter search pattern: " read pattern awk "/$pattern/ "'{ nmatches++ } END { print nmatches, "found" }' /path/to/data
The awk
program consists of two
pieces of quoted text that are concatenated together to form the program.
The first part is double-quoted, which allows substitution of the pattern
shell variable inside the quotes. The
second part is single-quoted.
Variable substitution via quoting works, but can potentially be messy. It requires a good understanding of the shell’s quoting rules (see Shell Quoting Issues), and it’s often difficult to correctly match up the quotes when reading the program.
A better method is to use awk
’s
variable assignment feature (see Assigning variables on the command line) to
assign the shell variable’s value to an awk
variable. Then use dynamic regexps to match the pattern (see Using Dynamic Regexps). The following shows how to redo the
previous example using this technique:
printf "Enter search pattern: " read pattern awk -v pat="$pattern" '$0 ~ pat { nmatches++ } END { print nmatches, "found" }' /path/to/data
Now, the awk
program is just one
single-quoted string. The assignment ‘-v
pat="$pattern"
’ still requires double quotes, in case there is
whitespace in the value of $pattern
.
The awk
variable pat
could be named pattern
too, but that would be more confusing.
Using a variable also provides more flexibility, as the variable can be
used anywhere inside the program—for printing, as an array subscript, or
for any other use—without requiring the quoting tricks at every point in
the program.
An awk
program or script consists
of a series of rules and function definitions interspersed. (Functions are described later. See User-Defined Functions.) A rule
contains a pattern and an action, either of which (but not both)
may be omitted. The purpose of the action is to
tell awk
what to do once a match for
the pattern is found. Thus, in outline, an awk
program generally looks like this:
[pattern
] {
action
}
pattern
[{
]action
}
…
function
name
(args
) { … }
…
An action consists of one or more awk
statements, enclosed
in braces (‘{…}
’). Each statement
specifies one thing to do. The statements are separated by newlines
or semicolons. The braces around an action must be used even if the action
contains only one statement, or if it contains no statements at all.
However, if you omit the action entirely, omit the braces as well. An
omitted action is equivalent to ‘{ print $0
}
’:
/foo/ { } matchfoo
, do nothing --- empty action /foo/ matchfoo
, print the record --- omitted action
The following types of statements are supported in awk
:
Call functions or assign values to variables (see Chapter 6). Executing this kind of statement simply computes the value of the expression. This is useful when the expression has side effects (see Assignment Expressions).
Specify the control flow of awk
programs. The awk
language gives you C-like constructs
(if
, for
, while
, and do
) as well as a few special ones (see
Control Statements in Actions).
Enclose one or more statements in braces. A compound statement
is used in order to put several statements together in the body of
an if
, while
, do
, or for
statement.
Use the getline
command
(see Explicit Input with getline). Also supplied in awk
are the next
statement (see The next Statement) and the nextfile
statement (see The nextfile Statement).
Such as print
and printf
. See Chapter 5.
For deleting array elements. See The delete Statement.
Control statements, such as if
, while
,
and so on, control the flow of execution in awk
programs. Most of awk
’s control statements are patterned after similar statements in C.
All the control statements start with special keywords, such as
if
and while
, to distinguish them from simple
expressions. Many control statements contain other statements. For
example, the if
statement contains
another statement that may or may not be executed. The contained statement
is called the body. To include more than one
statement in the body, group them into a single compound
statement with braces, separating them with newlines or
semicolons.
The if
-else
statement is awk
’s decision-making statement. It looks like this:
if (
[condition
) then-body
else
]else-body
The condition
is an expression that
controls what the rest of the statement does. If the
condition
is true,
then-body
is executed; otherwise,
else-body
is executed. The else
part of the statement is optional. The
condition is considered false if its value is zero or the null string;
otherwise, the condition is true. Refer to the following:
if (x % 2 == 0) print "x is even" else print "x is odd"
In this example, if the expression ‘x % 2
== 0
’ is true (i.e., if the value of x
is evenly divisible by two), then the first
print
statement is executed;
otherwise, the second print
statement
is executed. If the else
keyword
appears on the same line as then-body
and
then-body
is not a compound statement (i.e.,
not surrounded by braces), then a semicolon must separate
then-body
from the else
. To illustrate this, the previous example can be rewritten
as:
if (x % 2 == 0) print "x is even"; else print "x is odd"
If the ‘;
’ is left out,
awk
can’t interpret the statement and
it produces a syntax error. Don’t actually write programs this way,
because a human reader might fail to see the else
if it is not the first thing on its
line.
In programming, a loop is a part of a
program that can be executed two or more times in succession. The while
statement is
the simplest looping statement in awk
. It repeatedly executes a statement as
long as a condition is true. For example:
while (condition
)body
body
is a statement called the
body of the loop, and
condition
is an expression that controls how
long the loop keeps running. The first thing the while
statement does is test the
condition
. If the
condition
is true, it executes the statement
body
. After body
has been executed, condition
is tested again,
and if it is still true, body
executes again.
This process repeats until the condition
is
no longer true. If the condition
is initially
false, the body of the loop never executes and awk
continues with the statement following the
loop. This example prints the first three fields of each record, one per
line:
awk ' { i = 1 while (i <= 3) { print $i i++ } }' inventory-shipped
The body of this loop is a compound statement enclosed in braces,
containing two statements. The loop works in the following manner:
first, the value of i
is set to one.
Then, the while
statement tests
whether i
is less than or equal to three.
This is true when i
equals one, so the
i
th field is printed. Then the ‘i++
’
increments the value of i
and the
loop repeats. The loop terminates when i
reaches four.
A newline is not required between the condition and the body; however, using one makes the program clearer unless the body is a compound statement or else is very simple. The newline after the open brace that begins the compound statement is not required either, but the program is harder to read without it.
The do
loop is a variation of
the while
looping statement.
The do
loop executes
the body
once and then repeats the
body
as long as the
condition
is true. It looks like this:
dobody
while (condition
)
Even if the condition
is false at the
start, the body
executes at least once (and
only once, unless executing body
makes
condition
true). Contrast this with the
corresponding while
statement:
while (condition
)body
This statement does not execute the
body
even once if the
condition
is false to begin with. The
following is an example of a do
statement:
{ i = 1 do { print $0 i++ } while (i <= 10) }
This program prints each input record 10 times. However, it isn’t
a very realistic example, because in this case an ordinary while
would do just as well. This situation
reflects actual experience; only occasionally is there a real use for a
do
statement.
The for
statement makes it more
convenient to count iterations of a loop. The general form of the for
statement looks like this:
for (initialization
;condition
;increment
)body
The initialization
,
condition
, and
increment
parts are arbitrary awk
expressions, and
body
stands for any awk
statement.
The for
statement starts by
executing initialization
. Then, as long as
the condition
is true, it repeatedly executes
body
and then
increment
. Typically,
initialization
sets a variable to either zero or
one, increment
adds one to it, and
condition
compares it against the desired
number of iterations. For example:
awk ' { for (i = 1; i <= 3; i++) print $i }' inventory-shipped
This prints the first three fields of each input record, with one field per line.
It isn’t possible to set more than one variable in the
initialization
part without using a multiple
assignment statement such as ‘x = y =
0
’. This makes sense only if all the initial values are equal.
(But it is possible to initialize additional variables by writing their
assignments as separate statements preceding the for
loop.)
The same is true of the increment
part.
Incrementing additional variables requires separate statements at the
end of the loop. The C compound expression, using C’s comma operator, is
useful in this context, but it is not supported in awk
.
Most often, increment
is an increment
expression, as in the previous example. But this is not required; it can
be any expression whatsoever. For example, the following statement
prints all the powers of two between 1 and 100:
for (i = 1; i <= 100; i *= 2) print i
If there is nothing to be done, any of the three expressions in
the parentheses following the for
keyword may be omitted. Thus, ‘for (; x >
0;)
’ is equivalent to ‘while (x > 0)
’. If the
condition
is omitted, it is treated as true,
effectively yielding an infinite loop (i.e., a
loop that never terminates).
In most cases, a for
loop is an
abbreviation for a while
loop, as
shown here:
initialization
while (condition
) {body
increment
}
The only exception is when the continue
statement (see The continue Statement) is used inside the loop. Changing a
for
statement to a while
statement in this way can change the
effect of the continue
statement
inside the loop.
The awk
language has a for
statement in addition to a while
statement because a for
loop is
often both less work to type and more natural to think of. Counting the
number of iterations is very common in loops. It can be easier to think
of this counting as part of looping rather than as something to do
inside the loop.
There is an alternative version of the for
loop, for iterating over all the indices of an array:
for (i in array)
do something with
array[i]
See Scanning All Elements of an Array for more information on
this version of the for
loop.
This section describes a gawk
-specific feature. If gawk
is in compatibility mode (see Command-Line Options), it is not
available.
The switch
statement allows the
evaluation of an expression and the execution of statements based on a
case
match. Case statements are checked for a match in the order they
are defined. If no suitable case
is
found, the default
section is
executed, if supplied.
Each case
contains a single
constant, be it numeric, string, or regexp. The switch
expression is evaluated, and then each
case
’s constant is compared against
the result in turn. The type of constant determines the comparison:
numeric or string do the usual comparisons. A regexp
constant does a regular expression match against the string value of the
original expression. The general form of the switch
statement looks like this:
switch (expression
) { casevalue or regular expression
:case-body
default:default-body
}
Control flow in the switch
statement works as it does in C. Once a match to a given case is made,
the case statement bodies execute until a break
, continue
, next
, nextfile
, or exit
is encountered, or the end of the
switch
statement itself. For example:
while ((c = getopt(ARGC, ARGV, "aksx")) != -1) { switch (c) { case "a": # report size of all files all_files = TRUE; break case "k": BLOCK_SIZE = 1024 # 1K block size break case "s": # do sums only sum_only = TRUE break case "x": # don't cross filesystems fts_flags = or(fts_flags, FTS_XDEV) break case "?": default: usage() break } }
Note that if none of the statements specified here halt execution
of a matched case
statement,
execution falls through to the next case
until execution halts. In this example,
the case
for "?"
falls through to the default
case, which is to call a function
named usage()
. (The
getopt()
function being called here
is described in Processing Command-Line Options.)
The break
statement jumps out
of the innermost for
, while
, or do
loop that encloses it. The following example finds the smallest divisor of any
integer, and also identifies prime numbers:
# find smallest divisor of num { num = $1 for (divisor = 2; divisor * divisor <= num; divisor++) { if (num % divisor == 0) break } if (num % divisor == 0) printf "Smallest divisor of %d is %d ", num, divisor else printf "%d is prime ", num }
When the remainder is zero in the first if
statement, awk
immediately breaks
out of the containing for
loop. This means that awk
proceeds immediately to the statement following the loop and continues
processing. (This is very different from the exit
statement, which stops the entire
awk
program. See The exit Statement.)
The following program illustrates how the
condition
of a for
or while
statement could be replaced with a break
inside an if
:
# find smallest divisor of num { num = $1 for (divisor = 2; ; divisor++) { if (num % divisor == 0) { printf "Smallest divisor of %d is %d ", num, divisor break } if (divisor * divisor > num) { printf "%d is prime ", num break } } }
The break
statement is also
used to break out of the switch
statement. This is discussed in The switch Statement.
The break
statement has no
meaning when used outside the body of a loop or switch
. However, although it was never
documented, historical implementations of awk
treated the break
statement outside of a loop as if it
were a next
statement (see The next Statement). (d.c.) Recent versions of BWK awk
no longer allow this usage, nor does
gawk
.
Similar to break
, the continue
statement is used only inside
for
, while
, and do
loops. It skips over the rest of the loop
body, causing the next cycle around the loop to begin immediately.
Contrast this with break
, which jumps
out of the loop altogether.
The continue
statement in a
for
loop directs awk
to skip the rest of the body of the loop
and resume execution with the increment-expression of the for
statement. The
following program illustrates this fact:
BEGIN { for (x = 0; x <= 20; x++) { if (x == 5) continue printf "%d ", x } print "" }
This program prints all the numbers from 0 to 20—except for 5, for
which the printf
is skipped. Because
the increment ‘x++
’ is not skipped,
x
does not remain stuck at 5.
Contrast the for
loop from the
previous example with the following while
loop:
BEGIN { x = 0 while (x <= 20) { if (x == 5) continue printf "%d ", x x++ } print "" }
This program loops forever once x
reaches 5, because the increment (‘x++
’) is never reached.
The continue
statement has no
special meaning with respect to the switch
statement, nor does it have any meaning
when used outside the body of a loop. Historical versions of awk
treated a continue
statement outside a loop the same way
they treated a break
statement
outside a loop: as if it were a next
statement (discussed in the following section). (d.c.) Recent versions of
BWK awk
no longer work this way, nor
does gawk
.
The next
statement forces
awk
to immediately stop processing
the current record and go on to the next record. This means that no further rules are executed for the
current record, and the rest of the current rule’s action isn’t
executed.
Contrast this with the effect of the getline
function (see Explicit Input with getline). That also causes awk
to read the next record immediately, but
it does not alter the flow of control in any way (i.e., the rest of the
current action executes with a new input record).
At the highest level, awk
program execution is a loop that reads an input record and then tests
each rule’s pattern against it. If you think of this loop as a for
statement whose body contains the rules,
then the next
statement is analogous
to a continue
statement. It skips to
the end of the body of this implicit loop and executes the increment
(which reads another record).
For example, suppose an awk
program works only on records with four fields, and it shouldn’t fail
when given bad input. To avoid complicating the rest of the program,
write a “weed out” rule near the beginning, in the following
manner:
NF != 4 { printf("%s:%d: skipped: NF != 4 ", FILENAME, FNR) > "/dev/stderr" next }
Because of the next
statement,
the program’s subsequent rules won’t see the bad record. The error
message is redirected to the standard error output stream, as error
messages should be. For more detail, see Special Filenames in gawk.
If the next
statement causes
the end of the input to be reached, then the code in any END
rules is executed. See The BEGIN and END Special Patterns.
The next
statement is not
allowed inside BEGINFILE
and ENDFILE
rules. See The BEGINFILE and ENDFILE Special Patterns.
According to the POSIX standard, the behavior is undefined if the
next
statement is used in a BEGIN
or END
rule. gawk
treats it as a
syntax error. Although POSIX does not disallow it, most other awk
implementations don’t allow the next
statement inside function bodies (see
User-Defined Functions). Just as with any other next
statement, a next
statement inside a function body reads
the next record and starts processing it with the first rule in the
program.
The nextfile
statement is
similar to the next
statement.
However, instead of abandoning processing of the current record, the nextfile
statement instructs awk
to stop processing the current
datafile.
Upon execution of the nextfile
statement, FILENAME
is updated to the
name of the next datafile listed on the command line, FNR
is reset to 1, and processing starts over
with the first rule in the program. If the nextfile
statement causes the end of the input
to be reached, then the code in any END
rules is executed. An exception to this is
when nextfile
is invoked during
execution of any statement in an END
rule; in this case, it causes the program to stop immediately. See
The BEGIN and END Special Patterns.
The nextfile
statement is
useful when there are many datafiles to process but it isn’t necessary
to process every record in every file. Without nextfile
, in order to move on to the next
datafile, a program would have to continue scanning the unwanted
records. The nextfile
statement
accomplishes this much more efficiently.
In gawk
, execution of nextfile
causes additional things to happen: any ENDFILE
rules are executed if gawk
is not currently in an END
or BEGINFILE
rule, ARGIND
is incremented, and any BEGINFILE
rules are executed. (ARGIND
hasn’t been introduced yet. See Predefined Variables.)
With gawk
, nextfile
is useful inside a BEGINFILE
rule to skip over a file that would
otherwise cause gawk
to exit with a
fatal error. In this case, ENDFILE
rules are not executed. See The BEGINFILE and ENDFILE Special Patterns.
Although it might seem that ‘close(FILENAME)
’ would accomplish the same as
nextfile
, this isn’t true. close()
is reserved for closing files, pipes,
and coprocesses that are opened with redirections. It is not related to
the main processing that awk
does
with the files listed in ARGV
.
For many years, nextfile
was
a common extension. In September 2012, it was accepted for inclusion
into the POSIX standard. See the Austin Group
website.
The current version of BWK awk
and mawk
also support nextfile
. However, they don’t allow the nextfile
statement inside function bodies (see
User-Defined Functions). gawk
does; a nextfile
inside a function body reads the next
record and starts processing it with the first rule in the program, just
as any other nextfile
statement.
The exit
statement causes
awk
to immediately stop executing the
current rule and to stop processing input; any remaining input is
ignored. The exit
statement is
written as follows:
exit
[return code
]
When an exit
statement is
executed from a BEGIN
rule, the
program stops processing everything immediately. No input records are
read. However, if an END
rule is
present, as part of executing the exit
statement, the END
rule is executed (see The BEGIN and END Special Patterns). If exit
is used in the body of an END
rule,
it causes the program to stop immediately.
An exit
statement that is not
part of a BEGIN
or END
rule stops the execution of any further
automatic rules for the current record, skips reading any remaining
input records, and executes the END
rule if there is one. gawk
also skips
any ENDFILE
rules; they do not
execute.
In such a case, if you don’t want the END
rule to do its job, set a variable to a
nonzero value before the exit
statement and check that variable in the END
rule. See Assertions for an example that does this.
If an argument is supplied to exit
, its value is used as the exit status
code for the awk
process. If no
argument is supplied, exit
causes
awk
to return a “success” status. In
the case where an argument is supplied to a first exit
statement, and then exit
is called a second time from an END
rule with no argument, awk
uses the previously supplied exit value.
(d.c.) See gawk’s Exit Status for more information.
For example, suppose an error condition occurs that is difficult
or impossible to handle. Conventionally, programs report this by exiting
with a nonzero status. An awk
program
can do this using an exit
statement
with a nonzero argument, as shown in the following example:
BEGIN { if (("date" | getline date_now) <= 0) { print "Can't get system date" > "/dev/stderr" exit 1 } print "current date is", date_now close("date") }
For full portability, exit values should be between zero and 126, inclusive. Negative values, and values of 127 or greater, may not produce consistent results across different operating systems.
Most awk
variables are available
to use for your own purposes; they never change unless your
program assigns values to them, and they never affect anything unless your
program examines them. However, a few variables in awk
have special built-in meanings. awk
examines some of these automatically, so
that they enable you to tell awk
how to
do certain things. Others are set automatically by awk
, so that they carry information from the
internal workings of awk
to your
program.
This section documents all of gawk
’s predefined variables, most of which are
also documented in the chapters describing their areas of activity.
The following is an alphabetical list of variables that you can
change to control how awk
does
certain things.
The variables that are specific to gawk
are marked with a pound sign (‘#
’). These variables are gawk
extensions. In other awk
implementations or if gawk
is in compatibility mode (see Command-Line Options), they are not special. (Any exceptions are noted in
the description of each variable.)
BINMODE #
On non-POSIX systems, this variable specifies use of binary
mode for all I/O. Numeric values of one, two, or three specify that input
files, output files, or all files, respectively, should use binary
I/O. A numeric value less than zero is treated as zero, and a numeric
value greater than three is treated as three. Alternatively, string values
of "r"
or "w"
specify that input files and output
files, respectively, should use binary I/O. A string value of
"rw"
or "wr"
indicates that all files should use
binary I/O. Any other string value is treated the same as "rw"
, but causes gawk
to generate a warning message.
BINMODE
is described in more
detail in Using gawk on PC operating systems. mawk
(see Other Freely Available awk Implementations) also supports this variable, but only
using numeric values.
CONVFMT
A string that controls the conversion of numbers to strings (see Conversion of Strings and Numbers). It
works by being passed, in effect, as the first argument to the
sprintf()
function (see String-Manipulation Functions). Its default value is "%.6g"
. CONVFMT
was introduced by the POSIX
standard.
FIELDWIDTHS #
A space-separated list of columns that tells gawk
how to split input with fixed
columnar boundaries. Assigning a value to FIELDWIDTHS
overrides the use of
FS
and FPAT
for field splitting. See Reading Fixed-Width Data for more information.
FPAT #
A regular expression (as a string) that tells gawk
to create the fields based on text
that matches the regular expression. Assigning a value to FPAT
overrides the use of FS
and FIELDWIDTHS
for field splitting. See
Defining Fields by Content for more
information.
FS
The input field separator (see Specifying How Fields Are Separated). The value is a single-character string or a
multicharacter regular expression that matches the separations
between fields in an input record. If the value is the null string
(""
), then each character in
the record becomes a separate field. (This behavior is a gawk
extension. POSIX awk
does not specify the behavior when
FS
is the null string.
Nonetheless, some other versions of awk
also treat ""
specially.)
The default value is "
"
, a string consisting of a single space. As
a special exception, this value means that any sequence of spaces,
TABs, and/or newlines is a single separator.[35] It also causes spaces, TABs, and newlines at the
beginning and end of a record to be ignored.
You can set the value of FS
on the command line using the
-F
option:
awk -F, 'program
'input-files
If gawk
is using FIELDWIDTHS
or FPAT
for field splitting, assigning a
value to FS
causes gawk
to return to the normal, FS
-based field splitting. An easy way to
do this is to simply say ‘FS =
FS
’, perhaps with an explanatory comment.
IGNORECASE #
If IGNORECASE
is nonzero
or non-null, then all string comparisons and all regular expression
matching are case-independent.
This applies to
regexp matching with
‘~
’ and ‘!~
’, the gensub()
, gsub()
, index()
, match()
, patsplit()
, split()
, and sub()
functions, record termination with
RS
, and field splitting with
FS
and FPAT
.
However, the value of IGNORECASE
does not
affect array subscripting and it does not affect field splitting
when using a single-character field separator. See Case Sensitivity in Matching.
LINT #
When this variable is true (nonzero or non-null), gawk
behaves as if the
--lint
command-line option is in effect (see Command-Line Options). With a
value of "fatal"
, lint warnings
become fatal errors. With a value of "invalid"
, only warnings about things
that are actually invalid are issued. (This is not fully
implemented yet.) Any other true value prints nonfatal warnings.
Assigning a false value to LINT
turns off the lint warnings.
This variable is a gawk
extension. It is not special in other awk
implementations. Unlike with the
other special variables, changing LINT
does affect the production of lint
warnings, even if gawk
is in
compatibility mode. Much as the --lint
and
--traditional
options independently control
different aspects of gawk
’s
behavior, the control of lint warnings during program execution is
independent of the flavor of awk
being executed.
OFMT
A string that controls the conversion of numbers to strings
(see Conversion of Strings and Numbers) for printing with the print
statement. It works by being
passed as the first argument to the sprintf()
function (see String-Manipulation Functions). Its default
value is "%.6g"
. Earlier
versions of awk
used OFMT
to specify the format for
converting numbers to strings in general expressions; this is now
done by CONVFMT
.
OFS
The output field separator (see Output Separators). It is output between the fields
printed by a print
statement.
Its default value is " "
,
a string consisting of a single space.
ORS
The output record separator. It is output at the end of every print
statement. Its default value is
"
"
, the newline character.
(See Output Separators.)
PREC #
The working precision of arbitrary-precision floating-point numbers, 53 bits by default (see Setting the Precision).
ROUNDMODE #
The rounding mode to use for arbitrary-precision arithmetic
on numbers, by default "N"
(roundTiesToEven
in the IEEE
754 standard; see Setting the Rounding Mode).
RS
The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text. (See How Input Is Split into Records.)
The ability for RS
to be
a regular expression is a gawk
extension. In most other awk
implementations, or if gawk
is
in compatibility mode (see Command-Line Options), just the
first character of RS
’s value
is used.
SUBSEP
The subscript separator. It has the default value of " 34"
and is used to separate the parts
of the indices of a multidimensional array. Thus, the expression
‘foo["A", "B"]
’
really accesses foo["A 34B"]
(see Multidimensional Arrays).
TEXTDOMAIN #
Used for internationalization of programs at the awk
level. It sets the default text domain for specially marked string constants
in the source text, as well as for the dcgettext()
, dcngettext()
, and bindtextdomain()
functions (see Chapter 13). The default value of TEXTDOMAIN
is "messages"
.
The following is an alphabetical list of variables that awk
sets automatically on certain occasions in order to provide information to your
program.
The variables that are specific to gawk
are marked with a pound sign (‘#
’). These variables are gawk
extensions. In other awk
implementations or if gawk
is in compatibility mode (see Command-Line Options), they are not special:
ARGC
, ARGV
The command-line arguments available to awk
programs are stored in an array
called ARGV
. ARGC
is the
number of command-line arguments present. See Other Command-Line Arguments. Unlike most awk
arrays, ARGV
is indexed from 0 to ARGC
− 1. In the following
example:
$awk 'BEGIN {
>for (i = 0; i < ARGC; i++)
>print ARGV[i]
>}' inventory-shipped mail-list
awk inventory-shipped mail-list
ARGV[0]
contains
‘awk
’, ARGV[1]
contains ‘inventory-shipped
’, and ARGV[2]
contains ‘mail-list
’. The value of ARGC
is three, one more than the index of
the last element in ARGV
,
because the elements are numbered from zero.
The names ARGC
and
ARGV
, as well as the convention
of indexing the array from 0 to ARGC
− 1, are derived from the C
language’s method of accessing command-line arguments.
The value of ARGV[0]
can
vary from system to system. Also, you should note that the program
text is not included in ARGV
, nor are any of awk
’s command-line options. See Using ARGC and ARGV for information about how awk
uses these variables. (d.c.)
ARGIND #
The index in ARGV
of the
current file being processed. Every time gawk
opens a new datafile for processing, it sets ARGIND
to the index in ARGV
of the filename. When gawk
is processing the input files,
‘FILENAME == ARGV[ARGIND]
’ is
always true.
This variable is useful in file processing; it allows you to tell how far along you are in the list of datafiles as well as to distinguish between successive instances of the same filename on the command line.
While you can change the value of ARGIND
within your awk
program, gawk
automatically sets it to a new
value when it opens the next file.
ENVIRON
An associative array containing the values of the
environment. The array indices are the environment variable
names; the elements are the values of the particular environment
variables. For example, ENVIRON["HOME"]
might be "/home/arnold"
. Changing this array does
not affect the environment passed on to any programs that awk
may spawn via redirection or the
system()
function. (In a future
version of gawk
, it may do
so.)
Some operating systems may not have environment variables.
On such systems, the ENVIRON
array is empty (except for ENVIRON["AWKPATH"]
and
ENVIRON["AWKLIBPATH"]
;
see The AWKPATH Environment Variable and The AWKLIBPATH Environment Variable).
ERRNO #
If a system error occurs during a redirection for getline
, during a read for getline
, or during a close()
operation, then ERRNO
contains a string describing the
error.
In addition, gawk
clears
ERRNO
before opening each
command-line input file. This enables checking if the file is
readable inside a BEGINFILE
pattern (see The BEGINFILE and ENDFILE Special Patterns).
Otherwise, ERRNO
works
similarly to the C variable errno
. Except for the case just
mentioned, gawk
never clears it (sets it to zero or ""
). Thus, you should only expect its
value to be meaningful when an I/O operation returns a failure
value, such as getline
returning −1. You are, of course, free to clear it yourself before
doing an I/O operation.
FILENAME
The name of the current input file. When no datafiles are listed on the command line,
awk
reads from the standard
input and FILENAME
is set to
"-"
. FILENAME
changes each time a new file is
read (see Chapter 4). Inside a BEGIN
rule, the value of FILENAME
is ""
, because there are no input files
being processed yet.[36] (d.c.) Note, though, that using getline
(see Explicit Input with getline)
inside a BEGIN
rule can give
FILENAME
a value.
FNR
The current record number in the current file. awk
increments FNR
each time it reads a new record (see
How Input Is Split into Records). awk
resets FNR
to zero each time it starts a new
input file.
NF
The number of fields in the current input record. NF
is set each
time a new record is read, when a new field is created, or when
$0
changes (see Examining Fields).
Unlike most of the variables described in this subsection,
assigning a value to NF
has the
potential to affect awk
’s
internal workings. In particular, assignments to NF
can be used to create fields in or
remove fields from the current record. See Changing the Contents of a Field.
FUNCTAB #
An array whose indices and corresponding values are the names of all the built-in, user-defined, and extension functions in the program.
Attempting to use the delete
statement with the FUNCTAB
array causes a fatal error.
Any attempt to assign to an element of FUNCTAB
also causes a fatal
error.
NR
The number of input records awk
has processed since the beginning of the program’s
execution (see How Input Is Split into Records). awk
increments NR
each time it reads a new
record.
PROCINFO #
The elements of this array provide access to information about the running awk
program. The following elements
(listed alphabetically) are guaranteed to be available:
PROCINFO["egid"]
The value of the getegid()
system call.
PROCINFO["euid"]
The value of the geteuid()
system call.
PROCINFO["FS"]
This is "FS"
if
field splitting with FS
is in effect, "FIELDWIDTHS"
if field splitting
with FIELDWIDTHS
is in
effect, or "FPAT"
if
field matching with FPAT
is in effect.
PROCINFO["identifiers"]
A subarray, indexed by the names of all identifiers
used in the text of the awk
program. An
identifier is simply the name of a
variable (be it scalar or array), built-in function,
user-defined function, or extension function. For each
identifier, the value of the element is one of the
following:
"array"
The identifier is an array.
"builtin"
The identifier is a built-in function.
"extension"
The identifier is an extension function loaded
via @load
or
-l
.
"scalar"
The identifier is a scalar.
"untyped"
The identifier is untyped (could be used as a
scalar or an array; gawk
doesn’t know
yet).
"user"
The identifier is a user-defined function.
The values indicate what gawk
knows about the identifiers
after it has finished parsing the program; they are
not updated while the program
runs.
PROCINFO["gid"]
The value of the getgid()
system call.
PROCINFO["pgrpid"]
The process group ID of the current process.
PROCINFO["pid"]
The process ID of the current process.
PROCINFO["ppid"]
The parent process ID of the current process.
PROCINFO["sorted_in"]
If this element exists in PROCINFO
, its value controls the
order in which array indices will be processed by ‘for (
’ loops. This is
an advanced feature, so we defer the full description until
later; see Scanning All Elements of an Array.indx
in
array
)
PROCINFO["strftime"]
The default time format string for strftime()
. Assigning a new value
to this element changes the default. See Time Functions.
PROCINFO["uid"]
The value of the getuid()
system call.
PROCINFO["version"]
The version of gawk
.
The following additional elements in the array are available
to provide information about the MPFR and GMP libraries if your
version of gawk
supports
arbitrary-precision arithmetic (see Chapter 15):
PROCINFO["mpfr_version"]
The version of the GNU MPFR library.
PROCINFO["gmp_version"]
The version of the GNU MP library.
PROCINFO["prec_max"]
The maximum precision supported by MPFR.
PROCINFO["prec_min"]
The minimum precision required by MPFR.
The following additional elements in the array are available
to provide information about the version of the extension API, if
your version of gawk
supports
dynamic loading of extension functions (see Chapter 16):
PROCINFO["api_major"]
The major version of the extension API.
PROCINFO["api_minor"]
The minor version of the extension API.
On some systems, there may be elements in the array,
"group1"
through "group
for
some N
"N
. N
is
the number of supplementary groups that the process has. Use the
in
operator to test for these
elements (see Referring to an Array Element).
The PROCINFO
array has
the following additional uses:
It may be used to provide a timeout when reading from any open input file, pipe, or coprocess. See Reading Input with a Timeout for more information.
It may be used to cause coprocesses to communicate over pseudo-ttys instead of through two-way pipes; this is discussed further in Two-Way Communications with Another Process.
RLENGTH
The length of the substring matched by the match()
function (see String-Manipulation Functions). RLENGTH
is set by invoking the match()
function. Its value is the length of the matched string, or
−1 if no match is found.
RSTART
The start index in characters of the substring that is matched by the match()
function (see String-Manipulation Functions). RSTART
is set by invoking the match()
function. Its value is
the position of the string where the matched substring starts, or
zero if no match was found.
RT #
The input text that matched the text denoted by RS
, the record separator. It is set every time a record is read.
SYMTAB #
An array whose indices are the names of all defined global
variables and arrays in the program. SYMTAB
makes gawk
’s symbol table visible to the
awk
programmer. It is built as
gawk
parses the program and is
complete before the program starts to run.
The array may be used for indirect access to read or write the value of a variable:
foo = 5 SYMTAB["foo"] = 4 print foo # prints 4
The isarray()
function
(see Getting Type Information) may be used to test if an
element in SYMTAB
is an array.
Also, you may not use the delete
statement with the SYMTAB
array.
You may use an index for SYMTAB
that is not a predefined
identifier:
SYMTAB["xxx"] = 5 print SYMTAB["xxx"]
This works as expected: in this case SYMTAB
acts just like a regular array.
The only difference is that you can’t then delete SYMTAB["xxx"]
.
The SYMTAB
array is more
interesting than it looks. Andrew Schorr points out that it
effectively gives awk
data
pointers. Consider his example:
# Indirect multiply of any variable by amount, return result function multiply(variable, amount) { return SYMTAB[variable] *= amount }
In order to avoid severe time-travel paradoxes,[37] neither FUNCTAB
nor SYMTAB
is available as an
element within the SYMTAB
array.
Built-in Variables That Convey Information presented the following program
describing the information contained in ARGC
and ARGV
:
$awk 'BEGIN {
>for (i = 0; i < ARGC; i++)
>print ARGV[i]
>}' inventory-shipped mail-list
awk inventory-shipped mail-list
In this example, ARGV[0]
contains ‘awk
’, ARGV[1]
contains ‘inventory-shipped
’, and ARGV[2]
contains ‘mail-list
’. Notice that the awk
program is not entered in ARGV
. The other command-line options, with
their arguments, are also not entered. This includes variable
assignments done with the -v
option (see Command-Line Options). Normal variable assignments on the command line
are treated as arguments and do show up in the
ARGV
array. Given the following
program in a file named showargs.awk
:
BEGIN { printf "A=%d, B=%d ", A, B for (i = 0; i < ARGC; i++) printf " ARGV[%d] = %s ", i, ARGV[i] } END { printf "A=%d, B=%d ", A, B }
Running it produces the following:
$ awk -v A=1 -f showargs.awk B=2 /dev/null
A=1, B=0
ARGV[0] = awk
ARGV[1] = B=2
ARGV[2] = /dev/null
A=1, B=2
A program can alter ARGC
and
the elements of ARGV
. Each time
awk
reaches the end of an input file,
it uses the next element of ARGV
as
the name of the next input file. By storing a different string there, a
program can change which files are read. Use "-"
to represent the standard input. Storing
additional elements and incrementing ARGC
causes additional files to be
read.
If the value of ARGC
is
decreased, that eliminates input files from the end of the list. By
recording the old value of ARGC
elsewhere, a program can treat the eliminated arguments as something
other than filenames.
To eliminate a file from the middle of the list, store the null
string (""
) into ARGV
in place of the file’s name. As a special
feature, awk
ignores filenames that
have been replaced with the null string. Another option is to use the
delete
statement to remove elements
from ARGV
(see The delete Statement).
All of these actions are typically done in the BEGIN
rule, before actual processing of the
input begins. See Splitting a Large File into Pieces and Duplicating Output into Multiple Files for examples of each way of removing elements
from ARGV
.
To actually get options into an awk
program, end the awk
options with --
and then
supply the awk
program’s options, in
the following manner:
awk -f myprog.awk -- -v -q file1 file2 …
The following fragment processes ARGV
in order to examine, and then remove, the
previously mentioned command-line options:
BEGIN { for (i = 1; i < ARGC; i++) { if (ARGV[i] == "-v") verbose = 1 else if (ARGV[i] == "-q") debug = 1 else if (ARGV[i] ~ /^-./) { e = sprintf("%s: unrecognized option -- %c", ARGV[0], substr(ARGV[i], 2, 1)) print e > "/dev/stderr" } else break delete ARGV[i] } }
Ending the awk
options with
--
isn’t necessary in gawk
.
Unless --posix
has been specified, gawk
silently puts any unrecognized options
into ARGV
for the awk
program to deal with. As soon as it sees
an unknown option, gawk
stops looking
for other options that it might otherwise recognize. The previous command line with gawk
would be:
gawk -f myprog.awk -q -v file1 file2 …
Because -q
is not a valid gawk
option, it and the following
-v
are passed on to the awk
program. (See Processing Command-Line Options for an awk
library function that parses command-line
options.)
When designing your program, you should choose options that don’t
conflict with gawk
’s, because it will
process any options that it accepts before passing the rest of the
command line on to your program. Using ‘#!
’ with the -E
option may
help (see Executable awk Programs and Command-Line Options).
Pattern–action pairs make up the basic elements of an awk
program. Patterns are either normal
expressions, range expressions, or regexp constants; one of the
special keywords BEGIN
, END
, BEGINFILE
, or ENDFILE
; or empty. The action executes if
the current record matches the pattern. Empty (missing) patterns match
all records.
I/O from BEGIN
and END
rules has certain constraints. This is
also true, only more so, for BEGINFILE
and ENDFILE
rules. The latter two give you
“hooks” into gawk
’s file
processing, allowing you to recover from a file that otherwise would
cause a fatal error (such as a file that cannot be opened).
Shell variables can be used in awk
programs by careful use of shell
quoting. It is easier to pass a shell variable into awk
by using the -v
option
and an awk
variable.
Actions consist of statements enclosed in curly braces. Statements are built up from expressions, control statements, compound statements, input and output statements, and deletion statements.
The control statements in awk
are if
-else
, while
, for
, and do
-while
.
gawk
adds the switch
statement. There are two flavors of
for
statement: one for performing
general looping, and the other for iterating through an array.
break
and continue
let you exit early or start the
next iteration of a loop (or get out of a switch
).
next
and nextfile
let you read the next record and
start over at the top of your program or skip to the next input file
and start over, respectively.
The exit
statement terminates
your program. When executed from an action (or function body), it
transfers control to the END
statements. From an END
statement
body, it exits immediately. You may pass an optional numeric value to
be used as awk
’s exit
status.
Some predefined variables provide control over awk
, mainly for I/O. Other variables convey
information from awk
to your
program.
ARGC
and ARGV
make the command-line arguments
available to your program. Manipulating them from a BEGIN
rule lets you control how awk
will process the provided
datafiles.
[34] The original version of awk
kept reading and ignoring input
until the end of the file was seen.
[35] In POSIX awk
,
newline does not count as whitespace.
[36] Some early implementations of Unix awk
initialized FILENAME
to "-"
, even if there were datafiles to
be processed. This behavior was incorrect and should not be
relied upon in your programs.
[37] Not to mention difficult implementation issues.
18.191.233.43