This book describes the GNU implementation of awk
, which follows the POSIX specification.
Many longtime awk
users
learned awk
programming with the original
awk
implementation in Version 7 Unix.
(This implementation was the basis for awk
in Berkeley Unix, through 4.3-Reno. Subsequent
versions of Berkeley Unix, and, for a while, some systems derived from
4.4BSD-Lite, used various versions of gawk
for their awk
.) This chapter briefly describes the evolution
of the awk
language, with
cross-references to other parts of the book where you can find more
information.
To save space, we have omitted information on the history of features
in gawk
from this edition. You can find
it in the online
documentation.
The awk
language evolved
considerably between the release of Version 7 Unix (1978) and the new version that was first
made generally available in System V Release 3.1 (1987). This section summarizes the changes, with cross-references to
further details:
The requirement for ‘;
’ to
separate rules on a line (see awk Statements Versus Lines)
User-defined functions and the return
statement (see User-Defined Functions)
The delete
statement (see
The delete Statement)
The do
-while
statement (see The do-while Statement)
The built-in functions atan2()
, cos()
, sin()
, rand()
, and srand()
(see Numeric Functions)
The built-in functions gsub()
, sub()
, and match()
(see String-Manipulation Functions)
The built-in functions close()
and system()
(see Input/Output Functions)
The ARGC
, ARGV
, FNR
, RLENGTH
, RSTART
, and SUBSEP
predefined variables (see Predefined Variables)
Assignable $0
(see Changing the Contents of a Field)
The conditional expression using the ternary operator ‘?:
’ (see Conditional Expressions)
The expression ‘
’ outside of indx
in
array
for
statements (see Referring to an Array Element)
The exponentiation operator ‘^
’ (see Arithmetic Operators)
and its assignment operator form ‘^=
’ (see Assignment Expressions)
C-compatible operator precedence, which breaks some old awk
programs (see Operator Precedence (How Operators Nest))
Regexps as the value of FS
(see Specifying How Fields Are Separated) and as the third argument to
the split()
function (see String-Manipulation Functions), rather than using only the first
character of FS
Dynamic regexps as operands of the ‘~
’ and ‘!~
’ operators (see Using Dynamic Regexps)
The escape sequences ‘’,
‘
f
’, and ‘
’ (see Escape Sequences)
Redirection of input for the getline
function (see Explicit Input with getline)
Multiple BEGIN
and END
rules (see The BEGIN and END Special Patterns)
Multidimensional arrays (see Multidimensional Arrays)
The System V Release 4 (1989) version of Unix awk
added these features (some of which originated in gawk
):
The ENVIRON
array (see Predefined Variables)
Multiple -f
options on the command line (see
Command-Line Options)
The -v
option for assigning variables before
program execution begins (see Command-Line Options)
The --
signal for terminating command-line
options
The ‘a
’, ‘v
’, and ‘x
’
escape sequences (see Escape Sequences)
A defined return value for the srand()
built-in function (see Numeric Functions)
The toupper()
and tolower()
built-in string functions for case
translation (see String-Manipulation Functions)
A cleaner specification for the ‘%c
’ format-control letter in the printf
function (see Format-Control Letters)
The ability to dynamically pass the field width and precision
("%*.*d"
) in the argument list of
printf
and sprintf()
(see Format-Control Letters)
The use of regexp constants, such as /foo/
, as expressions, where they are
equivalent to using the matching operator, as in ‘$0 ~ /foo/
’ (see Using Regular Expression Constants)
Processing of escape sequences inside command-line variable assignments (see Assigning variables on the command line)
The POSIX Command Language and Utilities standard for awk
(1992)
introduced the following changes into the language:
The use of -W
for implementation-specific
options (see Command-Line Options)
The use of CONVFMT
for
controlling the conversion of numbers to strings (see Conversion of Strings and Numbers)
The concept of a numeric string and tighter comparison rules to go with it (see Variable Typing and Comparison Expressions)
The use of predefined variables as function parameter names is forbidden (see Function Definition Syntax)
More complete documentation of many of the previously undocumented features of the language
In 2012, a number of extensions that had been commonly available for many years were finally added to POSIX. They are:
The fflush()
built-in
function for flushing buffered output (see Input/Output Functions)
The nextfile
statement (see
The nextfile Statement)
The ability to delete all of an array at once with ‘delete
’
(see The delete Statement)array
See Common Extensions Summary for a list of common extensions not permitted by the POSIX standard.
The 2008 POSIX standard can be found online at http://www.opengroup.org/onlinepubs/9699919799/.
Brian Kernighan has made his version available via his home page (see Other Freely Available awk Implementations).
This section describes common extensions that originally appeared in
his version of awk
:
The ‘**
’ and ‘**=
’ operators (see Arithmetic Operators and Assignment Expressions)
The use of func
as an
abbreviation for function
(see
Function Definition Syntax)
The fflush()
built-in
function for flushing buffered output (see Input/Output Functions)
See Common Extensions Summary for a full list of the
extensions available in his awk
.
The GNU implementation, gawk
,
adds a large number of features. They can all be disabled with either the
--traditional
or --posix
options (see
Command-Line Options).
A number of features have come and gone over the years. This section summarizes the additional features over POSIX awk
that are in the current version of gawk
.
Additional predefined variables:
The ARGIND
, BINMODE
, ERRNO
, FIELDWIDTHS
, FPAT
, IGNORECASE
, LINT
, PROCINFO
, RT
, and TEXTDOMAIN
variables (see Predefined Variables)
Special files in I/O redirections:
The /dev/stdin
,
/dev/stdout
, /dev/stderr
, and /dev/fd/
special filenames (see Special Filenames in gawk)N
The /inet
, /inet4
, and ‘/inet6
’ special files for TCP/IP
networking using ‘|&
’ to
specify which version of the IP protocol to use (see Using gawk for Network Programming)
Changes and/or additions to the language:
The ‘x
’ escape sequence
(see Escape Sequences)
Full support for both POSIX and GNU regexps (see Chapter 3)
The ability for FS
and
for the third argument to split()
to be null strings (see Making Each Character a Separate Field)
The ability for RS
to be
a regexp (see How Input Is Split into Records)
The ability to use octal and hexadecimal constants in
awk
program source code (see
Octal and hexadecimal numbers)
The ‘|&
’ operator for
two-way I/O to a coprocess (see Two-Way Communications with Another Process)
Indirect function calls (see Indirect Function Calls)
Directories on the command line produce a warning and are skipped (see Directories on the Command Line)
New keywords:
The BEGINFILE
and
ENDFILE
special patterns (see
The BEGINFILE and ENDFILE Special Patterns)
The switch
statement (see
The switch Statement)
Changes to standard awk
functions:
The optional second argument to close()
that allows closing one end of a
two-way pipe to a coprocess (see Two-Way Communications with Another Process)
POSIX compliance for gsub()
and sub()
with
--posix
The length()
function
accepts an array argument and returns the number of elements
in the array (see String-Manipulation Functions)
The optional third argument to the match()
function for capturing
text-matching subexpressions within a regexp (see String-Manipulation Functions)
Positional specifiers in printf
formats for making translations
easier (see Rearranging printf Arguments)
The split()
function’s
additional optional fourth argument, which is an array to hold the
text of the field separators (see String-Manipulation Functions)
Additional functions only in gawk
:
The gensub()
, patsplit()
, and strtonum()
functions for more powerful
text manipulation (see String-Manipulation Functions)
The asort()
and asorti()
functions for sorting arrays
(see Controlling Array Traversal and Array Sorting)
The mktime()
, systime()
, and strftime()
functions for working with
timestamps (see Time Functions)
The and()
, compl()
, lshift()
, or()
, rshift()
, and xor()
functions for bit manipulation
(see Bit-Manipulation Functions)
The isarray()
function to
check if a variable is an array or not (see Getting Type Information)
The bindtextdomain()
,
dcgettext()
, and dcngettext()
functions for
internationalization (see Internationalizing awk Programs)
Changes and/or additions in the command-line options:
The AWKPATH
environment variable for
specifying a path search for the -f
command-line
option (see Command-Line Options)
The AWKLIBPATH
environment variable for
specifying a path search for the -l
command-line
option (see Command-Line Options)
The -b
, -c
,
-C
, -d
, -D
,
-e
, -E
, -g
,
-h
, -i
, -l
,
-L
, -M
, -n
,
-N
, -o
, -O
,
-p
, -P
, -r
,
-S
, -t
, and -V
short options. Also, the ability to use GNU-style long-named options
that start with --
; and the
--assign
, --bignum
,
--characters-as-bytes
, --copyright
, --debug
, --dump-variables
,
--exec
, --field-separator
,
--file
, --gen-pot
,
--help
, --include
,
--lint
, --lint-old
,
--load
, --non-decimal-data
,
--optimize
, --posix
,
--pretty-print
, --profile
,
--re-interval
, --sandbox
,
--source
, --traditional
,
--use-lc-numeric
, and --version
long options (see Command-Line Options)
Support for the following obsolete systems was removed from the
code and the documentation for gawk
version 4.0:
Amiga
Atari
BeOS
Cray
MIPS RiscOS
MS-DOS with the Microsoft Compiler
MS-Windows with the Microsoft Compiler
NeXT
SunOS 3.x, Sun 386 (Road Runner)
Tandem (non-POSIX)
Prestandard VAX C compiler for VAX/VMS
GCC for VAX and Alpha has not been tested for a while.
Support for the following obsolete system was removed from the
code for gawk
version 4.1:
Ultrix
The following table summarizes the common extensions supported by
gawk
, Brian Kernighan’s awk
, and mawk
, the three most widely used freely available versions of awk
(see Other Freely Available awk Implementations).
Feature | BWK awk | mawk | gawk | Now standard |
‘ | ✓ | ✓ | ✓ | |
| ✓ | ✓ | ✓ | |
| ✓ | ✓ | ✓ | |
| ✓ | ✓ | ✓ | |
| ✓ | ✓ | ✓ | |
| ✓ | ✓ | ✓ | ✓ |
| ✓ | ✓ | ✓ | ✓ |
| ✓ | ✓ | ✓ | |
| ✓ | ✓ | ✓ | ✓ |
| ✓ | ✓ | ||
| ✓ | ✓ | ||
| ✓ | ✓ | ||
| ✓ | ✓ | ||
Time-related functions | ✓ | ✓ |
This section describes the confusing history of ranges within
regular expressions and their interactions with locales, and how
this affected different versions of gawk
.
The original Unix tools that worked with regular expressions defined
character ranges (such as ‘[a-z]
’) to
match any character between the first character in the range and the last
character in the range, inclusive. Ordering was based on the numeric value
of each character in the machine’s native character set. Thus, on
ASCII-based systems, ‘[a-z]
’ matched
all the lowercase letters, and only the lowercase letters, as the numeric
values for the letters from ‘a
’ through
‘z
’ were contiguous. (On an EBCDIC
system, the range ‘[a-z]
’ includes
additional nonalphabetic characters as well.)
Almost all introductory Unix literature explained range expressions
as working in this fashion, and in particular, would teach that the “correct”
way to match lowercase letters was with ‘[a-z]
’, and that ‘[A-Z]
’ was the “correct” way to match uppercase
letters. And indeed, this was true.[104]
The 1992 POSIX standard introduced the idea of locales (see Where You Are Makes a Difference). Because many locales include other letters besides the plain 26 letters of the English alphabet, the POSIX standard added character classes (see Using Bracket Expressions) as a way to match different kinds of characters besides the traditional ones in the ASCII character set.
However, the standard changed the
interpretation of range expressions. In the "C"
and "POSIX"
locales, a range expression like
‘[a-dx-z]
’ is still equivalent to
‘[abcdxyz]
’, as in ASCII. But outside
those locales, the ordering was defined to be based on
collation order.
What does that mean? In many locales, ‘A
’ and ‘a
’
are both less than ‘B
’. In other words,
these locales sort characters in dictionary order, and ‘[a-dx-z]
’ is typically not equivalent to
‘[abcdxyz]
’; instead, it might be
equivalent to ‘[ABCXYabcdxyz]
’, for
example.
This point needs to be emphasized: much literature teaches that you
should use ‘[a-z]
’ to match a lowercase
character. But on systems with non-ASCII locales, this also matches all of
the uppercase characters except ‘A
’ or
‘Z
’! This was a continuous cause of
confusion, even well into the twenty-first century.
To demonstrate these issues, the following example uses the sub()
function, which does text replacement (see
String-Manipulation Functions). Here, the intent is to remove
trailing uppercase characters:
$ echo something1234abc | gawk-3.1.8 '{ sub("[A-Z]*$", ""); print }'
something1234a
This output is unexpected, as the ‘bc
’ at the end of ‘something1234abc
’ should not normally match
‘[A-Z]*
’. This result is due to the
locale setting (and thus you may not see it on your system).
Similar considerations apply to other ranges. For example, ‘["-/]
’ is perfectly valid in ASCII, but is not
valid in many Unicode locales, such as en_US.UTF-8
.
Early versions of gawk
used
regexp matching code that was not locale-aware, so ranges had their
traditional interpretation.
When gawk
switched to using
locale-aware regexp matchers, the problems began; especially as both
GNU/Linux and commercial Unix vendors started implementing non-ASCII
locales, and making them the default. Perhaps the
most frequently asked question became something like, “Why does ‘[A-Z]
’ match lowercase letters?!?”
This situation existed for close to 10 years, if not more, and the
gawk
maintainer grew weary of trying to
explain that gawk
was being nicely
standards-compliant, and that the issue was in the user’s locale. During
the development of version 4.0, he modified gawk
to always treat ranges in the original,
pre-POSIX fashion, unless --posix
was used (see Command-Line Options).[105]
Fortunately, shortly before the final release of gawk
4.0, the maintainer learned that the 2008
standard had changed the definition of ranges, such that outside the
"C"
and "POSIX"
locales, the meaning of range
expressions was undefined.[106]
By using this lovely technical term, the standard gives license to
implementors to implement ranges in whatever way they choose. The gawk
maintainer chose to apply the pre-POSIX
meaning both with the default regexp matching and when
--traditional
or --posix
are used. In all cases gawk
remains POSIX-compliant.
Always give credit where credit is due.
—Anonymous
This section names the major contributors to gawk
and/or this book, in approximate chronological order:
Dr. Alfred V. Aho, Dr. Peter J. Weinberger, and Dr. Brian W.
Kernighan, all of Bell Laboratories, designed and implemented Unix
awk
, from which gawk
gets the majority of its feature
set.
Paul Rubin did the initial design and implementation in 1986, and wrote the first draft (around 40 pages) of this book.
Jay Fenlason finished the initial implementation.
Diane Close revised the first draft of this book, bringing it to around 90 pages.
Richard Stallman helped finish the implementation and the initial draft of this book. He is also the founder of the FSF and the GNU Project.
John Woods contributed parts of the code (mostly fixes) in the
initial version of gawk
.
In 1988, David Trueman took over primary maintenance of gawk
, making it compatible with “new”
awk
, and greatly improving its
performance.
Conrad Kwok, Scott Garfinkle, and Kent Williams did the initial ports to MS-DOS with various versions of MSC.
Pat Rankin provided the VMS port and its documentation.
Hal Peterson provided help in porting gawk
to Cray systems. (This is no longer
supported.)
Kai Uwe Rommel provided the initial port to OS/2 and its documentation.
Michal Jaegermann provided the port to Atari systems and its
documentation. (This port is no longer supported.) He continues to
provide portability checking, and has done a lot of work to make sure
gawk
works on non-32-bit
systems.
Fred Fish provided the port to Amiga systems and its documentation. (With Fred’s sad passing, this is no longer supported.)
Scott Deifik currently maintains the MS-DOS port using DJGPP.
Eli Zaretskii currently maintains the MS-Windows port using MinGW.
Juan Grigera provided a port to Windows32 systems. (This is no longer supported.)
For many years, Dr. Darrel Hankerson acted as coordinator for the various ports to different PC platforms and created binary distributions for various PC operating systems. He was also instrumental in keeping the documentation up to date for the various PC platforms.
Christos Zoulas provided the extension()
built-in function for
dynamically adding new functions. (This was obsoleted at gawk
4.1.)
Jürgen Kahrs contributed the initial version of the TCP/IP
networking code and documentation, and motivated the inclusion of the
‘|&
’ operator.
Stephen Davies provided the initial port to Tandem systems and
its documentation. (However, this is no longer supported.) He was also
instrumental in the initial work to integrate the byte-code internals
into the gawk
code base.
Matthew Woehlke provided improvements for Tandem’s POSIX-compliant systems.
Martin Brown provided the port to BeOS and its documentation. (This is no longer supported.)
Arno Peters did the initial work to convert gawk
to use GNU Automake and GNU gettext
.
Alan J. Broder provided the initial version of the asort()
function as well as the code for the
optional third argument to the match()
function.
Andreas Buening updated the gawk
port for OS/2.
Isamu Hasegawa, of IBM in Japan, contributed support for multibyte characters.
Michael Benzinger contributed the initial code for switch
statements.
Patrick T.J. McPhee contributed the code for dynamic loading in Windows32 environments. (This is no longer supported.)
Anders Wallin helped keep the VMS port going for several years.
Assaf Gordon contributed the code to implement the
--sandbox
option.
John Haque made the following contributions:
The modifications to convert gawk
into a byte-code interpreter,
including the debugger
The addition of true arrays of arrays
The additional modifications for support of arbitrary-precision arithmetic
The initial text of Chapter 15
The work to merge the three versions of gawk
into one, for the 4.1
release
Improved array internals for arrays indexed by integers
The improved array sorting features were also driven by John, together with Pat Rankin
Panos Papadopoulos contributed the original text for Including Other Files into Your Program.
Efraim Yawitz contributed the original text for Chapter 14.
The development of the extension API first released with
gawk
4.1 was driven primarily by
Arnold Robbins and Andrew Schorr, with notable contributions from the
rest of the development team.
John Malmberg contributed significant improvements to the OpenVMS port and the related documentation.
Antonio Giovanni Colombo rewrote a number of examples in the early chapters that were severely dated, for which I am incredibly grateful.
Arnold Robbins has been working on gawk
since 1988, at first helping David
Trueman, and as the primary maintainer since around 1994.
The awk
language has evolved
over time. The first release was with V7 Unix, circa 1978. In 1987,
for System V Release 3.1, major additions, including user-defined
functions, were made to the language. Additional changes were made for
System V Release 4, in 1989. Since then, further minor changes have
happened under the auspices of the POSIX standard.
Brian Kernighan’s awk
provides a small number of extensions that are implemented in common
with other versions of awk
.
gawk
provides a large number
of extensions over POSIX awk
. They
can be disabled with either the --traditional
or
--posix
options.
The interaction of POSIX locales and regexp matching in gawk
has been confusing over the years.
Today, gawk
implements Rational
Range Interpretation, where ranges of the form ‘[a-z]
’ match only the
characters numerically between ‘a
’
through ‘z
’ in the machine’s native
character set. Usually this is ASCII, but it can be EBCDIC on IBM
S/390 systems.
Many people have contributed to gawk
development over the years. We hope
that the list provided in this chapter is complete and gives the
appropriate credit where credit is due.
[104] And Life was good.
[105] And thus was born the Campaign for Rational Range Interpretation (or RRI). A number of GNU tools have already implemented this change, or will soon. Thanks to Karl Berry for coining the phrase “Rational Range Interpretation.”
[106] See the standard and its rationale.
18.189.171.125