You need to count all the occurrences of several different strings, including some strings whose values you don’t know beforehand. That is, you’re not trying to count the occurrences of a pre-determined set of strings. Rather, you are going to encounter some strings in your data and you want to count these as-yet-unknown strings.
Use awk’s associative arrays (also known as hashes) for your counting.
For our example, we’ll count how many files are owned by various
users on our system. The username shows up as the third field in an
ls-l
output. So we’ll use that field
($3
) as the index of the array, and
increment that member of the array:
# # cookbook filename: asar.awk # NF > 7 { user[$3]++ } END { for (i in user) { printf "%s owns %d files ", i, user[i] } }
We invoke awk a bit differently here. Because
this awk script is a bit more complex, we’ve put it
in a separate file. We use the -f
option to tell
awk where to get the script file:
$ ls -lR /usr/local | awk -f asar.awk bin owns 68 files albing owns 1801 files root owns 13755 files man owns 11491 files $
We use the condition NF > 7
as a qualifier to part of the awk script to weed
out the lines that do not contain filenames, which appear in the
ls -lR
output and are useful for
readability because they include blank lines to separate different
directories as well as total counts for each subdirectory. Such lines
don’t have as many fields (or words). The expression NF>7
that precedes the opening brace is not
enclosed in slashes, which is to say that it is not a regular
expression. It’s a logical expression, much like you would use in an
if
statement, and it evaluates to
true or false. The NF
variable
is a special built-in variable that refers to the number
of fields for the current line of input. So only if a line of input has
more than seven fields (words of text) will it be processed by the
statements within the braces.
The key line, however, is this one:
user[$3]++
Here the username (e.g., bin) is used as the index to the array. It’s called an associative array, because a hash table (or similar mechanism) is being used to associate each unique string with a numerical index. awk is doing all that work for you behind the scenes; you don’t have to write any string comparisons and lookups and such.
Once you’ve built such an array it might seem difficult to get the
values back out. For this, awk has a special form
of the for loop. Instead of the numeric for(i=0; i<max; i++
) that
awk also supports, there is a particular syntax for
associative arrays:
for (i in user)
In this expression, the variable i
will take on successive values (in no
particular order) from the various values used as indexes to the array
user. In our example, this means that i will take on the values (i.e.,
bin, albing, man, root
), one each
iteration of the loop. If you haven’t seen associative arrays before,
then we hope that you’re surprised and impressed. It is a very powerful
feature of awk (and Perl).
man awk
Effective awk Programming by Arnold Robbins (O’Reilly)
sed & awk by Arnold Robbins and Dale Dougherty (O’Reilly)
3.147.72.74