Counting String Values

Problem

You need to count all the occurrences of several different strings, including some strings whose values you don’t know beforehand. That is, you’re not trying to count the occurrences of a pre-determined set of strings. Rather, you are going to encounter some strings in your data and you want to count these as-yet-unknown strings.

Solution

Use awk’s associative arrays (also known as hashes) for your counting.

For our example, we’ll count how many files are owned by various users on our system. The username shows up as the third field in an ls-l output. So we’ll use that field ($3) as the index of the array, and increment that member of the array:

#
# cookbook filename: asar.awk
#
NF > 7 {
    user[$3]++
}
END {
    for (i in user) {
        printf "%s owns %d files
", i, user[i]
    }
}

We invoke awk a bit differently here. Because this awk script is a bit more complex, we’ve put it in a separate file. We use the -f option to tell awk where to get the script file:

$ ls -lR /usr/local | awk -f asar.awk
bin owns 68 files
albing owns 1801 files
root owns 13755 files
man owns 11491 files
$

Discussion

We use the condition NF > 7 as a qualifier to part of the awk script to weed out the lines that do not contain filenames, which appear in the ls -lR output and are useful for readability because they include blank lines to separate different directories as well as total counts for each subdirectory. Such lines don’t have as many fields (or words). The expression NF>7 that precedes the opening brace is not enclosed in slashes, which is to say that it is not a regular expression. It’s a logical expression, much like you would use in an if statement, and it evaluates to true or false. The NF variable is a special built-in variable that refers to the number of fields for the current line of input. So only if a line of input has more than seven fields (words of text) will it be processed by the statements within the braces.

The key line, however, is this one:

 user[$3]++

Here the username (e.g., bin) is used as the index to the array. It’s called an associative array, because a hash table (or similar mechanism) is being used to associate each unique string with a numerical index. awk is doing all that work for you behind the scenes; you don’t have to write any string comparisons and lookups and such.

Once you’ve built such an array it might seem difficult to get the values back out. For this, awk has a special form of the for loop. Instead of the numeric for(i=0; i<max; i++) that awk also supports, there is a particular syntax for associative arrays:

for (i in user)

In this expression, the variable i will take on successive values (in no particular order) from the various values used as indexes to the array user. In our example, this means that i will take on the values (i.e., bin, albing, man, root), one each iteration of the loop. If you haven’t seen associative arrays before, then we hope that you’re surprised and impressed. It is a very powerful feature of awk (and Perl).

See Also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.72.74