Some data, read in from tape or from a spreadsheet, may not have obvious field separators, but the data does have fixed-width columns. To preprocess this type of data, the substr function is useful. (The files in this section are found on the CD in directory chap07/OddsAnd-Ends.)
In the following example, the fields are of a fixed width, but are not separated by a field separator. The substr function is used to create fields. For gawk, see the "The FIELDWIDTHS Variable" on page 241.
% cat fixed 031291ax5633(408)987–0124 021589bg2435(415)866–1345 122490de1237(916)933–1234 010187ax3458(408)264–2546 092491bd9923(415)134–8900 112990bg4567(803)234–1456 070489qr3455(415)899–1426 % awk '{printf substr($0,1,6)" ";printf substr($0,7,6)" "; print substr($0,13,length)} ' fixed 031291 ax5633 (408)987–0124 021589 bg2435 (415)866–1345 122490 de1237 (916)933–1234 010187 ax3458 (408)264–2546 092491 bd9923 (415)134–8900 112990 bg4567 (803)234–1456 070489 qr3455 (415)899–1426 |
Explanation
The first field is obtained by getting the substring of the entire record, starting at the first character, offset by 6 places. Next, a space is printed. The second field is obtained by getting the substring of the record, starting at position 7, offset by 6 places, followed by a space. The last field is obtained by getting the substring of the entire record, starting at position 13 to the position represented by the length of the line. (The length function returns the length of the current line ($0) if it does not have an argument.)
If the data is stored in fixed-width fields, it is possible that some of the fields are empty. In the following example, the substr function is used to preserve the fields, whether or not they contain data.
1 % cat db xxx xxx xxx abc xxx xxx a bbb xxx xx % cat awkfix # Preserving empty fields. Field width is fixed. { 2 f[1]=substr($0,1,3) 3 3f[2]=substr($0,5,3) 4 4f[3]=substr($0,9,3) 5 5line=sprintf("%-4s%-4s%-4s ", f[1],f[2], f[3]) 6 6print line } % awk –f awkfix db xxx xxx xxx abc xxx xxx a bbb xxx xx |
Explanation
The contents of the file db are printed. There are empty fields in the file.
The second element of the f array is assigned the substring of the record, starting at position 9 and offset by 3.
The elements of the array are assigned to the user-defined variable line after being formatted by the sprintf function.
The FIELDWIDTHS variable (gawk only) can be used if the file has fields of a fixed width. The value assigned to this variable is a space separated list of numbers, in which each number in the list represents the number of characters in a field. The value of FS will be ignored if FIELDWIDTHS is set.
% cat fixedfile abc1245556 xxxyyyzzzz % awk 'BEGIN{FIELDWIDTHS="3 3 4"}{print $2}' fixedfile 124 yyy |
Explanation
Gawk includes a variable called FIELDWIDTHS to govern the way it splits up a line into fields. The variable is assigned a space separated list, 3 3 4, meaning that each record consists of three fixed width fields: The first is 3 characters in length, the second 3 characters and the last is 4 characters. Even though there are no separators in fixedfile, the records will be split into fields according to the value assigned to FIELDWIDTHS.
In the following example, the price field contains a dollar sign and comma. The script must eliminate these characters to add up the prices to get the total cost. This is done using the gsub function.
% cat vendor access tech:gp237221:220:vax789:20/20:11/01/90:$1,043.00 alisa systems:bp262292:280:macintosh:new updates:06/30/91:$456.00 alisa systems:gp262345:260:vax8700:alisa talk:02/03/91:$1,598.50 apple computer:zx342567:240:macs:e–mail:06/25/90:$575.75 caci:gp262313:280:sparc station:network11.5:05/12/91:$1,250.75 datalogics:bp132455:260:microvax2:pagestation maint:07/01/90:$1,200.00 dec:zx354612:220:microvax2:vms sms:07/20/90:$1,350.00 % awk –F: '{gsub(/$/,"");gsub(/,/,""); cost +=$7}; END{print "The total cost is $" cost}'000 vendor The total cost is $7474 |
Explanation
The first gsub function globally substitutes the literal dollar sign ($) with the null string, and the second gsub function substitutes commas with a null string. The user-defined cost variable is then totalled by adding the seventh field to cost and assigning the result back to cost. In the END block, the string "The total cost is $" is printed, followed by the value of cost.[a]
[a] For details on how commas are added back into the program, see The Awk Programming Language by Alfred Aho, Brian Kernighan, and Peter Wienberger, Addison Wesley, 1988, p. 72.
In The AWK Programming Language by Alfred Aho, Brian Kernighan, and Peter Wienberger, the program to bundle files together is very short and to the point. We are trying to combine several files into one file to save disk space, to send files through electronic mail, and so forth. The following awk command will print every line of each file, preceded with the filename.
% awk '{ print FILENAME, $0 }' file1 file2 file3 > bundled
|
Explanation
The name of the current input file, FILENAME, is printed, followed by the record ($0) for each line of input in file1. After file1 has reached the end of file, awk will open the next file, file2, and do the same thing, and so on. The output is redirected to a file called bundled.
The following example displays how to unbundle files, or put them back into separate files.
% awk '$1 != previous { close(previous); previous=$1}; {print substr($0, index($0, " ") + 1) > $1}' bundled |
Explanation
The first field is the name of the file. If the name of the file is not equal to the value of the user-defined variable previous (initially null), the action block is executed. The file assigned to previous is closed, and previous is assigned the value of the first field. Then the substr of the record, the starting position returned from the index function (the position of the first space + 1), is redirected to the filename contained in the first field.
To bundle the files so that the filename appears on a line by itself, above the contents of the file use, the following command:
% awk '{if(FNR==1){print FILENAME;print $0} else print $0}' file1 file2 file3 > bundled
The following command will unbundle the files:
% awk 'NF==1{filename=$NF} ; NF != 1{print $0 > filename}' bundled
In the sample data files used so far, each record is on a line by itself. In the following sample datafile, called checkbook, the records are separated by blank lines and the fields are separated by newlines. To process this file, the record separator (RS) is assigned a value of null, and the field separator (FS) is assigned the newline.
(The Input File) % cat checkbook 1 1/1/99 #125 -695.00 Mortgage 1/1/99 #126 -56.89 PG&E 1/2/99 #127 -89.99 Safeway 1/3/99 +750.00 Pay Check 1/4/99 #128 -60.00 Visa (The Script) % cat awkchecker 1 BEGIN{RS=""; FS=" ";ORS=" "} 2 {print NR, $1,$2,$3,$4} (The Output) % awk –f awkchecker checkbook 1 1/1/99 #125 -695.00 Mortgage 2 1/1/99 #126 -56.89 PG&E 3 1/2/99 #127 -89.99 Safeway 4 1/3/99 +750.00 Pay Check 5 1/4/99 #128 -60.00 Visa |
Explanation
In the BEGIN block, the record separator (RS) is assigned null, the field separator (FS) is assigned a newline, and the output record separator (ORS) is assigned two newlines. Now each line is a field and each output record is separated by two newlines.
The number of the record is printed, followed by each of the fields.
The following example is modified from a program in The Awk Programming Language. The tricky part of this is keeping track of what is actually being processed. The input file is called data.file. It contains just the data. Each field in the input file is separated by colons. The other file is called form.letter. It is the actual form that will be used to create the letter. This file is loaded into awk's memory with the getline function. Each line of the form letter is stored in an array. The program gets its data from data.file, and the letter is created by substituting real data for the special strings preceded by # and @ found in form.letter. A temporary variable, temp, holds the actual line that will be displayed after the data has been substituted. This program allows you to create personalized form letters for each person listed in data.file.
(The Awk Script) % cat form.awk # form.awk is an awk script that requires access to 2 files: The # first file is called "form.letter". This file contains the # format for a form letter. The awk script uses another file, # "data.form", as its input file. This file contains the # information that will be substituted into the form letters in # place of the numbers preceded by pound signs. Today's date # is substituted in the place of "@date" in "form.letter". 1 BEGIN{ FS=":"; n=1 2 while(getline < "form.letter" > 0) 3 form[n++] = $0 #Store lines from form.letter in an array 4 "date" | getline d; split(d, today, " ") # Output of date is Sun Mar 2 14:35:50 PST 1999 5 thisday=today[2]". "today[3]", "today[6] 6 } 7 { for( i = 1; i < n; i++ ){ 8 temp=form[i] 9 for ( j = 1; j <=NF; j++ ){ gsub("@date", thisday, temp) 10 gsub("#" j, $j , temp ) } 11 print temp } } % cat form.letter The form letter, form.letter, looks like this: ********************************************************* Subject: Status Report for Project "#1" To: #2 From: #3 Date: @date This letter is to tell you, #2, that project "#1" is up to date. We expect that everything will be completed and ready for shipment as scheduled on #4. Sincerely, #3 ********************************************************** The file, data.form, is awk's input file containing the data that will replace the #1–4 and the @date in form.letter. % cat data.form Dynamo:John Stevens:Dana Smith, Mgr:4/12/1999 Gallactius:Guy Sterling:Dana Smith, Mgr:5/18/99 (The Command Line) % awk –f form.awk data.form ********************************************************* Subject: Status Report for Project "Dynamo" To: John Stevens From: Dana Smith, Mgr Date: Mar. 2, 1999 This letter is to tell you, John Stevens, that project "Dynamo" is up to date. We expect that everything will be completed and ready for shipment as scheduled on 4/12/1999. Sincerely, Dana Smith, Mgr Subject: Status Report for Project "Gallactius" To: Guy Sterling From: Dana Smith, Mgr Date: Mar. 2, 1999 This letter is to you, Guy Sterling, that project "Gallactius" is up to date. We expect that everything will be completed and ready for shipment as scheduled on 5/18/99. Sincerely, Dana Smith, Mgr |
Explanation
In the BEGIN block, the field separator (FS) is assigned a colon; a user-defined variable n is assigned 1.
In the while loop, the getline function reads a line at a time from the file called form.letter. If getline fails to find the file, it returns a –1. When it reaches the end of file, it returns zero. Therefore, by testing for a return value of greater than one, we know that the function has read in a line from the input file.
Each line from form.letter is assigned to an array called form.
The output from the UNIX date command is piped to the getline function and assigned to the user-defined variable d. The split function then splits up the variable d with white space, creating an array called today.
The user-defined variable thisday is assigned the month, day, and year.
The BEGIN block ends.
The for loop will loop n times.
The user-defined variable temp is assigned a line from the form array.
If a # and a number are found in the line stored in temp, the gsub function will replace the # and number with the value of the corresponding field in the input file, data.form. For example, if the first line stored is being tested, #1 would be replaced with Dynamo, #2 with John Stevens, #3 with Dana Smith, #4 with 4/12/1999, and so forth.
The line stored in temp is printed after the substitutions.
Now that you have seen how awk works, you will find that awk is a very powerful utility when writing shell scripts. You can embed one-line awk commands or awk scripts within your shell scripts. The following is a sample of a Bash shell program embedded with awk commands.
#!/bin/bash # Scriptname: bash.sc # This bash shell script will collect data for awk to use in # generating form letter(s). See above. echo "Hello $LOGNAME. " echo "This report is for the month and year:" 1 cal | awk 'NR==1{print $0}' if [[ –f data.form || –f formletter? ]] then rm data.form formletter? 2> /dev/null fi let num=1 while true do echo "Form letter #$num:" echo -n "What is the name of the project? " read project echo -n "Who is the status report from? " read sender echo -n "Who is the status report to? " read recipient echo -n "When is the completion date scheduled? " read due_date echo $project:$recipient:$sender:$due_date > data.form echo –n "Do you wish to generate another form letter? " read answer if [[ "$answer" != [Yy]* ]] then break else 2 awk –f form.awk data.form > formletter$num fi (( num+=1 )) done awk –f form.awk data.form > formletter$num |
Explanation
The Linux cal command is piped to awk. The first line which contains the current month and year is printed.
The awk script form.awk generates form letters, which are redirected to a UNIX file.
3.16.79.147