7.13. Odds and Ends

Some data, read in from tape or from a spreadsheet, may not have obvious field separators, but the data does have fixed-width columns. To preprocess this type of data, the substr function is useful. (The files in this section are found on the CD in directory chap07/OddsAnd-Ends.)

7.13.1. Fixed Fields

In the following example, the fields are of a fixed width, but are not separated by a field separator. The substr function is used to create fields. For gawk, see the "The FIELDWIDTHS Variable" on page 241.

Example 7.85.
% cat fixed
						031291ax5633(408)987–0124
						021589bg2435(415)866–1345
						122490de1237(916)933–1234
						010187ax3458(408)264–2546
						092491bd9923(415)134–8900
						112990bg4567(803)234–1456
						070489qr3455(415)899–1426

% awk '{printf substr($0,1,6)" ";printf substr($0,7,6)" ";
						print substr($0,13,length)} ' fixed
						031291  ax5633  (408)987–0124
						021589  bg2435  (415)866–1345
						122490  de1237  (916)933–1234
						010187  ax3458  (408)264–2546
						092491  bd9923  (415)134–8900
						112990  bg4567  (803)234–1456
						070489  qr3455  (415)899–1426
					

Explanation

The first field is obtained by getting the substring of the entire record, starting at the first character, offset by 6 places. Next, a space is printed. The second field is obtained by getting the substring of the record, starting at position 7, offset by 6 places, followed by a space. The last field is obtained by getting the substring of the entire record, starting at position 13 to the position represented by the length of the line. (The length function returns the length of the current line ($0) if it does not have an argument.)

Empty Fields

If the data is stored in fixed-width fields, it is possible that some of the fields are empty. In the following example, the substr function is used to preserve the fields, whether or not they contain data.

Example 7.86.
1   % cat db
							xxx xxx
							xxx abc xxx
							xxx a   bbb
							xxx     xx

    % cat awkfix
    # Preserving empty fields. Field width is fixed.
    {
2   f[1]=substr($0,1,3)
3   3f[2]=substr($0,5,3)
4   4f[3]=substr($0,9,3)
5   5line=sprintf("%-4s%-4s%-4s
", f[1],f[2], f[3])
6   6print line  
    }
    % awk –f awkfix db
							xxx xxx
							xxx abc xxx
							xxx a   bbb
							xxx     xx
						

Explanation

  1. The contents of the file db are printed. There are empty fields in the file.

  2. The first element of the f array is assigned the substring of the record, starting at position 1 and offset by 3.

  3. The second element of the f array is assigned the substring of the record, starting at position 5 and offset by 3.

  4. The second element of the f array is assigned the substring of the record, starting at position 9 and offset by 3.

  5. The elements of the array are assigned to the user-defined variable line after being formatted by the sprintf function.

  6. The value of line is printed and the empty fields are preserved.

The FIELDWIDTHS Variable

The FIELDWIDTHS variable (gawk only) can be used if the file has fields of a fixed width. The value assigned to this variable is a space separated list of numbers, in which each number in the list represents the number of characters in a field. The value of FS will be ignored if FIELDWIDTHS is set.

Example 7.87.
% cat fixedfile
abc1245556
xxxyyyzzzz

% awk 'BEGIN{FIELDWIDTHS="3 3 4"}{print $2}' fixedfile
							124
							yyy
						

Explanation

Gawk includes a variable called FIELDWIDTHS to govern the way it splits up a line into fields. The variable is assigned a space separated list, 3 3 4, meaning that each record consists of three fixed width fields: The first is 3 characters in length, the second 3 characters and the last is 4 characters. Even though there are no separators in fixedfile, the records will be split into fields according to the value assigned to FIELDWIDTHS.

Numbers with $, Commas, or Other Characters

In the following example, the price field contains a dollar sign and comma. The script must eliminate these characters to add up the prices to get the total cost. This is done using the gsub function.

Example 7.88.
% cat vendor
							access tech:gp237221:220:vax789:20/20:11/01/90:$1,043.00
							alisa systems:bp262292:280:macintosh:new updates:06/30/91:$456.00
							alisa systems:gp262345:260:vax8700:alisa talk:02/03/91:$1,598.50
							apple computer:zx342567:240:macs:e–mail:06/25/90:$575.75
							caci:gp262313:280:sparc station:network11.5:05/12/91:$1,250.75
							datalogics:bp132455:260:microvax2:pagestation
							maint:07/01/90:$1,200.00
							dec:zx354612:220:microvax2:vms sms:07/20/90:$1,350.00

% awk –F: '{gsub(/$/,"");gsub(/,/,""); cost +=$7};
							END{print "The total cost is $" cost}'000 vendor
							The total cost is $7474
						

Explanation

The first gsub function globally substitutes the literal dollar sign ($) with the null string, and the second gsub function substitutes commas with a null string. The user-defined cost variable is then totalled by adding the seventh field to cost and assigning the result back to cost. In the END block, the string "The total cost is $" is printed, followed by the value of cost.[a]

[a] For details on how commas are added back into the program, see The Awk Programming Language by Alfred Aho, Brian Kernighan, and Peter Wienberger, Addison Wesley, 1988, p. 72.

7.13.2. Bundling and Unbundling Files

The Bundle Program

In The AWK Programming Language by Alfred Aho, Brian Kernighan, and Peter Wienberger, the program to bundle files together is very short and to the point. We are trying to combine several files into one file to save disk space, to send files through electronic mail, and so forth. The following awk command will print every line of each file, preceded with the filename.

Example 7.89.
% awk '{ print FILENAME, $0 }' file1 file2 file3 > bundled
						

Explanation

The name of the current input file, FILENAME, is printed, followed by the record ($0) for each line of input in file1. After file1 has reached the end of file, awk will open the next file, file2, and do the same thing, and so on. The output is redirected to a file called bundled.

Unbundle

The following example displays how to unbundle files, or put them back into separate files.

Example 7.90.
% awk '$1 != previous { close(previous); previous=$1};
							{print substr($0, index($0, " ") + 1) > $1}' bundled
						

Explanation

The first field is the name of the file. If the name of the file is not equal to the value of the user-defined variable previous (initially null), the action block is executed. The file assigned to previous is closed, and previous is assigned the value of the first field. Then the substr of the record, the starting position returned from the index function (the position of the first space + 1), is redirected to the filename contained in the first field.

To bundle the files so that the filename appears on a line by itself, above the contents of the file use, the following command:

% awk '{if(FNR==1){print FILENAME;print $0}
							else print $0}'  file1 file2 file3 > bundled
						

The following command will unbundle the files:

% awk 'NF==1{filename=$NF} ;
							NF != 1{print $0 > filename}' bundled
						

7.13.3. Multiline Records

In the sample data files used so far, each record is on a line by itself. In the following sample datafile, called checkbook, the records are separated by blank lines and the fields are separated by newlines. To process this file, the record separator (RS) is assigned a value of null, and the field separator (FS) is assigned the newline.

Example 7.91.
(The Input File)
   % cat checkbook
						1 1/1/99
						#125
						-695.00
						Mortgage
						1/1/99
						#126
						-56.89
						PG&E
						1/2/99
						#127
						-89.99
						Safeway
						1/3/99
						+750.00
						Pay Check
						1/4/99
						#128
						-60.00
						Visa

(The Script)
   % cat awkchecker
1  BEGIN{RS=""; FS="
";ORS="

"}
2  {print  NR, $1,$2,$3,$4}

(The Output)
   % awk –f awkchecker checkbook
						1  1/1/99  #125  -695.00  Mortgage
						2  1/1/99  #126  -56.89  PG&E
						3  1/2/99  #127  -89.99  Safeway
						4  1/3/99  +750.00  Pay Check
						5  1/4/99  #128  -60.00  Visa
					

Explanation

  1. In the BEGIN block, the record separator (RS) is assigned null, the field separator (FS) is assigned a newline, and the output record separator (ORS) is assigned two newlines. Now each line is a field and each output record is separated by two newlines.

  2. The number of the record is printed, followed by each of the fields.

7.13.4. Generating Form Letters

The following example is modified from a program in The Awk Programming Language. The tricky part of this is keeping track of what is actually being processed. The input file is called data.file. It contains just the data. Each field in the input file is separated by colons. The other file is called form.letter. It is the actual form that will be used to create the letter. This file is loaded into awk's memory with the getline function. Each line of the form letter is stored in an array. The program gets its data from data.file, and the letter is created by substituting real data for the special strings preceded by # and @ found in form.letter. A temporary variable, temp, holds the actual line that will be displayed after the data has been substituted. This program allows you to create personalized form letters for each person listed in data.file.

Example 7.92.
(The Awk Script)
% cat form.awk
# form.awk is an awk script that requires access to 2 files: The
# first file is called "form.letter". This file contains the
# format for a form letter. The awk script uses another file, 
# "data.form", as its input file. This file contains the
# information that will be substituted into the form letters in
# place of the numbers preceded by pound signs. Today's date
# is substituted in the place of "@date" in "form.letter".
1   BEGIN{ FS=":"; n=1
2   while(getline < "form.letter" >  0)
3       form[n++] = $0   #Store lines from form.letter in an array
4   "date" | getline d; split(d, today, " ")
        # Output of date is Sun Mar 2 14:35:50   PST 1999
5   thisday=today[2]". "today[3]", "today[6]
6   }
7   { for( i = 1; i < n; i++ ){
8       temp=form[i]
9       for ( j = 1; j <=NF; j++ ){
            gsub("@date", thisday, temp)
10          gsub("#" j, $j , temp )
        }
11 print temp
   }
   }

% cat form.letter
   The form letter,
						form.letter, looks like this:
*********************************************************
   Subject: Status Report for Project "#1"
   To: #2
   From: #3
   Date: @date
   This letter is to tell you, #2, that project "#1" is up to 
   date.
   We expect that everything will be completed and ready for 
   shipment as scheduled on #4.

   Sincerely,

   #3
**********************************************************

The file, data.form, is awk's input file containing the data that 
will replace the #1–4 and the @date in form.letter.
						% cat data.form
   Dynamo:John Stevens:Dana Smith, Mgr:4/12/1999
   Gallactius:Guy Sterling:Dana Smith, Mgr:5/18/99

(The Command Line)

   % awk  –f form.awk data.form
   *********************************************************
   Subject: Status Report for Project "Dynamo"
						To: John Stevens
						From: Dana Smith, Mgr
						Date: Mar. 2, 1999
						This letter is to tell you, John Stevens, that project
						"Dynamo" is up to date.
						We expect that everything will be completed and ready for
						shipment as scheduled on 4/12/1999.
						Sincerely,
						Dana Smith, Mgr
						Subject: Status Report for Project "Gallactius"
						To: Guy Sterling
						From: Dana Smith, Mgr
						Date: Mar. 2, 1999
						This letter is to you, Guy Sterling, that project "Gallactius"
						is up to date.
						We expect that everything will be completed and ready for
						shipment as scheduled on 5/18/99.
						Sincerely,
						Dana Smith, Mgr
					

Explanation

  1. In the BEGIN block, the field separator (FS) is assigned a colon; a user-defined variable n is assigned 1.

  2. In the while loop, the getline function reads a line at a time from the file called form.letter. If getline fails to find the file, it returns a –1. When it reaches the end of file, it returns zero. Therefore, by testing for a return value of greater than one, we know that the function has read in a line from the input file.

  3. Each line from form.letter is assigned to an array called form.

  4. The output from the UNIX date command is piped to the getline function and assigned to the user-defined variable d. The split function then splits up the variable d with white space, creating an array called today.

  5. The user-defined variable thisday is assigned the month, day, and year.

  6. The BEGIN block ends.

  7. The for loop will loop n times.

  8. The user-defined variable temp is assigned a line from the form array.

  9. The nested for loop is looping through a line from the input file, data.form, NF number of times. Each line stored in the temp variable is checked for the string @date. If @date is matched, the gsub function replaces it with today's date (the value stored in this day).

  10. If a # and a number are found in the line stored in temp, the gsub function will replace the # and number with the value of the corresponding field in the input file, data.form. For example, if the first line stored is being tested, #1 would be replaced with Dynamo, #2 with John Stevens, #3 with Dana Smith, #4 with 4/12/1999, and so forth.

  11. The line stored in temp is printed after the substitutions.

7.13.5. Interaction with the Shell

Now that you have seen how awk works, you will find that awk is a very powerful utility when writing shell scripts. You can embed one-line awk commands or awk scripts within your shell scripts. The following is a sample of a Bash shell program embedded with awk commands.

Example 7.93.
#!/bin/bash
# Scriptname: bash.sc
# This bash shell script will collect data for awk to use in
# generating form letter(s). See above.
echo "Hello $LOGNAME. "
echo "This report is for the month and year:"
1 cal | awk 'NR==1{print $0}'

  if [[ –f data.form  || –f formletter? ]]
  then
       
     rm data.form formletter?  2> /dev/null
  fi
  let num=1
  while true
  do

    echo "Form letter #$num:"
    echo -n "What is the name of the project? "
    read project
    echo -n "Who is the status report from? "
    read sender
    echo -n "Who is the status report to? "
    read recipient
    echo -n "When is the completion date scheduled? "
    read due_date
    echo $project:$recipient:$sender:$due_date > data.form
    echo –n "Do you wish to generate another form letter? "
    read answer
    if [[ "$answer" != [Yy]* ]]
    then
       break
    else
2      awk –f form.awk  data.form  > formletter$num
    fi
    (( num+=1 ))
  done
  awk –f form.awk data.form > formletter$num
					

Explanation

  1. The Linux cal command is piped to awk. The first line which contains the current month and year is printed.

  2. The awk script form.awk generates form letters, which are redirected to a UNIX file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.79.147