Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13. Parsing and Similar Tasks

This is a chapter of tasks that programmers might recognize. The recipes here aren’t necessarily more advanced than the other bash script recipes in the book, but if you are not a programmer, these tasks might seem obscure or irrelevant to your use of bash. We won’t do much explaining of the reasons why you’d find yourself in these situations (as a programmer, you’ll recognize some if not all of them). Even if you don’t recognize the situations, though, you should read them for what you can learn about bash.

Some of the recipes in this chapter include the parsing of command-line arguments. Recall that the typical way to specify options on a shell script is to have a leading minus sign and a single letter. For example, an option for your script to give fewer messages might use -q as a flag to mean quiet mode. Sometimes an option might take an argument. For example, a user option where you need to specify a username might use -u followed by the username. This distinction will be made clear in this chapter’s first recipe.

Some Linux commands also allow long-form options. Using the previous example of a short-format -u option, a command might also support a long format like --user=username. We will not be showing any long-format options, though they could be used for some of the techniques that we show. The best way to parse long arguments is to use the getopt (note no s) command.

13.1 Parsing Arguments for Your Shell Script

Problem

You want to have some options on your shell script, some flags that users can use to alter its behavior. You could do the parsing directly, using ${} to tell you how many arguments have been supplied and using ${1:0:1} to test the first character of the first argument to see if it is a minus sign. You would need some if/then or case logic to identify which option it is and whether it takes an argument, though. And what if the user doesn’t supply a required argument, or calls your script with two options combined (e.g., -ab)? Will you also parse for that? The need to parse options for a shell script is a common situation. Lots of scripts have options. Isn’t there a more standard way to do this?

Solution

Use bash’s builtin getopts command to help parse options.

Example 13-1, based largely on the example in the manpage for getopts, illustrates.

Example 13-1. ch13/getopts_example

#!/usr/bin/env bash
# cookbook filename: getopts_example
#
# using getopts
#
aflag=
bflag=
while getopts 'ab:' OPTION
do
    case $OPTION in
        a) aflag=1
           ;;
        b) bflag=1
           bval="$OPTARG"
           ;;
        ?) printf "Usage: %s: [-a] [-b value] args
" ${0##*/} >&2
           exit 2
           ;;
    esac
done
shift $(($OPTIND - 1))

if [ "$aflag" ]
then
    printf "Option -a specified
"
fi
if [ "$bflag" ]
then
    printf 'Option -b "%s" specified
' "$bval"
fi
printf "Remaining arguments are: %s
" "$*"

Discussion

There are two kinds of options supported here. The first and simpler kind is an option that stands alone. It typically represents a flag to modify a command’s behavior. An example of this sort of option is the -l option on the ls command. The second kind of option requires an argument. An example of this is the mysql command’s -u option, which requires that a username be supplied, as in mysql -u sysadmin. Let’s look at how getopts supports the parsing of both kinds.

getopts takes two arguments:

getopts 'ab:' OPTION

The first is a list of option letters. The second is the name of a shell variable. In our example we are defining -a and -b as the only two valid options, so the first argument to getopts has just those two letters…and a colon. What does the colon signify? It indicates that -b needs an argument, just like -u username or -f filename might be used. The colon needs to be adjacent to any option letter taking an argument. For example, if only -a took an argument we would need to write a:b instead.

The getopts builtin will set the variable named in the second argument to the value that it finds when it parses the shell script’s argument list ($1, $2, etc.). If it finds an argument with a leading minus sign, it will treat that as an option argument and put the letter into the given variable ($OPTION in our example). Then it returns true (i.e., 0) so that the while loop will process the option, then continue to parse options by repeated calls to getopts until it runs out of arguments (or encounters a double minus, --, which allows users to put an explicit end to the options). Then getopts returns false (i.e., a nonzero value) and the while loop ends.

Inside the loop, when the parsing has found an option letter for processing, we use a case statement on the variable $OPTION to set flags or otherwise take action when the option is encountered. For options that take arguments, that argument is placed in the shell variable $OPTARG (a fixed name not related to our use of $OPTION as our variable). We need to save that value by assigning it to another variable because as the parsing continues to loop, the variable $OPTARG will be reset on each call to getopts.

The third case of our case statement is a question mark, a shell pattern that matches any single character. When getopts finds an option that is not in the set of expected options ('ab:' in our example), it will return a literal question mark in the variable ($OPTION in our example). So, we could have made our case statement read ?) or '?') for an exact match, but the ? as a pattern match of any single character provides a convenient default. It will match a literal question mark as well as matching any other single character.

In the usage message that we print, we have made two changes from the example script in the manpage. First, we use ${0##*/} to give the name of the script without the pathname that may have been part of how it was invoked. Secondly, we redirect this message to standard error (>&2) because that is really where such messages belong. All of the error messages from getopts that occur when an unknown option or missing argument is encountered are written to standard error; we add our usage message to that chorus.

When the while loop terminates, we see the next line to be executed is:

shift $(($OPTIND - 1))

which is a shift statement used to move the positional parameters of the shell script from $1, $2, etc. down a given number of positions (tossing the lower ones). The variable $OPTIND is an index into the arguments that getopts uses to keep track of where it is when it parses. Once we are done parsing, we can toss all the options that we’ve processed by executing this shift statement. For example, if we had this command line:

myscript -a -b alt plow harvest reap

then after parsing for options, $OPTIND would be set to 4. By doing three $OPTIND - 1 shifts we would get rid of the options, and then a quick echo $* would give this:

plow harvest reap

The remaining (nonoption) arguments will then be ready for use in our script (in a for loop, perhaps). In our example script, the last line is a printf showing all the remaining arguments.

13.2 Parsing Arguments with Your Own Error Messages

Problem

You are using getopts to parse the options for your shell script, but you don’t like the error messages that it writes when it encounters bad input. Can you still use getopts but write your own error handling?

Solution

If you just want getopts to be quiet and not report any errors at all, just assign OPTERR=0 before you begin parsing. But if you want getopts to give you more information without the error messages, then begin the option list with a colon, as shown in the script in Example 13-2. (The quotes around the option list are optional.)

Example 13-2. ch13/getopts_custom

#!/usr/bin/env bash
# cookbook filename: getopts_custom
#
# using getopts - with custom error messages
#
aflag=
bflag=
# since we don't want getopts to generate error
# messages, but want this script to issue its
# own messages, we will put, in the option list, a
# leading ':' to silence getopts.
while getopts :ab: FOUND
do
    case $FOUND in
        a)  aflag=1
            ;;
        b)  bflag=1
            bval="$OPTARG"
            ;;
        :) printf "argument missing from -%s option
" $OPTARG
            printf "Usage: %s: [-a] [-b value] args
" ${0##*/}
            exit 2
            ;;
        ?) printf "unknown option: -%s
" $OPTARG
            printf "Usage: %s: [-a] [-b value] args
" ${0##*/}
            exit 2
            ;;
        esac >&2
    done
shift $(($OPTIND - 1))

if [ "$aflag" ]
then
    printf "Option -a specified
"
fi
if [ "$bflag" ]
then
    printf 'Option -b "%s" specified
' "$bval"
fi
printf "Remaining arguments are: %s
" "$*"

Discussion

The script is very similar to the one in Recipe 13.1; see that recipe’s Discussion section for more background. One difference here is that getopts may now return a colon. It does so when an option is missing (e.g., when the user invokes the script with -b but without an argument for it). In that case, it puts the option letter into $OPTARG so that you know what option it was that was missing its argument.

Similarly, if an unsupported option is given (e.g., if the user tries -d when invoking the script) getopts returns a question mark as the value for $FOUND, and puts the letter (the d in this case) into $OPTARG so that it can be used in the error messages.

We put a backslash in front of both the colon and the question mark to indicate that these are literals and not any special patterns or shell syntax. While not necessary for the colon, it looks better to have the parallel construction with the two punctuation marks both being escaped.

We added an I/O redirection on the esac (the end of the case statement), so that all output from the various printf commands will be redirected to standard error. This is in keeping with the purpose of standard error and is just easier to put it here than remembering to put it on each printf individually.

13.3 Parsing Some HTML

Problem

You want to pull the strings out of some HTML. For example, you’d like to get at the href="urlstringstuff"-type strings from the <a> tags within a chunk of HTML.

Solution

For a quick and easy shell parse of HTML, provided it doesn’t have to be foolproof, you might want to try something like this:

cat $1 | sed -e 's/>/>
/g' | grep '<a' | while IFS='"' read a b c ; do echo $b; done

Discussion

Parsing HTML from bash is pretty tricky, mostly because bash tends to be very line-oriented whereas HTML was designed to treat newlines like whitespace. So, it’s not uncommon to see tags split across two or more lines, as in:

<a href="blah..." rel="blah..." media="blah..."
  target= "blah..." >

There are also two ways to write <a> tags, one with a separate ending </a> tag and one without, where instead the singular <a> tag itself ends with a />. Between this and the potential for multiple tags on a line and tags split across lines, it’s a bit messy to parse, and our simple bash technique for this is often not foolproof.

Here are the steps involved in our solution. First, break the multiple tags on one line into at most one line per tag:

cat file | sed -e 's/>/>
/g'

Yes, that’s a newline right after the backslash so that it substitutes each end-of-tag character (i.e., the >) with that same character and then a newline. That will put tags on separate lines, with maybe a few extra blank lines. The trailing g tells sed to do the search and replace globally; i.e., multiple times on a line if need be.

Then you can pipe that output into grep to grab just the <a tag lines:

cat file | sed -e 's/>/>
/g' | grep '<a'

or maybe just lines with double quotes:

cat file | sed -e 's/>/>
/g' | grep '".*"'

The single quotes tell the shell to take the inner characters literally and not do any shell expansion on them, and the rest is a regular expression to match a double quote followed by any character (.) any number of times (*), followed by another double quote. (This won’t work if the string itself is split across lines.)

To parse out the contents of what’s inside the double quotes, one trick is to use the shell’s internal field separator ($IFS) to tell it to use the double quote (") as the separator. You can do a similar thing with awk and its -F (field separator) option.

For example:

cat $1 | sed -e 's/>/>
/g' | grep '".*"' | awk -F'"' '{ print $2}'

(Or use grep '<a' if you just want <a tags and not all quoted strings.)

If you want to use the $IFS shell trick rather than awk, it would be:

cat $1 | sed -e 's/>/>
/g' | grep '<a' | while IFS='"' read PRE URL POST ; do echo $URL; done

where the grep output is piped into a while loop that reads the input into three fields (PRE, URL, and POST). By preceding the read command with IFS='"', we set that environment variable just for the read command, not for the entire script. Thus, it will parse with the quotes as its notion of what separates the words of the input line. It will set PRE to be everything up to the first quote, URL to be everything from there to the next quote, and POST to be everything thereafter. Then the script just echoes the second variable, URL; that’s all the characters between the quotes.

13.4 Parsing Output into an Array

Problem

You want the output of some program or script to be put into an array.

Solution

Example 13-3 illustrates how to use an array to parse the output into words.

Example 13-3. ch13/parseViaArray

#!/usr/bin/env bash
# cookbook filename: parseViaArray
#
# find the file size
# use an array to parse the ls -l output into words

LSL=$(ls -ld $1)

declare -a MYRA
MYRA=($LSL)

echo the file $1 is ${MYRA[4]} bytes.

Discussion

In our example, we take the output from the ls -l command and parse the words by putting them into an array. Then we can just refer to each array element to get at each word. (Remember that the arrays are zero-based, so an index of 4 gives us the fifth element.) The typical output from the ls -l command looks like this (yours may vary due to locale):

-rw-r--r-- 1 albing users 113 2006-10-10 23:33 mystuff.txt

Arrays are easy to initialize if you know the values as you write the script. The format is simple. We begin by declaring the variable to be an array, and then we assign it values:

declare -a MYRA
MYRA=(first second third home)

The same can be done by using a variable inside those parentheses. Just be sure not to use quotes around the variable. Writing MYRA=$("$LSL") will put the entire string into the first argument, since it is all contained as one quoted string. Then ${MYRA[0]} will be the only array element, and it will contain the entire string, which is not what you wanted.

We also could have shortened this script by combining the steps like this:

declare -a MYRA
MYRA=($(ls -ld $1))

If you want to know how many elements you have in your new array, just reference the variable ${#MYRA[*]} or ${#MYRA[@]}, either of which is a lot of special characters to type.

13.5 Parsing Output with a Function Call

Problem

You want to parse the output of some program into various variables to be used else-where in your program. Arrays are great when you are looping through the values, but not very readable if you want to refer to each separately, rather than by an index.

Solution

Use a function call to parse the words, as shown in Example 13-4.

Example 13-4. ch13/parseViaFunc

#!/usr/bin/env bash
# cookbook filename: parseViaFunc
#
# parse ls -l via function call
# an example of output from ls -l follows:
# -rw-r--r--  1 albing users 126 Jun 10 22:50 fnsize

function lsparts ()
{
    PERMS=$1
    LCOUNT=$2
    OWNER=$3
    GROUP=$4
    SIZE=$5
    CRMONTH=$6
    CRDAY=$7
    CRTIME=$8
    FILE=$9
}

lsparts $(ls -l "$1")

echo $FILE has $LCOUNT 'link(s)' and is $SIZE bytes long.

Here’s what it looks like when it runs:

$ ./fnsize fnsize
fnsize has 1 link(s) and is 311 bytes long.
$

Discussion

We can let bash do the work of parsing by putting the text to be parsed in a function call. Calling a function is much like calling a shell script. bash parses the words into separate variables and assigns them to $1, $2, etc. Our function can just assign each positional parameter to a separate variable. If the variables are not declared locally then they are available outside as well as inside the function.

We put quotes around the reference to $1 in the ls command in case the filename supplied has spaces in its name. The quotes keep it all together so that ls sees it as a single filename and not as a series of separate filenames.

We use quotes in the expression 'link(s)' to avoid special treatment of the parentheses by bash. Alternatively, we could have put the entire phrase (except for the echo itself) inside of double quotes—double, not single, so that the variable substitution (for $FILE, etc.) still occurs.

Warning

You might need to adjust the field list depending on how your computer and ls command present the date, or add options to your ls command to modify its output. For example, ls -l --time-style="long-iso" will produce a slightly different output format where month and day are replaced by a YYYY-MM-DD format date; you would then want to replace CRMONTH and CRDAY with a single variable, say, CRDATE, and adjust the field numbers accordingly.

13.6 Parsing Text with a read Statement

Problem

There are many ways to parse text with bash. What if you don’t want to use a function? Is there another way?

Solution

Use the read statement, as in Example 13-5.

Example 13-5. ch13/parseViaRead

#!/usr/bin/env bash
# cookbook filename: parseViaRead
#
# parse ls -l with a read statement
# an example of output from ls -l follows:
# -rw-r--r--  1 albing users 126 2006-10-10 22:50 fnsize

ls -l "$1" | { read PERMS LCOUNT OWNER GROUP SIZE CRDATE CRTIME FILE ;
                 echo $FILE has $LCOUNT 'link(s)' and is $SIZE bytes long. ;
             }

Discussion

Here we let read do all the parsing. It will break apart the input into words, where words are separated by whitespace, and assign each word to the variables named in the read statement. Actually, you can even change the separator, by setting the bash $IFS (internal field separator) variable to whatever character you want for parsing; just remember to set it back!

As you can see from the sample output of ls -l, we have tried to choose names that get at the meaning of each word in the output. Since FILE is the last word, any extra fields will also be part of that variable. That way if the name has whitespace in it, like “Beethoven Fifth Symphony,” then all three words will end up in $FILE.

13.7 Parsing with read into an Array

Problem

You’ve got a varying number of words on each line of input, so you can’t just assign each word to a predetermined variable.

Solution

Use the -a option on the read command, and the words will be read into an array variable:

read -a MYRAY

Discussion

Whether coming from user input or a pipeline, read will parse the input into words, putting each word in its own array element. The variable does not need to be declared as an array—using it in this fashion is enough to make it into an array. Each element can be referenced with the bash array syntax. Arrays in bash are zero-based, so the second word on a line of input will be put into ${MYRAY[1]} in our example. The number of words will determine the size of the array. In our example, the size of the array is ${#MYRAY[@]}.

13.8 Reading an Entire File

Problem

You want to read in a whole file and then parse it. Must you do this using a for loop and reading one line at a time, or is there a shorthand?

Solution

Use the mapfile or readarray command in bash. They are identical commands that take the same arguments and let you read an entire file into an array, one array entry for each line of the file, with one statement.

The choice of command, either readarray or mapfile, seems to be one of perspective—are you thinking about the destination (the array), or the source (the datafile)? Use whichever makes more sense to you. They are interchangeable.

Here’s a sample mapfile command, part of a fuller example in the discussion:

mapfile -t -s 1 -n 1500 -C showprg -c 100 BIGDATA  < /tmp/myfile.data

This command will discard (i.e., skip) the first line (-s 1) of input, reading up to 1,500 lines (-n 1500) and discarding the newline at the end of each line (-t). Every 100 lines (-c 100) it will call a user-defined function called showprg (to show progress in reading the file; the default is every 5,000 lines). The data is put into the array called BIGDATA, one line of input per entry. Input is redirected from the file as shown.

Discussion

Here’s the first part of an example use of mapfile (or readarray, if you prefer). It reads the file and shows progress as it reads. Then it prints out how many lines it read—i.e., the size of the array:

# use mapfile to read in $1

# show progress with dots
function showprg ()
{
    printf "."
}

# create a large datafile for our use
ls -l /usr/bin  > /tmp/myfile.data

# a.k.a. readarray; load up BIGDATA
mapfile -t -s 1 -n 1500 -C showprg -c 100 BIGDATA  < /tmp/myfile.data

# put a newline at the end of the showprg output
echo

# how many lines did we read?
siz=${#BIGDATA[@]}
echo "size: ${siz}"

The showprg function will print a dot (but no newline) each time it is called. This will show progress when reading in a large file. You could do something much fancier if you wanted; it’s whatever function you want, after all.

So now that the file has been read into the array, what might we do with all that data? In this case it’s a very long output from the ls command. We could now go through the file one line at a time and print out some of the data:

# number the lines as we print them out
for((i=0; i<siz; i++))
do
    ALINE=${BIGDATA[i]}
    if [[ ${ALINE:0:1} == 'l' ]]   # only symbolic links
    then
        # print the relevant substring
        printf "%4d: %s
" $i "${ALINE:48}"
    fi
done

rm /tmp/myfile.data   # clean up

In this case, the script will look at the first character of a line and, if it’s an l, print out the line, beginning at character 48 (zero-based). Since the data in the file is the “long” output of an ls command, such a first character indicates that we are looking at a symbolic link. (Similarly, a d would indicate a directory, but we don’t use that here.)

Here is an excerpt of the output that might result from running the whole script. The first line shows the dots that appear as the progress of the read:

 ...............
 size: 1500
 (other output, and then)
 1307: rsh -> /etc/alternatives/rsh
 1311: rtstat -> lnstat
 1315: rview -> /etc/alternatives/rview
 (even more output)

13.9 Getting Your Plurals Right

Problem

You want to use a plural noun when you have more than one of an object. But you don’t want to scatter if statements all through your code.

Solution

Example 13-6 illustrates a way to make words plural.

Example 13-6. ch13/pluralize

#!/usr/bin/env bash
# cookbook filename: pluralize
#
# A function to make words plural by adding an s
# when the value ($2) is != 1 or -1.
# It only adds an 's'; it is not very smart.
#
function plural ()
{
    if [ $2 -eq 1 -o $2 -eq -1 ]
    then
        echo ${1}
    else
        echo ${1}s
    fi
}

while read num name
do
    echo $num $(plural "$name" $num)
done

Discussion

The function, though only set to handle the simple addition of an s, will do fine for many nouns. The function doesn’t do any error checking of the number or contents of the arguments. If you wanted to use this script in a serious application, you might want to add those kinds of checks.

We put the name in quotes when we call the plural function in case there are embedded blanks in the name. It did, after all, come from the read statement, and the last variable in a read statement gets all the remaining text from the input line. You can see that in the following example.

We put the script in Example 13-6 into a file named pluralize and ran it against the following data:

$ cat input.file
1 hen
2 duck
3 squawking goose
4 limerick oyster
5 corpulent porpoise

$ ./pluralize < input.file
1 hen
2 ducks
3 squawking gooses
4 limerick oysters
5 corpulent porpoises

“Gooses” isn’t correct English, but the script did what was intended. If you like the C-like syntax better, you could write the if statement like this:

if (( $2 == 1 || $2 == -1 ))

The square bracket (i.e., the test builtin) is the older form, more common across the various versions of bash, but either should work. Use whichever form’s syntax is easiest for you to remember.

We don’t expect you would keep a file like pluralize around, but the plural function might be handy to have as part of a larger scripting project. Then whenever you report on the count of something you could use the plural function as part of the reference, as shown in the while loop in the script.

13.10 Taking It One Character at a Time

Problem

You have some parsing to do, and for whatever reason nothing else will do—you need to take your strings apart one character at a time.

Solution

The substring function for variables will let you take things apart, and another feature tells you how long a string is. Example 13-7 demonstrates their use.

Example 13-7. ch13/onebyone

#!/usr/bin/env bash
# cookbook filename: onebyone
#
# parsing input one character at a time

while read ALINE
do
    for ((i=0; i < ${#ALINE}; i++))
    do
        ACHAR=${ALINE:i:1}
        # do something here, e.g. echo $ACHAR
        echo $ACHAR
    done
done

Discussion

The read statement will take input from standard input and put it, a line at a time, into the variable $ALINE. Since there are no other variables in the read statement, it takes the entire line and doesn’t divvy it up, but it will remove leading and trailing $IFS whitespace unless you use IFS= read or just read by itself and later reference the default $REPLY variable.

The for loop will loop once for each character in the $ALINE variable. We can compute how many times to loop by using ${#ALINE}, which returns the length of the contents of $ALINE.

Each time through the loop we assign $ACHAR the value of the one-character substring of $ALINE that begins at the ith position. That’s simple enough.

13.11 Cleaning Up an SVN Source Tree

Problem

Subversion’s svn status command shows all the files that have been modified, but if you have scratch files or other garbage lying around in your source tree, svn will list those, too. It would be useful to have a way to clean up your source tree, removing those files unknown to Subversion.

Warning

Subversion won’t know about new files unless and until you do an svn add command. Don’t run this script until you’ve added any new source files, or they’ll be gone for good.

Solution

You can grep output from the svn status command and read that to create a list of files to delete:

svn status src | grep '^?' | 
  while read status filename; do echo "$filename"; rm -rf "$filename"; done

Discussion

The svn status output lists one file per line. It puts an M as the first character of a line for files that have been modified, an A for newly added (but not yet committed) files, and a question mark for those about which it knows nothing. We just grep for those lines beginning with a question mark. We process the output with a read statement in a while loop. The echo isn’t strictly necessary, but it’s useful to see what’s being removed, just in case there is a mistake or an error. You can at least see that it’s gone for good. When we do the remove, we use the -rf options in case the file is a directory, but mostly just to keep the remove quiet. Problems encountered with permissions and such are squelched by the -f option; it just removes the file as best as your permissions allow. We put the reference to the filename in quotes ("$fn") in case there are special characters, like spaces, in the filename.

13.12 Setting Up a Database with MySQL

Problem

You want to create and initialize several databases using MySQL. You want them all to be initialized using the same SQL commands. Each database needs its own name, but each database will have the same contents, at least at initialization. You may need to do this setup over and over, as in the case where these databases are used as part of a test suite that needs to be reset when tests are rerun.

Solution

The simple bash script in Example 13-8 can help with this administrative task.

Example 13-8. ch13/dbiniter

#!/usr/bin/env bash
# cookbook filename: dbiniter
#
# initialize databases from a standard file
# creating databases as needed

DBLIST=$(mysql -e "SHOW DATABASES;" | tail -n +2)
select DB in $DBLIST "new..."
do
    if [[ $DB == "new..." ]]
    then
        printf "%b" "name for new db: "
        read DB rest
        echo creating new database $DB
        mysql -e "CREATE DATABASE IF NOT EXISTS $DB;"
    fi

    if [ -n "$DB" ]
    then
        echo Initializing database: $DB
        mysql $DB < ourInit.sql
    fi
done

Discussion

The tail -n +2 is added to remove the heading from the list of databases (see Recipe 2.12).

The select creates the menus showing the existing databases. We added the literal "new..." as an additional choice (see Recipe 3.7 and Recipe 6.16).

When the user wants to create a new database, we prompt for and read a new name, but we use two fields in the read statement as a bit of error handling. If the user types more than one name on the line, we only use the first name—it gets put into the variable $DB while the rest of the input is put into $rest and ignored. (We could add an error check to see if $rest is null.)

Whether created anew or chosen from the list of extant databases, if the $DB variable is not empty, it will invoke mysql one more time to feed it the set of SQL statements that we’ve put into the file ourInit.sql as our standardized initialization sequence.

If you’re going to use a script like this, you might need to add parameters to your mysql command, such as -u and -p to prompt for a username and password. It will depend on how your database and its permissions are configured, or whether you have a file named .my.cnf with your MySQL defaults.

We could also have added an error check after the creation of the new database to see if it succeeded; if it did not succeed, we could unset DB, thereby bypassing the initialization. However, as many a math textbook has said, “we leave that as an exercise for the reader.”

13.13 Isolating Specific Fields in Data

Problem

You need to extract one or more fields from each line of output.

Solution

Use cut if there are delimiters you can easily pick out, even if they are different for the beginning and end of the field you need:

# Here's an easy one - what users, home directories and shells do
# we have on this NetBSD system?
$ cut -d':' -f1,6,7 /etc/passwd
root:/root:/bin/csh
toor:/root:/bin/sh
daemon:/:/sbin/nologin
operator:/usr/guest/operator:/sbin/nologin
bin:/:/sbin/nologin
games:/usr/games:/sbin/nologin
postfix:/var/spool/postfix:/sbin/nologin
named:/var/chroot/named:/sbin/nologin
ntpd:/var/chroot/ntpd:/sbin/nologin
sshd:/var/chroot/sshd:/sbin/nologin
smmsp:/nonexistent:/sbin/nologin
uucp:/var/spool/uucppublic:/usr/libexec/uucp/uucico
nobody:/nonexistent:/sbin/nologin
jp:/home/jp:/usr/pkg/bin/bash

# What is the most popular shell on the system?
$ cut -d':' -f7 /etc/passwd | sort | uniq -c | sort -rn
10 /sbin/nologin
2 /usr/pkg/bin/bash
1 /bin/csh
1 /bin/sh
1 /usr/libexec/uucp/uucico

# Now let's see the first two directory levels
$ cut -d':' -f6 /etc/passwd | cut -d'/' -f1-3 | sort -u
/
/home/jp
/nonexistent
/root
/usr/games
/usr/guest
/var/chroot
/var/spool

Use awk to split on multiples of whitespace, or if you need to rearrange the order of the output fields. Note the → denotes a tab character in the output. The default is a space, but you can change that using $OFS:

# Users, home directories, and shells, but swap the last two
# and use a tab delimiter
$ awk 'BEGIN {FS=":"; OFS="	"; } { print $1,$7,$6; }' /etc/passwd
root → /bin/csh → /root
toor → /bin/sh → /root
daemon → /sbin/nologin → /
operator → /sbin/nologin → /usr/guest/operator
bin → /sbin/nologin → /
games → /sbin/nologin → /usr/games
postfix → /sbin/nologin → /var/spool/postfix
named → /sbin/nologin → /var/chroot/named
ntpd → /sbin/nologin → /var/chroot/ntpd
sshd → /sbin/nologin → /var/chroot/sshd
smmsp → /sbin/nologin → /nonexistent
uucp → /usr/libexec/uucp/uucico → /var/spool/uucppublic
nobody → /sbin/nologin → /nonexistent
jp → /usr/pkg/bin/bash → /home/jp

# Multiples of whitespace and swapped, first field removed
$ grep '^# [1-9]' /etc/hosts | awk '{print $3,$2}'
10.255.255.255 10.0.0.0
172.31.255.255 172.16.0.0
192.168.255.255 192.168.0.0

Use grep -o to display just the part that matched your pattern. This is particularly handy when you can’t express delimiters in a way that lends itself to the solutions shown here. For example, say you need to extract all IP addresses from a file, no matter where they are. Note we use egrep because of the regular expression (regex), but -o should work with whichever GNU grep flavor you use (it is probably not supported on non-GNU versions; check your documentation):

$ cat has_ipas
This is line 1 with 1 IPA: 10.10.10.10
Line 2 has 2; they are 10.10.10.11 and 10.10.10.12.
Line three is ftp_server=10.10.10.13:21.

$ egrep -o '[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}' has_ipas
10.10.10.10
10.10.10.11
10.10.10.12
10.10.10.13

Discussion

The possibilities are endless, and we haven’t even scratched the surface here. This is the very essence of what the Unix toolchain idea is all about: take a number of small tools that do one thing well and combine them as needed to solve problems.

Also, the regex we used for IP addresses is naive and could match other things, including invalid addresses. For a much better pattern, use the Perl Compatible Regular Expressions (PCRE) regex from Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly), if your grep supports -P:

$ grep -oP '([01]?dd?|2[0-4]d|25[0-5]).([01]?dd?|2[0-4]d|25[0-5]).
([01]?d
d?|2[0-4]d|25[0-5]).([01]?dd?|2[0-4]d|25[0-5])' has_ipas
10.10.10.10
10.10.10.11
10.10.10.12
10.10.10.13
$

Or use Perl:

$ perl -ne 'while ( m/([01]?dd?|2[0-4]d|25[0-5]).([01]?dd?|2[0-4]d|25[0-5]).
([01]?dd?|2[0-4]d|25[0-5]).([01]?dd?|2[0-4]d|25[0-5])/g ) { print qq($1.$2.$3.
$4
); }' has_ipas
10.10.10.10
10.10.10.11
10.10.10.12
10.10.10.13
$

13.14 Updating Specific Fields in Datafiles

Problem

You need to extract certain parts (fields) of a line (record) and update them.

Solution

In the simple case, you want to extract a single field from a line, then perform some operation on it. For that, you can use cut or awk. See Recipe 13.13 for details.

For the more complicated case, you need to modify a field in a datafile without extracting it. If it’s a simple search and replace, use sed.

For example, let’s switch everyone from csh to sh on this NetBSD system:

$ grep csh /etc/passwd
root:*:0:0:Charlie &:/root:/bin/csh

$ sed 's;/csh$;/sh;' /etc/passwd | grep '^root'
root:*:0:0:Charlie &:/root:/bin/sh

You can use awk if you need to do arithmetic on a field or modify a string only in a certain field:

$ cat data_file
Line 1 ends
Line 2 ends
Line 3 ends
Line 4 ends
Line 5 ends

$ awk '{print $1, $2+5, $3}' data_file
Line 6 ends
Line 7 ends
Line 8 ends
Line 9 ends
Line 10 ends

# If the second field contains '3', change it to '8' and mark it
$ awk '{ if ($2 == "3") print $1, $2+5, $3, "Tweaked" ; else print $0; }' 
data_file
Line 1 ends
Line 2 ends
Line 8 ends Tweaked
Line 4 ends
Line 5 ends

Discussion

The possibilities here are as endless as your data, but hopefully these examples will give you enough of a start to easily modify your data.

13.15 Trimming Whitespace

Problem

You need to trim leading and/or trailing whitespace from lines for fields of data.

Solution

These solutions rely on a bash-specific treatment of read and $REPLY. See the end of the discussion for an alternate solution.

First, we’ll show a file with some leading and trailing whitespace. Note that we add ~~ to show the whitespace, and that the → denotes a literal tab character in the output:

# Show the whitespace in our sample file
$ while read; do echo ~~"$REPLY"~~; done < whitespace
~~ This line has leading spaces.~~
~~This line has trailing spaces. ~~
~~ This line has both leading and trailing spaces. ~~
~~ → Leading tab.~~
~~Trailing tab. → ~~
~~ → Leading and trailing tab. → ~~
~~ → Leading mixed whitespace.~~
~~Trailing mixed whitespace. →     ~~
~~ → Leading and trailing mixed whitespace. →     ~~
$

To trim both leading and trailing whitespace, use $IFS and the builtin $REPLY variable (see the discussion for why this works):

$ while read REPLY; do echo ~~"$REPLY"~~; done < whitespace
~~This line has leading spaces.~~
~~This line has trailing spaces.~~
~~This line has both leading and trailing spaces.~~
~~Leading tab.~~
~~Trailing tab.~~
~~Leading and trailing tab.~~
~~Leading mixed whitespace.~~
~~Trailing mixed whitespace.~~
~~Leading and trailing mixed whitespace.~~
$

To trim only leading or only trailing spaces, use a simple pattern match:

# Leading spaces only
$ while read; do echo "~~${REPLY## }~~"; done < whitespace
~~This line has leading spaces.~~
~~This line has trailing spaces. ~~
~~This line has both leading and trailing spaces. ~~
~~ → Leading tab.~~
~~Trailing tab. ~~
~~ → Leading and trailing tab. → ~~
~~ → Leading mixed whitespace.~~
~~Trailing mixed whitespace. →     ~~
~~ → Leading and trailing mixed whitespace. →     ~~

# Trailing spaces only
$ while read; do echo "~~${REPLY%% }~~"; done < whitespace
~~ This line has leading spaces.~~
~~This line has trailing spaces.~~
~~ This line has both leading and trailing spaces.~~
~~ → Leading tab.~~
~~Trailing tab. ~~
~~ → Leading and trailing tab. → ~~
~~     → Leading mixed whitespace.~~
~~Trailing mixed whitespace. →     ~~
~~     → Leading and trailing mixed whitespace. →     ~~

Trimming only leading or only trailing whitespace (including tabs) is a bit more complicated:

# You need this either way
$ shopt -s extglob

# Leading whitespace only
$ while read; do echo "~~${REPLY##+([[:space:]])}~~"; done < whitespace
~~This line has leading spaces.~~
~~This line has trailing spaces. ~~
~~This line has both leading and trailing spaces. ~~
~~Leading tab.~~
~~Trailing tab. ~~
~~Leading and trailing tab. → ~~
~~Leading mixed whitespace.~~
~~Trailing mixed whitespace.     → ~~
~~Leading and trailing mixed whitespace.     → ~~
$

# Trailing whitespace only
$ while read; do echo "~~${REPLY%%+([[:space:]])}~~"; done < whitespace
~~ This line has leading spaces.~~
~~This line has trailing spaces.~~
~~ This line has both leading and trailing spaces.~~
~~ → Leading tab.~~
~~Trailing tab.~~
~~ → Leading and trailing tab.~~
~~     → Leading mixed whitespace.~~
~~Trailing mixed whitespace.~~
~~    → Leading and trailing mixed whitespace.~~

Discussion

OK, at this point you are probably looking at these lines and wondering how we’re going to make this comprehensible. It turns out there’s a simple, if subtle, explanation.

Here we go. The first example used the default $REPLY variable that read uses when you do not supply your own variable name(s). Chet Ramey (maintainer of bash) made a design decision to “[if] there are no variables, save the text of the line read to the variable $REPLY,” unchanged:

while read; do echo ~~"$REPLY"~~; done < whitespace

But when we supply one or more variable names to read, it does parse the input, using the values in $IFS (which are space, tab, and newline by default). One step of that parsing process is to trim leading and trailing whitespace—just what we want:

while read REPLY; do echo ~~"$REPLY"~~; done < whitespace

Trimming leading or trailing spaces (but not both) is easy using the ${##} or ${%%} operators (see Recipe 6.7):

while read; do echo "~~${REPLY## }~~"; done < whitespace
while read; do echo "~~${REPLY%% }~~"; done < whitespace

Covering tabs is a little harder. If we had only tabs, we could use the ${##} or ${%%} operators and insert literal tabs using the Ctrl-V Ctrl-I key sequence. But that’s risky since it’s probable there’s a mix of spaces and tabs, and some text editors or unwary users may strip out the tabs. So, we turn on extended globbing and use a character class to make our intent clear. The [:space:] character class would work without extglob, but we need to say “one or more occurrences” using +() or else it will trim a single spaces or tabs, but not multiples of both on the same line. If you only care about space or tab, you could use the [:blank:] character class instead, since [:space:] includes other characters like the vertical tab (v) and DOS CR (carriage return, ):

# This works, need extglob for +() part
$ shopt -s extglob
...
$ while read; do echo "~~${REPLY##+([[:space:]])}~~"; done < whitespace
...
$ while read; do echo "~~${REPLY%%+([[:space:]])}~~"; done < whitespace
...

# This doesn't
$ while read; do echo "~~${REPLY##[[:space:]]}~~"; done < whitespace
~~This line has leading spaces.~~
~~This line has trailing spaces. ~~
~~This line has both leading and trailing spaces. ~~
~~Leading tab.~~
~~Trailing tab. ~~
~~Leading and trailing tab. ~~
~~    → Leading mixed whitespace.~~
~~Trailing mixed whitespace.     → ~~
~~    → Leading and trailing mixed whitespace.     → ~~

Here’s a different take, exploiting the same $IFS parsing, but to parse out fields (or words) instead of records (or lines):

$ for i in $(cat white_space); do echo ~~$i~~; done
~~This~~
~~line~~
~~has~~
~~leading~~
~~white~~
~~space.~~
~~This~~
~~line~~
~~has~~
~~trailing~~
~~white~~
~~space.~~
~~This~~
~~line~~
~~has~~
~~both~~
~~leading~~
~~and~~
~~trailing~~
~~white~~
~~space.~~
$

Finally, although the original solutions rely on Chet’s design decision about read and $REPLY, this solution does not:

shopt -s extglob

while IFS= read -r line; do
echo "None: ~~$line~~" # preserve all whitespaces
echo "Ld: ~~${line##+([[:space:]])}~~" # trim leading whitespace
echo "Tr: ~~${line%%+([[:space:]])}~~" # trim trailing whitespace
line="${line##+([[:space:]])}" # trim leading and...
line="${line%%+([[:space:]])}" # ...trailing whitespace
echo "All: ~~$line~~" # Show all trimmed
done < whitespace

13.16 Compressing Whitespace

Problem

You have runs of whitespace in a file (perhaps it is fixed-length, space-padded) and you need to compress the spaces down to a single character or delimiter.

Solution

Use tr or awk as appropriate.

Discussion

If you are trying to compress runs of whitespace down to a single character, you can use tr, but be aware that you may damage the file if it is not well formed. For example, if fields are delimited by multiple whitespace characters but internally have spaces, compressing multiple spaces down to one space will remove that distinction. Imagine if the _ characters in the following example were spaces instead. Note the → denotes a literal tab character in the output:

$ cat data_file
Header1             Header2              Header3
Rec1_Field1         Rec1_Field2          Rec1_Field3
Rec2_Field1         Rec2_Field2          Rec2_Field3
Rec3_Field1         Rec3_Field2          Rec3_Field3

$ cat data_file | tr -s ' ' '	'
Header1 → Header2 → Header3
Rec1_Field1 → Rec1_Field2 → Rec1_Field3
Rec2_Field1 → Rec2_Field2 → Rec2_Field3
Rec3_Field1 → Rec3_Field2 → Rec3_Field3

If your field delimiter is more than a single character, tr won’t work since it translates single characters from its first set into the matching single character in the second set. You can use awk to combine or convert field separators. awk’s internal field separator FS accepts regular expressions, so you can separate on pretty much anything. There is a handy trick to this as well: an assignment to any field causes awk to reassemble the record using the output field separator, OFS, so assigning field 1 to itself and then printing the record has the effect of translating FS to OFS without you having to worry about how many records there are in the data.

In this example, multiple spaces delimit fields, but fields also have internal spaces, so the more simple case of:

awk 'BEGIN {OFS="	"} {$1=$1; print }' data_file1

won’t work. Here is a datafile:

$ cat data_file1
Header1             Header2              Header3
Rec1 Field1         Rec1 Field2          Rec1 Field3
Rec2 Field1         Rec2 Field2          Rec2 Field3
Rec3 Field1         Rec3 Field2          Rec3 Field
$

In the next example, we assign two spaces to FS and the tab to OFS. We then make an assignment ($1 = $1) so awk rebuilds the record, but that results in strings of tabs replacing the double spaces, so we use gsub to squash the tabs, then we print. Note the → denotes a literal tab character in the output. The output is a little hard to read, so there is a hex dump as well. Recall that ASCII tab is 09 while ASCII space is 20:

$ awk 'BEGIN { FS = " "; OFS = "	" } { $1 = $1; gsub(/	+ ?/, "	"); print }' 
 data_file1
Header1 → Header2 → Header3
Rec1 Field1 → Rec1 Field2 → Rec1 Field3
Rec2 Field1 → Rec2 Field2 → Rec2 Field3
Rec3 Field1 → Rec3 Field2 → Rec3 Field3

$ awk 'BEGIN { FS = " "; OFS = "	" } { $1 = $1; gsub(/	+ ?/, "	"); print }' 
 data_file1 | hexdump -C
00000000 48 65 61 64 65 72 31 09  48 65 61 64 65 72 32 09 |Header1.Header2.|
00000010 48 65 61 64 65 72 33 0a  52 65 63 31 20 46 69 65 |Header3.Rec1 Fie|
00000020 6c 64 31 09 52 65 63 31  20 46 69 65 6c 64 32 09 |ld1.Rec1 Field2.|
00000030 52 65 63 31 20 46 69 65  6c 64 33 0a 52 65 63 32 |Rec1 Field3.Rec2|
00000040 20 46 69 65 6c 64 31 09  52 65 63 32 20 46 69 65 | Field1.Rec2 Fie|
00000050 6c 64 32 09 52 65 63 32  20 46 69 65 6c 64 33 0a |ld2.Rec2 Field3.|
00000060 52 65 63 33 20 46 69 65  6c 64 31 09 52 65 63 33 |Rec3 Field1.Rec3|
00000070 20 46 69 65 6c 64 32 09  52 65 63 33 20 46 69 65 | Field2.Rec3 Fie|
00000080 6c 64 0a                                         |ld.|
00000083

You can use awk to trim leading and trailing whitespace in the same way, but as noted previously, this will replace your field separators unless they are already spaces:

awk '{ $1 = $1; print }' white_space

13.17 Processing Fixed-Length Records

Problem

You need to read and process data that is in a fixed-length (also called fixed-width) form.

Solution

Use Perl or gawk 2.13 or greater. Given a file like:

$ cat fixed-length_file
Header1-----------Header2-------------------------Header3---------
Rec1 Field1       Rec1 Field2                     Rec1 Field3
Rec2 Field1       Rec2 Field2                     Rec2 Field3
Rec3 Field1       Rec3 Field2                     Rec3 Field3

you can process it using GNU’s gawk, by setting FIELDWIDTHS to the correct field lengths, setting OFS as desired, and making an assignment so gawk rebuilds the record using this OFS. However, gawk does not remove the spaces used in padding the original record, so we use two gsubs to do that, one for all the internal fields and the other for the last field in each record. Finally, we just print. Note the → denotes a literal tab character in the output. The output is a little hard to read, so there is a hex dump as well. Recall that ASCII tab is 09 while ASCII space is 20:

$ gawk 'BEGIN { FIELDWIDTHS = "18 32 16"; OFS = "	" }
> { $1 = $1; gsub(/ +	/, "	"); gsub(/ +$/, ""); print }' fixed-length_file
Header1----------- → Header2------------------------- → Header3---------
Rec1 Field1 → Rec1 Field2 → Rec1 Field3
Rec2 Field1 → Rec2 Field2 → Rec2 Field3
Rec3 Field1 → Rec3 Field2 → Rec3 Field3

$ gawk 'BEGIN { FIELDWIDTHS = "18 32 16"; OFS = "	" }
> { $1 = $1; gsub(/ +	/, "	"); gsub(/ +$/, ""); print }' fixed-length_file 
> | hexdump -C
00000000 48 65 61 64 65 72 31 2d 2d 2d 2d 2d 2d 2d 2d 2d |Header1---------|
00000010 2d 2d 09 48 65 61 64 65 72 32 2d 2d 2d 2d 2d 2d |--.Header2------|
00000020 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |----------------|
00000030 2d 2d 2d 09 48 65 61 64 65 72 33 2d 2d 2d 2d 2d |---.Header3-----|
00000040 2d 2d 2d 2d 0a 52 65 63 31 20 46 69 65 6c 64 31 |----.Rec1 Field1|
00000050 09 52 65 63 31 20 46 69 65 6c 64 32 09 52 65 63 |.Rec1 Field2.Rec|
00000060 31 20 46 69 65 6c 64 33 0a 52 65 63 32 20 46 69 |1 Field3.Rec2 Fi|
00000070 65 6c 64 31 09 52 65 63 32 20 46 69 65 6c 64 32 |eld1.Rec2 Field2|
00000080 09 52 65 63 32 20 46 69 65 6c 64 33 0a 52 65 63 |.Rec2 Field3.Rec|
00000090 33 20 46 69 65 6c 64 31 09 52 65 63 33 20 46 69 |3 Field1.Rec3 Fi|
000000a0 65 6c 64 32 09 52 65 63 33 20 46 69 65 6c 64 33 |eld2.Rec3 Field3|
000000b0 0a                                              |.|
000000b1

If you don’t have gawk, you can use Perl, which is more straightforward anyway. We use a nonprinting while input loop (-n), unpack each record ($_) as it’s read, and turn the resulting list back into a scalar by joining the elements with a tab. We then print each record, adding a newline at the end:

$ perl -ne 'print join("	", unpack("A18 A32 A16", $_) ) . "
";' 
> fixed-length_file
Header1----------- → Header2------------------------- → Header3---------
Rec1 Field1 → Rec1 Field2 → Rec1 Field3
Rec2 Field1 → Rec2 Field2 → Rec2 Field3
Rec3 Field1 → Rec3 Field2 → Rec3 Field3
$ perl -ne 'print join("	", unpack("A18 A32 A16", $_) ) . "
";' 
> fixed-length_file |
> hexdump -C
00000000 48 65 61 64 65 72 31 2d 2d 2d 2d 2d 2d 2d 2d 2d |Header1---------|
00000010 2d 2d 09 48 65 61 64 65 72 32 2d 2d 2d 2d 2d 2d |--.Header2------|
00000020 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |----------------|
00000030 2d 2d 2d 09 48 65 61 64 65 72 33 2d 2d 2d 2d 2d |---.Header3-----|
00000040 2d 2d 2d 2d 0a 52 65 63 31 20 46 69 65 6c 64 31 |----.Rec1 Field1|
00000050 09 52 65 63 31 20 46 69 65 6c 64 32 09 52 65 63 |.Rec1 Field2.Rec|
00000060 31 20 46 69 65 6c 64 33 0a 52 65 63 32 20 46 69 |1 Field3.Rec2 Fi|
00000070 65 6c 64 31 09 52 65 63 32 20 46 69 65 6c 64 32 |eld1.Rec2 Field2|
00000080 09 52 65 63 32 20 46 69 65 6c 64 33 0a 52 65 63 |.Rec2 Field3.Rec|
00000090 33 20 46 69 65 6c 64 31 09 52 65 63 33 20 46 69 |3 Field1.Rec3 Fi|
000000a0 65 6c 64 32 09 52 65 63 33 20 46 69 65 6c 64 33 |eld2.Rec3 Field3|
000000b0 0a                                              |.|
000000b1

See the Perl documentation for the pack and unpack template formats.

Discussion

Anyone with any Unix background will automatically use some kind of delimiter in output, since the textutils toolchain is never far from mind, so fixed-length (a.k.a. fixed-width) records are rare in the Unix world. They are very common in the mainframe world, however, so they will occasionally crop up in large applications that originated on big iron, such as some applications from SAP. As we’ve just seen, it’s no problem to handle them.

One caveat to this recipe is that it requires each record to end in a newline. Many old mainframe record formats don’t, in which case you can use Recipe 13.18 to add newlines to the end of each record before processing.

13.18 Processing Files with No Line Breaks

Problem

You have a large file with no line breaks, and you need to process it.

Solution

Preprocess the file and add line breaks in appropriate places. For example, Open- Office’s OpenDocument Format (ODF) files are basically zipped XML files. It is possible to unzip them and grep the XML, which we did a lot while writing this book. See Recipe 12.5 for a more comprehensive treatment of ODF files. In this example, we insert a newline after every closing angle bracket (>). That makes it much easier to process the file using grep or other textutils. Note that we must enter a backslash followed immediately by the Enter key to embed an escaped newline in the sed script:

$ wc -l content.xml
1 content.xml

$ sed -e 's/>/>
> /g' content.xml | wc -l
1687

If you have fixed-length records with no newlines, do this instead, where 48 is the length of the record:

$ cat fixed-length
Line_1_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_2__
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_3__
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_4__
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_5__
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_6__
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_7__
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_8__
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_9__
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_10_
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_11_
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_12_
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ

$ wc -l fixed-length
  1 fixed-length

$ sed 's/.{48}/&
> /g;' fixed-length
Line_1__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_2__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_3__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_4__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_5__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_6__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_7__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_8__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_9__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_10_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_11_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_12_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ

$ perl -pe 's/(.{48})/$1
/g;' fixed-length
Line_1__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_2__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_3__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_4__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_5__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_6__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_7__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_8__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_9__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_10_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_11_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_12_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ

Discussion

This happens often when people create output programmatically, especially using canned modules and especially with HTML or XML output.

Note the sed substitutions have an odd construct that allows an embedded newline. In sed, a literal ampersand (&) on the righthand side (RHS) of a substitution is replaced by the entire expression matched on the lefthand side (LHS), and the trailing on the first line escapes the newline so you don’t get an error like “sed: -e expression #1, char 11: unterminated ‘s’”. This is because sed doesn’t recognize as a metacharacter on the RHS of s///.

13.19 Converting a Datafile to CSV

Problem

You have a data file that you need to convert to a Comma Separated Values (CSV) file.

Solution

Use awk to convert the data into CSV format:

$ awk 'BEGIN { FS="	"; OFS="","" } { gsub(/"/, """"); $1 = $1;
> printf ""%s"
", $0}' tab_delimited
"Line 1","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes"
"Line 2","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes"
"Line 3","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes"
"Line 4","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes"
$

You can do the same thing in Perl also:

$ perl -naF'	' -e 'chomp @F; s/"/""/g for @F; print q(").join(q(","),@F)
> .qq("
);' tab_delimited
"Line 1","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes"
"Line 2","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes"
"Line 3","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes"
"Line 4","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes"
$

Discussion

First of all, it’s tricky to define exactly what CSV really means. There is no formal specification, and various vendors have implemented various versions. Our version here is very simple, and should hopefully work just about anywhere. We place double quotes around all fields (some implementations only quote strings, or strings with internal commas), and we double internal double quotes.

To do that, we have awk split up the input fields using a tab as the field separator and set the output field separator (OFS) to ",", which will provide the trailing quote for each field and then the leading quote for the next field as well as the comma in between them. We then globally replace any double quotes with two double quotes, make an assignment so awk rebuilds the record with our specified OFS (see the awk trick in Recipe 13.15), and print out the record with leading and trailing double quotes. We have to escape double quotes in several places, which looks a little cluttered, but otherwise this is very straightforward.

13.20 Parsing a CSV Datafile

Problem

You have a Comma Separated Values datafile that you need to parse.

Solution

Unlike the previous recipe for converting to CSV, there is no easy way to do this, since it’s tricky to define exactly what CSV really means.

Possible solutions for you to explore are:

sed: http://sed.sourceforge.net/sedfaq4.html#s4.12.
awk: http://lorance.freeshell.org/csv/.
Perl: Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly) has a regex to do this; see also CPAN, the Comprehensive Perl Archive Network, for various modules.
Load the CSV file into a spreadsheet (LibreOffice’s Calc and Microsoft’s Excel both work), then copy and paste the contents into a text editor; you should get tab-delimited output that you can now use easily.

Discussion

As noted in Recipe 13.19, there is no formal specification for CSV, and that fact, combined with data variations, makes this task much harder than it sounds.

Table of Contents for Parsing and Similar Tasks

Create new playlist

Sign In

Sign Up

Chapter 13. Parsing and Similar Tasks

13.1 Parsing Arguments for Your Shell Script

Problem

Solution

Example 13-1. ch13/getopts_example

Discussion

See Also

13.2 Parsing Arguments with Your Own Error Messages

Problem

Solution

Example 13-2. ch13/getopts_custom

Discussion

See Also

13.3 Parsing Some HTML

Problem

Solution

Discussion

See Also

13.4 Parsing Output into an Array

Problem

Solution

Example 13-3. ch13/parseViaArray

Discussion

See Also

13.5 Parsing Output with a Function Call

Problem

Solution

Example 13-4. ch13/parseViaFunc

Discussion

Warning

See Also

13.6 Parsing Text with a read Statement

Problem

Solution

Example 13-5. ch13/parseViaRead

Discussion

See Also

13.7 Parsing with read into an Array

Problem

Solution

Discussion

See Also

13.8 Reading an Entire File

Problem

Solution

Discussion

See Also

13.9 Getting Your Plurals Right

Problem

Solution

Example 13-6. ch13/pluralize

Discussion

See Also

13.10 Taking It One Character at a Time

Problem

Solution

Example 13-7. ch13/onebyone

Discussion

See Also

13.11 Cleaning Up an SVN Source Tree

Problem

Warning

Solution

Discussion

See Also

13.12 Setting Up a Database with MySQL

Problem

Solution

Example 13-8. ch13/dbiniter

Discussion

See Also

13.13 Isolating Specific Fields in Data

Problem

Solution

Discussion

See Also

Table of Contents for
Parsing and Similar Tasks