This is a chapter of tasks that programmers might recognize. The recipes here aren’t necessarily more advanced than the other bash script recipes in the book, but if you are not a programmer, these tasks might seem obscure or irrelevant to your use of bash. We won’t do much explaining of the reasons why you’d find yourself in these situations (as a programmer, you’ll recognize some if not all of them). Even if you don’t recognize the situations, though, you should read them for what you can learn about bash.
Some of the recipes in this chapter include the parsing of command-line arguments. Recall that the typical way to specify options on a shell script is to have a leading minus sign and a single letter. For example, an option for your script to give fewer messages might use -q
as a flag to mean quiet mode. Sometimes an option might take an argument. For example, a user option where you need to specify a username might use -u
followed by the username. This distinction will be made clear in this chapter’s first recipe.
Some Linux commands also allow long-form options. Using the previous example of a short-format -u
option, a command might also support a long format like --user=username
. We will not be showing any long-format options, though they could be used for some of the techniques that we show. The best way to parse long arguments is to use the getopt (note no s) command.
You want to have some options on your shell script, some flags that users can use to alter its behavior. You could do the parsing directly, using ${}
to tell you how many arguments have been supplied and using ${1:0:1}
to test the first character of the first argument to see if it is a minus sign. You would need some if/then
or case
logic to identify which option it is and whether it takes an argument, though. And what if the user doesn’t supply a required argument, or calls your script with two options combined (e.g., -ab
)? Will you also parse for that? The need to parse options for a shell script is a common situation. Lots of scripts have options. Isn’t there a more standard way to do this?
Use bash’s builtin getopts command to help parse options.
Example 13-1, based largely on the example in the manpage for getopts, illustrates.
#!/usr/bin/env bash
# cookbook filename: getopts_example
#
# using getopts
#
aflag
=
bflag
=
while
getopts
'ab:'
OPTIONdo
case
$OPTION
in a)
aflag
=
1;;
b)
bflag
=
1bval
=
"
$OPTARG
"
;;
?)
printf
"Usage: %s: [-a] [-b value] args "
${
0
##*/
}
>&
2exit
2;;
esac
done
shift
$((
$OPTIND
-
1
))
if
[
"
$aflag
"
]
then
printf
"Option -a specified "
fi
if
[
"
$bflag
"
]
then
printf
'Option -b "%s" specified '
"
$bval
"
fi
printf
"Remaining arguments are: %s "
"
$*
"
There are two kinds of options supported here. The first and simpler kind is an option that stands alone. It typically represents a flag to modify a command’s behavior. An example of this sort of option is the -l
option on the ls command. The second kind of option requires an argument. An example of this is the mysql command’s -u
option, which requires that a username be supplied, as in mysql -u sysadmin
. Let’s look at how getopts supports the parsing of both kinds.
getopts takes two arguments:
getopts 'ab:' OPTION
The first is a list of option letters. The second is the name of a shell variable. In our example we are defining -a
and -b
as the only two valid options, so the first argument to getopts has just those two letters…and a colon. What does the colon signify? It indicates that -b
needs an argument, just like -u
username
or -f
filename
might be used. The colon needs to be adjacent to any option letter taking an argument. For example, if only -a
took an argument we would need to write a:b
instead.
The getopts builtin will set the variable named in the second argument to the value that it finds when it parses the shell script’s argument list ($1
, $2
, etc.). If it finds an argument with a leading minus sign, it will treat that as an option argument and put the letter into the given variable ($OPTION
in our example). Then it returns true (i.e., 0
) so that the while
loop will process the option, then continue to parse options by repeated calls to getopts until it runs out of arguments (or encounters a double minus, --
, which allows users to put an explicit end to the options). Then getopts returns false (i.e., a nonzero value) and the while
loop ends.
Inside the loop, when the parsing has found an option letter for processing, we use a case
statement on the variable $OPTION
to set flags or otherwise take action when the option is encountered. For options that take arguments, that argument is placed in the shell variable $OPTARG
(a fixed name not related to our use of $OPTION
as our variable). We need to save that value by assigning it to another variable because as the parsing continues to loop, the variable $OPTARG
will be reset on each call to getopts.
The third case of our case
statement is a question mark, a shell pattern that matches any single character. When getopts finds an option that is not in the set of expected options ('ab:'
in our example), it will return a literal question mark in the variable ($OPTION
in our example). So, we could have made our case
statement read ?
) or '?'
) for an exact match, but the ?
as a pattern match of any single character provides a convenient default. It will match a literal question mark as well as matching any other single character.
In the usage message that we print, we have made two changes from the example script in the manpage. First, we use ${0##*/}
to give the name of the script without the pathname that may have been part of how it was invoked. Secondly, we redirect this message to standard error (>&2
) because that is really where such messages belong. All of the error messages from getopts that occur when an unknown option or missing argument is encountered are written to standard error; we add our usage message to that chorus.
When the while
loop terminates, we see the next line to be executed is:
shift $(($OPTIND - 1))
which is a shift
statement used to move the positional parameters of the shell script from $1
, $2
, etc. down a given number of positions (tossing the lower ones). The variable $OPTIND
is an index into the arguments that getopts uses to keep track of where it is when it parses. Once we are done parsing, we can toss all the options that we’ve processed by executing this shift
statement. For example, if we had this command line:
myscript -a -b alt plow harvest reap
then after parsing for options, $OPTIND
would be set to 4
. By doing three $OPTIND - 1
shifts we would get rid of the options, and then a quick echo $*
would give this:
plow harvest reap
The remaining (nonoption) arguments will then be ready for use in our script (in a for
loop, perhaps). In our example script, the last line is a printf showing all the remaining arguments.
help case
help getopts
help getopt
Recipe 13.2, “Parsing Arguments with Your Own Error Messages”
If you just want getopts to be quiet and not report any errors at all, just assign OPTERR=0
before you begin parsing. But if you want getopts to give you more information without the error messages, then begin the option list with a colon, as shown in the script in Example 13-2. (The quotes around the option list are optional.)
#!/usr/bin/env bash
# cookbook filename: getopts_custom
#
# using getopts - with custom error messages
#
aflag
=
bflag
=
# since we don't want getopts to generate error
# messages, but want this script to issue its
# own messages, we will put, in the option list, a
# leading ':' to silence getopts.
while
getopts
:ab: FOUNDdo
case
$FOUND
in a)
aflag
=
1;;
b)
bflag
=
1bval
=
"
$OPTARG
"
;;
:
)
printf
"argument missing from -%s option "
$OPTARG
printf
"Usage: %s: [-a] [-b value] args "
${
0
##*/
}
exit
2;;
?
)
printf
"unknown option: -%s "
$OPTARG
printf
"Usage: %s: [-a] [-b value] args "
${
0
##*/
}
exit
2;;
esac
>&
2done
shift
$((
$OPTIND
-
1
))
if
[
"
$aflag
"
]
then
printf
"Option -a specified "
fi
if
[
"
$bflag
"
]
then
printf
'Option -b "%s" specified '
"
$bval
"
fi
printf
"Remaining arguments are: %s "
"
$*
"
The script is very similar to the one in Recipe 13.1; see that recipe’s Discussion section for more background. One difference here is that getopts may now return a colon. It does so when an option is missing (e.g., when the user invokes the script with -b
but without an argument for it). In that case, it puts the option letter into $OPTARG
so that you know what option it was that was missing its argument.
Similarly, if an unsupported option is given (e.g., if the user tries -d
when invoking the script) getopts returns a question mark as the value for $FOUND
, and puts the letter (the d
in this case) into $OPTARG
so that it can be used in the error messages.
We put a backslash in front of both the colon and the question mark to indicate that these are literals and not any special patterns or shell syntax. While not necessary for the colon, it looks better to have the parallel construction with the two punctuation marks both being escaped.
We added an I/O redirection on the esac
(the end of the case
statement), so that all output from the various printf commands will be redirected to standard error. This is in keeping with the purpose of standard error and is just easier to put it here than remembering to put it on each printf individually.
help case
help getopts
help getopt
Parsing HTML from bash is pretty tricky, mostly because bash tends to be very line-oriented whereas HTML was designed to treat newlines like whitespace. So, it’s not uncommon to see tags split across two or more lines, as in:
<a
href=
"blah..."
rel=
"blah..."
media=
"blah..."
target=
"blah..."
>
There are also two ways to write <a>
tags, one with a separate ending </a>
tag and one without, where instead the singular <a>
tag itself ends with a />
. Between this and the potential for multiple tags on a line and tags split across lines, it’s a bit messy to parse, and our simple bash technique for this is often not foolproof.
Here are the steps involved in our solution. First, break the multiple tags on one line into at most one line per tag:
cat file|
sed -e's/>/>
/g'
Yes, that’s a newline right after the backslash so that it substitutes each end-of-tag character (i.e., the >
) with that same character and then a newline. That will put tags on separate lines, with maybe a few extra blank lines. The trailing g
tells sed to do the search and replace globally; i.e., multiple times on a line if need be.
Then you can pipe that output into grep to grab just the <a
tag lines:
cat file|
sed -e's/>/>
/g'
|
grep'<a'
or maybe just lines with double quotes:
cat file|
sed -e's/>/>
/g'
|
grep'".*"'
The single quotes tell the shell to take the inner characters literally and not do any shell expansion on them, and the rest is a regular expression to match a double quote followed by any character (.
) any number of times (*
), followed by another double quote. (This won’t work if the string itself is split across lines.)
To parse out the contents of what’s inside the double quotes, one trick is to use the shell’s internal field separator ($IFS)
to tell it to use the double quote ("
) as the separator. You can do a similar thing with awk and its -F
(field separator) option.
For example:
cat$1
|
sed -e's/>/>
/g'
|
grep'".*"'
|
awk -F'"'
'{ print $2}'
(Or use grep '<a'
if you just want <a
tags and not all quoted strings.)
If you want to use the $IFS
shell trick rather than awk, it would be:
cat$1
|
sed -e's/>/>
/g'
|
grep'<a'
|
while
IFS
=
'"'
read
PRE URL POST;
do
echo
$URL
;
done
where the grep output is piped into a while
loop that reads the input into three fields (PRE
, URL
, and POST
). By preceding the read command with IFS='"'
, we set that environment variable just for the read command, not for the entire script. Thus, it will parse with the quotes as its notion of what separates the words of the input line. It will set PRE
to be everything up to the first quote, URL
to be everything from there to the next quote, and POST
to be everything thereafter. Then the script just echoes the second variable, URL
; that’s all the characters between the quotes.
man sed
man grep
Example 13-3 illustrates how to use an array to parse the output into words.
#!/usr/bin/env bash
# cookbook filename: parseViaArray
#
# find the file size
# use an array to parse the ls -l output into words
LSL
=
$(
ls -ld$1
)
declare
-a MYRAMYRA
=(
$LSL
)
echo
the file$1
is${
MYRA
[4]
}
bytes.
In our example, we take the output from the ls -l
command and parse the words by putting them into an array. Then we can just refer to each array element to get at each word. (Remember that the arrays are zero-based, so an index of 4
gives us the fifth element.) The typical output from the ls -l
command looks like this (yours may vary due to locale):
-rw-r--r-- 1 albing users 113 2006-10-10 23:33 mystuff.txt
Arrays are easy to initialize if you know the values as you write the script. The format is simple. We begin by declaring the variable to be an array, and then we assign it values:
declare
-a MYRAMYRA
=(
first second third home)
The same can be done by using a variable inside those parentheses. Just be sure not to use quotes around the variable. Writing MYRA=$("$LSL")
will put the entire string into the first argument, since it is all contained as one quoted string. Then ${MYRA[0]}
will be the only array element, and it will contain the entire string, which is not what you wanted.
We also could have shortened this script by combining the steps like this:
declare
-a MYRAMYRA
=(
$(
ls -ld$1
)
)
If you want to know how many elements you have in your new array, just reference the variable ${#MYRA[*]}
or ${#MYRA[@]}
, either of which is a lot of special characters to type.
Use a function call to parse the words, as shown in Example 13-4.
#!/usr/bin/env bash
# cookbook filename: parseViaFunc
#
# parse ls -l via function call
# an example of output from ls -l follows:
# -rw-r--r-- 1 albing users 126 Jun 10 22:50 fnsize
function
lsparts()
{
PERMS
=
$1
LCOUNT
=
$2
OWNER
=
$3
GROUP
=
$4
SIZE
=
$5
CRMONTH
=
$6
CRDAY
=
$7
CRTIME
=
$8
FILE
=
$9
}
lsparts$(
ls -l"
$1
"
)
echo
$FILE
has$LCOUNT
'link(s)'
and is$SIZE
bytes long.
Here’s what it looks like when it runs:
$ ./fnsize fnsize fnsize has 1 link(s) and is 311 bytes long. $
We can let bash do the work of parsing by putting the text to be parsed in a function call. Calling a function is much like calling a shell script. bash parses the words into separate variables and assigns them to $1
, $2
, etc. Our function can just assign each positional parameter to a separate variable. If the variables are not declared locally then they are available outside as well as inside the function.
We put quotes around the reference to $1
in the ls command in case the filename supplied has spaces in its name. The quotes keep it all together so that ls sees it as a single filename and not as a series of separate filenames.
We use quotes in the expression 'link(s)'
to avoid special treatment of the parentheses by bash. Alternatively, we could have put the entire phrase (except for the echo
itself) inside of double quotes—double, not single, so that the variable substitution (for $FILE
, etc.) still occurs.
You might need to adjust the field list depending on how your computer and ls command present the date, or add options to your ls command to modify its output. For example, ls -l --time-style="long-iso"
will produce a slightly different output format where month and day are replaced by a YYYY-MM-DD
format date; you would then want to replace CRMONTH
and CRDAY
with a single variable, say, CRDATE
, and adjust the field numbers accordingly.
Use the read
statement, as in Example 13-5.
#!/usr/bin/env bash
# cookbook filename: parseViaRead
#
# parse ls -l with a read statement
# an example of output from ls -l follows:
# -rw-r--r-- 1 albing users 126 2006-10-10 22:50 fnsize
ls -l"
$1
"
|
{
read
PERMS LCOUNT OWNER GROUP SIZE CRDATE CRTIME FILE;
echo
$FILE
has$LCOUNT
'link(s)'
and is$SIZE
bytes long.;
}
Here we let read do all the parsing. It will break apart the input into words, where words are separated by whitespace, and assign each word to the variables named in the read
statement. Actually, you can even change the separator, by setting the bash $IFS
(internal field separator) variable to whatever character you want for parsing; just remember to set it back!
As you can see from the sample output of ls -l
, we have tried to choose names that get at the meaning of each word in the output. Since FILE
is the last word, any extra fields will also be part of that variable. That way if the name has whitespace in it, like “Beethoven Fifth Symphony,” then all three words will end up in $FILE
.
Use the -a
option on the read command, and the words will be read into an array variable:
read -a MYRAY
Whether coming from user input or a pipeline, read will parse the input into words, putting each word in its own array element. The variable does not need to be declared as an array—using it in this fashion is enough to make it into an array. Each element can be referenced with the bash array syntax. Arrays in bash are zero-based, so the second word on a line of input will be put into ${MYRAY[1]}
in our example. The number of words will determine the size of the array. In our example, the size of the array is ${#MYRAY[@]}
.
Use the mapfile or readarray command in bash. They are identical commands that take the same arguments and let you read an entire file into an array, one array entry for each line of the file, with one statement.
The choice of command, either readarray or mapfile, seems to be one of perspective—are you thinking about the destination (the array), or the source (the datafile)? Use whichever makes more sense to you. They are interchangeable.
Here’s a sample mapfile command, part of a fuller example in the discussion:
mapfile -t -s 1 -n 1500 -C showprg -c 100 BIGDATA < /tmp/myfile.data
This command will discard (i.e., skip) the first line (-s 1
) of input, reading up to 1,500 lines (-n 1500
) and discarding the newline at the end of each line (-t
). Every 100 lines (-c 100
) it will call a user-defined function called showprg
(to show progress in reading the file; the default is every 5,000 lines). The data is put into the array called BIGDATA
, one line of input per entry. Input is redirected from the file as shown.
Here’s the first part of an example use of mapfile (or readarray, if you prefer). It reads the file and shows progress as it reads. Then it prints out how many lines it read—i.e., the size of the array:
# use mapfile to read in $1
# show progress with dots
function
showprg()
{
printf
"."
}
# create a large datafile for our use
ls -l /usr/bin > /tmp/myfile.data# a.k.a. readarray; load up BIGDATA
mapfile -t -s1
-n1500
-C showprg -c100
BIGDATA < /tmp/myfile.data# put a newline at the end of the showprg output
echo
# how many lines did we read?
siz
=
${#
BIGDATA
[@]
}
echo
"size:
${
siz
}
"
The showprg
function will print a dot (but no newline) each time it is called.
This will show progress when reading in a large file. You could do something
much fancier if you wanted; it’s whatever function you want, after all.
So now that the file has been read into the array, what might we do with all that data? In this case it’s a very long output from the ls command. We could now go through the file one line at a time and print out some of the data:
# number the lines as we print them out
for
((
i
=
0;
i<siz;
i++))
do
ALINE
=
${
BIGDATA
[i]
}
if
[[
${
ALINE
:
0
:
1
}
==
'l'
]]
# only symbolic links
then
# print the relevant substring
printf
"%4d: %s "
$i
"
${
ALINE
:
48
}
"
fi
done
rm /tmp/myfile.data# clean up
In this case, the script will look at the first character of a line
and, if it’s an l
, print out the line, beginning at character 48 (zero-based).
Since the data in the file is the “long” output of an ls command, such a first character indicates that we are looking at a symbolic link.
(Similarly, a d
would indicate a directory, but we don’t use that here.)
Here is an excerpt of the output that might result from running the whole script. The first line shows the dots that appear as the progress of the read:
............... size: 1500 (other output, and then) 1307: rsh -> /etc/alternatives/rsh 1311: rtstat -> lnstat 1315: rview -> /etc/alternatives/rview (even more output)
Example 13-6 illustrates a way to make words plural.
#!/usr/bin/env bash
# cookbook filename: pluralize
#
# A function to make words plural by adding an s
# when the value ($2) is != 1 or -1.
# It only adds an 's'; it is not very smart.
#
function
plural()
{
if
[
$2
-eq1
-o$2
-eq -1]
then
echo
${
1
}
else
echo
${
1
}
sfi
}
while
read
num namedo
echo
$num
$(
plural"
$name
"
$num
)
done
The function, though only set to handle the simple addition of an s
, will do fine for many nouns. The function doesn’t do any error checking of the number or contents of the arguments. If you wanted to use this script in a serious application, you might want to add those kinds of checks.
We put the name in quotes when we call the plural function in case there are embedded blanks in the name. It did, after all, come from the read
statement, and the last variable in a read
statement gets all the remaining text from the input line. You can see that in the following example.
We put the script in Example 13-6 into a file named pluralize and ran it against the following data:
$ cat input.file 1 hen 2 duck 3 squawking goose 4 limerick oyster 5 corpulent porpoise $ ./pluralize < input.file 1 hen 2 ducks 3 squawking gooses 4 limerick oysters 5 corpulent porpoises
“Gooses” isn’t correct English, but the script did what was intended. If you like the C-like syntax better, you could write the if
statement like this:
if (( $2 == 1 || $2 == -1 ))
The square bracket (i.e., the test builtin) is the older form, more common across the various versions of bash, but either should work. Use whichever form’s syntax is easiest for you to remember.
We don’t expect you would keep a file like pluralize around, but the plural
function might be handy to have as part of a larger scripting project. Then whenever you report on the count of something you could use the plural
function as part of the reference, as shown in the while
loop in the script.
The substring function for variables will let you take things apart, and another feature tells you how long a string is. Example 13-7 demonstrates their use.
#!/usr/bin/env bash
# cookbook filename: onebyone
#
# parsing input one character at a time
while
read
ALINEdo
for
((
i
=
0;
i <${#
ALINE
}
;
i++))
do
ACHAR
=
${
ALINE
:
i
:
1
}
# do something here, e.g. echo $ACHAR
echo
$ACHAR
done
done
The read
statement will take input from standard input and put it, a line at a time, into the variable $ALINE
. Since there are no other variables in the read
statement, it takes the entire line and doesn’t divvy it up, but it will remove leading and trailing $IFS
whitespace unless you use IFS= read
or just read
by itself and later reference the default $REPLY
variable.
The for
loop will loop once for each character in the $ALINE
variable. We can compute how many times to loop by using ${#ALINE}
, which returns the length of the contents of $ALINE
.
Each time through the loop we assign $ACHAR
the value of the one-character substring of $ALINE
that begins at the i
th position. That’s simple enough.
Subversion’s svn status
command shows all the files that have been modified, but if you have scratch files or other garbage lying around in your source tree, svn will list those, too. It would be useful to have a way to clean up your source tree, removing those files unknown to Subversion.
Subversion won’t know about new files unless and until you do an svn add
command. Don’t run this script until you’ve added any new source files, or they’ll be gone for good.
The svn status
output lists one file per line. It puts an M
as the first character of a line for files that have been modified, an A
for newly added (but not yet committed) files, and a question mark for those about which it knows nothing. We just grep for those lines beginning with a question mark. We process the output with a read
statement in a while
loop. The echo
isn’t strictly necessary, but it’s useful to see what’s being removed, just in case there is a mistake or an error. You can at least see that it’s gone for good. When we do the remove, we use the -rf
options in case the file is a directory, but mostly just to keep the remove quiet. Problems encountered with permissions and such are squelched by the -f
option; it just removes the file as best as your permissions allow. We put the reference to the filename in quotes ("$fn"
) in case there are special characters, like spaces, in the filename.
You want to create and initialize several databases using MySQL. You want them all to be initialized using the same SQL commands. Each database needs its own name, but each database will have the same contents, at least at initialization. You may need to do this setup over and over, as in the case where these databases are used as part of a test suite that needs to be reset when tests are rerun.
The simple bash script in Example 13-8 can help with this administrative task.
#!/usr/bin/env bash
# cookbook filename: dbiniter
#
# initialize databases from a standard file
# creating databases as needed
DBLIST
=
$(
mysql -e"SHOW DATABASES;"
|
tail -n +2)
select
DB in$DBLIST
"new..."
do
if
[[
$DB
==
"new..."
]]
then
printf
"%b"
"name for new db: "
read
DB restecho
creating new database$DB
mysql -e"CREATE DATABASE IF NOT EXISTS
$DB
;"
fi
if
[
-n"
$DB
"
]
then
echo
Initializing database:$DB
mysql$DB
< ourInit.sqlfi
done
The tail -n +2
is added to remove the heading from the list of databases (see Recipe 2.12).
The select
creates the menus showing the existing databases. We added the literal "new..."
as an additional choice (see Recipe 3.7 and Recipe 6.16).
When the user wants to create a new database, we prompt for and read a new name, but we use two fields in the read
statement as a bit of error handling. If the user types more than one name on the line, we only use the first name—it gets put into the variable $DB
while the rest of the input is put into $rest
and ignored. (We could add an error check to see if $rest
is null.)
Whether created anew or chosen from the list of extant databases, if the $DB
variable is not empty, it will invoke mysql one more time to feed it the set of SQL statements that we’ve put into the file ourInit.sql as our standardized initialization sequence.
If you’re going to use a script like this, you might need to add parameters to your mysql command, such as -u
and -p
to prompt for a username and password. It will depend on how your database and its permissions are configured, or whether you have a file named .my.cnf with your MySQL defaults.
We could also have added an error check after the creation of the new database to see if it succeeded; if it did not succeed, we could unset
DB
, thereby bypassing the initialization. However, as many a math textbook has said, “we leave that as an exercise for the reader.”
Use cut if there are delimiters you can easily pick out, even if they are different for the beginning and end of the field you need:
# Here's an easy one - what users, home directories and shells do # we have on this NetBSD system? $ cut -d':' -f1,6,7 /etc/passwd root:/root:/bin/csh toor:/root:/bin/sh daemon:/:/sbin/nologin operator:/usr/guest/operator:/sbin/nologin bin:/:/sbin/nologin games:/usr/games:/sbin/nologin postfix:/var/spool/postfix:/sbin/nologin named:/var/chroot/named:/sbin/nologin ntpd:/var/chroot/ntpd:/sbin/nologin sshd:/var/chroot/sshd:/sbin/nologin smmsp:/nonexistent:/sbin/nologin uucp:/var/spool/uucppublic:/usr/libexec/uucp/uucico nobody:/nonexistent:/sbin/nologin jp:/home/jp:/usr/pkg/bin/bash # What is the most popular shell on the system? $ cut -d':' -f7 /etc/passwd | sort | uniq -c | sort -rn 10 /sbin/nologin 2 /usr/pkg/bin/bash 1 /bin/csh 1 /bin/sh 1 /usr/libexec/uucp/uucico # Now let's see the first two directory levels $ cut -d':' -f6 /etc/passwd | cut -d'/' -f1-3 | sort -u / /home/jp /nonexistent /root /usr/games /usr/guest /var/chroot /var/spool
Use awk to split on multiples of whitespace, or if you need to rearrange the order of the output fields. Note the → denotes a tab character in the output. The default is a space, but you can change that using $OFS
:
# Users, home directories, and shells, but swap the last two # and use a tab delimiter $ awk 'BEGIN {FS=":"; OFS=" "; } { print $1,$7,$6; }' /etc/passwd root → /bin/csh → /root toor → /bin/sh → /root daemon → /sbin/nologin → / operator → /sbin/nologin → /usr/guest/operator bin → /sbin/nologin → / games → /sbin/nologin → /usr/games postfix → /sbin/nologin → /var/spool/postfix named → /sbin/nologin → /var/chroot/named ntpd → /sbin/nologin → /var/chroot/ntpd sshd → /sbin/nologin → /var/chroot/sshd smmsp → /sbin/nologin → /nonexistent uucp → /usr/libexec/uucp/uucico → /var/spool/uucppublic nobody → /sbin/nologin → /nonexistent jp → /usr/pkg/bin/bash → /home/jp # Multiples of whitespace and swapped, first field removed $ grep '^# [1-9]' /etc/hosts | awk '{print $3,$2}' 10.255.255.255 10.0.0.0 172.31.255.255 172.16.0.0 192.168.255.255 192.168.0.0
Use grep -o
to display just the part that matched your pattern. This is particularly handy when you can’t express delimiters in a way that lends itself to the solutions shown here. For example, say you need to extract all IP addresses from a file, no matter where they are. Note we use egrep because of the regular expression (regex), but -o
should work with whichever GNU grep flavor you use (it is probably not supported on non-GNU versions; check your documentation):
$ cat has_ipas This is line 1 with 1 IPA: 10.10.10.10 Line 2 has 2; they are 10.10.10.11 and 10.10.10.12. Line three is ftp_server=10.10.10.13:21. $ egrep -o '[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}' has_ipas 10.10.10.10 10.10.10.11 10.10.10.12 10.10.10.13
The possibilities are endless, and we haven’t even scratched the surface here. This is the very essence of what the Unix toolchain idea is all about: take a number of small tools that do one thing well and combine them as needed to solve problems.
Also, the regex we used for IP addresses is naive and could match other things, including invalid addresses. For a much better pattern, use the Perl Compatible Regular Expressions (PCRE) regex from Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly), if your grep supports -P
:
$ grep -oP '([01]?dd?|2[0-4]d|25[0-5]).([01]?dd?|2[0-4]d|25[0-5]). ([01]?d d?|2[0-4]d|25[0-5]).([01]?dd?|2[0-4]d|25[0-5])' has_ipas 10.10.10.10 10.10.10.11 10.10.10.12 10.10.10.13 $
$ perl -ne 'while ( m/([01]?dd?|2[0-4]d|25[0-5]).([01]?dd?|2[0-4]d|25[0-5]). ([01]?dd?|2[0-4]d|25[0-5]).([01]?dd?|2[0-4]d|25[0-5])/g ) { print qq($1.$2.$3. $4 ); }' has_ipas 10.10.10.10 10.10.10.11 10.10.10.12 10.10.10.13 $
man cut
man awk
man grep
Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly)
Recipe 17.16, “Finding Lines That Appear in One File but Not in Another”
In the simple case, you want to extract a single field from a line, then perform some operation on it. For that, you can use cut or awk. See Recipe 13.13 for details.
For the more complicated case, you need to modify a field in a datafile without extracting it. If it’s a simple search and replace, use sed.
For example, let’s switch everyone from csh to sh on this NetBSD system:
$ grep csh /etc/passwd root:*:0:0:Charlie &:/root:/bin/csh $ sed 's;/csh$;/sh;' /etc/passwd | grep '^root' root:*:0:0:Charlie &:/root:/bin/sh
You can use awk if you need to do arithmetic on a field or modify a string only in a certain field:
$ cat data_file Line 1 ends Line 2 ends Line 3 ends Line 4 ends Line 5 ends $ awk '{print $1, $2+5, $3}' data_file Line 6 ends Line 7 ends Line 8 ends Line 9 ends Line 10 ends # If the second field contains '3', change it to '8' and mark it $ awk '{ if ($2 == "3") print $1, $2+5, $3, "Tweaked" ; else print $0; }' data_file Line 1 ends Line 2 ends Line 8 ends Tweaked Line 4 ends Line 5 ends
The possibilities here are as endless as your data, but hopefully these examples will give you enough of a start to easily modify your data.
These solutions rely on a bash-specific treatment of read and $REPLY
. See the end of the discussion for an alternate solution.
First, we’ll show a file with some leading and trailing whitespace. Note that we add ~~
to show the whitespace, and that the → denotes a literal tab character in the output:
# Show the whitespace in our sample file $ while read; do echo ~~"$REPLY"~~; done < whitespace ~~ This line has leading spaces.~~ ~~This line has trailing spaces. ~~ ~~ This line has both leading and trailing spaces. ~~ ~~ → Leading tab.~~ ~~Trailing tab. → ~~ ~~ → Leading and trailing tab. → ~~ ~~ → Leading mixed whitespace.~~ ~~Trailing mixed whitespace. → ~~ ~~ → Leading and trailing mixed whitespace. → ~~ $
To trim both leading and trailing whitespace, use $IFS
and the builtin $REPLY
variable (see the discussion for why this works):
$ while read REPLY; do echo ~~"$REPLY"~~; done < whitespace ~~This line has leading spaces.~~ ~~This line has trailing spaces.~~ ~~This line has both leading and trailing spaces.~~ ~~Leading tab.~~ ~~Trailing tab.~~ ~~Leading and trailing tab.~~ ~~Leading mixed whitespace.~~ ~~Trailing mixed whitespace.~~ ~~Leading and trailing mixed whitespace.~~ $
To trim only leading or only trailing spaces, use a simple pattern match:
# Leading spaces only $ while read; do echo "~~${REPLY## }~~"; done < whitespace ~~This line has leading spaces.~~ ~~This line has trailing spaces. ~~ ~~This line has both leading and trailing spaces. ~~ ~~ → Leading tab.~~ ~~Trailing tab. ~~ ~~ → Leading and trailing tab. → ~~ ~~ → Leading mixed whitespace.~~ ~~Trailing mixed whitespace. → ~~ ~~ → Leading and trailing mixed whitespace. → ~~ # Trailing spaces only $ while read; do echo "~~${REPLY%% }~~"; done < whitespace ~~ This line has leading spaces.~~ ~~This line has trailing spaces.~~ ~~ This line has both leading and trailing spaces.~~ ~~ → Leading tab.~~ ~~Trailing tab. ~~ ~~ → Leading and trailing tab. → ~~ ~~ → Leading mixed whitespace.~~ ~~Trailing mixed whitespace. → ~~ ~~ → Leading and trailing mixed whitespace. → ~~
Trimming only leading or only trailing whitespace (including tabs) is a bit more complicated:
# You need this either way $ shopt -s extglob # Leading whitespace only $ while read; do echo "~~${REPLY##+([[:space:]])}~~"; done < whitespace ~~This line has leading spaces.~~ ~~This line has trailing spaces. ~~ ~~This line has both leading and trailing spaces. ~~ ~~Leading tab.~~ ~~Trailing tab. ~~ ~~Leading and trailing tab. → ~~ ~~Leading mixed whitespace.~~ ~~Trailing mixed whitespace. → ~~ ~~Leading and trailing mixed whitespace. → ~~ $ # Trailing whitespace only $ while read; do echo "~~${REPLY%%+([[:space:]])}~~"; done < whitespace ~~ This line has leading spaces.~~ ~~This line has trailing spaces.~~ ~~ This line has both leading and trailing spaces.~~ ~~ → Leading tab.~~ ~~Trailing tab.~~ ~~ → Leading and trailing tab.~~ ~~ → Leading mixed whitespace.~~ ~~Trailing mixed whitespace.~~ ~~ → Leading and trailing mixed whitespace.~~
OK, at this point you are probably looking at these lines and wondering how we’re going to make this comprehensible. It turns out there’s a simple, if subtle, explanation.
Here we go. The first example used the default $REPLY
variable that read uses when you do not supply your own variable name(s). Chet Ramey (maintainer of bash) made a design decision to “[if] there are no variables, save the text of the line read to the variable $REPLY
,” unchanged:
while read; do echo ~~"$REPLY"~~; done < whitespace
But when we supply one or more variable names to read, it does parse the input, using the values in $IFS
(which are space, tab, and newline by default). One step of that parsing process is to trim leading and trailing whitespace—just what we want:
while read REPLY; do echo ~~"$REPLY"~~; done < whitespace
Trimming leading or trailing spaces (but not both) is easy using the ${##}
or ${%%}
operators (see Recipe 6.7):
while read; do echo "~~${REPLY## }~~"; done < whitespace while read; do echo "~~${REPLY%% }~~"; done < whitespace
Covering tabs is a little harder. If we had only tabs, we could use the ${##}
or ${%%}
operators and insert literal tabs using the Ctrl-V Ctrl-I key sequence. But that’s risky since it’s probable there’s a mix of spaces and tabs, and some text editors or unwary users may strip out the tabs. So, we turn on extended globbing and use a character class to make our intent clear. The [:space:]
character class would work without extglob
, but we need to say “one or more occurrences” using +()
or else it will trim a single spaces or tabs, but not multiples of both on the same line. If you only care about space or tab, you could use the [:blank:]
character class instead, since [:space:]
includes other characters like the vertical tab (v
) and DOS CR (carriage return,
):
# This works, need extglob for +() part $ shopt -s extglob ... $ while read; do echo "~~${REPLY##+([[:space:]])}~~"; done < whitespace ... $ while read; do echo "~~${REPLY%%+([[:space:]])}~~"; done < whitespace ... # This doesn't $ while read; do echo "~~${REPLY##[[:space:]]}~~"; done < whitespace ~~This line has leading spaces.~~ ~~This line has trailing spaces. ~~ ~~This line has both leading and trailing spaces. ~~ ~~Leading tab.~~ ~~Trailing tab. ~~ ~~Leading and trailing tab. ~~ ~~ → Leading mixed whitespace.~~ ~~Trailing mixed whitespace. → ~~ ~~ → Leading and trailing mixed whitespace. → ~~
Here’s a different take, exploiting the same $IFS
parsing, but to parse out fields (or words) instead of records (or lines):
$ for i in $(cat white_space); do echo ~~$i~~; done ~~This~~ ~~line~~ ~~has~~ ~~leading~~ ~~white~~ ~~space.~~ ~~This~~ ~~line~~ ~~has~~ ~~trailing~~ ~~white~~ ~~space.~~ ~~This~~ ~~line~~ ~~has~~ ~~both~~ ~~leading~~ ~~and~~ ~~trailing~~ ~~white~~ ~~space.~~ $
Finally, although the original solutions rely on Chet’s design decision about read and $REPLY
, this solution does not:
shopt
-s extglobwhile
IFS
=
read
-r line;
do
echo
"None: ~~
$line
~~"
# preserve all whitespaces
echo
"Ld: ~~
${
line
##+([[:
space
:]])
}
~~"
# trim leading whitespace
echo
"Tr: ~~
${
line
%%+([[:
space
:]])
}
~~"
# trim trailing whitespace
line
=
"
${
line
##+([[:
space
:]])
}
"
# trim leading and...
line
=
"
${
line
%%+([[:
space
:]])
}
"
# ...trailing whitespace
echo
"All: ~~
$line
~~"
# Show all trimmed
done
< whitespace
Use tr or awk as appropriate.
If you are trying to compress runs of whitespace down to a single character, you can use tr, but be aware that you may damage the file if it is not well formed. For example, if fields are delimited by multiple whitespace characters but internally have spaces, compressing multiple spaces down to one space will remove that distinction. Imagine if the _
characters in the following example were spaces instead. Note the → denotes a literal tab character in the output:
$ cat data_file Header1 Header2 Header3 Rec1_Field1 Rec1_Field2 Rec1_Field3 Rec2_Field1 Rec2_Field2 Rec2_Field3 Rec3_Field1 Rec3_Field2 Rec3_Field3 $ cat data_file | tr -s ' ' ' ' Header1 → Header2 → Header3 Rec1_Field1 → Rec1_Field2 → Rec1_Field3 Rec2_Field1 → Rec2_Field2 → Rec2_Field3 Rec3_Field1 → Rec3_Field2 → Rec3_Field3
If your field delimiter is more than a single character, tr won’t work since it translates single characters from its first set into the matching single character in the second set. You can use awk to combine or convert field separators. awk’s internal field separator FS
accepts regular expressions, so you can separate on pretty much anything. There is a handy trick to this as well: an assignment to any field causes awk to reassemble the record using the output field separator, OFS
, so assigning field 1 to itself and then printing the record has the effect of translating FS
to OFS
without you having to worry about how many records there are in the data.
In this example, multiple spaces delimit fields, but fields also have internal spaces, so the more simple case of:
awk 'BEGIN {OFS=" "} {$1=$1; print }' data_file1
won’t work. Here is a datafile:
$ cat data_file1 Header1 Header2 Header3 Rec1 Field1 Rec1 Field2 Rec1 Field3 Rec2 Field1 Rec2 Field2 Rec2 Field3 Rec3 Field1 Rec3 Field2 Rec3 Field $
In the next example, we assign two spaces to FS
and the tab to OFS
. We then make an assignment ($1 = $1
) so awk rebuilds the record, but that results in strings of tabs replacing the double spaces, so we use gsub to squash the tabs, then we print. Note the → denotes a literal tab character in the output. The output is a little hard to read, so there is a hex dump as well. Recall that ASCII tab is 09
while ASCII space is 20
:
$ awk 'BEGIN { FS = " "; OFS = " " } { $1 = $1; gsub(/ + ?/, " "); print }' data_file1 Header1 → Header2 → Header3 Rec1 Field1 → Rec1 Field2 → Rec1 Field3 Rec2 Field1 → Rec2 Field2 → Rec2 Field3 Rec3 Field1 → Rec3 Field2 → Rec3 Field3 $ awk 'BEGIN { FS = " "; OFS = " " } { $1 = $1; gsub(/ + ?/, " "); print }' data_file1 | hexdump -C 00000000 48 65 61 64 65 72 31 09 48 65 61 64 65 72 32 09 |Header1.Header2.| 00000010 48 65 61 64 65 72 33 0a 52 65 63 31 20 46 69 65 |Header3.Rec1 Fie| 00000020 6c 64 31 09 52 65 63 31 20 46 69 65 6c 64 32 09 |ld1.Rec1 Field2.| 00000030 52 65 63 31 20 46 69 65 6c 64 33 0a 52 65 63 32 |Rec1 Field3.Rec2| 00000040 20 46 69 65 6c 64 31 09 52 65 63 32 20 46 69 65 | Field1.Rec2 Fie| 00000050 6c 64 32 09 52 65 63 32 20 46 69 65 6c 64 33 0a |ld2.Rec2 Field3.| 00000060 52 65 63 33 20 46 69 65 6c 64 31 09 52 65 63 33 |Rec3 Field1.Rec3| 00000070 20 46 69 65 6c 64 32 09 52 65 63 33 20 46 69 65 | Field2.Rec3 Fie| 00000080 6c 64 0a |ld.| 00000083
You can use awk to trim leading and trailing whitespace in the same way, but as noted previously, this will replace your field separators unless they are already spaces:
awk '{ $1 = $1; print }' white_space
Effective awk Programming, 4th Edition, by Arnold Robbins (O’Reilly)
sed & awk, 2nd Edition, by Arnold Robbins and Dale Dougherty (O’Reilly)
Use Perl or gawk 2.13 or greater. Given a file like:
$ cat fixed-length_file Header1-----------Header2-------------------------Header3--------- Rec1 Field1 Rec1 Field2 Rec1 Field3 Rec2 Field1 Rec2 Field2 Rec2 Field3 Rec3 Field1 Rec3 Field2 Rec3 Field3
you can process it using GNU’s gawk, by setting FIELDWIDTHS
to the correct field lengths, setting OFS
as desired, and making an assignment so gawk rebuilds the record using this OFS
. However, gawk does not remove the spaces used in padding the original record, so we use two gsub
s to do that, one for all the internal fields and the other for the last field in each record. Finally, we just print. Note the → denotes a literal tab character in the output. The output is a little hard to read, so there is a hex dump as well. Recall that ASCII tab is 09
while ASCII space is 20
:
$ gawk 'BEGIN { FIELDWIDTHS = "18 32 16"; OFS = " " } > { $1 = $1; gsub(/ + /, " "); gsub(/ +$/, ""); print }' fixed-length_file Header1----------- → Header2------------------------- → Header3--------- Rec1 Field1 → Rec1 Field2 → Rec1 Field3 Rec2 Field1 → Rec2 Field2 → Rec2 Field3 Rec3 Field1 → Rec3 Field2 → Rec3 Field3 $ gawk 'BEGIN { FIELDWIDTHS = "18 32 16"; OFS = " " } > { $1 = $1; gsub(/ + /, " "); gsub(/ +$/, ""); print }' fixed-length_file > | hexdump -C 00000000 48 65 61 64 65 72 31 2d 2d 2d 2d 2d 2d 2d 2d 2d |Header1---------| 00000010 2d 2d 09 48 65 61 64 65 72 32 2d 2d 2d 2d 2d 2d |--.Header2------| 00000020 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |----------------| 00000030 2d 2d 2d 09 48 65 61 64 65 72 33 2d 2d 2d 2d 2d |---.Header3-----| 00000040 2d 2d 2d 2d 0a 52 65 63 31 20 46 69 65 6c 64 31 |----.Rec1 Field1| 00000050 09 52 65 63 31 20 46 69 65 6c 64 32 09 52 65 63 |.Rec1 Field2.Rec| 00000060 31 20 46 69 65 6c 64 33 0a 52 65 63 32 20 46 69 |1 Field3.Rec2 Fi| 00000070 65 6c 64 31 09 52 65 63 32 20 46 69 65 6c 64 32 |eld1.Rec2 Field2| 00000080 09 52 65 63 32 20 46 69 65 6c 64 33 0a 52 65 63 |.Rec2 Field3.Rec| 00000090 33 20 46 69 65 6c 64 31 09 52 65 63 33 20 46 69 |3 Field1.Rec3 Fi| 000000a0 65 6c 64 32 09 52 65 63 33 20 46 69 65 6c 64 33 |eld2.Rec3 Field3| 000000b0 0a |.| 000000b1
If you don’t have gawk, you can use Perl, which is more straightforward anyway. We use a nonprinting while
input loop (-n)
, unpack each record ($_
) as it’s read, and turn the resulting list back into a scalar by joining the elements with a tab. We then print each record, adding a newline at the end:
$ perl -ne 'print join(" ", unpack("A18 A32 A16", $_) ) . " ";' > fixed-length_file Header1----------- → Header2------------------------- → Header3--------- Rec1 Field1 → Rec1 Field2 → Rec1 Field3 Rec2 Field1 → Rec2 Field2 → Rec2 Field3 Rec3 Field1 → Rec3 Field2 → Rec3 Field3 $ perl -ne 'print join(" ", unpack("A18 A32 A16", $_) ) . " ";' > fixed-length_file | > hexdump -C 00000000 48 65 61 64 65 72 31 2d 2d 2d 2d 2d 2d 2d 2d 2d |Header1---------| 00000010 2d 2d 09 48 65 61 64 65 72 32 2d 2d 2d 2d 2d 2d |--.Header2------| 00000020 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |----------------| 00000030 2d 2d 2d 09 48 65 61 64 65 72 33 2d 2d 2d 2d 2d |---.Header3-----| 00000040 2d 2d 2d 2d 0a 52 65 63 31 20 46 69 65 6c 64 31 |----.Rec1 Field1| 00000050 09 52 65 63 31 20 46 69 65 6c 64 32 09 52 65 63 |.Rec1 Field2.Rec| 00000060 31 20 46 69 65 6c 64 33 0a 52 65 63 32 20 46 69 |1 Field3.Rec2 Fi| 00000070 65 6c 64 31 09 52 65 63 32 20 46 69 65 6c 64 32 |eld1.Rec2 Field2| 00000080 09 52 65 63 32 20 46 69 65 6c 64 33 0a 52 65 63 |.Rec2 Field3.Rec| 00000090 33 20 46 69 65 6c 64 31 09 52 65 63 33 20 46 69 |3 Field1.Rec3 Fi| 000000a0 65 6c 64 32 09 52 65 63 33 20 46 69 65 6c 64 33 |eld2.Rec3 Field3| 000000b0 0a |.| 000000b1
See the Perl documentation for the pack
and unpack
template formats.
Anyone with any Unix background will automatically use some kind of delimiter in output, since the textutils toolchain is never far from mind, so fixed-length (a.k.a. fixed-width) records are rare in the Unix world. They are very common in the mainframe world, however, so they will occasionally crop up in large applications that originated on big iron, such as some applications from SAP. As we’ve just seen, it’s no problem to handle them.
One caveat to this recipe is that it requires each record to end in a newline. Many old mainframe record formats don’t, in which case you can use Recipe 13.18 to add newlines to the end of each record before processing.
Preprocess the file and add line breaks in appropriate places. For example, Open- Office’s OpenDocument Format (ODF) files are basically zipped XML files. It is possible to unzip them and grep the XML, which we did a lot while writing this book. See Recipe 12.5 for a more comprehensive treatment of ODF files. In this example, we insert a newline after every closing angle bracket (>
). That makes it much easier to process the file using grep or other textutils. Note that we must enter a backslash followed immediately by the Enter key to embed an escaped newline in the sed script:
$ wc -l content.xml 1 content.xml $ sed -e 's/>/> > /g' content.xml | wc -l 1687
If you have fixed-length records with no newlines, do this instead, where 48
is the length of the record:
$ cat fixed-length Line_1_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_2__ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_3__ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_4__ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_5__ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_6__ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_7__ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_8__ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_9__ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_10_ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_11_ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_12_ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ $ wc -l fixed-length 1 fixed-length $ sed 's/.{48}/& > /g;' fixed-length Line_1__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_2__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_3__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_4__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_5__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_6__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_7__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_8__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_9__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_10_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_11_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_12_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ $ perl -pe 's/(.{48})/$1 /g;' fixed-length Line_1__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_2__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_3__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_4__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_5__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_6__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_7__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_8__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_9__aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_10_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_11_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_12_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
This happens often when people create output programmatically, especially using canned modules and especially with HTML or XML output.
Note the sed substitutions have an odd construct that allows an embedded newline. In sed, a literal ampersand (&
) on the righthand side (RHS) of a substitution is replaced by the entire expression matched on the lefthand side (LHS), and the trailing on the first line escapes the newline so you don’t get an error like “sed: -e expression #1, char 11: unterminated ‘s’”. This is because sed doesn’t recognize
as a metacharacter on the RHS of s///
.
Effective awk Programming, 4th Edition, by Arnold Robbins (O’Reilly)
sed & awk, 2nd Edition, by Arnold Robbins and Dale Dougherty (O’Reilly)
Use awk to convert the data into CSV format:
$ awk 'BEGIN { FS=" "; OFS="","" } { gsub(/"/, """"); $1 = $1; > printf ""%s" ", $0}' tab_delimited "Line 1","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes" "Line 2","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes" "Line 3","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes" "Line 4","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes" $
You can do the same thing in Perl also:
$ perl -naF' ' -e 'chomp @F; s/"/""/g for @F; print q(").join(q(","),@F) > .qq(" );' tab_delimited "Line 1","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes" "Line 2","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes" "Line 3","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes" "Line 4","Field 2","Field 3","Field 4","Field 5 with ""internal"" double-quotes" $
First of all, it’s tricky to define exactly what CSV really means. There is no formal specification, and various vendors have implemented various versions. Our version here is very simple, and should hopefully work just about anywhere. We place double quotes around all fields (some implementations only quote strings, or strings with internal commas), and we double internal double quotes.
To do that, we have awk split up the input fields using a tab as the field separator and set the output field separator (OFS) to ","
, which will provide the trailing quote for each field and then the leading quote for the next field as well as the comma in between them. We then globally replace any double quotes with two double quotes, make an assignment so awk rebuilds the record with our specified OFS (see the awk trick in Recipe 13.15), and print out the record with leading and trailing double quotes. We have to escape double quotes in several places, which looks a little cluttered, but otherwise this is very straightforward.
Unlike the previous recipe for converting to CSV, there is no easy way to do this, since it’s tricky to define exactly what CSV really means.
Possible solutions for you to explore are:
Perl: Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly) has a regex to do this; see also CPAN, the Comprehensive Perl Archive Network, for various modules.
Load the CSV file into a spreadsheet (LibreOffice’s Calc and Microsoft’s Excel both work), then copy and paste the contents into a text editor; you should get tab-delimited output that you can now use easily.
As noted in Recipe 13.19, there is no formal specification for CSV, and that fact, combined with data variations, makes this task much harder than it sounds.
18.118.28.179