Chapter 6. Processing Text with sed

When you need to edit a file, you typically open up your favorite editor, perform the change, and then save the file and exit. Editors are great for modifying files and seem to be suitable for any type of editing needed. However, imagine you have a web site with a couple of thousand HTML files that need the copyright year at the bottom changed from 2004 to 2005. The interactive nature of editors would require you to type every change that you need to make. You would launch your editor and open each file individually and, like an automaton, make the change, save, exit, repeat. After spending hours performing the same change on thousands of files, you realize you've forgotten about a whole section of the web site and actually have several thousand more, and next year you will need to do this again, with more files. There has to be a better way.

Fortunately, there is. This chapter introduces you to sed, an intelligent text-processing tool that will save you not only time but also, more important, your sanity. The sed command gives you the power to perform these changes on the command line, or in a shell script, with very little headache. Even better, sed will allow you to repeat the advanced batch editing of files simply. Sed can be run on the command line and is a powerful addition to any shell scriptwriter's toolbox. Learning the building blocks of sed will enable you to create tools to solve complex problems automatically and efficiently.

This chapter introduces you to the building blocks of sed by covering the following subjects:

  • Getting and installing sed

  • Methods of invoking sed

  • Selecting lines to operate on

  • Performing substitutions with sed

  • Advanced sed invocation

  • Advanced addressing

  • Common one-line sed scripts

Introducing sed

In this chapter, I give you a gentle introduction to sed and its powerful editing capabilities. Learning sed can take some time, but the investment pays off tenfold in time saved. It can be frustrating to figure out how to use sed to do what you need automatically, and at times you may decide you could do the rote changes interactively in less time. However, as you sharpen your skills, you'll find yourself using sed more frequently and in better ways. Soon you will revel in the divine realm of automation and reach to the programmer's version of nirvana.

The name sed means stream editor. It's designed to perform edits on a stream of data. Imagine a bubbling stream of cool mountain water filled with rainbow trout. You know that this stream empties into a sewage system a few miles down, and although you aren't a big trout fan you want to save the fish. So you do a little work to reroute the stream through a pipe to drain into a healthy lake. (The process of piping is discussed in Chapter 8.) With sed, you can do a little magic while the stream is flowing through the pipe to the lake. With a simple sed statement, the stream and all the trout would flow into the pipe, and out would come the same icy mountain stream, filled with catfish instead of trout. You could also change the cool water into iced tea, but that won't help the fish. Using your traditional text editor is like manually replacing each trout in the stream with catfish by hand; you'd be there forever, fish frustratingly slipping out of your grasp. With sed, it's a relatively simple and efficient task to make global changes to all the data in your stream.

Sed is related to a number of other Unix utilities, and what you learn in this chapter about sed will be useful for performing similar operations using utilities such as vi and grep. Sed is derived originally from the basic line editor ed, an editor you will find on most every Unix system but one that is rarely used because of its difficult user interface. (Although unpopular as an editor, ed continues to be distributed with Unix systems because the requirements to use this editor are very minimal, and thus it is useful in dire recovery scenarios when all other editors may fail because their required libraries are not available.)

Sed is shell independent; it works with whatever shell you want to use it with. Because the default shell on most systems is Bash, the examples here are based on the Bash shell.

Sed can be ugly and frightening, but it is also quite powerful. Maybe you've seen some unintelligible, scary sed, such as the following line, and are wary of learning it because it looks like gibberish:

sed '/
/!G;s/(.)(.*
)/&21/;//D;s/.//' myfile.txt

Even someone who knows sed well would have to puzzle over this line for a while before they understood that this reversed the order of every character in every line of myfile.txt, effectively creating a mirror image of the file. This line makes most people's heads spin. But don't worry. I'll start off with the basics, giving you a strong foundation so that you'll no longer find sed frightening and unintelligible.

sed Versions

Sed is a brick-and-mortar Unix command. It comes standard with nearly every Unix that exists, including Linux and Mac OS X, and it generally does not need to be installed, as it is such an essential shell command. However, it is possible that some systems don't ship with sed or come with a version that doesn't have the same features as others.

In fact, there is a dizzying array of sed implementations. There are free versions, shareware versions, and commercial versions. Different versions may have different options, and some of the examples in this chapter, especially the more advanced ones, may not work as presented with every version.

The most common version is arguably GNU sed, currently at revision 4.1.2. This version is used in the examples throughout this chapter. The GNU sed has a number of extensions that the POSIX sed does not have, making things that used to be difficult much simpler. If you need multibyte support for Japanese characters, there is a BSD implementation that offers those extensions. There is a version of sed called ssed (super-sed) that has more features than GNU sed and is based on the GNU sed code-base. There are versions of sed that are designed for constrained environments so they are small and fast (minised), versions that can be plugged into the Apache web server (mod_sed), and color versions of sed (csed). Most implementations will do the basics of what I cover here, so it is not necessary that you have GNU sed; however, you may find that the extensions that GNU sed offers will make your life easier.

Mac OS X comes with the BSD version of sed; GNU/Linux tends to distribute GNU sed. If your operating system is something else, you will be able to find a version of sed that works for you. The sed Frequently Asked Questions has an entire section devoted to the different versions of sed and where you can find one that works for your operating system (see http://sed.sourceforge.net/sedfaq2.html#s2.2).

Not all sed implementations are without cost. Commercial versions of sed are available, useful because many of them include support or provide sed for an esoteric or outdated operating system. Aside from that reason, they don't offer much more than GNU sed, probably have fewer features, and do not adhere as strictly to POSIX standards.

Sed is generally found at /bin/sed or /usr/bin/sed.

To see what version you have on your system, type the following command:

$ sed --version
GNU sed version 4.1.2
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE,
to the extent permitted by law.

If this doesn't work, try just typing sed by itself on the command line to see if you get anything at all. You may have to specify /bin/sed or /usr/bin/sed. If the --version argument is not recognized by your system, you are not running GNU sed. In that case, try the following command to get the current version number:

$ strings /bin/sed | grep -i ver

Installing sed

If you find that you don't have any version of sed installed, I recommend getting a version targeted for your operating system or the one directly from GNU. Mac OS X comes with a BSD version of sed, but you can easily install the GNU version through fink (http://fink.sourceforge.net/). On Debian GNU/Linux you can install sed as root by typing apt-get install sed.

Installing GNU sed by hand is not very difficult. The process is even less difficult if you have a system that already has another version of sed installed, as the GNU sed installation requires some form of sed installed to install itself. This sounds like a chicken-and-egg problem, but the GNU sed provides the necessary bootstrap sed as part of the installation to resolve this. You can get the latest .tar.gz file of GNU sed from ftp://ftp.gnu.org/pub/gnu/sed/. After you have obtained the sed tar, you uncompress it as you would any normal tar file:

$ tar -zxf sed-4.1.2.tar.gz
$ cd sed-4.1.2

Read the README file that is included with the source for specific instructions.

Bootstrap Installation

If you are building sed on a system that has no preexisting version of sed, you need to follow a bootstrap procedure outlined in README.boot. (If you have the BSD version of sed, you won't need to do this and can skip to the section Configuring and Installing sed.) This is because the process of making sed requires sed itself. The standard GNU autoconf configure script uses sed to determine system-dependent variables and to create Makefiles.

To bootstrap the building of sed, you run the shell script bootstrap.sh. This attempts to build a basic version of sed that works for the configure script. This version of sed is not fully functional and should not be used typically for anything other than bootstrapping the build process.

You should see output like the following when you run bootstrap.sh:

$ sh ./userinputh
Creating basic config.h...
+ rm -f 'lib/*.o' 'sed/*.o' sed/sed
+ cd lib
+ rm -f regex.h
+ cc -DHAVE_CONFIG_H -I.. -I. -c alloca.c

It continues building and may report a number of compiler warnings. Don't worry about these; however, if you get errors and the bootstrap version of sed fails to build, you will need to edit the config.h header file that was created for your system. Read the README.boot file and the comments that are contained in the config.h file to determine how to solve this. On most systems, however, this bootstrap version of sed should build fine.

Once the build has completed, you need to install the bootstrapped version of sed somewhere in your $PATH so you can build the full version. To do this, simply copy the sed binary that was built in the sed directory to somewhere in your $PATH. In the following example you create the bin directory in your home directory, append that path to your existing $PATH environment variable, and then copy the sed binary that was created in the bootstrap procedure into your $HOME/bin directory. This will make this version of sed available for the remainder of the build process.

$ mkdir $HOME/bin
$ export PATH=$PATH:$HOME/bin
$ cp sed/sed $HOME/bin

Configuring and Installing sed

If you already have a version of sed installed on your system, you don't need to bootstrap the installation but can simply use the following command to configure sed:

$ sh ./configure
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking for gcc... gcc

This will continue to run through the GNU autoconf configuration, analyzing your system for various utilities, variables, and parameters that need to be set or exist on your system before you can compile sed. This can take some time before it finishes. If this succeeds, you can continue with compiling sed itself. If it doesn't, you will need to resolve the configuration problem before proceeding.

To compile sed, simply issue a make command:

$ make
make  all-recursive
make[1]: Entering directory `/home/micah/working/sed-4.1.2'
Making all in intl
make[2]: Entering directory `/home/micah/working/sed-4.1.2/intl'

This will continue to compile sed, which shouldn't take too long. On my system the configuration took longer than the compile. If this succeeds, you can install the newly compiled sed simply by issuing the make install command as root:

$ su
Password:
# make install

Sed will be put in the default locations in your file system. By default, make install installs all the files in /usr/local/bin, /usr/local/lib, and so on. You can specify an installation prefix other than /usr/local using --prefix when running configure; for instance, sh ./configure --prefix=$HOME will make sed so that it will install in your home directory.

Note

Warning! Be very careful that you do not overwrite your system-supplied sed, if it exists. Some underlying systems may depend on that version, and replacing it with something other than what the vendor supplied could result in unexpected behavior. Installing into /usr/local or into your personal home directory is perfectly safe.

How sed Works

Because sed is a stream editor, it does its work on a stream of data it receives from stdin, such as through a pipe, writing its results as a stream of data on stdout (often just your screen). You can redirect this output to a file if that is what you want to do (see Chapter 8 for details on redirecting). Sed doesn't typically modify an original input file; instead you send the contents of your file through a pipe to be processed by sed. This means that you don't need to have a file on the disk with the data you want changed; this is particularly useful if you have data coming from another process rather than already written in a file.

Invoking sed

Before you get started with some of the examples that follow, you will need some data to work with. The /etc/passwd file, available on all Unix derivatives, contains some useful data to parse with sed. Everyone will have a slightly different /etc/passwd file, so your results may vary slightly. The output that is shown in the following examples and exercises will be based on the following lines from my /etc/passwd file; you can copy this and save it or download it from the Wrox web site.

If for some reason your system's version of /etc/passwd produces unrecognizably different output from the examples, try using this version instead. If you use this version instead of the file on your system, you will need to change the path in each example from /etc/passwd to the specific location where you put this file.

root:x:0:0:root user:/root:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh
backup:x:34:34:backup:/var/backups:/bin/sh

As mentioned previously, sed can be invoked by sending data through a pipe to it. Take a look at how this works by piping your password file through sed using the following command. You should see output similar to that shown here, which lists the command usage description for sed:

$ cat /etc/passwd | sed
Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...

  -n, --quiet, --silent
                 suppress automatic printing of pattern space
  -e script, --expression=script
                 add the script to the commands to be executed
  -f script-file, --file=script-file
                 add the contents of script-file to the commands to be executed
  -i[SUFFIX], --in-place[=SUFFIX]
                 edit files in place (makes backup if extension supplied)
  -l N, --line-length=N
                 specify the desired line-wrap length for the `l' command
  --posix
                 disable all GNU extensions.
  -r, --regexp-extended
                 use extended regular expressions in the script.
  -s, --separate
                 consider files as separate rather than as a single continuous
                 long stream.
-u, --unbuffered
                 load minimal amounts of data from the input files and flush
                 the output buffers more often
      --help     display this help and exit
      --version  output version information and exit

If no -e, --expression, -f, or --file option is given, then the first
non-option argument is taken as the sed script to interpret.  All
remaining arguments are names of input files; if no input files are
specified, then the standard input is read.

E-mail bug reports to: [email protected] .
Be sure to include the word ``sed'' somewhere in the ``Subject:'' field.

Many of the different sed options you see here are covered later in this chapter. The important thing to note right now is that because you did not tell sed what to do with the data you sent to it, sed felt that you needed to be reminded about how to use it.

This command dumps the contents of /etc/passwd to sed through the pipe into sed's pattern space. The pattern space is the internal work buffer that sed uses to do its work, like a workbench where you lay out what you are going to work on.

Simply putting something on a workbench doesn't do anything at all; you need to know what you are going to do with it. Similarly, dumping data into sed's pattern space doesn't do anything at all; you need to tell sed to do something with it. Sed expects to always do something with its pattern space, and if you don't tell it what to do, it considers that an invocation error. Because you incorrectly invoked sed in this example, you found that it spit out to your screen its command usage.

Editing Commands

Sed expects you to provide an editing command. An editing command is what you want sed to do to the data in the pattern space. The following Try It Out example uses the delete-line editing command, known to sed as d. This command will delete each line in the pattern buffer.

Invoking sed with the -e Flag

Instead of invoking sed by sending a file to it through a pipe, you can instruct sed to read the data from a file, as in the following example.

The -n, --quiet, and --silent Flags

As you saw in the preceding examples, sed by default prints out the pattern space at the end of processing its editing commands and then repeats that process.

The -n flag disables this automatic printing so that sed will instead print lines only when it is explicitly told to do so with the p command.

The p command simply means to print the pattern space. If sed prints out the pattern space by default, why would you want to specify it? The p command is generally used only in conjunction with the -n flag; otherwise, you will end up printing the pattern space twice, as demonstrated in the following Try It Out.

The -n flag has a couple of synonyms; if you find it easier to remember --quiet or --silent, these flags do the same thing.

These are the basic methods for invoking sed. Knowing these will allow you to move forward and use sed in a more practical way. When you are more familiar with some of sed's editing capabilities, you'll be ready for the more advanced methods covered in the Advanced sed Invocation section of this chapter.

sed Errors

It is easy to incorrectly specify your sed editing commands, as the syntax requires attention to detail. If you miss one character, you can produce vastly different results than expected or find yourself faced with a rather cryptic error message.

Sed is not friendly with its error messages, and unfortunately, different versions of sed have different cryptic errors for the same problems. GNU sed tends to be more helpful in indicating what was missing, but it is often very difficult for sed to identify the source of the error, and so it may spit out something that doesn't help much in fixing the problem. I explained in the previous section how GNU sed will output its command usage if you incorrectly invoke it, and you may get other strange errors as well.

Selecting Lines to Operate On

Sed also understands something called addresses. Addresses are either particular locations in a file or a range where a particular editing command should be applied. When sed encounters no addresses, it performs its operations on every line in the file.

The following command adds a basic address to the sed command you've been using:

$ cat /etc/passwd | sed '1d' |more
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh
backup:x:34:34:backup:/var/backups:/bin/sh

Notice that the number 1 is added before the delete edit command. This tells sed to perform the editing command on the first line of the file. In this example, sed will delete the first line of /etc/password and print the rest of the file. Because your /etc/passwd file may have so many lines in it that the top of the file scrolls by, you can send the result to the pager more. Notice that the following line is missing from the output:

root:x:0:0:root user:/root:/bin/sh

Most Unix systems have the root user as the first entry in the password file, but after performing this command you will see the entire password file, with the root user line missing from the top. If you replaced the number 1 with a 2, only the second line is removed.

Address Ranges

So what if you want to remove more than one line from a file? Do you have to tell sed every single line you want to remove? Fortunately not; you can specify a range of lines that you want to be removed by telling sed a starting line and an ending line to perform your editing commands on, as in the following Try It Out.

Now that you have a better understanding of how sed applies its commands to its pattern buffer, what do you think sed does if you specify a line number or range that doesn't exist in the file? Sed dutifully looks for the lines that you specify to apply its command, but it never finds them, so you get the entire file printed out with nothing omitted.

If you forget to complete your address range, you receive an error from sed. The cryptic nature of sed's errors means that sed will tell you something is wrong, but not in a helpful way. For example, if you forgot the number after the comma, sed won't understand that you were trying to specify an address range and will complain about the comma:

$ cat /etc/passwd | sed '1,d'
sed: -e expression #1, char 3: unexpected `,'

If you forgot the number before the comma in your address range, sed thinks that you are trying to specify the comma as an editing command and tells you that there is no such command:

$ cat /etc/passwd | sed ',10d'
sed: -e expression #1, char 1: unknown command: `,'

You can also instruct sed to match an address line and certain numbers following that first match.

Suppose you want to match line 4 and the five lines following line 4. You do this by appending a plus sign before the second address number, as in the following command:

$ cat /etc/passwd | sed '4,+5d'
root:x:0:0:root:/root:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
backup:x:34:34:backup:/var/backups:/bin/sh

This will match line 4 in the file, delete that line, continue to delete the next five lines, and then cease its deletion and print the rest.

Address Negation

By appending an exclamation mark at the end of any address specification, you negate that address match. To negate an address match means to match only those lines that do not match the address range.

Address Steps

GNU sed has a feature called address steps that allows you to do things such as selecting every odd line, every third line, every fifth line, and so on.

Address steps are specified in the same way that you specify a delete range, except instead of using a comma to separate the numbers, you use a tilde (). The number before the tilde is the number that you want the stepping to begin from. If you want to start stepping from the beginning of the file, you use the number 1. The number that follows the tilde is what is called the step increment. The step increment tells sed how many lines to step. The following Try It Outs provide examples of address steps.

Substitution

This section introduces you to one of the more useful editing commands available in sed, the substitution command. This command is probably the most important command in sed and has a lot of options.

The substitution command, denoted by s, will substitute any string that you specify with any other string that you specify. To substitute one string with another, you need to have some way of telling sed where your first string ends and the substitution string begins. This is traditionally done by bookending the two strings with the forward slash (/) character.

Substitution Flags

There are a number of other useful flags that can be passed in addition to the g flag, and you can specify more than one at a time.

The following is a full table of all the flags that can be used with the s substitution command.

Flag

Meaning

g

Replace all matches, not just the first match.

NUMBER

Replace only NUMBERth match.

p

If substitution was made, print pattern space.

w FILENAME

If substitution was made, write result to FILENAME. GNU sed additionally allows writing to /dev/stderr and /dev/stdout.

I or i

Match in a case-insensitive manner.

M or m

In addition to the normal behavior of the special regular expression characters ^ and $, this flag causes ^ to match the empty string after a newline and $ to match the empty string before a newline.

A useful flag to pass to a substitution command is i or its capital incarnation, I. Both indicate to sed to be case insensitive and match either the uppercase or lowercase of the characters you specify. If the /etc/passwd file had both the strings Root and root, the previous sed operation would match only the lowercase version. To get both throughout the entire file, you specify both the i flag and the g flag, as in the following example. Type the following sed substitution command:

$ cat /etc/passwd | sed 's/Root/toor/ig'
toor:x:0:0:toor user:/toor:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh
backup:x:34:34:backup:/var/backups:/bin/sh

If you specify any number as a flag (NUMBER flag), this tells sed to act on the instance of the string that matched that number. The /etc/passwd file has three instances of the string root in the first line, so if you want to replace only the third match on, add the number 3 at the end of the substitution delimiter, as in the following:

$ cat /etc/passwd | sed 's/root/toor/3' |head −2
root:x:0:0:root user:/toor:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/sh

With this command, sed searches for the third instance of the string root in the file /etc/passwd and substitutes the string toor. (I piped the output through the Unix command head with the flag −2 to limit the output to the first two lines for brevity.)

The POSIX standard doesn't specify what should happen when the NUMBER flag is specified with the g flag, and there is no wide agreement on how this should be interpreted amongst the different sed implementations. The GNU implementation of sed ignores the matches before the NUMBER and then matches and replaces all matches from that NUMBER on.

Using an Alternative String Separator

You may find yourself having to do a substitution on a string that includes the forward slash character. In this case, you can specify a different separator by providing the designated character after the s. Suppose you want to change the home directory of the root user in the passwd file. It is currently set to /root, and you want to change it to /toor. To do this, you specify a different separator to sed. I use a colon (:) in this example:

$ cat /etc/passwd | sed 's:/root:/toor:' | head −2
root:x:0:0:root user:/toor:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/sh

Notice this is exactly like doing string substitution with the slash character as a separator; the first string to look for is /root; the replacement is /toor.

It is possible to use the string separator character in your string, but sed can get ugly quickly, so you should try to avoid it by using a different string separator if possible. If you find yourself in the situation where you do need to use the string separator, you can do so by escaping the character. To escape a character means to put a special character in front of the string separator to indicate that it should be used as part of the string, rather than the separator itself. In sed you escape the string separator by putting a backslash before it, like so:

$ cat /etc/passwd | sed 's//root//toor/' | head −2
root:x:0:0:root user:/toor:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/sh

This performs the exact search and replace as the example before, this time using the slash as a string separator, escaping the slash that appears in the string /root so it is interpreted properly. If you do not escape this slash, you will have an error in your command, because there will be too many slashes presented to sed and it will spit out an error. The error will vary depending on where in the process sed encounters it, but it will be another example of sed's rather cryptic errors:

sed: -e expression #1, char 10: unknown option to `s'

You can use any separator that you want, but by convention people use the slash separator until they need to use something else, as in this case.

String substitution is not limited to single words. The string you specify is limited only by the string separator that you use, so it is possible to substitute a whole phrase, if you like. The following command replaces the string root user with absolutely power corrupts:

$ cat /etc/passwd | sed 's/:root user/:absolutely power corrupts/g' |head −2
root:x:0:0:absolutely power corrupts:/root:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/sh

It is often useful to replace strings of text with nothing. This is a funny way of saying deleting words or phrases. For example, to remove a word you simply replace it with an empty string, as in the following Try It Out.

Address Substitution

As with deletion, it is possible to perform substitution only on specific lines or on a specific range of lines if you specify an address or an address range to the command.

If you want to substitute the string sh with the string quiet only on line 10, you can specify it as follows:

$ cat /etc/passwd | sed '10s/sh/quiet/g'
root:x:0:0:root user:/root:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh
backup:x:34:34:backup:/var/backups:/bin/quiet

This is just like the line-specific delete command, but you are performing a substitution instead of a deletion. As you can see from the output of this command, the substitution replaces the sh string with quiet only on line 10.

Similarly, to do an address range substitution, you could do something like the following:

$ cat /etc/passwd | sed '1,5s/sh/quiet/g'
root:x:0:0:root user:/root:/bin/quiet
daemon:x:1:1:daemon:/usr/sbin:/bin/quiet
bin:x:2:2:bin:/bin:/bin/quiet
sys:x:3:3:sys:/dev:/bin/quiet
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh
backup:x:34:34:backup:/var/backups:/bin/sh

As you can see from the output, the first five lines had the string sh changed to quiet, but the rest of the lines were left untouched.

Advanced sed Invocation

You often want to make more than one substitution or deletion or do more complex editing using sed. To use sed most productively requires knowing how to invoke sed with more than one editing command at a time.

You can specify multiple editing commands on the command line in three different ways, explained in the following Try It Out. The editing commands are concatenated and are executed in the order that they appear. You must specify the commands in appropriate order, or your script will produce unexpected results.

These three methods are ways to invoke sed by specifying editing commands on the command line. The other method of invoking sed is to specify a file that contains all the commands that you want sed to run. This is useful when your commands become cumbersome on the command line or if you want to save the commands for use in the future. If you make a mistake on the command line, it can be confusing to try to fix that mistake, but if your commands are specified in a file you can simply re-edit that file to fix your mistake.

To specify the file containing the editing commands you want sed to perform, you simply pass the -f flag followed immediately by the file name containing the editing commands, as in the following Try It Out.

The comment Command

As in most programming languages, it's useful to include comments in your script to remember what different parts do or to provide information for others who might be trying to decipher your script. To add a comment in a sed script, you do what you do in other shell scripting environments: precede the line with the # character. The comment then continues until the next newline.

There are two caveats with the comment command: the first is that comments are not portable to non-POSIX versions of sed. If someone is running a version of sed that is not POSIX-conformant, their sed may not like comments anywhere in your sed script except on the very first line.

The second caveat with the comment command is that if the first two characters of your sed script are #n, the -n (no auto-print) option is automatically enabled. If you find yourself in the situation where your comment on the first line should start with the n character, simply use a capital N or place a space between the # and the n:

# Not going to enable -n
#Not going to enable -n
#no doubt, this will enable -n

The insert, append, and change Commands

The insert and append commands are almost as harmless as the comment command. Both of these commands simply output the text you provide. Insert (i) outputs the text immediately, before the next command, and append (a) outputs the text immediately afterward.

A classic example that illustrates these commands is converting a standard text file into an HTML file. HTML files have a few tags at the beginning of the files, followed by the body text and then the closing tags at the end. Using i and a, you can create a simple sed script that will add the opening and closing tags to any text file.

Advanced Addressing

Knowing exactly where in your file you want to perform your sed operations is not always possible. It's not so easy to know the exact line number or range of numbers you want the command(s) to act upon. Fortunately, sed allows you to apply your knowledge of regular expressions (regexps) to make your addressing much more powerful and useful.

In the Selecting Lines to Operate On section, you learned how to specify addresses and address ranges by specifying a line number or range of line numbers. When addresses are specified in this manner, the supplied editing command affects only the lines that you explicitly denoted in the address.

The same behavior is found when you use regular expressions in addresses; only those addresses that match the regular expression will have the editing command applied to them.

Regular Expression Addresses

To specify an address with a regular expression, you enclose the regular expression in slashes. The following example shows the top of my /etc/syslog.conf file as an example (yours may be slightly different):

#  /etc/syslog.conf     Configuration file for syslogd.
#
#                       For more information see syslog.conf(5)
#                       manpage.

# First some standard logfiles.  Log by facility.

auth,authpriv.*                 /var/log/auth.log

As you can see, a number of comments are in the file, followed by a few spaces and then some lines that are used for syslog. The following Try It Out shows you how to use a simple regular expression to remove all the comments in this file.

The following table lists four special characters that are very useful in regular expressions.

Character

Description

^

Matches the beginning of lines

$

Matches the end of lines

.

Matches any single character

*

Matches zero or more occurrences of the previous character

In the following Try It Out examples, you use a few of these to get a good feel for how regular expressions work with sed.

Character Class Keywords

Some special keywords are commonly available to regexps, especially GNU utilities that employ regexps. These are very useful for sed regular expressions as they simplify things and enhance readability.

For example, the characters a through z as well as the characters A through Z constitute one such class of characters that has the keyword [[:alpha:]], meaning all alphabetic characters. Instead of having to specify every character in a regular expression, you can simply use this keyword instead, as in the following example.

Using the alphabet character class keyword, this command prints only those lines in the /etc/syslog.conf file that start with a letter of the alphabet:

$ cat /etc/syslog.conf | sed -n '/^[[:alpha:]]/p'
auth,authpriv.*                 /var/log/auth.log

If you instead delete all the lines that start with alphabetic characters, you can see what doesn't fall within the [[:alpha:]] character class keyword:

$ cat /etc/syslog.conf | sed '/^[[:alpha:]]/d'
#  /etc/syslog.conf     Configuration file for syslogd.
#
#                       For more information see syslog.conf(5)
#                       manpage.

#
# First some standard logfiles.  Log by facility.
#

The following table is a complete list of the available character class keywords in GNU sed.

Character Class Keyword

Description

[[:alnum:]]

Alphanumeric [a-z A-Z 0-9]

[[:alpha:]]

Alphabetic [a-z A-Z]

[[:blank:]]

Blank characters (spaces or tabs)

[[:cntrl:]]

Control characters

[[:digit:]]

Numbers [0-9]

[[:graph:]]

Any visible characters (excludes whitespace)

[[:lower:]]

Lowercase letters [a-z]

[[:print:]]

Printable characters (noncontrol characters)

[[:punct:]]

Punctuation characters

[[:space:]]

Whitespace

[[:upper:]]

Uppercase letters [A-Z]

[[:xdigit:]]

Hex digits [0-9 a-f A-F]

Character classes are very useful and should be used whenever possible. They adapt much better to non-English character sets, such as accented characters.

Regular Expression Address Ranges

I demonstrated in the Address Ranges section that specifying two line numbers, separated by commas, is equivalent to specifying a range of lines over which the editing command will be executed.

The same behavior applies with regular expressions. You can specify two regular expressions, separated by a comma, and sed will match all of the lines from the first line that matches the first regular expression all the way up to, and including, the line that matches the second regular expression. The following Try It Out demonstrates this behavior.

Combining Line Addresses with regexps

If you want to use a line address in combination with a regular expression, sed won't stop you. In fact, this is an often-used addressing scheme.

Simply specify the line number in the file where you want the action to start working and then use the regular expression to stop the work.

Advanced Substitution

Doing substitutions with regular expressions is a powerful technique.

Using address ranges with regular expressions simply required taking what you already knew about address ranges and using regular expressions in place of simple line numbers. The same one-to-one mapping works with substitution and regular expressions. You already know that to substitute the string trout with the string catfish throughout the stream.txt file, you simply do the following:

$ cat stream.txt | sed 's/trout/catfish/g'
Imagine a quaint bubbling stream of cool mountain water filled with rainbow catfish and elephants drinking iced tea.

To do regular expression substitutions, you simply map a regular expression onto the literal strings as you mapped the regular expression on top of the literal line numbers in the previous section. Suppose you have a text file with a number of paragraphs separated by blank lines. You can change those blank lines into HTML <p> markers, using a regular expression substitution command:

sed 's/^$/<p>/g'

The first part of the substitution looks for blank lines and replaces them with the HTML <p> paragraph marker.

Add this sed command to the beginning of your txt2html.sed file. Now your HTML converter will add all the necessary headers, convert any blank lines into <p> markers so that they will be converted better in your browser, and then append the closing HTML tags.

Referencing Matched regexps with &

Matching by regular expression is useful; however, you sometimes want to reuse what you matched in the replacement. That's not hard if you are matching a literal string that you can identify exactly, but when you use regular expressions you don't always know exactly what you matched. To be able to reuse your matched regular expression is very useful when your regular expressions match varies.

The sed metacharacter & represents the contents of the pattern that was matched. For instance, say you have a file called phonenums.txt full of phone numbers, such as the following:

5555551212
5555551213
5555551214
6665551215
6665551216
7775551217

You want to make the area code (the first three digits) surrounded by parentheses for easier reading. To do this, you can use the ampersand replacement character, like so:

$ sed -e 's/^[[:digit:]][[:digit:]][[:digit:]]/(&)/g' phonenums.txt
(555)5551212
(555)5551213
(555)5551214
(666)5551215
(666)5551216
(777)5551217

Let's unpack this; it's a little dense. The easy part is that you are doing this sed operation on the file phonenums.txt, which contains the numbers listed. You are doing a regular expression substitution, so the first part of the substitution is what you are looking for, namely ^[[:digit:]][[:digit:]][[:digit:]]. This says that you are looking for a digit at the beginning of the line and then two more digits. Because an area code in the United States is composed of the first three digits, this construction will match the area code. The second half of the substitution is (&). Here, you are using the replacement ampersand metacharacter and surrounding it by parentheses. This means to put in parentheses whatever was matched in the first half of the command. This will turn all of the phone numbers into what was output previously.

This looks nicer, but it would be even nicer if you also included a dash after the second set of three numbers, so try that out.

Back References

The ampersand metacharacter is useful, but even more useful is the ability to define specific regions in a regular expressions so you can reference them in your replacement strings. By defining specific parts of a regular expression, you can then refer back to those parts with a special reference character.

To do back references, you have to first define a region and then refer back to that region. To define a region you insert backslashed parentheses around each region of interest. The first region that you surround with backslashes is then referenced by 1, the second region by 2, and so on.

Hold Space

Like the pattern space, the hold space is another workbench that sed has available. The hold space is a temporary space to put things while you do other things, or look for other lines. Lines in the hold space cannot be operated on; you can only put things in the hold space and take things out from it. Any actual work you want to do on lines has to be done in the pattern space. It's the perfect place to put a line that you found from a search, do some other work, and then pull out that line when you need it. In short, it can be thought of as a spare pattern buffer.

There are a couple of sed commands that allow you to copy the contents of the pattern space into the hold space. (Later, you can use other commands to copy what is in the hold space into the pattern space.) The most common use of the hold space is to make a duplicate of the current line while you change the original in the pattern space.

The following table details the three basic commands that are used for operating with the hold space.

Command

Description of Command's Function

h or H

Overwrite (h) or append (H) the hold space with the contents of the pattern space. In other words, it copies the pattern buffer into the hold buffer.

g or G

Overwrite (g) or append (G) the pattern space with the contents of hold space.

x

Exchange the pattern space and the hold space; note that this command is not useful by itself.

Each of these commands can be used with an address or address range.

The classic way of illustrating the use of the hold space is to take a text file and invert each line in the file so that the last line is first and the first is last, as in the following Try It Out.

More sed Resources

Refer to the following resources to learn even more about sed:

  • You can find the source code for GNU sed at ftp://ftp.gnu.org/pub/gnu/sed.

  • The sed one-liners (see the following section) are fascinating sed commands that are done in one line: http://sed.sourceforge.net/sed1line.txt.

  • The sed FAQ is an invaluable resource: http://sed.sourceforge.net/sedfaq.html.

  • Sed tutorials and other odd things, including a full-color, ASCII breakout game written only in sed, are available at http://sed.sourceforge.net/grabbag/scripts/.

  • The sed-users mailing list is available at http://groups.yahoo.com/group/sed-users/.

  • The man sed and info sed pages have the best information and come with your sed installation.

Common One-Line sed Scripts

The following code contains several common one-line sed commands. These one-liners are widely circulated on the Internet, and there is a more comprehensive list of one-liners available at http://sed.sourceforge.net/sed1line.txt.

The comments indicate the purpose of each script. Most of these scripts take a specific file name immediately following the script itself, although the input may also come through a pipe or redirection:

# Double space a file
   sed G file

   # Triple space a file
   sed 'G;G' file

   # Under UNIX: convert DOS newlines (CR/LF) to Unix format
   sed 's/.$//' file    # assumes that all lines end with CR/LF
   sed 's/^M$// file    # in bash/tcsh, press Ctrl-V then Ctrl-M

   # Under DOS: convert Unix newlines (LF) to DOS format
   sed 's/$//' file                     # method 1
   sed -n p file                        # method 2

   # Delete leading whitespace (spaces/tabs) from front of each line
   # (this aligns all text flush left). '^t' represents a true tab
   # character. Under bash or tcsh, press Ctrl-V then Ctrl-I.
   sed 's/^[ ^t]*//' file

   # Delete trailing whitespace (spaces/tabs) from end of each line
   sed 's/[ ^t]*$//' file               # see note on '^t', above

   # Delete BOTH leading and trailing whitespace from each line
   sed 's/^[ ^t]*//;s/[ ^]*$//' file    # see note on '^t', above

   # Substitute "foo" with "bar" on each line
   sed 's/foo/bar/' file        # replaces only 1st instance in a line
   sed 's/foo/bar/4' file       # replaces only 4th instance in a line
   sed 's/foo/bar/g' file       # replaces ALL instances within a line

   # Substitute "foo" with "bar" ONLY for lines which contain "baz"
   sed '/baz/s/foo/bar/g' file

   # Delete all CONSECUTIVE blank lines from file except the first.
   # This method also deletes all blank lines from top and end of file.
   # (emulates "cat -s")
   sed '/./,/^$/!d' file       # this allows 0 blanks at top, 1 at EOF
   sed '/^$/N;/
$/D' file     # this allows 1 blank at top, 0 at EOF

   # Delete all leading blank lines at top of file (only).
   sed '/./,$!d' file

   # Delete all trailing blank lines at end of file (only).
   sed -e :a -e '/^
*$/{$d;N;};/
$/ba' file

   # If a line ends with a backslash, join the next line to it.
   sed -e :a -e '/\$/N; s/\
//; ta' file

   # If a line begins with an equal sign, append it to the previous
   # line (and replace the "=" with a single space).
   sed -e :a -e '$!N;s/
=/ /;ta' -e 'P;D' file

Common sed Commands

In addition to the substitution command, which is used most frequently, the following table lists the most common sed editing commands.

Editing Command

Description of Command's Function

#

Comment. If first two characters of a sed script are #n, then the -n (no auto-print) option is forced.

{ COMMANDS }

A group of COMMANDS may be enclosed in curly braces to be executed together. This is useful when you have a group of commands that you want executed on an address match.

d[address][,address2]]d

Deletes line(s) from pattern space.

n

If auto-print was not disabled (-n), print the pattern space, and then replace the pattern space with the next line of input. If there is no more input, sed exits.

Less Common sed Commands

The remaining list of commands that are available to you in sed are much less frequently used but are still very useful and are outlined in the following table.

Command

Usage

: label

Label a line to reference later for transfer of control via b and t commands.

a[address][,address2]a text

Append text after each line matched by address or address range.

b[address][,address2]]b[label]

Branch (transfer control unconditionally) to :label.

c[address][,address2]] text

Delete the line(s) matching address and then output the lines of text that follow this command in place of the last line.

D[address][,address2]]D

Delete first part of multiline pattern (created by N command) space up to newline.

g

Replace the contents of the pattern space with the contents of the hold space.

G

Add a newline to the end of the pattern space and then append the contents of the hold space to that of the pattern space.

h

Replace the contents of the hold space with the contents of the pattern space.

H

Add a newline to the end of the hold space and then append the contents of the pattern space to the end of the pattern space.

i[address][,address2] text

Immediately output the lines of text that follow this command; the final line ends with an unprinted "".

l N

Print the pattern space using N lines as the word-wrap length. Nonprintable characters and the character are printed in C-style escaped form. Long lines are split with a trailing "" to indicate the split; the end of each line is marked with "$".

N

Add a newline to the pattern space and then append the next line of input into the pattern space. If there is no more input, sed exits.

P

Print the pattern space up to the first newline.

r[address][,address2] FILENAME

Read in a line of FILENAME and insert it into the output stream at the end of a cycle. If file name cannot be read, or end-of-file is reached, no line is appended. Special file /dev/stdin can be provided to read a line from standard input.

w[address][,address2] FILENAME

Write to FILENAME the pattern space. The special file names /dev/stderr and /dev/stdout are available to GNU sed. The file is created before the first input line is read. All w commands that refer to the same FILENAME are output without closing and reopening the file.

x

Exchange the contents of the hold and pattern spaces.

GNU sed-Specific sed Extensions

The following table is a list of the commands specific to GNU sed. They provide enhanced functionality but reduce the portability of your sed scripts. If you are concerned about your scripts working on other platforms, use these commands carefully!

Editing Command

Description of Command's Function

e [COMMAND]

Without parameters, executes command found in pattern space, replacing pattern space with its output. With parameter COMMAND, interprets COMMAND and sends output of command to output stream.

L N

Fills and joins lines in pattern space to produce output lines of N characters (at most). This command will be removed in future releases.

Q [EXIT-CODE]

Same as common q command, except that it does not print the pattern space. It provides the ability to return an EXIT-CODE.

R FILENAME

Reads in a line of FILENAME and inserts it into the output stream at the end of a cycle. If file name cannot be read or end-of-file is reached, no line is appended. Special file /dev/stdin can be provided to read a line from standard input

T LABEL

Branch to LABEL if there have been no successful substitutions (s) since last input line was read or branch taken. If LABEL is omitted, the next cycle is started.

v VERSION

This command fails if GNU sed extensions are not supported. You can specify the VERSION of GNU sed required; default is 4.0, as this is the version that first supports this command.

W FILENAME

Write to FILENAME the pattern space up to the first newline. See standard w command regarding file handles.

Summary

As you use sed more and more, you will become more familiar with its quirky syntax and you will be able to dazzle people with your esoteric and cryptic-looking commands, performing very powerful text processing with a minimum of effort.

In this chapter, you learned:

  • The different available versions of sed.

  • How to compile and install GNU sed, even on a system that doesn't have a working version.

  • How to use sed with some of the available editing commands.

  • Different ways to invoke sed: on the command line with the -e flag, separated by semicolons, with the bash multiline method, and by writing sed scripts.

  • How to specify addresses and address ranges by specifying the specific line number or specific range of line numbers. You learned address negation and stepping, and regular expression addressing.

  • The bread and butter of sed, substitution, was introduced, and you learned how to do substitution with flags, change the substitution delimiter, do substitution with addresses and address ranges, and do regular expression substitutions.

  • Some of the other basic sed commands: the comment, insert, append, and change commands.

  • What character class keywords are and how to use them.

  • About the & metacharacter and how to do numerical back references.

  • How to use the hold space to give you a little breathing room in what you are trying to do in the pattern space.

The next chapter covers how to read and manipulate text from files using awk. Awk was designed for text processing and works well when called from shell scripts.

Exercises

  1. Use an address range negation to print only the fifth line of your /etc/passwd file. Hint: Use the delete editing command.

  2. Use an address step to print every fifth line of your /etc/passwd file, starting with the tenth.

  3. Use an address step to delete the tenth line of your /etc/passwd file and no other line.

  4. Write a sed command that takes the output from ls -l issued in your home directory and changes the owner of all the files from your username to the reverse. Make sure not to change the group if it is the same as your username.

  5. Do the same substitution as Exercise 4, except this time, change only the first ten entries and none of the rest.

  6. Add some more sed substitutions to your txt2html.sed script. In HTML you have to escape certain commands in order that they be printed properly. Change any occurrences of the ampersand (&) character into &amp; for proper HTML printing. Hint: You will need to escape your replacement. Once you have this working, add a substitution that converts the less than and greater than characters (< and >) to &lt; and &gt; respectively.

  7. Change your txt2html.sed script so that any time it encounters the word trout, it makes it bold by surrounding it with the HTML bold tags (<b> and the closing </b>). Also make the script insert the HTML paragraph marker (<p>) for any blank space it finds.

  8. Come up with a way to remove the dash from the second digit so instead of printing Area code: (555) Second: 555- Third: 1212, you instead print Area code: (555) Second: 555 Third: 1212.

  9. Take the line reversal sed script shown in the Hold Space section and re-factor it so it doesn't use the -n flag and is contained in a script file instead of on the command line.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.160.181