Searching and replacing throughout multiple documents with sed

Back in Chapter 6, we talked about sed and how to use it to search and replace throughout files, one file at a time. Although we're sure you're still coming down off of the power rush from doing that, we'll now show you how to combine sed with shell scripts and loops. In doing this, you can take your search-and-replace criteria and apply them to multiple documents. For example, you can search through all of the .html documents in a directory and make the same change to all of them. In this example (Figure 16.5), we strip out all of the <BLINK> tags, which are offensive to some HTML purists.

Figure 16.5. Create a script to search and replace in multiple documents.


Before you get started, you might have a look at Chapter 6 for a review of sed basics and Chapter 10 for a review of scripts and loops.

To search and replace throughout multiple documents:

1.
vi thestinkinblinkintag

Use the editor of your choice to create a new script. Name the file whatever you want.

2.
#! /bin/sh

Start the shell script with the name of the program that should run the script.

3.
for i in 'ls -l *.htm*'

Start a loop. In this case, the loop will process all of the .htm or .html documents in the current directory.

4.
do

Indicate the beginning of the loop content.

5.
cp $i $i.bak

Make a backup copy of each file before you change it. Remember, Murphy is watching you.

6.
sed "s/</*BLINK>//g" $i > $i

Specify your search criteria and replacement text. A lot is happening in this line, but don't panic. From the left, this command contains sed followed by

  • ", which starts the command.

  • s/, which tells sed to search for something.

  • <, which is the first character to be searched for.

  • /, which allows you to search for the /. (The escapes the / so the / can be used in the search.)

  • *, which specifies none or one of the previous character (/), which takes care of both the opening and closing tags (with and without a / at the beginning).

  • BLINK>, which indicates the rest of the text to search for. Note that this only searches for capital letters. You'll want to add a line if your HTML document might use lowercase tags.

  • //, which ends the search section and the replace section (there's nothing in the replace section because the tag will be replaced with nothing).

  • g, which tells sed to make the change in all occurrences (globally), not just in the first occurrence on each line.

  • ", which closes the command.

  • $i is replaced with each file name in turn as the loop runs.

  • > $i indicates that the output is redirected back to the same file name.

(See Code Listing 16.1.)

Code Listing 16.1. You can even use sed to strip out bad HTML tags, as shown here.
[ejr@hobbes scripting]$ more
 thestinkinblinkintag
#! /bin/sh
for i in 'ls -1 *.htm*'
do
cp $i $i.bak
sed "s/</*BLINK>//g" $i > $i
echo "$i is done!"
done
[ejr@hobbes scripting]$ chmod u+x
 thestinkinblinkintag
[ejr@hobbes scripting]$ ./thestinkinblinkintag
above.htm is done!
file1.htm is done!
file2.htm is done!
html.htm is done!
temp.htm is done!
[ejr@hobbes scripting]$

7.
echo "$i is done."

Optionally, print a status message onscreen, which can be reassuring if there are a lot of files to process.

8.
done

Indicate the end of the loop.

9.
Save and close out of your script.

10.
Try it out.

Remember to make your script executable with chmod u+x and the file name, then run it with ./thestinking blinkintag. In our example, we'll see the "success reports" for each of the HTML documents processed (Code Listing 16.1).

Tip

You could perform any number of other operations on the files within the loop, if you wanted. For example, you could strip out other codes, replace a former Webmaster's address with your own, or automatically insert comments and last-update dates.


Code Listing 16.2. Use awk to generate quick reports.
[ejr@hobbes /home]$ ls -la | awk '{print $9 "
 owned by " $3 } END { print NR " Total
 Files" }'
owned by
. owned by root
.. owned by root
admin owned by admin
anyone owned by anyone
asr owned by asr
awr owned by awr
bash owned by bash
csh owned by csh
deb owned by deb
debray owned by debray
ejr owned by ejr
ejray owned by ejray
ftp owned by root
httpd owned by httpd
lost+found owned by root
merrilee owned by merrilee
oldstuff owned by 1000
pcguest owned by pcguest
raycomm owned by pcguest
samba owned by root
shared owned by root
22 Total Files
[ejr@hobbes /home]$

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.13.255