Cleaning Up HTML Documents with tidy

If you ever have to develop HTML documents—when developing personal Web sites, completing a class project, or creating Web pages on the job—the tidy utility can be a handy resource for you. If you’re creating HTML pages by hand, you’ll likely make occasional errors. These errors probably won’t cause significant problems with using the pages, but they might make the pages harder to read, harder to maintain, and harder to subject to the scrutiny of your peers. Not to worry; tidy can help!

tidy is not usually included with Linux or Unix distributions, but you can download (and install, using the instructions in Chapter 14) from http://tidy.sourceforge.net.

To Clean Up Html Documents with tidy:

1.
vi sampledoc.html

Use the editor of your choice to create an HTML document. Our sample document is called, well, sampledoc.html (Figure 17.1) Don’t worry about getting the tagging or syntax exactly right; tidy will take care of the details. Save and close your document.

Figure 17.1. Even a flawed HTML document, like this one, can be fixed by tidy.


2.
tidy sampledoc.html

The tidy utility will apply HTML formatting rules and then output a massaged version of your document that is technically correct (Code Listing 17.1). Cool, huh?

Code Listing 17.1. The tidy command is handy for cleaning up HTML documents.
[jdoe@frazz public_html]$ tidysampledoc.html

Tidy (vers 4th August 2000) Parsing
→ "sampledoc.html" line 10 column 6 -
→ Warning: discarding unexpected </ul>

sampledoc.html: Document content looks
→ like HTML 2.0
1 warnings/errors were found!

<!DOCTYPE html PUBLIC "-//IETF//DTD
→ HTML 2.0//EN">
<html>
<head>
<meta name="generator" content="HTML
→ Tidy, see www.w3.org">
<title>Jdoe's Home Page</title>
</head>
<body>
<h1>Making Unix Work, One Day at a
→ Time</h1>

<p>Read these tips, when I get around to
→ writing them, and weep.</p>

<ul>
<li>To be written</li>

<li>To be written later</li>

<li>To be written next week</li>
</ul>

<address>[email protected]</address>
</body>
</html>

HTML & CSS specifications are available
→ from http://www.w3.org/
To learn more about Tidy see
→ http://www.w3.org/People/Raggett/tidy/
Please send bug reports to Dave Raggett
→ care of <[email protected]>
Lobby your company to join W3C, see
→ http://www.w3.org/Consortium
[jdoe@frazz public_html]$

3.
tidy sampledoc.html > fixedupdoc.html

If you like the results, redirect the document to a new filename, as shown here, or use tidy –m sampledoc.html to replace the original document.

✓ Tips

  • For even spiffier results, we like using tidy –indent –quiet ––doctype loose—modify sampledoc.html, which suppresses the informative messages from tidy, makes the output an HTML 4 document, tidily indents the output, and replaces the original with the modified file (Code Listing 17.2). All that, and only one command.

  • Consider using tidy with the sed script (described in the next section) to do a lot of cleanup at once.


Code Listing 17.2. The tidy command, with the appropriate flags, performs miracles—almost.
[jdoe@frazz public_html]$ tidy -indent-quiet—doctypeloose  sampledoc.html
line 10 column 6 -- Warning: discarding
→ unexpected </ul>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML
→ 4.01 Transitional//EN">
<html>
  <head>
    <meta name="generator" content="HTML
    → Tidy, see www.w3.org">
    <title>
      Jdoe's Home Page
    </title>
  </head>
  <body>
    <h1>
      Making Unix Work, One Day at a Time
    </h1>
    <p>
     Read these tips, when I get around
     → to writing them, and weep.
    </p>
    <ul>
      <li>
        To be written
      </li>
      <li>
        To be written later
      </li>
      <li>
        To be written next week
      </li>
     </ul>
     <address>
       [email protected]
     </address>
   </body>
 </html>

HTML&CSS specifications are available
→ from http://www.w3.org/
To learn more about Tidy see
→ http://www.w3.org/People/Raggett/tidy/
Please send bug reports to Dave Raggett
→ care of <[email protected]>
Lobby your company to join W3C, see
→ http://www.w3.org/Consortium
[jdoe@frazz public_html]$

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.28.50