Chapter 2. Tools

Automatic tools are a critical component of refactoring. Although you can perform most refactoring manually with a text editor, and although I will sometimes demonstrate refactoring that way for purposes of illustration, in practice we almost always use software to help us. To my knowledge no major refactoring browsers are available for HTML at the time of this writing. However, a lot of tools can assist in many of the processes. In this section, I’ll explain some of them.

Backups, Staging Servers, and Source Code Control

Throughout this book, I’m going to show you some very powerful tools and techniques. As the great educator Stan Lee taught us, “With great power comes great responsibility.” Your responsibility is to not irretrievably break anything while using these techniques. Some of the tools I’ll show can misbehave. Some of them have corner cases where they get confused. A huge amount of bad HTML is out there, not all of which the tools discussed here have accounted for. Consequently, refactoring HTML requires at least a five-step process.

  1. Identify the problem.

  2. Fix the problem.

  3. Verify that the problem has been fixed.

  4. Check that no new problems have been introduced.

  5. Deploy the solution.

Because things can go wrong, you should not use any of these techniques on a live site. Instead, make a local copy of the site before making any changes. After making changes to your local copy, carefully verify all pages once again before you deploy.

Most large sites today already use staging or development servers where content can be deployed and checked before the public sees it. If you’re just working on a small personal static site, you can make a local copy on your hard drive instead; but by all means work on a copy and check the changes before deploying them. How to check the changes is the subject of the next section.

Of course, even with the most careful checks, sometimes things slip by and are first noticed by end-users. Sometimes a site works perfectly well on a staging server and has weird problems on the production server due to unrecognized configuration differences. Thus, it’s a very good idea to have a full and complete backup of the production site that you can restore to in case the newly deployed site doesn’t behave as expected. Regular, reliable, tested backups are a must.

Finally, you should very seriously consider storing all your code, including all your HTML, CSS, and images, in a source code control system. Programmers have been using source code control for decades, but it’s a relatively uncommon tool for web developers and designers. It’s time for that to change. The more complex a site is, the more likely it is that subtle problems will slip in unnoticed at first. When refactoring, it is critical to be able to go back to previous versions, maybe even from months or years ago, to find out which change introduced a bug. Source code control also provides timestamped backups so that it’s possible to revert your site to its state at any given point in time.

I strongly recommend Subversion for web development, mostly because of its strong support for moving files from one directory to another, though its excellent Unicode support and decent support for binary files are also helpful. Most source code control systems are set up for programmers who rarely bother to move files from one directory to another. By contrast, web developers frequently reorganize site structures (more frequently than they should, in fact). Consequently, a system really needs to be able to track histories across file moves. If your organization has already set up some other source code control system such as CVS, Visual SourceSafe, ClearCase, or Perforce, you can use that system instead; but Subversion is likely to work better and cause you fewer problems in the long run.

The topic of managing Subversion could easily fill a book on its own; and indeed, several such books are available. (My favorite is Pragmatic Version Control Using Subversion by Mike Mason [The Pragmatic Bookshelf, 2006].) Many large sites hire people whose sole responsibility is to manage the source code control repository. However, don’t be scared off. Ultimately, setting up Subversion or another source code control repository is no harder than setting up Apache or another web server. You’ll need to read a little documentation. You’ll need to tweak some config files, and you may need to ask for help from a newsgroup or conduct a Google search to get around a rough spot. However, it’s eminently doable, and it’s well worth the time invested.

You can check files into or out of Subversion from the command line if necessary. However, life is usually simpler if you use an editor such as BBEdit that has built-in support for Subversion. Plug-ins are available that add Subversion support to editors such as Dreamweaver that don’t natively support it. Furthermore, products such as Tortoise-SVN and SCPlugin are available that integrate Subversion support directly into Windows Explorer or the Mac Finder.

Some content management systems (CMSs) have built-in version control. If yours does, you may not need to use an external repository. For instance, MediaWiki stores a record of all changes that have been made to all pages. It is possible at any point to see what any given page looked like at any moment in time and to revert to that appearance. This is critical for MediaWiki’s model installation at Wikipedia, where vandalism is a real problem. However, even private sites that are not publicly editable can benefit greatly from a complete history of the site over time. Although Wikis are the most common use of version control on the Web, some other CMSs such as Siteline also bundle this functionality.

Validators

There really are standards for HTML, even if nobody follows them. One way to find out whether a site follows HTML standards is to run a page through a validation service. The results can be enlightening. They will provide you with specific details to fix, as well as a good idea of how much work you have ahead of you.

The W3C Markup Validation Service

For public pages, the validator of choice is the W3C’s Markup Validation Service, at http://validator.w3.org/. Simply enter the URL of the page you wish to check, and see what it tells you. For example, Figure 2.1 shows the result of validating my blog against this service.

The W3C Markup Validation Service

Figure 2.1. The W3C Markup Validation Service

It seems I had misremembered the syntax of the blockquote element. I had mistyped the cite attribute as the source attribute. This was actually better than I expected. I fixed that and rechecked, as shown in Figure 2.2. Now the page is valid.

Desired results: a valid page

Figure 2.2. Desired results: a valid page

It is usually not necessary to check each and every page on your site for these sorts of errors. Most errors repeat themselves. Generally, once you identify a class of errors, it becomes possible to find and automatically fix the problems. For example, here I could simply search for <blockquote source= to find other places where I had made the same mistake.

This page was actually cleaner than normal to begin with. I next tried one of the oldest and least-maintained pages I could find on my sites. This generated 144 different errors.

Private Validation

The W3C is reasonably trustworthy. Still, you may not want to submit private pages from your intranet to the site. Indeed, for people dealing with medical, financial, and other personal data, it may be illegal to do so. You can download a copy of the HTML validator scripts from http://validator.w3.org/docs/install and install them locally. Then your pages don’t need to travel outside your firewall when validating.

This style of validation is good when authoring. It is also extremely useful for spot-checking a site to see how much work you’re likely to do, or for working on a single page. However, at some point, you’ll want to verify all of the pages on a site, not just a few of them. For this, an automatic batch validator is a necessity.

The Log Validator

The W3C Markup Validation Service has given birth to the Log Validator (www.w3.org/QA/Tools/LogValidator/), a command-line tool written in Perl that can check an entire site. It can also use a web server’s log files to determine which pages are the most popular and start its analysis there. Obviously, you care a lot more about fixing the page that gets 100 hits a minute than the one that gets 100 hits a year. The Log Validator will provide you with a convenient list of problems to fix, prioritized by seriousness and popularity of page. Listing 2.1 shows the beginning of one such list.

Example 2.1. Log Validator Output

Results for module HTMLValidator
****************************************************************
Here are the 10 most popular invalid document(s) that I could
find in the logs for www.elharo.com.

 Rank   Hits   #Error(s)                                 Address
------ ------ ----------- -------------------------------------
1      2738   21          http://www.elharo.com/blog/feed/atom/
2      1355   21          http://www.elharo.com/blog/feed/
3      1231   3           http://www.elharo.com/blog/
4      1127   6           http://www.elharo.com/
6      738    3           http://www.elharo.com/blog/networks
/2006/03/18/looking-for-a-router/feed/
11     530    3           http://www.elharo.com/journal
/fruitopia.html
20     340    1           http://www.elharo.com/blog
/wp-comments-post.php
23     305    3           http://www.elharo.com/blog/birding
/2006/03/15/birding-at-sd/
25     290    4           http://www.elharo.com/journal
/fasttimes.html
26     274    1           http://www.elharo.com/journal/

The first two pages in this list are Atom feed documents, not HTML files at all. Thus, it’s no surprise that they show up as invalid. The third and fourth ones are embarrassing, though, since they’re my blog’s main page and my home page, respectively. They’re definitely worth fixing. The fifth most visited page on my site is valid, however, so it doesn’t show up in the list. Numbers 11, 25, and 26 are very old pages that predate XML, much less XHTML. It’s no surprise that they’re invalid, but because they’re still getting hits, it’s worth fixing them.

Number 20 is also a false positive. That’s just the comments script used to post comments. When the validator tries to GET it rather than POST to it, it receives a blank page. That’s not a real problem, though I might want to fix it one of these days to show a complete list of the comments. Or perhaps I should simply set up the script to return “HTTP error 405 Method Not Allowed” rather than replying with “200 OK” and a blank document.

After these, various other pages that aren’t as popular are listed. Just start at the top and work your way down.

xmllint

You can also use a generic XML validator such as xmllint, which is bundled on many UNIX machines and is also available for Windows. It is part of libxml2, which you can download from http://xmlsoft.org/.

There are advantages and disadvantages to using a generic XML validator to check HTML. One advantage is that you can separate well-formedness checking from validity checking. It is usually easier to fix well-formedness problems first and then fix validity problems. Indeed, that is the order in which this book is organized. Well-formedness is also more important than validity.

The first disadvantage of using a generic XML validator is that it won’t catch HTML-specific problems that are not specifically spelled out in the DTD. For instance, it won’t notice an a element nested inside another a element (though that problem doesn’t come up a lot in practice). The second disadvantage is that it will have to actually read the DTD. It doesn’t assume anything about the document it’s checking.

Using xmllint to check for well-formedness is straightforward. Just point it at the local file or remote URL you wish to check from the command line. Use the --noout option to say that the document itself shouldn’t be printed and --loaddtd to allow entity references to be resolved. For example:

$ xmllint --noout --loaddtd http://www.aw.com
http://www.aw-bc.com/:118: parser error : Specification mandate
value for attribute nowrap
<TD class="headerBg" bgcolor="#004F99" nowrap align="left">
                                              ^
http://www.aw-bc.com/:118: parser error :
attributes construct error
<TD class="headerBg" bgcolor="#004F99" nowrap align="left">
                                              ^
http://www.aw-bc.com/:118: parser error : Couldn't find end of
Start Tag TD line 118
<TD class="headerBg" bgcolor="#004F99" nowrap align="left">
                                              ^
http://www.aw-bc.com/:120: parser error : Opening and ending
tag mismatch: IMG line 120 and A
 Benjamin Cummings" WIDTH="84" HEIGHT="64" HSPACE="0"
VSPACE="0" BORDER="0"></A>
...

When you first run a report such as this, the number of error messages can be daunting. Don’t despair—start at the top and fix the problems one by one. Most errors fall into certain common categories that we will discuss later in the book, and you can fix them en masse. For instance, in this example, the first error is a valueless nowrap attribute. You can fix this simply by searching for nowrap and replacing it with nowrap="nowrap". Indeed, with a multifile search and replace, you can fix this problem on an entire site in less than five minutes. (I’ll get to the details of that a little later in this chapter.)

After each change, you run the validator again. You should see fewer problems with each pass, though occasionally a new one will crop up. Simply iterate and repeat the process until there are no more well-formedness errors. The next problem is an IMG element that uses a start-tag rather than an empty-element tag. This one isn’t quite as easy, but you can fix most occurrences by searching for BORDER="0"> and replacing it with border="0" />. That won’t catch all of the problems with IMG elements, but it will fix a lot of them.

It is important to start with the first error in the list, though, and not pick an error randomly. Often, one early mistake can cause multiple well-formedness problems. This is especially true for omitted start-tags and end-tags. Fixing an early problem often removes the need to fix many later ones.

Once you have achieved well-formedness, the next step is to check validity. You simply add the --valid switch on the command line, like so:

$ xmllint --noout --loaddtd --valid  valid_aw.html

This will likely produce many more errors to inspect and fix, though these are usually not as critical or problematic. The basic approach is the same, though: Start at the beginning and work your way through until all the problems are solved.

Editors

Many HTML editors have built-in support for validating pages. For example, in BBEdit you can just go to the Markup menu and select Check/Document Syntax to validate the page you’re editing. In Dreamweaver, you can use the context menu that offers a Validate Current Document item. (Just make sure the validator settings indicate XHTML rather than HTML.) In essence, these tools just run the document through a parser such as xmllint to see whether it’s error-free.

If you’re using Firefox, you should install Chris Pederick’s Web Developer plug-in (https://addons.mozilla.org/firefox/60/). Once you’ve done that, you can validate any page by going to Tools/Web Developer/Tools/Validate HTML. This loads the current page in the W3C validator. The plug-in also provides a lot of other useful options in Firefox.

Whatever tool or technique you use to find the markup mistakes, validating is the first step to refactoring into XHTML. Once you see what the problems are, you’re halfway to fixing them.

Testing

In theory, refactoring should not break anything that isn’t already broken. In practice, it isn’t always so reliable. To some extent, the catalog later in this book shows you what changes you can safely make. However, both people and tools do make mistakes, and it’s always possible that refactoring will introduce new bugs. Thus, the refactoring process really needs a good automated test suite. After every refactoring, you’d like to be able to press a button and see at a glance whether anything broke.

Although test-driven development has been a massive success among traditional programmers, it is not yet so common among web developers, especially those working on the front end. In fact, any automated testing of web sites is probably the exception rather than the rule, especially when it comes to HTML. It is time for that to change. It is time for web developers to start to write and run test suites and to use test-driven development.

The basic test-driven development approach is as follows.

  1. Write a test for a feature.

  2. Code the simplest thing that can possibly work.

  3. Run all tests.

  4. If tests passed, goto 1.

  5. Else, goto 2.

For refactoring purposes, it is very important that this process be as automatic as possible. In particular:

  • The test suite should not require any complicated setup. Ideally, you should be able to run it with the click of a button. You don’t want developers to skip running tests because they’re too hard to run.

  • The tests should be fast enough that they can be run frequently; ideally, they should take 90 seconds or less to run. You don’t want developers to skip running tests because they take too long.

  • The result must be pass or fail, and it should be blindingly obvious which it is. If the result is fail, the failed tests should generate more output explaining what failed. However, passing tests should generate no output at all, except perhaps for a message such as “All tests passed.” In particular, you want to avoid the common problem in which one or two failing tests get lost in a sea of output from passing tests.

Writing tests for web applications is harder than writing tests for classic applications. Part of this is because the tools for web application testing aren’t as mature as the tools for traditional application testing. Part of this is because any test that involves looking at something and figuring out whether it looks right is hard for a computer. (It’s easy for a person, but the goal is to remove people from the loop.) Thus, you may not achieve the perfect coverage you can in a Java or .NET application. Nonetheless, some testing is better than none, and you can in fact test quite a lot.

One thing you will discover is that refactoring your code to web standards such as XHTML is going to make testing a lot easier. Going forward, it is much easier to write tests for well-formed and valid XHTML pages than for malformed ones. This is because it is much easier to write code that consumes well-formed pages than malformed ones. It is much easier to see what the browser sees, because all browsers see the same thing in well-formed pages and different things in malformed ones. Thus, one benefit of refactoring is improving testability and making test-driven development possible in the first place. Indeed, with a lot of web sites that don’t already have tests, you may need to refactor them enough to make testing possible before moving forward.

You can use many tools to test web pages, ranging from decent to horrible and free to very expensive. Some of these are designed for programmers, some for web developers, and some for business domain experts. They include

In practice, the rough edges on these tools make it very helpful to have an experienced agile programmer develop the first few tests and the test framework. Once you have an automated test suite in place, it is usually easier to add more tests yourself.

JUnit

JUnit (www.junit.org/) is the standard Java framework for unit testing and the one on which a lot of the more specific frameworks such as HtmlUnit and HttpUnit are built. There’s no reason you can’t use it to test web applications, provided you can write Java code that pretends to be a browser. That’s actually not as hard as it sounds.

For example, one of the most basic tests you’ll want to run is one that tests whether each page on your site is well-formed. You can test this simply by parsing the page with an XML parser and seeing whether it throws any exceptions. Write one method to test each page on the site, and you have a very nice automated test suite for what we checked by hand in the previous section.

Listing 2.2 demonstrates a simple JUnit test that checks the well-formedness of my blog. All this code does is throw a URL at an XML parser and see whether it chokes. If it doesn’t, the test passes. This version requires Sun’s JDK 1.5 or later, and JUnit 3.8 or later somewhere in the classpath. You may need to make some modifications to run this in other environments.

Example 2.2. A JUnit Test for Web Site Well-Formedness

import java.io.IOException;
import junit.framework.TestCase;
import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;

public class WellformednessTests extends TestCase {

    private XMLReader reader;

    public void setUp() throws SAXException {
        reader = XMLReaderFactory.createXMLReader(
       "com.sun.org.apache.xerces.internal.parsers.SAXParser");
    }

    public void testBlogIndex()
      throws SAXException, IOException {
        reader.parse("http://www.elharo.com/blog/");
    }

}

You can run this test from inside an IDE such as Eclipse or Net-Beans, or you can run it from the command line like so:

$ java -cp .:junit.jar
  junit.swingui.TestRunner WellformednessTests

If all tests pass, you’ll see a green bar as shown in Figure 2.3.

All tests pass.

Figure 2.3. All tests pass.

To test additional pages for well-formedness, you simply add more methods, each of which looks exactly like testBlogIndex, just with a different URL. Of course, you can also write more complicated tests. You can test for validity by setting the http://xml.org/sax/features/validation feature on the parser and attaching an error handler that throws an exception if a validity error is detected.

You can use DOM, XOM, SAX, or some other API to load the page and inspect its contents. For instance, you could write a test that checks whether all links on a page are reachable. If you use TagSoup as the parser, you can even write these sorts of tests for non-well-formed HTML pages.

You can submit forms using the HttpURLConnection class or run JavaScript using the Rhino engine built into Java 6. This is all pretty low-level stuff, and it’s not trivial to do, but it’s absolutely possible to do it. You just have to roll up your sleeves and start coding.

If nondevelopers are making regular changes to your site, you can set up the test suite to run periodically with cron and to e-mail you if anything unexpectedly breaks. (It’s probably not reasonable to expect each author or designer to run the entire test suite before every check-in.) You can even run the suite continuously using a product such as Hudson or Cruise Control. However, that may fill your logs with a lot of uncountable test traffic, so you may wish to run this against the development server instead.

Many similar test frameworks are available for other languages and platforms: PyUnit for Python, CppUnit for C++, NUnit for .NET, and so forth. Collectively these go under the rubric xUnit. Whichever one you and your team are comfortable working with is fine for writing web test suites. The web server doesn’t care what language your tests are written in. As long as you have a one-button test harness and enough HTTP client support to write tests, you can do what needs to be done.

HtmlUnit

HtmlUnit (http://htmlunit.sourceforge.net/) is an open source JUnit extension designed to test HTML pages. It will be most familiar and comfortable to Java programmers already using JUnit for test-driven development. HtmlUnit provides two main advantages over pure JUnit.

  • The WebClient class makes it much easier to pretend to be a web browser.

  • The HTMLPage class has methods for inspecting common parts of an HTML document.

For example, HtmlUnit will run JavaScript that’s specified by an onLoad handler before it returns the page to the client, just like a browser would. Simply loading the page with an XML parser as Listing 2.2 did would not run the JavaScript.

Listing 2.3 demonstrates the use of HtmlUnit to check that all the links on a page are not broken. I could have written this using a raw parser and DOM, but it would have been somewhat more complex. In particular, methods such as getAnchors to find all the a elements in a page are very helpful.

Example 2.3. An HtmlUnit Test for a Page’s Links

import java.io.IOException;
import java.net.*;
import java.util.*;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;

import junit.framework.TestCase;

public class LinkCheckTest extends TestCase {

    public void testBlogIndex()
      throws FailingHttpStatusCodeException, IOException {
        WebClient webClient = new WebClient();
        URL url = new URL("http://www.elharo.com/blog/");
        HtmlPage page = (HtmlPage) webClient.getPage(url);
        List links = page.getAnchors();
        Iterator iterator = links.iterator();
        while (iterator.hasNext()) {
            HtmlAnchor link = (HtmlAnchor) iterator.next();
            URL u = new URL(link.getHrefAttribute());
            // Check that we can download this page.
            // If we can't, getPage throws an exception and
            // the test fails.
            webClient.getPage(u);
        }
    }

}

This test is more than a unit test. It checks all the links on a page, whereas a real unit test would check only one. Furthermore, it makes connections to external servers. That’s very unusual for a unit test. Still, this is a good test to have, and it will let us know that we need to fix our pages if an external site breaks links by reorganizing its pages.

HttpUnit

HttpUnit (http://httpunit.sourceforge.net/) is another open source JUnit extension designed to test HTML pages. It is also best suited for Java programmers already using JUnit for test-driven development and is in many ways quite similar to HtmlUnit. Some programmers prefer HttpUnit, and others prefer HtmlUnit. If there’s a difference between the two, it’s that HttpUnit is somewhat lower-level. It tends to focus more on the raw HTTP connection, whereas HtmlUnit more closely imitates a browser. HtmlUnit has somewhat better support for Java-Script, if that’s a concern. However, there’s certainly a lot of overlap between the two projects.

Listing 2.4 demonstrates an HttpUnit test that verifies that a page has exactly one H1 header, and that its text matches the web page’s title. That may not be a requirement for all pages, but it is a requirement for some. For instance, it would be a very apt requirement for a newspaper site.

Example 2.4. An HttpUnit Test That Matches the Title to a Unique H1 Heading

import java.io.IOException;
import org.xml.sax.SAXException;
import com.meterware.httpunit.*;
import junit.framework.TestCase;

public class TitleChecker extends TestCase {

    public void testFormSubmission()
      throws IOException, SAXException {

        WebConversation wc = new WebConversation();
        WebResponse     wr = wc.getResponse(
          "http://www.elharo.com/blog/");
        HTMLElement[]   h1 = wr.getElementsWithName("h1");
        assertEquals(1, h1.length);
        String title = wr.getTitle();
        assertEquals(title, h1[0].getText());

    }

}

I could have written this test in HtmlUnit, too, and I could have written Listing 2.3 with HttpUnit. Which one you use is mostly a matter of personal preference. Of course, these are hardly the only such frameworks. There are several more, including ones not written in Java. Use whichever one you like, but by all means use something.

JWebUnit

JWebUnit is a higher-level API that sits on top of HtmlUnit and JUnit. Generally, JWebUnit tests involve more assertions and less straight Java code. These tests are somewhat easier to write without a large amount of Java expertise, and they may be more accessible to a typical web developer. Furthermore, tests can very easily extend over multiple pages as you click links, submit forms, and in general follow an entire path through a web application.

Listing 2.5 demonstrates a JWebUnit test for the search engine on my web site. It fills in the search form on the main page and submits it. Then it checks that one of the expected results is present.

Example 2.5. A JWebUnit Test for Submitting a Form

import junit.framework.TestCase;
import net.sourceforge.jwebunit.junit.*;

public class LinkChecker extends TestCase {

    private WebTester tester;

    public LinkChecker(String name) {
        super(name);
        tester = new WebTester();
        tester.getTestContext().setBaseUrl(
          "http://www.elharo.com/");
    }

    public void testFormSubmission() {

        // start at this page
        tester.beginAt("/blog/");

        // check that the form we want is on the page
        tester.assertFormPresent("searchform");

        /// check that the input element we expect is present
        tester.assertFormElementPresent("s");

        // type something into the input element
        tester.setTextField("s", "Linux");

        // send the form
        tester.submit();

        // we're now on a different page; check that the
        // text on that page is as expected.
        tester.assertTextPresent("Windows Vista");
    }

}

FitNesse

FitNesse (http://fitnesse.org/) is a Wiki designed to enable business users to write tests in table format. Business users like spreadsheets. The basic idea of FitNesse is that tests can be written as tables, much like a spreadsheet. Thus, FitNesse tests are not written in Java. Instead, they are written as a table in a Wiki.

You do need a programmer to install and configure FitNesse for your site. However, once it’s running and a few sample fixtures have been written, it is possible for savvy business users to write more tests. FitNesse works best in a pair environment, though, where one programmer and one business user can work together to define the business rules and write tests for them.

For web app acceptance testing, you install Joseph Bergin’s HtmlFixture (http://fitnesse.org/FitNesse.HtmlFixture). It too is based on HtmlUnit. It supplies instructions that are useful for testing web applications such as typing into forms, submitting forms, checking the text on a page, and so forth.

Listing 2.6 demonstrates a simple FitNesse test that checks the http-equiv meta tag in the head to make sure it’s properly specifying UTF-8. The first three lines set the classpath. Then, after a blank line, the next line identifies the type of fixture as an HtmlFixture. (There are several other kinds, but HtmlFixture is the common one for testing web applications.)

The external page at http://www.elharo.com/blog/ is then loaded. In this page, we focus on the element named meta that has an id attribute with the value charset. This will be the subject for our tests.

The test then looks at two attributes of this element. First it inspects the content attribute and asserts that its value is text/html; charset=utf-8. Next it checks the http-equiv attribute of the same element and asserts that its value is content-type.

Example 2.6. A FitNesse Test for <meta name="charset" http-equiv="Content-Type" content="text/html; charset=UTF-8" />

!path fitnesse.jar
!path htmlunit-1.5/lib/*.jar
!path htmlfixture20050422.jar

!|com.jbergin.HtmlFixture|
|http://www.elharo.com/blog/|
|Element Focus|charset   |meta|
|Attribute    |content   |text/html; charset=utf-8|
|Attribute    |http-equiv|content-type|

This test would be embedded in a Wiki page. You can run it from a web browser just by clicking the Test button, as shown in Figure 2.4. If all of the assertions pass, and if nothing else goes wrong, the test will appear green after it is run. Otherwise, it will appear pink. You can use other Wiki markup elsewhere in the page to describe the test.

A FitNesse page

Figure 2.4. A FitNesse page

Selenium

Selenium is an open source browser-based test tool designed more for functional and acceptance testing than for unit testing. Unlike with HttpUnit and HtmlUnit, Selenium tests run directly inside the web browser. The page being tested is embedded in an iframe and the Selenium test code is written in JavaScript. It is mostly browser and platform independent, though the IDE for writing tests, shown in Figure 2.5, is limited to Firefox.

The Selenium IDE

Figure 2.5. The Selenium IDE

Although you can write tests manually in Selenium using remote control, it is really designed more as a traditional GUI record and playback tool. This makes it more suitable for testing an application that has already been written and less suitable for doing test-driven development.

Selenium is likely to be more comfortable to front-end developers who are accustomed to working with JavaScript and HTML. It is also likely to be more palatable to professional testers because it’s similar to some of the client GUI testing tools they’re already familiar with.

Listing 2.7 demonstrates a Selenium test that verifies that www.elharo.com shows up in the first page of results from a Google search for “Elliotte.” This script was recorded in the Selenium IDE and then edited a little by hand. You can load it into and then run it from a web browser. Unlike the other examples given here, this is not Java code, and it does not require major programming skills to maintain. Selenium is more of a macro language than a programming language.

Example 2.7. Test That elharo.com Is in the Top Search Results for Elliotte

<html>
<head>
<meta http-equiv="Content-Type"
      content="text/html; charset=UTF-8">
<title>elharo.com is a top search results for Elliotte</title>
</head>
<body>
<table cellpadding="1" cellspacing="1" border="1">
<thead>
<tr><td rowspan="1" colspan="3">New Test</td></tr>
</thead><tbody>
<tr>
      <td>open</td>
      <td>/</td>
      <td></td>
</tr>
<tr>
      <td>type</td>
      <td>q</td>
      <td>elliotte</td>
</tr>
<tr>
      <td>clickAndWait</td>
      <td>btnG</td>
      <td></td>
</tr>
<tr>
      <td>verifyTextPresent</td>
      <td>www.elharo.com/</td>
      <td></td>
</tr>

</tbody></table>
</body>
</html>

Obviously, Listing 2.6 is a real HTML document. You can open this with the Selenium IDE in Firefox and then run the tests. Because the tests run directly inside the web browser, Selenium helps you find bugs that occur in only one browser or another. Given the wide variation in browser CSS, HTML, and JavaScript support, this capability is very useful. HtmlUnit, HttpUnit, JWebUnit, and the like use their own JavaScript engines that do not always have the same behavior as the browsers’ engines. Selenium uses the browsers themselves, not imitations of them.

The IDE can also export the tests as C#, Java, Perl, Python, or Ruby code so that you can integrate Selenium tests into other environments. This is especially important for test automation. Listing 2.8 shows the same test as in Listing 2.7, but this time in Ruby. However, this will not necessarily catch all the cross-browser bugs you’ll find by running the tests directly in the browser.

Example 2.8. Automated Test That elharo.com Is in the Top Search Results for Elliotte

require "selenium"
require "test/unit"

class GoogleSearch < Test::Unit::TestCase
  def setup
    @verification_errors = []
    if $selenium
      @selenium = $selenium
    else
      @selenium = Selenium::SeleneseInterpreter.new("localhost",
        4444, *firefox", "http://localhost:4444", 10000);
      @selenium.start
    end
    @selenium.set_context("test_google_search", "info")
  end

  def teardown
    @selenium.stop unless $selenium
    assert_equal [], @verification_errors
  end

  def test_google_search
    @selenium.open "/"
    @selenium.type "q", "elliotte"
    @selenium.click "btnG"
    @selenium.wait_for_page_to_load "30000"
    begin
        assert @selenium.is_text_present("www.elharo.com/")
    rescue Test::Unit::AssertionFailedError
        @verification_errors << $!
    end
  end
end

Getting Started with Tests

Because you’re refactoring, you already have a web site or application; and if it’s like most I’ve seen, it has limited, if any, front-end tests. Don’t let that discourage you. Pick the tool you like and start to write a few tests for some basic functionality. Any tests at all are better than none. At the early stages, testing is linear. Every test you write makes a noticeable improvement in your code coverage and quality. Don’t get bogged down thinking you have to test everything. That’s great if you can do it, but if you can’t, you can still do something.

Before refactoring a particular page, subdirectory, or path through a site, take an hour and write at least two or three tests for that section. If nothing else, these are smoke tests that will let you know if you totally muck up everything. You can expand on these later when you have time.

If you find a bug, by all means write a test for the bug before fixing it. That will help you know when you’ve fixed the bug, and it will prevent the bug from accidentally reoccurring in the future after other changes. Because front-end tests aren’t very unitary, it’s likely that this test will indirectly test other things besides the specific bit of buggy code.

Finally, for new features and new developments beyond refactoring, by all means write your tests first. This will guarantee that the new parts of the site are tested, and tests will often leak over into the older pages and scripts as well.

Automatic testing is critical to developing a robust, scalable application. Developing a test suite can seem daunting at first, but it’s worth doing. The first test is the hardest. Once you’ve set up your test framework and written your first test, subsequent tests will flow much more easily. Just as you can improve a site linearly through small, automatic refactorings that add up over time, so too can you improve your test suite by adding just a few tests a week. Sooner than you know, you’ll have a solid test suite that helps to ensure reliability by telling you when things are broken and showing you what to fix.

Regular Expressions

Manually inspecting and changing each file on even a small site is tedious and often cost-prohibitive. It is much more effective to let the computer do the work by searching for mistakes and, when possible, automatically fixing them. A number of tools support this, including command-line tools such as grep, egrep, and sed; text editors such as jEdit, BBEdit, TextPad, and PSPad; and programming languages such as Java, Perl, and PHP. All these tools provide a specialized search syntax known as regular expressions. Although there are small differences from one tool to the next, the basic regular expression syntax is much the same.

For purposes of illustration, I’m going to use the jEdit text editor as my search and replace tool in this section. I chose it because it provides pretty much all the features you need, it has a reasonable GUI, it’s open source, and it’s written in Java, so it runs on essentially any platform you’re likely to want. You can download a copy from http://jedit.org/.

However, the techniques I’m showing here are by no means limited to that one editor. In my work, I normally use BBEdit instead because it has a slightly nicer interface. However, it’s payware and only runs on the Mac. There are numerous other choices. If you prefer a different program, by all means use it. What you’ll need are

  • Full regular expression search and replace

  • The ability to recursively search a directory

  • The ability to filter the files you search

  • A tool that shows you what it has changed, but does not require you to manually approve each change

  • Automatic recognition of different character encodings and line-ending conventions

Any tool that meets these criteria should be sufficient.

Searching

The first goal of a regular expression is to find things that may be wrong. For example, I recently noticed that I had mistyped some dates as 20066 instead of 2006 in one of my files. That’s an error that’s likely to have happened more than once, so I checked for it by searching for that string.

In jEdit, you perform a multifile search using the Search/Search in Directory menu item. Selecting this menu item brings up the dialog shown in Figure 2.6. This is normally configured more or less as shown here.

  • The string you’re searching for (the target string) goes in the first text field.

  • The string that will replace the target string goes in the second text field. Here I’m just going to find, not replace, so I haven’t entered a replacement string.

    jEdit multifile search

    Figure 2.6. jEdit multifile search

  • The Directory radio button is checked to indicate that you’re going to search multiple files. You can also search just in the current file, or even the current selection.

  • The filter is set to *.html to search only those files that end in .html. You can modify this to search different kinds of or subsets of files. For instance, I often want to search only my old news files, which are named news2000.html, news2001.html, news2002.html, and so on. In that case, I would set the filter to news2.*html. I could search even older files including news1999.html by rewriting the filter regular expression in a form such as newsdddd.html.

  • I specify the directory where I’ve stored my local copy of the files I’m searching. In my case, this is /Users/elharo/Cafe au Lait/javafaq.

  • “Search subdirectories” is checked. If it weren’t, jEdit would search only the javafaq directory, but not any directories that directory contains.

  • “Keep dialog” is checked. This keeps the dialog box open after the search is completed.

  • “Ignore case” is checked. This will allow the regular expression to match regardless of case. This isn’t always what you want, but more often than not, it is.

  • Regular expressions” is checked. You don’t need to check this when you’re only searching for a constant string, as here. However, most searches are more complex than that.

  • HyperSearch is checked. This will bring up a window showing all matches, rather than just finding the next match.

Fortunately, that particular problem seems to have been isolated. However, I also recently noticed another, more serious problem. For some unknown reason, I somehow had managed to write links with double equals signs, as shown here, throughout one of my sites:

<a href=="../../index.html">Cafe au Lait</a>

Consequently, links were breaking all over the place. The first step was to find out how broad the problem was. In this case, the mistaken string was constant and was unlikely to appear in correct text, so it was easy to search for. This problem turned up 4,475 times in 476 files, as shown in the HyperSearch results in Figure 2.7.

jEdit search results

Figure 2.7. jEdit search results

When there aren’t a lot of mistakes, you can click on each one to open the document and fix it manually. Sometimes this is needed. Sometimes this is even the easiest solution. However, when there are thousands of mistakes, you have to fix them with a tool. In this case, the solution is straightforward. Put href= in the “Replace with” field, and then click the “Replace all” button.

Do be careful when performing this sort of operation, though. A small mistake can cause bigger problems. A bad search and replace likely caused this problem in the first place. You should test your regular expression search and replace on a few files first before trying it on an entire site.

Most important, always work on a backup copy of the site, always run your test suite after each change, and always spot-check at least some of the files that have been changed to make sure nothing went wrong. If something does go wrong, an editor with undo capability can be very useful. Not all editors support multifile undo with a buffer that’s large enough to handle thousands of changes. If yours doesn’t, be ready to delete your working copy and replace it with the original in case the search goes wrong. Like any other complex bit of code, sometimes you have to try several times to fully debug a regular expression.

Search Patterns

Often, you don’t know exactly what you’re searching for, but you do know its general pattern. For example, if you’re searching for years in the recent past, you might want to find any four-digit number beginning with 200. You may want to search for attribute name=value pairs, but you’re not sure whether they’re in the format name=value, name='value', or name="value". You may want to search for all <p> start-tags, whether they have attributes or not. These are all good candidates for regular expressions.

In a regular expression, certain characters and patterns stand in for a set of other characters. For example, d means any digit. Thus, to search for any year from 2000 to 2009, one could use the regular expression 200d. This would match 2000, 2001, 2002, and so on through 2009.

However, the regular expression 200d also matches 12000, 200032, 12320056, and other strings that are probably not years at all. (To be precise, it matches the substrings in the form 200d, not the entire string.) Thus, you might want to indicate that the string you’re matching must be preceded and trailed by whitespace of some kind. The metacharacter s matches whitespace, so we can now rewrite the expression as s200ds to match only those strings that look like years in this decade.

Of course, there’s still no guarantee that every string you match in this form is a year. It could be a price, a population, a score, a movie title, or something else. You’ll want to scan the list of matches to verify that it is what you expect. False positives are a real concern, especially for simple cases such as this. However, it’s normally possible to either further refine the regular expression to avoid any false positives or manually remove the accidental matches.

There usually are other ways to do many things. For instance, we could write this search as 200d. The metacharacter  matches the beginning or end of a word, without actually selecting any characters. This would avoid the whitespace at the beginning and end of words. This would also allow us to recognize a year that came at the end of a sentence right before a period, as in “This is 2008.” However, it can’t distinguish periods from decimal points and would also match the 2005 in 2005.3124.

You could even simply list the years separated by the OR operator, |, like so:

2000|2001|2002|2003|2004|2005|2006|2007|2008|2009

However, this still has the word boundary problems of the previous matches.

Sometimes you stop with a search. In particular, if the content is generated automatically from a CMS, template page, or other program, the search is used merely to find bugs: places where the program is generating incorrect markup. You then must change the program to generate correct markup. If this is the case, false positives don’t worry you nearly so much because all changes will be performed manually anyway. The search only identifies the bug. It doesn’t fix it.

If you don’t stop with a search, and you go on to a replacement, you need to be cautious. Regular expressions can be tricky, and ones involving HTML are often much trickier than the textbook examples. Nonetheless, they are invaluable tools in cleaning up HTML.

Note

If you don’t have a lot of experience with regular expressions, please refer to Appendix 1 for many more examples. I also recommend Mastering Regular Expressions, 3rd Edition, by Jeffrey E.F. Friedl (O’Reilly, 2006).

Tidy

Regular expressions are well and good for individual, custom changes, but they can be tedious and difficult to use for large quantities of changes. In particular, they are designed more to work with plain text than with semistructured HTML text. For batch changes and automated corrections of common mistakes, it helps to have tools that take advantage of the markup in HTML. The first such tool is Dave Raggett’s Tidy (www.w3.org/People/Raggett/tidy/), the original HTML fixer-upper. It’s a simple, multiplatform command-line program that can correct most HTML mistakes.

-asxhtml

For purposes of this book, you want to use Tidy with the -asxhtml command-line option. For example, this command converts the file index.html to well-formed XHTML and stores the result back into the same file (-m):

$ tidy -asxhtml -m index.html

Frankly, you could do worse than just running Tidy across all your HTML files and calling it a day, but please don’t stop reading just yet. Tidy has a few more options that can improve your code further, and there are some problems it can’t handle or handles incorrectly. For example, when I used this command on one of my older pages that I hadn’t looked at in at least five years, Tidy generated the following error messages:

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 7 column 1 - Warning: <body> attribute "bgcolor" has
invalid value "#fffffff"
line 16 column 2 - Warning: <table> lacks "summary" attribute
line 230 column 1 - Warning: <table> lacks "summary" attribute
line 14 column 91 - Warning: trimming empty <p>
Info: Document content looks like XHTML 1.0 Transitional
5 warnings, 0 errors were found!

These are problems that Tidy mostly didn’t know how to fix. It actually was able to supply a DOCTYPE because I specified XHTML mode, which has a known DOCTYPE. However, it doesn’t know what to do with bgcolor="#fffffff". The problem here is an extra f which should be removed, or perhaps the entire bgcolor attribute should be removed and replaced with CSS.

Tip

Once you’ve identified a problem such as this, it’s entirely possible that the same problem crops up in multiple documents. Having noticed this in one file, it’s worth doing a search and replace across the entire directory tree to catch any other occurrences. Given the prevalence of copy and paste coding, few mistakes occur only once.

The second two problems are tables that lack a summary attribute. This is an accessibility problem, and you should correct it. Tidy actually prints some further details about this:

The table summary attribute should be used to describe
the table structure. It is very helpful for people using
non-visual browsers. The scope and headers attributes for
table cells are useful for specifying which headers apply
to each table cell, enabling non-visual browsers to provide
a meaningful context for each cell.

For further advice on how to make your pages accessible
see http://www.w3.org/WAI/GL. You may also want to try
"http://www.cast.org/bobby/" which is a free Web-based
service for checking URLs for accessibility.

Table summaries are certainly a good thing, and you should definitely add one. However, Tidy can’t summarize your table itself. You’ll have to do that.

The final message warns you that Tidy found an empty paragraph element and threw it away. This common message is probably misleading and definitely worth a second look. What it meant in this case (and in almost every other one you’ll see) is that the <P> tag was being used as an end-tag rather than a start-tag. That is, the markup looked like this:

Blah blah blah<P>
Blah blah blah<P>
Blah blah blah<P>

Tidy reads it as this:

Blah blah blah
<P>Blah blah blah</P>
<P>Blah blah blah</P>
<P></P>

Thus, it throws away that last empty paragraph. However, what you almost certainly wanted was this:

<P>Blah blah blah</P>
<P>Blah blah blah</P>
<P>Blah blah blah</P>

There’s no easy search and replace for this flaw, though XHTML strict validation will at least alert you to the files in which the problem lies. You can use XSLT (discussed shortly) to fix some of these problems; but if there aren’t too many of them, it’s safer and not hugely onerous to manually edit these files.

If you specify the --enclose-text yes option, Tidy will wrap any such unparented text in a p element. For example:

$ tidy -asxhtml --enclose-text yes  example.html

Tidy may alert you to a few other serious problems that you’ll still have to fix manually. These include

  • A missing end quote for an attribute value, for example, <p id="c1>

  • A missing > to close a tag, for example, <p or </p

  • Misspelled element and attribute names, for example, <tabel> instead of <table>

-clean

The next important option is -clean. This replaces deprecated presentational elements such as i and font with CSS markup. For example, when I used -clean on the same document, Tidy added these CSS rules to the head of my document:

<style type="text/css">
/*<![CDATA[*/
 body {
  background-color: #fffffff;
  color: #000000;
 }
 p.c2 {text-align: center}
 h1.c1 {text-align: center}
/*]]>*/
</style>

It also added the necessary class attributes to the elements that had previously used a center element. For example:

<h1 class="c1">Java Virtual Machines</h1>

That’s as far as Tidy goes with CSS, but you might want to consider revisiting all these files to see if you can extract a common external stylesheet that can be shared among the documents on your site, rather than including the rules in each separate page. Furthermore, you should probably consider more semantic class names rather than Tidy’s somewhat prosaic default.

For example, I’ve sometimes used the i element to indicate that I’m talking about a word rather than using the word. For example:

<p><i>May</i> can be very ambiguous in English, meaning might,
can, or allowed, depending on context.</p>

Here the italics are not used for emphasis, and replacing them with em is not appropriate. Instead, they should be indicated with a span and a class, like so:

<p><span class='wordasword'>May</span> can be very ambiguous
in English, meaning might, can, or allowed, depending on
context.</p>

Then, add a CSS rule to the stylesheet to format this as italic:

span.wordasword { font-style: italic; }

Tidy, however, is not smart enough to figure this out and will need your help. Still, Tidy is a good first step.

Encodings

Tidy is surprisingly bad at detecting the character encoding of HTML documents, despite the relatively rich metadata in most HTML documents for specifying exactly this. If your content is anything other than ASCII or ISO-8859-1 (Latin-1), you’d best tell Tidy that with the --input-encoding option. For example, if you’ve saved your documents in UTF-8, invoke Tidy thusly:

$ tidy -asxhtml --input-encoding utf8 index.html

Tidy generates ASCII text as output unless you tell it otherwise. It will escape non-ASCII characters using named entities when available, and numeric character references when not. However, Tidy supports several other common encodings. The only other one I recommend is UTF-8. To get it use the --output-encoding option:

$ tidy -asxhtml --output-encoding utf8 index.html

The input encoding does not have to be the same as the output encoding. However, if it is you can just specify -utf8 instead:

$ tidy -asxhtml -utf8 index.html

For various reasons, I strongly recommend that you stick to either ASCII or UTF-8. Other encodings do not transfer as reliably when documents are exchanged across different operating systems and locales.

Pretty Printing

Tidy also has a couple of options that don’t have a lot to do with the HTML itself, but do make documents prettier to look at and thus easier to work with when you open them in a text editor.

The -i option indents the text so that it’s easier to see how the elements nest. Tidy is smart enough not to indent whitespace-significant elements such as pre.

The -wrap option wraps text at a specified column. Usually about 80 columns are nice.

$ tidy -asxhtml -utf8 -i -wrap 80 index.html

Generated Code

Tidy has limited support for working on PHP, JSP, and ASP pages. Basically, it will ignore the content inside the PHP, ASP, or JSP sections and try to work on the rest of the HTML markup. However, that is very tricky to do. In particular, most templating languages do not respect element boundaries. If code is generating half of an element or a start-tag for an element that is later closed by a literal end-tag, it is very easy for Tidy to get confused. I do not recommend using Tidy directly on these sorts of pages.

Instead, download some fully rendered pages from your web site after processing by the template engine. Run Tidy on a representative sample of these pages, and then compare the results to the original pages. By looking at the differences, you can usually figure out what needs to be changed in your templates; then make the changes manually.

Although this does require more manual work and human intelligence, if each template is generating multiple static pages, this process can finish sooner than semiautomated processing of large numbers of static HTML pages.

Use As a Library

TidyLib is a C library version of Tidy that you can integrate into your own programs. This might be useful for writing scripts that tidy an entire site, for example. Personally, though, I do not find C to be the most conducive language for simple scripting. I usually prefer to write a shell or Perl script that invokes the Tidy command line directly.

TagSoup

John Cowan’s TagSoup (http://home.ccil.org/~cowan/XML/tagsoup/) is an open source HTML parser written in Java that implements the Simple API for XML, or SAX. Cowan describes TagSoup as “a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty, and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.”

TagSoup is not intended as an end-user tool, but it does have a basic command-line interface. It’s also straightforward to hook it up to any number of XML tools that accept input from SAX. Once you’ve done that, feed in HTML, and out will come well-formed XHTML. For example:

$ java -jar tagsoup.jar index.html
<?xml version="1.0" standalone="yes"?>
<html lang="en-US" xmlns="http://www.w3.org/1999/
xhtml"><head><title>Java Virtual Machines</title><meta
name="description" content="A Growing
list of Java virtual machines and their capabilities">
</meta></head><body bgcolor="#ffffff" text="#000000">

<h1 align="center">Java Virtual Machines</h1>
...

You can improve its output a little bit by adding the --omit-xml-declaration and -nodefaults command-line options:

$ java -jar tagsoup.jar --omit-xml-declaration
       -nodefaults index.html
<html lang="en-US" xmlns="http://www.w3.org/1999/
xhtml"><head><title>Java Virtual Machines</title><meta
name="description" content="A Growing
list of Java virtual machines and their capabilities"></meta>
</head><body bgcolor="#ffffff" text="#000000">

<h1 align="center">Java Virtual Machines</h1>
...

This will remove a few pieces that are likely to confuse one browser or another.

You can use the --encoding option to specify the character encoding of the input document. For example, if you know the document is written in Latin-1, ISO 8859-1, you could run it like so:

$ java -jar tagsoup.jar --encoding=ISO-8859-1 index.html

TagSoup’s output is always UTF-8.

Finally, you can use the --files option to write new copies of the input files with the extension .xhtml. Otherwise, TagSoup prints the output on stdout, from where you can redirect it to any convenient location. TagSoup cannot change a file in place like Tidy can.

However, TagSoup is primarily designed for use as a library. Its output from command-line mode leaves something to be desired compared to Tidy. In particular:

  • It does not convert presentational markup to CSS.

  • It does not include a DOCTYPE declaration, which is needed before some browsers will recognize XHTML.

  • It does include an XML declaration, which needlessly confuses older browsers.

  • It uses start-tag and end-tag pairs for empty elements such as br and hr, which may confuse some older browsers.

TagSoup does not guarantee absolutely valid XHTML (though it does guarantee well-formedness). There are a few things it cannot handle. Most important, XHTML requires all img elements to have an alt attribute. If the alt attribute is empty, the image is purely presentational and should be ignored by screen readers. If the attribute is not empty, it is used in place of the image by screen readers. TagSoup has no way of knowing whether any given img with an omitted alt attribute is presentational or not, so it does not insert any such attributes. Similarly, TagSoup does not add summaries to tables. You’ll have to do that by hand, and you’ll want to validate after using TagSoup to make sure you catch all these instances.

However, despite these limits, TagSoup does do a huge amount of work for you at very little cost.

TagSoup versus Tidy

For an end-user, the primary difference between TagSoup and Tidy is one of philosophy. Tidy will sometimes give up and ask for help. There are some things it does not know how to fix and will not try to fix. TagSoup will never give up. It will always produce well-formed XHTML as output. It does not always produce perfectly valid XHTML, but it will give you something. For the same reasons, TagSoup does not warn you about what it could not handle so that you can fix it manually. Its assumption is that you really don’t care that much. If that’s not true, you might prefer to use Tidy instead. Tidy is more careful. If it isn’t pretty sure it knows what the document means, it won’t give you anything. TagSoup will always give you something.

For a programmer, the differences are a little more striking. First, TagSoup is written in Java and Tidy is written in C. That alone may be enough to help you decide which one to use (though there is a Java port of Tidy, called JTidy). Another important difference is that TagSoup operates in streaming mode. Rather than working on the entire document at once, it works on just a little piece of it at a time, moving from start to finish. That makes it very fast and allows it to process very large documents. However, it can’t do things such as add a style rule to the head that applies to the last paragraph of the document. Because HTML documents are rarely very large (usually a few hundred kilobytes at most), I think a whole-document approach such as Tidy’s is more powerful.

XSLT

XSLT (Extensible Stylesheet Language Transformations) is one of many XML tools that work well on HTML documents once they have first been converted into well-formed XHTML. In fact, it is one of my favorite such tools, and the first thing I turn to for many tasks. For instance, I use it to automatically generate a lot of content, such as RSS and Atom feeds, by screen-scraping my HTML pages. Indeed, the possibility of using XSLT on my documents is one of my main reasons for refactoring documents into well-formed XHTML. XSLT can query documents for things you need to fix and automate some of the fixes.

When refactoring XHTML with XSLT, you usually leave more alone than you change. Thus, most refactoring stylesheets start with the identity transformation shown in Listing 2.9.

Example 2.9. The Identity Transformation in XSLT

<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
  version='1.0'>
  xmlns:html='http://www.w3.org/1999/xhtml'
  xmlns='http://www.w3.org/1999/xhtml'
  exclude-result-prefixes='html'>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

This merely copies the entire document from the input to the output. You then modify this basic stylesheet with a few extra rules to make the changes you desire. For example, suppose you want to change all the deprecated <i> elements to <em> elements. You would add this rule to the stylesheet:

<xsl:template match='html:i'>
  <em>
     <xsl:apply-templates select="@*|node()"/>
  </em>
</xsl:template>

Notice that the XPath expression in the match attribute must use a namespace prefix, even though the element it’s matching uses the default namespace. This is a common source of confusion when transforming XHTML documents. You always have to assign the XHTML namespace a prefix when you’re using it in an XPath expression.

Note

Several good introductions to XSLT are available in print and on the Web. First, I’ll recommend two I’ve written myself. Chapter 15 of The XML 1.1 Bible (Wiley, 2003) covers XSLT in depth and is available on the Web at www.cafeconleche.org/books/bible3/chapters/ch15.html. XML in a Nutshell, 3rd Edition, by Elliotte Harold and W. Scott Means (O’Reilly, 2004), provides a somewhat more concise introduction. Finally, if you want the most comprehensive coverage available, I recommend Michael Kay’s XSLT: Programmer’s Reference (Wrox, 2001) and XSLT 2.0: Programmer’s Reference (Wrox, 2004).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.234.28