Chapter 8. Content

So far, we’ve focused almost exclusively on the markup of the page and relatively little on the content. Markup may be where the code is, but content is why readers come to a site in the first place. Chances are there are some significant improvements you can make in your content too that will pay immediate dividends.

Correct Spelling

Check all text with a spell checker.

 In the new milennium, the inteligent playright endeavers to         
 avoid the embarassment that arises from superceding good            
 judgement with exceedingly dificult to spell words of foriegn       
 derivasion at every occurence.                                      

                        Correct Spelling

 In the new millennium, the intelligent playwright endeavors to      
 avoid the embarrassment that arises from superseding good           
 judgment with exceedingly difficult to spell words of foreign       
 derivation at every occurrence.                                     

Motivation

Proper spelling makes a site appear more professional. It enhances the trust readers have in your site. Sites that can’t spell look bad and drive away readers. This is especially critical for any site attempting to close a sale with a visitor. Web surfers have learned to associate poor spelling (and grammar) with hackers and impostors. They are far less likely to hand over a credit card number to a site that’s full of spelling errors.

Proper spelling also improves search engine placement. You get no Google juice for a term you’ve misspelled (unless everyone else is misspelling it, too).

Potential Trade-offs

None.

Mechanics

Most decent HTML editors, such as BBEdit and Dreamweaver, have built-in spell checkers. By all means, use them. Unlike the spell checkers built into products such as Microsoft Word, they’re smart enough to realize that kbd is misspelled but <kbd> isn’t. Make sure all people editing any page on your site have their spell checker turned on and running so that it draws a little squiggly red line under every misspelled word as soon as it’s typed. Do not give writers the opportunity to forget to spell-check their content before submitting it.

Of course, for the content that already exists on your site, you’ll need something a little more automated. It’s well worth spending a day or two on a complete and thorough proofread of a site. If you don’t have a solid day, work on one page at a time, but by all means, devote some effort and energy to this.

The trick is to generate a good custom dictionary that matches your site and all its unique terms, brand names, proper names, and other words the default dictionary does not contain. Although you can certainly check each page individually and build the dictionary as you go, I find it more efficient to work in larger batches. The basic procedure is as follows.

  1. Generate a list of all possibly misspelled words in all documents.

  2. Delete all actually misspelled words from the list. This requires the services of a native speaker who is an excellent speller. What remains is a custom dictionary for your site.

  3. Rerun the spell checker on one file at a time using the custom dictionary. This time, any words it flags should be genuine spelling errors, so you should fix them.

Be sure to store the dictionary you create for later use. You will occasionally need to add new words to it as the site grows and changes.

At least for English, the gold standard is the GNU Project’s Aspell. This is really a library more than an end-user program, but you can make it work by stringing together a few UNIX commands. Here’s how I use it.

First I check an entire directory of files with this command:

$ cat *.html | aspell --mode=html list | sort | uniq

This types all HTML files in the directory, passes the output into the spell checker, sorts the results, and uniquifies them (deletes duplicates). The result is a list of all the misspelled words in the directory, such as this:

AAAS
AAC
AAL
ABA
ABCDEFG
ABCNEWS
ACGNJ
...
ystart
yt
yvalue
zephyrfalcon
zigbert
ziplib
zlib
zparser

Of course, looking at a list such as this, it will immediately strike you that most of these words are not in fact misspelled. They are proper names, technical terms, foreign words, coinages, and other things the spell checker doesn’t recognize. Thus, next you inspect the output and use it to build a custom dictionary.

Pipe or copy the output into a text editor and delete all clearly misspelled words. (I am assuming here that you’re a solid speller. If not, hire someone who is. This is especially important if you’re not a native speaker of the language you’re checking.) If you’re in doubt about a word, delete it. You’ll want to look at it in context before deciding.

Save the remaining correctly spelled words in a file called customdict.txt. If the file is too large to inspect manually, you may want to start with a smaller sample. Then compile this text file into a custom dictionary, like so:

$ aspell --lang=en create master ./webdict < customdict.txt

This creates the file webdict in the current working directory.

Now run the command again with the --add-extra-dicts=./ webdict option, like so:

$ cat *.html | aspell --mode=sgml --add-extra-dicts=./webdict
  list | sort | uniq

This time, Aspell will generate a list of words that are much more likely to be actually misspelled. When I recently checked my site, my initial list comprised more than 11,000 misspelled words. After scanning them and creating a custom dictionary, the potential spelling errors were reduced to 1,138, a much more manageable number.

At this point, I would take each word in this new shorter list and search for it using this regular expression:

misspelling

The  on both ends limits the search to word boundaries so that I don’t accidentally find it in the middle of other words. For example, if I’m correcting adn to and, I don’t want to also change sadness to sandess.

If I’m uncertain about whether a word is correctly spelled, I may open the file and fix it manually. If I know it’s obviously wrong, I’ll just replace it.

The alternative to this approach, and one that may be more accessible to some people, is to use a traditional GUI spell checker, such as those built into BBEdit and Dreamweaver, and check files one at a time. That’s certainly possible, but in my experience it takes quite a bit longer, and the larger the site, the longer it takes. Spelling errors are not independent. The same ones tend to crop up again and again.

A nice compromise position is to use Aspell to build up a custom dictionary and then use that custom dictionary as you check individual files, whether with Aspell or with some other GUI tools. Most decent tools should be able to import a custom dictionary saved as a plain text file.

For many sites, that may be enough. Anything this process catches is certainly something you’ll want to fix. However, professional sites that convey your image to the world are worth a little more effort. Some things a machine can’t catch. For instance, I know of one site where a spell checker did not notice an omitted l in the word public with consequently embarrassing results. If at all possible, I recommend hiring a professional proofreader to catch these and similar mistakes, as well as errors of grammar, meaning, and style that a computer program just won’t recognize.

Although I usually say that hiring a professional proofreader is optional, there is one exception. If you are publishing a site in anything other than your primary native language, professional native assistance is mandatory. Even well-educated, truly bilingual people rarely have formal education in more than one tongue. Any commercial site publishing in a non-native language should insist on a native proofreader.

However you go about this, correcting spelling errors can take awhile. The effort involved is roughly linear in the size of the site after you get your initial custom dictionary set up. As usual, you may not have to do it all at once. Start with your home page and other frequently accessed pages, as indicated by your server logs. Then work your way forward from there. Don’t feel like you have to do it all at once. Every error you correct is one less error for site visitors to notice and to think less of you for.

Repair Broken Links

Repair broken links if possible. Delete them if not.

 <a href="http://deadsite.example.com/">Dotcom, Inc.</a>                
 <a href="http://www.example.com/reorganized/site">Learn More</a>       

                         Repair Broken Links

 Dotcom, Inc.                                                           
 <a href="http://www.example.com/new/location">Learn More</a>           

Motivation

Dead links annoy users and waste their time. In the worst case, they can be offensive. Many disgusting spammers make a habit of buying up abandoned domain names of failed companies and replacing them with pages of ads for subprime mortgages, get-rich-quick schemes, and outright pornography. Many web sites are pointing to porn and don’t even know it.

Dead links also reduce search engine placement for both your site and the sites linked to it.

Potential Trade-offs

None.

Mechanics

Checking links is fairly easy to automate. As a result, many tools will do this for you. Some are built into authoring tools, and they usually work on a single page. Others are stand-alone programs that run on your computer. Still others are web-based services. For a quick check on one page, I’ll either use what’s built into my editor or hop over to the web-based checker at http://validator.w3.org/checklink. Googling for “online link checker” will find many similar tools.

For more automated testing of an entire site, on Windows I use Xenu Link Sleuth, http://home.snafu.de/tilman/xenulink.html; and on UNIX I use Linklint, www.linklint.org/. Once again, these are just two choices. There are many others. Each can scan a site remotely or locally and attempt to follow each link it finds. If a link can’t be followed, or if it is redirected, an error message is logged. For example, here’s some output from checking one of my sites with Linklint:

$ ./linklint -http -host www.cafeaulait.org -doc results /@

Checking links via http://www.cafeaulait.org
that match: /@
1 seed: /

Seed:    /
    checking robots.txt for www.cafeaulait.org
Checking /
Checking /oldnews/news2007March26.html
Checking /mailinglists.html
Checking /javafaq.html
...
-----    /oldnews/news2007March30.html
-----    /oldnews/news2007March23.html

Processing ...

writing files to results
wrote 21 txt files
wrote 19 html files
wrote index file index.html

found  64 default indexes
found 112 cgi files
found 923 html files
found   2 java archive files
found 130 image files
found  86 applet files
found 593 other files
found 1076 http links
found  26 ftp links
found 249 mailto links
found  13 news links
found 542 named anchors
-----   3 actions skipped
----- 100 files skipped
ERROR   1 missing directory
ERROR   4 missing cgi files
ERROR  28 missing html files
ERROR   1 missing java archive file
ERROR  12 missing image files
ERROR   6 missing applet files
ERROR  19 missing other files
ERROR 104 missing named anchors

Linklint found 1910 files and checked 1000 html files.
There were 71 missing files. 160 files had broken links.
175 errors, no warnings.

This tool writes its results into a directory specified on the command line (here, the results directory) in both plain text and HTML format. The most useful results file is errorF.html, which contains a list of the file containing the broken links, as well as the links that were broken. Typical output looks like this:

file: errorF.txt
host: www.cafeaulait.org
date: Sun, 01 Apr 2007 14:55:15 (local)
Linklint version: 2.3.5

#------------------------------------------------------------
# ERROR 160 files had broken links
#------------------------------------------------------------
/
    had 1 broken link
    /&

/books.html
    had 3 broken links
    /&
    /books/beans/a
    /javafaq/

/books/
    had 1 broken link
    /books/&

/books/beans/
    had 2 broken links
    /books/javasecrets.html
    /index.html

Most such tools are highly configurable and have numerous options for specifying exactly what to check and not to check. Usually the defaults are fine. One option you may want to consider is differentiating between checking for broken internal links and broken external links. Broken external links are bad, but broken internal links are ten times worse (and are usually easier to fix). In the case of Linklint, the default is to check only for internal links. To check external links as well, add the -net option:

$ linklint -http -host www.cafeaulait.org -net /2000march.html

Checking links via http://www.cafeaulait.org
that match: /2000march.html
1 seed: /2000march.html

Seed:    /2000march.html
    checking robots.txt for www.cafeaulait.org
Checking /2000march.html
-----    /1999november.html
-----    /1999may.html
-----    /1999september.html
-----    /1999december.html
-----    /1999october.html
-----    /2000february.html
-----    /1999july.html
-----    /1999august.html
-----    /2000january.html
-----    /1999june.html

Processing ...

found  11 html files
found   2 image files
found  61 http links
-----  10 files skipped
ERROR   1 missing other file
ERROR   1 missing named anchor

Linklint found 13 files and checked 1 html file.
There was 1 missing file. 1 file had broken links.
2 errors, no warnings.

checking 61 urls ...

bluej.monash.edu/
  could not find ip address
conferences.oreilly.com/java/
  moved
  conferences.oreillynet.com/java/
  not found (404)
crushftp.bizland.com/
  access forbidden (403)
developer.apple.com/java/text/download.html
  moved
  developer.apple.com/java/download.html
  ok
developer.apple.com/mkt/swl/
  moved
  developer.apple.com/softwarelicensing/index.html
  ok
fourier.dur.ac.uk/%7Edma3mjh/jsci/index.html
  timed out connecting to host
fred.lavigne.com/
  ok
homepage.mac.com/mheun/jEditForMac.html
  moved
  www.mac.com/account_error.html
  ok
java.sun.com/aboutJava/communityprocess/maintenance/JLS/index.html
  moved
  jcp.org/aboutJava/communityprocess/maintenance/JLS/index.html
  ok
java.sun.com/products/personaljava/index.html
  not an http link
java.sun.com/products/personaljava/pj-cc.html
  ok
java.sun.com/products/personaljava/pj-emulation.html
...
found  39 urls: ok
-----  12 urls: moved permanently (301)
-----  14 urls: moved temporarily (302)
ERROR   2 urls: access forbidden (403)
ERROR   1 url: could not find ip address
ERROR   9 urls: not found (404)
ERROR   2 urls: timed out connecting to host
warn    8 urls: not an http link

Linklink checked 61 urls:
    39 were ok, 14 failed. 26 urls moved.
    3 hosts failed: 3 urls could be retried.
1 file had failed urls.
There were 2 files with broken links.

Of course, Linklint does not spider the remote pages. It merely checks that they’re where they’re expected to be. As you can see, a link can be broken for several reasons. These include:

  • could not find ip address

    The entire host has been removed from the Net and has not been replaced. In some cases, you can find the replacement host with a little Google work. Otherwise, you should delete the link.

  • moved

    The page has moved to a new location, but the host is still there. If the ultimate response is OK, you can update the link, but it’s not essential and not an immediate problem. If the location is not found, though, you need to fix the link.

  • access forbidden

    Usually this means the directory has been deleted. You’ll need to fix or delete the link.

  • timed out

    The host is still there, but it doesn’t seem to be responding at the moment. It may be a temporary glitch, or the site may be gone for good. Try again tomorrow.

The exact terms vary from one tool to the next, but the reasons and responses are the same.

Note that this process does not just find broken links. It also finds redirected links—that is, links where the server sends the browser to a new page. These are worth a second look, especially if the server the user is being redirected to is not the original server. Too often, the browser is being redirected to a spam page that’s been set up at a dead domain. Other times it’s the home page of the correct site, but the page you were actually linking to is missing. You may need to verify these manually.

In many ways, checking for broken links—especially external links—is one of the most annoying refactorings. Most of the refactorings described in this book have the advantage of being stable. That is, once you fix a page, it stays fixed, at least until someone edits it. Not so with link checking. A page can be perfectly fine one minute and have two dozen broken links the next, and there’s not a lot you can do about it. The best you can hope for is to notice and quickly fix any problems that do arise. For example, set up a cron job that runs Linklint periodically and e-mails you the results. You can’t stop other sites from breaking your links to them, but you can at least repair the problems as time permits.

Repairing Links

Sometimes you can fix links automatically with search and replace. For example, when Sun’s Java site changed its URL from java.sun.com to www.javasoft.com, it was easy for me to replace all the old links with new ones just by searching for java.sun.com and changing it to www.javasoft.com. Then when Sun changed the host name back to java.sun.com a few years later, I just did the search and replace in reverse.

Most changes aren’t this easy. You’ll often need to spend some time surfing the targeted site and Googling for new page locations. Sometimes you’ll find them. Sometimes you won’t. If you do find them, updating the old link to point to the new location is easy. If you don’t find it, delete the link. Depending on context, you can delete the entire a element, the <a> and </a> start tag and end tag, or just the href attribute. If you delete only the href attribute, you can move the old URL into the title attribute for archival purposes, so you’ll still have it somewhere if the site comes back:

<a title="http://www.example.com/foo/">Foo Corp.</a>

There’s one important exception. If the link points into your own site, rather than to an external site, you should either delete the entire element or fix it to point to the new page. However, there’s something else you have to do to. You’ll want to set up a redirect so that other sites and bookmarks linking to this page will be redirected to the new location as well. I’ll take this up in the next section.

Move a Page

Reorganize your URL structure to be more transparent to developers, visitors, and search engines, but always make sure the old URLs for those pages still work.

 http://www.example.com/framework/wo/9t5oReW4DX1d7/0.0.1.172.1.1       
 http://www.example.com/boredofdirectors/bios/                         
 http://www.example.com/2007/05/23/1756?id=p12893                      

                         Move a Page

 http://www.example.com/products/carburetors/                          
 http://www.example.com/boardofdirectors/bios/                         
 http://www.example.com/pr/TIC-patents-new-fuel-efficient-engine       

Motivation

In the words of the W3C, “Cool URIs don’t change.” Once you publish a URI, you should endeavor to make sure that content remains at that URI forever. Other sites link to the page. Users bookmark it. Every time you reorganize your URL structure you are throwing away traffic.

However, you do need to move pages. Sometimes it’s a matter of search engine optimization. Sometimes it’s mandated by a change to a new content management system (CMS) or server-side framework. Sometimes it’s necessary just to keep the development process sane by keeping related static files close together.

Thus, the compromise: Move pages as necessary, but leave redirects behind at the old URLs to point users to the new URLs. Done right, users will mostly never notice that this has happened.

Potential Trade-offs

As long as you don’t actively break links, most external sites that point to you won’t bother to update their links. Then again, most of them won’t update their links no matter what you do. That’s why you need to keep the old links working.

Mechanics

Every time you move a page, set up a redirect from the old page to the new page. In other words, configure your web server so that rather than sending a 404 Not Found error when a visitor arrives at the old location, it sends a 301 Moved Permanently or 302 Found response.

For example, when I published my first book, I put the examples at www.cafeaulait.org/examples/. When I published my second book, I needed to split that up by book, so I moved those files to www.cafeaulait.org/books/jdr/examples/. However, if you go to www.cafeaulait.org/examples/, the server sends this response:

HTTP/1.1 302 Found
Date: Fri, 06 Apr 2007 17:34:27 GMT
Server: Apache/2
Location: http://www.cafeaulait.org/books/jdr/examples
Content-Length: 298
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved
<a href="http://www.cafeaulait.org/books/jdr/examples">here</a>
</p>
<hr>
<address>Apache/2 Server at www.cafeaulait.org Port 80</address>
</body></html>

Most users never see this. Instead, their browser silently and immediately redirects them to www.cafeaulait.org/books/jdr/examples/. I was able to move the page but keep the old links working.

Consult your server documentation to determine exactly how this is accomplished with different servers. In the most popular server, Apache, this is achieved with mod_rewrite rules placed in the httpd.conf or .htaccess files. For example, the preceding redirect is accomplished with this rule:

RewriteEngine On
RewriteOptions MaxRedirects=10 inherit
RewriteBase   /
RewriteRule   ^examples(.*) books/jdr/examples$1 [R]

Regular expressions indicate exactly what is rewritten and how. In this case, after turning on the engine and setting a couple of options to prevent infinite redirect loops, the base for the rewrites is set to /, the root of the URL hierarchy. This is where all further matches begin.

The actual rule looks for all URLs whose path component begins with /examples, followed by any number of characters. The ^ character anchors this expression to the rewrite base of / set in the previous line. The .* matches everything that comes after /examples, and the parentheses around .* enable us to refer back to those matched characters in the next part of the expression as $1.

The third string on the last line, books/jdr/examples$1, is the replacement rule. It replaces the matched string from the first line with books/jdr/examples/ and the piece that was matched by (.*) in the search string.

Finally, [R] means that the client should be told about the redirect by sending a 302 response. If we left it out, the new data would still be sent to the client, but it would appear as though it had come from the old URL. That’s usually not what you want, because keeping the same page at several URLs reduces your search engine impact. (However, such silent redirects are very useful if you’re trying to put a sensible URL structure on top of a messy internal database-backed system. WordPress uses this technique extensively, for example.) If you’d rather send a 301 Moved Permanently response, change [R] to [R=301].

This one rule creates many redirects. For instance:

  • /examples/chapters/01 is redirected to /books/jdr/examples/chapters/01.

  • /examples is redirected to /books/jdr/examples.

  • /examples/chapters/01/HelloWorld.java is redirected to /books/jdr/examples/chapters/01/HelloWorld.java.

By adjusting the regular expression and the base, you can create other redirects. For example, we could redirect requests for .html files but not .java files:

RewriteRule   ^examples(.*).html books/jdr/examples$1.html [R]

Or suppose you’ve changed all your HTML files to end with .xhtml instead of .html. This rule redirects all requests for .html to new names with .xhtml instead:

RewriteRule   ^(.*).html $1.xhtml [R]

Rules aren’t always so generic. Sometimes you just want to redirect a single file. For example, suppose you discover you’ve published a document at /vacation/picures.html and it should really be /vacation/pictures.html. You can rename the file easily enough and then insert this rule in your .htaccess file so that requests for /picures.html are now redirected to /vacation/pictures.html:

RewriteRule   /vacation/picures.html /vacation/pictures.html [R]

Another common case is when the site name changes. For example, suppose your company changes its name from Foo Corp to Bar Corp. Of course, you’ll continue to hold on to the www.foo.com domain, but you do want to send users to the new www.bar.com domain. This rule does that:

RewriteCond %{HTTP_HOST}   !^www.bar.com [NC]
RedirectMatch ^/(.*) http://www.foo.com/$1 [L,R=301]

This says the host is exactly www.bar.com (except for case) and the request should be redirected to www.foo.com, from which the same path and query string will be requested.

Remove the Entry Page

Put the content on the front page.

Flash intro page for IBiblio

Figure 8.1. Flash intro page for IBiblio

The real IBiblio home page

Figure 8.2. The real IBiblio home page

Motivation

Don’t waste visitors’ time. Every extra click you put in their way is one more opportunity for them to leave your site and go elsewhere. Put everything people need to start using your site on the front page. If it’s too complicated to put everything on the front page, put the first step on the front page.

This will also make the site simpler for users to navigate because they’ll have a clear root.

Potential Trade-offs

None.

Mechanics

Many sites make you click through some sort of entry page before they let you do what you came to do. At best, these pages are a minor annoyance. At worst, they make you sit through an all-singing, all-dancing Flash extravaganza before you can actually get any work done. These pages are occasionally impressive (though usually the main people they impress are the site’s own designers), but more commonly they’re just annoying.

Remember, for most sites repeat visitors are far more important than first-time customers. Someone who has already bought your product or viewed your site and then comes back a second time is far more likely to do it again. I don’t care how brilliantly designed your front page is or how clever a Flash animation you’ve put there. Users will be bored with even the best after the first time. (For all but the best, many users will get bored and leave the first time.)

Ask yourself why people come to your site, and make sure they can do it on page one. If they come to read news, make sure the first page is where they’ll find it. If they need to log in, put the login box on the front page, not hiding behind a link. If they want to shop, make sure they can begin browsing and adding to their shopping cart right away. And because search is one of the preeminent ways users navigate a site, make sure the search box is prominently displayed on the front page.

Amazon and Google are two examples of sites that get this right. You go to Google to search and that’s what you do on Google’s very first page. Amazon’s is a more complex site, but it lets you start shopping right away without even logging in. Sites that get this wrong are a lot less common than they used to be, though I do notice that musician and artist sites are disproportionately fond of entry pages. I suspect they view it as art in itself. Maybe that makes some sense for them. For the rest of the world, though, move the content upfront. Visitors arrive for the content, not to admire a beautifully orchestrated entry page.

In the short term, the simplest way to support this is to set up a mod_rewrite rule that automatically transfers visitors from the entry page to the real front page—that is, the page users used to go to after clicking on the entry page. For example, this rule in .htaccess redirects from the root to realcontent.html:

RewriteRule ^/$ /realcontent.html

It does not use an [R], so the change is transparent to the end-user. The redirect happens exclusively on the server.

It doesn’t take that much longer to go in the other direction. Move the real home page from its old location to the root of the filesystem. Then set up a redirect so that anyone going to the old home page is redirected to the root of the site:

RewriteRule ^realcontent.html$ / [R=301]

This example specifies a permanent redirect code so that bookmarks can be updated.

Finally, search for all links that pointed to the old location and update them with links to the root.

Hide E-mail Addresses

E-mail addresses published on web pages should be encoded to prevent spambots from harvesting them.

 <a href="[email protected]">E-mail Elliotte Harold<a/>            
 [email protected]                                                      

                         Hide E-mail Addresses

 <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;                      
 elharo%40metalab%2Eunc%2Eedu">E-mail Elliotte Harold</a>               
 elharo&#x40;macfaq&#x2E;&#x43;om                                       

Motivation

Spammers run spiders that screen-scrape HTML pages for e-mail addresses to spam. However, the spiders aren’t especially smart, don’t follow the relevant specifications, and thus can usually be fooled fairly easily.

Potential Trade-offs

Taken to extremes, hiding e-mail addresses from spambots hides them from your legitimate customers and readers, too. You don’t want to do this. Don’t go overboard. Make sure your applications allow people to find you. No solution will be perfect. You cannot block all spammers and let in all humans, but it is far more important not to block any humans than it is to keep out the last 1% of spam robots.

Mechanics

Finding e-mail addresses is fairly straightforward. This regular expression will pick up most of them:

[w-.+]+@([w-]+.)+[a-zA-Z]{2,7}

You can also search for mailto: to find mailto links. Indeed, it is the ease of mechanically extracting e-mail addresses from text that makes spambots so effective. Most spambots don’t do anything more sensitive than this very search. That’s what makes it possible to fool them.

The first and most obvious technique is to break the address in a way that’s easy for a human to repair but hard for a robot. For example:

elharo /at/ metalab.unc.edu

or:

The problem is that this keeps the addresses from being copied and pasted without manual editing. Thus, I prefer not to do this.

Some people embed the e-mail address in an image instead:

<img src="elharoemail.png" width="167" height="28" />

Of course, this is now completely opaque to a blind user. You can add an alt attribute to rectify that:

<img src="elharoemail.png" width="167" height="28"
     alt="[email protected]"/>

However, spambots that are just matching a regular expression will find this, and this still has several disadvantages for legitimate visitors. First, it prevents the address from being copied and pasted. Second, it doesn’t work in a mailto link. Finally, it has the additional disadvantage of being relatively difficult to implement on a site-wide basis because you need to create a new picture for each address you publish.

The approach I recommend, at least until spammers start to recognize it, is to use some HTML and XML encoding tricks that the browser will slip right through but the regular expression engine used by various spambots will ignore. In particular, I make use of numeric character references, either decimal and hexadecimal or both, like so:

&#x65;&#x6c;&#x68;&#x61;&#x72;&#x6f;&#x40;
&#x6d;&#x65;&#x74;&#x61;
&#x6c;&#x61;&#x62;&#x2e;&#x75;&#x6e;&#x63;
&#x2e;&#x65;&#x64;&#x75;
&#101;&#108;&#104;&#97;&#114;&#111;&#64;
&#109;&#101;&#116;&#97;&#108;
&#97;&#98;&#46;&#117;&#110;&#99;&#46;&#101;&#100;&#117;

Very few regular expressions will recognize those as e-mail addresses, but browsers will.

When the e-mail address is a URL, a third level of escaping is possible. We can replace some of the characters with percent escapes. For example, here I escape the @ sign and the periods:

<a href="mailto:elharo%40metalab%2Eunc%2Eedu">E-mail Me</a>

Of course, you can combine these techniques. This link hides the mailto scheme in decimal character references while encoding the e-mail part of the URL with percent escaping:

<a href="&#109;&#97;&#105;&#108;&#116;&#111;
&#58;elharo%40metalab%2Eunc%2Eedu">E-mail Me</a>

This isn’t impenetrable by any means, but it’s enough to fool most spambots.

If you like, you can generate the e-mail addresses out of JavaScript instead of including them directly in the page. Few if any spambots will run the scripts. For example:

<script type="text/javascript">
<!--
address=('elharo@' + 'metalab.unc.edu')
document.write(
  '<a href="mailto:' + address + '">' + address + '</a>')
 //-->
</script>

The disadvantage to this approach is that many users disable JavaScript for a variety of reasons. This technique hides the e-mail address from those users as well. One of those JavaScript-disabled users is Googlebot, so users won’t be able to search for the e-mail addresses. I tend to think this is a bad thing because I often search for e-mail addresses, but you may feel differently.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.239.226