Exploiting URL Namespaces and Doctitle Namespaces

We’ve seen how a docbase’s URL namespace can encode a lot of information that can enable both programs and people to categorize the subsets of that namespace that search engines return. There’s also a complementary namespace that can carry an additional information load. The <TITLE> tag enclosed by an HTML document’s <HEAD> is an invaluable but often underutilized resource. Text placed there doesn’t appear in a web page. It becomes the title of the window in which the browser displays the page. As I mentioned in Chapter 5, you may be disappointed if you rely on the window title to display information that is essential to users. I’ve found that people don’t regard the window title as part of a web page, so you have to recapitulate it in the body of web pages in order to get people to notice it. But while this doctitle namespace may not be very interesting to people, it’s enormously useful to search-results scripts. Figure 8.2 shows what a search-results page looked like on the BYTE site.

Multidocbase search results

Figure 8-2. Multidocbase search results

The result set draws from three different docbases, but everything fits into a common abstract pattern:

DATE
  TYPE SUBTYPE TITLE
  ABSTRACT

When you control both the search engine and the docbase, you can always achieve this effect. The question is: With how much effort? Careful design of the URL and doctitle namespaces will yield search results that integrate easily and comfortably into this kind of structure. The trick is to ensure that the two namespaces, in combination, can map as completely as possible to the abstract markers—that is, DATE, TYPE, SUBTYPE, and so on.

In this case, the search-results structure requires a creation date for each result. You can’t just rely on the file’s modification date that some search tools can report. For our purposes here, the creation date must be fixed—we want the age of the document, not a last-modified date that changes when someone edits the file or when a filter program transforms the entire docbase.

However, as is typical when you try to map multiple docbases into a common results architecture, the notion of a creation date is open to interpretation. In this example, for documents in the magazine archive, it really means issue date—that is, the month in which the article appeared, not the month in which it was written. For conference messages and press releases, the creation date really means what it says—but it also says less than it could. For records of these two types, the creation date is known not merely to the month, but to the day. Because daily grouping of results didn’t map cleanly across all three docbases and because monthly aggregation was simpler yet sufficient, I took the latter approach.

Where did the creation date come from? That depended on the docbase. For magazine articles and press releases, it was included in each record’s HTML document title. The URL /art/9704/sec6/art1.htm, for example, corresponded to the doctitle BYTE / April 1997 / Cover Story / Cheaper Computing. Likewise, the URL /vpr/000439.htm corresponded to the doctitle VPR / Citrix / WinFrame for Networks / 95-08-29. For each docbase, the set of doctitles formed a kind of virtual database. Knowing the schema of that database, the search-results script could pick out fields and use them to structure a results page. Note that the date appears twice for hits from the magazine archive—as the URL component 9704 and the doctitle component April 1997. That’s OK; there’s more than one way to do it. It doesn’t matter which of these namespaces carries the marker you need, so long as at least one of them does.

It’s useful to think of the URL and doctitle namespaces as complementary. For example, when we made creation date part of every Docbase URL, we took some of the pressure off the doctitle namespace. If the doctitles don’t need to display creation date, they can instead display some other useful dimension—say, author. It’s the union of the namespaces that matters to a search-results script, and you can use them in combination to carry the maximum information load.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.197.212