When URL and Doctitle Namespaces Don’t Suffice

You want to cram as much as possible into the URL and doctitle namespaces, because these are what you get back “for free” from search engines. But inevitably, there will be missing pieces. Consider the newsgroup results in Figure 8.2. There’s clearly a value that will map well to creation date—namely, the posting date of each newsgroup message. However, that value appears neither in the document’s URL nor in its document title. In fact, documents in the primary conferencing docbase had no HTML document titles, because they weren’t HTML documents; they were newsgroup messages that carried their fielded information in NNTP headers.

In these cases, you have to dig out the information another way. How? That depends on the nature of the docbase—whether it resides remotely on a server you don’t control or locally on a server you do control, whether you have access to the server’s file system, and possibly other factors.

Note that the newsgroup search results in Figure 8.2 presented two links. The N link’s address was a news:// URL such as news://dev4.byte.com/[email protected], and the W link pointed to a mirrored web page such as http://dev4.byte.com/syscon/02137.html. Given this mirrored-docbase situation, there were many possible ways to get hold of a creation date for a newsgroup search result.

One solution would have been to reengineer the doctitle namespace in the web mirror of the newsgroup, just as was done for the other docbases, and then point the indexers at that web mirror rather than the NNTP server’s spool directory.

Another solution, and the one I in fact used in this case, relied on access to the NNTP server’s spool directory. In that namespace, a reference to a news message looked like /syscon/1793—that is, a directory corresponding to the newsgroup named syscon, and a numbered file containing the message. The search-results script used this reference to open the message file, absorb its NNTP headers, and parse the Date: field.

There are lots of ways to skin the cat, but it’s always best when you don’t have to. Use the URL and doctitle namespaces for all they’re worth.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.2.122