Multiple Engines, Multiple Docbases

A groupware system will often involve docbases of varying types. In this chapter, we’ll assume that the analysts who file reports to the ProductAnalysis docbase also collaborate in a searchable private newsgroup. A search for some term, say LDAP, should return results from both the docbase and the newsgroup. It should also organize the results from these different sources according to some normalized schema, so that, for example, reports filed by Jon Udell and newsgroup postings from Jon Udell will cluster together in a by author view of the search results.

Groupware systems may also involve multiple search engines. This might be because, over time, you switch from one engine to another. Or it might be because you run engines in parallel. The BYTE site, for example, comprised three primary docbases: the magazine archive, the public conferences, and the Virtual Press Room. These three docbases were searchable separately or in combination using either of two engines: Excite and SWISH. Why two engines? We found them to be complementary and neither alone to be sufficient. Figure 8.1 shows what the search page looked like for this multisearch-engine, multidocbase system.

A multiengine, multidocbase search form

Figure 8-1. A multiengine, multidocbase search form

Excite Versus SWISH

Why use Excite? It’s free, and it’s also very powerful. Excite grew out of an academic research project. Its inventors knew that, from a formal library science perspective, search is an exacting discipline. They studied the use of academic search systems and learned that the average search expression involved about a dozen terms. At the same time they studied web search systems and found that the typical search expression involved only one or two terms. How, they wondered, could the complex queries needed for effective search be compressed into the simple queries that web users are willing and able to perform? Their answer was the Excite search engine. Its ingenious concept search feature enables every search hit to launch, with just a mouse click, a follow-on search that looks for “more articles like this one.”[9] The follow-on search uses just a single term—the document ID associated with each search hit. Excite stores a statistical profile for each indexed document. In aggregate these profiles map out a kind of conceptual space, within which each document is surrounded by a set of neighbors. When you perform a “more like this” search, based on any search hit, Excite returns that document’s conceptual neighbors.

Excite’s concept search delivers the crucial benefit of serendipity. That is, it helps you to find things that you didn’t know you were looking for. Let’s say, for example, that you’re searching a corpus for documents about the file format used by LDAP-based directory software. If you know that LDIF (LDAP directory information file) is the relevant acronym, you’re in great shape. That’s just the sort of unusual and highly discriminatory search term that gets good results with most search engines. But what if you didn’t know about LDIF ? What if the very existence of a thing called LDIF was the thing you needed to discover? Here’s where Excite ’s concept search can really shine. A concept search based on any of the hits produced by an initial search for LDAP might return a document called LDIF file format—even if this latter document doesn’t contain LDAP ! How? The match isn’t based solely on the seed term LDAP but rather on a cluster of terms that Excite judged significant in the document that launched the follow-on concept search. Shared terms such as X.500 or organizational unit or distinguished name could bring the two documents into conceptual proximity, even when they don’t share the seed term LDAP.

A related benefit of this method is that users can get good results no matter what seed term they choose. For example, directory is far less discriminating than LDAP and might on the first pass produce documents about directories in the filesystem sense as well as in the LDAP sense. It hardly matters. The user need only scan the first result set and click on any LDAP-oriented hit in order to refocus the search in the LDAP direction.

If Excite ’s so wonderful—and it is—then why did I bother with SWISH? There’s more than one way to do it. Sometimes a directed walk through conceptual space is just the ticket. Sometimes, though, you really do want to search for the literal term LDIF. In these cases, I found SWISH to be more effective than Excite, even when Excite ran in its literal mode. Moreover, for any large corpus, no single search tool will work best for every query. Finally, Excite doesn’t support fielded search based on <meta> tags. In this chapter we’ll use a rewrite of the original SWISH, called SWISH-E, which adds this invaluable feature.

Of course SWISH and Excite aren’t the only choices. There are a variety of noncommercial and commercial tools available; see http://www.searchtools.com/ for an overview of the field. The question isn’t which tool is best but rather which tool—or combination of tools—meets your requirements. And that answer may evolve as your requirements do. So let’s look at how a groupware system can flexibly connect a range of search tools to a range of docbases.



[9] The “more like this” feature isn’t unique to Excite. Other search engines that can do this include Verity’s Topic and Thunderstone’s Texis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.124.244