A Search Engine’s Web API

We’ll talk about three different search engines in this chapter—Excite, SWISH-E, and Microsoft Index Server. Each packages its functionality in a different way. The Excite engine is command-line driven but wrapped in a layer of Perl that is written by a web application that you use to customize your web interface to Excite. SWISH-E is a plain command-line program. To integrate it into your site, you have to script your own wrapper around it or use one of the canned wrappers available for it. Microsoft Index Server, unlike the other two, has no command-line interface. It’s a Dynamic-Link Library (DLL) that works closely with Internet Information Server. To customize its web interface, you write proprietary scripts and templates.

Once integrated into a site, though, these engines are more alike than different. And, crucially for our purposes here, they’re all plug-compatible with one another. How can that be? Indexing strategies and query languages aside, every search engine returns a set of URLs in response to a query. One way or another, you are guaranteed to be able to intercept and process that set of URLs. In some cases, you can wrap your own scripts around the engine itself or around the default scripts that come with it. If that’s not possible, because (as with Microsoft Index Server) the engine runs as a deeply intertwined extension of a web server, you can still wrap it using a web-client script.

Web-Client Scripting

You can write web-client scripts in Perl, Python, Java, or any other language that can send a URL to a server and fetch its output. The URL might be the address of a page or, more interestingly in many cases, the equivalent of a submitted form, expressed as a CGI call with arguments. Here, for example, is one way to use Perl to search AltaVista for “Jon Udell”:

use LWP::Simple;
print get "http://www.altavista.com/cgi-bin/query?pg=q&kl=XX&q=%22jon+udell%22";

The implications of this little nugget are just astonishing. In effect, every web site is a scriptable component, and the Web as a whole is a vast library of such components. You can invoke these invidually from any scripting language that can issue HTTP requests and interpret the responses.

What’s more, you can join components to achieve novel effects. For example, I’ve used Yahoo! and AltaVista in combination to measure the “mindshare” of web sites in specific categories. To do that, I wrote a Perl script that uses Yahoo!’s namespace API to unroll the subdirectories under a node of the Yahoo! directory tree, yielding a consolidated list of URLs belonging to some category, such as /Science/Nanotechnology/. Then the script feeds that list of URLs, one at a time, to AltaVista, using its CGI API to ask, for each site, how many other pages in the AltaVista index refer to that site. The ranked list of these citation counts measures what I call the web mindshare of the sites.

Yahoo! wasn’t designed to produce an unrolled list of sites in a category, but its web API can be made to do it. Likewise, AltaVista wasn’t designed to count references to each of the sites in such a list, but its web API can be made to do it. These two macrocomponents, driven remotely by a 100-line Perl script (see http://www.byte.com/features/1999/03/udellmindshare.html), can be joined to create a new application that measures web mindshare.

This is heady stuff, and we’ll see a lot more of it in chapters to come. For now, let’s just notice that web search engines can be yoked to applications in a number of ways. Because one of those ways necessarily exposes a web API, search engines are guaranteed to be able to plug their results into applications that add value to those results.

In this chapter, I’ll develop such an application. It comprises a family of Perl modules that abstract two kinds of macrocomponents: docbases and search engines. These modules embody a method of classifying and organizing search results that generalizes to any docbase and to any search engine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.221.52