Plugging in the Microsoft Index Server

Included with the freely available NT 4.0 Option Pack, the Microsoft Index Server is a powerful and feature-packed search tool. It’s also convenient, because it largely automates the indexing process. When installed, it registers with NT to receive filesystem change alerts and indexes on demand when the directories that it monitors change. That’s really helpful, especially for volatile docbases such as newsgroups that can acquire new content hourly. Even for archival docbases, it’s convenient not to have to do manual updates or to write scripts to automate scheduled updates. The simplest way to use Index Server is to mark your web server’s virtual root as indexable. To be more selective, turn off indexing at the root and enable it only for specific docbases. A docbase doesn’t have to reside in the web tree. If you define a virtual root for a docbase that you want to index, the corresponding physical path can be on any local or LAN-attached drive. Or you can specify a URL and use the indexer in web-spider mode. Figure 8.7 shows the setup for our analyst newsgroups.

Index Server: Selecting a subtree for indexing

Figure 8-7. Index Server: Selecting a subtree for indexing

The virtual root /analyst maps to the Microsoft NNTP spool directory ntpfile ootanalyst. If we set up another root for the ProductAnalysis docbase, then run the indexer, the Property pane in Index Server’s management console will look like Figure 8.8.

Index Server custom properties

Figure 8-8. Index Server custom properties

Working with Index Server Custom Properties

Figure 8.8 shows that there’s a set of properties for each of our docbases. Index Server has very cleverly mapped NNTP headers from the newsgroups into properties like MsgFrom and NewsSubject, and <meta> tags from the HTML docbase into properties like analyst and company. Note, however, that the newsgroup mappings only work in conjunction with the Microsoft NNTP service. Index Server’s filter DLLs, which encapsulate per-docbase knowledge of custom fields, trigger on file extensions. The MS NNTP service, which stores news messages using names like 1000000.nws, plays into this strategy. But if you point Index Server at a standard INN spool directory or at a Collabra Server spool directory, it won’t recognize any custom properties.

To perform a query that uses these properties, or any query for that matter, you need a trio of related files: a form, a control file, and a results template. Example 8.8 shows a bare-bones search form, which we’ll call query.htm.

Example 8-8. A Basic Index Server Search Form

<html><body>

<form action="query.idq" method="get">

<p>Enter your query:
<br><input type="text" name="cirestriction" size="40">
<br><input type="submit" value="go">

<input type="hidden" name="ciscope" value="/">
<input type="hidden" name="cimaxrecordsperpage" value="100">
<input type="hidden" name="templatename" value="query">
<input type="hidden" name="cisort" value="rank[d]">
<input type="hidden" name="htmlqueryform" value="query.htm">
</form>

</body></html>

The form names a control file, query.idq, which invokes an Internet Services API (ISAPI) DLL to run the search. Example 8.9 shows a simple version of that file.

Example 8-9. A Basic Index Server Control File

[Names]
PAanalyst( DBTYPE_WSTR|DBTYPE_BYREF) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 analyst
PAcompany( DBTYPE_WSTR|DBTYPE_BYREF) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 company
PAproduct( DBTYPE_WSTR|DBTYPE_BYREF) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 product
PAduedate( DBTYPE_WSTR|DBTYPE_BYREF) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 duedate

[Query]
CiColumns=PAcompany,PAanalyst,PAproduct,PAduedate,NewsGroup,MsgFrom,NewsMsgId,
NewsDate,NewsSubject,rank,characterization,vpath,DocTitle

CiTemplate=query.htx

The [Names] section maps the <meta> tags defined in the ProductAnalysis docbase to a Component Object Model (COM) class ID—that’s the nasty-looking string of 32 hex characters—and thence to aliases such as PAcompany. These aliases, in turn, appear in the [Query] section of the control file and enable you to issue a query like:

(@PAanalyst "Jon Udell") or (@MsgFrom "Jon Udell")

This query asks for ProductAnalysis records that contain <META NAME="analyst" CONTENT="Jon Udell"> or conference messages that contain From: Jon Udell. Why doesn’t MsgFrom apear in the [Names] section of query.idq? NNTP properties are among the property sets that Index Server intrinsically understands.

The third query-related file, named query.htx in the control file, is the results template. Example 8.10 shows a minimal query.htx.

Example 8-10. A Basic Index Server Results Template

<%begindetail%>
<p>Url: <%vpath%>
<br>Title: <%DocTitle%>
<br>Newsgroup: <%NewsGroup%>
<br>NewsSubject: <%NewsSubject%>
<br>NewsDate: <%NewsDate%>
<br>NewsMsgId: <%NewsMsgId%>
<br>MsgFrom: <%MsgFrom%>
<br>Company: <%PAcompany%>
<br>Product: <%PAproduct%>
<br>Analyst: <%PAanalyst%>
<br>Duedate: <%PAduedate%>
<%enddetail%>

Now the CiColumns section of the control file comes into play. Custom properties can interpolate into the results page if you mention them in the template file. However, if you run a query against our two docbases using this query.htm/query.idq/query.htx combination, you’ll be puzzled to find that only the intrinsic NNTP properties appear in the output, not the HTML <meta> tag properties. The query (@PAanalyst “Jon Udell”) will find the right set of records, and the result page will include values for <%vpath%> (the URL) and <%DocTitle%>, but <%PAanalyst%> will be blank.

Dealing with Cached Properties

Index Server has a notion of cached properties, and it will only interpolate values into properties that are cached. To make <%PAanalyst%> appear in the output, you have to follow this procedure:

Cache the property.

In the Index Server’s management console, select each custom property you want to cache, right-click it, choose Properties, and configure it as shown in Figure 8.9.

Commit these changes.

To do that, right-click the catalog’s Properties folder, and select Commit Changes.

Stop Index Server and restart it.

This shouldn’t be necessary, according to the documentation, but it is.

Caching Index Server custom properties

Figure 8-9. Caching Index Server custom properties

Now both the intrinsic NNTP properties and our own user-defined ProductAnalysis properties will receive interpolated values in the search-results page.

Using the .htx Template Language

Index Server’s query system is flexible, but not flexible enough to accommodate the multidocbase architecture we’ve designed in this chapter. Although the .htx template is programmable, there are limits to what it can do. Here’s how you can attack the problem of mapping results to the kind of abstract structure we’ve defined for our SearchResults module:

<% if NewsMsgId eq "" %>    <!-- it's a ProductAnalysis record -->
  <br>   TYPE: Docbase-ProductAnalysis
  <br>SUBTYPE: <%PAcompany%>
  <br> AUTHOR: <%PAanalyst%>
<% else %>                  <!-- it's a conference record -->
  <br>   TYPE: Conference-analyst
  <br>SUBTYPE: <%Newsgroup%>
  <br> AUTHOR: <%MsgFrom%>
<% endif %>

This sort of works, because Index Server’s aggressive approach to custom properties gives the template language access to almost everything that it needs. We’ll be in trouble, though, if we need to dig down for something that isn’t available as a property. For example, in the ProductAnalysis docbase, we could map SUMMARY to <%Characterization%>. Index Server loads that property with the first few hundred characters from each document. But in fact, this docbase affords a more precise mapping—to the CSS-tagged summary section of each record. The .htx template language can’t go after that data. There’s another limitation, too. Note how the query form in Example 8.8 assigns the value rank[d] to the hidden variable CiSort. That tells Index Server to sort results in descending order by relevance. Quite conveniently, we could change that to NewsDate[d] to sort conference records by date. But there’s no way to accommodate the date style of the ProductAnalysis docbase. Nor, for that matter, can the .htx template language extract the date from the pathname that includes it.

Querying the Result Set Using SQL

An Active Server Pages (ASP) script can manipulate Index Server query results much more effectively. Like the Perl library we’ve developed, a server-based ASP script can parse the pathname, extract the date from it, and dig into individual result files to retrieve values not exportable as custom properties. Most interestingly, it affords a much more powerful sorting solution than the one we’ve devised. Because Index Server comes with what’s called an OLE DB provider, its result set can emulate an SQL result set. So the ASP script can simply say:

select * from recordset, order by author, subtype, date desc

That’s pretty hot stuff! If Index Server were the only search engine that mattered to you, there would be no need to bother with the library we’ve developed here. But suppose you do want the best of both worlds—the comprehensive field indexing of Index Server and the concept search of Excite. Is there a way to integrate Index Server into what we’ve built so far, without building a complete ASP-based system in parallel? Sure. The simplest approach would be to capture its output using a new Classifier module, Search::MicrosoftIndexClassifier, which would extract the pathname and doctitle from each record and then reuse the existing Mappers.

Integrating Index Server into the SearchResults System

In the case of SWISH-E, the search driver runs the search engine, captures the result as a string, and passes that to Search::SwishClassifier. Index Server’s engine, however, is a DLL that you access indirectly by way of .idq or .asp files. What to do? Black-box the whole thing. The information left on the browser’s command line is the key to this puzzle:

http://localhost/msidx/query.idq?cirestriction=%28%40PAanalyst+%22Jon+Udell%22%29
&ciscope=%2F&cimaxrecordsperpage=100&templatename=query&cisort=rank%5Bd%5D
&htmlqueryform=query.htm

Any URL-aware programming language can use this web interface to treat Index Server as a component. In Java and Python, web-client capability is built into the standard kit. In Perl, you can use the Library for WWW access in Perl (LWP) module to programmatically fetch URLs. Whatever the method, it’s a simple matter to capture Index Server’s output so that we can feed it into our search-results kit. To produce record-per-line output analogous to what SWISH-E and Excite emit, we can rewrite the .htx template like this:

<%begindetail%>
<Url><%vpath%><Title><%DocTitle%><Newsgroup><%NewsGroup%><NewsSubject>
     <%NewsSubject%><NewsDate><%NewsDate%><NewsMsgId><%NewsMsgId%><MsgFrom>
     <%MsgFrom%><Company><%PAcompany%><Product><%PAproduct%><Analyst>
     <%PAanalyst%><Duedate><%PAduedate%>
<%enddetail%>

This template produces output that’s meaningless to humans but that’s just right for a Search::MicrosoftIndexClassifier that wants to parse results a line at a time. Elements like <Url> would be poor choices if the results page were intended for human consumption. Browsers see these elements as bogus HTML and don’t render them. But for a page that’s only used by a robotic search-results processor, they’re fine.

Exploiting Index Server’s Aggressive Indexing of Custom Properties

Since the Mappers can already produce the abstract SearchResults structure using pathnames and document titles from their respective docbases, we could stop right here. But it seems a shame to throw away all the work that Index Server has already done. Why revisit the docbases to look up fields that Index Server has already found? Instead, let’s expand the interface that the Mappers implement. To the methods isRecord( ) and mapResult( ), we’ll add the method mapFullySpecified-Results( ). This per-docbase method will receive a delimited record, pick out the items that pertain to itself, and map them. Here’s the ProductAnalysisMapper version:

sub mapFullySpecifiedResults
  {
  my ($result,$spec) = @_;

  if ( $spec =~ m#(<Url>)([^<]+)# )
    { $result = fieldsFromPathname($result,$2);}

  if ( $spec =~ m#(<Title>)([^<]+)# )
    { $result = fieldsFromDoctitle($result,$2); }
 
  if ( $spec =~ m#(<Analyst>)([^<]+)# )
    { $result->{AUTHOR} = $2; }

  $result = fieldsFromDocbase($result);

  return $result;
  }

Note that it simply reuses the fieldsFromPathname( ) and fieldsFromDoctitle( ) routines, passing along the values for pathname and doctitle that it receives from Search::MicrosoftIndexClassifier. It fills the AUTHOR slot with the value of the analyst <meta> tag, thus avoiding the need to peek into the docbase record as the standard fieldsViaPathname( ) method must do.

The ConferenceMapper version of this method works similarly:

sub mapFullySpecifiedResults
  {
  my ($result,$spec) = @_;
  if ( $spec =~ m#(<Url>)([^<]+)# )
    {   $result->{PATH} = $2; }
  if ( $spec =~ m#(<Newsgroup>)([^<]+)# )
    {   $result->{SUBTYPE} = $2; }
  if ( $spec =~ m#(<NewsSubject>)([^<]+)# )
    {   $result->{TITLE} = $2; }
  if ( $spec =~ m#(<NewsDate>)([^<]+)# )
    {   $result->{DATE} = $2; }
  if ( $spec =~ m#(<MsgFrom>)([^<]+)# )
    {   $result->{AUTHOR} = $2; }
  $result = fieldsFromDocbase($result);
  return $result;
  }

It gets everything it needs from the pathname plus the NNTP headers transmitted by way of Index Server-specific custom properties. Note that there is nothing Index Server specific about this behavior. Any engine that can supply a complete set of field values in response to a search can use mapFullySpecifiedResults( ) to bypass the fallback mechanism that digs the values out of their original docbase records. Note also how the field-at-a-time parsing of the fully specified record into a hashtable-based accumulator isolates the Mapper from any dependency on the order or composition of the record. If we add a third docbase that exports its own set of custom fields, we only need to ensure that none of them conflict with existing fields. Recognition of the new fields will be encapsulated in the new docbase’s Mapper. Neither it nor the existing Mappers will care whether any given field exists, and no Mapper will care where any of its fields appear in the record.

Using the optimized Mappers

Example 8.11 shows a new Classifier, Search::MicrosoftIndexClassifier.

Example 8-11. A Different Kind of Classifier

package MicrosoftIndexClassifier;
use Classifier;

use ConferenceMapper;
my $con = ConferenceMapper->new();

use ProductAnalysisMapper;
my $pa = ProductAnalysisMapper->new();

@ISA = ('Classifier'),

sub new 
  {
  my ($pkg) = @_;
  my $self =  {};
  bless $self,$pkg;
  return $self;
  }

sub classify
  {
  my ($self,$results) = @_;
  my @resultlist = split (/
/,$results);
  foreach (@resultlist)
    {
    foreach $obj ($pa, $con)
      {
      if ( $obj->isRecord($_) )
        { 
        my $href = $obj->mapFullySpecifiedResults($_); 
        $self->addResult($href);
        }
      }
    }
  }

1;

It’s a tad simpler than SwishClassifier, because it doesn’t need to parse out each record’s pathname and doctitle. It just hands the whole delimited record—a superset of all the elements in each docbase—to the Mappers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
100.28.132.102