C H A P T E R  31

images

Search and Apache Solr Integration

by Peter Wolanin

This chapter will discuss the Apache Solr Search Integration module in terms of how it implements the Drupal core Search module hooks, as an example of how to make a custom search, and in terms of its functionality. This chapter also highlights some of the additional hooks that allow the module's behavior to be customized and extended. This module can be seen as an example of integrating Drupal with a web service, and it makes use of some object-oriented code.

The Search module in Drupal core provides a framework and API for modules to provide search functionality. The Search module itself does very little except provide a search form and some administrative configuration options. In order to have the search form show up, one or more modules must implement the Search module hooks. Within Drupal core, both the Node module and User module implement the search hooks.

The Node module provides the ability to do keyword searches of content. Since it is part of Drupal core, it uses the SQL database as the storage and searching mechanism. While the Node module's search implementation provides very good keyword matching, its use of the database can cause significant performance issues for larger sites. In addition, while it can handle some filtering (for example, by user or taxonomy term via the advanced search form), this is rather limited.

The Apache Solr Search Integration module (found at drupal.org/project/apachesolr) provides an alternative and replacement for the Node module's search for indexing and searching content. A wider range of content indexing and filtering functionality is available, and Solr server can be accessed from many Drupal sites. A number of the enhancements and changes to the Search module API in Drupal 7 were driven by the limitations encountered when creating Apache Solr Search Integration with the Drupal 6 Search module. There are several key features that distinguish it from the Node module search:

  • Faceted search using Facet API module.
  • Multiple user-selectable sorts for results.
  • Highly customizable boosting that allows you to tune the relevancy score for search results to control what is listed first.
  • Fast searches for large amounts of content (e.g., > 10,000 nodes).
  • The potential to do multi-site searches or federated searches, such as showing user and content searches in the same result set.

images Note The term “facet” refers to an attribute of the documents in the search results (or the whole search index), such as a taxonomy term. “Faceted search” refers to an interface for filtering search results based on facets, which is a technique for helping the user find the desired result and avoid dead-end searches.

Apache Solr is its own open-source project. Actually, it's a part of the Lucene Java project: Lucene is the actual search library that Solr is built upon. Apache Solr provides the HTTP interface and hence can be integrated with Drupal (or almost any other application) while residing on the same server or a totally separate server. The fact that Solr can be run on a separate server is one reason for its popularity in the Drupal community: it allows a site administrator to reduce the load on the database server and get fast search results even for sites with hundreds of thousands of nodes. Solr also has built-in support for master-slave replication, so it can easily provide high availability for search requests; it can also scale horizontally in the event that search traffic exceeds the capacity of one server. The Drupal module is primarily intended to work with the stable Solr 1.4.x release series, though most or all functionality should work with the forthcoming Solr releases.

If you want to run Solr, you have the following options:

  • Run it yourself. Generally this option is suited for those with at least one dedicated server or VPS. It requires, at the least, the ability to deploy Solr in a Java servlet container (like Jetty or Tomcat) and control access to it via firewall rules, HTTP authentication, or other authentication.
  • Pay for a hosted Solr index. Acquia provides a Solr index for every customer with a Drupal support subscription. Other companies provide more generic services.
  • Pool resources via a non-profit or cooperative such as May First People Link.

The Apache Solr project comes with a simple-to-run Jetty deployment that almost anyone interested in trying the module can get running in a few minutes on a local machine. The steps are outlined in the README.txt that comes with the Apache Solr Search Integration module. However, this simple kind of deployment does not include any authentication mechanism, so access to Solr needs to be protected by a firewall, at minimum, when used for a production site on a public server.

Search Module Administrative Options

The administrative interface for the Search module provides some key configuration options in Drupal 7 that were not available in Drupal 6. Both the Search module and Apache Solr Search Integration settings are found in the Search and metadata section at the admin/config path, as shown in Figure 31–1.

images

Figure 31–1. The Search settings in the Admin Configuration screen

Of particular note for the Search module, shown in Figure 31–2, the form at admin/config/search/settings allows you to selectively enable any or all of the modules that implement the search hooks. You can also choose any one of them to be the default search (the search that is run from the search block form and the default tab). If you want to use Apache Solr Search Integration as the main search, you should make it the default and likely disable the Node module search.

images

Figure 31–2. Search module configuration options

Search Results and Facet Blocks

Further configuration will be discussed later in the chapter, but what you get when appropriate filters are enabled and blocks are configured is a default search that allows you to see the current keywords and filters so that you can narrow your search or make it broader by removing one of the current filters. The enabled filters that are relevant for the current search results will show up in a facet block. Each link in this block will apply one additional filter to narrow the result set. The default settings show the facets with a checkbox (this is added via JavaScript). Once a facet filter is applied to the search results, you can also check the Retain current filters checkbox, which essentially gives you the option to search again with different keywords and the current filters (see Figure 31–3).

images

Figure 31–3. Search results with current search block, facet blocks, and the checkbox to retain filters when using new keywords

Search Module API

The Search module in Drupal core provides a framework that other modules can use. In particular, they can take advantage of the search interface and standard result formatting from the Search module.

Much more detail on these hooks can be found in the search.api.php file that comes with the Drupal 7 Search module; the same documentation can be accessed online at api.drupal.org. This section will focus on the hooks implemented by the Apache Solr Search Integration module, which are basically the minimal set of hooks any module would need to implement to create a custom search implementation.

Hooks Implementations Required to Create a Search

The following hooks are essential to define a new search that shows up as a search tab. It's a conscious limitation that a single module can only define one search—this helps keep the API simpler.

hook_search_info()
hook_search_execute()

With just these two hooks, your new search functionality can appear. hook_search_info() lets the Search module know about your search implementation, what title to give the search tab, what path you want to use to run searches, and (optionally) the name of a callback function that adds other conditions like filters to the search in addition to keywords. The other essential hook, hook_search_execute(), is called when a user visits the search path and finds either keywords or conditions present. Note that while the search form submits via a POST request, the search module actually takes all the search parameters from the URL. So you can, for example, bookmark and search and visit the URL to run it again to find new results. The return value from hook_search_execute() is an array of results, each of which is an array with certain key/value pairs expected by the theme function.

images Note The fact that all the search parameters are passed in via the URL with a GET request has benefits beyond allowing you to bookmark searches. For example, since Drupal uses the page URL as a cache key, you can benefit from Drupal page caching for search pages for any commonly run searches on your site, such as providing users with links to particular search URLs.

Additional Search Module Hooks

The following hooks are optional but they allow your module more control over the search process and the indexing process (if using the search module's indexing facilities), or they allow you to add to the Search module administrative interface:

hook_search_access()
hook_search_reset()
hook_search_status()
hook_search_admin()
hook_search_page()
hook_search_preprocess()
hook_update_index()

If you implement hook_search_page(), you can take total control over the processing and display of search results, in which case the return format from hook_search_execute() can be altered for your convenience rather than conforming to the format expected by the Search module. The implementation shown in Listing 31–1 mirrors closely what's in the core search module but adds an additional possible output to browse all facets' blocks.

The apachesolr_search Integration module implements five of these hooks plus the callback that is optionally specified in hook_search_info(). This last callback is important because it allows the code to pull additional filter parameters out of the query string and to use these to run a search even when there are no keywords. It was added to the core Search module based on this need for Apache Solr Search Integration. You'll notice also that the info returned for hook_search_info() is actually the content of a variable, though this variable is not (currently) exposed for configuration in the user interface. This will allow developers to change the name and path for the search tab without needing to use hook_menu_alter().

Listing 31–1. Basic Search Module with Additional Possible Output

/**
 * Implementation of hook_search_info()
 */
function apachesolr_search_search_info() {
  return variable_get('apachesolr_search_search_info', array(
    'title' => 'Site',
    'path' => 'site',
    'conditions_callback' => 'apachesolr_search_conditions',
  ));
}

/**
 * Implementation of hook_search_execute()
 */
function apachesolr_search_search_execute($keys = NULL, $conditions = NULL) {
  $filters = isset($conditions['filters']) ? $conditions['filters'] : '';
  $solrsort = isset($_GET['solrsort']) ? $_GET['solrsort'] : '';

  try {
    return apachesolr_search_run($keys, $filters, $solrsort, 'search/' . arg(1), pager_find_page());
  }
  catch (Exception $e) {
    watchdog('Apache Solr', nl2br(check_plain($e->getMessage())), NULL, WATCHDOG_ERROR);
    apachesolr_failure(t('Solr search'), $keys);
  }
}

/**
 * Implementation of a search_view() conditions callback
 */
function apachesolr_search_conditions() {
  $conditions = array();

  if (isset($_GET['filters']) && trim($_GET['filters'])) {
    $conditions['filters'] = trim($_GET['filters']);
  }
  if (variable_get('apachesolr_search_browse', 'browse') == 'results') {
    // Set a condition so the search is triggered.
    $conditions['apachesolr_search_browse'] = 'results';
  }
  return $conditions;
}

/**
 * Implementation of hook_search_reset()
 */
function apachesolr_search_search_reset(){
  apachesolr_clear_last_index('apachesolr_search'),
}

/**
 * Implementation of hook_search_status().
 */
function apachesolr_search_search_status(){
  return apachesolr_index_status('apachesolr_search'),
}

/**
 * Implements hook_search_page()
 */
function apachesolr_search_search_page($results) {
  if (!empty($results['apachesolr_search_browse'])) {
    // Show facet browsing blocks.
    $output = apachesolr_search_page_browse($results['apachesolr_search_browse']);
  }
  elseif ($results) {
    $output = array(
      '#theme' => 'search_results',
      '#results' => $results,
      '#module' => 'apachesolr_search',
    );
  }
  else {
    // Give the user some custom help text
    $output = array('#markup' => theme('apachesolr_search_noresults'));
  }
  return $output;
}

Obviously the code here mostly wraps calls to other internal module functions; the status and reset hooks are implemented simply to allow status and reset operations to work with the Search module administrative page as well as within the Apache Solr Search Integration administrative pages. Note that hook_search_page() is implemented so that it can provide either facet block browsing or customized help text when there are no search results. The code to format normal search results is the same as the default implementation in Search module.

Apache Solr Search Configuration

The administrative interface for the Apache Solr Search Integration provides a number of configuration options that will meet the needs for most initial customization. In particular, by configuring the boost setting and doing some basic tests of end user satisfaction with the ordering of results, you can help make the search results become more relevant.

Enabled Filters

In order to let end users navigate via a particular facet, you need to follow a two step process. First, you have to enable the filter via the Apache Solr Search Integration settings, and then you need to enable the corresponding block via the normal Block module interface. The act of enabling a filter means that extra processing is performed by the Solr server and additional data is returned. Thus, you should only enable those filters where you will use the block or will use the data for some other purpose. For example, in order to make a facet block available for the Tags field, the last filter needs to be enabled (see Figure 31–4) and then the block is configured.

images

Figure 31–4. Enabling a filter makes an additional facet available in the search results.

Type Biasing and Exclusion

A common need for sites is that content of a certain type should receive a boost in search results or a certain content type should not be added to the search index at all. For example, you may wish to steer users toward blog posts. Alternately, you may want them to first find documentation represented by book nodes. Initially, all content types are treated equally. By setting a value to something other than “Ignore” you indicate that a certain node type within your site content has greater importance and should receive a higher score in search results. In contrast, there may be some content that should not be indexed at all. This may be true for nodes of a type that is automatically generated or represents data rather than actual site content.

The administrative interface lets you configure boosting and exclusion per content type (see Figure 31–5). The module will attempt to immediately delete from the search index all relevant nodes if you add a type to the excluded list, so do not make this change casually.

images

Figure 31–5. Setting the search result bias and exclusion settings for specific content types

Apache Solr Search Customization

The Apache Solr Search Integration module is only a starting point if you want an interface that is fully optimized for your Drupal site. In addition, the filtering and sorting capabilities of Solr make it attractive to use as a data source for certain kinds of listing pages such as ecommerce sites, library sites, or on drupal.org itself for the page that lists all modules. There are a wide number of hooks documented in apachesolr.api.php, but only a few of them are necessary for most typical customizations.

Hooks for Getting Data into Solr

When indexing a node, Apache Solr Search Integration will add certain fields to the document by default. If you want to do custom filtering, boosting, etc., you will want to add additional fields to the document in the index. To do so, you can implement hook_apachesolr_update_index($document, $entity, $namespace). This hook is used to add more data to a document before it's sent to Solr; it can also be used to alter or replace data added to the document by Apache Solr or another module. It works like an alter hook, although there's no need to pass the variable by reference because the document is an object. When adding data to the Solr index, it's helpful to look at the schema.xml file to see the names of types of the dynamic fields. You can control how the data is indexedsimply by naming a property on the document with the right prefix. For example, you could add a single-value like so:

function MYMODULE_apachesolr_update_index($document, $node, $namespace){
 if ($node->type == 'site_product' && $document->entity_type == 'node') {
    // Add an additional custom node field to the index.
    $document->fs_price = $node->price;
  }
}

There are several ways to get searchable data into the index. The simplest way is to simply add more content to the node to be rendered at index time. Another approach is to implement hook_node_view($node, $view_mode, $langcode), and look for a $view_mode of ‘search_index’. Yet another option is to add content via hook_node_update_index($node). Any content returned from that hook is appended to the content sent to the search index. However, in the latter two cases, this content will simply be found as part of a keyword search and can't be used to create facets or sorts.

A big feature of the Drupal 7 core release is the Fields API. The Apache Solr Search Integration module has built-in support for indexing the fields on nodes or (potentially) other entity types, based on either the field type or even on a per-field basis. This feature was created based on the support for Content Construction Kit (CCK) fields in the 6.x-2.x version of the module; for 7.x, it has been extended to include handling the taxonomy term reference fields. By default, only taxonomy and all the list-type fields (e.g. list_text) will be indexed as separate fields in the Solr document. If you need to add to or change this indexing, you can implement hook_apachesolr_field_mappings_alter(&$mappings). See apachesolr.api.php for more details.

A last thing to consider is actually keeping data out of the search index. Previously, you saw the administrative interface for excluding all nodes of a certain type, but you might need to exclude content on a more selective basis. In that case, you can implement hook_apachesolr_node_exclude($node, $namespace). If any module returns TRUE, the node is not sent to the index.

Hooks for Altering Queries and Results

The first and most common reason to alter the query sent to Solr is to retrieve an additional field from the document in the search result. This is the complement to adding an extra field to the document via hook_apachesolr_update_index($document, $node, $namespace). Usually when you modify a query, you don't want the modification to be visible to the end user in the facet links, etc. In this case, you should use hook_apachesolr_modify_query($query, $caller) and append your field name to the ‘fl’ parameter sent to Solr, like so:

function MYMODULE_apachesolr_query_alter($query){
  // Also return any price data from the index inthe results.
$query->addParam('fl', 'fs_price'),
}

hook_apachesolr_modify_query() can also be used to add filters to a search that are not visible to the end user. This is important, for example, in the implementation of the Apache Solr node access module. This module adds filters to search queries based on the node access system using node_access_grants(). It also uses hook_apachesolr_update_index() as described previously to index as additional fields the node access information with each Solr document derived from a node.

A very similar hook is hook_apachesolr_query_prepare($query). Any changes made using this hook may end up being visible to the user on the search results page, so its use is much more limited than hook_apachesolr_query_alter().

There are also several hooks (and theme hooks) that can be used to modify or enrich the search results before they are displayed to the user. The most common one is hook_apachesolr_search_result_alter($doc, $extra), which allows each document in the result set to be individually altered.

Integrating with the Apache Solr Server

To understand a little bit about how and why the Search module hook implementations in Apache Solr Search Integration are written as they are, it's useful to have a broad conceptual understanding of how the Apache Solr server works. Drupal interacts with Solr via an HTTP request, which for Drupal means using the drupal_http_request function (though it could also be done other ways in PHP including via the curl functions or file_get_contents(), depending on the PHP install). Solr has a RESTful API interface, but, at least in version 1.4.x, it doesn't support the full range of HTTP methods as verbs so it's not a true REST interface for this reason. Instead, different URL paths are used (and can be configured per search index); POST requests are used for data changing operations and GET requests are used for querying.

A PHP library was adapted to provide some of the low-level logic of getting data into and out of Solr. That library can be found at code.google.com/p/solr-php-client. The Apache Solr Search Integration module provides a class adapted from the main library class, which alters its behavior. The most important alteration is to use Drupal_http_request() instead of file_get_contents() so that all Drupal sites can work with the module. This is class DrupalApacheSolrService, which extends Apache_Solr_Service. The document class is used from the library with minimal modification.

Managing Data in the Solr Index

Data is added to the Solr index as XML documents sent via POST request to the /update path on the Solr server. Solr stores data as documents, and each document must have a unique string ID value. Solr does not have any native concept of relationships between document nor any ability to JOIN documents together. In this sense, it's like document-based NoSQL databases, such as MongoDB, so you have to store together all attributes of the node or other entity that you wish to be able to search on or retrieve in the search result. Documents are deleted with a POST request of an XML document that either specifies the document ID or by a query that deletes all matched documents.

Searching and Analysis

Normal searches are run based just on the URL path and query string. Depending on your configuration, different paths may be used for different searches, such as a keyword search versus a “more like this” search. If something is not working as expected, having Solr running locally is very helpful since you can type your query directly into the URL or use other features of the Solr administrative interface. In particular, the analysis feature is useful to help understand how indexed content or search keywords are transformed by the analyzers and filters configured in the Solr schema.xml. If running Solr locally using the example Jetty deployment, you'll be able to get to the interface at http://localhost:8983/solr/admin/. Figure 31–6 shows the admin interface: in parentheses at the top is the name of the schema in use, then the link to the analysis interface and the text box where you can initiate a search.

images

Figure 31–6. The Apache Solr admin page, including the link to the analysis interface

Summary

By using Apache Solr Search Integration, you can enhance the quality of the search results and the search interface on your site, which will help keep users on your site and help them find what they are looking for.

The Drupal 7 version of Apache Solr Search Integration will be enhanced as indexing for additional entities like users and files is available and as the Drupal 7 version of Views is released.

images Tip Find updates at dgd7.org/solr.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.240.249