Chapter 9. Searching

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Searching

This chapter covers

The Zend_Search_Lucene component
How to add a search page to a real-world application
The Observer pattern and how it helps code separation

One feature that separates an excellent website from a mediocre one is search. It doesn’t matter how good your navigation system is, your users are used to Google and expect that they can search your website to find what they’re looking for. If they can’t search, or if the search doesn’t return useful results, they’ll try another site and another until they find one that does. A good search system is hard to write, but Zend Framework provides the Zend_Search_Lucene component to make it easier.

In this chapter, we’ll look at how search should work for the user and at how the Zend_Search_Lucene component works. We’ll then integrate search into our Places website, looking at how to index a model and how to present results to the user.

9.1. Benefits of Search

For most of us, we get the best from a website when we successfully read the information we came for. For an e-commerce site, this generally means finding the item and purchasing it. For other sites, we’re generally looking for relevant articles on the topics that interest us. Searching is one way to help a user to quickly find the results they’re looking for. A good search system will provide the most relevant results at the top.

9.1.1. Key Usability Issue for Users

Users want only one thing from their search: to find the right thing with the minimum amount of effort. Obviously, this isn’t easy, and it’s a major reason why Google is the number one search engine. Generally, a single text field is all that is required. The user enters his search term, presses the submit button, and gets a list of relevant answers.

9.1.2. Ranking Results is Important

For a search system to be any good, it’s important that relevant results are displayed, which requires ranking each result in order of relevance to the search query. To do this, a full-text search engine is used. Full-text search engines work by creating a list of keywords, an index, for each page of the site. Along with the list of keywords, other relevant data is also stored, such as a title, the date of the document, the author, and information on how to retrieve the page, such as the URL. When the user runs a query, each document in the index is ranked based on how many of the requested keywords are in the document. The results are then displayed in order, with the most relevant pages at the top.

Providing a simple search system using a simple database query is quite easy using PHP. The difficultly lies in ranking the results with the most relevant coming first. Zend_Search_Lucene is designed to solve this problem.

9.2. Introducing Zend_Search_Lucene

Zend Framework’s search component, Zend_Search_Lucene, is a very powerful tool. It’s a full-text search engine based on the popular Apache Lucene project, which is a search engine for Java. The index files created by Zend_Search_Lucene are compatible with Apache Lucene, so any of the index-management utilities written for Apache Lucene will work with Zend_Search_Lucene too.

Zend_Search_Lucene creates an index that consists of documents. The documents are each instances of Zend_Search_Lucene_Document, and they each contain Zend_Search_Lucene_Field objects that contain the actual data. A visual representation is shown in figure 9.1.

Figure 9.1. A `Zend_Search_Lucene` index consists of multiple documents, each containing multiple fields. The data in some fields is not stored, and some fields may contain data for display rather than searching.

Queries can then be issued against the index, and an ordered array of results (each of type Zend_Search_Lucene_Search_QueryHit) is returned.

The first part of implementing a solution using Zend_Search_Lucene is to create an index.

What’s a full-text search engine?

A full-text search engine searches through a separate index file of the contents of the website to find what the user is looking for. This means that the content information can be stored very efficiently, because it’s only used for finding the URL (or the means to work out the URL) of the page required. The search engine can also order the search results based on more complex algorithms than are available when using a simple database query. Hopefully, this means that the user is presented with more-relevant results at the top.

One consequence is that all of the content that is to be searched needs to be in the index file, or it won’t be found. If your website contains static HTML as well as database content, this needs to be added to the search engine’s index file.

9.2.1. Creating a Separate Search Index for your Website

Creating an index for Zend_Search_Lucene is a matter of calling the create() method:

$index = Zend_Search_Lucene::create('path/to/index'),

The index created is actually a directory that contains a few files in a format that is compatible with the main Apache Lucene project. This means that if you want to, you could create index files using the Java or .Net applications, or conversely, create indexes with Zend_Search_Lucene and search them using Java.

Having created the index, the next thing to do is put data into it for searching on. This is where it gets complicated and we have to pay attention! Adding data is easily done using the addDocument() method, but you need to set up the fields within the document, and each field has a type. Here’s an example:

$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $url));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents));
$doc->addField(Zend_Search_Lucene_Field::Text('desc', $desc));

$index->addDocument($doc);

As you can see in this example, the URL data is of field type UnIndexed, contents are UnStored, and the description is Text. Each field type is treated differently by the search engine to determine whether it needs to store the data or use it for creating the index file. Table 9.1 shows the key field types and what the differences are.

Table 9.1. Lucene field types for adding fields to an index

Name	Indexed	Stored	Tokenized	Description
Keyword	Yes	Yes	No	Use for storing and indexing data that is already split into separate words for searching.
UnIndexed	No	Yes	No	Use for data that isn’t searched on, but is displayed in the search results, such as dates or database ID fields.
Text	Yes	Yes	Yes	Use for storing data that is both indexed and used in the search results display. This data is split into separate words for indexing.
Unstored	Yes	No	Yes	Use for indexing the main content of the document. The actual data isn’t stored because you won’t be using it for search results display.
Binary	No	Yes	No	Use for storing binary data that is associated with the document, such as a thumbnail.

There are two reasons for adding a field to the search index: providing search data and providing content to be displayed with the results. The data field types identified in table 9.1 cover both these operations, and choosing the correct field type for a given piece of information is crucial for the correct operation of the search engine.

Let’s first look at storing the data for searching. This is known as indexing, and the Keyword, Text, and Unstored fields are relevant. The main difference between Keyword and Text/Unstored is the concept of tokenizing, which is when the indexing engine analyzes the data to determine the actual words in use. The Keyword field isn’t tokenized, which means that each word is used exactly as spelled for the purposes of search matching. The Text and Unstored fields are tokenized, which means that each word is analyzed and the underlying base word is used for search matching. For example, punctuation and plurals are removed for each word.

When using a search engine, the list of results needs to provide enough information for users to determine whether a given result is the one they’re looking for. The field types that are stored can help with this, because they are returned by the search engine. These are the Keyword, UnIndexed, Text, and Binary field types. The Unindexed and Binary field types exist solely for storing data used in the display of results. Typically, the Binary field type would be used for storing a thumbnail image that relates to the record, and the UnIndexed field type is used to store items such as a summary of the result or data related to finding the result, such as its database table, ID, or URL.

Zend_Search_Lucene will automatically optimize the search index as new documents are added. You can also force optimization by calling the optimize() method if required.

Now that we have created an index file, we can perform searches on the index. As you’d expect, Zend_Search_Lucene provides a variety of powerful mechanisms for building queries that produce the desired results, and we’ll explore them next.

9.2.2. Powerful Queries

Searching a Zend_Search_Lucene index is as simple as this:

$index = Zend_Search_Lucene::open('path/to/index'),
$index->find($query);

The $query parameter may be a string or you can build a query using Zend_Search_Lucene objects and pass in an instance of Zend_Search_Lucene_Search_Query.

Clearly, passing in a string is easier, but for maximum flexibility, using the Zend_Search_Lucene_Search_Query object can be very useful. The string is converted to a search query object using a query parser, and a good rule of thumb is that you should use the query parser for data from users and the query objects directly when pro-grammatically creating queries. This implies that for an advanced search system, which most websites provide, a hybrid approach would be used. We’ll explore that later.

Let’s look first at the query parser for strings.

String Queries

All search engines provide a very simple search interface for their users: a single text field. This makes them very easy to use, but at first glance this approach seems to make it harder to provide a complex query. Like Google, Zend_Search_Lucene has a query parser that can convert what’s typed into a single text field into a powerful query. When you pass a string to the find() method, the Zend_Search_Lucene_Search_QueryParser::parse() method is called behind the scenes. This class implements the Lucene query parser syntax, as supported by Apache Lucene 2.0.

To do its work, the parser breaks down the query into terms, phrases, and operators. A term is a single word, and a phrase is multiple words grouped using quotation marks, such as “hello world”. An operator is a Boolean word (such as AND) or symbol used to provide more complex queries. Wildcards are also supported using the asterisk (*) and question mark (?) symbols. A question mark represents a single character, and the asterisk represents several characters. For instance, searching for frame* will find frame, framework, frameset, and so on. Table 9.2 lists the key modifier symbols that can be applied to a search term.

Table 9.2. Search-term modifiers for controlling how the parser uses the search term

Modifier	Symbol	Example	Description
Wildcards	? and *	H?t h*t	? is a single character placeholder, and * represents multiple characters.
Proximity	~x	"php power"~10	Terms must be within a certain number of words apart (10 in the example).
Inclusive range	fieldname:[x TO y]	category:[skiing TO surfing]	Finds all documents whose field values are between the upper and lower bounds specified.
Exclusive range	fieldname:{x To y}	published:[20070101 TO 20080101]	Finds all documents whose field values are greater than the specified lower bound and less than the upper bound; the field values found don’t include the bounds.
Term boost	^x	"Rob Allen"^3	Increases the relevance of a document containing this term; the number determines how much of an increase in relevance is given.

Each modifier in table 9.2 affects only the term to which it’s attached. Operators, on the other hand, affect the makeup of the query. The key operators are listed in table 9.3.

Table 9.3. Boolean operators for refining the search

Operator	Symbol	Example	Description
Required	+	+php power	The term after the + symbol must exist in the document.
Prohibit	-	php -html	Documents containing the term after the - symbol are excluded from the search results.
AND	AND or &&	php and power	Both terms must exist somewhere in the document.
OR	OR or \|\|	php or html	Either term must be present in all returned documents.
NOT	NOT or !	php not java	Only includes documents that contain the first term but not the second term.

Careful use of the Boolean operators can create very complex queries, but few users in the real world will use more than one or two operators in any given query. The supported Boolean operators are pretty much the same as those used by major search engines, such as Google, so your users will have a degree of familiarity. We’ll now look at the other way to create a search query: programmatic creation.

Programmatically Creating Search Queries

A programmatic query involves instantiating the correct objects and putting them together. There are a lot of different objects available that can be combined to make up a query. Table 9.4 lists search query objects that can be used.

Table 9.4. Search query objects

Type	Example and description
Term query	A term query is used to search for a single term. Example search: category:php $term = new Zend_Search_Lucene_Index_Term('php', 'category'), $query = new Zend_Search_Lucene_Search_Query_Term($term); $hits = $index->find($query); Finds the term “php” within the category field. If the second term of the Zend_Search_Lucene_Index_Term constructor is omitted, all fields will be searched.
Multiterm query	Multiterm queries allow for searching through many terms. For each term, you can optionally apply the required or prohibited modifier. All terms are applied to each document when searching. Example search: +php power -java $query = new Zend_Search_Lucene_Search_Query_MultiTerm(); $query->addTerm(new Zend_Search_Lucene_Index_Term('php'), true); $query->addTerm(new Zend_Search_Lucene_Index_Term('power'), null); $query->addTerm(new Zend_Search_Lucene_Index_Term('java'), false); $hits = $index->find($query); The second parameter controls the modifier: true for required, false for prohibited, and null for no modifier. In this example, the query will find all documents that contain the word “php” and may contain the word “power” but don’t contain the word “java”.
Wildcard query	Wildcard queries are used for searching for a set of terms where only some of the letters are known. Example search: super* $term = new Zend_Search_Lucene_Index_Term('super'), $query = new Zend_Search_Lucene_Search_Query_Wildcard($term); $hits = $index->find($query); As you can see, the standard Zend_Search_Lucene_Index_Term object is used, then passed to a Zend_Search_Lucene_Search_Query_Wildcard object to ensure that the (or ?) modifier is interpreted correctly. If you pass a wildcard term into an instance of Zend_Search_Lucene_Search_Query_Term, it’ll be treated as a literal.
Phrase query	Phrase queries allow for searching for a specific multiword phrase. Example search: “php is powerful” $query = new Zend_Search_Lucene_Search_Query_Phrase(array('php', 'is', 'powerful')); $hits = $index->find($query); or // separate Index_Term objects $query = new Zend_Search_Lucene_Search_Query_Phrase(); $query->addTerm(new Zend_Search_Lucene_Index_Term('php')); $query->addTerm(new Zend_Search_Lucene_Index_Term('is')); $query->addTerm(new Zend_Search_Lucene_Index_Term('powerful')); $hits = $index->find($query); The words that make up the query can be either specified in the constructor of the Zend_Search_Lucene_Search_Query_Phrase object, or each term can be added separately. You can also search with word wildcards: $query = new Zend_Search_Lucene_Search_Query_Phrase(array('php', 'powerful'), array(0, 2)); $hits = $index->find($query); This example will find all three word phrases that have “php” as the first word and “powerful” as the last. This is known as “slop,” and you can also use the setSlop() method to achieve the same effect, like this: $query = new Zend_Search_Lucene_Search_Query_Phrase(array('php', 'powerful'), $query->setSlop(2); $hits = $index->find($query);
Range query	Range queries find all records within a particular range—usually a date range, but it can be anything. Example search: published_date: [20070101 to 20080101] $from = new Zend_Search_Lucene_Index_Term('20070101', 'published_date'), $to = new Zend_Search_Lucene_Index_Term('20080101', 'published_date'), $query = new Zend_Search_Lucene_Search_Query_Range($from, $to, true /* inclusive */); $hits = $index->find($query); Either boundary may be null to imply “from the beginning” or “until the end,” as appropriate.

The benefit of using the programmatic interface to Zend_Search_Lucene over the string parser is that it’s easier to express search criteria exactly, and you can allow the user to access an advanced search web form to refine his search.

This covers our look at what you can do with Zend_Search_Lucene, and as you can see, it’s a very powerful tool. Before we look at how to implement searching within a website, let’s look at how to get the best out of Zend_Search_Lucene.

9.2.3. Best Practices

We have covered all you need to know about using Zend_Search_Lucene, but it’s useful to look at a few best practices for using Zend_Search_Lucene.

First, don’t use id or score as document field names, because this will make them harder to retrieve. For other field names, you can do this:

$hits = $index->find($query);
foreach ($hits as $hit) {
    // Get 'title' document field
    $title = $hit->title;
}

But to retrieve a field called id, you would have to do this:

$id = $hit->getDocument()->id;

This is only required for the fieldnames called id and score, so it’s best to use different names, such as doc_id and doc_score.

Second, you need to be aware of memory usage. Zend_Search_Lucene uses a lot of memory! The memory usage increases if you have a lot of unique terms, which occurs if you have a lot of untokenized phases as field values. This means that you’re indexing a lot of non-text data. From a practical point of view, this means that Zend_Search_Lucene works best for searching text, which isn’t a problem for most websites. Note that indexing uses a bit more memory too, and this can be controlled with the MaxBufferedDocs parameter. You’ll find further details in the Zend Framework manual.

Last, Zend_Search_Lucene uses UTF-8 character coding internally, and if you’re indexing non-ASCII data, it’s wise to specify the encoding of the data fields when adding to the index. This is done using the optional third parameter of the field-creation methods. Here’s an example:

$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('body', $body, 'iso-8859-1'));

Now let’s do something interesting with Zend_Search_Lucene and integrate searching into the Places website.

9.3. Adding Search to Places

As Places is a community site, a search facility will be expected by its members. We’ll implement a simple one-field search on every page in the sidebar so that it’s very easy for users to find what they want. The search results will be displayed as in figure 9.2.

Figure 9.2. The search results page for Places. Each entry has a title that is linked to the indexed page, followed by a summary. The results are ranked with the most relevant at the top.

To achieve this, we’re first going to create the index files used by the search, then we’ll write the simple form and search the results page, finishing with the advanced search form.

9.3.1. Updating the Index as new Content is Added

We have two choices about when we create index files: we can do it when data is added to the database or as a scheduled task using cron or another scheduler. The traffic that the site receives largely determines how much additional stress these two methods cause.

For Places, we want to use both methods. Adding entries to the index as new data is added to the site is vital for a useful website. We also want the ability to re-index all the data in one hit so we can optimize the search index files as the amount of data on the site gets larger.

We’ll start by looking at adding to the index as we go along because we can leverage that work for the full re-indexing later.

Designing the Index

We need to consider the fields that we’re going to create in our index. The search index is going to contain records with different content types, such as places, reviews, or user profile data, but the results will be presented to the user as a list of web pages to view. We need to unify the set of fields in the index so that the results will make sense to the user. Table 9.5 shows the set of fields we’ll use.

Table 9.5. Lucene field types for adding fields to an index

Field name	Type	Notes
class	UnIndexed	The class name of the stored data. We need this on retrieval to create a URL to the correct page in the results list.
key	UnIndexed	The key of the stored data. Usually this is the ID of the data record. We need this on retrieval to create a URL to the correct page in the results list.
docRef	Keyword	A unique identifier for this record. We need this to find the record for updates or deletions.
title	Text	The title of the data. We’ll search on and display this within the results.
contents	UnStored	The main content for searching. This isn’t displayed.
summary	UnIndexed	The summary, which contains information about the search result to display within the results. It isn’t used for searching.
createdBy	Text	The author of the record. This is used for searching and display. We use the Keyword type to preserve the author’s name exactly when searching.
dateCreated	Keyword	The date created. This is used for searching and display. We use the Keyword type because we don’t want Lucene to parse the data.

The basic code for creating the Zend_Search_Lucene_Document to be indexed is the same no matter what type of data we’re going to be indexing. It looks something like the code in listing 9.1.

Listing 9.1. Adding a document to the search index

This method creates a Zend_Search_Lucene document, called $doc, then we add each field’s data to the document, specifying its type and name , before calling addDocument() .

Because we’re going to be adding documents for every page on the website, we should make the creation of the document as easy as possible. We’ll extend Zend_Search_Lucene_Document so we can instantiate the object with the right data in one line of code. This is shown in listing 9.2.

Listing 9.2. Extending `Zend_Search_Lucene_Document` for easier creation

class Places_Search_Lucene_Document extends Zend_Search_Lucene_Document
{

   public function __construct($class, $key, $title,
      $contents, $summary, $createdBy, $dateCreated)
   {
      $this->addField(Zend_Search_Lucene_Field::Keyword(
           'docRef', "$class:$key"));
      $this->addField(Zend_Search_Lucene_Field::UnIndexed(
           'class', $class));
      $this->addField(Zend_Search_Lucene_Field::UnIndexed(
           'key', $key));
      $this->addField(Zend_Search_Lucene_Field::Text(
           'title', $title));

      $this->addField(Zend_Search_Lucene_Field::UnStored(
           'contents', $contents));
      $this->addField(Zend_Search_Lucene_Field::UnIndexed(
           'summary', $summary));
      $this->addField(Zend_Search_Lucene_Field::Keyword(
           'createdBy', $createdBy));
      $this->addField(Zend_Search_Lucene_Field::Keyword(
           'dateCreated', $dateCreated));

   }

}

This class has a constructor that simply adds all the data to the fields we want to create. Using the new Places_Search_Lucene_Document class is extremely easy, as shown in listing 9.3.

Listing 9.3. Adding to the index

We now need to write this code within each model class that is to be searched over, and we immediately hit a design brick wall. The model clearly knows all about the data to be searched, but it should not know how to add that data to the search index. If it did, it would be tightly coupled to the search system, which would make our lives much harder if we ever want to change the search engine to another one. Fortunately, programmers before us have had the same basic problem, and it turns out that it’s so common that there’s a design pattern named after the solution: the Observer pattern.

Using the Observer Pattern to Decouple Indexing from the Model

The Observer design pattern describes a solution that uses the concept of notifications. An observer object registers interest in an observable object, and when something happens to the observable, the observer is notified and can take appropriate action. In our case, this is shown in figure 9.3.

Figure 9.3. The Observer design pattern allows us to decouple the search indexing from the model data, making it easy to add new models to be searched or to change the way we index for searching.

Our observable classes are our models, so we want to allow observers to register themselves with the models. The data we’re interested in are instances of Zend_Db_Table_Row_Abstract, so we’ll create an extension class called Places_Db_Table_Row_Observable that will contain the methods that allow us to register and notify observers. Listing 9.4 shows the class skeleton.

Listing 9.4. The `Places_Db_Table_Row_Observable` class

The first two methods we need are the core of the Observer pattern. The attach-Observer() method allows an observer to attach itself. We use a static method because the list of observers is independent of a specific model class. Similarly, the array of observers is static because it needs to be accessible from every model class.

The _notifyObservers() method notifies all observers that have been attached to the class. This method iterates over the list and calls a static method, observerTable-Row() within the observer class . The observer can then use the information about what the event is and the data from the model to perform whatever action is necessary. In this case, we’ll update the search index.

Zend_Db_Table_Row_Abstract provides a number of hook methods that allow us to perform processing before and after the database row is inserted, updated, or deleted. We use the _postInsert(), _postUpdate(), and _postDelete() methods to call _notifyObservers(), in order to update the search index. Each method is the same; _postInsert() is shown in listing 9.5 (except for the notification string).

Listing 9.5. Notifying the observer after an insert

Let’s look now at the observer class. This class is called SearchIndexer and, because it’s a model, it’s stored in application/models. The code to register it with the observable class is in the Bootstrap class in application/boostrap.php, and it looks like this:

SearchIndexer::setIndexDirectory(ROOT_DIR . '/var/search_index'),
Places_Db_Table_Row_Observable::attachObserver('SearchIndexer'),

As you can see, it’s simply a case of setting the directory to store the search index files and attaching the class to the list of observers in Places_Db_Table_Row_Observable using the name of the class. The three main methods in SearchIndexer are shown in listings 9.6, 9.7, and 9.8.

Listing 9.6. The notification hook method: `SearchIndexer::observeTableRow()`

The notification hook method, observeTableRow(), checks the event type, and if the event indicates that new data has been written to the database, it retrieves the data from the model and updates the search index files . All that’s left to do is retrieve the data from the model.

Adding the Model’s Data to the Index

The process of retrieving the data from the model is a separate method, because it’ll be used when re-indexing the entire database. This method is shown in listing 9.7.

Listing 9.7. Retrieving the field information: `SearchIndexer::getDocument()`

The SearchIndexer is only interested in models that need to be searched. As the observation system can be used for many observers, it’s possible that some models may be sending out notifications that are not relevant to the SearchIndexer. To test for this, getDocument() checks that the model implements the getSearchIndexFields() method . If it does, we call the method to retrieve the data from the model in a format that is suitable for our search index , and then we create the document ready to be added to the index .

We now need to add a document to the search index. This is done in _addToIndex() as shown in listing 9.8.

Listing 9.8. Adding to the search index: `SearchIndexer::_addToIndex()`

Adding the search document to the index in _addToIndex() is really easy. All we need to do is open the index using the directory that was set up in the bootstrap , then add the document . Note that we need to commit the index to ensure that it’s saved for searching on. This isn’t necessary if you add lots of documents, because there is an automatic commit system that will sort it out for you.

One problem is that Zend_Search_Lucene doesn’t handle updating a document. If you want to update a document, you need to delete it and re-add it. We don’t want to ever end up with the same document in the index twice, so we’ll create a class, Places_Search_Lucene, as an extension of Zend_Search_Lucene and override addDocument() to do a delete first if required. The code for Places_Search_Lucene is shown in listing 9.9.

Listing 9.9. Deleting a document before adding it

This method does exactly what we need, but it doesn’t work! This is because the static method Zend_Search_Lucene::open() creates an instance of Zend_Search_Lucene, not Places_Search_Lucene. We need to override the open() and create() methods, as shown in listing 9.10.

Listing 9.10. Overriding `open()` so that it all works

The create() and open() methods in listing 9.10 are very simple but also very necessary. We need to update SearchIndexer::_addToIndex() to reference Places_Search_Lucene, as shown in listing 9.11, and everything will work as expected.

Listing 9.11. Correcting `SearchIndexer::_addToIndex()`

The method in listing 9.11 is the same as that in listing 9.8, with the exception of calling Places_Search_Lucene::open() . This means that the call to addDocument() now calls our newly written method so we can be sure that there are no duplicate pages in the search index.

We now have a working system for adding updated content to our search index. We need only to implement the search system on the frontend so our users can find what they’re looking for.

Re-Indexing the Entire Site

Now that we have the ability to update the index as new content is added, we can utilize all the code we have written to easily re-index all the data in the site. This is useful for supporting bulk inserting of data and also as a recovery strategy if the index is damaged or accidentally deleted.

The easiest way to support re-indexing is to create a controller action, search/rein-dex, which is shown in listing 9.12.

Listing 9.12. The re-indexing controller

The reindex action is very simple. It uses Zend_Search_Lucene’s create() method to start a new index, effectively overwriting any current index. To add each document, we take advantage of the SearchIndexer’s getDocument() method, which creates a document using the model’s getSearchIndexFields() method, reusing the code and making this method very simple.

Tip

The re-indexing action should be protected using Zend_Acl if deployed on a live server. This is because a malicious user could repeatedly call this action, which would result in significant CPU usage and probably cause the website to slow to a halt.

We now have complete control over creating search indexes within the Places website, and we can look at creating a search form to allow the user to search and then view the results.

9.3.2. Creating the Search form and Displaying the Results

The search form for Places is very simple. It consists only of a single input text field and a Go button. The form is available on every page, so the HTML for it is contained in views/layouts/_search.phtml as shown in listing 9.13.

Listing 9.13. A simple search form in HTML

The search form’s action attribute points to the index action in the search controller, which is where the searching takes places. Because Places follows the MVC pattern, the searching takes place in the SearchController::indexAction() method, and the display of the search results is separated into the associated view file, views/scripts/search/index.phtml.

Let’s look at the controller first.

Processing a Search Request in the Controller

This method performs the search and assigns the results to the view. It also validates and filters the user’s input to ensure that we don’t accidentally introduce XSS security holes. The controller action is shown in listing 9.14.

Listing 9.14. Filtering and validating for the search form

To use data provided by a user, we first have to ensure that it’s safe to do so. The Zend_Filter_Input component provides both filtering and validation functionality. We use filtering to remove any whitespace padding on the search term and also to remove any HTML with the StripTags filter . The only validation we do on the search term is to ensure that the user has provided it , because searching for an empty string won’t return useful results! Zend_Filter_Input’s isValid() method filters the data and checks that the validation passes . On success, we collect the search query text and reassign it to the view to display.

Having checked that the data provided by the user is okay to use, we can now perform the search. As usual, with Zend_Search_Lucene, we first open the index , then call the find() method to do the work . In this case, we can use the built-in string query parser, because the user can provide a very simple search query (such as “zoo” to find all zoos, or a more complicated one such as “warwickshire -zoo” to find all attractions in Warwickshire except zoos).

If the validation fails, we collect the reason from Zend_Filter_Input by assigning the return value of getMessages() to the view. Now that we have generated either a result set or failed validation, we need to display this information to the user in the view.

Displaying the Search Results in the View

The view has two responsibilities: to display any failure messages to the user and display the search results. To display the error messages, we simply iterate over the list and echo within a list. This is shown in listing 9.15.

Listing 9.15. Displaying error messages from `Zend_Filter_Input`

This is very straightforward, and the sole thing to note is that we iterate over only the ‘q’ array within messages because we know that there is only one form field in this form. For a more complicated search form, we’d have to iterate over all the form fields.

The second half of the view script displays the search results. The fields we have available are limited to those we set up in Places_Search_Lucene_Document, back in listing 9.2 and we use these fields in the output as shown in listing 9.16.

Listing 9.16. Displaying error messages from `Zend_Filter_Input`

As with any response to a user action, we provide important feedback about what the user searched for and how many results were found. We then iterate over the results array, displaying each item’s information within an unsigned list. The search results don’t contain the URL to the page containing the result, so we need to work it out from the class and key fields that we do have in the search index. This is outsourced to a view helper, getSearchResultUrl() (shown in listing 9.17), to keep the code contained. The results are ordered by the score field, which shows the weighting of each result. The user isn’t interested in this, but we may be; it’s included as a comment so it can be inspected using the view source command when investigating search queries. Obviously, this can be omitted from a production application.

Listing 9.17. View helper to retrieve the search result’s URL

The initial version of getSearchResultUrl() is very simple because there is a one-to-one mapping from the model’s class name to the controller action. That is, for a model called Places, the controller used is places/index. It’s likely that this would change when more models are introduced into the application. As this happens, the complexity of mapping from the model to the URL will increase and be completely contained within the view helper. This will help make long-term maintenance that much easier.

9.4. Summary

This chapter has introduced one of the exceptional components of Zend Framework. Zend_Search_Lucene is a very comprehensive full-text search engine written entirely in PHP, and it easily enables a developer to add a search facility to a website. We have looked in detail at the way Zend_Search_Lucene works and at how queries can be either a simple string like a Google search or programmatically constructed using the rich API to allow for very complex queries.

To put Zend_Search_Lucene in context, we have also integrated search into the Places community website. The search in Places is relatively simplistic, as it only has one model that requires indexing. However, we have future-proofed our code using the Observer pattern to separate the search indexing from the models where the data is stored. The result is a search engine that performs a ranked search algorithm and helps your users find the information they’re looking for quickly, with all the benefits that this brings to your website.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Searching

Create new playlist

Sign In

Sign Up

Chapter 9. Searching

9.1. Benefits of Search

9.1.1. Key Usability Issue for Users

9.1.2. Ranking Results is Important

9.2. Introducing Zend_Search_Lucene

Figure 9.1. A Zend_Search_Lucene index consists of multiple documents, each containing multiple fields. The data in some fields is not stored, and some fields may contain data for display rather than searching.

9.2.1. Creating a Separate Search Index for your Website

Table 9.1. Lucene field types for adding fields to an index

9.2.2. Powerful Queries

String Queries

Table 9.2. Search-term modifiers for controlling how the parser uses the search term

Table 9.3. Boolean operators for refining the search

Programmatically Creating Search Queries

Table 9.4. Search query objects

9.2.3. Best Practices

9.3. Adding Search to Places

Figure 9.2. The search results page for Places. Each entry has a title that is linked to the indexed page, followed by a summary. The results are ranked with the most relevant at the top.

9.3.1. Updating the Index as new Content is Added

Designing the Index

Table 9.5. Lucene field types for adding fields to an index

Listing 9.1. Adding a document to the search index

Listing 9.2. Extending Zend_Search_Lucene_Document for easier creation

Listing 9.3. Adding to the index

Using the Observer Pattern to Decouple Indexing from the Model

Figure 9.3. The Observer design pattern allows us to decouple the search indexing from the model data, making it easy to add new models to be searched or to change the way we index for searching.

Listing 9.4. The Places_Db_Table_Row_Observable class

Listing 9.5. Notifying the observer after an insert

Listing 9.6. The notification hook method: SearchIndexer::observeTableRow()

Adding the Model’s Data to the Index

Listing 9.7. Retrieving the field information: SearchIndexer::getDocument()

Listing 9.8. Adding to the search index: SearchIndexer::_addToIndex()

Listing 9.9. Deleting a document before adding it

Listing 9.10. Overriding open() so that it all works

Listing 9.11. Correcting SearchIndexer::_addToIndex()

Re-Indexing the Entire Site

Listing 9.12. The re-indexing controller

Tip

9.3.2. Creating the Search form and Displaying the Results

Listing 9.13. A simple search form in HTML

Processing a Search Request in the Controller

Listing 9.14. Filtering and validating for the search form

Displaying the Search Results in the View

Listing 9.15. Displaying error messages from Zend_Filter_Input

Listing 9.16. Displaying error messages from Zend_Filter_Input

Listing 9.17. View helper to retrieve the search result’s URL

9.4. Summary

Table of Contents for
Chapter 9. Searching

Figure 9.1. A `Zend_Search_Lucene` index consists of multiple documents, each containing multiple fields. The data in some fields is not stored, and some fields may contain data for display rather than searching.

Listing 9.2. Extending `Zend_Search_Lucene_Document` for easier creation

Listing 9.4. The `Places_Db_Table_Row_Observable` class

Listing 9.6. The notification hook method: `SearchIndexer::observeTableRow()`

Listing 9.7. Retrieving the field information: `SearchIndexer::getDocument()`

Listing 9.8. Adding to the search index: `SearchIndexer::_addToIndex()`

Listing 9.10. Overriding `open()` so that it all works

Listing 9.11. Correcting `SearchIndexer::_addToIndex()`

Listing 9.15. Displaying error messages from `Zend_Filter_Input`

Listing 9.16. Displaying error messages from `Zend_Filter_Input`