Haystack search for Django

Haystack is a general purpose search application for Django. It supports multiple search engine backends with a standardized integration. To install Haystack, use the download link from http://github.com/toastdriven/django-haystack and run setup.py to install it.

Haystack currently supports three search engine backends: Solr, Whoosh, and Xapian. These backends are specified to Haystack with the Django setting HAYSTACK_SEARCH_ENGINE. We will be using Haystack with the Whoosh search engine, so our settings file will need to include the following:

HAYSTACK_SEARCH_ENGINE='whoosh'
HAYSTACK_WHOOSH_PATH='/path/to/indexes'
HAYSTACK_SITECONF='project.search_sites'

As we discussed in the previous section, Whoosh stores indexes in files on the filesystem. When used with Haystack the location of these indexes is defined by the HAYSTACK_WHOOSH_PATH setting. Now that we've configured our search engine backend, we can move on to configuring the rest of Haystack.

Haystack uses a common Django design known as the registration pattern. This is the same approach as with Django's built-in admin application. It involves using a register method to attach functionality to your models. This allows code to be reused through subclassing and simplifies configuration. For more information, examine the ModelAdmin class in the django.contrib.admin module and the admin site documentation.

When you add haystack to your project setting's INSTALLED_APPS, it will trigger a search for a search_indexes.py file in each of your app modules. This file defines the indexes on your models.

In addition, Haystack needs a global configuration file, specified by the Django setting HAYSTACK_SITECONF. Typically this site conf file will be very simple:

import haystack

haystack.autodiscover()

Haystack requires us to specify our indexes, just as in Sphinx and Whoosh. Haystack uses a more Django-like syntax, however, and it will automatically translate and create our index definitions to the format appropriate for our chosen search engine backend. Just as our Django models are defined by subclassing django.db.models.Model, our indexes are defined by subclassing haystack.indexes.SearchIndex. These subclasses then specify the fields we want to index along with their type.

To index our Product model using Haystack, we would write the following SearchIndex subclass in our app's search_indexes.py:

from haystack import indexes
from haystack import site
from coleman.products.models import Product

class ProductIndex(index.SearchIndex):
    text = indexes.CharField(document=True, use_template=True) 
    name = indexes.CharField(model_attr='name') 
    slug = indexes.CharField(model_attr='slug') description = indexes.CharField(model_attr='description')

site.register(Product, ProductIndex)

Haystack's approach to search differs significantly from what we've seen before. The indexes we define will be fed into our search engine, but queries and results are all handled by the Haystack application logic, which translates everything automatically. Every Haystack SearchIndex requires one field to be marked document=True. This field should be the same name across all your indexes (for different models). It will be treated as the primary field for searching.

In our product index, this field is called text. It has an additional special option: use_template=True. This is a way of indexing the output of a simple template as a model's data, rather than just using a model attribute. The remaining fields will have the attribute specified by the model_attr keyword argument looked up and used as index data. This can be a normal object attribute, a callable, and can even go through relationships using the double-under syntax (category__name__id).

This index definition may seem awkward at first due to the extra field and document=True, but it is required in part to standardize an interface with multiple backends. Because the text field will be present in all indexes, our search engine knows for certain it can index this field for all our data. It's a little complex, but if you're interested in learning more the Haystack documentation is very complete.

One last thing about our index definition: the use_template=True argument. This argument lets us define a template for the data to be indexed in the text field. The template must be named according to our model and the indexed field, so in the ProductIndex example we would create a product_text.txt template that looks like this:

{{ object.name }} 
{{ object.manufacturer }}
{{ object.description }}

This gives us additional flexibility in defining the data we want indexed. Think of this template as allowing you to construct a more formal "document" out of your Django models. Haystack will render this template for all Product objects it processes and pass the resulting string to the search engine to use as the indexed data. Standard template filters are available here and we can also call methods on our object to include their results in our index.

Haystack searches

Haystack includes a default URLs file that defines a stock set of views. You can get search up and running on your Django site simply by adding the Haystack URLs to your root URLConf:

(r'^search/', include('haystack.urls')),

Haystack also needs us to define a search template, called search/search.html by default. This template will receive a paginator object that contains the list of search results. Unlike django-sphinx, which provides a QuerySet of model objects, our Haystack search results will be instances of the Haystack SearchResult class. This class includes access to the model through an object attribute, but it should be noted that this will cause a database connection and affect performance.

In addition, the SearchResult object has all of the fields included in the SearchIndex we previously defined. In order to save on database hits, we can also store data in the index that is available as part of our SearchResult objects. This means we can store a model attribute, for example, directly in our SearchIndex and not need to access the object attribute in SearchResults (causing a database hit).

To do this we can specify the stored=True argument, which is the default for all SearchIndex field attributes. In our ProductIndex example, when we process the SearchResult list we can access the slug data via result.slug or result.object.slug, but the latter will incur a performance penalty.

Now that our indexes have been defined and we've added a search view, we need to actually generate the index files. To simplify this task, Haystack provides a Django management command, which we can pass to the django-admin.py or manage.py utilities. This command is called rebuild_index and it will cause Haystack to process the models in our database and create indexes for those that define one.

The rebuild command will completely recreate our indexes (by clearing them and updating against all model data). An alternative is the update_index management command, which will only update, but not clear, our indexes. The update command can take an age parameter, which will only update model objects from a certain time period. For example, running ./manage.py update_index --age=24 will refresh the index for model objects update in the last 24 hours. To use this, we need to add a get_updated_field method to our SearchIndex definition that returns a model field to use for checking age. Often this can be a pub_date DateTime field on our model, as in the following:

class ProductIndex(indexes.SearchIndex):
    ...

def get_update_field(self):
return 'pub_date'

You will need to schedule a periodic call to the update_index command to ensure that new data is indexed and will appear in search results. This is necessary for any search engine indexer. In the case of Haystack, we can use our server's cron daemon to schedule a nightly run of update_index with this example crontab line:

0 2 * * * django-admin.py update_index—age=24

It is not necessary to schedule the index update this way, especially if your data does not change frequently (for example, your e-commerce site doesn't add new products on a daily basis). In these cases you can do periodic, manual index updates.

However, there are other reasons you may wish to update an index besides capturing new data. If you store field information for your models in the SearchField objects using stored=True, you will need to reindex to refresh this stored data. Search results may use this stored data to avoid a database lookup on actual models, so reindexing ensures any changes (to price, for example) are reflected in search results on a regular basis.

Haystack for real-time search

One of the unique features of Haystack is that it can be used to build a real-time search engine for your data. In the typical approach to search, including the Haystack usage described above, indexes are built at some regular interval. Any new data added to the site will not be indexed until the next run of the update_index command, either manually or via a scheduled cron job. However, Haystack offers an alternative to this approach, which is the RealTimeSearchIndex.

Using the RealTimeSearchIndex is simple: instead of writing your indexes as subclasses of SearchIndex, you simply subclass index.RealTimeSearchIndex instead. Using this real-time class attaches signal handlers to your model's post_save and post_delete signals.

These signal handlers cause Haystack to immediately update your indexes whenever a model object is saved or deleted. This means your indexes are updated in real time and search results will reflect any changes instantly. This functionality could be useful in e-commerce applications by allowing staff to search over product sales data in real time or for cases where the price of a good or service changes many times throughout the day.

Real-time search of this nature causes significant load on your server resources if you have many object updates and deletions. Before deploying this solution, you should make sure your search engine and database server can handle the increased load that frequent indexing will create.

Haystack is extremely powerful considering how unassuming it appears at first glance. It can be used to produce very sophisticated search strategies across even the largest of Django sites. It also follows many of the Django conventions that have evolved over the past couple of releases and is a great project to reference as an example reusable application. Because it supports multiple search engine backends it has the performance and flexibility to fit many development needs. For these reasons and more, it is a highly recommended tool.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.91.44