Aggregations and filters

If you have ever used Amazon (or any other large e-commerce site), you might remember that on the left-hand side of the search results, these sites provide a list of filters that users can easily select and navigate the search results. These filters are generated dynamically based on what results are shown and selecting one further narrows down the search results. It's just easier to show what I mean with a screenshot. On Amazon, if you perform a search, you'll see something similar to this on the left-hand side of the screen:

Aggregations and filters

If you select any of the options listed here, you will further refine your search and see results relevant to only that option. They also provide the user with instant feedback, letting them know at a glance how many results they can expect to see if they select one of the available options.

We want to implement something similar in our application. Elasticsearch provides a feature called aggregations to help us do just this. Let's see what aggregations are first.

Aggregations provide a way to get statistics about our search results. There are two types of aggregations that can be used to get two different kinds of data about the search results: bucket aggregations and metric aggregations.

Bucket aggregations are like GROUP BY SQL queries. They gather documents into groups, or buckets, based on certain dimensions and calculate some metrics for each of these groups. The simplest aggregation is a terms aggregation. You give it a field name, and for each unique value of that field, Elasticsearch returns the count of documents where the field contains that value.

For example, let's say that you have five documents in your index:

{"name": "Book 1", "category": "web"}
{"name": "Book 2", "category": "django"}
{"name": "Book 3", "category": "java"}
{"name": "Book 4", "category": "web"}
{"name": "Book 5", "category": "django"}

If we run a terms aggregation on this data based on the category field, we will get back results that give us the count of books in each category: two in web, two in Django, and one in Java.

First, we will add aggregations for the categories in our products list and allow the user to filter their search based on these categories.

Category aggregation

The first step is to add an aggregation to our search object and pass the results from this aggregation to our template. Change HomeView in main/views.py to match the following code:

class HomeView(View):
    def get(self, request):
        form = SearchForm(request.GET)

        ctx = {
            "form": form
        }

        if form.is_valid():
            name_query = form.cleaned_data.get("name")
            if name_query:
                s = Search(index="daintree").query("match", name=name_query)
            else:
                s = Search(index="daintree")

            min_price = form.cleaned_data.get("min_price")
            max_price = form.cleaned_data.get("max_price")
            if min_price is not None or max_price is not None:
                price_query = dict()

                if min_price is not None:
                    price_query["gte"] = min_price

                if max_price is not None:
                    price_query["lte"] = max_price

                s = s.query("range", price=price_query)

            # Add aggregations
            s.aggs.bucket("categories", "terms", field="category")

            result = s.execute()

            ctx["products"] = result.hits
            ctx["aggregations"] = result.aggregations

        return render(request, "home.html", ctx)

I have highlighted the new code, which is just two lines. The first line is as follows:

s.aggs.bucket("categories", "terms", field="category")

This line adds a bucket type aggregation to our search object. In Elasticsearch, each aggregation needs a name and the aggregation results are associated with this name in the response. We give our aggregation the name, categories. The next parameter to the method is the type of aggregation that we want. As we want to count the number of documents for each distinct category term, we use the terms aggregation. As we'll see later on, Elasticsearch has a lot of different aggregation types that provide for almost all kinds of use cases that you can think of. After the second parameter, all keyword arguments are part of the aggregation definition. Each type of aggregation requires different parameters. The terms aggregation only needs the name of the field to aggregate on, which is category in our documents.

The next line is as follows:

ctx["aggregations"] = result.aggregations

This line adds the results from our aggregations to our template context, where we will use it to render in the template. The format of the aggregation results is similar to this:

{
    "categories": {
        "buckets": [
            {
                "key": "CATEGORY 1",
                "doc_count": 10
            },

            {
                "key": "CATEGORY 2",
                "doc_count": 50
            },

            .
            .
            .
        ]
    }
}

The top-level dictionary contains a key for each aggregation that we added, with the same name as the one we added it with. In our case, the name is categories. The value for each key is the result of that aggregation. For a bucket aggregation, like the terms one that we have used, the result is a list of buckets. Each bucket has a key, which is a distinct category name, and the number of documents that have this category.

Let's display this data in our template first. Change main/templates/home.html to match the following code:

{% extends "base.html" %}

{% block content %}
<h2>Search</h2>
<form action="" method="get">
    {{ form.as_p }}
    <input type="submit" value="Search" />
</form>

{% if aggregations.categories.buckets %}
<h2>Categories</h2>
<ul>
{% for bucket in aggregations.categories.buckets %}
    <li>{{ bucket.key }} ({{ bucket.doc_count }})</li>
{% endfor %}
</ul>
{% endif %}

<ul>
    {% if products %}
        {% for product in products %}
        <li>
            Name: <b>{{ product.name }}</b> <br />
            <i>Category: {{ product.category }}</i> <br />
            <i>Price: {{ product.price }}</i> <br />
            {% if product.tags.all %}
                Tags: (
                {% for tag in product.tags.all %}
                    {{ tag.name }}
                    {% if not forloop.last %}
                    ,
                    {% endif %}
                {% endfor %}
                )
            {% endif %}
        </li>
        {% endfor %}
    {% else %}
        No results found. Please try another search term
    {% endif %}
</ul>
{% endblock %}

Again, I have highlighted the new code. Having seen the format of the preceding output, this new code should be simple for you to understand. We just loop over each bucket item and display the name of the category and number of documents having that category here.

Let's take a look at the results. Open up the home page in your browser and perform a search; you should see something similar to this:

Category aggregation

We now have a list of categories displayed. But wait, what's this? If you look closer, you'll see that none of the category names make sense (outside of the fact that they are in Latin). None of the categories that we see match what the categories our products have. How come?

What's happened here is that Elasticsearch took our list of categories, broke them up into individual words, and then ran the aggregation. For example, if three products had categories web development, django development, and web applications, this aggregation would have given us the following results:

  • web (2)
  • development (2)
  • django (1)
  • applications (1)

However, this is not useful for our use case. Our category names should be treated as a unit and not broken up into individual words. Also, we never asked Elasticsearch to do any such thing when we were indexing our data. So what happened? To understand this, we need to understand how Elasticsearch works with textual data.

Full text search and analysis

Elasticsearch is based on Lucene, which is a very powerful library to create full text search applications. Full text search is a bit like using Google on your own documents. You must have used the Find functionality in word processors such as Microsoft Word or on web pages a couple of times in your life. This approach to search is called exact matching. For example, you have a piece of text like this one taken from the preface to Stories from The Arabian Nights:

Scheherazadè, the heroine of the Thousand and one Nights, ranks among the great story-tellers of the world much as does Penelope among the weavers. Procrastination was the basis of her art; for though the task she accomplished was splendid and memorable, it is rather in the quantity than the quality of her invention—in the long spun-out performance of what could have been done far more shortly—that she becomes a figure of dramatic interest.

If you search for the term memorable quantity using exact matching, it will not show any results. That's because the exact term "memorable quantity" is not found in this text.

A full text search, however, would return you this text because even though the exact term memorable quantity is not seen anywhere in the text, the two words memorable and quantity do appear in the text. Even if you search for something like memorable Django, this text would still be returned because the word memorable is still present in the text, even though Django is not. This is how most users expect search to work on the web, especially on e-commerce sites.

If you are searching for Django web development books on our site and we do not have something with the exact title, but we do have a book called Django Blueprints, the user will expect to see that in the search results.

This is what Elasticsearch does when you use a full text search. It breaks up your search term into words, and then uses these to find search results that have these terms in them. However, to do this, Elasticsearch also needs to break up your document when you index them so that it can do the search faster later on. This process is called analyzing the document and happens at index time for all string fields by default.

This is the reason why when we get the aggregations for our category field, we get individual words instead of the complete category names in the result. While full text search is very useful in most cases of search, for example, the name query search that we have, in cases like category names it actually gives us unexpected results.

As I've mentioned before, the analysis process that leads to Elasticsearch breaking up (tokenization is the technical term for this) is done at indexing time. In order to make sure that our category names are not analyzed, we need to change our ESProduct DocType subclass and reindex all our data.

First, let's change our ESProduct class in main/es_docs.py. Note the following line:

category = String(required=True)

Change this to be as follows:

category = String(required=True, index="not_analyzed")

However, if we now try to update the mapping, we will run into a problem. Elasticsearch can only create a mapping for fields, not update them. This is because if we were allowed to change the mapping of a field after we have some data in our index, the old data might not make sense anymore with the new mapping.

To delete our existing Elasticsearch index, run the following command in the command line:

> curl -XDELETE 'localhost:9200/daintree'
{"acknowledged":true}

Next, we want to create our new index and add the ESProduct mapping. We could do what we did before and create the index from the Python shell. Instead, let's modify our index_all_data command to automatically create the index when it is run. Change the code in main/management/commands/index_all_data.py to match the following:

import elasticsearch_dsl
import elasticsearch_dsl.connections

from django.core.management import BaseCommand

from main.models import Product
from main.es_docs import ESProduct


class Command(BaseCommand):
    help = "Index all data to Elasticsearch"

    def handle(self, *args, **options):
        elasticsearch_dsl.connections.connections.create_connection()
        ESProduct.init(index='daintree')

        for product in Product.objects.all():
            esp = ESProduct(meta={'id': product.pk}, name=product.name, description=product.description,
                            price=product.price, category=product.category.name)
            for tag in product.tags.all():
                esp.tags.append(tag.name)
            
            esp.save(index='daintree')

I have highlighted the change, which is just the addition of a new line calling the ESProduct.init method. Finally, let's run our command:

> python manage.py index_all_data

After running the command, let's make sure that our new mapping was inserted correctly. Let's see what mapping Elasticsearch has now by running the following in the command line:

> curl "localhost:9200/_mapping?pretty=1"
{
  "daintree" : {
    "mappings" : {
      "products" : {
        "properties" : {
          "category" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "description" : {
            "type" : "string"
          },
          "name" : {
            "type" : "string"
          },
          "price" : {
            "type" : "long"
          },
          "tags" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

If you look at the mapping for the category field, it is now not analyzed. Let's try that last search again and see if this fixes our category aggregations issue. You should now see something similar to this:

Full text search and analysis

As you can see, we no longer have our category names split up into individual words. Instead, we get a list of unique category names, which is what we wanted from the start. Now let's give our users the ability to select one of these categories to limit their search to just the selected category.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.158.36