Controlling the number of terms needed to match

Imagine a situation where you have an e-commerce bookstore and you want to make a search algorithm that tries to bring the best search results to your customers. However, you notice that many of your customers tend to make queries with too many words, which results in an empty result list. So, you decide to make a query that will require a maximum of two of the words, which the user entered, to be matched. This recipe will show you how to do it.

Getting ready

Before we continue, it is crucial to mention that the following method can only be used with the dismax or edismax query parser. For the list of available query parsers, refer to http://wiki.apache.org/solr/QueryParser.

How to do it...

Follow these steps to control the number of terms needed to match:

  1. Let's begin with creating our index structure. For our simple use case, we will only have documents with the identifier (the id field) and title (the title field). We define the index structure by adding the following section to the schema.xml file:
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text" indexed="true" stored="true" />
  2. Now, let's look at the example data:
    <add>
     <doc>
      <field name="id">1</field>
      <field name="title">Solrcook book revised</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="title">Some book that was revised</field>
     </doc>
     <doc>
      <field name="id">3</field>
      <field name="title">Another revised book</field>
     </doc>
    </add>
  3. The third step is to make a query that will satisfy the requirements. For example, let's imagine that we want 100 percent of the terms matched for queries that have three or fewer terms in them, and only 25 percent of the terms matched for queries that have four or more terms. Such a query might look like this:
    http://localhost:8983/solr/cookbook/select?q=book+revised+another+different+word+that+doesnt+count&defType=dismax&mm=3<25%25

This query will return the following results:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
   <str name="mm">3&lt;25%</str>
   <str name="q">book revised another different word that doesnt count</str>
   <str name="defType">dismax</str>
  </lst>
 </lst>
 <result name="response" numFound="3" start="0">
  <doc>
   <str name="id">3</str>
   <str name="title">Another revised book</str>
   <long name="_version_">1470445837694795776</long></doc>
  <doc>
   <str name="id">2</str>
   <str name="title">Some book that was revised</str>
   <long name="_version_">1470445837693747200</long></doc>
  <doc>
   <str name="id">1</str>
   <str name="title">Solrcook book revised</str>
   <long name="_version_">1470445837648658432</long></doc>
 </result>
</response>

On the other hand, let's look at the following query:

http://localhost:8983/solr/cookbook/select?q=book+revised+another&defType=dismax&mm=3<25%25

This query will return the following results:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
   <str name="mm">3&lt;25%</str>
   <str name="q">book revised another</str>
   <str name="defType">dismax</str>
  </lst>
 </lst>
 <result name="response" numFound="1" start="0">
  <doc>
   <str name="id">3</str>
   <str name="title">Another revised book</str>
   <long name="_version_">1470445837694795776</long></doc>
 </result>
</response>

As you can see, only a single document was returned. Now, let's see how it works.

How it works...

The index structure and data are fairly simple. Every book is described by using two fields:

  • The unique identifier
  • The title

The data itself is not complicated, so let's skip discussing it.

The query is the thing that we are interested in. The first query sends eight terms in the q parameter. However, since we are using the dismax query parser (the defType=dismax parameter) and added the mysterious mm parameter, Solr returned three documents. This is because we specified mm=3<25%25 (which is in fact mm=3<25%, but we needed to URL-encode it). It tells Solr to enforce matching all the query terms. If there are three or fewer terms present in the query, they must all match. If there are more terms in the query, at least 25 percent of the query terms must be found in a document for it to be considered a match.

Now, if we look at the second query, we notice that it has three terms in it, and it only returns a single document, the one with all three terms matched. Apart from the q parameter, all the others, especially the mm parameter stayed the same. As you remember, we set the mm parameter so that queries that have three or less terms must have all of them matched in a document for that document to be returned in the results. And this is the case in the second example.

Before we finish, let's get back to the first query and its results. Note that the document that has three words matched is at the top of the list. The relevance algorithm is still there, which means that the documents that have more words that matched the query will be higher on the result list than those that have less words that match the query.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.64.172