Using the n-gram approach to do performant trailing wildcard searches

Many users working with traditional RDBMS systems are used to wildcard searches. The most common among them are the ones using the * characters, which means zero or more characters. If you used SQL databases, you probably saw searches such as this:

AND name LIKE 'ABC12%'

However, wildcard searchers are not too efficient when it comes to Solr. This is because Solr needs to enumerate all the terms because the query is executed. So, how do we prepare our Solr deployment to handle trailing wildcard characters in an efficient way? This recipe will show you how to prepare your data and make efficient searches.

How to do it...

There are some steps we need to make efficient wildcards using the n-gram approach:

  1. The first step is to create a proper index structure. Let's assume we have the following fields defined in the schema.xml file:
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="name" type="text_wildcard" indexed="true" stored="true" />
  2. Now, let's define our text_wildcard type, again in the schema.xml file:
    <fieldType name="text_wildcard" class="solr.TextField">
     <analyzer type="index">	
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25"/>
      <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
    </fieldType>
  3. The third step is to create an index example data that looks like this:
    <add>
     <doc>
      <field name="id">1</field>
      <field name="name">XYZ1234ABC12POI</field>
     </doc>
    </add>
  4. Now, send the following query to Solr:
    http://localhost:8983/solr/cookbook/select?q=name:XYZ1

    The Solr response for this query is as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">0</int>
      <lst name="params">
       <str name="q">name:XYZ1</str>
      </lst>
     </lst>
     <result name="response" numFound="1" start="0">
      <doc>
       <str name="id">1</str>
       <str name="name">XYZ1234ABC12POI</str>
       <long name="_version_">1468270390243491840</long></doc>
     </result>
    </response>

As you see, the document has been found, so our setup is working as intended.

How it works...

First, let's look at our index structure defined in the schema.xml file. We have two fields, one holding the unique identifier of the document (the id field) and the second holding the name of the document (the name field), which are actually the fields we are interested in.

The name field is based on the new type we defined, text_wildcard. This type is responsible for enabling trailing wildcards, the ones that will enable running queries similar to LIKE 'WORD%' in SQL. As you can see, the field type is divided into two analyzers, one for data analysis during indexing and the other for query processing.

The querying analyzer is straight—it just tokenizes the data on the basis of whitespace characters (using the solr.WhitespaceTokenizerFactory tokenizer) and lowercases it (using the solr.LowerCaseFilterFactory filter). We don't want the query time analysis to include n-gram because we will provide only a part of the word, the first letters of it. For example, in our case, we passed the XYZ1 part of the whole XYZ1234ABC12POI name.

Now, the indexing time analysis (of course, we are talking about the name field) is similar to the query time. During indexing, the data is tokenized on the basis of whitespace characters (using the same solr.WhitespaceTokenizerFactory tokenizer), but there is also an additional filter defined. The solr.EdgeNGramFilterFactory filter is responsible for generating so called n-grams. In our setup, we tell Solr that the minimum length of an n-gram is 1 (the minGramSize attribute) and the maximum length is 25 (the maxGramSize attribute). We also use the solr.LowerCaseFilterFactory filter to lowercase the n-gram output. So, what will Solr do with our example data? It will create the following tokens from the example text X, XY, XYZ, XYZ1, XYZ12, and so on. It will create tokens by adding the next character from the string to the previous token, up to the maximum length of n-gram that is given in the filter configuration. As you can see, the YZ1 term won't match.

So, by typing the example query, we can be sure that the example document will be found because the n-gram filter is defined in the configuration of the field. We also didn't define the n-gram on the querying stage of analysis because we don't want our query to be analyzed in such a way, since that could lead to false positive hits, and we don't want this to happen.

By the way, this functionality, as described, can be successfully used to provide autocomplete (if you are not familiar with the autocomplete feature, take a look at http://en.wikipedia.org/wiki/Autocomplete) features for your application.

Remember that using n-grams will make your index a bit larger. As a result of this, you should avoid having n-grams on all the fields in the index; you should carefully decide which fields should use n-grams and which should not.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.157.6