Many users working with traditional RDBMS systems are used to wildcard searches. The most common among them are the ones using the *
characters, which means zero or more characters. If you used SQL databases, you probably saw searches such as this:
AND name LIKE 'ABC12%'
However, wildcard searchers are not too efficient when it comes to Solr. This is because Solr needs to enumerate all the terms because the query is executed. So, how do we prepare our Solr deployment to handle trailing wildcard characters in an efficient way? This recipe will show you how to prepare your data and make efficient searches.
There are some steps we need to make efficient wildcards using the n-gram approach:
schema.xml
file:<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text_wildcard" indexed="true" stored="true" />
text_wildcard
type, again in the schema.xml
file:<fieldType name="text_wildcard" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<add> <doc> <field name="id">1</field> <field name="name">XYZ1234ABC12POI</field> </doc> </add>
http://localhost:8983/solr/cookbook/select?q=name:XYZ1
The Solr response for this query is as follows:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="q">name:XYZ1</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">1</str> <str name="name">XYZ1234ABC12POI</str> <long name="_version_">1468270390243491840</long></doc> </result> </response>
As you see, the document has been found, so our setup is working as intended.
First, let's look at our index structure defined in the schema.xml
file. We have two fields, one holding the unique identifier of the document (the id
field) and the second holding the name of the document (the name
field), which are actually the fields we are interested in.
The name
field is based on the new type we defined, text_wildcard
. This type is responsible for enabling trailing wildcards, the ones that will enable running queries similar to LIKE 'WORD%'
in SQL. As you can see, the field type is divided into two analyzers, one for data analysis during indexing and the other for query processing.
The querying analyzer is straight—it just tokenizes the data on the basis of whitespace characters (using the solr.WhitespaceTokenizerFactory
tokenizer) and lowercases it (using the solr.LowerCaseFilterFactory
filter). We don't want the query time analysis to include n-gram because we will provide only a part of the word, the first letters of it. For example, in our case, we passed the XYZ1
part of the whole XYZ1234ABC12POI
name.
Now, the indexing time analysis (of course, we are talking about the name
field) is similar to the query time. During indexing, the data is tokenized on the basis of whitespace characters (using the same solr.WhitespaceTokenizerFactory
tokenizer), but there is also an additional filter defined. The solr.EdgeNGramFilterFactory
filter is responsible for generating so called n-grams. In our setup, we tell Solr that the minimum length of an n-gram is 1
(the minGramSize
attribute) and the maximum length is 25
(the maxGramSize
attribute). We also use the solr.LowerCaseFilterFactory
filter to lowercase the n-gram output. So, what will Solr do with our example data? It will create the following tokens from the example text X
, XY
, XYZ
, XYZ1
, XYZ12
, and so on. It will create tokens by adding the next character from the string to the previous token, up to the maximum length of n-gram that is given in the filter configuration. As you can see, the YZ1
term won't match.
So, by typing the example query, we can be sure that the example document will be found because the n-gram filter is defined in the configuration of the field. We also didn't define the n-gram on the querying stage of analysis because we don't want our query to be analyzed in such a way, since that could lead to false positive hits, and we don't want this to happen.
By the way, this functionality, as described, can be successfully used to provide autocomplete (if you are not familiar with the autocomplete feature, take a look at http://en.wikipedia.org/wiki/Autocomplete) features for your application.
Remember that using n-grams will make your index a bit larger. As a result of this, you should avoid having n-grams on all the fields in the index; you should carefully decide which fields should use n-grams and which should not.
3.17.157.6