As you might already know that Solr supports UTF-8 encoding and thus can handle data in many languages. However, if you ever needed to sort some languages that have characters specific to them, you probably know that it doesn't work well on the standard Solr string
type. This recipe will show you how to deal with sorting and Solr.
schema.xml
file):<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text_general" indexed="true" stored="true" /> <field name="name_sort_bad" type="string" indexed="true" stored="true" /> <field name="name_sort_good" type="text_sort" indexed="true" stored="true" />
name_sort_bad
and name_sort_good
fields. Here is how they are defined (again, we only need to add the following section to the schema.xml
file):<copyField source="name" dest="name_sort_bad" /> <copyField source="name" dest="name_sort_good" />
schema.xml
file is the new type. So, the text_sort
definition looks as follows:<fieldType name="text_sort" class="solr.CollationField" language="pl" country="PL" strength="primary" />
<add> <doc> <field name="id">1</field> <field name="name">Łąka</field> </doc> <doc> <field name="id">2</field> <field name="name">Lalka</field> </doc> <doc> <field name="id">3</field> <field name="name">Ząb</field> </doc> </add>
http://localhost:8983/solr/cookbook/select?q=*:*&sort=name_sort_bad+asc
And now the response that was returned for the preceding query looks as follows:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="q">*:*</str> <str name="sort">name_sort_bad asc</str> </lst> </lst> <result name="response" numFound="3" start="0"> <doc> <str name="id">2</str> <str name="name">Lalka</str> <str name="name_sort_bad">Lalka</str> <str name="name_sort_good">Lalka</str> <long name="_version_">1481928342372352000</long></doc> <doc> <str name="id">3</str> <str name="name">Ząb</str> <str name="name_sort_bad">Ząb</str> <str name="name_sort_good">Ząb</str> <long name="_version_">1481928342372352001</long></doc> <doc> <str name="id">1</str> <str name="name">Łąka</str> <str name="name_sort_bad">Łąka</str> <str name="name_sort_good">Łąka</str> <long name="_version_">1481928342282174464</long></doc> </result> </response>
http://localhost:8983/solr/cookbook/select?q=*:*&sort=name_sort_good+asc
The results returned by Solr are as follows:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="q">*:*</str> <str name="sort">name_sort_good asc</str> </lst> </lst> <result name="response" numFound="3" start="0"> <doc> <str name="id">2</str> <str name="name">Lalka</str> <str name="name_sort_bad">Lalka</str> <str name="name_sort_good">Lalka</str> <long name="_version_">1481928342372352000</long></doc> <doc> <str name="id">1</str> <str name="name">Łąka</str> <str name="name_sort_bad">Łąka</str> <str name="name_sort_good">Łąka</str> <long name="_version_">1481928342282174464</long></doc> <doc> <str name="id">3</str> <str name="name">Ząb</str> <str name="name_sort_bad">Ząb</str> <str name="name_sort_good">Ząb</str> <long name="_version_">1481928342372352001</long></doc> </result> </response>
As you can see, the order is different and believe me, it's correct. Now let's see how it works.
Every document in the index is built of four fields. The id
field is responsible for holding the unique identifier of the document. The name
field is responsible for holding the name of the document. The last two fields are used for sorting.
The name_sort_bad
field is nothing new—it's just a string
based field that is used to perform sorting. The name_sort_good
field is based on a new type—the text_sort
field type. The field is based on the solr.CollationField
type. The field allows Solr to sort the defined language correctly. We used three attributes while defining the field. First, the language
attribute tells Solr about the language of the field. The second attribute is country
, which tells Solr about the country variant (this can be skipped if necessary). The strength
attribute informs Solr about the collation strength used. More information about these parameters can be found in the JDK documentation. One thing that is crucial is you need to create an appropriate field and set the appropriate attributes value for every non-English language you want to sort.
The two queries that you can see in the examples have one difference—the field used for sorting. The first query uses the string
based field name_sort_bad
. When sorting on this field, the document order will be incorrect when there will be non-English characters present. However, when sorting on the name_sort_good
field, everything will be in the correct order, as shown in the preceding example.
13.59.134.218