Configuring sorting for non-English languages

As you might already know that Solr supports UTF-8 encoding and thus can handle data in many languages. However, if you ever needed to sort some languages that have characters specific to them, you probably know that it doesn't work well on the standard Solr string type. This recipe will show you how to deal with sorting and Solr.

How to do it...

  1. For the purpose of this recipe, I assumed that we will have to sort text that contains Polish characters. To show good and bad sorting behavior, we need to create the following index structure (add this to your schema.xml file):
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="name" type="text_general" indexed="true" stored="true" />
    <field name="name_sort_bad" type="string" indexed="true" stored="true" />
    <field name="name_sort_good" type="text_sort" indexed="true" stored="true" />
  2. Now let's define some copy fields to automatically fill the name_sort_bad and name_sort_good fields. Here is how they are defined (again, we only need to add the following section to the schema.xml file):
    <copyField source="name" dest="name_sort_bad" />
    <copyField source="name" dest="name_sort_good" />
  3. The last thing about the schema.xml file is the new type. So, the text_sort definition looks as follows:
    <fieldType name="text_sort" class="solr.CollationField" language="pl" country="PL" strength="primary" />
  4. The test that needs to be indexed looks as follows (note that the file with the following data needs to be encoded with UTF-8):
    <add>
     <doc>
      <field name="id">1</field>
      <field name="name">Łąka</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="name">Lalka</field>
     </doc>
     <doc>
      <field name="id">3</field>
      <field name="name">Ząb</field>
     </doc>
    </add>
  5. First, let's take a look at how the incorrect sorting order looks like. To do this, we will send the following query to Solr:
    http://localhost:8983/solr/cookbook/select?q=*:*&sort=name_sort_bad+asc

    And now the response that was returned for the preceding query looks as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">*:*</str>
       <str name="sort">name_sort_bad asc</str>
      </lst>
     </lst>
     <result name="response" numFound="3" start="0">
      <doc>
       <str name="id">2</str>
       <str name="name">Lalka</str>
       <str name="name_sort_bad">Lalka</str>
       <str name="name_sort_good">Lalka</str>
       <long name="_version_">1481928342372352000</long></doc>
      <doc>
       <str name="id">3</str>
       <str name="name">Ząb</str>
       <str name="name_sort_bad">Ząb</str>
       <str name="name_sort_good">Ząb</str>
       <long name="_version_">1481928342372352001</long></doc>
      <doc>
       <str name="id">1</str>
       <str name="name">Łąka</str>
       <str name="name_sort_bad">Łąka</str>
       <str name="name_sort_good">Łąka</str>
       <long name="_version_">1481928342282174464</long></doc>
     </result>
    </response>
  6. Now let's send the query that should return the documents sorted in the correct order. The query looks similar to this:
    http://localhost:8983/solr/cookbook/select?q=*:*&sort=name_sort_good+asc

    The results returned by Solr are as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">*:*</str>
       <str name="sort">name_sort_good asc</str>
      </lst>
     </lst>
     <result name="response" numFound="3" start="0">
      <doc>
       <str name="id">2</str>
       <str name="name">Lalka</str>
       <str name="name_sort_bad">Lalka</str>
       <str name="name_sort_good">Lalka</str>
       <long name="_version_">1481928342372352000</long></doc>
      <doc>
       <str name="id">1</str>
       <str name="name">Łąka</str>
       <str name="name_sort_bad">Łąka</str>
       <str name="name_sort_good">Łąka</str>
       <long name="_version_">1481928342282174464</long></doc>
      <doc>
       <str name="id">3</str>
       <str name="name">Ząb</str>
       <str name="name_sort_bad">Ząb</str>
       <str name="name_sort_good">Ząb</str>
       <long name="_version_">1481928342372352001</long></doc>
     </result>
    </response>

As you can see, the order is different and believe me, it's correct. Now let's see how it works.

How it works...

Every document in the index is built of four fields. The id field is responsible for holding the unique identifier of the document. The name field is responsible for holding the name of the document. The last two fields are used for sorting.

The name_sort_bad field is nothing new—it's just a string based field that is used to perform sorting. The name_sort_good field is based on a new type—the text_sort field type. The field is based on the solr.CollationField type. The field allows Solr to sort the defined language correctly. We used three attributes while defining the field. First, the language attribute tells Solr about the language of the field. The second attribute is country, which tells Solr about the country variant (this can be skipped if necessary). The strength attribute informs Solr about the collation strength used. More information about these parameters can be found in the JDK documentation. One thing that is crucial is you need to create an appropriate field and set the appropriate attributes value for every non-English language you want to sort.

The two queries that you can see in the examples have one difference—the field used for sorting. The first query uses the string based field name_sort_bad. When sorting on this field, the document order will be incorrect when there will be non-English characters present. However, when sorting on the name_sort_good field, everything will be in the correct order, as shown in the preceding example.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.134.218