Storing data outside of Solr index

Although Solr allows us to use the partial update API to update a single field of our document, what it does in the background is the complete reindexing of a document. However, there are situations where such reindexing is not possible. For example, we can have an index containing articles about published books, and we can store the information on how many users visited this article and read it. The number of users is so high that we have thousands of updates per second. Sending a high amount of updates can be demanding for Solr; however, we can store such information in external files and use it for boosting or sorting. This recipe will show how to do this.

How to do it...

The following steps are needed to achieve our requirements:

  1. First of all, we will create the index structure by adding the following field definition to our schema.xml file:
    <field name="name" type="text_general" indexed="true" stored="true" />
    <field name="visits" type="visitsType" />
  2. Next, we will define the visitsType field type by adding the following section to the schema.xml file:
    <fieldType name="visitsType" class="solr.ExternalFileField" keyField="id" defVal="0" stored="false" indexed="false" valType="float"/>
  3. We also need to put a file called external_visits to the directory, where the Solr index directory is located (it is usually the data directory and not the data/index directory). The contents of the external_visits file looks like this:
    1=1.0
    2=5.0
  4. Our example data looks as follows:
    <add>
     <doc>
      <field name="id">1</field>
      <field name="name">Solr Cookbook released</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="name">Elasticsearch server released</field>
     </doc>
    </add>
  5. Finally, we can run our query, for example a query that returns all the documents with the released term in the name field, sorted in descending order by the number of visits:
    http://localhost:8983/solr/cookbook/select?q=name:released&sort=field(visits)+desc
  6. The results returned by Solr will be as follows:
    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">15</int>
      <lst name="params">
       <str name="q">name:released</str>
       <str name="sort">field(visits) desc</str>
      </lst>
     </lst>
     <result name="response" numFound="2" start="0">
      <doc>
       <str name="id">2</str>
       <str name="name">Elasticsearch server released</str>
       <long name="_version_">1470198928794189824</long></doc>
      <doc>
       <str name="id">1</str>
       <str name="name">Solr Cookbook released</str>
       <long name="_version_">1470198928742809600</long></doc>
     </result>
    </response>

Now, let's see how it works.

How it works...

Our index structure is built of three fields; the id field holds the unique identifier of our articles, the name field holds its name, and the visits field holds the number of visits for each document.

The visits field is the one we are interested in the most. It uses a new type, the visitsType field type. We defined the type by using the solr.ExternalFileField class, which tells Solr that we will store the values for this field in an external file. To use this type, we need to provide a few properties specific to the field type:

  • keyField: This is the name of the field that is used to differentiate documents. Usually, we set the value of the property to the name of the primary key, but in general, it should point Solr to a field that can be used to differentiate documents.
  • defVal: This is the default value of the field using the field type, when no value for the given document is found in the external field. So, in our case, if a document identifier with a value can't be found in the external field, it will be given a value of 0.
  • valType: This is the name of the type that will be used for values in the external field. It can be any float-based field type; in our case, it is one of the default, simple type provided in the example Solr schema.

Finally, we have the external_visits file. As I already mentioned, this file needs to be placed in the same directory as the directory in which Solr stores the index for the collection (or core). This is because Solr will load the file during startup and reload along with each searcher reopening (the hard commit with searcher reopening or the soft commit). The naming scheme of the file is really simple; it consists of the constant external part concatenated with the name of the field that uses the external field type; in our case, it is external_visits. When it comes to the contents, this is also not complicated; it contains pairs of document identifiers (matching the values from the field defined by the keyField property) and the float values, which in our case is the number of visits. The identifier and value must be concatenated with the = character. We don't need to sort the values in the file, but Solr will work slightly faster when the values in the external field are sorted on the basis of the document identifier.

Finally, as you can see in the query result, the data is sorted properly. We can also use the value for boosting, but we can't search on the data stored in the external field type; Solr just doesn't allow this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.19.75