Automatically expiring Solr documents

There are use cases that require expiration of documents after a certain amount of time—they should either be deleted or marked as inactive after a given time or period. For example, let's assume that we have a web application that works as a link shortening service. One can paste a long link and get the short version of it. However, we would like the links to be expired after one hour from their creation. Of course, we can develop a periodic job on our application-side and make this happen, but we can also use Solr for this. This recipe will show you how to achieve such functionality with Solr.

How to do it...

For the purpose of this recipe, let's assume that we want our documents to expire 5 minutes after they were sent to indexation.

  1. We will start with the structure of the index, which looks as follows (we add it to our schema.xml file):
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="url" type="string" indexed="false" stored="true" />
    <field name="short" type="string" indexed="false" stored="true" />
    <field name="user" type="string" indexed="false" stored="true" />
    <field name="expiration_time" type="date" indexed="true" stored="true" />
  2. The second step is to define a new update request processor chain, which looks as follows (we put it in the solrconfig.xml file):
    <updateRequestProcessorChain default="true">
      <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
        <int name="autoDeletePeriodSeconds">10</int>
        <str name="expirationFieldName">expiration_time</str>
      </processor>
      <processor class="solr.LogUpdateProcessorFactory" />
      <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>
  3. Now, we can index the first document using the following command:
    curl 'http://localhost:8983/solr/cookbook/update?_ttl_=%2B5MINUTES' -H 'Content-type:application/xml' --data-binary '<add>
     <doc>
      <field name="id">1</field>
      <field name="url">http://solr.pl/en/2014/10/31/lucene-solr-4-10-2/</field>
      <field name="short">http://solr.pl/short/1</field>
      <field name="user">gr0</field>
     </doc>
    </add>'
    
  4. A minute later, we index the following document:
    curl 'http://localhost:8983/solr/cookbook/update?_ttl_=%2B5MINUTES' -H 'Content-type:application/xml' --data-binary '<add>
     <doc>
      <field name="id">2</field>
      <field name="url">http://solr.pl/en/2014/09/04/apache-lucene-and-solr-4-10/</field>
      <field name="short">http://solr.pl/short/2</field>
      <field name="user">gr0</field>
     </doc>
    </add>'
    
  5. Now, let's commit these documents by running the following command:
    curl 'http://localhost:8983/solr/cookbook/update' -H 'Content-type:application/xml' --data-binary '<commit/>'
    
  6. After indexing them, we try running the following query:
    http://localhost:8983/solr/cookbook/select?q=*:*

    The response will be as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">0</int>
      <lst name="params">
        <str name="q">*:*</str>
      </lst>
    </lst>
    <result name="response" numFound="2" start="0">
      <doc>
        <str name="id">1</str>
        <str name="url">http://solr.pl/en/2014/10/31/lucene-solr-4-10-2/</str>
        <str name="short">http://solr.pl/short/1</str>
        <str name="user">gr0</str>
        <date name="expiration_time">2014-11-03T12:09:15.002Z</date>
        <long name="_version_">1483752084606025728</long></doc>
      <doc>
        <str name="id">2</str>
        <str name="url">http://solr.pl/en/2014/09/04/apache-lucene-and-solr-4-10/</str>
        <str name="short">http://solr.pl/short/2</str>
        <str name="user">gr0</str>
        <date name="expiration_time">2014-11-03T12:09:55.963Z</date>
        <long name="_version_">1483752127556747264</long></doc>
    </result>
    </response>
  7. Now we should wait for 5 minutes and again run the same query, which looks as follows:
    http://localhost:8983/solr/cookbook/select?q=*:*

    Now, the response is different:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">0</int>
      <lst name="params">
        <str name="q">*:*</str>
      </lst>
    </lst>
    <result name="response" numFound="1" start="0">
      <doc>
        <str name="id">2</str>
        <str name="url">http://solr.pl/en/2014/09/04/apache-lucene-and-solr-4-10/</str>
        <str name="short">http://solr.pl/short/2</str>
        <str name="user">gr0</str>
        <date name="expiration_time">2014-11-03T12:09:55.963Z</date>
        <long name="_version_">1483752127556747264</long></doc>
    </result>
    </response>

    Also, we can see the following message in the Solr's logs:

    213361 [autoExpireDocs-10-thread-1] INFO  org.apache.solr.update.processor.LogUpdateProcessor  ľ [collection1] {deleteByQuery={!cache=false}expiration_time:[* TO 2014-11-03T12:05:25.794Z] (-1483752158835769344),commit=} 0 4
    213361 [autoExpireDocs-10-thread-1] INFO  org.apache.solr.update.processor.DocExpirationUpdateProcessorFactory  ľ Finished periodic deletion of expired docs

This means that everything is working as it should be and Solr deleted the first document. If we would wait longer, we would see that the second document was deleted as well. Now, let's see how it works.

How it works...

As usual, we are starting with the structure of the index we are going to use. We need the identifier of the document, which is represented by the id field, the long URL address (the url field), the shortened URL (the short field), and the user who registered the URL address—the user field. Finally, the last field—expiration_time is the field that Solr will use to check whether the document should be deleted or not. This field should be based on the date and time types, which in our case is the date type.

The next thing we do is define a custom update request processor chain that we set to be the default one (default="true"). Next, we have an update processor class that is responsible for document deletion—solr.processor.DocExpirationUpdateProcessorFactory. We used properties to define the processor behavior. The first is autoDeletePeriodSeconds that tells Solr how often Solr should look for deleted documents. In our case, it will be every 10 seconds. What Solr does is runs a delete by query on the collection and will delete all the documents in the current time. The autoDeletePeriodSeconds property tells Solr how often such a query should be run. The second property—expirationFieldName tells Solr which the delete by query should use as the field to hold the expiration date of the document. In our case, it is our expiration_time field. The two additional processors are common—one is about logging the update process (solr.LogUpdateProcessorFactory) and the second one is about running the update itself (solr.RunUpdateProcessorFactory).

The next interesting thing is how we index our documents. As you can see, in addition to the document itself, we also provide an additional request parameter—_ttl_. It stands for time to live and specifies when the documents in the update request should be deleted. The value of the property can use the whole date math syntax that Solr allows you to use (which is described in https://cwiki.apache.org/confluence/display/solr/Working+with+Dates). We set it to %2B5MINUTES (which is +5MINUTES decoded). This means that Solr will take the current date and time, will add 5 minutes to it, and will write that information into the field defined in the update processor using the expirationFieldName field name.

As we can see, after waiting for some time Solr deleted and automatically refreshed the collection using a soft commit operation with the openSearcher=true property.

There's more...

There is one more thing that I would like to mention when it comes to automatic document expiration.

Changing the time to live parameter name

If we want, we can also change the name of the parameter that is used to provide time to live information for documents. To do this, we should use the ttlFieldName property and add it to the solr.processor.DocExpirationUpdateProcessorFactory parameter in our solrconfig.xml file. For example, if we would like to use the expireAfter property instead of _ttl_, we should configure our update request chain as follows:

<updateRequestProcessorChain default="true">
  <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
    <int name="autoDeletePeriodSeconds">10</int>
    <str name="ttlFieldName">expireAfter</str>
    <str name="expirationFieldName">expiration_time</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.244.228