There are use cases that require expiration of documents after a certain amount of time—they should either be deleted or marked as inactive after a given time or period. For example, let's assume that we have a web application that works as a link shortening service. One can paste a long link and get the short version of it. However, we would like the links to be expired after one hour from their creation. Of course, we can develop a periodic job on our application-side and make this happen, but we can also use Solr for this. This recipe will show you how to achieve such functionality with Solr.
For the purpose of this recipe, let's assume that we want our documents to expire 5 minutes after they were sent to indexation.
schema.xml
file):<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="url" type="string" indexed="false" stored="true" /> <field name="short" type="string" indexed="false" stored="true" /> <field name="user" type="string" indexed="false" stored="true" /> <field name="expiration_time" type="date" indexed="true" stored="true" />
solrconfig.xml
file):<updateRequestProcessorChain default="true"> <processor class="solr.processor.DocExpirationUpdateProcessorFactory"> <int name="autoDeletePeriodSeconds">10</int> <str name="expirationFieldName">expiration_time</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
curl 'http://localhost:8983/solr/cookbook/update?_ttl_=%2B5MINUTES' -H 'Content-type:application/xml' --data-binary '<add> <doc> <field name="id">1</field> <field name="url">http://solr.pl/en/2014/10/31/lucene-solr-4-10-2/</field> <field name="short">http://solr.pl/short/1</field> <field name="user">gr0</field> </doc> </add>'
curl 'http://localhost:8983/solr/cookbook/update?_ttl_=%2B5MINUTES' -H 'Content-type:application/xml' --data-binary '<add> <doc> <field name="id">2</field> <field name="url">http://solr.pl/en/2014/09/04/apache-lucene-and-solr-4-10/</field> <field name="short">http://solr.pl/short/2</field> <field name="user">gr0</field> </doc> </add>'
curl 'http://localhost:8983/solr/cookbook/update' -H 'Content-type:application/xml' --data-binary '<commit/>'
http://localhost:8983/solr/cookbook/select?q=*:*
The response will be as follows:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="q">*:*</str> </lst> </lst> <result name="response" numFound="2" start="0"> <doc> <str name="id">1</str> <str name="url">http://solr.pl/en/2014/10/31/lucene-solr-4-10-2/</str> <str name="short">http://solr.pl/short/1</str> <str name="user">gr0</str> <date name="expiration_time">2014-11-03T12:09:15.002Z</date> <long name="_version_">1483752084606025728</long></doc> <doc> <str name="id">2</str> <str name="url">http://solr.pl/en/2014/09/04/apache-lucene-and-solr-4-10/</str> <str name="short">http://solr.pl/short/2</str> <str name="user">gr0</str> <date name="expiration_time">2014-11-03T12:09:55.963Z</date> <long name="_version_">1483752127556747264</long></doc> </result> </response>
http://localhost:8983/solr/cookbook/select?q=*:*
Now, the response is different:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="q">*:*</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">2</str> <str name="url">http://solr.pl/en/2014/09/04/apache-lucene-and-solr-4-10/</str> <str name="short">http://solr.pl/short/2</str> <str name="user">gr0</str> <date name="expiration_time">2014-11-03T12:09:55.963Z</date> <long name="_version_">1483752127556747264</long></doc> </result> </response>
Also, we can see the following message in the Solr's logs:
213361 [autoExpireDocs-10-thread-1] INFO org.apache.solr.update.processor.LogUpdateProcessor ľ [collection1] {deleteByQuery={!cache=false}expiration_time:[* TO 2014-11-03T12:05:25.794Z] (-1483752158835769344),commit=} 0 4 213361 [autoExpireDocs-10-thread-1] INFO org.apache.solr.update.processor.DocExpirationUpdateProcessorFactory ľ Finished periodic deletion of expired docs
This means that everything is working as it should be and Solr deleted the first document. If we would wait longer, we would see that the second document was deleted as well. Now, let's see how it works.
As usual, we are starting with the structure of the index we are going to use. We need the identifier of the document, which is represented by the id
field, the long URL address (the url
field), the shortened URL (the short
field), and the user who registered the URL address—the user
field. Finally, the last field—expiration_time
is the field that Solr will use to check whether the document should be deleted or not. This field should be based on the date and time types, which in our case is the date
type.
The next thing we do is define a custom update request processor chain that we set to be the default one (default="true"
). Next, we have an update processor class that is responsible for document deletion—solr.processor.DocExpirationUpdateProcessorFactory
. We used properties to define the processor behavior. The first is autoDeletePeriodSeconds
that tells Solr how often Solr should look for deleted documents. In our case, it will be every 10 seconds. What Solr does is runs a delete by
query on the collection and will delete all the documents in the current time. The autoDeletePeriodSeconds
property tells Solr how often such a query should be run. The second property—expirationFieldName
tells Solr which the delete by
query should use as the field to hold the expiration date of the document. In our case, it is our expiration_time
field. The two additional processors are common—one is about logging the update process (solr.LogUpdateProcessorFactory
) and the second one is about running the update itself (solr.RunUpdateProcessorFactory
).
The next interesting thing is how we index our documents. As you can see, in addition to the document itself, we also provide an additional request parameter—_ttl_
. It stands for time to live and specifies when the documents in the update request should be deleted. The value of the property can use the whole date math syntax that Solr allows you to use (which is described in https://cwiki.apache.org/confluence/display/solr/Working+with+Dates). We set it to %2B5MINUTES
(which is +5MINUTES
decoded). This means that Solr will take the current date and time, will add 5 minutes to it, and will write that information into the field defined in the update processor using the expirationFieldName
field name.
As we can see, after waiting for some time Solr deleted and automatically refreshed the collection using a soft commit operation with the openSearcher=true
property.
There is one more thing that I would like to mention when it comes to automatic document expiration.
If we want, we can also change the name of the parameter that is used to provide time to live information for documents. To do this, we should use the ttlFieldName
property and add it to the solr.processor.DocExpirationUpdateProcessorFactory
parameter in our solrconfig.xml
file. For example, if we would like to use the expireAfter
property instead of _ttl_
, we should configure our update request chain as follows:
<updateRequestProcessorChain default="true"> <processor class="solr.processor.DocExpirationUpdateProcessorFactory"> <int name="autoDeletePeriodSeconds">10</int> <str name="ttlFieldName">expireAfter</str> <str name="expirationFieldName">expiration_time</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
3.144.244.228