Solr provides us with a way to prevent duplicate or nearly duplicate elements to get indexed using a signature/fingerprint field. It natively provides a deduplication technique of this type via the signature class, and this can further be used to implement new hash and signature implementations.
Let's see how we can implement deduplication in Solr. We'll use our musicCatalog
core, which we used in the previous chapter as well, and will modify it:
musicCatalog
core and create a new core called musicCatalog-dedupe
from it. After we have created the new core, we'll change schema.xml
to add a signature field that will contain the document signature/fingerprint:<!-- Field to store the fingerprint/signature --> <field name="signature" type="string" indexed="true" stored="true" required="true" multiValued="false" />
UpdateRequestProcessor
element to solrconfig.xml
configuration file, which will detect and overwrite duplicate documents:<updateRequestProcessorChain name="dedupe"> <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> <bool name="enabled">true</bool> <bool name="overwriteDupes">true</bool> <str name="signatureField">signature</str> <str name="fields">songId,songName,artistName,albumArtist,albumName,songDuration,composer,rating,year,genre</str> <str name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
We've used the SignatureUpdateProcessorFactory
class that comes with Solr, which we can use to detect/overwrite duplicate documents. The following properties can be set up in SignatureUpdateProcessorFactory
:
signatureClass
: This is an implementation of the org.apache.solr.update.processor.Signature
abstract class, for example, org.apache.solr.update.processor.Lookup3Signature
or solr.processor.Lookup3Signature
.fields
: These are the names of the fields that are used to generate the hash. By default, all fields of the document will be used to generate the hash.signatureField
: This is the name of the field that will hold the hash. This should be defined in schema.xml
.enabled
: This is used to enable/disable deduplication.overwriteDupes
: If this is set to true, the matching document will be overwritten.duplicateAlbumData.xml
), which contains two duplicate documents, and will send it to Solr for indexing:<add> <doc> <field name="songId">100000001</field> <field name="songName">Cool for the Summer</field> <field name="artistName">Dem Lovato</field> <field name="albumArtist">Demi Lovato</field> <field name="albumName">Cool for the Summer</field> <field name="songDuration">3.24</field> <field name="composer"/> <field name="rating">3.5</field> <field name="year">2015</field> <field name="genre">Pop, Music, Rock, Dance, Electronic</field> </doc> <doc> <field name="songId">100000001</field> <field name="songName">Cool for the Summer</field> <field name="artistName">Dem Lovato</field> <field name="albumArtist">Demi Lovato</field> <field name="albumName">Cool for the Summer</field> <field name="songDuration">3.24</field> <field name="composer"/> <field name="rating">3.5</field> <field name="year">2015</field> <field name="genre">Pop, Music, Rock, Dance, Electronic</field> </doc> </add>
We can use the following command to send the XML file that contains these two duplicate documents for indexing:
$ curl 'http://localhost:8983/solr/musicCatalog-dedupe/update?update.chain=dedupe&commit=true' -H "Content-Type: text/xml" --data-binary @duplicateAlbumData.xml
After executing this command, we'll get the following response from Solr:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">383</int></lst> </response>
After we execute this query, we can use the q=*:*
query to view the data that was indexed into Solr.
Open up the Solr Query browser or using the following url http://localhost:8983/solr/musicCatalog-dedupe/select?q=*%3A*&wt=json&indent=true
.
We'll receive the following response from Solr:
{ responseHeader : { status : 0, QTime : 3 }, response : { numFound : 1, start : 0, docs : [{ songId : "100000001", songName : "Cool for the Summer", artistName : "Dem Lovato", albumArtist : "Demi Lovato", albumName : "Cool for the Summer", songDuration : 3.24, composer : "", rating : 3.5, year : 2015, genre : "Pop, Music, Rock, Dance, Electronic", signature : "c105d2c9b431932e0e662b513f328aaa" } ] } }
As we can see from this response, the original document has been overwritten and we're seeing only one document getting indexed in Solr.
3.145.78.155