Removing duplicate documents (deduplication)

Solr provides us with a way to prevent duplicate or nearly duplicate elements to get indexed using a signature/fingerprint field. It natively provides a deduplication technique of this type via the signature class, and this can further be used to implement new hash and signature implementations.

Let's see how we can implement deduplication in Solr. We'll use our musicCatalog core, which we used in the previous chapter as well, and will modify it:

  1. Copy the musicCatalog core and create a new core called musicCatalog-dedupe from it. After we have created the new core, we'll change schema.xml to add a signature field that will contain the document signature/fingerprint:
    <!-- Field to store the fingerprint/signature -->
    <field name="signature" type="string" indexed="true" stored="true" required="true" multiValued="false" />
  2. After adding the field, we'll add a new UpdateRequestProcessor element to solrconfig.xml configuration file, which will detect and overwrite duplicate documents:
    <updateRequestProcessorChain name="dedupe">
      <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
        <bool name="enabled">true</bool>
        <bool name="overwriteDupes">true</bool>
        <str name="signatureField">signature</str>
        <str name="fields">songId,songName,artistName,albumArtist,albumName,songDuration,composer,rating,year,genre</str>
        <str name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str>
      </processor>
      <processor class="solr.LogUpdateProcessorFactory" />
      <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>

    We've used the SignatureUpdateProcessorFactory class that comes with Solr, which we can use to detect/overwrite duplicate documents. The following properties can be set up in SignatureUpdateProcessorFactory:

    • signatureClass: This is an implementation of the org.apache.solr.update.processor.Signature abstract class, for example, org.apache.solr.update.processor.Lookup3Signature or solr.processor.Lookup3Signature.
    • fields: These are the names of the fields that are used to generate the hash. By default, all fields of the document will be used to generate the hash.
    • signatureField: This is the name of the field that will hold the hash. This should be defined in schema.xml.
    • enabled: This is used to enable/disable deduplication.
    • overwriteDupes: If this is set to true, the matching document will be overwritten.
  3. After we add the UpdateRequestHandler, we'll use the following XML document (duplicateAlbumData.xml), which contains two duplicate documents, and will send it to Solr for indexing:
    <add>
      <doc>
        <field name="songId">100000001</field>
        <field name="songName">Cool for the Summer</field>
        <field name="artistName">Dem Lovato</field>
        <field name="albumArtist">Demi Lovato</field>
        <field name="albumName">Cool for the Summer</field>
        <field name="songDuration">3.24</field>
        <field name="composer"/>
        <field name="rating">3.5</field>
        <field name="year">2015</field>
        <field name="genre">Pop, Music, Rock, Dance, Electronic</field>
      </doc>
      <doc>
        <field name="songId">100000001</field>
        <field name="songName">Cool for the Summer</field>
        <field name="artistName">Dem Lovato</field>
        <field name="albumArtist">Demi Lovato</field>
        <field name="albumName">Cool for the Summer</field>
        <field name="songDuration">3.24</field>
        <field name="composer"/>
        <field name="rating">3.5</field>
        <field name="year">2015</field>
        <field name="genre">Pop, Music, Rock, Dance, Electronic</field>
      </doc>
    </add>

We can use the following command to send the XML file that contains these two duplicate documents for indexing:

$ curl 'http://localhost:8983/solr/musicCatalog-dedupe/update?update.chain=dedupe&commit=true' -H "Content-Type: text/xml" --data-binary @duplicateAlbumData.xml

After executing this command, we'll get the following response from Solr:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">383</int></lst>
</response>

After we execute this query, we can use the q=*:* query to view the data that was indexed into Solr.

Open up the Solr Query browser or using the following url http://localhost:8983/solr/musicCatalog-dedupe/select?q=*%3A*&wt=json&indent=true.

We'll receive the following response from Solr:

{
  responseHeader : {
  status : 0,
  QTime : 3
  },
  response : {
    numFound : 1,
    start : 0,
    docs : [{
      songId : "100000001",
      songName : "Cool for the Summer",
      artistName : "Dem Lovato",
      albumArtist : "Demi Lovato",
      albumName : "Cool for the Summer",
      songDuration : 3.24,
      composer : "",
      rating : 3.5,
      year : 2015,
      genre : "Pop, Music, Rock, Dance, Electronic",
      signature : "c105d2c9b431932e0e662b513f328aaa"
    }
    ]
  }
}

As we can see from this response, the original document has been overwritten and we're seeing only one document getting indexed in Solr.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.78.155