Time for action – finding copies of the same files with deduplication

What if we added the same file more than once? This is possible, particularly when indexing a large number of files. One interesting case is trying to find if a document has already been added. Deduplication is the name we use for the process by which redundant information or duplicated files can be found, so that we can delete the copies and maintain an archive with only a single instance of every document. This can be very important, particularly in the context of a document management system, a shared documentation repository, or similar business cases.

We can easily create a unique key based on the content of the file. This new field can be used to find a document that has been added more than once:

  1. We need to add the new field (let's call it uid) in our schema.xml file:
    <field name='uid' type='string' indexed='true' stored='true' multiValued='false' />
    <uniqueKey>uid</uniqueKey>
  2. We can define a specific updateRequestProcessorChain for computing this particular uid unique identifier, modifying our solrconfig.xml file:
    <requestHandler name='/update/extract' class='solr.extraction.ExtractingRequestHandler'>
      <lst name='defaults'>
        <str name='captureAttr'>true</str>
        <str name='lowernames'>true</str>
        <str name='overwrite'>true</str>
        <str name='literalsOverride'>true</str>
        <str name='fmap.a'>link</str>
        <str name='update.chain'>deduplication</str>
      </lst>
    </requestHandler>
    
    <updateRequestProcessorChain name='deduplication'>
      <processor class='org.apache.solr.update.processor.SignatureUpdateProcessorFactory'>
        <bool name='overwriteDupes'>false</bool>
        <str name='signatureField'>uid</str>
        <bool name='enabled'>true</bool>
        <str name='fields'>content</str>
        <str name='minTokenLen'>10</str>
        <str name='quantRate'>.2</str>
        <str name='signatureClass'>solr.update.processor.TextProfileSignature</str>
      </processor>
      <processor class='solr.LogUpdateProcessorFactory' />
      <processor class='solr.RunUpdateProcessorFactory' />
    </updateRequestProcessorChain>

As you see, we introduce a new deduplication processor chain designed to manage this process, and call it by name inside the path of the /update/extract request handler, so that it will be executed every time we try to index metadata extracted from a source.

What just happened?

The new processor chain needs to be activated. We can do this by concatenating its invocation at the end of the /update/extract result handler:

<str name='update.chain'>deduplication</str>

Once a uid field is introduced, we can execute the same command multiple times:

>> curl -X POST 'http://localhost:8983/solr/pdfs_4/update/extract?extractFormat=text&literal.uid=11111&commit=true' -F '[email protected]'

Changing the uid value provided, we are indexing the same file a number of times. You can also try to copy the file and index the copied one as well; after some additions, a simple default query will return more than one result. If we execute a simple query given as follows:

>> curl -X GET 'http://localhost:8983/solr/pdfs_4/select?q=*:*&start=0&rows=0&wt=json&indent=true

We can see from the results that there will be no copies of the metadata indexed from the same file. At the end of the extraction phase, our deduplication processor starts and creates a 'footprint' for the content field in the form of a hash key, saved into the uid field, which will be signatureField. With the parameter fields=content (<str name='fields'>content</str>), we defined the source for calculating the hash. This way every uid represents the connection to a single document value. Indexing the same document several times will produce the same uid, which we also defined as uniqueKey. So the final result will have only a copy of it in memory, regardless of whether it was inserted first or last.

Please note how the tag lst is used to define a collection of parameter values, each of them defined into a tag that reflects theirs type (str for string, bool for boolean, and so on).

If you want to have some fun with data, it's possible to use a specific kind of query to count how many documents indexed from the same file exist. We can run this query to verify that all is going well (we expect a single document with its own UID for every file), and we can also run the same query after disabling the deduplication part, indexing the same resources many times, to see the counter grow:

>> curl -X GET 'http://localhost:8983/solr/pdfs_4/select?q=*:*&start=0&rows=0&wt=json&indent=true&facet=true&facet.field=stream_source_info'

This is actually a faceted query on the field stream_source_info. We will discuss faceted queries in detail and how to use them in Chapter 6, Using Faceted Search – from Searching to Finding.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.26.185