What if we added the same file more than once? This is possible, particularly when indexing a large number of files. One interesting case is trying to find if a document has already been added. Deduplication is the name we use for the process by which redundant information or duplicated files can be found, so that we can delete the copies and maintain an archive with only a single instance of every document. This can be very important, particularly in the context of a document management system, a shared documentation repository, or similar business cases.
We can easily create a unique key based on the content of the file. This new field can be used to find a document that has been added more than once:
uid)
in our schema.xml
file:<field name='uid' type='string' indexed='true' stored='true' multiValued='false' /> <uniqueKey>uid</uniqueKey>
updateRequestProcessorChain
for computing this particular uid
unique identifier, modifying our solrconfig.xml
file:<requestHandler name='/update/extract' class='solr.extraction.ExtractingRequestHandler'> <lst name='defaults'> <str name='captureAttr'>true</str> <str name='lowernames'>true</str> <str name='overwrite'>true</str> <str name='literalsOverride'>true</str> <str name='fmap.a'>link</str> <str name='update.chain'>deduplication</str> </lst> </requestHandler> <updateRequestProcessorChain name='deduplication'> <processor class='org.apache.solr.update.processor.SignatureUpdateProcessorFactory'> <bool name='overwriteDupes'>false</bool> <str name='signatureField'>uid</str> <bool name='enabled'>true</bool> <str name='fields'>content</str> <str name='minTokenLen'>10</str> <str name='quantRate'>.2</str> <str name='signatureClass'>solr.update.processor.TextProfileSignature</str> </processor> <processor class='solr.LogUpdateProcessorFactory' /> <processor class='solr.RunUpdateProcessorFactory' /> </updateRequestProcessorChain>
As you see, we introduce a new deduplication processor chain designed to manage this process, and call it by name inside the path of the /update/extract
request handler, so that it will be executed every time we try to index metadata extracted from a source.
The new processor chain needs to be activated. We can do this by concatenating its invocation at the end of the /update/extract
result handler:
<str name='update.chain'>deduplication</str>
Once a uid
field is introduced, we can execute the same command multiple times:
>> curl -X POST 'http://localhost:8983/solr/pdfs_4/update/extract?extractFormat=text&literal.uid=11111&commit=true' -F '[email protected]'
Changing the uid
value provided, we are indexing the same file a number of times. You can also try to copy the file and index the copied one as well; after some additions, a simple default query will return more than one result. If we execute a simple query given as follows:
>> curl -X GET 'http://localhost:8983/solr/pdfs_4/select?q=*:*&start=0&rows=0&wt=json&indent=true
We can see from the results that there will be no copies of the metadata indexed from the same file. At the end of the extraction phase, our deduplication processor starts and creates a 'footprint' for the content
field in the form of a hash key, saved into the uid
field, which will be signatureField
. With the parameter fields=content
(<str name='fields'>content</str>
), we defined the source for calculating the hash. This way every uid
represents the connection to a single document value. Indexing the same document several times will produce the same uid
, which we also defined as uniqueKey
. So the final result will have only a copy of it in memory, regardless of whether it was inserted first or last.
Please note how the tag lst
is used to define a collection of parameter values, each of them defined into a tag that reflects theirs type (str
for string, bool
for boolean, and so on).
If you want to have some fun with data, it's possible to use a specific kind of query to count how many documents indexed from the same file exist. We can run this query to verify that all is going well (we expect a single document with its own UID for every file), and we can also run the same query after disabling the deduplication part, indexing the same resources many times, to see the counter grow:
>> curl -X GET 'http://localhost:8983/solr/pdfs_4/select?q=*:*&start=0&rows=0&wt=json&indent=true&facet=true&facet.field=stream_source_info'
This is actually a faceted query on the field stream_source_info
. We will discuss faceted queries in detail and how to use them in Chapter 6, Using Faceted Search – from Searching to Finding.
18.220.125.100