Writing the full configuration for our PDF index example

Now let's see how the configuration files looks after all the modifications.

If you look at the provided sources, you'll find the full configuration at the path /SolrStarterBook/solr-app/chp02/pdfs/conf, the index data saved at the path /SolrStarterBook/solr-app/chp02/pdfs/data/index, and some scripts useful for your testing at the path of the companion directory /test/chp02/pdfs. In case you miss something while reading, you will also find the step-by-step version of the updated configurations in the same /cho02/ folder.

Writing the solrconfig.xml file

Our configuration will now look as shown in the following piece of code:

<?xml version='1.0' encoding='UTF-8' ?>
<config>
  <luceneMatchVersion>LUCENE_45</luceneMatchVersion>
  <directoryFactory name='DirectoryFactory'class='solr.MMapDirectoryFactory' />
  <codecFactory name='CodecFactory' class='solr.SchemaCodecFactory' />
  <lib dir='${solr.core.instanceDir}/../lib' />
  <requestHandler name='standard' class='solr.StandardRequestHandler' default='true' />
  <requestHandler name='/update' class='solr.UpdateRequestHandler'>
    <lst name='defaults'>
      <str name='update.chain'>deduplication</str>
    </lst>
  </requestHandler>
  <requestHandler name='/update/extract' class='solr.extraction.ExtractingRequestHandler'>
    <lst name='defaults'>
      <str name='captureAttr'>true</str>
      <str name='lowernames'>true</str>
      <str name='overwrite'>true</str>
      <str name='literalsOverride'>true</str>
      <str name='fmap.a'>link</str>
      <str name='update.chain'>deduplication</str>
    </lst>
  </requestHandler>
  <updateRequestProcessorChain name='deduplication'>
    <processor class='org.apache.solr.update.processor.SignatureUpdateProcessorFactory'>
      <bool name='overwriteDupes'>false</bool>
      <str name='signatureField'>uid</str>
      <bool name='enabled'>true</bool>
      <str name='fields'>content</str>
      <str name='minTokenLen'>10</str>
      <str name='quantRate'>.2</str>
      <str name='signatureClass'>solr.update.processor.TextProfileSignature</str>
    </processor>
    <processor class='solr.LogUpdateProcessorFactory' />
    <processor class='solr.RunUpdateProcessorFactory' />
  </updateRequestProcessorChain>
  <requestHandler name='/admin/' class='org.apache.solr.handler.admin.AdminHandlers' />

  <admin><defaultQuery>*:*</defaultQuery></admin>
</config>

In the following chapters we will avoid transcribing a full configuration file, to make the example more readable. In this chapter we had the first look at a complete file. Though it is very simple, it will become more complex. This will help us to understand the ideas of the whole process more clearly.

Writing the schema.xml file

This file now contains all the types used in the examples, with their analysis (tokenization, case transformation):

<?xml version='1.0' encoding='UTF-8' ?>
<schema name='pdfs' version='1.1'>
<types>
<fieldtype name='string' class='solr.StrField' postingsFormat='SimpleText' />
<fieldtype name='text' class='solr.TextField' postingsFormat='SimpleText'>
  <analyzer>
    <tokenizer class='solr.WhitespaceTokenizerFactory' />
    <filter class='solr.LowerCaseFilterFactory' />
  </analyzer>
</fieldtype>
</types>
<fields>
  <field name='uid' type='string' indexed='true'stored='true' multiValued='false' />
  <dynamicField name='*' type='string' multiValued='true' indexed='true' stored='true' />
  <copyField source='*' dest='fullText' />
  <field name='fullText' type='text' multiValued='true' />
</fields>
<defaultSearchField>fullText</defaultSearchField>
<solrQueryParser defaultOperator='OR' />
<uniqueKey>uid</uniqueKey>
</schema>

Dynamic fields have been introduced to include some flexibility, directly indexing every field exposed by Tika as metadata.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.104.127