Time for action – writing a simple solrconfig.xml file with an update handler

One of the essential parts of the solrconfig,xml file will be the definition of an update handler for posting documents.

  1. We can start with a basic example, in order to add some more complicated components later:
    <config>
      <luceneMatchVersion>LUCENE_45</luceneMatchVersion>
      <directoryFactory name='DirectoryFactory' class='solr.MMapDirectoryFactory' />
    
      <requestHandler name='standard' class='solr.StandardRequestHandler' default='true' />
      <requestHandler name='/update' class='solr.UpdateRequestHandler' />
      <requestHandler name='/admin/' class='org.apache.solr.handler.admin.AdminHandlers' />
      <admin>
        <defaultQuery>*:*</defaultQuery>
      </admin>
    
    </config>
  2. As you see, this basic configuration will include a standard /update handler, in order to receive the posted documents and index them.

What just happened?

This particular solrconfig.xml file configuration, as you see, only defines the default handler for accepting queries, and it generally will expose the simplest update handler. Using this handler in conjunction with our schemaless approach, let us index basically every new field posted to Solr.

Testing the PDF file core with dummy data and an example query

The best way to test if a core is working correctly is to post data to it and verify the results of simple queries. Then, to see if everything is working correctly, we can post some dummy data to our new index:

>> curl -X POST 'http://localhost:8983/solr/pdfs/update?commit=true&wt=json' -H 'Content-Type: application/xml' -d '<add><doc><field name='title'>Dummy Test Document</field><field name='text'>Hello World</field></doc></add>'

Now we can easily see if the data has been correctly indexed:

>> curl -X GET 'http://localhost:8983/solr/pdfs/select?q=*:*&wt=json&indent=true'

Once we have added at least two/three documents, repeat this process and change the example values to search for the documents containing the term hello. How do we search for it? Remember that the parameter q=*:* queries all the fields for every term; a basic search for hello can be written as q=fullText:hello or even replace with q=fulltext:'hello', where the double quotation is used for querying on an exact sequence of words (we will see it later with a phrase search), but is unnecessary now:

>> curl -X GET 'http://localhost:8983/solr/pdfs/select?q=fullText:hello&wt=json&indent=true'

If we had simply written q=hello, defaultSearchField will be used. Note that searching for the term hello will not work at the moment, since we have adopted a string type for the fields, which indexes the textual values as a unique term. If we want this example to work correctly, we need to split the text into two different terms (to find Hello as a single term), and ignore the case (to be able to recognize Hello as a match for hello). We will add those capabilities in the next section.

Defining a new tokenized field for fulltext

In this section we will introduce a simple analyzer configuration for a fulltext field.

We can now update our schema.xml file, defining a new field type that we will call text. This internally uses two very common elements for the types, a tokenizer and a filter:

<types>
...
<fieldtype name='text' class='solr.TextField'>
  <analyzer>
    <tokenizer class='solr.WhitespaceTokenizerFactory' />
    <filter class='solr.LowerCaseFilterFactory' />
  </analyzer>
</fieldtype>
…
</types>

The WhitespaceTokenizerFactory is one of the most common tokenizers. A tokenizer is a component that splits an entire text fragment (a string) into several terms. This particular tokenizer splits terms by whitespace, for example, the string Hello World will be split into the terms Hello and World.

The LowerCaseFilterFactory is a filter that transforms the case of all the terms to lowercase.

Using these two combined and yet simple components, we are able to perform better searches. The best place to use this new field type in our example is probably the fullText field itself:

<field name='fullText' type='text' multiValued='true' />

Now we should have a chance to search for a simple term hello:

  1. Stop and restart our Solr instance, to load the updated configuration containing the new type definition.
  2. Index the example document again, in order to have its values analyzed with the new configuration.
  3. Retry the query to obtain our document in the results, as expected.

With this simple addition, we are now able to perform queries searching only a single term.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.170.134