Optimizing search schema

When Solr is used in the context of a specific requirement (for example, log search for an enterprise application) it holds a specific schema that can be defined in schema.xml and copied over to nodes. The schema is based on the schema attributes indexes and thus plays a vital role in the performance of your Solr instance.

Specifying default search field

In the schema.xml file of the Solr configuration, the system allows you to specify the <defaultSearchField> parameter. This is the parameter that controls when you search without an explicit field name in your query, and which field to pick up for searching. This is an optional parameter; if this is not specified, for all the queries that are not providing the field name, the search will run them on all the available fields in the schema. This will not only consume more CPU time but on the whole, slow down the search performance.

Configuring search schema fields

In custom schema, having a larger number of fields for indexing has a direct impact on the index size and the amount of memory needed to create your index and segments. You can control the amount of indexing of fields to be done by specifying indexed=true or indexed=false appropriately for each schema attribute. Avoid indexing unnecessary fields that you do not intend to use in the search.

Similarly, you can set stored=false for fields that are not returned as search results. Setting this will not stop you from querying for these fields, but you won't be able to retrieve the original value of these fields. For larger fields, there is a significant value to this? in terms of disk space and search speed for the lookup.

The fields that are larger are difficult to fit into the memory while indexing, so one has to ensure that all the fields of the document fit into the memory. Each field can have maxFieldLength in the schema configuration; this in turn can help you control the sizing of the fields.

Stop words

We have already covered stop words in Chapter 2, Understanding Apache Solr, and Appendix, Use Cases for Big Data Search, which provides more details about them. They play a significant role in optimizing your Solr instance for performance. While performing the inverted index creation, the stop words are not considered by Solr because they do not add any value to your search. The stop words can be specified in any file and the file can be pointed out in the schema.xml file of the Solr configuration.

Stop words

Having a large set of stop words can significantly save space in terms of index size creation. You can use some of the common stop words of the English language by accessing the following example links:

Stemming

Stemming is a process of reducing the derived word into its original form. By enabling word stemming with Apache Solr, it not only saves your search time but also improves your query performance. Stemming also improves the accuracy of the results. For example words such as walking, walked, and walks can be stemmed to walk. Appendix, Use Cases for Big Data Search, provides a detailed explanation about protwords.txt, which is used for stemming, along with some examples. Based on the requirements, a right stemming algorithm should be chosen for your instance. Here are some of the available algorithms for stemming:

Algorithm

Description

Porter

This rule-based algorithm transforms any form of the word in English into its stem. For example, talking and talked are marked as talk.

KStem

Similar to Porter, with less aggressiveness.

Snowball

This is all language-supported string processing language for running your words. Using this, you can create new stemming algorithms.

Hunspell

Open Office dictionary-based algorithm. Works with all languages; the only condition is the health of the dictionary.

Overall, the workflow and the mandatory fields for mapping are shown in the following table. A true value indicates the presence of this attribute while defining the field, and a false value indicates that it cannot be used for a given use case. For example, a multi-valued attribute cannot be used in unique keys. An empty value indicates that the attribute can be true or false. We have already explained the terms multi-valued, omit-norms, term vector, and so on in Chapter 2, Understanding Apache Solr.

Use Case

Indexed

Stored

Multi-valued

Omit

Norms

Term

Vectors

Term

Positions

Term

Offsets

Search within field

TRUE

      

Retrieve contents

 

TRUE

     

Use as unique key

TRUE

 

FALSE

    

Sort on field

TRUE

 

FALSE

TRUE

   

Use field boosts

   

FALSE

   

Document boosts affect searches within field

   

FALSE

   

Highlighting

TRUE

TRUE

     

Faceting

TRUE

      

Add multiple values, maintaining order

  

TRUE

    

Field length affects doc score

   

FALSE

   

MoreLikeThis

 

TRUE

  

TRUE

  

Term frequency

    

TRUE

  

Document frequency

    

TRUE

  

tf*idf

    

TRUE

  

Term positions

    

TRUE

TRUE

TRUE

Term offsets

    

TRUE

TRUE

TRUE

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.218.64