Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Optimizing search schema

When Solr is used in the context of a specific requirement (for example, log search for an enterprise application) it holds a specific schema that can be defined in schema.xml and copied over to nodes. The schema is based on the schema attributes indexes and thus plays a vital role in the performance of your Solr instance.

Specifying default search field

In the schema.xml file of the Solr configuration, the system allows you to specify the <defaultSearchField> parameter. This is the parameter that controls when you search without an explicit field name in your query, and which field to pick up for searching. This is an optional parameter; if this is not specified, for all the queries that are not providing the field name, the search will run them on all the available fields in the schema. This will not only consume more CPU time but on the whole, slow down the search performance.

Configuring search schema fields

In custom schema, having a larger number of fields for indexing has a direct impact on the index size and the amount of memory needed to create your index and segments. You can control the amount of indexing of fields to be done by specifying indexed=true or indexed=false appropriately for each schema attribute. Avoid indexing unnecessary fields that you do not intend to use in the search.

Similarly, you can set stored=false for fields that are not returned as search results. Setting this will not stop you from querying for these fields, but you won't be able to retrieve the original value of these fields. For larger fields, there is a significant value to this? in terms of disk space and search speed for the lookup.

The fields that are larger are difficult to fit into the memory while indexing, so one has to ensure that all the fields of the document fit into the memory. Each field can have maxFieldLength in the schema configuration; this in turn can help you control the sizing of the fields.

Stop words

We have already covered stop words in Chapter 2, Understanding Apache Solr, and Appendix, Use Cases for Big Data Search, which provides more details about them. They play a significant role in optimizing your Solr instance for performance. While performing the inverted index creation, the stop words are not considered by Solr because they do not add any value to your search. The stop words can be specified in any file and the file can be pointed out in the schema.xml file of the Solr configuration.

Having a large set of stop words can significantly save space in terms of index size creation. You can use some of the common stop words of the English language by accessing the following example links:

Stemming

Stemming is a process of reducing the derived word into its original form. By enabling word stemming with Apache Solr, it not only saves your search time but also improves your query performance. Stemming also improves the accuracy of the results. For example words such as walking, walked, and walks can be stemmed to walk. Appendix, Use Cases for Big Data Search, provides a detailed explanation about protwords.txt, which is used for stemming, along with some examples. Based on the requirements, a right stemming algorithm should be chosen for your instance. Here are some of the available algorithms for stemming:

Algorithm	Description
Porter	This rule-based algorithm transforms any form of the word in English into its stem. For example, talking and talked are marked as talk.
KStem	Similar to Porter, with less aggressiveness.
Snowball	This is all language-supported string processing language for running your words. Using this, you can create new stemming algorithms.
Hunspell	Open Office dictionary-based algorithm. Works with all languages; the only condition is the health of the dictionary.

Overall, the workflow and the mandatory fields for mapping are shown in the following table. A true value indicates the presence of this attribute while defining the field, and a false value indicates that it cannot be used for a given use case. For example, a multi-valued attribute cannot be used in unique keys. An empty value indicates that the attribute can be true or false. We have already explained the terms multi-valued, omit-norms, term vector, and so on in Chapter 2, Understanding Apache Solr.

Use Case	Indexed	Stored	Multi-valued	Omit Norms	Term Vectors	Term Positions	Term Offsets
Search within field	TRUE
Retrieve contents		TRUE
Use as unique key	TRUE		FALSE
Sort on field	TRUE		FALSE	TRUE
Use field boosts				FALSE
Document boosts affect searches within field				FALSE
Highlighting	TRUE	TRUE
Faceting	TRUE
Add multiple values, maintaining order			TRUE
Field length affects doc score				FALSE
MoreLikeThis		TRUE			TRUE
Term frequency					TRUE
Document frequency					TRUE
tf*idf					TRUE
Term positions					TRUE	TRUE	TRUE
Term offsets					TRUE	TRUE	TRUE

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Optimizing search schema

Create new playlist

Sign In

Sign Up

Optimizing search schema

Specifying default search field

Configuring search schema fields

Stop words

Stemming

Table of Contents for
Optimizing search schema