Configuring Solr

Apache Solr allows extensive configuration to meet the needs of the consumer. Configuring the instance revolves around the following:

  • Defining a schema
  • Configuring Solr parameters

First, let's try and understand the Apache Solr structure, and then, look at all these steps to understand the configuration of Apache Solr.

Understanding the Solr structure

The Apache Solr home folder mainly contains the configuration and index-related data. These are the following major folders in the Solr collection:

Directory

Purpose

conf/

This folder contains all the configuration files of Apache Solr and is mandatory. Among them, solrconfig.xml, and schema.xml are important configuration files.

data/

This folder stores the data related to indexes generated by Solr. This is a default location for Solr to store this information. This location can be overridden by modifying conf/solrconfig.xml.

lib/

This folder is optional. If it exists, Solr will load any Jars found in this folder and use them to resolve any "plugins" if provided in solrconfig.xml (Analyzers, RequestHandlers, and so on.) Alternatively, you can use the <lib> syntax in conf/solrconfig.xml to direct Solr to your plugins.

Defining the Solr schema

In an enterprise, the data is generated from all the software systems that participate in day-to-day operations. This data has different formats, and bringing in this data for big-data processing requires a storage system that is flexible enough to accommodate the data with varying data models. Traditional relational databases allow users to define a strict data structure and an SQL-based querying mechanism.

By design, Solr supports any data to be loaded in a search engine through different handlers, making it a data format agnostic. Solr can easily be scaled on top of commodity hardware; hence, it becomes one of the most efficient eligible NoSQL-based search programs available today. The data can be stored in Solr indexes and can be queried through Lucene search APIs. Solr does perform joins because of its denormalization of data. The overall schema file (schema.xml) is structured in the following manner:

<schema>
  <types>
  <fields>
  <uniqueKey>
  <defaultSearchField>
  <solrQueryParser defaultOperator>
  <copyField>
</schema>

Solr fields

Apache Solr's basic unit of information is a document, which is a set of data that describes something. Each document in Solr is composed of Fields. Apache Solr allows you to define the structure of your data to extend support for searching across the traditional keyword searching. You can allow Solr to understand the structure of your data (coming from various sources) by defining fields in the schema definition file. These fields, once defined, will be made available at the time of data import or data upload. The schema is stored in the schema.xml file in the conf/ folder of Apache Solr.

Apache Solr ships with a default schema.xml file, which you have to change to fit your needs.

Tip

If you change schema.xml in a Solr instance running on some data, the impact of this change requires regeneration of the Solr index with the new schema.

In the schema configuration, you can define field types (for example, string, integer, and date) and map them to their respective Java classes:

<field name="id" type="integer" indexed="true" stored="true" required="true"/>

This enables users to define the custom type, should they wish to. Then, you can define the fields with the name and type pointing to one of the defined types. A field in Solr will have the following major attributes:

Name

Description

Default

This sets default value, if not read while importing a document.

Indexed

This is true, when it has to be indexed (that is, can be searched and sorted, and have facets created).

Stored

When true, a field is stored in the index store, and it will be accessible while displaying results.

compressed

When true, the field will be zipped (using gzip). This is applicable for text-based fields.

multiValued

If a field contains multiple values in the same import cycle of the document/row.

omitNorms

When true, it omits the norms associated with a field (such as length normalization, and index boosting). Similarly, it has omitTermFreqAndPositions (if true, omits term frequency, positions, and payloads from postings for this field. This can be a performance boost for fields that don't require this information. It also reduces the storage space required for the index) and omitPositions.

termVectors

When true, it stores metadata related to a document and returns this metadata when queried.

With Solr 4.2, the team has introduced a new feature called DocValue for fields. DocValues are a way of building an index that is more efficient for purposes like sorting and faceting. While Apache Solr relies on an inverted index mechanism, the DocValue storage focuses on efficiently indexing the document, in order to index the storage mechanism by using a column-oriented field structure, using a document-to-value mapping built at index time. This approach (column-oriented field) results in a reduction of memory usage and the overall search speed. DocValue can be enabled on specific fields in Solr in the following fashion:

<field name="test_outcome" type="string" indexed="false" stored="false" docValues="true" />

If the data is indexed before applying DocValue, it has to be re-indexed to utilize the gains of DocValue indexing.

Dynamic fields in Solr

In addition to static fields, you can also use Solr dynamic fields for getting flexibility, in case you do not know the schema upfront. Use the <dynamicField> declaration for creating a field rule to allow Solr to understand which datatype is to be used. In the following sample, any field imported, and suffixed with *_no (For example, id_no and vehicle_no) will in turn be read as an integer by Solr. In this case, * represents a wildcard.

The following code snippet shows how you can create a dynamic field:

<dynamicField name="*_no" type="integer" indexed="true" stored="true"/>

Tip

Although it is not a mandatory condition, it is recommended for each Solr instance to have a unique identifier field for the data. Similarly, the ID name-specified unique key cannot be multivalued.

Copying the fields

You can also index the same data into multiple fields by using the <copyField> directive. This is typically needed when you want to have multi-indexing for the same data type. For example, if you have data for a refrigerator with the company name followed by the model number (WHIRLPOOL-1000LTR, SAMSUNG-980LTR, and others), you can have these indexed separately by applying your own Tokenizers to different fields. You might generate indexes for two different fields: namely Company Name and Model Number. You can define Tokenizers specific to your field types. Here is the sample copyField from schema.xml:

<copyField source="cat" dest="text"/>
<copyField source="name" dest="text"/>
<copyField source="manu" dest="text"/>
<copyField source="features" dest="text"/>

Dealing with field types

You can define your own field types in Apache Solr that cater to your requirements for data processing. The field type includes four types of information:

  • Name
  • Implementation class name (implemented on org.apache.solr.schema.FieldType)
  • If the field type is TextField, a description of the field analysis for the field type
  • Field attributes

The following XML snippet shows a sample field type:

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

The class attribute indicates which Java class the given field type is associated with. PositionIncrementGap determines the spacing between two words. It's useful for multivalued fields where the space between multiple values of the fields is determined. For example, if the author field has "John Doe" and "Jack Williams" as values, when PositionIncrementGap is zero, a search for Doe Jack will match with these fields because Solr treats this field as John Doe Jack Williams. To separate these multivalued fields, you can specify a high PositionIncrementGap value. The name attribute indicates the name of the field type; later when a field is defined, it uses the type attribute to denote the associated field type as shown in following code snippet:

<field name="name" type="text_ws" indexed="true" stored="true"/>

Additional metadata configuration

There are other files where metadata can be specified. These files again appear in the conf folder of Apache Solr. These files are given in the following table:

File Name

Description

Protwords.txt

In this file, you can specify protected words that you do not wish to get stemmed. So, for example, a stemmer might stem the word catfish to cat or fish.

Currency.txt

Stores current mapping of exchange rates between different countries; this is helpful when you have your application accessed by people from different countries.

Elevate.txt

With this file, you can influence the search results and get your own results to rank among the top-ranked results. This overrides Lucene's standard ranking scheme, taking into account elevations from this file.

Spellings.txt

In this file, you can provide spelling suggestions to the end user.

Synonyms.txt

Using this file, you can specify your own synonyms. For example, cost => money, money => dollars.

Stopwords.txt

Stopwords are those that will not be indexed and used by Solr in the applications; this is particularly helpful when you really wish to get rid of certain words; for example: In the string "Jamie and Joseph," the word "and" can be marked as a stopword.

Other important elements of the Solr schema

The following table describes the different elements in schema.xml:

Name

Description

Example

Unique key

The uniqueKey element specifies which field is a unique identifier for documents. For example, uniqueKey should be used if you ever update a document in the index.

<uniqueKey>id</uniqueKey>

Default search field

If you are using the Lucene query parser, queries that don't specify a field name will use the defaultSearchField. The use of default search has decreased from Apache Solr 3.6 or higher.

<defaultSearchField></defaultSearchField>

Similarity

Similarity is a Lucene class responsible for scoring the matched results. Solr allows you to override the default similarity behavior through the <similarity> declaration. Similarity can be configured at the global level; however, Solr 4.0 extends similarity to be configured at the field level.

<similarity class="solr.DFRSimilarityFactory">

  <str name="basicModel">P</str>

  <str name="afterEffect">L</str>

  <str name="normalization">H2</str>

  <float name="c">7</float>

  </similarity>

Configuration files of Apache Solr

The storage of Apache Solr is mainly used for storing metadata and the actual index information. It is typically a file stored locally, configured in the configuration of Apache Solr. The default Solr installation package comes with a Jetty server, whose configuration can be found in the solr.home/conf folder of Solr install. There are three major configuration files in Solr:

File name

Description

Solrconfig.xml

This is the main configuration file of your Solr install. Using this, you can control everything possible, right from caching and specifying customer handlers to codes and commit options.

Schema.xml

This file is responsible for defining a Solr schema for your application. For example: Solr implementation for log management would have a schema with Log-related attributes, that is, log levels, severity, message type, container name, application name, and so on.

Solr.xml

Using Solr.xml, you can configure Solr cores (single or multiple) for your setup. It also provides additional parameters such as ZooKeeper timeout and transient cache size.

Apache Solr (underlying Lucene) indexing is a specially designed data structure, stored in the file system as a set of index files. The index is designed with a specific format in such a way as to maximize the query performance.

Once the schema is configured, the immediate next step is to configure the instance itself to work with your enterprise. There are two major configurations that comprise the Solr configuration, namely solrconfig.xml and solr.xml. Let's look at them one by one.

Working with solr.xml and Solr core

The solr.xml configuration resides in the $SOLR_HOME folder and mainly focuses on maintaining the configuration for logging, cloud setup, and Solr core. The Apache Solr 4.X code line uses solr.xml for identifying the cores defined by the users. In the newer versions of Solr 5.x (planned), the current solr.xml structure (which contains the <core> element and so on) will not be supported, and there will be an alternative structure used by Solr.

Instance configuration with solrconfig.xml

The solrconfig.xml file primarily provides you access to request handlers, listeners, and request dispatchers. Let's look at the solrconfig.xml file and understand all the important declarations you'd be using frequently:

Directive

Description

luceneMatchVersion

Tells which version of Lucene/Solr this configuration file is set to. When upgrading your Solr instances, you need to modify this attribute.

Lib

In case you create any plugins for Solr, you need to put a library reference here, so that it gets picked up. The libraries are loaded in the same sequence as that of the configuration order. The paths are relative; you can also specify regular expressions. For example:

<lib dir=".../../../contrib/velocity/lib" regex=".*.jar" />.

dataDir

By default, Solr uses the ./data folder for storing indexes; however, this can be overrided by changing the folder for data by using this directive.

indexConfig

This directive is of the complex type, and it allows you to change the settings of some of the internal indexing configuration of Solr.

Filter

You can specify different filters to be run at the time of index creation.

writeLockTimeout

This directive denotes the maximum time to wait for the write lock for IndexWriter.

maxIndexingThreads

Denotes the maximum number of indexes and threads that can run in the IndexWriter; if more threads arrive, they have to wait. The default value is 8.

ramBufferSizeMB

The maximum RAM you need in the buffer while index creation, before the files are flushed to filesystem.

maxBufferedDocs

Limits the number of documents buffered.

lockType

When indexes are generated and stored in the file, this mechanism decides which file-locking mechanism should be used to manage concurrent read-writes. There are three types of file locking mechanisms: single (one process at a time), native (native operating system driven), and simple (based on locking using plain files).

unlockOnStartup

When true, it will release all the write locks held in past.

Jmx

Solr can expose runtime statistics through MBeans. It can be enabled or disabled through this directive.

updateHandler

Update handler is responsible for managing the updates to Solr. The entire configuration for updateHandler forms a part of this directive.

updateLog

You can specify the folder and other configuration for transaction logs while the index updates.

autoCommit

Enables automatic commit, when updates are happening. This could be based on documents or time before an automatic commit is triggered.

Listener

Using this directive, you can subscribe to update events when IndexWriters are updating the index. The listeners can either be run at the time of "postCommit" or "postOptimize"

Query

This directive is mainly responsible for controlling different parameters at the query time.

requestDispatcher

By setting parameters in this directive, you can control how a request will be processed by SolrDispatchFilter.

requestHandler

Request handlers are responsible for handling different types of requests with a specific logic for Apache Solr. These are described in a separate section.

searchComponent

Search components in Solr enable additional logic that can be used by the search handler to provide a better searching experience. These are described in Appendix, Use Cases for Big Data Search.

updateRequestProcessor Chain

Update request processor chain defines how update requests are processed; you can define your own updateRequestProcessor to perform tasks such as cleaning up data and optimizing text fields.

queryResponseWriter

Each request for query is formatted and written back to the user through queryResponseWriter. You can extend your Solr instance to have responses for XML, JSON, PHP, Ruby, Python, csvs, and so on by enabling the respective pre-defined writers. If you have a custom requirement for a certain type of response, it can easily be extended.

queryParser

The query parser directive tells Apache Solr which query parser to be used for parsing the query and creating Lucene Query Objects. Apache Solr contains pre-defined query parsers such as lucene (default), DisMax (based on weights of fields), edismax (similar to DisMax with some additional features), and others.

Understanding the Solr plugin

Apache Solr provides easy extensions to its current architecture through Solr plugins. Using Solr plugins, one can load his or her own code to perform a variety of tasks within Solr: from custom Request Handlers to process searches, to custom Analyzers and Token Filters for the text field. Typically, the plugins can be developed in Solr by using any IDE by importing apache-solr*.jar as the library.

The following types of plugins can be created with Apache Solr:

Component

Description

Search components

These plugins operate on a result set of a query. The results that they produce typically appear at the end of the search request.

Request handler

Request handlers are used to provide a REST endpoint from the Solr instance to get some work done.

Filters

Filters are the chain of agents that analyze the text for various filtering criteria, such as lower case and stemming. Now you can introduce your own filter and package it along with the plugin jar file.

Once the plugin is developed, it has to be defined as a part of solrconfig.xml by pointing the library to your jar.

Other configuration

Request handlers in Solr are responsible for handling requests. Each request handler can be associated with one relative URL: for example, /search, /select. A request handler that provides search capabilities is called a search handler. There are more than 25 request handlers available with Solr by default, and you can see the complete list here: http://lucene.apache.org/solr/api/org/apache/solr/request/SolrRequestHandler.html.

There are search handlers that provide searching capabilities on a Solr-based index (For example, DisMaxRequestHandler and SearchHandler); similarly, there are update handlers that provide support for uploading documents to Solr (For example, DataImportHandler and CSVUpdateRequestHandler). RealTimeGetHandler provides the latest stored fields of any document. UpdateRequestHandlers are responsible to process the updating of an index. Similarly, CSVRequestHandler and JsonUpdateRequestHandler take the responsibility of updating the indexes with the CSV and JSON formats. ExtractingRequestHandler uses Apache Tika to extract the text from different file formats.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.7.7