Leveraging Solr cores

Recall from Chapter 2, Schema Design, that you can either put different types of data into a single index or use separate indexes. Up to this point, the only way you would know how to use separate indexes is to actually run multiple instances of Solr. However, adding another complete instance of Solr for each type of data you want to index is rather time consuming and unnecessary.

A Solr server instance supports multiple separate indexes (cores) to exist within a single Solr server instance as well as bringing features like hot core reloading and swapping that make administration easier. In fact, the MusicBrainz setup with this book has 6 cores. The core name immediately follows the /solr/ part and precedes the request handler (for example, /select). In SolrCloud mode, this spot is the collection name. In this URL, we search the mbartists core like this:

http://localhost:8983/solr/mbartists/select?q=dave%20matthews

Other than the introduction of the core name in the URL, you still perform all of your management tasks, searches, and updates in the same way as you always did in a single core setup.

Configuring solr.xml

Since Solr started supporting multiple cores, solr.xml, located in the solr.home directory has been how Solr would find all the cores. Starting in 4.4, Solr auto discovers cores as an alternative mechanism. At startup, Solr will look through all the subdirectories below solr.home, and in each subdirectory, no matter how many levels deep, if it finds a file named core.properties, then it knows it has found a directory with configuration information on a core to be loaded. The core.properties file only has to exist, it doesn't have to have any content, although it can contain core-specific configuration parameters. In Solr 5.0, solr.xml will no longer list cores; it will only contain properties related to running Solr such as SolrCloud-related properties. We have included the old style solr.xml in the example code in ./cores/solr_legacy.xml; this will look familiar to folks who have used earlier versions of Solr. The ./cores/solr.xml reflects the new approach. You might remember a property called persistent="true" in solr.xml, it has been removed as this file is now immutable.

Some of the configuration options are:

  • sharedLib="lib": This specifies the path to the lib directory containing shared JAR files for all the cores. On the other hand, if you have a core with its own specific JAR files, then you would place them in the core/lib directory. For example, the karaoke core uses Solr Cell (see Chapter 4, Indexing Data) for indexing rich content, so the JARs for parsing and extracting data from rich documents are located in ./examples/cores/karaoke/lib/.
  • shareSchema: This allows you to use a single in-memory representation of the schema for all the cores that use the same instanceDir. This will cut down on your memory use and startup time, especially in situations where you have many cores. I have seen Solr run with dozens of cores with no issues beyond increased startup time as each index is opened.
  • solrCloud: This is a stanza of XML for configuring SolrCloud across all collections deployed in SolrCloud. There are a number of options such as distribUpdateConnTimeout, distribUpdateSoTimeout, leaderVoteWait, leaderConflictResolveWait, and zkClientTimeout that are all related to managing timeouts. In general, the defaults should be fine, but if your SolrCloud has many collections, is running on a slow network, or your nodes are on multiple networks, then you may need to increase the timeouts.
  • shardHandler: This is a stanza that also deals with the HTTP layer and has options that expose the Apache HTTP Client library settings such as socketTimeout and connTimeout that you may need to change.

Each core is configured via a fairly obvious set of properties provided in core.properties. This file is mutable, and you should put your custom properties into it:

  • name: This specifies the name of the core, and therefore what to put in the URL to access the core.
  • configSet: This specifies the name of a shared configuration that you want to use for the core. See Chapter 10, Scaling Solr for more about using configuration sets.
  • instanceDir: This specifies the path to the directory that contains the conf directory for the core, and data directory too, by default. A relative path is relative to solr.home. In a basic single-core setup, this is typically set to the same place as solr.home. In the preceding example, we have three cores using the same configuration directory, and two that have their own specific configuration directories.
  • dataDir: This specifies where to store the indexes and any other supporting data, like spell check dictionaries. If you don't define it, then by default each core stores its information in the <instanceDir>/data directory.
  • You can also provide your own properties just by defining them in the file, and then referencing them in your solr configuration.

Some of the most interesting properties in the core.properties file are the loadOnStartup and the transient properties. If you have a Solr node with hundreds or thousands of cores, for example, if you have one core per user interacting with the system, then you would only want to load the cores of the people who are actively using your system, otherwise, Solr will run out of memory. By default, loadOnStartup is true so that each core will load, but in this use case you would want it to be false, and only load the core when the user logs in. The inverse, setting the transient property to true allows Solr to start unloading cores if too many users are logged on at the same time. You must be wondering how to load the core in response to the user login action, check the RELOAD command (see the Managing cores section later in this chapter) that is part of the Solr Core Admin API.

Property substitution

Property substitution allows you to externalize configuration values, which can be very useful for customizing your Solr install with environmental specific values. For example, in production, you might want to store your indexes on a separate solid state drive, then you would specify it as a property: dataDir="${ssd.dir}". You can also supply a default value to use if the property hasn't been set as well: dataDir="${ssd.dir:/tmp/solr_data}". This property substitution works in solr.xml, solrconfig.xml, schema.xml, and DIH configuration files.

Properties can be defined in core.properties or as Java system properties. To set a Java system property, use the –D parameter like this: -Dssd.dir=/Volumes/ssd.

Include fragments of XML with XInclude

XInclude stands for XML Inclusions and is a W3C standard for merging a chunk of XML into another document. Solr has support for using XInclude tags in solrconfig.xml to incorporate a chunk of XML at load time.

In ./examples/cores/karaoke/conf/solrconfig.xml, we have externalized the <query/> configuration into three flavors: a default query cache setup, a no caching setup, and a big query cache setup:

<xi:includehref="cores/karaoke/conf/${karaoke.xinclude.query}" parse="xml" xmlns:xi="http://www.w3.org/2001/XInclude">
  <xi:fallback>
    <xi:include href="cores/karaoke/conf/solrconfig-query-default.xml"/>
  </xi:fallback>
</xi:include>

The ${karaoke.xinclude.query} property is defined in the core definition:

<core name="karaoke" instanceDir="karaoke" dataDir="../../cores_data/karaoke">
<property name="karaoke.xinclude.query"   value="solrconfig-query-nocache.xml"/>
</core>

If the XML file defined by the href attribute isn't found, then the xi:fallback included file is returned. The fallback metaphor is primarily if you are including XML files that are loaded via HTTP and might not be available due to network issues.

Managing cores

While there isn't a nice GUI for managing Solr cores the way there is for some other options, the URLs you use to issue commands to Solr cores are very straightforward, and they can easily be integrated into other management applications. The response by default is XML, but you can also return results in JSON by appending wt=json to the command.

We'll cover a couple of the common commands using the example Solr setup in ./examples. The individual URLs listed here are stored in plain text files in ./examples/11/ to make it easier to follow along in your own browser:

  • STATUS: Getting the status of the current cores is done through http://localhost:8983/solr/admin/cores?action=STATUS. You can select the status of a specific core, such as mbartists through http://localhost:8983/solr/admin/cores?action=STATUS&core=mbartists. The STATUS command provides a nice summary of the various cores, and it is an easy way to monitor statistics showing the growth of your various cores.
  • CREATE: You can generate a new core called karaoke_test based on the karaoke core, on the fly, using the CREATE command through http://localhost:8983/solr/admin/cores?action=CREATE&name=karaoke_test&instanceDir=karaoke&config=solrconfig.xml&schema=schema.xml&dataDir=./examples/cores_data/karaoke_test. If you create a new core that has the same name as an old core, then the existing core serves up requests until the new one is generated, and then the new one takes over.
  • RENAME: Renaming a core can be useful when you have fixed names of cores in your client, and you want to make a core fit that name. To rename the mbartists core to the more explicit core name music_brainz_artists, use the URL http://localhost:8983/solr/admin/cores?action=RENAME&core=mbartists&other=music_brainz_artists. This naming change only happens in memory, as it doesn't update the filesystem paths for the index and configuration directories.
  • SWAP: Swapping two cores is one of the key benefits of using Solr cores. Swapping allows you to have an offline "on deck" core that is fully populated with updated data. In a single fast-atomic operation, you can swap out the current live core that is servicing requests with your freshly populated "on deck" core. As it's an atomic operation, there isn't any chance of mixed data being sent to the client. As an example, we can swap the mbtracks core with the mbreleases core through http://localhost:8983/solr/admin/cores?action=SWAP&core=mbreleases&other=mbtracks. You can verify the swap occurred by going to the mbtracks admin page and verifying that Solr home is displayed as cores/mbreleases/.
  • RELOAD: As you make minor changes to a core's configuration through solrconfig.xml, schema.xml, and supporting files you don't want to be stopping and starting Solr constantly. In an environment with even a couple of cores, it can take some tens of seconds to restart all the cores during which Solr is unavailable. By using the RELOAD command, you can trigger a reload of just one specific core without impacting the others. An example of this is if you use synonyms.txt for query time synonym expansion. If you modify it, you can just reload the affected core! A simple example for mbartists is http://localhost:8983/solr/admin/cores?action=RELOAD&core=mbartists.
  • UNLOAD: Just like you would expect, the unload action allows you to remove an existing core from Solr. Currently running queries are completed, but no new queries are allowed. A simple example for mbartists is http://localhost:8983/solr/admin/cores?action=UNLOAD&core=mbartists.
  • MERGEINDEXES: (For advanced users) The merge command allows you to merge one or more indexes into yet another core. This can be very useful if you've split data across multiple cores and now want to bring them together without re-indexing the source data all over again. It can also be used as the final step of an off-line indexing step in which index data is added (merged) into a live index. You need to issue commits to the individual indexes that are sources for data. After merging, issue another commit to make the searchers aware of the new data. This all happens at the Lucene index level on the filesystem, so functions such as deduplication that work through update request processors are not invoked. The full set of commands using curl is listed in ./11/MERGE_COMMAND.txt.

Some uses of multiple cores

Solr's support of multiple cores in a single instance enables you to serve multiple indexes of data in a single Solr instance. Multiple cores also address some key needs for maintaining Solr in a production environment:

  • Rebuilding an index: While Solr has a lot of features to handle, such as doing sparse updates to an index with minimal impact on performance, occasionally you need to bulk update significant amounts of your data. This invariably leads to performance issues, as your searchers are constantly being reopened. By supporting the ability to populate a separate index in a bulk fashion, you can optimize the offline index for updating content. Once the offline index has been fully populated, you can use the SWAP command to take the offline index and make it the live index.
  • Testing configuration changes: Configuration changes can have very differing impacts depending on the type of data you have. If your production Solr has massive amounts of data, moving that to a test or development environment may not be possible. By using the CREATE and the MERGE commands, you can make a copy of a core and test it in relative isolation from the core being used by your end users. Use the RELOAD command to restart your test core to validate your changes. Once you are happy with your changes, you can either SWAP the cores or just reapply your changes to your live core and RELOAD it.
  • Merging separate indexes together: You will find that over time you have more separate indexes than you need, and you want to merge them together. You can use the MERGEINDEXES command to merge two cores together into a third core. However, note that you need to do a commit on both cores and ensure that no new data is indexed while the merge is happening.
  • Renaming cores at runtime: You can build multiple versions of the same basic core and control which one is accessed by your clients using the RENAME command to rename a core to match the URL the clients are connecting to.

You can learn more about Solr core related features at https://cwiki.apache.org/confluence/display/solr/Core+Admin and https://cwiki.apache.org/confluence/display/solr/Core-Specific+Tools.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.189.7