Adding more consideration about tuning

Using DataImportHandler is usually a good chance to make considerations about performance and tuning of a Solr installation. For example, while indexing we can suffer through problems that are mostly dependent on memory usage (of both the index and the documents to be posted), merge time, and optimization time.

These kinds of problems can be reduced by using the omitNorms="true" parameter and studying a good value for the mergeFactor parameter, both in schema.xml. Merging is critical because it introduces an overhead, and it's easy to consider the same problem with optimization.

A good list of the factors in which we are interested for understanding the performance can be assembled from the following two wiki pages:

Critical elements can be assumed to be on an extensive use of faceting, lot of sort parameters, and the memory related to a huge index or very big documents to be indexed.

Understanding Java Virtual Machine, threads, and Solr

Solr runs on a Java Virtual Machine (JVM), and its own performance relies on a good configuration on that, based on things such as memory, encoding, and threads management.

In order to manage memory allocation correctly, one of the best approaches is using tools like jconsole that are connected to a running JVM. It gives us very detailed information about the usage of memory and—most important of all—on the minimum requirement for it. Identifying good values for that is a very common problem indeed.

To avoid problems with memory consumption and free memory available during runtime, we can easily pass a specific parameter to the JVM while starting Solr. For example:

>> java -Xms512M -Xmx1024M -XX:ThreadStackSize=128 -jar start.jar

This command can be used to tell the virtual machine to adopt a heap size of 512 MB, and the maximum available memory for the JVM. -XX:ThreadStackSize or -Xss can be used instead to manage the memory allocated for local resources of the threads in memory.

Tip

Reading from articles at http://www.vogella.com/articles/JavaPerformance/article.html makes it clear that another critical element is based on garbage collector policies. If you can choose between more than one JVM implementation, consider reading about what kind of garbage collector strategy is used. If you are a programmer, try to write code that is stateless and as concurrent as possible. This is because allocating and releasing memory for new objects is more garbage-collector friendly than acting from time to time to big referenced objects.

I found an improvement in performance with the adoption of an actor system, such as AKKA that supports both Java and Scala, for writing concurrent code to post new data on Solr. This avoids typical synchronization problems that are faced by the data isolation when using actors.

Choosing the correct directory for implementation

Ideally, we can imagine Solr to handle all documents in memory. Even if this kind of implementation does exist (solr.RAMDirectoryFactory), it is only useful for testing purposes as it is volatile by nature and it does not scale at all! What we have really adopted in our previous examples are disk-based implementations, in which a binary format is used to fast access the records on disk without the need to having them already in memory all the times. In this scenario, there are two crucial points of possible failures: the need for caching to improve performance, and the underlying filesystem where we are actually writing our own data. We have adopted solr.MMapDirectoryFactory for most as it designed to work on 64-bit systems and internally uses an I/O cache capability by the virtual machine, so it's either fast or works well on multithreads. This implementation replaced the previous local implementations that had problems with lots of threads (solr.SimpleFSDirectoryFactory) and are very slow on Microsoft platforms (solr.NIOFSDirectoryFactory).

If you are unsure about what implementation to adopt, you can always use the default on solr.StandardDirectoryFactory, as described in http://wiki.apache.org/solr/SolrCaching.

From a distributed and real-time point of view, things are evolving fast. At the moment of writing this book, we have a solr.NRTCachingDirectoryFactory at our disposal, which is designed to expand the capability of a MMapDirectoryFactory in the direction of near real-time search by storing part of the data into an in-memory, transparent cache.

Tip

For reading about near, real-time search, I would suggest you to start from the CAP theorem at http://en.wikipedia.org/wiki/CAP_theorem, which is used for designing the strategy behind many current NoSQL and distributed filesystem products.

A simple workaround of these problems is to adopt solrcloud, and spread a large index over several shards, so that every single instance will suffer minor heap problems. Other experiments were done in the past on storing indexes on a database backend, but they intrinsically produce bad performance, as far as I know. In the past few years, there have been many experiments on using distributed filesystems or even NoSQL storage. So, we can expect really good improvements in the future.

Adopting Solr cache

Solr provides three levels of caching management: Filter, Document, and Query cache. You should probably notice that what is really missed here is a Field level; it does exist, but it is an internal caching level managed directly from Lucene. So, we cannot handle it directly, unless we want to provide our custom implementation.

However, most of the caching needed at the Field level are really manageable from the document. This component will manage a cache of documents and their stored fields. You don't really need to cache only indexed fields, as they are already used to construct the index and caching them makes no sense.

A typical example configuration for a Filter cache is, for example:

 <filterCache	class="solr.LRUCache" 
        initialSize="4096" size="16384"
        autowarmCount="4096"/>

The other cache type will be very similar, and they always include a minimum allocation and a dynamic max value. The class used here depends on the chosen strategy for caching, and lastly the autowarm parameter is really important to load at the start along with other common values, reducing the need to reload them later.

The Filter cache is instead more focused on caching the mostly used filters combinations and the most recurring terms from faceting.

The Query cache is probably the most intuitive, and it is similar to what we expect. It simply provides caching for results of queries.

If you want to know more on caching visit http://wiki.apache.org/solr/SolrCaching.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.253.223