Chapter 2. Understanding Apache Solr

In the previous chapter, we discussed how big data has evolved to cater to the needs of various organizations, in order to deal with a humongous data size. There are many other challenges while working with data of different shapes. For example, the log files of any application server have semi-structured data or Microsoft Word documents, making it difficult to store the data in traditional relational storage. The challenge to handling such data is not just related to storage: there is also the big question of how to access the required information. Enterprise search engines are designed to address this problem.

Today, finding the required information within a specified timeframe has become more crucial than ever. Enterprises without information retrieval capabilities suffer from problems such as lost productivity of employees, poor decisions based on faulty/incomplete information, duplicated efforts, and so on. Given these scenarios, it is evident that Enterprise searches are absolutely necessary in any enterprise.

Apache Solr is an open source enterprise search platform, designed to handle these problems in an efficient and scalable way. Apache Solr is built on top of Apache Lucene, which provides an open source information search and retrieval library. Today, many professional enterprise search market leaders, such as LucidWorks and PolySpot, have built their search platform using Apache Solr. We will be learning more about Apache Solr in this chapter, and we will be looking at the following aspects of Apache Solr:

  • Setting up Apache Solr
  • Apache Solr architecture
  • Configuring Solr
  • Loading data in Apache Solr
  • Querying for information in Solr

Setting up Apache Solr

We will be going through the Apache Solr architecture in the next section; for now, let's install Apache Solr on our machines. Apache Solr is a Java Servlet web application that runs on Apache Lucene, Tika, and other open source libraries. Apache Solr ships with a demo server on jetty, so one can simply run it through the command line. This helps users to run the Solr instance quickly. However, you can choose to customize it and deploy it in your own environment. Apache Solr does not ship with any installer; it has to be run as a part of J2EE Application.

Prerequisites for setting up Apache Solr

Apache Solr requires Java 1.6 or more to run, so it is important to make sure you have the correct version of Java by calling java –version, as shown in the following screenshot:

Prerequisites for setting up Apache Solr

Note

With the latest version of Apache Solr (4.0 or more), JDK 1.5 is not supported anymore. Apache Solr 4.0+ runs on JDK 1.6 + version. Instead of going for the pre-shipped JDK with your default operating system, go for the full version of JDK by downloading it from http://www.oracle.com/technetwork/java/javase/downloads/index.html?ssSourceSiteId=otnjp. This will enable full support for an international charset. Apache Solr 4.10.1 version requires a minimum of JDK 7.

Once you have the correct Java version, you need a servlet container such as Tomcat, Jetty, Resin, Glassfish, or Weblogic installed on your machine. If you intend to use a jetty-based demo server, then you would not require a container.

Running Apache Solr on jetty

The Apache Solr distribution comes as a single zipped folder. You can download the stable installer from http://lucene.apache.org/solr/ or from its nightly builds running on the same site. To run Solr in Windows, download the zip file from the Apache mirror site for Linux, UNIX, and other such flavors; you can download the .gzip/.tgz version. In Windows, you can simply unzip your file, and in UNIX, you can run the following command:

$ tar –xvzf solr-<major-minor version>.tgz

Another way is to build Apache Solr from a source. This will be required if you are going to modify or extend the Apache Solr source for your own handler, plugin, and others. You need Java SE 7 JDK (which stands for Java Development Kit) or JRE (which stands for Java Runtime Environment), Apache Ant distribution (1.8.2 or more), and Apache Ivy (2.2.0+). You can compile the source by simply navigating to the Solr folder and running ant from there.

Note

More information can be found at https://wiki.apache.org/solr/HowToCompileSolr

When you unzip Solr, it extracts the following folders:

  • contrib/: This folder contains all the libraries that are additional to Solr, and they can be included on demand. They provide libraries for data import handler, MapReduce, Apache UIMA, velocity template, and so on.
  • dist/: This folder provides the distributions of Solr and other useful libraries such as SolrJ, UIMA, and MapReduce. We will be looking at this in the next chapter.
  • docs/: This folder contains documentation for Apache Solr.
  • example/: This folder provides jetty-based Solr web apps that can be directly used. We are going to use this folder for running Apache Solr.
  • licenses/: This folder contains all the licenses of the underlying libraries used by Solr.

Now, declare $JAVA_HOME to point to your JDK/JRE. You will find the jetty server in the solr<version>/example folder. Once you unzip solr-<major-minor version>.tgz, all you need to do is go to solr<version>/example and run the following command:

$ $JAVA_HOME/bin/java –jar start.jar

Note

If you are using the latest release of Solr (Solr 5.0), you need to go to the solr-5.0.0 folder and run the following command:

$ bin/slor start

The instructions for Solr 5.0 are available at:

https://cwiki.apache.org/confluence/display/solr/Solr+Start+Script+Reference

The default jetty instance will run on port 8983, and you can access the Solr instance by visiting the following URL: http://localhost:8983/Solr/browse. It shows a default search screen as shown in the following screenshot:

Running Apache Solr on jetty

If your system default is Locale, or character set is non-English (that is, en/en-US), for the sake of safety you can override your system defaults for Solr by passing –Duser.language=en –Duser.country=US in your jetty to ensure smooth running of Solr.

Running Solr on other J2EE containers

It is relatively easy to set up Apache Solr on any J2EE container. It requires deployment of the Apache Solr application war file using the standard J2EE application deployment of any container. Another additional step that the Apache Solr application needs is the location of the Apache Solr home folder. This can either be set through Java options by setting the following environment variables or updating the container start up script:

$ export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/opt/solr/example"

Alternatively, you can configure JNDI lookup for the java:comp/env/solr/home resource by pointing it to the Solr home folder. In Tomcat, this can be done by creating a context XML file with a chosen name (context.xml) in $CATALINA_HOME/conf/Catalina/localhost/context.xml, and adding the following entries:

<?xml version="1.0" encoding="utf-8"?>
<Context docBase="<solr-home>/example/solr/solr.war" debug="0" crossContext="true">
  <Environment name="solr/home" type="java.lang.String" value="<solr-home>/example/solr" override="true"/>
</Context>

Hello World with Apache Solr!

Once you are done with the installation of Apache Solr, you can simply run examples by going to the examples/exampledocs folder and running:

java -jar post.jar solr.xml monitor.xml

post.jar is a utility provided by Solr to upload the data to Apache Solr for indexing. When it is run, the post.jar file simply uploads all the files that are passed as a parameter to Apache Solr for indexing, and Solr indexes these files and stores them in its repository. Now, try accessing your instance by typing http://localhost:8983/solr/browse; you should find a sample search interface with some information in it, as shown in the following screenshot:

Hello World with Apache Solr!

Understanding Solr administration

Apache Solr provides an excellent user interface for administrating the server and can be accessed by calling http://localhost:8983/solr. Apache Solr has the concepts of Collections and Core. A collection in Apache Solr is a collection of Solr documents that represent one complete logical index. Solr Core is an execution unit of Solr that can run on its own configuration and other metadata. Apache Solr collections can be created for each index. Similarly, you can run Solr in multiple core modes.

Option

Purpose

Dashboard

This shows information related to version, memory consumption, JVM, and so on.

Logging

Shows log outputs with the latest logs on top

Logging | Level

Shows the current log configuration for packages, that is, for which packages the logs are enabled

Core Admin

Shows information about core, and allows its administration

Java Properties

Shows different Java properties set when Solr is running

Thread Dump

Describes the stack trace with information on CPU and user time; also enables a detailed stack trace

collection1

Demonstrates different parameters of collection, and all the activities you can perform, such as running queries and ping status

Solr navigation

The following table shows some of the important URLs configured with Apache Solr by default:

URL

Purpose

/select

For processing search queries, the primary request handler provided with Solr is "SearchHandler." It delegates to a sequence of search components.

/query

Same SearchHandler for JSON-based requests.

/get

Real-time get handler, guaranteed to return the latest stored fields of any document, without the need to commit or open a new searcher. The current implementation relies on the updateLog feature being enabled in the JSON format.

/browse

This URL provides a faceted web-based search, primary interface.

/update/extract

Solr accepts posted XML messages that Add/Replace, Commit, Delete, and Delete by query, by using the /update URL (ExtractingRequestHandler).

/update/csv

This URL is specific for CSV messages, CSVRequestHandler.

/update/json

This URL is specific for messages in the JSON format, JsonUpdateRequestHandler.

/analysis/field

This URL provides an interface for analyzing the fields. It provides the ability to specify multiple field types and field names in the same request, and outputs index-time and query-time analysis for each of them. It also uses FieldAnalysisRequestHandler internally.

/analysis/document

This URL provides an interface for analyzing the documents.

/admin

AdminHandler for providing the administration of Solr. AdminHandler has multiple sub-handlers defined. /admin/ping is for the health checkup.

/debug/dump

DumpRequestHandler—Echoes the request content back to the client.

/replication

Supports replicating indexes across different Solr servers, used by masters and slaves for data sharing. Uses ReplicationHandler.

Common problems and solutions

In this section, we will try and understand the common problems faced while running Solr instances:

  • When we run Apache Solr, I get the following error:
    Java.lang.UnsupportedClassError: org.apache.solr.servlet.SolrDispatchFilter : Org.eclipse.jetty.Unsupported Major-Minor version 51
    

    This error is seen due to a Java version mismatch with an Apache Solr-compiled Java version. In this case, you need Java Version 7 or more. The following values are the Java versions with class version mapping:

    J2SE 8 = 52,
    J2SE 7 = 51,
    J2SE 6.0 = 50,
    J2SE 5.0 = 49,
    JDK 1.4 = 48,
    JDK 1.3 = 47,
    JDK 1.2 = 46,
    JDK 1.1 = 45

    So, you need to use Java Version 7 to run Apache Solr. If you have any other Java run-time setup on your machine for the existing applications, and do not wish to disturb it, simply download JRE in a folder and run the Solr start command (as explained in the previous section) by calling Java of the new JRE.

  • While running Solr, I got java.lang.OutOfMemoryError. How to fix it?

    The Out-of-Memory error is thrown by the Java Virtual Machine (JVM) running Apache Solr when there is not enough memory available for heap, or for PermGen (Permanent Generation Space holds metadata regarding user classes and methods). When you get such an error, you need to restart the container. However, while restarting the container, you must make sure that you increase the memory of JVM. This can be done by adding the following JVM arguments for PermGen:

    export JVM_ARGS="-Xmx1024m -XX:MaxPermSize=256m"
    

    For correcting the heap space error, you can specify the following JVM arguments:

    export JVM_ARGS="-Xms1024m -Xmx1024m"
    

    Please note that the size of memory should be specified by the user. Visit http://jvmmemory.com/ to create these JVM arguments for setting the correct JVM variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.110.97