In the previous chapter, we discussed how big data has evolved to cater to the needs of various organizations, in order to deal with a humongous data size. There are many other challenges while working with data of different shapes. For example, the log files of any application server have semi-structured data or Microsoft Word documents, making it difficult to store the data in traditional relational storage. The challenge to handling such data is not just related to storage: there is also the big question of how to access the required information. Enterprise search engines are designed to address this problem.
Today, finding the required information within a specified timeframe has become more crucial than ever. Enterprises without information retrieval capabilities suffer from problems such as lost productivity of employees, poor decisions based on faulty/incomplete information, duplicated efforts, and so on. Given these scenarios, it is evident that Enterprise searches are absolutely necessary in any enterprise.
Apache Solr is an open source enterprise search platform, designed to handle these problems in an efficient and scalable way. Apache Solr is built on top of Apache Lucene, which provides an open source information search and retrieval library. Today, many professional enterprise search market leaders, such as LucidWorks and PolySpot, have built their search platform using Apache Solr. We will be learning more about Apache Solr in this chapter, and we will be looking at the following aspects of Apache Solr:
We will be going through the Apache Solr architecture in the next section; for now, let's install Apache Solr on our machines. Apache Solr is a Java Servlet web application that runs on Apache Lucene, Tika, and other open source libraries. Apache Solr ships with a demo server on jetty, so one can simply run it through the command line. This helps users to run the Solr instance quickly. However, you can choose to customize it and deploy it in your own environment. Apache Solr does not ship with any installer; it has to be run as a part of J2EE Application.
Apache Solr requires Java 1.6 or more to run, so it is important to make sure you have the correct version of Java by calling java –version
, as shown in the following screenshot:
With the latest version of Apache Solr (4.0 or more), JDK 1.5 is not supported anymore. Apache Solr 4.0+ runs on JDK 1.6 + version. Instead of going for the pre-shipped JDK with your default operating system, go for the full version of JDK by downloading it from http://www.oracle.com/technetwork/java/javase/downloads/index.html?ssSourceSiteId=otnjp. This will enable full support for an international charset. Apache Solr 4.10.1 version requires a minimum of JDK 7.
Once you have the correct Java version, you need a servlet container such as Tomcat, Jetty, Resin, Glassfish, or Weblogic installed on your machine. If you intend to use a jetty-based demo server, then you would not require a container.
The Apache Solr distribution comes as a single zipped folder. You can download the stable installer from http://lucene.apache.org/solr/ or from its nightly builds running on the same site. To run Solr in Windows, download the zip file from the Apache mirror site for Linux, UNIX, and other such flavors; you can download the .gzip
/.tgz
version. In Windows, you can simply unzip your file, and in UNIX, you can run the following command:
$ tar –xvzf solr-<major-minor version>.tgz
Another way is to build Apache Solr from a source. This will be required if you are going to modify or extend the Apache Solr source for your own handler, plugin, and others. You need Java SE 7 JDK (which stands for Java Development Kit) or JRE (which stands for Java Runtime Environment), Apache Ant distribution (1.8.2 or more), and Apache Ivy (2.2.0+). You can compile the source by simply navigating to the Solr folder and running ant from there.
More information can be found at https://wiki.apache.org/solr/HowToCompileSolr
When you unzip Solr, it extracts the following folders:
contrib/
: This folder contains all the libraries that are additional to Solr, and they can be included on demand. They provide libraries for data import handler, MapReduce, Apache UIMA, velocity template, and so on.dist/
: This folder provides the distributions of Solr and other useful libraries such as SolrJ, UIMA, and MapReduce. We will be looking at this in the next chapter.docs/
: This folder contains documentation for Apache Solr.example/
: This folder provides jetty-based Solr web apps that can be directly used. We are going to use this folder for running Apache Solr.licenses/
: This folder contains all the licenses of the underlying libraries used by Solr.Now, declare $JAVA_HOME
to point to your JDK/JRE. You will find the jetty server in the solr<version>/example
folder. Once you unzip solr-<major-minor version>.tgz
, all you need to do is go to solr<version>/example
and run the following command:
$ $JAVA_HOME/bin/java –jar start.jar
If you are using the latest release of Solr (Solr 5.0), you need to go to the solr-5.0.0
folder and run the following command:
$ bin/slor start
The instructions for Solr 5.0 are available at:
https://cwiki.apache.org/confluence/display/solr/Solr+Start+Script+Reference
The default jetty instance will run on port 8983, and you can access the Solr instance by visiting the following URL: http://localhost:8983/Solr/browse
. It shows a default search screen as shown in the following screenshot:
If your system default is Locale, or character set is non-English (that is, en/en-US), for the sake of safety you can override your system defaults for Solr by passing –Duser.language=en –Duser.country=US
in your jetty to ensure smooth running of Solr.
It is relatively easy to set up Apache Solr on any J2EE container. It requires deployment of the Apache Solr application war file using the standard J2EE application deployment of any container. Another additional step that the Apache Solr application needs is the location of the Apache Solr home folder. This can either be set through Java options by setting the following environment variables or updating the container start up script:
$ export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/opt/solr/example"
Alternatively, you can configure JNDI lookup for the java:comp/env/solr/home
resource by pointing it to the Solr home folder. In Tomcat, this can be done by creating a context XML file with a chosen name (context.xml
) in $CATALINA_HOME/conf/Catalina/localhost/context.xml
, and adding the following entries:
<?xml version="1.0" encoding="utf-8"?> <Context docBase="<solr-home>/example/solr/solr.war" debug="0" crossContext="true"> <Environment name="solr/home" type="java.lang.String" value="<solr-home>/example/solr" override="true"/> </Context>
Once you are done with the installation of Apache Solr, you can simply run examples by going to the examples/exampledocs
folder and running:
java -jar post.jar solr.xml monitor.xml
post.jar
is a utility provided by Solr to upload the data to Apache Solr for indexing. When it is run, the post.jar
file simply uploads all the files that are passed as a parameter to Apache Solr for indexing, and Solr indexes these files and stores them in its repository. Now, try accessing your instance by typing http://localhost:8983/solr/browse
; you should find a sample search interface with some information in it, as shown in the following screenshot:
Apache Solr provides an excellent user interface for administrating the server and can be accessed by calling http://localhost:8983/solr
. Apache Solr has the concepts of Collections and Core. A collection
in Apache Solr is a collection of Solr documents that represent one complete logical index. Solr Core is an execution unit of Solr that can run on its own configuration and other metadata. Apache Solr collections can be created for each index. Similarly, you can run Solr in multiple core modes.
Option |
Purpose |
---|---|
Dashboard |
This shows information related to version, memory consumption, JVM, and so on. |
Logging |
Shows log outputs with the latest logs on top |
Logging | Level |
Shows the current log configuration for packages, that is, for which packages the logs are enabled |
Core Admin |
Shows information about core, and allows its administration |
Java Properties |
Shows different Java properties set when Solr is running |
Thread Dump |
Describes the stack trace with information on CPU and user time; also enables a detailed stack trace |
collection1 |
Demonstrates different parameters of collection, and all the activities you can perform, such as running queries and ping status |
The following table shows some of the important URLs configured with Apache Solr by default:
URL |
Purpose |
---|---|
|
For processing search queries, the primary request handler provided with Solr is "SearchHandler." It delegates to a sequence of search components. |
|
Same SearchHandler for JSON-based requests. |
|
Real-time get handler, guaranteed to return the latest stored fields of any document, without the need to commit or open a new searcher. The current implementation relies on the updateLog feature being enabled in the JSON format. |
|
This URL provides a faceted web-based search, primary interface. |
|
Solr accepts posted XML messages that Add/Replace, Commit, Delete, and Delete by query, by using the |
|
This URL is specific for CSV messages, |
|
This URL is specific for messages in the JSON format, |
|
This URL provides an interface for analyzing the fields. It provides the ability to specify multiple field types and field names in the same request, and outputs index-time and query-time analysis for each of them. It also uses FieldAnalysisRequestHandler internally. |
|
This URL provides an interface for analyzing the documents. |
|
|
|
|
|
Supports replicating indexes across different Solr servers, used by masters and slaves for data sharing. Uses |
In this section, we will try and understand the common problems faced while running Solr instances:
Java.lang.UnsupportedClassError: org.apache.solr.servlet.SolrDispatchFilter : Org.eclipse.jetty.Unsupported Major-Minor version 51
This error is seen due to a Java version mismatch with an Apache Solr-compiled Java version. In this case, you need Java Version 7 or more. The following values are the Java versions with class version mapping:
J2SE 8 = 52, J2SE 7 = 51, J2SE 6.0 = 50, J2SE 5.0 = 49, JDK 1.4 = 48, JDK 1.3 = 47, JDK 1.2 = 46, JDK 1.1 = 45
So, you need to use Java Version 7 to run Apache Solr. If you have any other Java run-time setup on your machine for the existing applications, and do not wish to disturb it, simply download JRE in a folder and run the Solr start command (as explained in the previous section) by calling Java of the new JRE.
java.lang.OutOfMemoryError
. How to fix it?The Out-of-Memory error is thrown by the Java Virtual Machine (JVM) running Apache Solr when there is not enough memory available for heap, or for PermGen (Permanent Generation Space holds metadata regarding user classes and methods). When you get such an error, you need to restart the container. However, while restarting the container, you must make sure that you increase the memory of JVM. This can be done by adding the following JVM arguments for PermGen:
export JVM_ARGS="-Xmx1024m -XX:MaxPermSize=256m"
For correcting the heap space error, you can specify the following JVM arguments:
export JVM_ARGS="-Xms1024m -Xmx1024m"
Please note that the size of memory should be specified by the user. Visit http://jvmmemory.com/ to create these JVM arguments for setting the correct JVM variables.
3.134.110.97