SolrJ is Solr's Java client API that insulates you from the dirty details of parsing and sending messages back and forth between your application and Solr. More than just a client API, it is also the way to run Solr embedded in your code instead of communicating to one remotely over HTTP—more on that later.
Although you don't have to use SolrJ to communicate with Solr, it's got some great performance features that may even tempt non-Java applications to use a little Java (or run on a JVM) in order to use it. The following are the features:
Aside from performance, SolrJ has some other nice qualities too. It can automatically map Solr documents to your Java class, and vice versa, simply by following JavaBean naming conventions and/or using annotations. And it has API convenience methods for most of Solr's search and response format options that are more often absent in other client APIs.
To illustrate the use of SolrJ, we've written some sample code in ./examples/9/solrj/
to index and search the web pages downloaded from MusicBrainz.org
. The web pages were crawled with
Heritrix, an open source web crawler used by the Internet Archive. Heritrix is included with the code supplement to the book at ./examples/9/crawler/heritrix-2.0.2/
, as well as an
ARC archive file of a crawl deep within the jobs/
subdirectory.
Most of the code is found in BrainzSolrClient.java
, which has a main
method. There is also a JUnit test that calls main()
with various arguments. You will notice that BrainzSolrClient demonstrates different ways of connecting to Solr and different ways of searching and indexing for documents.
Many Java projects are built with Apache Maven, and even if yours isn't, the information here is adaptable. Solr's artifacts and required dependencies are all published to Maven Central with decent POMs. Over time, the dependency information has gotten better but nonetheless it is subject to change and you might find it necessary to add exclusions related to logging or something else. For example, SolrJ 4.8 declares a dependency on Log4j even though it doesn't use it directly; SolrJ 4.9 doesn't declare this dependency.
If your code needs SolrJ to communicate remotely to Solr, then declare a dependency on SolrJ:
<dependency> <groupId>org.apache.solr</groupId> <artifactId>solr-solrj</artifactId> <version>${solr.version}</version> <scope>compile</scope> </dependency>
Due to transitive dependency resolution, this will automatically inherit SolrJ's dependencies: commons-io
, httpclient
, httpcore
, and httpmime
(from the Apache HttpComponents project), zookeeper
(only used for SolrCloud), noggit
, and wstx-asl
. SolrJ 4.8 erroneously includes Log4j too. The wstx-asl
dependency (Woodstox) isn't actually needed; it has been included with SolrJ since the days of Java 1.6 when Java's built-in XML processing implementation was substantially slower than Woodstox. Speaking of which, SolrJ 4.7 and onwards requires Java 7.
If you wish to use EmbeddedSolrServer
for embedding Solr, then add the solr-core
artifact as well instead. Note that this brings in a ton of transitive dependencies since you're running Solr in-process; some of these might have incompatible versions with the ones your application uses.
If you wish to write plugins to Solr and test Solr using Solr's excellent test infrastructure, then add a test dependency on the artifact solr-test-framework
before solr-core
or other Lucene/Solr artifacts. If the ordering is wrong, then you should see a helpful error.
Unfortunately, the world of logging in Java is a mess of frameworks. Java includes one but few use it for a variety of reasons. What just about any Java application should do (particularly the ones built with Maven or that produce a POM to publish dependency metadata) is explicitly declare the logging dependencies; don't leave it to transitive resolution. If you prefer Log4j, the most popular one, then the dependency list is slf4j-api
, jcl-over-slf4j
, slf4j-log4j
, and finally log4j
.
If the project is a direct plugin into Solr, then declare none, except perhaps for testing purposes, since the plugin will inherit Solr's. If the project is a library/module that doesn't wish to insist that its clients use a particular framework, then just depend on slf4j-api
and jcl-over-slf4j
and then declare any others as optional
or scope them to test
scope so that they aren't transitively required.
The first class in SolrJ's API to learn about is the SolrServer
class (package org.apache.solr.client.solrj
). In Solr 5, it was renamed as SolrClient
, with its subclasses following suit. As its name suggests, it represents an instance of Solr. Usually it's a client to a remote instance but, in the case of EmbeddedSolrServer
, it's the real thing. SolrServer
is an abstract class with multiple implementations to choose from:
HttpSolrServer
: This is generally the default choice for communicating to Solr.ConcurrentUpdateSolrServer
: This wraps HttpSolrServer
, handling document additions asynchronously with a buffer and multiple concurrent threads that independently stream data to Solr for high indexing throughput. It is ideal for bulk-loading data (that is, a re-index), even for SolrCloud. In Solr 3, this class was named StreamingUpdateSolrServer
.LBHttpSolrServer
: This wraps multiple HttpSolrServers
with load-balancing behavior using a round-robin algorithm and temporary host blacklisting when connection problems occur. It's usually inappropriate for indexing purposes.CloudSolrServer
: This wraps LBHttpSolrServer
but communicates to the ZooKeeper ensemble that manages a SolrCloud cluster to make intelligent routing decisions for both searching and indexing. Compared to HttpSolrServer
, this reduces latency and has enhanced resiliency when a replica becomes unavailable. If you are using SolrCloud, this is the implementation to use.EmbeddedSolrServer
: This is a real local Solr instance without HTTP, and less (but some) message serialization. More on this later.With the exception of EmbeddedSolrServer
, they are easy to instantiate with simple constructors. Here's how to instantiate HttpSolrServer
, whose only parameter is the URL to the Solr instance, to include the core or collection name:
public SolrServer createRemoteSolr() { return new HttpSolrServer("http://localhost:8983/solr/crawler"); }
SolrJ uniquely supports the ability to communicate with Solr using a custom binary format it calls javabin, which is more compressed and efficient to read and write than XML. However, the javabin format has changed on occasion, and when it does, it can force you to use the same (or sometimes newer) version on the client. By default, SolrJ sends requests in XML and it asks for responses back in javabin. Here's a code snippet to consistently toggle XML versus javabin for both request and responses:
if (useXml) {// xml, very compatible solrServer.setRequestWriter(new RequestWriter());//xml solrServer.setParser(new XMLResponseParser()); } else {//javabin, sometimes Solr-version sensitive solrServer.setRequestWriter(new BinaryRequestWriter()); solrServer.setParser(new BinaryResponseParser()); }
Performing a search is very straightforward:
SolrQuery solrQuery = new SolrQuery("Smashing Pumpkins"); solrQuery.setRequestHandler("/select"); QueryResponse response = solrServer.query(solrQuery); SolrDocumentList docList = response.getResults();
SolrQuery
extends SolrParams
to add convenience methods around some common query parameters. SolrDocumentList
is a List<SolrDocument>
plus the numFound
, start
, and maxScore
metadata. For an alternative to working with a SolrDocument
, see the Annotating your JavaBean – an alternative section ahead. A little known alternative to the query method is queryAndStreamResponse
, which takes a callback SolrJ call for each document it parses from the underlying stream. It can be used to more efficiently stream large responses from Solr to reduce latency and memory, although it only applies to the returned documents, not to any other response information.
Here's another example of adding faceting to find out the most popular hosts and paths indexed by the crawler:
SolrQuery solrQuery = new SolrQuery("*:*"); solrQuery.setRows(0);//just facets, no docs solrQuery.addFacetField("host","path");//facet on both solrQuery.setFacetLimit(10); solrQuery.setFacetMinCount(2); QueryResponse response = solr.query(solrQuery); for (FacetField facetField : response.getFacetFields()) { System.out.println("Facet: "+facetField.getName()); for (FacetField.Count count : facetField.getValues()) { System.out.println(" " +count.getName()+":"+count.getCount()); } }
The QueryResponse
class has a lot of methods to access the various aspects of a Solr search response; it's pretty straightforward. One method of interest is getResponse
, which returns a NamedList
. If there is some information in Solr's response that doesn't have a convenience method, you'll have to resort to using that method to traverse the response tree to get the data you want.
To index a document with SolrJ, you need to create a SolrInputDocument
, populate it, and give it to the SolrServer
. What follows is an excerpt from the code for the book that indexes a web-crawled document:
void indexAsSolrDocument(ArchiveRecordHeader meta, String htmlStr) throws Exception { SolrInputDocument doc = new SolrInputDocument(); doc.setField("url", meta.getUrl(), 1.0f); doc.setField("mimeType", meta.getMimetype(), 1.0f); doc.setField("docText", htmlStr); URL = new URL(meta.getUrl()); doc.setField("host", url.getHost()); doc.setField("path", url.getPath()); solrServer.add(doc); // or could batch in a collection }
If one of these fields were multivalued, then we would call addField
for each value instead of setField
, as you can see in the preceding code.
Depending on your commit strategy, you may want to call commit()
. The semantics of committing documents are described in Chapter 4, Indexing Data.
Unless you are using ConcurrentUpdateSolrServer
, you will want to do some amount of batching. This means passing a Java Collection of documents to the add
method instead of passing just one at a time. In the Sending data to Solr in bulk section of Chapter 10, Scaling Solr, there is more information showing how much it improved performance in a benchmark.
If you already have a class holding the data to index under your control (versus a third-party library), you may prefer to annotate your class's setters or fields with SolrJ's @Field
annotation instead of working with SolrInputDocument
and SolrDocument
. It might be easier to maintain and less code, if a little slower. Here's an excerpt from the book's sample code with an annotated class RecordItem
:
package solrbook; import org.apache.solr.client.solrj.beans.Field; public class RecordItem { //@Field("url") COMMENTED to show you can put the annotation on a setter String id; @Field String mimeType; @Field("docText") String html; @Field String host; @Field String path; public String getId() { id; } @Field("url") void setId(String id) { this.id = id; } //… other getter's and setters }
To search and retrieve a RecordItem
instance instead of a SolrDocument
, you simply call this method on QueryResponse
:
List<RecordItem> items = response.getBeans(RecordItem.class);
Indexing RecordItem
is simple too:
solrServer.addBean(item);
One of the most interesting aspects of SolrJ is that, because Solr and SolrJ are both written in Java, you can instantiate Solr and interact with it directly instead of starting up Solr as a separate process. This eliminates the HTTP layer and serializing the request too. The response is serialized; however, the returned documents can avoid it by using queryAndStreamResponse
as mentioned earlier. We'll describe further why or why not you might want to embed Solr, but let's start with a code example. As you can see, starting up an embedded Solr is more complex than any other type:
public static SolrServer createEmbeddedSolr(String instanceDir) throws Exception { String coreName = new File(instanceDir).getName(); // note: this is more complex than it should be. See SOLR-4502 SolrResourceLoader resourceLoader = new SolrResourceLoader(instanceDir); CoreContainer container = new CoreContainer(resourceLoader, ConfigSolr.fromString(resourceLoader, "<solr />")); container.load(); Properties coreProps = new Properties(); //coreProps.setProperty(CoreDescriptor.CORE_DATADIR, dataDir);//"dataDir" (optional) CoreDescriptor descriptor = new CoreDescriptor( container, coreName, instanceDir, coreProps); SolrCore core = container.create(descriptor); container.register(core, false);//not needed in Solr 4.9+ return new EmbeddedSolrServer(container, coreName); }
Keep in mind that your application embedding Solr will then take on all of Solr's dependencies, of which there are many.
In my opinion, a great use case for embedding Solr is unit testing!. Starting up an embedding Solr configured to put its data into memory in RAMDirectoryFactory
is efficient and it's much easier to incorporate into tests then awkwardly attempting to use a real Solr instance. Note that using EmbeddedSolrServer
in tests implies that your application shouldn't hardcode how it instantiates its SolrServer
since tests will want to supply it. If you wish to test while communicating with Solr over HTTP then take a look at JettySolrRunner
, a convenience class in the same package as EmbeddedSolrServer
, with a main method that starts Jetty and Solr. Depending on how you use this class, this is another way to embed Solr without having to manage another process. Yet another option to be aware of is mostly relevant when testing custom extensions to Solr. For that case, you won't use a SolrServer abstraction, your test will extend SolrTestCaseJ4
, which embeds Solr and has a ton of convenience methods. For more information on this, review a variety of Solr's tests that use that class and learn by example.
What about using it in other places besides tests? No application needs to embed Solr, but some apps may find it preferable. Fundamentally, embedded Solr is in-process (with the application) and doesn't listen on a TCP/IP port. It's easy to see that standalone Java-based desktop applications may prefer this model. Another use case seen in Solr's MapReduce contrib module and, in at least a couple of open source projects in the wild, is to decouple index building from the search server. The process that produces the document indexes it to disk with an embedded Solr instead of communicating remotely to one. Communicating to a standalone Solr process would of course also work but it's operationally more awkward.
After the index is built, it's copied to where a standalone Solr search process is running (this can be skipped for shared filesystems). If the index needs to get merged into an existing index instead of replacing or updating one, then it is merged with the MERGEINDEXES
core admin command. A final commit to the search process will make the new index data visible in search results.
One particular case that people seek to embed Solr is for an anticipated performance increase, particularly during indexing. However, there is rarely a convincing performance win in doing so because the savings are usually negligible compared to all the indexing work that Solr has to do, such as tokenizing text, inverting documents, and of course writing to disk. Nonetheless, there are always exceptions (such as when leveraging Hadoop as a builder), and you might have such a case.
3.16.147.124