SolrJ – Solr's Java client API

SolrJ is Solr's Java client API that insulates you from the dirty details of parsing and sending messages back and forth between your application and Solr. More than just a client API, it is also the way to run Solr embedded in your code instead of communicating to one remotely over HTTP—more on that later.

Although you don't have to use SolrJ to communicate with Solr, it's got some great performance features that may even tempt non-Java applications to use a little Java (or run on a JVM) in order to use it. The following are the features:

  • It communicates with Solr via an efficient and compact binary message format (still over HTTP) called javabin. It can still do standard XML if desired (useful if your client and your server are different versions).
  • It streams documents to Solr asynchronously in multiple threads for maximizing indexing throughput. This is not the default but it's easy to configure it this way.
  • It routes documents and search requests to a SolrCloud cluster intelligently by examining the cluster state in ZooKeeper. A document can be delivered directly to the leader of the appropriate shard, and searches are load-balanced against available replicas, possibly obviating the need for an independent load balancer. Read more about SolrCloud in Chapter 10, Scaling Solr.

Aside from performance, SolrJ has some other nice qualities too. It can automatically map Solr documents to your Java class, and vice versa, simply by following JavaBean naming conventions and/or using annotations. And it has API convenience methods for most of Solr's search and response format options that are more often absent in other client APIs.

The sample code – BrainzSolrClient

To illustrate the use of SolrJ, we've written some sample code in ./examples/9/solrj/ to index and search the web pages downloaded from MusicBrainz.org. The web pages were crawled with Heritrix, an open source web crawler used by the Internet Archive. Heritrix is included with the code supplement to the book at ./examples/9/crawler/heritrix-2.0.2/, as well as an ARC archive file of a crawl deep within the jobs/ subdirectory.

Most of the code is found in BrainzSolrClient.java, which has a main method. There is also a JUnit test that calls main() with various arguments. You will notice that BrainzSolrClient demonstrates different ways of connecting to Solr and different ways of searching and indexing for documents.

Dependencies and Maven

Many Java projects are built with Apache Maven, and even if yours isn't, the information here is adaptable. Solr's artifacts and required dependencies are all published to Maven Central with decent POMs. Over time, the dependency information has gotten better but nonetheless it is subject to change and you might find it necessary to add exclusions related to logging or something else. For example, SolrJ 4.8 declares a dependency on Log4j even though it doesn't use it directly; SolrJ 4.9 doesn't declare this dependency.

Tip

Run mvn dependency:tree to see what your project's dependencies are and look for problems, such as incompatible or missing logging jars.

If your code needs SolrJ to communicate remotely to Solr, then declare a dependency on SolrJ:

<dependency>
  <groupId>org.apache.solr</groupId>
  <artifactId>solr-solrj</artifactId>
  <version>${solr.version}</version>
  <scope>compile</scope>
</dependency>

Due to transitive dependency resolution, this will automatically inherit SolrJ's dependencies: commons-io, httpclient, httpcore, and httpmime (from the Apache HttpComponents project), zookeeper (only used for SolrCloud), noggit, and wstx-asl. SolrJ 4.8 erroneously includes Log4j too. The wstx-asl dependency (Woodstox) isn't actually needed; it has been included with SolrJ since the days of Java 1.6 when Java's built-in XML processing implementation was substantially slower than Woodstox. Speaking of which, SolrJ 4.7 and onwards requires Java 7.

Note

SolrJ has additional logging dependencies that won't transitively resolve, jcl-over-slf4j (commons-logging) and an SLF4J target logger. See the next subsection on logging.

If you wish to use EmbeddedSolrServer for embedding Solr, then add the solr-core artifact as well instead. Note that this brings in a ton of transitive dependencies since you're running Solr in-process; some of these might have incompatible versions with the ones your application uses.

If you wish to write plugins to Solr and test Solr using Solr's excellent test infrastructure, then add a test dependency on the artifact solr-test-framework before solr-core or other Lucene/Solr artifacts. If the ordering is wrong, then you should see a helpful error.

Declaring logging dependencies

Unfortunately, the world of logging in Java is a mess of frameworks. Java includes one but few use it for a variety of reasons. What just about any Java application should do (particularly the ones built with Maven or that produce a POM to publish dependency metadata) is explicitly declare the logging dependencies; don't leave it to transitive resolution. If you prefer Log4j, the most popular one, then the dependency list is slf4j-api, jcl-over-slf4j, slf4j-log4j, and finally log4j.

If the project is a direct plugin into Solr, then declare none, except perhaps for testing purposes, since the plugin will inherit Solr's. If the project is a library/module that doesn't wish to insist that its clients use a particular framework, then just depend on slf4j-api and jcl-over-slf4j and then declare any others as optional or scope them to test scope so that they aren't transitively required.

The SolrServer class

The first class in SolrJ's API to learn about is the SolrServer class (package org.apache.solr.client.solrj). In Solr 5, it was renamed as SolrClient, with its subclasses following suit. As its name suggests, it represents an instance of Solr. Usually it's a client to a remote instance but, in the case of EmbeddedSolrServer, it's the real thing. SolrServer is an abstract class with multiple implementations to choose from:

  • HttpSolrServer: This is generally the default choice for communicating to Solr.
  • ConcurrentUpdateSolrServer: This wraps HttpSolrServer, handling document additions asynchronously with a buffer and multiple concurrent threads that independently stream data to Solr for high indexing throughput. It is ideal for bulk-loading data (that is, a re-index), even for SolrCloud. In Solr 3, this class was named StreamingUpdateSolrServer.

    Tip

    Don't forget to override handleError(); add() usually won't throw an error if something goes wrong.

  • LBHttpSolrServer: This wraps multiple HttpSolrServers with load-balancing behavior using a round-robin algorithm and temporary host blacklisting when connection problems occur. It's usually inappropriate for indexing purposes.
  • CloudSolrServer: This wraps LBHttpSolrServer but communicates to the ZooKeeper ensemble that manages a SolrCloud cluster to make intelligent routing decisions for both searching and indexing. Compared to HttpSolrServer, this reduces latency and has enhanced resiliency when a replica becomes unavailable. If you are using SolrCloud, this is the implementation to use.
  • EmbeddedSolrServer: This is a real local Solr instance without HTTP, and less (but some) message serialization. More on this later.

    Tip

    Remember to call shutdown() or close() on the SolrServer when finished to properly release resources.

With the exception of EmbeddedSolrServer, they are easy to instantiate with simple constructors. Here's how to instantiate HttpSolrServer, whose only parameter is the URL to the Solr instance, to include the core or collection name:

public SolrServer createRemoteSolr() {
  return new HttpSolrServer("http://localhost:8983/solr/crawler");
}

Using javabin instead of XML for efficiency

SolrJ uniquely supports the ability to communicate with Solr using a custom binary format it calls javabin, which is more compressed and efficient to read and write than XML. However, the javabin format has changed on occasion, and when it does, it can force you to use the same (or sometimes newer) version on the client. By default, SolrJ sends requests in XML and it asks for responses back in javabin. Here's a code snippet to consistently toggle XML versus javabin for both request and responses:

if (useXml) {// xml, very compatible
  solrServer.setRequestWriter(new RequestWriter());//xml
  solrServer.setParser(new XMLResponseParser());
} else {//javabin, sometimes Solr-version sensitive
  solrServer.setRequestWriter(new BinaryRequestWriter());
  solrServer.setParser(new BinaryResponseParser());
}

Tip

We recommend that you make the XML / javabin choice configurable as we saw earlier, with the default being javabin. During an upgrade of Solr, your Solr client could be toggled to use XML temporarily.

Searching with SolrJ

Performing a search is very straightforward:

SolrQuery solrQuery = new SolrQuery("Smashing Pumpkins");
solrQuery.setRequestHandler("/select");
QueryResponse response = solrServer.query(solrQuery);
SolrDocumentList docList = response.getResults();

SolrQuery extends SolrParams to add convenience methods around some common query parameters. SolrDocumentList is a List<SolrDocument> plus the numFound, start, and maxScore metadata. For an alternative to working with a SolrDocument, see the Annotating your JavaBean – an alternative section ahead. A little known alternative to the query method is queryAndStreamResponse, which takes a callback SolrJ call for each document it parses from the underlying stream. It can be used to more efficiently stream large responses from Solr to reduce latency and memory, although it only applies to the returned documents, not to any other response information.

Here's another example of adding faceting to find out the most popular hosts and paths indexed by the crawler:

SolrQuery solrQuery = new SolrQuery("*:*");
solrQuery.setRows(0);//just facets, no docs
solrQuery.addFacetField("host","path");//facet on both
solrQuery.setFacetLimit(10);
solrQuery.setFacetMinCount(2);
QueryResponse response = solr.query(solrQuery);
for (FacetField facetField : response.getFacetFields()) {
  System.out.println("Facet: "+facetField.getName());
  for (FacetField.Count count : facetField.getValues()) {
    System.out.println(" " +count.getName()+":"+count.getCount());
  }
}

The QueryResponse class has a lot of methods to access the various aspects of a Solr search response; it's pretty straightforward. One method of interest is getResponse, which returns a NamedList. If there is some information in Solr's response that doesn't have a convenience method, you'll have to resort to using that method to traverse the response tree to get the data you want.

Indexing with SolrJ

To index a document with SolrJ, you need to create a SolrInputDocument, populate it, and give it to the SolrServer. What follows is an excerpt from the code for the book that indexes a web-crawled document:

void indexAsSolrDocument(ArchiveRecordHeader meta,
    String htmlStr) throws Exception {
  SolrInputDocument doc = new SolrInputDocument();
  doc.setField("url", meta.getUrl(), 1.0f);
  doc.setField("mimeType", meta.getMimetype(), 1.0f);
  doc.setField("docText", htmlStr);
  URL  = new URL(meta.getUrl());
  doc.setField("host", url.getHost());
  doc.setField("path", url.getPath());
  solrServer.add(doc); // or could batch in a collection
}

If one of these fields were multivalued, then we would call addField for each value instead of setField, as you can see in the preceding code.

Depending on your commit strategy, you may want to call commit(). The semantics of committing documents are described in Chapter 4, Indexing Data.

Tip

Unless you are using ConcurrentUpdateSolrServer, you will want to do some amount of batching. This means passing a Java Collection of documents to the add method instead of passing just one at a time. In the Sending data to Solr in bulk section of Chapter 10, Scaling Solr, there is more information showing how much it improved performance in a benchmark.

Deleting documents

Deleting documents is simple with SolrJ. In this query, we'll delete everything (*:* is the query to match all documents):

solrServer.deleteByQuery( "*:*" );

To delete documents by their ID, call deleteById. As with the add method, it's overloaded to take commitWithin a number of milliseconds.

Annotating your JavaBean – an alternative

If you already have a class holding the data to index under your control (versus a third-party library), you may prefer to annotate your class's setters or fields with SolrJ's @Field annotation instead of working with SolrInputDocument and SolrDocument. It might be easier to maintain and less code, if a little slower. Here's an excerpt from the book's sample code with an annotated class RecordItem:

package solrbook;
import org.apache.solr.client.solrj.beans.Field;

public class RecordItem {
  //@Field("url")  COMMENTED to show you can put the annotation on a setter
  String id;

  @Field String mimeType;

  @Field("docText") String html;

  @Field String host;

  @Field String path;

  public String getId() { id; }

  @Field("url") void setId(String id) { this.id = id; }

  //… other getter's and setters
}

To search and retrieve a RecordItem instance instead of a SolrDocument, you simply call this method on QueryResponse:

List<RecordItem> items = response.getBeans(RecordItem.class);

Indexing RecordItem is simple too:

solrServer.addBean(item);

Embedding Solr

One of the most interesting aspects of SolrJ is that, because Solr and SolrJ are both written in Java, you can instantiate Solr and interact with it directly instead of starting up Solr as a separate process. This eliminates the HTTP layer and serializing the request too. The response is serialized; however, the returned documents can avoid it by using queryAndStreamResponse as mentioned earlier. We'll describe further why or why not you might want to embed Solr, but let's start with a code example. As you can see, starting up an embedded Solr is more complex than any other type:

public static SolrServer createEmbeddedSolr(String instanceDir)
    throws Exception {
  String coreName = new File(instanceDir).getName();
  // note: this is more complex than it should be. See SOLR-4502
  SolrResourceLoader resourceLoader =
      new SolrResourceLoader(instanceDir);
  CoreContainer container = new CoreContainer(resourceLoader,
      ConfigSolr.fromString(resourceLoader, "<solr />"));
  container.load();
  Properties coreProps = new Properties();
  //coreProps.setProperty(CoreDescriptor.CORE_DATADIR,
    dataDir);//"dataDir"  (optional)
  CoreDescriptor descriptor = new CoreDescriptor(
    container, coreName, instanceDir, coreProps);
  SolrCore core = container.create(descriptor);
  container.register(core, false);//not needed in Solr 4.9+
  return new EmbeddedSolrServer(container, coreName);
}

Note

A nonobvious limitation of instances of EmbeddedSolrServer is that it only enables you to interact with one SolrCore. Curiously, the constructor takes a core container, yet only the core named by the second parameter is accessible.

Keep in mind that your application embedding Solr will then take on all of Solr's dependencies, of which there are many.

When should you use embedded Solr? Tests!

In my opinion, a great use case for embedding Solr is unit testing!. Starting up an embedding Solr configured to put its data into memory in RAMDirectoryFactory is efficient and it's much easier to incorporate into tests then awkwardly attempting to use a real Solr instance. Note that using EmbeddedSolrServer in tests implies that your application shouldn't hardcode how it instantiates its SolrServer since tests will want to supply it. If you wish to test while communicating with Solr over HTTP then take a look at JettySolrRunner, a convenience class in the same package as EmbeddedSolrServer, with a main method that starts Jetty and Solr. Depending on how you use this class, this is another way to embed Solr without having to manage another process. Yet another option to be aware of is mostly relevant when testing custom extensions to Solr. For that case, you won't use a SolrServer abstraction, your test will extend SolrTestCaseJ4, which embeds Solr and has a ton of convenience methods. For more information on this, review a variety of Solr's tests that use that class and learn by example.

What about using it in other places besides tests? No application needs to embed Solr, but some apps may find it preferable. Fundamentally, embedded Solr is in-process (with the application) and doesn't listen on a TCP/IP port. It's easy to see that standalone Java-based desktop applications may prefer this model. Another use case seen in Solr's MapReduce contrib module and, in at least a couple of open source projects in the wild, is to decouple index building from the search server. The process that produces the document indexes it to disk with an embedded Solr instead of communicating remotely to one. Communicating to a standalone Solr process would of course also work but it's operationally more awkward.

After the index is built, it's copied to where a standalone Solr search process is running (this can be skipped for shared filesystems). If the index needs to get merged into an existing index instead of replacing or updating one, then it is merged with the MERGEINDEXES core admin command. A final commit to the search process will make the new index data visible in search results.

One particular case that people seek to embed Solr is for an anticipated performance increase, particularly during indexing. However, there is rarely a convincing performance win in doing so because the savings are usually negligible compared to all the indexing work that Solr has to do, such as tokenizing text, inverting documents, and of course writing to disk. Nonetheless, there are always exceptions (such as when leveraging Hadoop as a builder), and you might have such a case.

Note

An alternative way to achieve in-process indexing is to write a custom RequestHandler class (possibly extending ContentStreamHandlerBase) that fetches and processes your data to your liking. It could be more convenient than using EmbeddedSolrServer depending on your use case.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.147.124