Index operations

This section shows you the basic commands needed for updating an index, by adding or removing documents. As a general note, each command we will see can be issued in at least two ways: using the command line, through the cURL tool, for example (a built-in tool in a lot of Linux distributions and available for all platforms); and using code (that is, SolrJ or some other client API). When you want to add documents, it's also possible to run those commands from the administration console.

Note

SolrJ and client APIs will be covered later in a dedicated chapter.

Another common aspect of these interactions is the Solr response, which always contains a status and a QTime attribute. The status is a returned code of the executed command, which is always 0 if the operation succeeds. The QTime attribute is the elapsed time of the execution. This is an example of the response in XML format:

<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">97</int>
  </lst>
</response>

Add

The command sends one or more documents to add to Solr. The documents that are added are not visible until a commit or an optimize command is issued.

We already saw that documents are the unit of information in Solr. Here, depending on the format of the data, one or more documents are sent using the proper representation.

Since the attributes and the content of the message will be the same regardless of the format, the formal description of the message structure will be given once. The following is an add command in XML format:

<add commitWithin="10000" overwrite="true">
  <doc boost="1.9">
    <field name="id">12020</field>
    <field name="title" boost="2.2">Round around midnight</field>
    <field name="subject">Music</field>
    <field name="subject">Jazz</field>
  </doc>
  …
</add>

Let's discuss the preceding command in detail:

  • <add>: This is the root tag of the XML document and indicates the operation.
  • commitWithin: This is an alternative to the autocommit features we saw previously. Using this optional attribute, the requestor asks Solr to ensure that the documents will be committed within a given period of time.
  • overwrite: This tells Solr to check out and eventually overwrite documents with the same uniqueKey. If you don't have a uniqueKey, or you're confident that you won't ever add the same document twice, you can get some index performance improvements by explicitly setting this flag to false.
  • <doc>: This represent the document to be added.
  • boost: This is an optional attribute that specifies the boost for the whole document (that is, for each field). It defaults to 1.0.
  • <field>: This is a field of the document with just one value. If the field is multivalued, there will be several fields with the same name and different values.
  • boost: This is an optional attribute that specifies the boost for the specific field. It defaults to 1.0.

The same data can be expressed in JSON as follows:

{
  "add": {
    "commitWithin": 10000,
    "overwrite": true,
    "doc": {
      "boost": 1.9,
      "id": 12020,
      "title": {
        "value": "Round around midnight",
        "boost": 2.2
        },
      "subject": ["Music", "Jazz"]
      }
    }
}

As you can see, the information is the same as in the previous example. The difference is in the encoding of the information according to the JSON format.

Sending add commands

We can issue an add command in several ways: using cURL, the administration console, and a client API such as SolrJ.

The cURL tool is a command-line tool used to transfer data with URL syntax. Among other protocols, it supports HTTP and HTTPS, so it's perfect for sending commands to Solr. These are some examples of add commands sent using cURL:

# curl http://127.0.0.1:8983/solr/update -H "Content-type: text/xml" --data-binary @datafile.xml

# curl http://127.0.0.1:8983/solr/update -H "Content-type: text/xml" --data-binary
'<add commitWithin="10000" overwrite="true">
  <doc boost="1.9">
    <field name="id">12020</field>

    <field name="subject">Jazz</field>
  </doc>

</add>'

The first example uses data contained in a file. The second (useful for short requests) directly embeds the documents in the data-binary parameter. The preceding examples are perfectly valid for JSON and CSV documents as well (obviously, the data format and the content type will change).

Delete

A delete command will mark one or more documents as deleted. This means the target documents are not immediately removed from the index. Instead, a kind of tombstone is placed on them; when the next commit event happens, that data will be removed. Commits and optimizes are commands that make the update changes visible and available. In other words, they make those changes effectively part of the Solr index. We will see both of them later.

Solr allows us to identify the target documents in two different ways: by specifying a set of identifiers or by deleting all documents matched by a query. In the same way as we sent add commands, we can use cURL to issue delete commands:

# curl http://127.0.0.1:8983/solr/update -H "Content-type: text/xml" --data-binary @datafile_with_deletes.xml

# curl http://127.0.0.1:8983/solr/update -H "Content-type: text/xml" --data-binary
'<delete>
  <id>92392</id>
  <query>publisher:"Ashler"</query>
</delete>' 

In the second example, we issued a command to delete:

  • The document with 92392 as uniqueKey
  • All documents that have a publisher attribute with the Ashler value

Commit, optimize, and rollback

Changes resulting from add and delete operations are not immediately visible. They must be committed first; that is, a commit command has to be sent.

We already explored hard and soft unsolicited commits in the Index configuration section. The same command can be explicitly sent to Solr by clients.

Although we previously described the difference between hard and soft commits, it's important to remember that a hard commit is an expensive operation, causing changes to be permanently flushed to disk. Soft commits operate exclusively in memory, and are therefore very fast but transient; so, in the event of a JVM crash, softly committed data is lost.

Tip

In a prototype I'm working on, we index data coming from traffic sensors in Solr. As you can imagine, the input flow is continuous; it can happen several times in a second. A control system needs to execute a given set of queries at short periodic intervals, for example, every few seconds. In order to make the most updated data available to that system, we issue a soft commit every second and a hard commit every 20 minutes. At the moment, this seems to be a good compromise between the availability of fresh data and the risk of data loss (it could still happen during those 20 minutes).

For those interested, the Solr extension we will use in that project is available on GitHub, at https://github.com/agazzarini/SolRDF. It allows Solr to index RDF data, and it is a good example of the capabilities of Solr in the realm of customization.

A third kind of commit, which is actually a hard commit, is the so-called optimize. With optimize, other than producing the same results as those of a hard commit, Solr will merge the current index segments into a single segment, resulting in a set of intensive I/O operations. The merge usually occurs in the background and is controlled by parameters such as merge scheduler, merge policy, and merge factor. Like the hard commit, optimize is a very expensive operation in terms of I/O because, apart from costing the same as a hard commit, it must have some temporary space available on the disk to perform the merge.

It is possible to send the commit or the optimize command together with the data to be indexed:

# curl http://127.0.0.1:8983/solr/update?commit=true -H "Content-type: text/xml" --data-binary @datafile.xml
# curl http://127.0.0.1:8983/solr/update?optimize=true -H "Content-type: text/xml" --data-binary @datafile.xml

The message payload can also be a commit command:

# curl http://127.0.0.1:8983/solr/update -H "Content-type: text/xml" --data-binary '<commit/>'

A commit has a few additional Boolean parameters that can be specified to customize the service behavior:

Parameter

Description

waitSearcher

The command won't return until a new searcher is opened and registered as the main searcher

waitFlush

The command won't return until uncommitted changes are flushed to disk

softCommit

If this is true, a soft commit will be executed

Before committing any pending change, it's possible to issue a rollback to remove uncommitted add and delete operations. The following are examples of rollback requests:

# curl http://127.0.0.1:8983/solr/update?rollback=true 
# curl http://127.0.0.1:8983/solr/update -H "Content-type: text/xml" --data-binary '<rollback/>'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.176.145