Reindexing from a remote cluster

The snapshot and restore APIs are very fast and the preferred way to back up data, but they have some limitations, such as:

  • The backup is a safe Lucene index copy, so it depends on the Elasticsearch version used. If you are switching from a version of Elastisearch that is prior to version 5.x, it's not possible to restore old indices.
  • It's not possible to restore backups of a newer Elasticsearch version in an older version. The restore is only forward-compatible.
  • It's not possible to restore partial data from a backup.

To be able to copy data in this scenario, the solution is to use the reindex API using a remote server.

Getting ready

You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

To execute curl via command line, you need to install curl for your operative system.

How to do it...

To copy an index from a remote server, we need to execute the following steps:

  1. We need to add the remote server address in the config/elasticsearch.yml section reindex.remote.whitelist in a similar line:
    reindex.remote.whitelist: ["192.168.1.227:9200"]
    
  2. After having restarted the Elasticsearch node to take the new configuration, we can call the reindex API to copy a test-source index data in a test-dest via the remote REST endpoint in this way:
    curl -XPOST "http://localhost:9200/_reindex" -d'
    {
          "source": {
            "remote": {
              "host": "http://192.168.1.227:9200"
            },
            "index": "test-source"
          },
          "dest": {
            "index": "test-dest"
      }
    }'
    

The result will be similar to a local reindex that we have already seen in the Reindex an index recipe in Chapter 4, Basic Operations.

How it works...

The reindex API allows you to call a remote cluster. Every version of the Elasticsearch server is supported (mainly 1.x or above).

The reindex API executes a scan query on the remote index cluster and puts the data in the current cluster. This process can take a lot of time, depending on the amount of data that needs to be copied and the time required to index that data.

The source section contains important parameters to control the fetched data, such as:

  • remote: This is a section that contains information on the remote cluster connection.
  • index: This is the remote index that must be used to fetch the data. It can also be an alias or multiple indices via globs.
  • query: This parameter is optional: it's a standard query that can be used to select the document that must be copied.
  • size: This parameter is optional and the buffer is up to 200MB, the number of the documents to be used for the bulk read/write.

The remote section of the configuration is composed of the following parameters:

  • host: The remote REST endpoint of the cluster
  • username:  The username to be used for copying the data (an optional parameter)
  • password:  The password for the user to access the remote cluster (optional)

There are a lot advantages to using this approach on standard snapshot and restore, including:

  • Ability to copy data from older clusters (from version 1.x or above).
  • Ability to use a query to copy on a selection of documents. This is very handy for copying data from a production cluster to a dev/test one.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.42.251