Migrating data to another collection

There are times when migrating data from one collection to another is a good option, for example, if you have data of multiple clients in different shards. Some of the clients are paying for faster searches and more indexing throughput, and you would like to migrate the data of those clients to another collection so that it can be moved to new, more powerful nodes. If we use routing during indexation, Solr has a nice feature for us—the Collections API and its migrate command. This recipe will show you how to use it.

Getting ready

Before continuing, you should read the Using routing recipe in Chapter 7, In the Cloud. It provides a description on how to use routing, which is essential to fully understand this recipe. We also assume that we have two collections—one called customers that will hold our data and the second, empty one called important_customers. Both the collections were created using the same configuration shown in this recipe. If you want to know more about how to create a new SolrCloud cluster, refer to the Creating a new SolrCloud cluster recipe in Chapter 7, In the Cloud. This recipe will show you how to create a new SolrCloud cluster and create a collection.

How to do it...

For the purpose of this recipe, we will use the following index structure (we need to add the following section to our schema.xml file):

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="text_general" indexed="true" stored="true" />
<field name="customer" type="string" indexed="true" stored="true" />

For a customer with the name customer_1, we have the following data (stored in a file called data_customer_1.xml):

<add>
 <doc>
  <field name="id">customer_1!1</field>
  <field name="title">Customer document 1</field>
  <field name="customer">customer_1</field>
 </doc>
 <doc>
  <field name="id">customer_1!2</field>
  <field name="title">Customer document 2</field>
  <field name="customer">customer_1</field>
 </doc>
</add>

For a customer with the name customer_2, we have the following data (stored in a file called data_customer_2.xml):

<add>
 <doc>
  <field name="id">customer_2!3</field>
  <field name="title">Customer document 3</field>
  <field name="customer">customer_2</field>
 </doc>
 <doc>
  <field name="id">customer_2!4</field>
  <field name="title">Customer document 4</field>
  <field name="customer">customer_2</field>
 </doc>
</add>

We assume that we have the data indexed into the collection called customers:

  1. Let's now try moving the data of the customer_2 collection to another collection called important_customers that we already created and that is empty. To do this, we will run the following command:
    curl 'localhost:8983/solr/admin/collections?action=MIGRATE&collection=customers&target.collection=important_customers&split.key=customer_2!&forward.timeout=60'
    
  2. After the command was executed, we run the commit command to force the reload of index reader. We will do this using the following command:
    curl 'http://localhost:8983/solr/important_customers/update' --data-binary '<commit/>' -H 'Content-type:application/xml'
    
  3. We can now check the contents of the new collection by running the following query:
    http://localhost:8983/solr/important_customers/select?q=*:*

    The response should be as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">*:*</str>
      </lst>
     </lst>
     <result name="response" numFound="2" start="0">
      <doc>
       <str name="id">customer_2!3</str>
       <str name="title">Customer document 3</str>
       <str name="customer">customer_2</str>
       <long name="_version_">1481976223112364032</long></doc>
      <doc>
       <str name="id">customer_2!4</str>
       <str name="title">Customer document 4</str>
       <str name="customer">customer_2</str>
       <long name="_version_">1481976223113412608</long></doc>
     </result>
    </response>

As we can see in the new collection, we have only the data for the customer we wanted. We can now remove this data from the original collection because Solr doesn't do that by default.

Let's now see how it works.

How it works...

Our index structure is very simple. Each of the documents has three fields—one for the document identifier (the id field), one for the title of the document (the title field), and the field called customer that will be used for filtering. Each identifier is prefixed with the customer name followed by the !character. We talked about this already in Chapter 7, In the Cloud; when using the composite routing, we can prefix the document identifier with a value and the !character after that. This means that Solr will use the cookbook1234 value to determine in which shard the document will be indexed.

We assumed that we have the data indexed and we want to migrate the data of the second customer to another collection that we created upfront. We do this by running the MIGRATE command to the Collections API (action=MIGRATE sent to the /solr/admin/collections REST endpoint). We provide the source collection, which is in our case the customers collection (the collection=customers parameter), and we provide the target collection to where the data should be migrated to (the target.collection=important_customers parameter). In addition to this, we need the routing key, which we used during indexation, which for our second customer is customer_2! (the split.key parameter). Finally, we define the forward.timeout parameter that controls for how long Solr will re-route the write request from the source collection to the target one. It is the user's responsibility to switch read and write operations to the target collection after the migration has been done. Note that the source collection will not be modified by the migrate request. The migration of documents is a synchronous operation and it is advised to keep the timeout on the client side high, although even with that the HTTP command might timeout during execution when a large number of documents need to be migrated. This doesn't mean that the operation will not be successful—Solr will continue the migration in the background and you should check the logs if errors occur.

There are a few things that we need to remember when migrating data between collections:

  • The migration can be performed on multiple shards at once if the shard.key parameter value spans multiple shards. Solr will do this automatically.
  • Because the migration is a synchronous operation, it can take a long time on large collections.
  • Multiple temporary collections can be created during execution of the migrate command although they should be removed once the command finishes executing.
  • The command only works with collections that use the CompositeId router.
  • The collection that is a target shouldn't receive any updates during the migration process because it might lead to data loss.
  • Duplication is not done as a part of the migration process, so if the target collection contains data with the same identifiers as the ones in the source collection, you might end up with duplicates.

Finally, after the command was successful, we send the commit command to refresh the index reader so that the data is visible. As we can see in the response, everything went well and we can now remove data from the original collection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.221.149