Indexing a document

In Elasticsearch there are two vital operations: index and search.

Indexing means storing one or more documents in an index: a similar concept of inserting records in a relational database.

In Lucene, the core engine of Elasticsearch, inserting or updating a document has the same cost: in Lucene and Elasticsearch update means replace.

Getting ready

You need an up-and-running Elasticsearch installation, as used in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

To execute curl via the command line, you need to install curl for your operative system.

To correctly execute the following commands, use the index and mapping created in the Putting a mapping in an index recipe.

How to do it...

To index a document, several REST entry points can be used:

Method

URL

POST

http://<server>/<index_name>/<type>

PUT/POST

http://<server>/<index_name>/<type> /<id>

PUT/POST

http://<server>/<index_name>/<type> /<id>/_create

To index a document, we need to perform the following steps:

  1. If we consider the type order of the previous chapter, the call to index a document will be as follows:
            curl -XPOST 
            'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw' -d 
            '{
                "id" : "1234",
                "date" : "2013-06-07T12:14:54",
                "customer_id" : "customer1",
                "sent" : true,
             "in_stock_items" : 0,
             "items":[
                    {"name":"item1", "quantity":3, "vat":20.0},
                    {"name":"item2", "quantity":2, "vat":20.0},
                    {"name":"item3", "quantity":1, "vat":10.0}
                ]
            }'
    
  2. If the index operation was successful, the result returned by Elasticsearch should be as follows:
            {
              "_index" : "myindex",
              "_type" : "order",
              "_id" : "2qLrAfPVQvCRMe7Ku8r0Tw",
              "_version" : 1,
              "forced_refresh" : false,
              "_shards" : {
                "total" : 2,
                "successful" : 1,
                "failed" : 0
              },
              "created" : true
            }
    

Some additional information is returned from the index operation, such as:

  • An auto generated ID if it's not specified (in this example: 2qLrAfPVQvCRMe7Ku8r0Tw)
  • The version of the indexed document per the Optimistic Concurrency Control (the version is 1 because it was the first time to save/update the document)
  • Whether the record has been created ("create":true in this example)

How it works...

One of the most used APIs in Elasticsearch is the index. Basically, indexing a JSON document consists internally in the following steps:

  1. Routing the call to the correct shard based on the ID or routing/parent metadata. If the ID is not supplied by the client, a new one is created (see the Managing your data recipe in Chapter 1, Getting Started for details).
  2. Validating the sent JSON.
  3. Processing the JSON according to the mapping. If new fields are present in the document (and the mapping can be updated), new fields are added in the mapping.
  4. Indexing the document in the shard. If the ID already exists, it is updated.
  5. If it contains nested documents, it extracts them and it processes them separately.
  6. Returning information about the saved document (ID and versioning).

It's important to choose the correct ID for indexing your data. If you don't provide an ID, Elasticsearch during the indexing phase will automatically associate a new one to your document. To improve performances, the ID should generally be of the same character length to improve balancing of the data tree that stores them.

Due to the REST call nature, it's better to pay attention when not using ASCII characters due to URL encoding and decoding (or be sure that the client framework you use correctly escapes them).

Depending on the mappings, other actions take place during the indexing phase: propagation on replica, nested processing, and percolator.

The document will be available for standard search calls after a refresh (forced with an API call or after the time slice of 1 second, Neal Real-Time): every GET API on the document doesn't require a refresh and can be instantly available.

The refresh can also be forced by specifying the refresh parameter during indexing.

There's more...

Elasticsearch allows passing the index API URL several query parameters to control how the document is indexed. The most used ones are as follows:

  • routing: Which controls the shard to be used for indexing, that is:
        curl -XPOST 'http://localhost:9200/myindex/order?routing=1'
  • parent: Which defines the parent of a child document and uses this value to apply routing. The parent object must be specified in the mappings:
        curl -XPOST 'http://localhost:9200/myindex/order?parent=12'
  • timestamp: The timestamp to be used in indexing the document. It must be activated in the mappings:
        curl -XPOST 'http://localhost:9200/myindex/order?timestamp= 
        2013-01-25T19%3A22%3A22'
  • consistency(one/quorum/all): By default, an index operation succeeds if a quorum (>replica/2+1) of active shards are available. The right consistency value can be changed for index action:
        curl -XPOST 'http://localhost:9200/myindex/order?
        consistency=one'
  • replication (sync/async): Elasticsearch returns from an index operation when all the shards of the current replication group have executed the index operation. Setting the replication async, allows us to execute the index action synchronous only on the primary shard and asynchronous on secondary shards. In this way, the API call returns the response action faster:
        curl -XPOST 'http://localhost:9200/myindex/order?
        replication=async' ...
  • version: The version allows us to use the Optimistic Concurrency Control (http://en.wikipedia.org/wiki/Optimistic_concurrency_control). The first time index of a document, the version 1, is set on the document. At every update this value is incremented. Optimistic Concurrency Control is a way to manage concurrency in every insert or update operation. The passed version value is the last seen version (usually returned by a get or a search). The index happens only if the current index version value is equal to the passed one:
        curl -XPOST 'http://localhost:9200/myindex/order?version=2' ...
  • op_type: Which can be used to force a create on a document. If a document with the same ID exists, the index fails:
        curl -XPOST 'http://localhost:9200/myindex/order?
        op_type=create'...
  • refresh: Which forces a refresh after having indexed the document. It allows having documents ready for search after indexing them:
        curl -XPOST 'http://localhost:9200/myindex/order?
        refresh=true'...
  • timeout: Which defines a time to wait for the primary shard to be available. Sometimes the primary shard is not in a writable status (relocating or recovering from a gateway) and a timeout for the write operation is raised after one minute:
        curl -XPOST 'http://localhost:9200/myindex/order?timeout=5m' 
        ...

See also

  • The Getting a document recipe in this chapter to learn how to retrieve a stored document
  • The Deleting a document recipe in this chapter to learn how to delete a document
  • The Updating a document recipe in this chapter to learn how to update fields in a document
  • For Optimistic Concurrency Control, the Elasticsearch way to manage concurrency on a document, a good reference can be found at http://en.wikipedia.org/wiki/Optimistic_concurrency_control
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.93.141