Chapter 3. Putting Elasticsearch into Action

We have covered a lot of ground on Elasticsearch architecture, indexes, analyzers, and mappings. It's time to start learning about the indexing of data and the querying of Elasticsearch using its rich Query-DSL.

In this chapter, we will cover the following topics:

  • CRUD operations using the Elasticsearch Python client
  • CRUD operations using the Elasticsearch Java client
  • Creating a search database
  • Introducing Query-DSL
  • Search requests using Python
  • Search requests using Java
  • Sorting data
  • Document routing

CRUD operations using elasticsearch-py

Elasticsearch is written in Java but it is interoperable with non-JVM languages too. In this book, we will use its Python client, elasticsearch-py, as well as its Java client to perform all the operations. The best part of this Python client library is that it communicates over HTTP and gives you the freedom to write your settings, mappings, and queries using plain JSON objects, and allows them to pass into the body parameters of the requests. To read more about this client, you can visit this URL: http://elasticsearch-py.readthedocs.org/en/master/.

All the examples used in this book are based on Python 2.7. However, they are compatible with Python version 3 also.

Setting up the environment

In this section, you will learn how to set up Python environments on Ubuntu using pip and virtualenv.

Installing Pip

Pip is a package installer for Python modules. It can be installed using the following commands:

sudo apt-get install python-pip python-dev build-essential
sudo pip install --upgrade pip 

Installing virtualenv

While developing programs using Python, it is good practice to create virtualenv. A virtualenv command creates a directory that stores a private copy of Python and all the default Python packages. Virtual environments are of great help while working with several projects and different versions of Python on a single system. You can create separate virtual environments for each project and enable them for each corresponding project. To install virtualenv, use the following command:

sudo pip install --upgrade virtualenv

Once the virtual environment is installed, you can create a directory and copy existing Python packages to it using this command:

mkdir venv
virtualenv venv

After this, you can activate this environment with the following command:

source venv

Once this environment is activated, all the packages that you install will be inside this venv directory. A virtual environment can be deactivated using just the deactivate command.

Installing elasticsearch-py

elasticsearch-py can be easily installed using pip in the following way:

pip install elasticsearch

You can verify the installation using the following command:

pip freeze | grep elasticsearch

You will get to know which version of the Elasticsearch client has been installed:

elasticsearch==1.6.0

Note

The version can be different depending on the latest release. The preceding command installs the latest version. If you want to install a specific version, then you can specify the version using the == operator. For example, pip install elasticsearch==1.5.

Performing CRUD operations

You will learn to perform CRUD operations in the upcoming sections, but before that, let's start with the creation of indexes using Python code.

Note

Since elasticsearch-py communicates over HTTP, it takes JSON data (setting, mapping, and queries) in the body parameters of the requests. It is advisable to use a sense plugin (which comes with Marvel or as an extension too) to write queries, settings, mappings, and all other requests, because sense offers a lot of help with its autosuggestion functionality. Once the correct JSON data is created, you can simply store it inside a variable in your Python code and use it inside a function's body parameter.

Request timeouts

The default timeout for any request sent to Elasticsearch is 10 seconds but there are chances for requests not to get completed within 10 seconds due to the complexity of the queries, the load on Elasticsearch, or network latencies. You have two options to control the timeouts:

  • Global timeout: This involves using the timeout parameter while creating a connection.
  • Per-request timeout: This involves using the request_timeout parameter (in seconds) while hitting separate requests. When request_timeout is used, it overrides the global timeout value for that particular request.

Creating indexes with settings and mappings

Create a Python file with index settings and mappings and save it with the name config.py. It will have two variables, index_settings and doc_mappings:

index_settings = {
"number_of_shards": 1,
"number_of_replicas": 1,
"index": {
"analysis": {
"analyzer": {
"keyword_analyzed": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "keyword"
}
                }
            }
        }
    }
doc_mapping = {
"_all": {
"enabled": False
},
"properties": {
"skills": {
"type": "string",
"index": "analyzed",
"analyzer": "standard",
        }
    }
}

Now create another file, es_operations.py, and follow these steps:

  1. Import the Elasticsearch module to your Python file:
    from elasticsearch import Elasticsearch
    from time import time
  2. Import the index_setting and mapping variables from the config file:
    from config import index_settings, doc_mapping
  3. Initialize the client:
    es = Elasticsearch('localhost:9200')
  4. Declare variables for the index name, doc type, and body. The body will contain the settings and the mapping:
    index_name='books'
    doc_type='search'
    body = {}
    mapping = {}
    mapping[doc_type] = doc_mapping
    body['settings'] = index_settings
    body['mappings'] = mapping
  5. Check whether the index exists; otherwise, create the index:
    if not es.indices.exists(index = index_name):
    print 'index does not exist, creating the index'
    es.indices.create(index = index_name, body = body)
        time.sleep(2)
    print 'index created successfully'
    else:
    print 'An index with this name already exists'

Indexing documents

Let's create a document and store it inside a doc1 variable, which is a dictionary:

doc1 = {
'name' : 'Elasticsearch Essentials',
'category' : ['Big Data', 'search engines', 'Analytics'],
'Publication' : 'Packt-Pub',
'Publishing Date' : '2015-31-12'
}

Once the document is created, it can be indexed using the index function:

es.index(index = index_name, doc_type = doc_type, body = doc1, id = '123')

Note

If you want the unique ID to be autogenerated, use the None keyword inside the id parameter.

Retrieving documents

Document retrieval is done using a GET request that takes the following parameter:

response = es.get(index=index_name, doc_type=doc_type, id='1', ignore=404)
print response

The ignore parameter is used to avoid exceptions in case the document does not exist in the index. The response will look as follows:

Retrieving documents

Note

In the response, all the field names start with u that denotes Unicode. In normal scenarios, this format does not affect when any task is performed with the response data. However, in some cases, you might require this to be in the form of a plain JSON format. To do this, simply import the JSON module of Python in your code and call the json.dumps(response) function with the response object inside its parameter.

All the fields are returned inside the _source object and a particular field can be accessed using this:

response.get('_source').get(field_name)

Updating documents

As we have seen in Chapter 1, Getting Started with Elasticsearch, a partial document update can be done using scripts with the _update API. With a Python client, it can be done using the update function. We can do an update in two scenarios; either to completely replace the value for a field, or to append the value inside that field. To use scripts to update the documents, make sure you have dynamic scripting enabled.

Replacing the value of a field completely

To replace the value of an existing field, you need to simply pass the new value inside the _source of the document in the following way:

script ={"script" : "ctx._source.category= "data analytics""}
es.update(index=index_name, doc_type=doc_type, body=script, id='1', ignore=404)

After this request, the category field will contain only one value, data analytics.

Appending a value in an array

Sometimes you need to preserve the original value and append some new data inside it. Groovy scripting supports the use of parameters with the param attribute inside scripts, which helps us to achieve this task:

script =  {"script" : "ctx._source.category += tag",
"params" : {
"tag" : "Python"
}
        }
es.update(index=index_name, doc_type=doc_type, body=script, id='1', ignore=404)

After this request, the category field will contain two values: data analytics and Python.

Updates using doc

Partial updates can be done using the doc parameter instead of body, where doc is a dictionary object that holds the partial fields that are to be updated. This is the preferable method to do partial updates. It can be done as shown in the following example:

es.update(index=index_name, doc_type=doc_type, 'doc': {'new_field': 'doing partial update with a new field'}, id='1', ignore=404)

Note

In most cases, where many documents need to be updated, document re-indexing makes more sense than updating it through a script or with doc.

Checking document existence

If it is required to check whether a document exists, you can use the exists function that returns either true or false in its response:

es.exists(index=index_name, doc_type=doc_type, id='1'):

Deleting a document

Document deletion can be done using the delete function in the following way:

es.delete(index=index_name, doc_type=doc_type, id='1', ignore=404)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.83.185