Using the ingest attachment plugin

It's easy to make a cluster irresponsive in Elasticsearch prior to 5.x, using the attachment mapper. The metadata extraction from a document requires a very high CPU operation and if you are ingesting a lot of documents, your cluster is under load.

To prevent this scenario, Elasticsearch introduces the ingest node. An ingest node can be held under very high pressure without causing problems to the rest of the Elasticsearch cluster.

The attachment processor allows us to use the document extraction capabilities of Tika in an ingest node.

Getting ready

You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.

To execute curl via the command line, you need to install curl for your operative system.

How to do it...

To be able to use the ingest attachment processor, perform the following steps:

  1. You need to install it as a plugin via:
            bin/elasticsearch-plugin install ingest-attachment
    
  2. The output will be something similar to the following one:
            -> Downloading ingest-attachment from elastic
            [=================================================] 100%??      
            @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
            @     WARNING: plugin requires additional permissions     @    
            @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
            * java.lang.RuntimePermission getClassLoader
            * java.lang.reflect.ReflectPermission suppressAccessChecks
            * java.security.SecurityPermission createAccessControlContext
            * java.security.SecurityPermission insertProvider
            * java.security.SecurityPermission putProviderProperty.BC
            Continue with the installation? [y/n] y
            -> Installed ingest-attachment
    

    You must accept the security permissions to complete successfully the installation.

    Note

    See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for more details on the allowed permissions and the associated risks.

  3. After having installed a new plugin, your node must be restarted to be able to load it. Now you can create a pipeline ingest with the attachment processor:
            curl -XPUT 'http://127.0.0.1:9200/_ingest/pipeline/attachment' 
            -d '{
              "description" : "Extract data from an attachment via Tika",
              "processors" : [ 
                 {
                  "attachment" : {
                    "field" : "data"
                  }
                }
              ],
              "version":1
            }'
    
  4. If everything is okay, you should receive the acknowledged:
            {"acknowledged":true}
    
  5. Now we can index a document via a pipeline:
            curl -XPUT 'http://127.0.0.1:9200/my_index/my_type/my_id?
            pipeline=attachment' -d '{
              "data": 
              "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbW
              V0DQpccGFyIH0="
             }'
    
  6. And we can recall it:
            curl -XGET 'http://127.0.0.1:9200/my_index/my_type/my_id?
            pretty'
    
  7. The result will be as follows:
            {
              "_index" : "my_index",
              "_type" : "my_type",
              "_id" : "my_id",
              "_version" : 2,
              "found" : true,
              "_source" : {
                "data" :     
              "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQp
               ccGFyIH0=",
                "attachment" : {
                  "content_type" : "application/rtf",
                  "language" : "ro",
                  "content" : "Lorem ipsum dolor sit amet",
                  "content_length" : 28
                }
              }
            }
    

How it works...

The attachment ingest processor is provided by a separate plugin that must be installed.

After having installed it, it works like every other processor. The properties that control it are as follows:

  • field: This is the field that will contain the base 64 representation of the binary data.
  • target_field: This will hold the attachment information (default attachment).
  • indexed_char: The number of characters to be extracted to prevent very huge fields. If it set to -1, all the characters are extracted (default 100000).
  • properties: Other metadata fields of the document that need to be extracted. They can be content, title, name, author, keywords, date, content_type, content_length, and language (default all).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.28.108