It's easy to make a cluster irresponsive in Elasticsearch prior to 5.x, using the attachment mapper. The metadata extraction from a document requires a very high CPU operation and if you are ingesting a lot of documents, your cluster is under load.
To prevent this scenario, Elasticsearch introduces the ingest node. An ingest node can be held under very high pressure without causing problems to the rest of the Elasticsearch cluster.
The attachment processor allows us to use the document extraction capabilities of Tika in an ingest node.
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command line, you need to install curl
for your operative system.
To be able to use the ingest attachment processor, perform the following steps:
bin/elasticsearch-plugin install ingest-attachment
-> Downloading ingest-attachment from elastic [=================================================] 100%?? @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * java.lang.RuntimePermission getClassLoader * java.lang.reflect.ReflectPermission suppressAccessChecks * java.security.SecurityPermission createAccessControlContext * java.security.SecurityPermission insertProvider * java.security.SecurityPermission putProviderProperty.BC Continue with the installation? [y/n] y -> Installed ingest-attachment
You must accept the security permissions to complete successfully the installation.
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for more details on the allowed permissions and the associated risks.
curl -XPUT 'http://127.0.0.1:9200/_ingest/pipeline/attachment' -d '{ "description" : "Extract data from an attachment via Tika", "processors" : [ { "attachment" : { "field" : "data" } } ], "version":1 }'
{"acknowledged":true}
curl -XPUT 'http://127.0.0.1:9200/my_index/my_type/my_id? pipeline=attachment' -d '{ "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbW V0DQpccGFyIH0=" }'
curl -XGET 'http://127.0.0.1:9200/my_index/my_type/my_id? pretty'
{ "_index" : "my_index", "_type" : "my_type", "_id" : "my_id", "_version" : 2, "found" : true, "_source" : { "data" : "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQp ccGFyIH0=", "attachment" : { "content_type" : "application/rtf", "language" : "ro", "content" : "Lorem ipsum dolor sit amet", "content_length" : 28 } } }
The attachment ingest processor is provided by a separate plugin that must be installed.
After having installed it, it works like every other processor. The properties that control it are as follows:
field
: This is the field that will contain the base 64 representation of the binary data.target_field
: This will hold the attachment information (default attachment
).indexed_char
: The number of characters to be extracted to prevent very huge fields. If it set to -1
, all the characters are extracted (default 100000
).properties
: Other metadata fields of the document that need to be extracted. They can be content
, title
, name
, author
, keywords
, date
, content_type
, content_length
, and language
(default all
).18.119.28.108