ElasticSearch allows extending its core types to cover new requirements with native plugins that provide new mapping types. A most used custom field type is the attachment one.
It allows indexing and searching the contents of common documental files, that is, Microsoft office formats, open document formats, PDF, ePub, and many others.
You need a working ElasticSearch cluster with the attachment plugin (https://github.com/elasticsearch/elasticsearch-mapper-attachments) installed.
It can be installed from the command line with the following command:
bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.9.0
The plugin version is related to the current ElasticSearch version. Check the GitHub page for further details.
To map a field as an attachment, it's necessary to set the type
to attachment
.
Internally the attachment field is a multifield that takes some binary data (encoded base64).
If we want to create a mapping for an e-mail storing attachment, it will be:
{ "email": { "properties": { "sender": { "type": "string", "store": "yes", "index": "not_analyzed" }, "date": { "type": "date", "store": "no", "index": "not_analyzed" }, "document": { "type": "attachment", "fields": { "file": { "store": "yes", "index": "analyzed" }, "date": { "store": "yes" }, "author": { "store": "yes" }, "keywords": { "store": "yes" }, "content_type": { "store": "yes" }, "title": { "store": "yes" } } } } } }
The attachment plugin internally uses Apache Tika—a library specialized in text extraction from documents. The list of supported document types is available in the Apache Tika website (http://tika.apache.org/1.3/formats.html), but it covers all common file types.
The attachment type field receives a base64 binary stream that is processed by Tika metadata and text extractor. The field can be seen as a multifield that stores different contents in its subfields:
file
: This stores the content of the filedate
: This stores the file creation data, extracted by Tika metadataauthor
: This stores the file author, extracted by Tika metadatakeywords
: This stores the file keywords, extracted by Tika metadatacontent_type
: This stores the file content typetitle
: This stores the file title, extracted by Tika metadataThe attachment type is an example of how it's possible to extend ElasticSearch with its custom types. This is very important when we are managing different types of data that needs custom operations of cooking to be delivered.
The attachment plugin is very useful for indexing documents, e-mails and every type of an unstructured document. A good example of an application that uses it is ScrutMyDocs (http://www.scrutmydocs.org/).
13.59.55.72