Mapping an attachment field

ElasticSearch allows extending its core types to cover new requirements with native plugins that provide new mapping types. A most used custom field type is the attachment one.

It allows indexing and searching the contents of common documental files, that is, Microsoft office formats, open document formats, PDF, ePub, and many others.

Getting ready

You need a working ElasticSearch cluster with the attachment plugin (https://github.com/elasticsearch/elasticsearch-mapper-attachments) installed.

It can be installed from the command line with the following command:

 bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.9.0

The plugin version is related to the current ElasticSearch version. Check the GitHub page for further details.

How to do it...

To map a field as an attachment, it's necessary to set the type to attachment.

Internally the attachment field is a multifield that takes some binary data (encoded base64).

If we want to create a mapping for an e-mail storing attachment, it will be:

{
  "email": {
    "properties": {
      "sender": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "date": {
        "type": "date",
        "store": "no",
        "index": "not_analyzed"
      },
      "document": {
        "type": "attachment",
        "fields": {
          "file": {
            "store": "yes",
            "index": "analyzed"
          },
          "date": {
            "store": "yes"
          },
          "author": {
            "store": "yes"
          },
          "keywords": {
            "store": "yes"
          },
          "content_type": {
            "store": "yes"
          },
          "title": {
            "store": "yes"
          }
        }
      }
    }
  }
}

How it works...

The attachment plugin internally uses Apache Tika—a library specialized in text extraction from documents. The list of supported document types is available in the Apache Tika website (http://tika.apache.org/1.3/formats.html), but it covers all common file types.

The attachment type field receives a base64 binary stream that is processed by Tika metadata and text extractor. The field can be seen as a multifield that stores different contents in its subfields:

  • file: This stores the content of the file
  • date: This stores the file creation data, extracted by Tika metadata
  • author: This stores the file author, extracted by Tika metadata
  • keywords: This stores the file keywords, extracted by Tika metadata
  • content_type: This stores the file content type
  • title: This stores the file title, extracted by Tika metadata

    Tip

    Default settings for attachment plugins are to extract 100000 chars. This value can be changed by globally setting the index settings index.mappings.attachment.indexed_chars or when indexing the element passing a value to the _indexed_chars property.

There's more...

The attachment type is an example of how it's possible to extend ElasticSearch with its custom types. This is very important when we are managing different types of data that needs custom operations of cooking to be delivered.

The attachment plugin is very useful for indexing documents, e-mails and every type of an unstructured document. A good example of an application that uses it is ScrutMyDocs (http://www.scrutmydocs.org/).

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.55.72