Optimization of mapping definition

If your search requirements allow it, there are some tips for optimization in the mapping definition of your index for when you need to improve the indexing performance. In the following section, we will look at those tips.

Norms

Scoring is the process of calculating the score of a document in the scope of a particular query and is an important part of the query process in Lucene. The score indicates how well the document matches the query. In other words, it is a factor that shows how close the document you are looking for. This means, the higher the score, the more relevant the document. There are several factors that are a determinant in calculating the score. One of them is the norms.

Lucene takes field length into account for the default relevance calculation. When a searched term is found in a short field (content length is short), Lucene thinks it is more likely that the content of that field is about the term than if the same term contains in a long field (content length is long). Therefore, Lucene keeps the length of a field for later use at query time, that is called field-length norm. It is a number that represents the relative field length and boost setting (that is, this is a match weight factor). The norms are useful for scoring and important for full-text search, but this functionality comes at a cost: It requires quite a lot of memory and consumes approximately 1 byte per string field per document in an index. Hence, if you don't need scoring on a specific field, you should disable norms on that field. You can disable the norms before indexing with explicit mapping on a field as follows:

curl -XPUT localhost:9200/talent
{
  "mappings": {
    "talented": {
      "properties": {
        "email": {
          "type": "string",
          "norms": {
            "enabled": false
          }
        }
      }
    }
  }
}
{"acknowledged":true}

Or by using the PUT mapping API after indexing, like the following:

curl -XPUT localhost:9200/talent/_mapping/talented
{
  "properties": {
    "email": {
      "type": "string",
      "norms": {
        "enabled": false
      }
    }
  }
}
{"acknowledged":true}

When the norms are disabled in a field, it means that the field will not take the field-length norm into account.

Note

Norms will not be removed instantly after disabling. They will be removed while you continue indexing new documents because meanwhile, old segments are merged into new segments. In addition, keep in mind that the norms cannot be re-enabled after disabling.

Feature index_option of string type

Elasticsearch provides some features for string types by default. If you don't need these features, you can improve indexing performance by disabling, and therefore, you can save memory. The index_option feature allows you to set the indexing options; positions for analyzed fields. This means that doc numbers, term frequencies, and positions will be indexed. Possible values and their meanings are shown in the following table:

Value

Meaning

doc

doc numbers are indexed

freqs

doc numbers and term frequencies are indexed

positions

doc numbers, term frequencies, and positions are indexed

The term frequency is a weight factor that shows how often the term appears in this document. You can disable term frequencies by executing the following command:

curl -XPUT localhost:9200/talent
{
  "mappings": {
    "talented": {
      "properties": {
        "zip_code": {
          "type":          "string",
          "index_options": "docs"
        }
      }
    }
  }
}

The preceding mapping will disable term frequencies and also term positions at the zip_code field, and only doc numbers can be indexed. Keep in mind that the zip_code field with this setting will not count how many times a term appears. Also, phrase and proximity queries will not be unable, in that these queries need term frequencies and term positions features when in use. In addition, not_analyzed fields use this setting by default.

Exclude unnecessary fields

We mentioned in Chapter 3, Basic Concepts of Mapping, that Elasticsearch includes the text of one or more other fields within the document indexed and concatenates them into one big string at _all field. By default, the _all field is enabled. Therefore, maybe you can exclude some fields in the _all field for improving indexing performance and saving disk space. For example, we have email and emailverification fields and we expect similar content in these fields. In this case, there is no practical benefit if the _all field includes emailverification field because the email field has already been included in the _all field and this is sufficient to search for an email on _all field. In such cases, you can exclude such fields. Thus, you can improve the performance of indexing and you can reduce your storage costs, and, at the same time, you can throttle unnecessary I/O operations.

Note

Please refer to the _all section in Chapter 3, Basic Concepts of Mapping, about how to exclude a field in the _all field.

Extension of the automatic index refresh time

When the data persistence step comes into play, the hard disk drives are able to create a risk of bottleneck for I/O operations. Elasticsearch uses the filesystem cache that is sitting between itself and the disk for overcoming the risk of bottleneck, so ensure that a new document can be searched in real time. A new segment is written to the filesystem cache first, and later it is flushed to disk by Elasticsearch. (If you do not have information about the segments, you may want to read the Segments and merging policies section before this section.) This lightweight process of writing and opening a new segment is called a refresh in Elasticsearch. By default, all shards (so all indices) are being refreshed per second. Elasticsearch thus supports real-time search.

Of course, there is a cost to refresh shards per second, especially when working with large size data. You can configure or turn off automatic refresh time. This can be done in the following two ways:

  • For all indices in your cluster, by setting the index.refresh_interval parameter in the configuration file
  • A per-index basis by index setting update

When you want to set an automatic refresh value for all indices in your cluster, you must make the following adjustment in the elasticsearch.yml file:

index.refresh_interval: 30s

The index refresh time has been setting as 30 seconds for all indices in the cluster in the preceding definition. The index.refresh_interval setting defines how often the refresh operation will be executed on our indices. Defaults to 1s. When you set value to –1, it means you just turned the setting off. You can set an automatic refresh value for an index, as follows:

curl -XPUT localhost:9200/my_index/_settings
{
  "index": {
    "refresh_interval": "30s"
  }
}

Extension of the automatic index refresh time enables faster indexing because of memory saving, thus achieving I/O operations throttling. But, in this case, it should be noted that creating new documents and making changes to the existing documents will not appear in searches during a specified period of time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.145.144