ICU analysis plugin

Elasticsearch has an ICU analysis plugin. You can use this plugin to use mentioned forms in the previous section, and so ensuring that all of your tokens are in the same form. Note that the plugin must be compatible with the version of Elasticsearch in your machine:

bin/plugin install elasticsearch/elasticsearch-analysis-icu/2.7.0

After installing, the plugin registers itself by default under icu_normalizer or icuNormalizer. You can see an example of the usage as follows:

curl -XPUT /my_index -d '{
  "settings": {
    "analysis": {
      "filter": {
        "nfkc_normalizer": {
          "type": "icu_normalizer",
          "name": "nfkc"
        }
      },
      "analyzer": {
        "my_normalizer": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "nfkc_normalizer" ]
        }
      }
    }
  }
}'

The preceding configuration let's normalize all tokens into the NFKC normalization form.

Note

If you want more information about the ICU, refer to http://site.icu-project.org. If you want to examine the plugin, refer to https://github.com/elastic/elasticsearch-analysis-icu.

ASCII Foldng Token filter

The ASCII Folding token filter converts alphabetic, numeric, and symbolic unicode characters. It determines their corresponding ASCII characters, if a character is not in the first 127 ASCII characters and, of course, if one exists.

To see how it works, run the following command:

curl -XGET 'l
ocalhost:9200/_analyze?tokenizer=standard&filters=asciifolding&pretty' -d "Le déjà-vu est la sensation d'avoir déjà ététémoinoud'avoir déjà vécuune situation présente"
{
  "tokens" : [ {
    "token" : "Le",
    "start_offset" :0,
    "end_offset" :2,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "deja",
    "start_offset" :3,
    "end_offset" :7,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "vu",
    "start_offset" :8,
    "end_offset" :10,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "est",
    "start_offset" :11,
    "end_offset" :14,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "la",
    "start_offset" :15,
    "end_offset" :17,
    "type" : "<ALPHANUM>",
    "position" : 5
  }, {
    "token" : "sensation",
    "start_offset" :18,
    "end_offset" :27,
    "type" : "<ALPHANUM>",
    "position" : 6
  }, {
    "token" : "d'avoir",
    "start_offset" :28,
    "end_offset" :35,
    "type" : "<ALPHANUM>",
    "position" : 7
  }, {
    "token" : "deja",
    "start_offset" :36,
    "end_offset" :40,
    "type" : "<ALPHANUM>",
    "position" : 8
  }, {
    "token" : "ete",
    "start_offset" :41,
    "end_offset" :44,
    "type" : "<ALPHANUM>",
    "position" : 9
  }, {
    "token" : "temoin",
    "start_offset" :45,
    "end_offset" :51,
    "type" : "<ALPHANUM>",
    "position" : 10
  }, {
    "token" : "ou",
    "start_offset" :52,
    "end_offset" :54,
    "type" : "<ALPHANUM>",
    "position" : 11
  }, {
    "token" : "d'avoir",
    "start_offset" :55,
    "end_offset" :62,
    "type" : "<ALPHANUM>",
    "position" : 12
  }, {
    "token" : "deja",
    "start_offset" :63,
    "end_offset" :67,
    "type" : "<ALPHANUM>",
    "position" : 13
  }, {
    "token" : "vecu",
    "start_offset" :68,
    "end_offset" :72,
    "type" : "<ALPHANUM>",
    "position" : 14
  }, {
    "token" : "une",
    "start_offset" :73,
    "end_offset" :76,
    "type" : "<ALPHANUM>",
    "position" : 15
  }, {
    "token" : "situation",
    "start_offset" :77,
    "end_offset" :86,
    "type" : "<ALPHANUM>",
    "position" : 16
  }, {
    "token" : "presente",
    "start_offset" :87,
    "end_offset" :95,
    "type" : "<ALPHANUM>",
    "position" : 17
  } ]
}

As you see, even though a user may enter déjà, the filter converts it to deja; likewise, été is being converted to ete. The ASCII Folding token filter doesn't require any configuration, but, if desired, you can include directly the one in a custom analyzer as follows:

curl -XPUT localhost:9200/my_index -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.3.255