Elasticsearch has an ICU analysis plugin. You can use this plugin to use mentioned forms in the previous section, and so ensuring that all of your tokens are in the same form. Note that the plugin must be compatible with the version of Elasticsearch in your machine:
bin/plugin install elasticsearch/elasticsearch-analysis-icu/2.7.0
After installing, the plugin registers itself by default under icu_normalizer
or icuNormalizer
. You can see an example of the usage as follows:
curl -XPUT /my_index -d '{ "settings": { "analysis": { "filter": { "nfkc_normalizer": { "type": "icu_normalizer", "name": "nfkc" } }, "analyzer": { "my_normalizer": { "tokenizer": "icu_tokenizer", "filter": [ "nfkc_normalizer" ] } } } } }'
The preceding configuration let's normalize all tokens into the NFKC normalization form.
If you want more information about the ICU, refer to http://site.icu-project.org. If you want to examine the plugin, refer to https://github.com/elastic/elasticsearch-analysis-icu.
The ASCII Folding token filter converts alphabetic, numeric, and symbolic unicode characters. It determines their corresponding ASCII characters, if a character is not in the first 127 ASCII characters and, of course, if one exists.
To see how it works, run the following command:
curl -XGET 'l ocalhost:9200/_analyze?tokenizer=standard&filters=asciifolding&pretty' -d "Le déjà-vu est la sensation d'avoir déjà ététémoinoud'avoir déjà vécuune situation présente" { "tokens" : [ { "token" : "Le", "start_offset" :0, "end_offset" :2, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "deja", "start_offset" :3, "end_offset" :7, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "vu", "start_offset" :8, "end_offset" :10, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "est", "start_offset" :11, "end_offset" :14, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "la", "start_offset" :15, "end_offset" :17, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "sensation", "start_offset" :18, "end_offset" :27, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "d'avoir", "start_offset" :28, "end_offset" :35, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "deja", "start_offset" :36, "end_offset" :40, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "ete", "start_offset" :41, "end_offset" :44, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "temoin", "start_offset" :45, "end_offset" :51, "type" : "<ALPHANUM>", "position" : 10 }, { "token" : "ou", "start_offset" :52, "end_offset" :54, "type" : "<ALPHANUM>", "position" : 11 }, { "token" : "d'avoir", "start_offset" :55, "end_offset" :62, "type" : "<ALPHANUM>", "position" : 12 }, { "token" : "deja", "start_offset" :63, "end_offset" :67, "type" : "<ALPHANUM>", "position" : 13 }, { "token" : "vecu", "start_offset" :68, "end_offset" :72, "type" : "<ALPHANUM>", "position" : 14 }, { "token" : "une", "start_offset" :73, "end_offset" :76, "type" : "<ALPHANUM>", "position" : 15 }, { "token" : "situation", "start_offset" :77, "end_offset" :86, "type" : "<ALPHANUM>", "position" : 16 }, { "token" : "presente", "start_offset" :87, "end_offset" :95, "type" : "<ALPHANUM>", "position" : 17 } ] }
As you see, even though a user may enter déjà
, the filter converts it to deja
; likewise, été
is being converted to ete
. The ASCII Folding token filter doesn't require any configuration, but, if desired, you can include directly the one in a custom analyzer as follows:
curl -XPUT localhost:9200/my_index -d '{ "settings": { "analysis": { "analyzer": { "folding": { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ] } } } } }'
18.222.3.255