Standard analyzer

Standard Analyzer is suitable for many languages and situations. It can also be customized for the underlying language or situation. Standard analyzer comprises of the following components:

Tokenizer:

Standard tokenizer: A tokenizer that splits tokens at whitespace characters

Token filters:

Standard token filter: Standard token filter is used as a placeholder token filter within the Standard Analyzer. It does not change any of the input tokens but may be used in future to perform some tasks.
Lowercase token filter: Makes all tokens in the input lowercase.
Stop token filter: Removes specified stopwords. The default setting has a stopword list set to _none_, which doesn't remove any stopwords by default.

Let's see how Standard analyzer works by default with an example:

PUT index_standard_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std": { 
          "type": "standard"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std"
      }
    }
  }
}

Here, we created an index, index_standard_analyzer. There are two things to notice here:

Under the settings element, we explicitly defined one analyzer with the name std. The type of analyzer is standard. Apart from this, we did not do any additional configuration on Standard Analyzer.
We created one type called _doc in the index and explicitly set a field level analyzer on the only field, my_text.

Let's check how Elasticsearch will do the analysis for the my_text field whenever any document is indexed in this index. We can do this test using the _analyze API, as we saw earlier:

POST index_standard_analyzer/_analyze
{
  "field": "my_text", 
  "text": "The Standard Analyzer works this way."
}

The output of this command shows the following tokens:

{
  "tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "standard",
      "start_offset": 4,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "analyzer",
      "start_offset": 13,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "works",
      "start_offset": 22,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "this",
      "start_offset": 28,
      "end_offset": 32,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "way",
      "start_offset": 33,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 5
    }
  ]
}

Please note that, in this case, the field level analyzer for the my_field field was set to Standard Analyzer explicitly. Even if it wasn't set explicitly for the field, Standard Analyzer is the default analyzer if no other analyzer is specified.

As you can see, all of the tokens in the output are lowercase. Even though the Standard Analyzer has a stop token filter, none of the tokens are filtered out. This is why the _analyze output has all words as tokens.

Let's create another index that uses English language stopwords:

PUT index_standard_analyzer_english_stopwords
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std": { 
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std"
      }
    }
  }
}

Notice the difference here. This new index is using _english_ stopwords. You can also specify a list of stopwords directly, such as stopwords: (a, an, the). The _english_ value includes all such English words.

When you try the _analyze API on the new index, you will see that it removes the stopwords, such as the and this:

POST index_standard_analyzer_english_stopwords/_analyze
{
  "field": "my_text", 
  "text": "The Standard Analyzer works this way."
}

It returns a response like the following:

{
  "tokens": [
    {
      "token": "standard",
      "start_offset": 4,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "analyzer",
      "start_offset": 13,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "works",
      "start_offset": 22,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "way",
      "start_offset": 33,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 5
    }
  ]
}

English stopwords such as the and this are removed. As you can see, with a little configuration, Standard Analyzer can be used for English and many other languages.

Let's go through a practical application of creating a custom analyzer.

Table of Contents for Standard analyzer

Create new playlist

Sign In

Sign Up

Table of Contents for
Standard analyzer