Inverted indexes

An inverted index is the core data structure of Elasticsearch and any other system supporting full-text search. An inverted index is similar to the index that you see at the end of any book. It maps the terms that appear in the documents to the documents.

For example, you may build an inverted index from the following strings:

Document ID

Document

1

It is Sunday tomorrow.

2

Sunday is the last day of the week.

3

The choice is yours.

 

Elasticsearch builds a data structure from the three documents that have been indexed. The following data structure is called an inverted index:

Term

Frequency

Documents (postings)

choice

1

3

day

1

2

is

3

1, 2, 3

it

1

1

last

1

2

of

1

2

sunday

2

1, 2

the

3

2, 3

tomorrow

1

1

week

1

2

yours

1

3

 

Notice the following things:

  • Documents were broken down into terms after removing punctuation and placing them in lowercase.
  • Terms are sorted alphabetically.
  • The Frequency column captures how many times the term appears in the entire document set.
  • The third column captures the documents in which the term was found. Additionally, it may also contain the exact locations (offsets within the document) where the term was found.

When searching for terms in the documents, it is blazingly fast to locate the documents in which the given term appears. If the user searches for the term sunday, then looking up sunday from the Term column will be really fast, because the terms are sorted in the index. Even if there were millions of terms, it is quick to look up terms when they are sorted.

Subsequently, consider a scenario in which the user searches for two words, for example, last sunday. The inverted index can be used to individually search for the occurrence of last and sunday; document 2 contains both terms, so it is a better match than document 1, which contains only one term.

The inverted index is the building block for performing fast searches. Similarly, it is easy to look up how many occurrences of terms are present in the index. This is a simple count aggregation. Of course, Elasticsearch uses lots of innovation on top of the bare inverted index we've explained here. It caters to both search and analytics.

By default, Elasticsearch builds an inverted index on all the fields in the document, pointing back to the Elasticsearch document in which the field was present. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.65.208