Understanding the structure of an inverted index

A Lucene index is basically an Inverted Flat Index. This means that when Lucene indexes the text for a resource we are interested in, it creates an internal representation where it registers every term found, the number of times it recurs, and in which documents it is found.

So, the real internal structure for an index is somewhat similar to the following diagram:

Understanding the structure of an inverted index

The structure is what is generally called an inverted index, and explains why Lucene is so fast at giving results in complex full-text searches and at creating and saving indexes, and generally has a limited memory footprint. This structure suggests that once a textual value has been analyzed and its frequencies and positions are saved, we don't necessarily need to save it while updating a field instance in the index. This is why Lucene generally retrieves documents by searches very quickly, but can require a certain amount of time for a full update of very big indexes.

Note

You can get a complete introduction to the Lucene syntax at

http://lucene.apache.org/core/4_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html

You can get a gradual and clear introduction at

http://www.lucenetutorial.com/basic-concepts.html

You can find useful and clear materials to study at http://www.lucenetutorial.com/lucene-query-syntax.html.

Understanding how optimization affects the segments of an index

Running an optimization is very common to find out the number of segment file changes. As the result of an optimization process, the segments are usually merged into fewer files, obtaining a more compact index. You will have a small recipe to request an optimization in a while, so I suggest you play with these recipes on your data, as an exercise.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.248.0