Text

Text indexes are special indexes on string value fields to support text searches. This book is based on version 3 of the text index functionality, available since version 3.2.

A text index can be specified similarly to a regular index, replacing the index sort order (-1, 1) with the word text shown as follows:

> db.books.createIndex({"name": "text"})

Any collection can have at most one text index. This text index can support multiple fields, text or not. It cannot support other special types such as multikey or geospatial. Text indexes cannot be used for sorting results, even if they are only part of a compound index.

Since we only have one text index per collection, we need to choose the fields wisely. Reconstructing this text index can take quite some time and having only one of them per collection makes maintenance quite tricky, as we will see towards the end of this chapter.

Luckily, this index can also be a compound index:

> db.books.createIndex( { "available": 1, "meta_data.page_count": 1,  "$**": "text" } )

A compound index with text fields follows the same rules regarding sorting and prefix indexing as we explained earlier in this chapter. We can use this index to query on available or available, meta_data.page_count or sort if the sort order allows for traversing our index in any direction.

We can also blindly index as text each and every field in a document that contains strings:

> db.books.createIndex( { "$**": "text" } )

This can result in unbounded index size and should be avoided but it can be useful if we have unstructured data, coming for example straight from application logs where we don't know which fields may be useful or not and we want to be able to query as many of them as possible.

Text indexes will apply stemming (removing common suffixes such as plural s / es for English language words) and remove stop words (a, an, the, and so on) from the index.

Text indexing supports more than 20 languages, including Spanish, Chinese, Urdu, Persian, and Arabic. Text indexes require special configuration to index correctly in languages other than English.

Case insensitivity and diacritic insensitivity: A text index is case- and diacritic-insensitive. Version 3 of the text index (the one that comes with version 3.4) supports common C, simple S, and the special T case foldings as described in Unicode Character Database 8.0 case folding. In addition to case insensitivity, version 3 of the text index supports diacritic insensitivity. This expands insensitivity to characters with accents both in small- and capital-letter form. For example e, è, é, ê, ë and their capital letter counterparts can all be the same in comparison when using a text index. In previous versions of the text index these were treated as different strings.

Tokenization delimiters: Version 3 of the text index supports the tokenization delimiters defined as Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space as described in Unicode Character Database 8.0 case folding.

Table of Contents for Text

Create new playlist

Sign In

Sign Up

Table of Contents for
Text