Chapter 8

Indexing and Ordering Data Sets

WHAT’S IN THIS CHAPTER?

  • Creating indexes that help enhance query performance
  • Creating and maintaining indexes in document databases and column-family databases
  • Ordering NoSQL data sets
  • Making effective design choices to create optimal indexes and ordering patterns

You have already learned the essentials of querying NoSQL databases. In this chapter, you take the next step to ensure that your queries are fast and efficient. In relational databases, a common way to optimize query performance is to leverage database indexes. Similar concepts apply to the world of NoSQL as well.

Indexes exist to increase data access performance. In theory, they are similar in behavior to the index in a book. When you need to search for a term or a word in a book, you have two options, namely:

  • Scan the entire book page by page to search for the term or word.
  • Look up the index at the end to find the pages where the term or word can be found and then browse to those specific pages.

Between these two options it’s a no-brainer to look up the index as opposed to a page-by-page scan to find the term. It makes the job easy and saves time.

In an analogous manner you have two choices when accessing a record in a database:

  • Look through the entire collection or data set item by item.
  • Leverage the index to get to the relevant data quickly.

Obviously, again the index lookup is a preferred choice. Although the book index and database index are analogous, stretching the similarity too far can cause confusion. Book indexes are on free text and so the number of words or terms indexed is restricted to an important subset of the entire possible set. On the other hand, database indexes apply to all data sets in a collection. Indexes are created on an item identifier or a specific property.

ESSENTIAL CONCEPTS BEHIND A DATABASE INDEX

There is no single universal formula for creating an index but most useful methods hinge on a few common ideas. The building blocks of these common ideas reside in hash functions and B-tree and B+-tree data structures. In this section you peruse these ideas to understand the underlying theory.

A hash function is a well-defined mathematical function to convert a large, and often variable-sized and complex, data value to a single integer or a set of bytes. The output of a hash function is known by various names, including hash code, hash value, hash sum, and checksum. A hash code is often used as the key for an associative array, also called a hash map. Hash functions come in handy when you are mapping complex database property values to hash codes for index creation.

A tree data structure distributes a set of values in a tree-like structure. Values are structured in a hierarchical manner with links or pointers between certain nodes in the tree. A binary tree is a tree that has at most two child nodes: one on the left and other on the right. A node can be a parent, in which case it has at most two nodes, or it can be a leaf, in which case it’s the last node in the chain. At the base of a tree structure is a root node. See Figure 8-1 to understand a binary tree data structure.

A B-tree is a generalization of a binary tree. It allows more than two child nodes for a parent node. A B-tree keeps the data sorted and therefore allows for efficient search and data access. A B+-tree is a special variant of a B-tree. In a B+-tree, all records are stored in the leaves and leaves are sequentially linked. A B+-tree is the most common tree structure used to store a database index.

If you are itching to learn more about the details of B-trees and B+-trees, read through the following content, available online:

Alternatively, for a more structured tutorial consider reading Cormen, Leiserson, Rivest, and Stein’s Introduction to Algorithms, ISBN 0-262-03384-4.

Though the essential building blocks are the same, indexes are created and applied differently in different NoSQL products. In the subsequent sections in this chapter, indexes in MongoDB, CouchDB, and Apache Cassandra are explained. Effective data sorting is also covered as a part of the exposition on indexes, because the two are closely related.

INDEXING AND ORDERING IN MONGODB

MongoDB provides a wide range of rich options around indexing collections to enhance query performance. By default, it creates an index on the _id property for all collections it contains.

Indexing is best explained in the context of examples. I start out with the MongoDB movie-ratings collection, introduced in Chapter 6. If you don’t have the movie-ratings collection in your MongoDB instance, follow along with the example in Chapter 6 to set up and load the collection. Once it is set up you should have three collections, namely, movies, ratings, and users, at your disposal.

To understand the significance and impact of an index you also need to have a few tools to measure query performance with and without an index. In MongoDB, measuring query performance is facilitated by built-in tools that explain query plans and identify slow-running queries.

A query plan describes what the database server has to do to run a given query. To get started, run the explain plan utility before you delve deeper into its output and what it conveys. To get all the items in the ratings collection, you can query like so:

image
db.ratings.find();

movielens_indexes.txt

To run explain plan for this query you can run this query:

image
db.ratings.find().explain();

movielens_indexes.txt

The output of the explain plan would be something like this:

{
    "cursor" : "BasicCursor",
    "nscanned" : 1000209,
    "nscannedObjects" : 1000209,
    "n" : 1000209,
    "millis" : 1549,
    "indexBounds" : {
        
    }
}

The output says it took 1,549 milliseconds to return 1,000,209 (more than 1 million) documents. In returning these 1,000,209 documents, 1,000,209 items were examined. It also states that BasicCursor was used.

As is evident, the output of the explain function is a document as well. Its properties, as shown in the previous example, are as follows:

  • cursor — The cursor used to return the result sets of the query. A cursor can be of two types: basic cursor and B-tree cursor. Basic cursor implies table scan and B-tree cursor means an index was used.
  • nscanned — Number of entries scanned. When using an index, it would correspond to the number of index entries.
  • nscannedObjects — Number of documents scanned.
  • n — The number of documents returned.
  • millis — The time, in milliseconds, taken to run the query.
  • indexBounds — Represents the minimum and maximum index keys within which the query was matched. This field is relevant only when an index is used.

The next example queries for a subset of the ratings. The ratings collection consists of rankings (on a scale of 1 to 5) for a set of movies by a set of users. To filter the ratings collection, restrict it to a subset that relates to a particular movie. The ratings collection has only movie IDs, so to correlate IDs to names you need to look up the value in the movies collection. I use the original Toy Story (that is, Toy Story 1) movie as an example. You can choose to pick up another one!

To get the document that relates to Toy Story you can leverage a good old regular expression. You know of these query-filtering techniques from Chapter 6. If you feel unsure, don’t hesitate to refer to that chapter to review the concepts. All documents relating to Toy Story in the movies collection can be queried like so:

image
db.movies.find({title: /Toy Story/i});

movielens_indexes.txt

The output should be as follows:

{ "_id" : 1, "title" : "Toy Story (1995)", "genres" : [ "Animation", 
"Children's", "Comedy" ] }
{ "_id" : 3114, "title" : "Toy Story 2 (1999)", "genres" : [ "Animation", 
"Children's", "Comedy" ] }

I guess Toy Story 3 wasn’t released when these ratings were compiled. That’s why you don’t see that in the list. Next, take the movie ID for "Toy Story", which happens to be 1, and use that to find all the relevant ratings from all the users. Before you do that, though, run the explain plan function to view how the database ran the regular expression query to find Toy Story in the movies collection. You can run the explain plan like so:

image
db.movies.find({title: /Toy Story/i}).explain();

movielens_indexes.txt

The output should be as follows:

{
    "cursor" : "BasicCursor",
    "nscanned" : 3883,
    "nscannedObjects" : 3883,
    "n" : 2,
    "millis" : 6,
    "indexBounds" : {
        
    }
}

Run a count, using db.movies.count();, on the movies collection to verify the number of documents and you will observe that it matches with the nscanned and nscannedObjects value of the query explanation. This means the regular expression query led to a table scan, which isn’t efficient. The number of documents was limited to 3,883 so the query still ran fast enough and took only 6 milliseconds. In a short bit you will see how you could leverage indexes to make this query more efficient, but for now return to the ratings collection to get a subset that relates to Toy Story.

To list all ratings that relate to Toy Story (more accurately Toy Story (1995)) you can query as follows:

image
db.ratings.find({movie_id: 1});

movielens_indexes.txt

To see the query plan for the previous query run explain as follows:

image
db.ratings.find({movie_id: 1}).explain();

movielens_indexes.txt

The output should be as follows:

{
    "cursor" : "BasicCursor",
    "nscanned" : 1000209,
    "nscannedObjects" : 1000209,
    "n" : 2077,
    "millis" : 484,
    "indexBounds" : {
        
    }
}

At this stage it’s evident that the query is not running optimally because the nscanned and nscannedObjects count reads 1,000,209, which is all the documents in the collection. This is a good point to introduce indexes and optimize things.

CREATING AND USING INDEXES IN MONGODB

The ensureIndex keyword does most of the index creation magic in MongoDB. The last query filtered the ratings collection based on the movie_id so creating an index on that property should transform the lookup from table scan to B-tree index traversal. First, verify if the theory does hold good.

Create the index by running the following command:

image
db.ratings.ensureIndex({ movie_id:1 });

movielens_indexes.txt

This creates an index on movie_id and sorts the keys in the index in an ascending order. To create an index with keys sorted in descending order use the following:

image
db.ratings.ensureIndex({ movie_id:-1 });

movielens_indexes.txt

Then rerun the earlier query as follows:

image
db.ratings.find({movie_id: 1});

movielens_indexes.txt

Verify the query plan after that as follows:

image
db.ratings.find({movie_id: 1}).explain();

movielens_indexes.txt

The output should be:

{
    "cursor" : "BtreeCursor movie_id_1",
    "nscanned" : 2077,
    "nscannedObjects" : 2077,
    "n" : 2077,
    "millis" : 2,
    "indexBounds" : {
        "movie_id" : [
            [
                1,
                1
            ]
        ]
    }
}

At first glance, it’s clear that the number of items (and documents) looked up have reduced from 1,000,209 (the total number of documents in the collection) to 2,077 (the number of documents that match the filter criteria). This is a huge performance boost. In algorithmic speak, the document search has been reduced from a linearly scalable time to constant time. Therefore, the total time to run the query is reduced from 484 ms to 2 ms, which is over a 99-percent reduction in the time taken to run the query.

From the query plan cursor value, it’s clear that the index movie_id_1 was used. You can try to create an index with keys sorted in a descending order and rerun the query and the query plan. However, before you run the query, analyze the list of indexes in the ratings collection and find how you could force a particular index.

Getting a list, or more accurately an array, of all indexes is easy. You can query as follows:

image
db.ratings.getIndexes();

movielens_indexes.txt

There are two indexes on movie_id with ascending and descending order and the default _id, so the list of indexes should have these three. The output of getIndexes is as follows:

[
    {
        "name" : "_id_",
        "ns" : "mydb.ratings",
        "key" : {
            "_id" : 1
        }
    },
    {
        "_id" : ObjectId("4d02ef30e63c3e677005636f"),
        "ns" : "mydb.ratings",
        "key" : {
            "movie_id" : -1
        },
        "name" : "movie_id_-1"
    },
    {
        "_id" : ObjectId("4d032faee63c3e6770056370"),
        "ns" : "mydb.ratings",
        "key" : {
            "movie_id" : 1
        },
        "name" : "movie_id_1"
    }
]

You have already created an index on movie_id using a descending order sort using the following command:

image
db.ratings.ensureIndex({ movie_id:-1 });

movielens_indexes.txt

If required, you could force a query to use a particular index using the hint method. To force the descending order index on movie_id to get ratings related to “Toy Story (1995)” you can query as follows:

image
db.ratings.find({ movie_id:1 }).hint({ movie_id:-1 });

movielens_indexes.txt

Soon after running this query, you can verify the query plan to see which index was used and how it performed. A query plan for the last query using the descending order index on movie_id can be accessed as follows:

image
db.ratings.find({ movie_id:1 }).hint({ movie_id:-1 }).explain();

movielens_indexes.txt

The output of the query explain plan is as follows:

{
    "cursor" : "BtreeCursor movie_id_-1",
    "nscanned" : 2077,
    "nscannedObjects" : 2077,
    "n" : 2077,
    "millis" : 17,
    "indexBounds" : {
        "movie_id" : [
            [
                1,
                1
            ]
        ]
    }
}

The explain plan output confirms that the descending order index on movie_id, identified by movie_id_-1, was used. It also shows that like the ascending order index, the descending order index accessed only 2,077 items.

There is one peculiarity in the output, though. Although an index was used and only a select few documents were scanned it took 17 ms to return the result set. This is much less than the 484 ms used for the table scan but is substantially more than the 2 ms the ascending order index took to return the result set. This is possibly because in this case, the movie_id is 1 and is at the beginning of the ascending order list and the results were cached from a previous query. Ascending order indexes do not deterministically outperform descending order indexes when accessing documents in the beginning of the list. Likewise, descending order indexes do not deterministically outperform ascending order indexes when accessing documents at the end of the list. In most cases, especially for items somewhere near the middle, both index types perform equally well. To test this performance claim you can use both of the indexes to search for ratings for a movie, whose movie_id is at the other end.

The movie_id field (or property) of the ratings collection corresponds to the _id field of the movies collection. The _id field (and of course the movie_id field) has integer values so finding the movie_id at the top of the descending order sort is the same as finding the maximum value for the _id field in the movies collection. One way to find the maximum value of _id in the movies collection is to sort it in descending order as follows:

image
db.movies.find().sort({ _id:-1 });

movielens_indexes.txt

The JavaScript console returns only 20 documents at a time so it’s easy to find the maximum value, which is 3,952, at a quick glance. If you are running this query using a language API or any other mechanism you may want to limit the number of items in the result. Because only one item is required, you could simply run the query like so:

image
db.movies.find().sort({ _id:-1 }).limit(1);

movielens_indexes.txt

image

If you are wondering why the limit method and not the findOne method was used to return the top item in the list sorted in descending order, you may benefit from knowing that sort doesn’t work with findOne. This is because findOne can return only a single document and the concept of sorting a single document has no meaning. On the other hand, the limit method restricts only the final output to a subset of the total result set.

The movie_id 3952 corresponds to Contender, The (2000). To get ratings for the movie The Contender, you could use either the ascending or the descending ordered index on movie_id. Because the objective here is to analyze how both of these indexes perform for an item that satisfies boundary conditions, you can use both of them one after the other. In both cases you can also run the query plans. The query and query plan commands for the ascending order movie_id index are as follows:

image
db.ratings.find({ movie_id:3952 }).hint({ movie_id:1 });
db.ratings.find({ movie_id:3952 }).hint({ movie_id:1 }).explain();

movielens_indexes.txt

The output of the query plan is like so:

{
    "cursor" : "BtreeCursor movie_id_1",
    "nscanned" : 388,
    "nscannedObjects" : 388,
    "n" : 388,
    "millis" : 2,
    "indexBounds" : {
        "movie_id" : [
            [
                3952,
                3952
            ]
        ]
    }
}

The query and query plan commands for the descending order movie_id index is as follows:

image
db.ratings.find({ movie_id:3952 }).hint({ movie_id:-1 });
db.ratings.find({ movie_id:3952 }).hint({ movie_id:-1 }).explain();
{
    "cursor" : "BtreeCursor movie_id_-1",
    "nscanned" : 388,
    "nscannedObjects" : 388,
    "n" : 388,
    "millis" : 0,
    "indexBounds" : {
        "movie_id" : [
            [
                3952,
                3952
            ]
        ]
    }
}

movielens_indexes.txt

From multiple runs of these queries it seems the theory that values at the extremes don’t always benefit from indexes that start out at the corresponding end seems to be true. However, you need to bear in mind that query plan output is not idempotent. Every run could produce a unique output. For example, values could be cached and so may never hit the underlying data structures on a rerun. Also, for smaller data sets, as is the case with the movies collection, the difference is negligible and often the extraneous overheads like I/O lag substantially affect response time. In general, though, and especially for large data sets, a sort order that favors the item queried should be used.

Sometimes, after numerous modifications to a collection it may be worthwhile to rebuild indexes. To rebuild all indexes for the ratings collection you can run this command:

image
db.ratings.reIndex();

movielens_indexes.txt

You can alternatively use the runCommand to reindex:

image
db.runCommand({ reIndex:'ratings' });

movielens_indexes.txt

Rebuilding indexes is not required in most cases unless the size of the collection has changed in a considerable way or the index seems to be occupying an unusually large amount of disk space.

Sometimes, you may want to drop and create new indexes instead of rebuilding the ones that exist. Indexes can be dropped with the dropIndex command:

image
db.ratings.dropIndex({ movie_id:-1 });

movielens_indexes.txt

This command drops the descending order movie_id index. You can also drop all indexes if need be. All indexes (except the one of the _id field) can be dropped as follows:

image
db.ratings.dropIndexes();

movielens_indexes.txt

Compound and Embedded Keys

So far, you have created indexes on only a single field or property. It’s also possible to create compound indexes that involve multiple fields. For example, you may choose to create an index on movie_id and ratings fields together. The command to create such an index is:

image
db.ratings.ensureIndex({ movie_id:1, rating:-1 });

movielens_indexes.txt

This creates a compound index on movie_id (ascending order sorted) and rating (descending order sorted). You can create three more indexes out of the four total possible compound indexes involving movie_id and rating. The four possibilities arise due to possible ascending and descending order sorts of the two keys. The order of the sort can have an impact on queries that involve sorting and range queries so keep the order in mind when you define the compound indexes for your collection.

A compound index involving movie_id and rating can be used to query for documents that are matched on both these keys and the first key (that is, movie_id) alone. When using this index to filter documents on the basis of movie_id alone, the behavior is similar to a single field index on movie_id.

Compound keys are not restricted to two keys. You can include as many keys as you like. A compound index for movie_id, rating, and user_id can be created like so:

image
db.ratings.ensureIndex({ movie_id:1, rating:-1, user_id:1 });

movielens_indexes.txt

This index can be used to query for any of these following combinations:

  • movie_id, rating, and user_id
  • movie_id and rating
  • movie_id

Compound indexes can also include nested (or embedded) fields. Before you see how compound indexes involve nested fields, I cover how to create a single index involving a nested field. To illustrate, a collection of people (named people2) is used. An element of the people2 collection is as follows:

image

I already have a collection called people so I named a second one people2. You can choose to call it something else if you prefer.

{
    "_id" : ObjectId("4d0688c6851e434340b173b7"),
    "name" : "joe",
    "age" : 27,
    "address" : {
        "city" : "palo alto",
        "state" : "ca",
        "zip" : "94303",
        "country" : "us"
    }
}

You can create an index on the zip field of the address field as follows:

image
db.people2.ensureIndex({ "address.zip":1 });

movielens_indexes.txt

Next, you can create a compound index for the name and address.zip fields:

image
db.people2.ensureIndex({ name:1, "address.zip":1 });

movielens_indexes.txt

You can also choose the entire sub-document as the key of an index so you can create a single index for the address field:

image
db.people2.ensureIndex({ address:1 });

movielens_indexes.txt

This indexes the entire document and not just the zip field of the document. Such an index can be used if an entire document is passed as a query document to get a subset of the collection.

A MongoDB collection field can also contain an array instead of a document. You can index such fields as well. Now consider another example of an orders collection to illustrate how array properties can be indexed. An element of the orders collection is as follows:

{
    "_id" : ObjectId("4cccff35d3c7ab3d1941b103"),
    "order_date" : "Sat Oct 30 2010 22:30:12 GMT-0700 (PDT)",
    "line_items" : [
        {
            "item" : {
                "name" : "latte",
                "unit_price" : 4
            },
            "quantity" : 1
        },
        {
            "item" : {
                "name" : "cappuccino",
                "unit_price" : 4.25
            },
            "quantity" : 1
        },
        {
            "item" : {
                "name" : "regular",
                "unit_price" : 2
            },
            "quantity" : 2
        }
    ]
}

You could index with line_items:

image
db.orders.ensureIndex({ line_items:1 });

movielens_indexes.txt

When an indexed field contains an array, each element of the array is added to the index.

In addition, you could index by the item property of the line_items array:

image
db.orders.ensureIndex({ "line_items.item":1 });

movielens_indexes.txt

You could go one level further and index it by the name property of the item document contained in the line_items array as follows:

image
db.orders.ensureIndex({ "line_items.item.name":1 });

movielens_indexes.txt

So, you could query by this nested name field as follows:

image
db.orders.find({ "line_items.item.name":"latte" });

movielens_indexes.txt

Run the query plan to confirm that the cursor value used for the query is BtreeCursor line_items.item.name_1, which as you know indicates the use of the nested index.

Creating Unique and Sparse Indexes

By now you must be convinced that MongoDB provides a wide array of options to index documents and provide for efficient query performance. In addition to enhancing the query performance, indexes can also serve the purpose of imposing constraints.

You could create a sparse index by explicitly specifying it as follows:

image
db.ratings.ensureIndex({ movie_id:1 }, { sparse:true });

movielens_indexes.txt

A sparse index implies that those documents that have a missing indexed field are completely ignored and left out of the index. This may sometimes be desirable but you need to be aware that a sparse index may not reference all the documents in the collection.

MongoDB also provides a facility for creating unique indexes. You could create a unique index on the title field of the movies collection as follows:

image
db.movies.ensureIndex({ title:1 }, { unique:true });

movielens_indexes.txt

Two items in the movies collection don’t have the same title but if it were the case, a unique index would not be created unless you explicitly specified that all duplicates after the first entry be dropped. Such explicit specification can be done as follows:

image
db.movies.ensureIndex({ title:1 }, { unique:true, dropDups : true });

movielens_indexes.txt

If a collection contains documents with a missing value for the indexed field a null value will be inserted in place of the missing value. Unlike the sparse index, the document will not be skipped. Also, if two documents are missing from the indexed field, only the first one is saved; the rest would be ignored in the collection.

Keyword-based Search and Multikeys

So far, a lot has been said about indexes in MongoDB. All the essential concepts have been covered and most of the nuances illustrated. To wrap up this section and move onto another document database, namely CouchDB, I present one last example and that is about the regular expression-based search in a text field. Earlier in this chapter, the search for the movie ID corresponding to Toy Story prompted the following query:

db.movies.find({title: /Toy Story/i});

A query plan was also run and it showed that all 3,883 documents were scanned to get the response back in 6 ms. The collection of movies is small so a table scan wasn’t that expensive. However, running this same query on a large collection could have been much slower.

To enhance the query performance you could simply create an index like so:

db.movies.ensureIndex({ title:1 });

In some cases, though, creating a traditional index may not be enough, especially when you don’t want to rely on regular expressions and need to do a full text search. You have already seen that a field that contains an array of values can be indexed. In such instances, MongoDB creates multikeys: one for each unique value in the array. For example, you could save a set of blogposts in a collection, named blogposts, where each element could be as follows:

{
    "_id" : ObjectId("4d06bf4c851e434340b173c3"),
    "title" : "NoSQL Sessions at Silicon Valley Cloud Computing Meetup in January
 2011",
    "creation_date" : "2010-12-06",
    "tags" : [
        "amazon dynamo",
        "big data",
        "cassandra",
        "cloud",
        "couchdb",
        "google bigtable",
        "hbase",
        "memcached",
        "mongodb",
        "nosql",
        "redis",
        "web scale"
    ]
}

Now, you could easily create a multikey index on the tags field as follows:

db.blogposts.ensureIndex({ tags:1 });

So far it’s like any other index but next you could search by any one of the tag values like so:

db.blogposts.find({ tags:"nosql" });

This feature can be used to build out a complete keyword-based search. As with tags, you would need to save the keywords in an array that could be saved as a value of a field. The extraction of the keywords itself is not done automatically by MongoDB. You need to build that part of the system yourself.

Maintaining a large array and querying through numerous documents that each hold a large array could impose a performance drag on the database. To identify and preemptively correct some of the slow queries you can leverage the MongoDB database profiler. In fact, you can use the profiler to log all the operations.

The profiler lets you define three levels:

  • 0 — Profiler is off
  • 1 — Slow operations (greater than 100 ms) are logged
  • 2 — All operations are logged

To log all operations you can set the profiler level to 2 like so:

db.setProfilingLevel(2);

The profiler logs themselves are available as a MongoDB collection, which you can view using a query as follows:

db.system.profile.find();

If you have been following along until now, you have theoretically learned almost everything there is to learn about indexes and sorting in MongoDB. Next, you use the available tools to tune the query to optimal performance as you access data from your collections.

INDEXING AND ORDERING IN COUCHDB

You have seen the RESTful query mechanisms in CouchDB. Now, you delve a bit into how the values are indexed to make queries efficient. Unlike MongoDB, the indexing features in CouchDB are automatic and triggered for all changed data sets when they are read first after the change. To understand this further, step back a bit to review the essential mechanics of data access in CouchDB. CouchDB follows the MapReduce style data manipulation.

The map function that emits key/value pairs based on the collection data leads to view results. When such views are accessed for the first time a B-tree index is built out of this data. On subsequent querying the data is returned from the B-tree and the underlying data is untouched. This means queries beyond the very first one leverage the B-tree index.

The B-tree Index in CouchDB

A B-tree index scales well for large amounts of data. Despite huge data growth, the height of a B-tree remains in single digits and allows for fast data retrieval. In CouchDB, the B-tree implementation has specialized features like MultiVersion Concurrency Control and append-only design.

MultiVersion Concurrency Control (MVCC) implies that multiple reads and writes can occur in parallel without the need for exclusive locking. The simplest parallel of this is distributed software version control like GitHub. All writes are sequenced and reads are not impacted by writes. CouchDB has a _rev property that holds the most current revision value. Like optimistic locking, writes and reads are coordinated based on the _rev value.

Therefore, each version is the latest one at the time a client starts reading the data. As documents are modified or deleted the index in the view results are updated.

image

The couchdb-lucene project (https://github.com/rnewson/couchdb-lucene) provides full text search capability using Lucene, the open-source search engine, and CouchDB.

INDEXING IN APACHE CASSANDRA

Column-oriented databases like HBase and Hypertable have a default row-key-based order and index. Indexes on column values, which are often called secondary indexes, are typically not available out-of-box in these databases. HBase has some minimal support for secondary indexes. Hypertable intends to support secondary index by the time of its version 1.0 release, which will be available later this year.

Apache Cassandra is a hybrid between a column-oriented database and a pure key/value data store. It incorporates ideas from Google Bigtable and Amazon Dynamo. Like column-oriented databases, Cassandra supports row-key-based order and index by default. In addition, Cassandra also supports secondary indexes.

Secondary indexes support in Cassandra is explained using a simple example. You may recall a Cassandra database example with CarDataStore keyspace and the Cars column-family from Chapter 2. The same example is revisited for explaining support for secondary indexes.

To follow along, start the Cassandra server using the cassandra program in the bin directory of the Cassandra distribution. Then connect to Cassandra using the CLI as follows:

PS C:applicationsapache-cassandra-0.7.4> .incassandra-cli -host localhost
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
 
Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.

The CarDataStore should already be in the local database if you followed along with the examples in Chapter 2. If not, then please revisit Chapter 2 and set up the keyspace and column-family as required. When your setup is complete, make CarDataStore the current keyspace as follows:

[default@unknown] use CarDataStore;
Authenticated to keyspace: CarDataStore

Use the following command to verify that the data you added earlier exists in your local Cassandra data store:

[default@CarDataStore] get Cars['Prius'];
=> (column=make, value=746f796f7461, timestamp=1301824068109000)
=> (column=model, value=70726975732033, timestamp=1301824129807000)
Returned 2 results.

The Cars column-family has two columns: make and model. To make querying by values in the make column more efficient, create a secondary index on the values in that column. Since the column already exists, modify the definition to include an index. You can update the column-family and column definition as follows:

image
[default@CarDataStore] update column family Cars with comparator=UTF8Type
...     and column_metadata=[{column_name: make, validation_class: UTF8Type,
 index_type: KEYS},
...     {column_name: model, validation_class: UTF8Type}];
9f03d6cb-7923-11e0-aa26-e700f669bcfc
Waiting for schema agreement...
... schemas agree across the cluster

cassandra_secondary_index.txt

The update command created an index on the column make. The type of index created is of type KEYS. Cassandra defines a KEYS type index, which resembles a simple hash of key/value pairs.

Now, query for all values that have a make value of toyota. Use the familiar SQL-like syntax as follows:

image
[default@CarDataStore] get Cars where make = 'toyota';
-------------------
RowKey: Prius
=> (column=make, value=toyota, timestamp=1301824068109000)
=> (column=model, value=prius 3, timestamp=1301824129807000)
-------------------
RowKey: Corolla
=> (column=make, value=toyota, timestamp=1301824154174000)
=> (column=model, value=le, timestamp=1301824173253000)
 
2 Rows Returned.

cassandra_secondary_index.txt

Try another query, but this time filter the Cars data by model value of prius 3 as follows:

image
[default@CarDataStore] get Cars where model = 'prius 3';
No indexed columns present in index clause with operator EQ

cassandra_secondary_index.txt

The query that filters by make works smoothly but the one that filters by model fails. This is because there is an index on make but not on model. Try another query where you combine both make and model as follows:

image
[default@CarDataStore] get Cars where model = 'prius 3' and make = 'toyota';
-------------------
RowKey: Prius
=> (column=make, value=toyota, timestamp=1301824068109000)
=> (column=model, value=prius 3, timestamp=1301824129807000)
 
1 Row Returned.

cassandra_secondary_index.txt

The index works again because at least one of the filter criteria has an indexed set of values.

The example at hand doesn’t have any numerical values in its columns so showing a greater-than or less-than filter is not possible. However, if you did want to leverage a filter for such an inequality comparator-based query then you are going to be out of luck. Currently, the KEYS index does not have the capability to perform range queries. Range queries via indexes may be supported in the future if Cassandra includes a B-tree, or a similar index type. The rudimentary KEYS index isn’t sufficient for range queries.

SUMMARY

In this chapter you delved into the details of indexing documents and their fields in MongoDB. You also had a chance to learn about the automatic view indexing in CouchDB. The one prominent theme that emerged was that both databases support indexes and that these indexes aren’t drastically different from the indexes you see in relational databases.

You also learned about special features like how arrays in MongoDB are indexed as multikeys and how CouchDB provisions for automatic indexing of all documents that have changed since the last read.

In addition to indexes in document databases you also learned about indexing capabilities in Apache Cassandra, a popular column-family database.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.103.154