Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4. Indexes

Dejan Miličić¹

(1)

Novi Sad, Serbia

In the previous chapter, we saw how to run queries that will return data that you need. This chapter will cover what happens behind the scenes when you run such queries and how the database executes them. We will also show some challenges database needs to solve and how databases are coping with ever-growing data. This chapter will present some possible optimization solutions. Finally, we will show how RavenDB prevents common problems with queries.

Queries from the Perspective of a Database

As users and developers, we execute queries against the database to get results back. This action is so standard that we do not think about underlying mechanisms and tools that deliver results. Additionally, working with a relational database, you probably use Object-Relational Mapper, representing an additional layer of abstraction between your code and a database. Over time, as we work with databases, we tend to forget what happens behind the scene.

Unbounded Queries

The first important notion is the concept of the unbounded query. Such queries will return whole collections without any limitations. When executed, they will return all documents from a collection, no matter how many you have. Even if you have one million employees, you will get them all back. Listing 4-1 shows a simple unbounded query.

from Employees

Listing 4-1

Unbounded query over Employees collection

The database engine will fetch all employees from the disk and return them. Unbounded queries can be dangerous, as things are not under your control. Results of execution are not deterministic, but they depend on the number of documents. If we are processing employees, we do not know if we will iterate over 10 or 10,000 employees. Furthermore, for all visual representations of our company employees, we will rarely need all of them .

Paging

Usually, we want to present paged list or some variant of paging (e.g., fetching batches as the user scrolls down the list). As shown in the previous chapter, RavenDB has built-in support for paging in the following form:

from [Collection] limit [skip], [take]

so query from Employees limit 0, 2 will return first two employees.

However, besides paged result set, it is also possible to get a total number of documents in a collection in the same round trip to the database. As shown in Figure 4-1, there is a chain icon next to the Results heading.

Clicking on this icon will display raw JSON that RavenDB Server returns:

{

TotalResults: 9,

LongTotalResults: 9,

CappedMaxResults: 2,

...

The client interprets this response, in this case, RavenDB Studio, to show you query results. Not all properties of returned JSON are rendered in Studio – as you can see, the first property contains a total number of results. So, you can display the total number of results in paged query with just one request to the server .

Filtering

However, this simple request is what we rarely need – we usually want to filter out employees by a specific field. So, let’s see which one of these nine employees in our database live and work in London, as defined in the query in Listing 4-2.

from Employees

where Address.City = 'London'

Listing 4-2

The query for employees living in London

Figure 4-2 shows the result of executing this query .

As you can see, the engine managed to find those living in London. So, if you had to perform the same task, how would you do it?

Let’s imagine that you have nine papers representing each of the employees. Taking one by one, you would look at the Address, checking if City is London. If this is the case, you will stack paper to the right. Otherwise, you would put it to the left. After you process all documents, the stack to the right will contain all four employees living in London.

What you just did is very similar to what a database engine might do in some cases, and it is called sequential scan or full table scan. It might seem like a logical move at first sight. However, what happens as the number of employees grows in your database? Or, if you filter orders in a retail operation, you have thousands and thousands of orders after a couple of years. How long will it take to filter orders by some of their properties? No matter how fast the disk or CPU, your application will start suffering from a full table scan sooner or later.

In algorithmic complexity terms, this operation has O(n) complexity – if you increase the amount of data ten times, filtering will take ten times longer. It is evident that this approach is not a sustainable one and that we need a different one. However, no matter what method we use, the database engine still needs to visit every employee to check the living city .

Indexes

So, if we cannot eliminate this sequential scan, one conceptual solution would be to perform it once and save results for subsequent calls. This approach is how all modern databases are solving this challenge. You can specify a field, and the engine will go over all documents looking at the content of this field. Discovered values will be stored in a particular data structure called Index. An index is an additional data structure (metadata, or “data about your data”) stored along with your documents. The database will derive the content of the index from your content, and it will never alter it.

In our example, we would define index over Address.City field of an Employee document, and we would say that this field is indexed. When computing the content of this index, the database would extract all distinct values (all cities) and store them in a specialized data structure. Various indexing structures were developed over the years, but all of them have a clear purpose – to provide a fast way to fetch documents based on some criteria.

After forming Employee.Address.City index and populating it with all cities where employees live, the database engine will be able to answer our queries swiftly, whatever value may be. When we execute the quest for all London employees, the engine will not go over documents. Instead, it will consult the index. There it will find entries shown in Table 4-1.

Table 4-1

Content of Employee.Address.City Index

Address.City	ID
Kirkland	employees/3-A
London	employees/9-A employees/5-A employees/6-A employees/7-A
Redmond	employees/4-A
Seattle	employees/1-A employees/8-A
Tacoma	employees/2-A

Obtaining a list of all employees from London is now easy. Their identifiers will be collected, and those documents will be fetched from the disk and delivered. You could say that we prepared ahead of time for all possible Address.City queries. We precomputed results and stored them side-by-side with documents. As a result, our queries will be fast. There are no more table scans: no more need to go over documents and load them to check if they should be included in the result set. We have a list of IDs that can be read directly from the disk, with certainty that this is a complete result set.

Types of Indexes

If a computation is inevitable, there is no way to avoid performing it. However, you can do it just once and save results to a disk ahead of query time. Indexes are precomputed answers stored in a specialized data structure that provides a fast way to fetch results .

In Table 4-1, we presented an index that contains document IDs. Such an index is called the secondary index .

When a secondary index is used to service a query, one more step is needed before the list of employees is returned to you. The database needs to use IDs obtained from the secondary index to fetch complete employee documents. This operation needs to be efficient, so the database will maintain one more index that tracks the exact storage position of each document based on its identifier. This index is called the key-value index , and if you have been working with relational databases, you will recognize a similar internal data structure as the primary index .

Overall, a query like Listing 4-2 will require a database to perform two operations. First, taking the value of an indexed property, a secondary index will be consulted to obtain a list of identifiers. Second, the primary index will be used to fetch employee documents from database storage efficiently.

However, in certain situations, this can be reduced to just one operation. Namely, you can expand the index to contain the whole document instead of just an identifier. Such index is called clustered index, and you can see its structure in Table 4-2 .

Table 4-2

Content of Clustered Index

City	ID	Document
Kirkland	employees/3-A	{id: “employees/3-A”, ”FirstName”: “Janet”..}
London	employees/9-A employees/5-A employees/6-A employees/7-A	{id: “employees/9-A”, ”FirstName”: “Anne”..} {id: “employees/5-A”, ”FirstName”: “Steven”..} {id: “employees/6-A”, ”FirstName”: “Michael”..} {id: “employees/7-A”, ”FirstName”: “Robert”..}
Redmond	employees/4-A	{id: “employees/4-A”, ”FirstName”: “Margaret”..}
Seattle	employees/1-A employees/8-A	{id: “employees/1-A”, ”FirstName”: “Nancy”..} {id: “employees/8-A”, ”FirstName”: “Laura”..}
Tacoma	employees/2-A	{id: “employees/2-A”, ”FirstName”: “Andrew”..}

The database engine can fetch the list of employee documents with just one round trip to the storage with a clustered index. Unfortunately, such an index is expensive storage-wise since it will essentially duplicate all records from the primary data storage into metadata structure. Various relational databases are using different approaches to overcome this challenge. SQL Server allows just one clustered index per table; MySQL is implementing primary index as clustered, with secondary index entries pointing to primary keys.

Midway between nonclustered index (storing only references) and clustered one (storing complete data) is covering index, also known as an index with included columns. Let’s observe the following query:

from Employees

where Address.City = 'London'

select FirstName

In this case, we are not asking for a complete employee document. We are interested in the first name of employees from London. Names are usually small pieces of data, and we can expand the nonclustered secondary index from Table 4-1 into the form shown in Table 4-3.

Table 4-3

Content of Covering Index

Address.City	ID	FirstName
Kirkland	employees/3-A	Janet
London	employees/9-A employees/5-A employees/6-A employees/7-A	Anne Steven Michael Robert
Redmond	employees/4-A	Margaret
Seattle	employees/1-A employees/8-A	Nancy Laura
Tacoma	employees/2-A	Andrew

Covering index is a compromise that enables the database to answer some of the queries using index alone (hence the term covered, since such index covers some of the queries) without allocating too much space .

Downsides of Indexing

Indexes sound like a great idea, but there are also downsides. Storing additional data structures on the disk will inevitably take extra space. In our example, we defined the index on just one field of the Employee document. If we wanted to provide users with efficient filtering on other fields like FirstName or LastName, we would have to expand EmployeeIndex with entries for these two fields. Over time, as you define new indexes, expand existing ones with additional fields, and increase the number of documents in your database, the total size of indexes will inevitably grow, allocating more and more space for this metadata.

However, there is another much more challenging task – index maintenance. Every time we update an employee’s city of living or any other indexed field, we need to update the Index. Every time we create a new employee in the database, the Index needs to be updated. As a result, each write to a database will be accompanied by additional one or more index update operations. Both of them, writing data and index updates, will be performed together, making your writing operation slower.

This results in an interesting trade-off: we will introduce indexes to speed up queries , but these will slow down writes. We want to have fast queries, but at the same time, we do not want to slow down writes. Hence, we should carefully select which fields should be indexed based on the applications’ usage patterns. This trade-off explains why databases are not indexing everything by default – this approach would introduce unnecessary overhead.

Even though querying via indexes (opposed to querying raw data) is what every developer should be doing, a lack of indexes is probably the most common anti-pattern among programmers. There are various contributing factors to this sad fact, but the need for the manual creation of indexes is the most significant one. It is easy to forget about it or make an omission, even when you are informed and aware. Additionally , this mistake is usually well-hidden in a small amount of data. Developers testing an application on a small dataset will not experience the dreadful consequences of a full table scan. Without load testing, a developer will deliver the database without indexes to production.

Eventually, a lack of indexes will result in a significant and annoying slow-up of the application. With the small amount of data in the database, the application will be fast, but over time, as the amount of data grows, performance will start to degrade linearly. This gradual degradation is a severe obstacle to the scaling of your application and will make you go back and fix performance problems, which will cost you both time and money .

RavenDB’s Indexing Philosophy

The general philosophy of RavenDB design as a database is not to help you solve problems and challenges but to prevent them. This approach has unique consequences – your database will make bad things impossible, or when you make an omission, RavenDB will react and correct your actions.

As a part of this prevention philosophy, in RavenDB, all queries are always executed via indexes. It is technically impossible for a query to avoid execution via index. This way, the manual omission is not possible, and all filtering queries will be efficient.

So, what happens when we attempt to run a query from Listing 4-2 without defining the index upfront? Will we get an error or warning? Figure 4-3 shows the result of query execution .

Not only that an error was not displayed, but we also got the actual result set. How come?

Looking closer at the top left corner of the Results panel, you can see that RavenDB Studio indicates the name of the index used to serve this query. Figure 4-4 shows a close-up of this part of the screen with the index name and execution time .

Index with the name Auto/Employees/ByAddress.City was used, and it took 62ms to get results. We can inspect this index by expanding the dropdown and selecting the first option, View Index, as shown in Figure 4-5.

You can now see mapping structure – documents from Employees collection are processed, and values are extracted from Address.City field. Figure 4-6 shows the structure of this index.

RavenDB studio provides a way to inspect the content of this index. If you click on the Terms button in the upper right corner, you will be presented with Index terms, as shown in Figure 4-7.

Expanding both Address.City and id() sections will reveal extracted values from those fields. It is easy to see that nine employees have unique identifiers (hence we have nine identifiers), but some live in the same city since we have five different cities. We can execute a filtering query against this index by hovering over any of the fields and clicking on magnifying glass icon, as shown in Figure 4-8.

Result of executing query

from index 'Auto/Employees/ByAddress.City'

where 'Address.City' = 'london'

is shown in Figure 4-9.

Figure 4-9
Results of Executing Query over Index

As observed in Figure 4-9, we are now querying the index directly, and the results are the same as when we executed the query from Listing 4-2. This is expected since the same index was used when we queried the collection directly.

Studio also provides a way to see the content of this index. Clicking on Settings and then turning on Show the raw index entries instead of matching documents will produce insight into the inner data structure of this index, as shown in Figure 4-10.

Looking again at Table 4-1, we can see that this is the precisely same set of entries for “London.” Modifying this query and running it for the rest of the cities from Figure 4-8 will return the remaining items from Table 4-1.

Try running from Employees where Address.City = 'London' query again. You are getting the expected result set, but what is essential here is the execution time, shown in Figure 4-11.

We did not mention it earlier, but this Query execution time measures server-side execution time. Even if you experienced a long time waiting for results to appear in your browser, you were waiting for the result set to be delivered to your browser. The measurement shown here is the actual time it took the database to generate these results.

Comparing query timing for this attempt with the previous one from Figure 4-3, you can see that we are much faster this time. On the first run, after failing to find an appropriate index, RavenDB had to create one. This operation took some time – the database had to fetch all nine employees from the disk, process their Address.City field, extract values, and form an index out of those values.

On the second run that we just did, a suitable index was detected right away and used to deliver results. The same scenario will happen on any subsequent run – precomputed values will be read from the disk and returned .

Summary

In this chapter, we introduced the concept of an index as a specialized data structure. An index is metadata placed side by side with your data inside a database, and it is one of the possible solutions for speeding up queries. We showed how RavenDB’s design approach would help you prevent common problems with indexes.

In the next chapter, we will show how you can control index creation and define it yourself instead of letting RavenDB create them automatically.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Indexes

Create new playlist

Sign In

Sign Up

4. Indexes

Queries from the Perspective of a Database

Unbounded Queries

Paging

Filtering

Indexes

Types of Indexes

Downsides of Indexing

RavenDB’s Indexing Philosophy

Summary

Table of Contents for
4. Indexes