Chapter 3. Handling Indexes

In this chapter, one of the most important topics in the area of database work will be covered—indexing. In many cases, missing or wrong indexes are the main source of trouble, leading to bad performance, unexpected behavior, and a great deal of frustration. To avoid these issues, this chapter will present all the information that you need to make your life as easy as possible.

The following issues will be covered:

  • Understanding PostgreSQL indexing
  • Types of indexes
  • Managing operator classes
  • Detecting missing indexes
  • Reading execution plans
  • Fixing LIKE
  • Fixing regular expressions
  • Improving indexes and functions
  • Indexes and insertions
  • Indexes and foreign keys

Understanding indexes in PostgreSQL

In this section, you will learn the fundamental concept of indexing. As already mentioned, indexes are highly important and their importance can hardly be overestimated. Therefore, it is important for you to understand what indexes are good for and how they can be used efficiently without harming performance. To illustrate indexes, the first thing to do is to introduce the concept of so-called execution plans.

Execution plans are a component vital to understanding PostgreSQL's performance as a whole. They are your window to the inner workings of the database system. It is important to understand execution plans and indexes.

Using a simple index

The basic idea of SQL is that you can write a query and the database will decide on the best strategy to execute the request. Deciding on the best strategy is the task of the so-called optimizer. The idea behind the optimizer is to figure out which query is expected to impose the lowest costs on the system, and go for that strategy. Consider the following example:

test=# CREATE TABLE t_test (id int4);
CREATE TABLE

A table with only one column has been created. Now 10 million rows are added:

test=# INSERT INTO t_test 
  SELECT * FROM generate_series(1, 10000000);
INSERT 0 10000000

Once the data has been added, a simple query looking for a single row can be issued. The point here is that the query is prefixed with the explain keyword:

test=# explain SELECT * FROM t_test WHERE id = 423425;
                        QUERY PLAN                         
-------------------------------------------------------
 Seq Scan on t_test  (cost=0.00..169247.71 
  rows=1 width=4)
   Filter: (id = 423425)
  Planning time: 0.143 ms
(3 rows)

As a result, PostgreSQL returns the so-called execution plan. It tells you how the system plans to execute the query. What is an execution plan? Basically, PostgreSQL will execute SQL in four stages:

  • Parser: At this stage, the syntax of the query will be checked. The parser is what complains about syntax errors.
  • Rewrite system: The rewrite system will rewrite the query, handle rules, and so on.
  • Optimizer: At this point, PostgreSQL will decide on the strategy and come up with a so-called execution plan, which represents the fastest way through the query to get the result set.
  • Executor: The plan generated by the optimizer will be executed by the executor, and the result will be returned to the end user.

The execution plan is exactly what has been created here. It tells us how PostgreSQL plans to execute the query. What we see is a so-called sequential scan. It means that the entire table is read and PostgreSQL checks every row to see whether it matches the condition or not. As our table is really long here, it takes quite a while to execute the query.

To receive some more detailed information, explain analyze can be used:

test=# explain analyze SELECT * 
  FROM   t_test 
  WHERE id = 423425;
       QUERY PLAN                                                
------------------------------------------------------
 Seq Scan on t_test  (cost=0.00..169247.71 
  rows=1 width=4) 
  (actual time=60.842..1168.681 rows=1 loops=1)
   Filter: (id = 423425)
   Rows Removed by Filter: 9999999
 Planning time: 0.050 ms
 Execution time: 1168.706 ms
(5 rows)

The total runtime of explain analyze is 1.1 seconds—quite a lot for a small query. Of course, if the query was executed without explain analyze, it would have been a little faster. However, it is still close to a second. Close to a second is certainly not what end users should expect from a modern database.

This is where an index can help. Here is how we create an index:

test=# CREATE INDEX idx_id ON t_test (id);
CREATE INDEX

Creating the index will solve all problems in a miraculous way. Now the query executes in a fraction of a millisecond, thousands of times faster than before:

test=# explain analyze SELECT * 
    FROM   t_test 
    WHERE id = 423425;
                     QUERY PLAN  
-------------------------------------------------------
 Index Only Scan using idx_id on t_test    
  (cost=0.43..8.45 rows=1 width=4) 
  (actual time=0.013..0.014 rows=1 loops=1)
   Index Cond: (id = 423425)
   Heap Fetches: 1
 Planning time: 0.059 ms
 Execution time: 0.034 ms
(5 rows)

The important point here is that the optimizer has found out that the index is a lot more beneficial. How did it happen? Internally, PostgreSQL performs a so-called cost-based optimization. This means that an index is not simply taken when it is around. PostgreSQL carefully calculates costs (just some number) and takes the cheapest plan promising the best execution times. In the preceding example, the total cost of the plan using the index is just 8.45. The sequential scan comes up with a stunning estimate of 169247.71 penalty points. Logically, the index is taken because it promises (and provides) far better return times. It is also important to see that there are two times returned by PostgreSQL: the time needed to plan the query and the time to do the actual execution. In the case of short queries, planning time starts becoming relevant.

How an index works

So far, only a simple B-tree has been created. How does a B-tree work internally? In PostgreSQL, Lehman-Yao trees are used. These are binary trees, which have some very nice features. First of all, a B-tree has logarithmic runtime behavior. This means that if the tree grows, the time needed to query the tree won't increase proportionally. One million times more data might result in probably 20 times slower index lookup times (this depends on factors such as distribution of values, amount of RAM available, clock speed, and so on).

To find out more about binary trees, check out http://en.wikipedia.org/wiki/Binary_tree.

There is more—B-trees in PostgreSQL have some more positive features such as Lehman-Yao trees (the trees used in a PostgreSQL B-tree implementation) can be modified by many people at the same time. Therefore, they are a wonderful data structure for scaling up operations. The power of doing things concurrently is important in modern systems and not just for databases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.95