Avoiding trouble with indexes

Indexes are not always a solution to the problem; they can also be the problem by themselves. The following example outlines a common pitfall. It should be avoided at all costs:

test=# CREATE TABLE t_test (id int, x text);
CREATE TABLE
test=# INSERT INTO t_test SELECT x, 'house' 
  FROM generate_series(1, 10000000) AS x;
INSERT 0 10000000
test=# CREATE INDEX idx_x ON t_test (x);
CREATE INDEX

Before taking a look at the way the index is used, it makes sense to inspect the size of the table as well as the size of the indexes:

test=# SELECT
   pg_size_pretty(pg_relation_size('t_test')), 
  pg_size_pretty(pg_relation_size('idx_x'));
 pg_size_pretty | pg_size_pretty 
----------------+----------------
 422 MB         | 214 MB
(1 row)

The table created in this example is 422 MB large (only the table). On top of that, there is 214 MB taken up by the index. So, overall the size of the table with the index is larger than 600 MB.

The problem is that the index is of no use in this example:

test=# explain SELECT * FROM t_test WHERE x = 'house';
                      QUERY PLAN                            
-------------------------------------------------------
 Seq Scan on t_test  (cost=0.00..179054.03 rows=10000000
   width=10)
   Filter: (x = 'house'::text)
 (2 rows)

Despite the index, PostgreSQL will still use a sequential scan. The reason for that is simple: all values in the table are the same, the table is not selective enough, and therefore, the index is more or less useless. Why doesn't PostgreSQL use the index at all? Let's think about it for a moment. The goal of the index is to reduce I/O. If we expect to utilize all rows in the table, we'll have to read the entire table anyway, but if we use the index, it would be necessary to read all index entries on top of the table. Reading the index and the entire table is a lot more expensive than just reading the table. Therefore, PostgreSQL decides on the cheaper strategy.

Note

Keep in mind that PostgreSQL only has to read those parts of the index that are relevant to the query. This example is really a special case designed to require the entire table and the entire index.

However, the index is not always useless:

test=# explain SELECT * FROM t_test WHERE x = 'y';
                   QUERY PLAN                              
-------------------------------------------------------
 Index Scan using idx_x on t_test  (cost=0.43..4.45 
  rows=1 width=10)
   Index Cond: (x = 'y'::text)
  Planning time: 0.096 ms
(3 rows)

In this example, we choose to select a value that is not present in the table (or is expected to be pretty rare). In this case, the index is chosen. In other words, selectivity is the key to the usefulness of an index. If a field is not selective, an index will be pointless.

Detecting missing indexes

In this chapter, the importance of indexing has already been demonstrated. However, what is the best way to figure out whether indexes are missing or not? The pg_stat_user_tables view is a system view containing all the relevant information. Here is a query that will definitely be beneficial to you:

SELECT schemaname, relname, seq_scan, seq_tup_read,
    idx_scan, seq_tup_read / seq_scan 
FROM   pg_stat_user_tables 
WHERE   seq_scan > 0 
ORDER BY seq_tup_read DESC;

The query provides us with schemaname and relname to identify the table inside the database. The seq_scan field will return the number of times the table has been read sequentially. During those sequential scans, the database had to read seq_tup_read rows. The idx_scan field informs us about the number of index scans, and finally, the average number of rows needed by seq_scan is displayed.

The beauty of this query is that those tables that may need an index will show up on top. How do you know? A table that is read sequentially over and over again and that contains a large number of rows is definitely a candidate.

It obviously makes sense to go through the top candidates and check each table. Keep in mind that PostgreSQL can tell you which tables may be a problem, but it won't tell you which columns have to be indexed. Some basic knowledge about the application is needed. Otherwise, you will end up doing guesswork. It is really necessary to figure out which columns your application filters on. At this point, there is no automatic algorithm to check that.

Tip

Make sure that you are not falling into the trap of simply indexing everything. Creating too many indexes won't do any good, as shown in the next section.

Removing useless indexes

Some of you might wonder why too many indexes are bad. While reading data, indexes are no problem at all in 99 percent of all cases (unless the planner makes a bad decision or some caches are too small). However, when it comes to insertion, indexes are a major performance bottleneck.

To compare things, a simple test can be created. One million rows should be added to an empty table as well as to an indexed table. To compare runtimes, iming is used:

test=# 	iming
Timing is on.

The beauty of iming is that the time needed to execute a query is displayed right at the end of the command. First, 1 million rows are added:

test=# INSERT INTO t_test SELECT * 
    FROM generate_series(1, 1000000);
INSERT 0 1000000
Time: 6346.756 ms

It takes around 6 seconds to do this. What if data is added to an empty table?

test=# CREATE TABLE t_fast (id int, x text);
CREATE TABLE
Time: 92.034 ms
test=# INSERT INTO t_fast SELECT * 
    FROM generate_series(1, 1000000);
INSERT 0 1000000
Time: 2078.413 ms

Doing the same thing without an index is around three times faster. Keep in mind that the main bottleneck in this example is actually the creation of the data itself. If generate_series was free, the difference would have been a lot larger. You have to keep in mind that every index that does not yield any benefit will be a destructive index because it needs space on the disk, and more importantly, it will slow down writes dramatically. If your application is write-bound, additional indexes can be a killer. However, in my professional career, I have seen that write-bound applications are really a minority. Therefore, overindexing can be as dangerous as underindexing. However, underindexing is the more obvious problem because you will instantly see that certain queries are slow. If there are too many indexes, performance problems are usually a bit more subtle.

To prevent problems related to too many indexes, it takes time to check for useless indexes periodically. PostgreSQL has all the onboard means to do that. The best place to start is a system view called pg_stat_user_indexes. It tells you which indexes are used and how often:

test=# d pg_stat_user_indexes
View "pg_catalog.pg_stat_user_indexes"
    Column     |  Type  | Modifiers 
---------------+--------+-----------
 relid         | oid    | 
 indexrelid    | oid    | 
 schemaname    | name   | 
 relname       | name   | 
 indexrelname  | name   | 
 idx_scan      | bigint | 
 idx_tup_read  | bigint | 
 idx_tup_fetch | bigint | 

The relevant field here is called idx_scan. It tells us how often a certain index has been utilized. If this index is rarely used, it might make sense to drop it.

Note

Keep in mind that dropping an index on a very small table (maybe on a lookup table) or on a table that is supposed to grow anyway might not be a good idea. It really makes sense to focus on big tables and their indexes.

Dropping indexes can free up a lot of space, reduce I/O, and boost writes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.195.111