Chapter 4. Query Optimization

Now that you are familiar with the SQL syntax implemented by SQLite, let's take a look at the way in which your tables are created and how the way your queries are written can affect the performance of your database.

Keys and Indexes

In the preceding chapter you met the CREATE INDEX statement, used to add an index to one or more columns of a database table.

Indexes are used to preserve a particular sort order on a column within the database itself, enabling queries in which the WHERE clause references that column directly to operate much faster. Rather than scanning a table's data from top to bottom and pulling out just the rows that match the given criteria, the index tells SQLite exactly where to look in that table to find the matching rows.

What an Index Does

A database index is not an abstract concept; you certainly have come across one before. Let's consider the index in this book, for instance, and think about when you would use it.

If you were trying to find references to the API call sqlite_exec(), for example, you might already expect to find this in Chapter 6, “The C/C++ Interface”—but bear in mind that a computer program cannot think for itself. So for a moment pretend you don't have the benefit of this knowledge.

To make sure you find every reference in the book, the systematic and thorough approach is to start from page one, line one and read each page a line at a time until the end of the book, noting each reference as you come across it. Clearly, this would be a time-consuming process.

Fortunately the production team behind this book has created a comprehensive index, with topics and terms used throughout the book listed alphabetically and page references given for each entry. You've seen an index before and know how to use one, so you already know that it's much quicker to look up sqlite_exec() in the index and jump to each page in turn to find the topic you are looking for than to scan every single printed word.

How Indexes Work in SQLite

A database index works much the same way as the index in a book. Suppose you have a table containing a list of names and telephone numbers, and the data has been built up over time as you come into contact with new people. It's very likely that some of your acquaintances will share a surname, but certainly your records were not added to the table in alphabetical order.

Imagine you want to find the details for someone called Brown. The SQL SELECT statement might look like this:

SELECT * FROM contacts
WHERE last_name = 'Brown';

Without an index, the database has to look at each record in the database in turn and compare it to the string Brown. The more popular you are, the larger your contacts database will be and the longer this query will take to complete. This process is known as a full table scan.

However, having an index on the last_name column means that SQLite will know how to sort the records in that table alphabetically by surname, and consequently it can pick out the matching values of last_name without even looking at the other rows in the table.

Such an index could be created on the contacts table as follows:

CREATE INDEX contacts_last_name ON contacts(last_name);

Remember that each index has to be given its own unique identifier, and that an index name cannot share the same name as a table, view, or trigger in the same database.

Note

Although the CREATE INDEX syntax allows you to specify a sort order—ASC or DESC—all indexes are currently created in ascending order. This feature is reserved for possible future use.

Indexing Multiple Columns

An index can be created on more than one column at a time—known as a clustered index. The order in which the columns are specified in the CREATE INDEX statement determines the sort order for the index—records are sorted on the first column initially, then on the second column where values of the first column are the same.

Think of how a telephone directory is organized. People appear in order of their surname, but because there are many people with the same surname, the listings are further ordered by first name or initials.

In SQLite, this kind of index for the contacts table would look like this:

CREATE INDEX contacts_name ON contacts(last_name, first_name);

A WHERE clause with conditions on both columns would find results very quickly using this index. In fact, because the primary sort order is last_name, a query that only references last_name would also perform well using this index.

The ordering of the columns for a clustered index is important. A condition imposed on first_name only would be no better off using this index than it would doing a full table scan.

Unique Indexes

You can specify the optional keyword UNIQUE in the CREATE INDEX statement to indicate that each value in the indexed column—or each permutation of values in a clustered index—only appears in the table once.

Using CREATE UNIQUE INDEX is the same as imposing a UNIQUE constraint in the table schema, but it can be done at any time after the table has been created. If you attempt to insert a duplicate value, SQLite will return an error and the insert will fail.

In fact, using the UNIQUE attribute in the CREATE TABLE statement causes SQLite to create a unique index automatically on this column as shown in the following example. Note that the sql column is empty because no CREATE INDEX command was entered.

sqlite> create table mytable (
   ...>   a INTEGER PRIMARY KEY
   ...> );
sqlite> select * from sqlite_master where type = 'index';
type        name                   tbl_name    rootpage    sql
----------  ---------------------  ----------  ----------  ----------
index       (mytable autoindex 1)  mytable     3

Note

If a column is declared as INTEGER PRIMARY KEY, it is used internally as the key for the table structure, so the records in the table are ordered on that column by default. Although it does not cause an index to show up in sqlite_master, the column will behave as if it were indexed. Explicitly creating an index on an INTEGER PRIMARY KEY column uses additional database resources without providing any extra benefits.

When to Create an Index

An index can be created or dropped at any time without affecting the data contained in that table. A number of factors determine when it becomes beneficial to create an index.

Usually you will create indexes on columns used in conditions in WHERE statements. Whether the purpose of the index is to restrict a dataset from one table or to perform a join, generally speaking SQLite will be able to do its work much faster if key fields are indexed.

However, very small tables are often not made any faster by adding an index and in some cases, the index can actually slow things down. It can be just as fast for SQLite to do a quick scan of all rows than to look up the sort order from the index.

If a table has data inserted frequently but not queried very often, use indexes sparingly. Although indexes can speed up queries—and UPDATE statements with a WHERE clause that references the indexed column—inserting data is slower when an index is present. In addition to adding a record to the table, an INSERT requires the index to be updated to reflect the position of the new data within the sort order of each indexed column.

Think of the filing cabinet in a traditional office, with customer files arranged alphabetically in the drawers. When a new document is filed, it takes a moment to locate the correct drawer and file and make sure that the item is put away in the correct place. However, when it comes to accessing documents in the future, everything relating to a particular customer is in the same place and can be found very quickly.

The alternative—a large stack of papers—lets you file things away much faster by just placing them on top of the pile, but looking for specific information becomes a daunting task. Just as most offices use a filing cabinet rather than an unsorted pile, most database tables can benefit from indexes when used correctly.

Don't go overboard with the number of indexes you create. The biggest performance benefits come from careful index selection based on the nature of the data stored in tables and the ways in which the application interacts with it.

If you do have too many columns indexed, it is possible that SQLite will not pick the best one for any particular query. Only one index can be used on each table at a time, so a clustered index is the only way to apply the capabilities of an index across two columns simultaneously. We will see an example of this later.

When Indexes Can Be Used

You also need to bear in mind what kinds of conditions can and cannot make use of an index. Most operators will use an index if one is available provided that the column value is not modified before it is compared.

The following examples will all make use of an index on mytable.myfield if there is one.

SELECT * FROM mytable WHERE myfield = 'xyz';
SELECT * FROM mytable WHERE myfield > 2000;
SELECT * FROM mytable WHERE myfield IN ('val1', 'val2'),
SELECT * FROM mytable WHERE myfield IN (SELECT somefield FROM anothertable)

However, the BETWEEN and LIKE operators will not use an index at any time. Such queries can usually be rewritten if using an available index would be beneficial. For instance

SELECT * FROM mytable WHERE myfield BETWEEN 10 and 20;

becomes

SELECT * FROM mytable WHERE myfield >= 10 AND myfield <= 20;

Similarly, a LIKE operator that matches the first few characters of a string can be written using two inequality operators.

SELECT * FROM mytable WHERE myfield LIKE 'sql%';

becomes

SELECT * FROM mytable WHERE myfield >= 'sql' AND myfield < 'sqm';

Note

A LIKE operator with characters after a wildcard symbol cannot be rewritten as a pair of inequalities. This would be like trying to find people whose surnames end with a “y” in the telephone directory.

These examples show that two conditions on the same column combined with an AND can still make use of an index. However, this is not true of the OR keyword—the IN operator should be used instead wherever possible.

SELECT * FROM mytable WHERE myfield = 'abc' OR myfield = 'xyz';

becomes

SELECT * FROM mytable WHERE myfield IN ('abc', 'xyz'),

If some operation is performed on a column value in the condition, SQLite will not be able to use an index. For instance, the following queries will be evaluated using a full table scan, regardless of any indexes on myfield:

SELECT * FROM mytable WHERE myfield % 2 = 1;
SELECT * FROM mytable WHERE substr(myfield, 0, 1) = 'w';
SELECT * FROM mytable WHERE length(myfield) < 5;

Tweaking the Table List

To join two or more tables together, the rows have to be combined before the results are returned, and the SQLite engine does this using nested loops. The hierarchy of the loops is determined by the order in which table names are specified in the FROM list, so that the leftmost table becomes the outer loop and the rightmost table becomes the inner loop.

Knowing a little about the tables you are performing the SELECT on can help you use this feature of SQLite to increase the speed of the query. If you use the tables that will produce the smallest number of rows—after any WHERE clause conditions that directly apply to that table have been applied—the outer loop will be as small as possible, and conditions evaluated on the inner loops will be performed a smaller number of times.

Benchmarking

Listing 4.1 shows a simple benchmarking shell script that can be used to execute a query repeatedly within sqlite and using the Unix time command to time execution.

Example 4.1. benchmark.sh

#!/bin/sh
if [ $# -ne 3 ]; then
  echo "Usage: $0 database-file sql-file num-loops"
  exit;
fi
dbname=$1;
sqlfile=$2;
numloops=$3
tmpfile=/tmp/sqlite_benchmark.sql
> $tmpfile
echo "Generating temporary SQL file"
for i in 'seq 1 $numloops'; do
  cat $sqlfile >> $tmpfile
done
echo "Executing query $numloops times"
time sqlite $dbname < $tmpfile > /dev/null
rm -f $tmpfile

This script is quite crude but will do enough to tell us how fast one query is in relation to another. A shell script has been used here so as not to tie it to any particular programming API, but after you have read the chapters in Part II of this book, “Using SQLite Programming Interfaces,” you will be able to create a benchmarking process that suits the environment you work in.

Note

The script in Listing 4.1 uses the seq command, which is part of the GNU coreutils package, included on Linux and certain other Unix systems as standard. If your operating system does not include this command, the package can be obtained from http://ftp.gnu.org/pub/gnu/coreutils/.

The shell script version can be quite slow because it involves a lot of disk writes to generate the temporary SQL file. Using a programming API, the query could be executed repeatedly in a loop construct within the same SQLite connection.

However, to perform a loop in the shell with sqlite called multiple times to execute the same query over and over again would carry a large overhead as the program has to start up, and open and close the database file for each iteration. As we are only interested in the time taken to perform the query, the file containing multiple queries is built up before sqlite is invoked and this work is not timed.

Create a text file containing the query to be analyzed and execute the program using the following syntax. Note that because the SQL query is passed to sqlite many times from one file, the terminating semicolon must be present.

./benchmark.sh database-file sql-file num-loops

The output will look something like this:

$ ./benchmark.sh words somefile.sql 10000
Generating temporary SQL file
Executing query 10000 times
real    0m6.814s
user    0m2.360s
sys     0m1.220s

Ideally the number of loops should be as high as you can bear to wait for the result to obtain a reliable average execution time. This will depend on the actual execution time of your query. The preceding example looped 10,000 times and took just a few seconds, but for a slow query even one loop might take longer than this.

Some Examples

The examples that follow use some relatively large sample tables to demonstrate that indexes can be used effectively. You can download a sample of the database from the Sams Publishing website at http://www.samspublishing.com/. Enter this book's ISBN (without the hyphens) in the Search box and click Search. When the book's title is displayed, click the title to go to a page where you can download the code if you want to work through these examples.

The first example has three tables, each containing 10,000 rows. This is not large compared to the sizes of databases SQLite is capable of handling, but is big enough for us to see the difference that an index can make.

The tables each contain a numeric integer between 1 and 10,000 and one word pulled randomly from a dictionary file. Table t1 uses an INTEGER PRIMARY KEY and contains each value from 1 to 10,000 in the id column. The schema for t1 is as follows:

CREATE TABLE t1 (
  num INTEGER PRIMARY KEY,
  word TEXT NOT NULL
);

Though word is not defined as UNIQUE in t1, the sample table was created in such a way that values of word are not duplicated. Therefore this table contains a list of 10,000 random English words, numbered sequentially.

CREATE TABLE t2 (
  num INTEGER NOT NULL,
  word TEXT NOT NULL
);

Table t2 contains one record for each value of word in t1; however, the num column contains a random integer between 1 and 10,000. As there is no UNIQUE constraint on num, the random way in which the data was generated has resulted in some numbers being duplicated and some being omitted from this table.

CREATE TABLE t3 (
  num INTEGER NOT NULL,
  word TEXT NOT NULL
);

The final table, t3, has a similar schema to t1 but with no INTEGER PRIMARY KEY. In fact data has been inserted so far in such a way that num is unique and every value from 1 to 10,000 is present, but to start with we do not want this column to be indexed.

Values of word were inserted randomly from the same dictionary file, which actually contains more than 15,000 words. Therefore some words from t1 appear twice or more in t3, some are omitted, and some values that do not appear in either of the other tables have found their way into t3.

This data is trivial, but it allows us to create some reasonably large joins on both num and word to see how indexes could be used. In the queries that follow, we use count(*) as the selected output because the number of rows returned will usually be large and returning the count will not significantly affect the query's execution time.

Let's start by looking at how the INTEGER PRIMARY KEY in t1 makes things faster, even though there is no visible index. Because the records in t1 are stored in the order of the num column, a comparison on this column is much faster than on t2 where the ordering of num is random.

Listing 4.2 shows query t1.sql for performing a very simple benchmark on these tables using the equals operator.

Example 4.2. t1.sql

SELECT count(*) FROM t1
WHERE num = 3500 AND 4000;

The result of executing this query 1000 times is

$ ./benchmark.sh words t1.sql 1000
Generating temporary SQL file
Executing query 1000 times
real    0m0.007s
user    0m0.000s
sys     0m0.010s

File t2.sql performs the exact same query on table t2. Here are the results:

$ ./benchmark.sh words t2.sql 1000
Generating temporary SQL file
Executing query 1000 times
real    0m11.261s
user    0m9.270s
sys     0m1.930s

As you can see, the difference the INTEGER PRIMARY KEY makes is vast because SQLite knows for certain what the order of the records will be.

When table t3 was created, records were actually inserted in order of column num, but without the column being indexed, SQLite still has to perform a full table scan to be sure that it does not leave out any rows meeting the condition.

$ ./benchmark.sh words t3.sql 1000
Generating temporary SQL file
Executing query 1000 times
real    0m10.898s
user    0m8.880s
sys     0m1.970s

If we create an index on t3.num, the benchmark results show speeds comparable to the those of the original query using the INTEGER PRIMARY KEY:

sqlite> CREATE INDEX t3_num_idx ON t3(num);
$ ./benchmark.sh words t3.sql 1000
Generating temporary SQL file
Executing query 1000 times
real    0m0.029s
user    0m0.030s
sys     0m0.000s

Joining any two of these tables produces a potential 100 million rows, so this is where indexes can make a real impact. Consider the query in Listing 4.3, which joins tables t1 and t2 on the num field. As t2.num is not indexed, a full table scan must be performed on t2 for each row in t1.

Example 4.3. joint1t2.sql

SELECT count(*) FROM t1, t2
WHERE t1.num = t2.num;

This query is unacceptably slow, in fact so slow that it's not even worth running multiple times with the benchmark script. Just one execution took over a minute to complete.

$ time sqlite words < joint1t2.sql
10000
real    1m3.822s
user    1m1.000s
sys     0m0.170s

If we join t1 and t3 instead using the same condition, this is much faster. As each row in t1 is iterated, SQLite can find the corresponding record in t3 almost immediately using the index we just created.

$ ./benchmark.sh words joint1t3.sql 100
Generating temporary SQL file
Executing query 100 times
real    0m4.721s
user    0m4.400s
sys     0m0.310s

Remember that we said the order in which tables appear in the FROM list is significant—here's an example. Let's take Listing 4.3 and reverse the table order as shown in Listing 4.4. This time, each row in t2 is examined in turn and joined to t1 using the condition specified. Because t1.num is an INTEGER PRIMARY KEY, SQLite can jump straight to the row in question.

Example 4.4. joint2t1.sql

SELECT count(*) FROM t2, t1
WHERE t1.num = t2.num;
$ ./benchmark.sh words joint2t1.sql 100
Generating temporary SQL file
Executing query 100 times
real    0m4.435s
user    0m3.960s
sys     0m0.380s

Let's see what difference an index can make to the word column, which contains a variable-length character string, using the query shown in Listing 4.5.

Example 4.5. t1word.sql

SELECT count(*) FROM t1
WHERE word = 'tickles';
$ ./benchmark.sh words t1word.sql 1000
Generating temporary SQL file
Executing query 1000 times
real    0m8.533s
user    0m6.370s
sys     0m2.070s

If we add an index on t1.word and rerun the same test, the performance is much, much better.

sqlite> CREATE INDEX t1_word_idx ON t1(word);
$ ./benchmark.sh words t1word.sql 1000
Generating temporary SQL file
Executing query 1000 times
real    0m0.185s
user    0m0.160s
sys     0m0.030s

Our sample query just looks for a single word; in t1 and t2 the word “tickles” appears exactly once, so we could use a unique index if we want to. However, in t3 it actually appears three times, so trying to create a unique index gives a warning:

sqlite> CREATE UNIQUE INDEX t3_word_idx on t3(word);
SQL error: indexed columns are not unique

Instead we have to use a nonunique index on t3.word.

sqlite> CREATE INDEX t3_word_idx on t3(word);

The EXPLAIN Statement

There is a way to compare how different indexes affect the speed of certain queries without having to perform endless benchmark tests. However, it requires a little knowledge of the inner workings of SQLite.

The EXPLAIN command can be used to find out how an SQL query is parsed by SQLite and from that you can determine the way in which it will actually be executed. The output from EXPLAIN is a series of opcodes from the Virtual Database Engine, which we'll look at in more detail in Chapter 10, “General Database Administration.”

When using sqlite the .explain command can be used to make the format of the output of EXPLAIN more readable. The following example shows the opcodes used to process a query on t2 using a WHERE condition on the non-indexed num column.

sqlite> .explain
sqlite> EXPLAIN SELECT word FROM t2 WHERE num = 1234;
addr  opcode        p1          p2          p3
----  ------------  ----------  ----------  -----------------------------------
0     ColumnName    0           0           word
1     Integer       0           0
2     OpenRead      0           4           t2
3     VerifyCookie  0           2979
4     Rewind        0           11
5     Column        0           0
6     Integer       1234        0           1234
7     Ne            1           10
8     Column        0           1
9     Callback      1           0
10    Next          0           5
11    Close         0           0
12    Halt          0           0

To work out whether an index is used or not, knowing the actual meaning of all these opcodes is not necessary. If you compare the preceding output to that for a similar query on t3 where num is indexed, you'll see the name of the index in the p3 column alongside an OpenRead opcode.

sqlite> EXPLAIN SELECT word FROM t3 WHERE num = 1234;
addr  opcode        p1          p2          p3
----  ------------  ----------  ----------  -----------------------------------
0     ColumnName    0           0           word
1     Integer       0           0
2     OpenRead      0           5           t3
3     VerifyCookie  0           2979
4     Integer       0           0
5     OpenRead      1           1077        t3_num_idx
6     Integer       1234        0           1234
7     NotNull       -1          10
8     Pop           1           0
9     Goto          0           22
10    MakeKey       1           0           n
11    MemStore      0           0
12    MoveTo        1           22
13    MemLoad       0           0
14    IdxGT         1           22
15    RowKey        1           0
16    IdxIsNull     1           21
17    IdxRecno      1           0
18    MoveTo        0           0
19    Column        0           1
20    Callback      1           0
21    Next          1           13
22    Close         0           0
23    Close         1           0
24    Halt          0           0

Because table and column names from the query will also appear in the p3 column, it is a good idea to have a strong naming convention for your indexes so that they stand out in this output. In this case we used t3_num_idx, which tells us exactly which table and column the index applies to.

We can use EXPLAIN to see which index SQLite chooses to use where two are available—remember that only one index can be used per table in a query. In the following examples, only the lines of output from EXPLAIN that correspond to an index name are given.

sqlite> EXPLAIN SELECT * FROM t3
   ...> WHERE num = 4000 AND word = 'soccer';
...
6     OpenRead      1           1682        t3_word_idx
...

So of the two indexes available on table t3, SQLite has chosen the index on t3.word to use for this query.

Let's take a moment to recap what indexes are in place on these tables. You can always find this information by querying the sqlite_master table:

sqlite> SELECT name, sql FROM sqlite_master
   ...> WHERE type = 'index';
name              sql
----------------  ------------------------------------
t3_num_idx        CREATE INDEX t3_num_idx ON t3(num)
t1_word_idx       CREATE INDEX t1_word_idx ON t1(word)
t3_word_idx       CREATE INDEX t3_word_idx on t3(word)

In fact the index that SQLite will choose where more than one is equally suitable is the one that appears last in the sqlite_master table—usually the one most recently created. A brand new index will also take precedence over older indexes for the duration of the current SQLite session and the database schema has not been re-read. A clustered index is considered more suitable than an index on a single column if all the columns in the index form part of the WHERE clause.

There may be occasions when you want to prevent use of the index SQLite would otherwise choose in order to force another index to be used. This is achieved by using an operator that does not affect the column's value.

To make sure the index on t3.word is used, add zero to the num column:

sqlite> EXPLAIN SELECT * FROM t3
   ...> WHERE num+0 = 4000 AND word = 'soccer';
...
6     OpenRead      1           1695        t3_word_idx
...

Similarly, to force use of the index on t3.num, use the concatenation operator to append an empty string to word:

sqlite> EXPLAIN SELECT * FROM t3
   ...> WHERE num = 4000 AND word||'' = 'soccer';
...
6     OpenRead      1           1077        t3_num_idx
...

If multiple indexes can be used in a query, each one will appear in the output from EXPLAIN. The following example shows that indexes are also used for a LEFT OUTER JOIN.

sqlite> EXPLAIN SELECT count(*) FROM t1
   ...> LEFT OUTER JOIN t3 ON t1.num = t3.num
   ...> WHERE t1.word in ('apple', 'banana'),
11    OpenRead      2           927         t1_word_idx
15    OpenRead      3           2042        t3_num_idx

Using Transactions

You saw in Chapter 3, “SQLite Syntax and Use,” that every change to the database must take place within a transaction, and that if one is not started explicitly with the BEGIN TRANSACTION command, SQLite will begin an implicit transaction whenever an INSERT, UPDATE, or DELETE statement is processed.

However, taking the time to create your own transactions can be well worth the effort. Each transaction requires separate low-level operations to open and close the journal file, and this is an unnecessary overhead when a number of changes to the database can be made in a single transaction.

Therefore rather than letting SQLite start several implicit transactions for you when performing a series of changes to the database, get into the habit of using BEGIN TRANSACTION and COMMIT TRANSACTION wherever such statements can be grouped together.

Note

If you are writing to a temporary table, the impact of using transactions is not so drastic. Because temporary tables are not designed to be stored beyond the current session, the disk writes are not flushed to the database file as regularly.

The VACUUM Statement

Any kind of disk access is faster if the data being read is contiguous on the drive, and fetching rows from SQLite is no exception.

When records are deleted from a table, SQLite blanks out the space they used to occupy but does not remove it from the database file. The same applies when a table or index is dropped. Because additional disk space does not need to be reallocated to the database file, a future INSERT statement that can occupy this space is slightly quicker, but not all data records are the same size, and before long the database will become fragmented.

The VACUUM command is SQLite's defragmenter. It works by systematically copying the contents of the database to a temporary file—leaving no empty space—and then reloading the original file from this copy.

The SQL syntax allows a table name or index name argument to be passed to VACUUM for backward compatibility with much older versions of SQLite, however any argument given is now simply ignored. Issuing a VACUUM command will always clean up the entire database.

Tuning the Database Itself

This chapter has mostly covered optimizations that can be performed on individual SQL statements. Another way to improve the performance of SQLite is to tune certain database parameters to suit your application.

We will examine system-wide performance considerations in Chapter 10.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.147.123