Using Indexing

We'll discuss indexing first because it's the most important tool you have for speeding up your queries. There are other techniques available to you, too, but generally the one thing that will make the most difference is the proper use of indexes. On the MySQL mailing list, people often ask for help in making a query run faster. In a surprisingly large number of cases, there are no indexes on the tables in question, and adding indexes often solves the problem immediately. It doesn't always work like that because optimization isn't always simple. Nevertheless, if you don't use indexes, then in many cases you're just wasting your time trying to improve performance by other means. Use indexing first to get the biggest performance boost and then see what other techniques might be helpful.

This section describes what an index is, how it improves query performance, when indexes may degrade performance, and how to choose indexes for your tables. In the next section, we'll discuss MySQL's query optimizer. It's good to have some understanding of the optimizer in addition to knowing how to create indexes because then you'll be better able to take advantage of the indexes you create. Certain ways of writing queries actually prevent your indexes from being useful, and generally you'll want to avoid having that happen. (Not always, though. Sometimes you'll want to override the optimizer's behavior. We'll cover some of these cases, too.)

Benefits of Indexing

Let's consider how an index works by beginning with a table that has no indexes. An unindexed table is simply an unordered collection of rows. For example, Figure 4.1 shows the ad table that we first saw in Chapter 1, "Introduction to MySQL and SQL." There are no indexes on this table, so if we're looking for the rows for a particular company, we must examine each row in the table to see if it matches the desired value. This involves a full table scan, which is slow, as well as tremendously inefficient if the table contains only a few records matching the search criteria.

Figure 4.1. Unindexed ad table.


Figure 4.2 shows the same table, but with the addition of an index on the company_num column in the ad table. The index contains an entry for each row in the ad table, but the index entries are sorted on company_num value. Now, instead of searching through the table row by row looking for items that match, we can use the index. Suppose we're looking for all rows for company 13. We begin scanning the index and find three rows for that company. Then we reach the row for company 14, a value higher than the one we're looking for. Index values are sorted, so when we read the record containing 14, we know we won't find any more matches and can quit looking. If we were looking for a value that doesn't occur until some point midway into the index table, there are positioning algorithms for finding the first matching index entry without doing a linear scan of the table (for example, a binary search). That way, we can quickly position to the first matching value and save a lot of time in the search. Databases use various techniques for positioning to index values quickly, but it's not so important here what those techniques are. What's important is that they work and that indexing is a good thing.

Figure 4.2. Indexed ad table.


You may be asking why we don't just sort the data file and dispense with the index file? Wouldn't that produce the same type of improvement in search speed? You're right, it would—if you had a single index. But you might want a second index, and you can't sort the data file two different ways at once. (For example, you might want one index on customer names and another on customer ID numbers or phone numbers.) Using indexes as entities separate from the data file solves the problem and allows multiple indexes to be created. In addition, rows in the index are generally shorter than data rows. When you insert or delete new values, it's easier to move around shorter index values to maintain the sort order than to move around the longer data rows.

The example corresponds to the way MySQL indexes tables. A table's data rows are kept in a data file, and index values are kept in an index file. You can have more than one index on a table; if you do, they're all stored in the same index file. Each index in the index file consists of a sorted array of key records that are used for fast access into the data file.

The preceding discussion describes the benefit of an index in the context of single-table queries, where the use of an index speeds searches significantly by eliminating the need for full table scans. However, indexes are even more valuable when you're running queries involving joins on multiple tables. In a single-table query, the number of values you need to examine per column is the number of rows in the table. In a multiple-table query, the number of possible combinations skyrockets because it's the product of the number of rows in the tables.

Suppose you have three unindexed tables, t1, t2, and t3, each containing a column c1, c2, and c3, respectively, and each consisting of 1000 rows that contain the numbers 1 through 1000. (This example is contrived, of course. It's simple to make a point; nevertheless, the problems it will illustrate are real.) A query to find all combinations of table rows in which the values are equal looks like this:

SELECT c1, c2, c3
FROM t1, t2, t3
WHERE c1 = c2 AND c1 = c3

The result of this query should be 1000 rows, each containing three equal values. If we process the query in the absence of indexes, we have no idea which rows contain which values. Consequently, we must try all combinations to find the ones that match the WHERE clause. The number of possible combinations is 1000 × 1000 × 1000 (1 billion!), which is a million times more than the number of matches. That's a lot of wasted effort, and this query is likely to be very slow, even for a database such as MySQL that is very fast. And that is with only 1000 rows per table. What happens when you have tables with millions of rows? You can see that this quickly leads to very poor performance. If we index each table, we can speed things up considerably because indexing allows the query to be processed like this:

  1. Select the first row from table t1 and see what value the row contains.

  2. Using the index on table t2, go directly to the row that matches the value from t1. Similarly, using the index on table t3, go directly to the row that matches the value from t1.

  3. Proceed to the next row of table t1 and repeat the preceding procedure until all rows in t1 have been examined.

In this case, we're still performing a full scan of table t1, but we're able to do indexed lookups on t2 and t3 to pull out rows from those tables directly. The query runs about a million times faster this way—literally.

MySQL uses indexes as just described to speed up searches for rows matching terms of a WHERE clause or rows that match rows in other tables when performing joins. It also uses indexes to improve the performance of other types of operations:

  • The smallest or largest value for an indexed column can be found quickly without examining every row when you use the MIN() or MAX() functions.

  • MySQL can often use indexes to perform sorting operations quickly for ORDER BY clauses.

  • Sometimes MySQL can avoid reading the data file entirely. Suppose you're selecting values from an indexed numeric column, and you're not selecting other columns from the table. In this case, by reading an index value, you've already got the value you'd get by reading the data file. There's no reason to read values twice, so the data file need not even be consulted.

Disadvantages of Indexing

In general, if MySQL can figure out how to use an index to process a query more quickly, it will. This means that, for the most part, if you don't index your tables, you're hurting yourself. You can see that I'm painting a rosy picture of the benefits of indexing. Are there disadvantages? Yes, there are. In practice, these drawbacks tend to be outweighed by the advantages, but you should know what they are.

First, the index file takes up disk space. If you have lots of indexes, the index file may reach the maximum file size more quickly than the data file. Second, indexes speed up retrievals but slow down inserts and deletes, as well as updates of values in indexed columns (that is, most operations involving writing) because a write affects not only the data row, but often the indexes as well. The more indexes a table has, the greater the average performance degradation for write operations. In the section "Loading Data Efficiently," we'll go into more detail about this performance problem and what you can do about it.

Choosing Indexes

The syntax for creating indexes was covered in the section "Creating and Dropping Indexes" of Chapter 3, "MySQL SQL Syntax and Use." I assume here that you've read that section. But knowing syntax doesn't in itself help you determine how your tables should be indexed. That requires some thought about the way you use your tables. This section gives some guidelines on how to identify and select candidate columns for indexing:

  • Index columns that you search for, not columns you select. In other words, the best candidate columns for indexing are the columns that appear in your WHERE clause or columns named in join clauses, not columns that appear in the selection list following the SELECT keyword:

    SELECT
        col_a   ← not a candidate
    FROM
        tbl1 LEFT JOIN tbl2
        ON tbl1.col_b = tbl2.col_c  ← candidates
    WHERE
        col_d = expr  ← a candidate
    

    The columns that you select and the columns you use in the WHERE clause might be the same, of course. The point is that appearance of a column in the selection list is not a good indicator that it should be indexed.

    Columns that appear in join clauses or in expressions of the form col1 = col2 are especially good candidates for indexing. col_b and col_c in the query just shown are examples of this. If MySQL can optimize a query using joined columns, it cuts down the potential table-row combinations quite a bit by eliminating full table scans.

  • Use unique indexes. Consider the spread of values in a column. Indexes work best for columns with unique values, and most poorly with columns that have many duplicate values. For example, if a column contains ages and has several different values, an index will differentiate rows readily. An index will not help as much if a column is used to record sex and contains only the two values "M" and "F" (whichever value you search for, you're still going to get about half of the rows).

  • Use short indexes. If you're indexing a string column, specify a prefix length whenever it's reasonable to do so. For example, if you have a CHAR(200) column, don't index the entire column if most values are unique within the first 10 or 20 characters. Indexing the first 10 or 20 characters will save a lot of space in the index, and probably will make your queries faster as well. A smaller index involves less disk I/O, and shorter values can be compared more quickly. More importantly, with shorter key values, blocks in the index cache hold more key values, so MySQL can hold more keys in memory at once. This improves the likelihood of locating rows without reading additional blocks from the index. (You want to use some common sense, of course. Indexing just the first character from a column isn't likely to be that helpful because there won't be very many distinct values in the index.)

  • Take advantage of leftmost prefixes. When you create an n-column index, you actually create n indexes that MySQL can use. A multiple-column index serves as several indexes because any leftmost set of columns in the index can be used to match rows. Such a set is called a leftmost prefix.[1]

    [1] This is different than indexing a prefix of a column, which is using the first n characters of the column for index values.

    Suppose you have a table with an index on columns named state, city, and zip. Rows in the index are sorted in state/city/zip order, so they're automatically sorted on state/city order and state order as well. This means that MySQL can take advantage of the index even if you specify only state values in a query, or only state and city values. Thus, the index can be used to search the following combinations of columns:

    state, city, zip
    state, city
    state
    

    MySQL cannot use the index for searches that don't involve a leftmost prefix. For example, if you search by city or by zip, the index isn't used. If you're searching for a given state 0and a particular zip code (columns 1 and 3 of the index), the index can't be used for the combination of values. However, MySQL can narrow the search using the index to find rows that match the state.

  • Don't over-index. Don't index everything in sight based on the assumption "the more, the better." That's a mistake. Every additional index takes extra disk space and hurts performance of write operations, as has already been mentioned. Indexes must be updated and possibly reorganized when you modify the contents of your tables, and the more indexes you have, the longer this takes. If you have an index that is rarely or never used, you're slowing down table modifications unnecessarily. In addition, MySQL considers indexes when generating an execution plan for retrievals. Creating extra indexes creates more work for the query optimizer. It's also possible (if unlikely) that MySQL will fail to choose the best index to use when you have too many indexes. Maintaining only the indexes you need helps the query optimizer avoid making such mistakes.

    If you're thinking about adding an index to a table that is already indexed, consider whether or not the index you're thinking about adding is a leftmost prefix of an existing multiple-column index. If so, don't bother adding the index because, in effect, you already have it.

  • Consider the type of comparisons you perform on a column. Indexes are used for '<', '<=', '=d', '>=', '>', and BETWEEN operations. Indexes are also used for LIKE operations, when the pattern has a literal prefix. If you use a column only for other kinds of operations, such as STRCMP(), there is no value in indexing it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.41.187