Deciding Whether to Create an Index

Overview

An index can reduce the time required to locate a set of rows, especially for a large data file. However, there are costs associated with creating, storing, and maintaining the index. When deciding whether to create an index, you must weigh any benefits in performance improvement against the costs of increased resource usage.
Note: This chapter discusses the benefits and costs that are associated with using indexes specifically with PROC SQL. To learn about the costs and benefits of using indexes with other SAS procedures, see the SAS documentation.

PROC SQL Queries That Can Be Optimized by an Index

To use indexes effectively with PROC SQL, it is important to know the classes of queries that can be processed more efficiently by using an index. The classes of queries that can be optimized are specified below.
Query performance is optimized when the key column occurs in ...
Example
a WHERE clause expression that contains
  • a comparison operator
  • the TRIM or SUBSTR function
  • the CONTAINS operator
  • the LIKE operator.
proc sql;
    select empid, jobcode, salary
       from sasuser.payrollmaster
       where jobcode='FA3'
       order by empid;
Key Column(s): JobCode
a subquery returning values to the IN operator.
proc sql;
    select empid, lastname, firstname,
            city, state
       from sasuser.staffmaster
       where empid in
          (select empid
             from sasuser.payrollmaster
             where salary>40000);
Key Column(s): EmpID
a correlated subquery, in which the column being compared with the correlated reference is indexed
proc sql;
    select lastname, firstname
       from sasuser.staffmaster
       where 'NA'=
          (select jobcategory
             from sasuser.supervisors
             where staffmaster.empid =
                    supervisors.empid);
Key Column(s): Supervisors.EmpID
a join in which
  • the join expression contains the equals (=) operator (an equijoin)
  • all the columns in the join expression are indexed in one of the tables being joined.
proc sql;
    select *
       from sasuser.payrollmaster as p,
            sasuser.staffmaster as s
       where p.empid =
             s.empid
       order by jobcode;
Key Column(s): Payrollmaster.EmpID or Staffmaster.EmpID

Benefits of Using an Index

For PROC SQL, there are three main benefits to using an index to access data directly (instead of reading the data sequentially):
  • A small subset of data (<15% of rows) can be accessed more quickly. (As the size of the subset increases, the advantage of using an index decreases.)
  • Equijoins can be performed without internal sorts.
  • Uniqueness can be enforced by creating a unique index.

Example: Using an Index to Access a Small Subset of Data

Suppose you are writing a query that references the table Work.Payrollmaster. (Work.Payrollmaster is a duplicate of the table Sasuser.Payrollmaster.) Work.Payrollmaster stores payroll information for employees, and a simple index is defined on the column JobCode. Your query's WHERE clause expression references the key column:
proc sql;
   select empid, jobcode, salary
      from work.payrollmaster
      where jobcode='FA3'
      order by empid;
If the value of JobCode for most of the rows in the table is FA3, then the use of an index does not significantly improve the efficiency of the following query. In fact, performance might be degraded.
However, if only 10% of the rows have a value of FA3, then PROC SQL can process the query more efficiently by using the index.
Note: In this chapter, if you want to submit any sample code that references a temporary table (a table that is stored in the Work library), you first need to create the temporary table by copying the table in the Sasuser library that has the same name.

Understanding the Costs of Using an Index

When you are deciding whether to create an index, you should consider the associated increase in resource usage, which includes the following:
  • Additional CPU time is necessary to create an index, to maintain the index when the table is modified, and to use an index to read a row from a table.
  • Using an index to read rows from a table might require additional I/O (input/output) requests when compared to reading the table sequentially.
  • Using an index requires additional memory for buffers into which the index pages and code are loaded for processing.
  • Additional disk space is required to store the index file, which can show up as a separate file (in the Windows and UNIX operating environments, for example) or can appear to be part of the data file (in the z/OS operating environment).

Guidelines for Creating Indexes

To use indexes effectively, follow these guidelines for creating indexes:
  • Keep the number of indexes to a minimum to reduce disk storage and update costs.
  • Do not create an index for small tables. Sequential access is faster on small tables.
  • Do not create an index based on columns that have a very small number of distinct values, low cardinality ( for example, a Gender column that contains only the two values Male and Female).
  • Use indexes for queries that retrieve a relatively small subset of rows — that is, less than 15%.
  • Do not create more than one index that is based on the same column as the primary key.
Tip
Many factors affect the processing of SAS programs. The most accurate way to find out whether to create an index for a particular table or column is to perform benchmarking tests.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.249.92