Deciding Whether to Create an Index

In previous sections, you learned how SAS determines whether sequential access or direct access is likely to be most efficient for WHERE processing. You also learned about a variety of factors that you can assess to determine which access method is most efficient. After you have made your determination, you can use the following guidelines to decide whether it is efficient to create an index.

Guidelines for Deciding Whether to Create an Index

  • Minimize the number of indexes to reduce disk storage and update costs. Create indexes only on variables that are often used in queries or (when data cannot be sorted) in BY-group processing.
  • Create an index when you intend to retrieve a small subset of observations from a large data file.
  • Do not create an index if the data file's page count is less than three pages. It is faster to access the data sequentially.
  • Create indexes on variables that are discriminating. Discriminating variables have many different values that precisely identify observations. A WHERE expression that subsets based on a discriminating variable results in a smaller subset than a WHERE expression that references a non-discriminating variable. A non-discriminating variable is one that has only two values (for example, gender does not make a good variable on which to create an index).
  • To minimize I/O operations, sort the data by the key variable before creating the index. Then, to improve performance, maintain the data file in sorted order by the key variable.
    Note: If you choose not to use an index and the data set is large, it is still more efficient to sort the data set on the variable or variables that are specified in the WHERE statement.
  • Consider how often your applications use an index. An index must be used often in order to compensate for the resources that are used in creating and maintaining it.
  • Consider the cost of an index for a data file that is frequently changed.
  • When you create an index to process a WHERE expression, do not try to create one index to satisfy all queries.
Consider three sample queries to see how you can apply the guidelines that are listed in the previous section. These queries illustrate the effect of one factor—the size of the subset relative to the size of the data set—on the choice of an access method. For each query, you learn the following:
  • which access method SAS is likely to select
  • whether you could improve performance by creating an index

Example: Selecting Subsets of Various Sizes from Data Sets of Various Sizes

Suppose you are working with the following two data sets. Each contains information about a company's orders:
Data Set Name
Pages
Observations
Company.Orders_large
285,500
19,033,380
Company.Orders_small
2
140
You want to create queries to generate three subset detail reports, one for each of the following types of subsets:
  • small subset from a large data set
  • large subset from a large data set
  • small subset from a small data set
In all three queries, the WHERE expression specifies the variable Order_Date. You know that this variable is used frequently in queries, and that it is a discriminating variable. According to the guidelines in the previous section, these are both criteria for creating an index on the variable. However, there is currently no index defined on this variable in either data set.

Query 1: Small Subset from a Large Data Set

The first report shows all orders in Company.Orders_large that were made on January 10, 1998. Your query is shown below, along with the subset size that you have estimated:
Query
Subset Size
data _null_;
    set company.orders_large;
    where order_date='10JAN1998'd;
run;
2232 observations (out of 19,033,380)< .02% of the data set
Because the subset is less than 3% of the entire data set, using an index on Order_Date should be more efficient than using sequential access. SAS uses an index for WHERE processing, if an index is available. To improve performance, you should create an index on Order_Date before running this program.

Query 2: Large Subset from a Large Data Set

The second report shows all orders in Company.Orders_large that were made before January 1, 2000. Your query and the estimated subset size are shown below:
Query
Subset Size
data _null_;
   set company.orders_large;
   where order_date<'01JAN2000'd;
run;
12,752,365 observations (out of 19,033,380) =approximately 67% of the data set
Because the subset is more than 33% of the entire data set, using the index is probably less efficient than using sequential access. SAS probably will not use the index for WHERE processing.

Query 3: Small Subset from a Small Data Set

The third report shows all orders in the smaller data set Company.Orders_small that were made on June 30, 1998. Your query and the estimated subset size are shown below:
Query
Subset Size
data _null_;
   set company.orders_small;
   where order_date'30JUN1998'd;
run;
2 observations (out of 140) =< 2 % of the data set
Because the subset is less than 3% of the entire data set, SAS will use the index for WHERE processing. However, the data file's page count is less than three pages, so it is more efficient to use sequential access. In this situation, it is best not to create an index.

Using the IDXWHERE= and IDXNAME= Data Set Options

In most situations, it is best to let SAS determine whether to use an index for WHERE processing. However, sometimes you might want to control whether SAS uses an existing index. For example, if you know that your query selects a large subset and that indexed access therefore is not efficient, you can tell SAS to ignore any index and to satisfy the conditions of the WHERE expression with a sequential search of the data set. Alternatively, if your query selects a small subset and there are multiple available indexes, you can make sure that SAS uses a particular index to process your WHERE statement. Finally, you might want to force SAS to use (or not use) an index when you are benchmarking.
The data set options IDXWHERE= and IDXNAME= control index usage.
Option
Action
IDXWHERE=
specifies whether SAS should use an index to process the WHERE expression, no matter which access method SAS estimates is faster.
You cannot use IDXWHERE= to override the use of an index for processing a BY statement.
IDXNAME=
causes SAS to use a specific index.
Note: You can use either IDXWHERE= or IDXNAME=, but not both at the same time.
For more information about the IDXWHERE= and IDXNAME= data set options, see Creating and Managing Indexes Using PROC SQL.

Specifying MSGLEVEL=I to Determine Whether SAS Used an Index

To determine whether SAS used an index to process a WHERE expression, specify I as the value of the MSGLEVEL= system option. MSGLEVEL=I causes SAS to display information about index usage in the SAS log.
Note: To make the most efficient use of resources, use MSGLEVEL=I only for debugging and for verifying index usage.
Note: For more information about the MSGLEVEL= system option, see Creating Indexes or Creating and Managing Indexes Using PROC SQL.

Example: Using IDXWHERE=NO to Prevent Index Usage

Suppose you write the following query, which lists all employees who work in the Sales department.
proc print data=company.organization;
   where department='Sales';
run;
Now suppose an index is defined on the variable Department in the data set Company.Organization. You know that Department has the value Sales in 65% of the observations, so it is not efficient for SAS to use an index for WHERE processing. To ensure that SAS does not use an index, specify IDXWHERE=NO after the data set name. At the beginning of the program, you can also add an OPTIONS statement that specifies MSGLEVEL=I to display a message about index usage in the SAS log. The revised program is shown below:
options msglevel=i;
proc print data=company.organization (idxwhere=no);
   where department='Sales';
run;
When you run this program, the SAS log indicates that the index was not used for processing.
Table 23.1 SAS Log
INFO: Data set option (IDXWHERE=NO) forced a sequential pass of the data 
rather than use of an index for where-clause processing.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.88.54