• Create an index when you intend to retrieve a small subset of observations from a
large data file (for example, less than 25% of all observations). When this occurs, the
cost of processing data file pages is lower than the overhead of sequentially reading
the entire data file. The smaller the subset, the larger the performance gains.
• To reduce the number of I/Os performed when you create an index, first sort the data
by the key variable. Then to improve performance, maintain the data file in sorted
order by the key variable. This technique reduces the I/Os by grouping like values
together. That is, the more ordered the data file is with respect to the key variable, the
more efficient the use of the index. If the data file has more than one index, sort the
data by the most frequently used key variable.
• An index might not be necessary to optimize a WHERE expression if the data is
sorted appropriately in order to satisfy the condition. To process a WHERE
expression without an index, SAS first checks for the sort indicator that is stored
with the file from a previous SORT procedure. If the sort indicator is appropriate,
SAS stops reading the file once there are no more values that satisfy the WHERE
expression. For example, consider a file that is sorted by Age, without an index. To
process the expression where age le 25, SAS stops reading observations after it
finds an observation that is greater than 25. Note that while SAS can determine when
to stop reading observations, if there is no index, there is no indication where to
begin. Without an index, SAS always begins with the first observation, which can
require reading a lot of observations.
Index Use Considerations
• Keep the number of indexes per data file to a minimum to reduce disk storage and to
reduce update costs.
• Consider how often your applications use an index. An index must be used often in
order to make up for the resources that are used in creating and maintaining it. That
is, do not rely solely on resource savings from processing a WHERE expression.
Take into consideration the resources that it takes to actually create the index and to
maintain it every time the data file is changed.
• When you create an index to process a WHERE expression, do not try to create one
index that is used to satisfy all queries. If there are several variables that appear in
queries, those queries might be best satisfied with simple indexes on the most
discriminating of those variables.
Key Variable Candidates
In most cases, multiple variables are used to query a data file. However, it probably
would be a mistake to index all variables in a data file, as certain variables are better
candidates than others:
• The variables to be indexed should be variables that are used in queries. That is, your
application should require selecting small subsets from a large file, and the most
common selection variables should be considered as candidate key variables.
• A variable is a good candidate for indexing when the variable can be used to
precisely identify the observations that satisfy a WHERE expression. That is, the
variable should be discriminating, which means that the index should select the
fewest possible observations. For example, variables such as Age, FirstName, and
Gender are not discriminating because it is possible for a large representation of the
data to have the same age, first name, and gender. However, a variable such as
LastName is a good choice because it is less likely that many employees share the
same last name.
For example, consider a data file with variables LastName and Gender.
Understanding SAS Indexes 645