Estimating the Number of Observations

Overview

It is more efficient to use indexed access for a small subset and to use sequential access for a large subset. Therefore, after identifying any available indexes and evaluating the conditions in the WHERE expression, SAS estimates the number of observations that will be qualified by the index. Whether SAS uses an index depends on the percentage of observations that are qualified (the size of the subset relative to the size of the data set), as shown below:
SAS index
  • If the subset is less than 3% of the data set, direct access is almost certainly more efficient than sequential access, and SAS will use an index. In this situation, SAS does not go on to compare probable resource usage.
  • If the subset is between 3% and 33% of the data set, direct access is likely to be more efficient than sequential access, and SAS probably uses an index.
  • If the subset is greater than 33% of the data set, it is less likely that direct access is more efficient than sequential access, and SAS might or might not use an index.
When multiple indexes exist, SAS selects the one that produces the fewest qualified observations (the smallest subset). SAS does this even when each index returns a subset that is less than 3% of the data set.

Printing Centile Information

To help SAS estimate the number of observations that would be selected by a WHERE expression, each index stores 21 statistics called cumulative percentiles, or centiles. Centiles provide information about the distribution of values for the indexed variable.
Understanding the distribution of values in a data set can help you improve the efficiency of WHERE processing in your programs. You can print centile information for an indexed data file by specifying the CENTILES option in either of these places:
  • the CONTENTS procedure
  • the CONTENTS statement in the DATASETS procedure
PROC CONTENTS <options>;
RUN;
PROC DATASETS <options>;    
CONTENTS <options>;
QUIT;

Example

The following SAS program prints centile information for the data set Company.Organization:
proc contents data=company.organization centiles;
run;
Partial output from this program is shown below. As indicated on the left, an index is defined on the variable Employee_ID. The 21 centile values are listed on the right.
Figure 23.1 Partial PROC CONTENTS Output
Alphabetic List of Indexes and Attributes
The 21 centile values consist of the following:
Position in List
Value Shown in Output Above
Description
1 (first)
120101
the minimum value of the indexed variable (0% of values are lower than this value)
2-20
120152 - 121097
each value is greater than or equal to all other values in one of the 19 percentiles that range from the bottom 5% to the bottom 95% of values, in increments of 5%
21 (last)
99999999
the maximum value of the indexed variable (100% of values are lower than or equal to this value)
Note: For information about updating and refreshing centiles for a data file, see the SAS documentation.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.249.210