Summary

Using an Index for Efficient WHERE Processing

When processing a WHERE expression, SAS determines whether it is more efficient to access observations sequentially, by examining all observations, or directly, by using an index to access specific observations. Using an index to process a WHERE expression might improve performance and is referred to as optimizing the WHERE expression. By deciding whether to create an index, you play a role in determining which access method SAS can use.
In order to decide whether to use an index, you must evaluate the benefits and costs of using an index.
SAS performs a series of steps to determine whether to process a WHERE expression by using an index or by reading all observations sequentially.

Identifying Available Indexes

First, SAS determines whether there are any existing indexes that might be used to process the WHERE expression. Specifically, SAS checks the variable in each condition in the WHERE expression to determine whether the variable is either a key variable in a simple index or the first key variable in a composite index. No matter how many indexes are available, SAS can use only one index to process a WHERE expression. Therefore, if multiple indexes are available, SAS must choose between them.
It is most common for SAS to use an index to process just one condition in a WHERE expression. However, in a process called compound optimization, SAS can use a composite index to optimize multiple conditions on multiple variables, which are joined with a logical operator such as AND.

Identifying Conditions That Can Be Optimized

Second, SAS looks for operators and functions that can be optimized in the WHERE conditions that contain key variables. There are also certain operators and functions that cannot be optimized. For compound optimization, WHERE conditions must meet slightly different criteria in order to be candidates for optimization.

Estimating the Number of Observations

Third, SAS estimates how many observations are qualified by the index. When multiple indexes exist, SAS selects the one that seems to produce the fewest qualified observations (the smallest subset). Whether SAS uses an index depends on the percentage of observations that are qualified (the size of the subset relative to the size of the data set). It is more efficient to use direct access for a small subset and sequential access for a large subset. If SAS estimates that the number of qualified observations is less than 3% of the data file, SAS automatically uses the index and does not compare probable resource usage.
To help SAS estimate how many observations would be selected by a WHERE expression, each index stores 21 statistics called cumulative percentiles, or centiles. Centiles provide information about the distribution of values for the indexed variable.

Comparing Probable Resource Usage

Fourth, SAS decides whether it is more efficient to satisfy the WHERE expression by using the index or by reading all observations sequentially. To make the decision, SAS predicts how many I/O operations are required to satisfy the WHERE expression for each access method, and then compares the two resource costs.
Several factors affect the number of I/O operations that are required for WHERE processing, including the following:
  • subset size relative to data set size
  • number of pages in the data file
  • order of the data
  • cost to uncompress a compressed file for a sequential read
Data type and length are two other factors that affect index efficiency.

Deciding Whether to Create an Index

When you use a WHERE expression to select a subset, you can use specific guidelines to decide whether it is efficient to create an index. Depending on factors such as the size of the subset relative to the size of the data set, you might or might not choose to create an index.
In most situations, it is best to let SAS determine whether to use an index for WHERE processing. However, sometimes you might want to control whether SAS uses an existing index. You can use either of the data set options IDXWHERE= or IDXNAME=, but not both at the same time, to control index usage. You can specify MSGLEVEL=I to tell SAS to display information about index usage in the SAS log.

Comparing Procedures That Produce Detail Reports

When you produce a detail report, you can choose between the PRINT procedure and the SQL procedure. To perform a particular task, a single-purpose tool like PROC PRINT generally uses fewer computer resources than a multi-purpose tool like PROC SQL.
For detail reports, a PROC PRINT step often, but not always, uses fewer resources than a PROC SQL step:
  • PROC PRINT is usually more efficient than PROC SQL for generating a simple detail report, a subset detail report, and a sorted detail report.
  • PROC PRINT and PROC SQL are likely to use similar resources for generating a sorted subset detail report.

Comparing Tools for Summarizing Data

SAS provides a variety of tools for summarizing data, including the MEANS procedure (or SUMMARY procedure), the TABULATE procedure, the REPORT procedure, the SQL procedure, and the DATA step.
If you summarize data for one class variable, the tools in each of the following groups are similar in resource usage:
  • PROC MEANS (or PROC SUMMARY), PROC REPORT, and PROC TABULATE
  • PROC SQL and the DATA step
However, the relative efficiency of the two groups of tools varies according to the shape of the data.
You can use PROC MEANS in a variety of ways to produce summary statistics for combinations of class variables. Each combination of class variables is called a type.
To summarize data for all combinations of all class variables, you can use a basic PROC MEANS step (or PROC SUMMARY step). To produce summary statistics for specific combinations of class variables, you can use PROC MEANS in the following ways :
  • the TYPES statement in a PROC MEANS step
  • the NWAY option in multiple PROC MEANS steps
  • the WHERE= output data set option in a PROC MEANS step
These three techniques vary in efficiency; the TYPES statement in PROC MEANS is the most efficient.
You can also use the WAYS statement in PROC MEANS to produce summary statistics for specific combinations of class variables.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.144.248