Comparing Probable Resource Usage

Overview

After SAS estimates the number of qualified observations and selects the index that qualifies the fewest observations, SAS must then determine whether it is faster (more efficient) to satisfy the WHERE expression by using the index or by reading all of the observations sequentially. Specifically, SAS predicts how many I/O operations are required in order to satisfy the WHERE expression for each of the access methods. Then it compares the two resource costs.
Note: Remember, if SAS estimates that a subset contains fewer than 3% of the observations in the data set, SAS does not need to estimate resource usage. In this situation, SAS uses the index to process the WHERE statement.

How SAS Compares Resource Usage

To compare resource usage, SAS performs the following steps:
  1. SAS predicts how many I/O operations are required if it uses the index to satisfy the WHERE expression. To do so, SAS positions the index at the first entry that contains a qualified value. In a buffer management simulation that takes into account the current number of available buffers, the RIDs (record identifiers) on that index page are processed, indicating how many I/Os are required in order to read the observations in the data file.
  2. SAS calculates the I/O cost of a sequential pass of the entire data file.
  3. SAS compares the two resource costs and determines which access method has a lower cost.
Note: If comparing resource costs results in a tie, SAS chooses the index.

Factors That Affect I/O

Several factors affect the number of I/O operations that are required for WHERE processing, including the following:
  • subset size relative to data set size
  • number of pages in the data file
  • order of the data
  • cost to uncompress a compressed file for a sequential read
These factors are discussed in more detail below.

Subset Size Relative to Data Set Size

As explained earlier in this chapter, SAS is more likely to use an index to access a small subset of observations. The process of retrieving data with an index is inherently more complicated than sequentially processing the data.
For small subsets, however, the benefit of reading only a few observations outweighs the cost of the complex processing. The smaller the subset, the larger the performance gains. Remember that SAS uses an index if the subset is less than 3% of the data set, and SAS probably uses an index if the subset is between 3% and 33% of the data set.

Number of Pages in the Data File

For a small data file, sequential processing is often just as efficient as index processing. If the data file's page count is less than three pages, then sequential access is faster even if the subset is less than 3% of the entire data set.
Note: The amount of data that can be transferred to one buffer in a single I/O operation is referred to as page size. To see how many pages are in a data file, use either the CONTENTS procedure or the CONTENTS statement in the DATASETS procedure. For more information about reporting the page size for a data file, see Controlling Memory Usage.

Order of the Data

The order of the data (sort order) affects the number of I/O operations as described below:
Order of the Data
Effect on I/O Operations
observations are randomly distributed throughout the data file
Qualified observations are located on a larger number of data file pages. An I/O operation is required each time that SAS loads a page. Therefore, the more random the data in the data file, the more I/O operations are needed to use the index.
observations are sorted on the indexed variable or variables
The data is ordered more like the index (in ascending value order), so qualified observations are located on fewer data file pages. Therefore, the less random the data in the data file, the fewer I/O operations are needed to use the index.
Note: In general, sorting the data set by the key variable before indexing results in greater efficiency. The more ordered the data file is with respect to the key variable, the more efficient the use of the index. If the data file has more than one index, then sorting the data by the most frequently used key variable is most efficient. Sorting the data set results in more efficient WHERE processing even when SAS does not use an index.

Cost to Uncompress a Compressed File for a Sequential Read

When SAS reads a compressed data file, SAS automatically uncompresses the observations as they are read into the program data vector. This requires additional CPU resources, but fewer I/O operations are required because there are fewer data set pages. When performing a sequential read of a compressed data file, SAS must uncompress all observations in the file. However, when using direct access, SAS must uncompress only the qualified observations. Therefore, the resource cost of uncompressing observations is greater for a sequential read than for direct access.
Note: Compressing a file is a process that reduces the number of bytes that are required for representing each observation. By default, a SAS data file is not compressed. For more information about compressing files, see Controlling Data Storage Space.

Other Factors That Affect Resource Usage

Data type and length are two other factors that can affect index efficiency.
Numeric key variables typically result in more CPU usage than character key variables, because numeric variables must be converted to formats that can be sequenced when values are read into the index or retrieved from the index. Character values are already in a format that can be sequenced.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.182.107