Understanding SAS Indexes (2/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

using, and maintaining an index. The following topics provide information about

resource usage and give you some guidelines for creating indexes.

CPU Cost

Additional CPU time is necessary to create an index as well as to maintain the index

when the data file is modified. That is, for an indexed data file, when a value is added,

deleted, or modified, it must also be added, deleted, or modified in the appropriate

index(es).

When SAS uses an index to read an observation from a data file, there is also increased

CPU usage. The increased usage results from SAS using a more complicated process

than is used when SAS retrieves data sequentially. Although CPU usage is greater, you

benefit from SAS reading only those observations that meet the conditions. Note that

increased CPU usage is why using an index is more expensive when there is a larger

number of observations that meet the conditions.

Note: To compare CPU usage with and without an index, for some operating

environments, you can issue the STIMER or FULLSTIMER system options in order

to write performance statistics to the SAS log.

I/O Cost

Using an index to read observations from a data file can increase the number of I/O

(input/output) requests compared to reading the data file sequentially. For example,

processing a BY statement with an index might increase I/O count, but you save in not

having to issue the SORT procedure. For WHERE processing, SAS considers I/O count

when deciding whether to use an index.

1. SAS does a binary search on the index file and positions the index to the first entry

that contains a qualified value.

2. SAS uses the value's RID (identifier) to directly access the observation containing

the value. SAS transfers the observation between external storage to a buffer, which

is the memory into which data is read or from which data is written. The data is

transferred in pages, which is the amount of data (the number of observations) that

can be transferred for one I/O request; each data file has a specified page size.

3. SAS then continues the process until the WHERE expression is satisfied. Each time

SAS accesses an observation, the data file page containing the observation must be

read into memory if it is not already there. Therefore, if the observations are on

multiple data file pages, an I/O operation is performed for each observation.

The result is that the more random the data, the more I/Os are required to use the index.

If the data is ordered more like the index, which is in ascending value order, a smaller

number of I/Os are required to access the data.

The number of buffers determines how many pages of data can simultaneously be in

memory. Frequently, the larger the number of buffers, the smaller the number of I/Os

that are required. For example, if the page size is 4096 bytes and one buffer is allocated,

then one I/O transfers 4096 bytes of data (or one page). To reduce I/Os, you can increase

the page size but you need a larger buffer. To reduce the buffer size, you can decrease the

page size but you use more I/Os.

For information about data file characteristics like the data file page size and the number

of data file pages, issue the CONTENTS procedure (or use the CONTENTS statement in

the DATASETS procedure). With this information, you can determine the data file page

size and experiment with different sizes. Note that the information that is available from

PROC CONTENTS depends on the operating environment.

Understanding SAS Indexes 643

The BUFSIZE= data set option (or system option) sets the permanent page size for a

data file when it is created. The page size is the amount of data that can be transferred

for an I/O operation to one buffer. The BUFNO= data set option (or system option)

specifies how many buffers to allocate for a data file and for the overall system for a

given execution of SAS. That is, BUFNO= is not stored as a data set attribute.

Buffer Requirements

In addition to the resources that are used to create and maintain an index, SAS also

requires additional memory for buffers when an index is actually used. Opening the data

file opens the index file but none of the indexes. The buffers are not required unless SAS

uses the index but they must be allocated in preparation for the index that is being used.

The number of buffers that are allocated depends on the number of levels in the index

tree and in the data file open mode. If the data file is open for input, the maximum

number of buffers is three; for update, the maximum number is four. (Note that these

buffers are available for other uses; they are not dedicated to indexes.)

The IBUFSIZE= system option specifies the page size on disk for an index file when it

is created. The default setting causes SAS to use the minimum optimal page size for the

operating environment. Typically, you do not need to specify an index page size.

However, there are situations that could require a different page size. For more

information, see the “IBUFSIZE= System Option” in SAS System Options: Reference.

The IBUFNO= system option specifies an optional number of extra buffers to be

allocated when navigating an index file. SAS automatically allocates a minimal number

of buffers. Typically, you do not need to specify extra buffers. However, using

IBUFNO= to specify extra buffers could improve execution time by limiting the number

of input/output operations that are required for a particular index file. The improvement

in execution time, however, comes at the expense of increased memory consumption.

For more information, see the “IBUFNO= System Option” in SAS System Options:

Reference.

Disk Space Requirements

Additional disk space is required to store the index file. This file might show up as a

separate file or appear to be part of the data file, depending on the operating

environment.

For information about the index file size, issue the CONTENTS procedure (or the

CONTENTS statement in the DATASETS procedure). Note that the available

information from PROC CONTENTS depends on the operating environment.

Guidelines for Creating Indexes

Data File Considerations

• For a small data file, sequential processing is often just as efficient as index

processing. Do not create an index if the data file page count is less than three pages.

It would be faster to access the data sequentially. To see how many pages are in a

data file, use the CONTENTS procedure (or use the CONTENTS statement in the

DATASETS procedure). Note that the information that is available from PROC

CONTENTS depends on the operating environment.

• Consider the cost of an index for a data file that is frequently changed. If you have a

data file that changes often, the overhead associated with updating the index after

each change can outweigh the processing advantages you gain from accessing the

data with an index.

644 Chapter 26 • SAS Data Files

• Create an index when you intend to retrieve a small subset of observations from a

large data file (for example, less than 25% of all observations). When this occurs, the

cost of processing data file pages is lower than the overhead of sequentially reading

the entire data file. The smaller the subset, the larger the performance gains.

• To reduce the number of I/Os performed when you create an index, first sort the data

by the key variable. Then to improve performance, maintain the data file in sorted

order by the key variable. This technique reduces the I/Os by grouping like values

together. That is, the more ordered the data file is with respect to the key variable, the

more efficient the use of the index. If the data file has more than one index, sort the

data by the most frequently used key variable.

• An index might not be necessary to optimize a WHERE expression if the data is

sorted appropriately in order to satisfy the condition. To process a WHERE

expression without an index, SAS first checks for the sort indicator that is stored

with the file from a previous SORT procedure. If the sort indicator is appropriate,

SAS stops reading the file once there are no more values that satisfy the WHERE

expression. For example, consider a file that is sorted by Age, without an index. To

process the expression where age le 25, SAS stops reading observations after it

finds an observation that is greater than 25. Note that while SAS can determine when

to stop reading observations, if there is no index, there is no indication where to

begin. Without an index, SAS always begins with the first observation, which can

require reading a lot of observations.

Index Use Considerations

• Keep the number of indexes per data file to a minimum to reduce disk storage and to

reduce update costs.

• Consider how often your applications use an index. An index must be used often in

order to make up for the resources that are used in creating and maintaining it. That

is, do not rely solely on resource savings from processing a WHERE expression.

Take into consideration the resources that it takes to actually create the index and to

maintain it every time the data file is changed.

• When you create an index to process a WHERE expression, do not try to create one

index that is used to satisfy all queries. If there are several variables that appear in

queries, those queries might be best satisfied with simple indexes on the most

discriminating of those variables.

Key Variable Candidates

In most cases, multiple variables are used to query a data file. However, it probably

would be a mistake to index all variables in a data file, as certain variables are better

candidates than others:

• The variables to be indexed should be variables that are used in queries. That is, your

application should require selecting small subsets from a large file, and the most

common selection variables should be considered as candidate key variables.

• A variable is a good candidate for indexing when the variable can be used to

precisely identify the observations that satisfy a WHERE expression. That is, the

variable should be discriminating, which means that the index should select the

fewest possible observations. For example, variables such as Age, FirstName, and

Gender are not discriminating because it is possible for a large representation of the

data to have the same age, first name, and gender. However, a variable such as

LastName is a good choice because it is less likely that many employees share the

same last name.

For example, consider a data file with variables LastName and Gender.

Understanding SAS Indexes 645

• If many queries against the data file include LastName, then indexing LastName

could prove to be beneficial because the values are usually discriminating.

However, the same reasoning would not apply if you issued a large number of

queries that included Gender. The Gender variable is not discriminating (because

perhaps half the population is male and half is female).

• However, if queries against the data file most often include both LastName and

Gender as shown in the following WHERE expression, then creating a composite

index on LastName and Gender could improve performance.

where lastname='LeVoux' and gender='F';

Note that when you create a composite index, the first key variable should be the

most discriminating.

Creating an Index

Overview of Creating Indexes

You can create one index for a data file, which can be either a simple index or a

composite index, and you can create multiple indexes, which can be multiple simple

indexes, multiple composite indexes, or a combination of both simple and composite.

1. You request to create an index for one or multiple variables using a method such as

the INDEX CREATE statement in the DATASETS procedure.

2. SAS reads the data file one observation at a time, extracts values and RIDs for each

key variable, and places them in the index file.

SAS ensures that the values that are placed in the index are successively the same or

increasing. SAS determines whether the data is already sorted by the key variables in

ascending order. It determines this by checking the sort indicator in the data file, which

is an attribute of the file that indicates how the data is sorted. The sort indicator is stored

with the SAS data file descriptor information and is set from a previous SORT procedure

or SORTEDBY= data set option.

If the values in the sort indicator are in ascending order, SAS does not sort the values for

the index file and avoids the resource. Note that SAS always validates that the data is

sorted as indicated. If not, the index is not created. For example, if the sort indicator was

set from a SORTEDBY= data set option and the data is not sorted as indicated, an error

occurs. A message is written to the SAS log stating that the index was not created

because values are not sorted in ascending order.

If the values in the sort indicator are not in ascending order, SAS sorts the data that is

included in the index file in ascending value order. To sort the data, SAS follows this

procedure:

1. SAS first attempts to sort the data using the thread-enabled sort. By dividing the

sorting into separately executable processes, the time to sort the data can be reduced.

To use the thread-enabled sort, the index must be sufficiently large (which is

determined by SAS), the SAS system option CPUCOUNT= must be set to more than

one processor, and the THREADS system option must be enabled. Adequate

memory must be available for the thread-enabled sort. If not enough memory is

available, SAS reduces the number of threads to one and begins the sort process

again, which increases the time to create the index.

2. If the thread-enabled sort cannot be done, SAS uses the unthreaded sort.

646 Chapter 26 • SAS Data Files

Note: To display messages regarding what type of sort is used, memory and resource

information, and the status of the index being created, set the SAS system option

MSGLEVEL=I; that is:

options msglevel=i;

Using the DATASETS Procedure

The DATASETS procedure provides statements that enable you to create and delete

indexes. In the following example, the MODIFY statement identifies the data file, the

INDEX DELETE statement deletes two indexes, and the two INDEX CREATE

statements specify the variables to index, with the first INDEX CREATE statement

specifying the options UNIQUE and NOMISS:

proc datasets library=mylib;

modify employee;

index delete salary age;

index create empnum / unique nomiss;

index create names=(lastname firstname);

Note: If you delete and create indexes in the same step, place the INDEX DELETE

statement before the INDEX CREATE statement so that space occupied by deleted

indexes can be reused during index creation.

Using the INDEX= Data Set Option

To create indexes in a DATA step when you create the data file, use the INDEX= data set

option. The INDEX= data set option also enables you to include the NOMISS and

UNIQUE options. The following example creates a simple index on the variable Stock

and specifies UNIQUE:

data finances(index=(stock /unique));

The next example uses the variables SSN, City, and State to create a simple index named

SSN and a composite index named CitySt:

data employee(index=(ssn cityst=(city state)));

Using the SQL Procedure

The SQL procedure supports index creation and deletion and the UNIQUE option. Note

that the variable list requires that variable names be separated by commas (which is an

SQL convention) instead of blanks (which is a SAS convention).

The DROP INDEX statement deletes indexes. The CREATE INDEX statement specifies

the UNIQUE option, the name of the index, the target data file, and the variable or

variables to be indexed. For example:

drop index salary from employee;

create unique index empnum on employee (empnum);

create index names on employee (lastname, firstname);

Using Other SAS Products

You can also create and delete indexes using other SAS utilities and products, such as

SAS/CONNECT software, SAS/IML software, SAS Component Language, and

SAS/Warehouse Administrator.

Understanding SAS Indexes 647

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Understanding SAS Indexes (2/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding SAS Indexes (2/5)