using, and maintaining an index. The following topics provide information about
resource usage and give you some guidelines for creating indexes.
CPU Cost
Additional CPU time is necessary to create an index as well as to maintain the index
when the data file is modified. That is, for an indexed data file, when a value is added,
deleted, or modified, it must also be added, deleted, or modified in the appropriate
index(es).
When SAS uses an index to read an observation from a data file, there is also increased
CPU usage. The increased usage results from SAS using a more complicated process
than is used when SAS retrieves data sequentially. Although CPU usage is greater, you
benefit from SAS reading only those observations that meet the conditions. Note that
increased CPU usage is why using an index is more expensive when there is a larger
number of observations that meet the conditions.
Note: To compare CPU usage with and without an index, for some operating
environments, you can issue the STIMER or FULLSTIMER system options in order
to write performance statistics to the SAS log.
I/O Cost
Using an index to read observations from a data file can increase the number of I/O
(input/output) requests compared to reading the data file sequentially. For example,
processing a BY statement with an index might increase I/O count, but you save in not
having to issue the SORT procedure. For WHERE processing, SAS considers I/O count
when deciding whether to use an index.
1. SAS does a binary search on the index file and positions the index to the first entry
that contains a qualified value.
2. SAS uses the value's RID (identifier) to directly access the observation containing
the value. SAS transfers the observation between external storage to a buffer, which
is the memory into which data is read or from which data is written. The data is
transferred in pages, which is the amount of data (the number of observations) that
can be transferred for one I/O request; each data file has a specified page size.
3. SAS then continues the process until the WHERE expression is satisfied. Each time
SAS accesses an observation, the data file page containing the observation must be
read into memory if it is not already there. Therefore, if the observations are on
multiple data file pages, an I/O operation is performed for each observation.
The result is that the more random the data, the more I/Os are required to use the index.
If the data is ordered more like the index, which is in ascending value order, a smaller
number of I/Os are required to access the data.
The number of buffers determines how many pages of data can simultaneously be in
memory. Frequently, the larger the number of buffers, the smaller the number of I/Os
that are required. For example, if the page size is 4096 bytes and one buffer is allocated,
then one I/O transfers 4096 bytes of data (or one page). To reduce I/Os, you can increase
the page size but you need a larger buffer. To reduce the buffer size, you can decrease the
page size but you use more I/Os.
For information about data file characteristics like the data file page size and the number
of data file pages, issue the CONTENTS procedure (or use the CONTENTS statement in
the DATASETS procedure). With this information, you can determine the data file page
size and experiment with different sizes. Note that the information that is available from
PROC CONTENTS depends on the operating environment.
Understanding SAS Indexes 643
The BUFSIZE= data set option (or system option) sets the permanent page size for a
data file when it is created. The page size is the amount of data that can be transferred
for an I/O operation to one buffer. The BUFNO= data set option (or system option)
specifies how many buffers to allocate for a data file and for the overall system for a
given execution of SAS. That is, BUFNO= is not stored as a data set attribute.
Buffer Requirements
In addition to the resources that are used to create and maintain an index, SAS also
requires additional memory for buffers when an index is actually used. Opening the data
file opens the index file but none of the indexes. The buffers are not required unless SAS
uses the index but they must be allocated in preparation for the index that is being used.
The number of buffers that are allocated depends on the number of levels in the index
tree and in the data file open mode. If the data file is open for input, the maximum
number of buffers is three; for update, the maximum number is four. (Note that these
buffers are available for other uses; they are not dedicated to indexes.)
The IBUFSIZE= system option specifies the page size on disk for an index file when it
is created. The default setting causes SAS to use the minimum optimal page size for the
operating environment. Typically, you do not need to specify an index page size.
However, there are situations that could require a different page size. For more
information, see the “IBUFSIZE= System Option” in SAS System Options: Reference.
The IBUFNO= system option specifies an optional number of extra buffers to be
allocated when navigating an index file. SAS automatically allocates a minimal number
of buffers. Typically, you do not need to specify extra buffers. However, using
IBUFNO= to specify extra buffers could improve execution time by limiting the number
of input/output operations that are required for a particular index file. The improvement
in execution time, however, comes at the expense of increased memory consumption.
For more information, see the “IBUFNO= System Option” in SAS System Options:
Reference.
Disk Space Requirements
Additional disk space is required to store the index file. This file might show up as a
separate file or appear to be part of the data file, depending on the operating
environment.
For information about the index file size, issue the CONTENTS procedure (or the
CONTENTS statement in the DATASETS procedure). Note that the available
information from PROC CONTENTS depends on the operating environment.
Guidelines for Creating Indexes
Data File Considerations
For a small data file, sequential processing is often just as efficient as index
processing. Do not create an index if the data file page count is less than three pages.
It would be faster to access the data sequentially. To see how many pages are in a
data file, use the CONTENTS procedure (or use the CONTENTS statement in the
DATASETS procedure). Note that the information that is available from PROC
CONTENTS depends on the operating environment.
Consider the cost of an index for a data file that is frequently changed. If you have a
data file that changes often, the overhead associated with updating the index after
each change can outweigh the processing advantages you gain from accessing the
data with an index.
644 Chapter 26 SAS Data Files
Create an index when you intend to retrieve a small subset of observations from a
large data file (for example, less than 25% of all observations). When this occurs, the
cost of processing data file pages is lower than the overhead of sequentially reading
the entire data file. The smaller the subset, the larger the performance gains.
To reduce the number of I/Os performed when you create an index, first sort the data
by the key variable. Then to improve performance, maintain the data file in sorted
order by the key variable. This technique reduces the I/Os by grouping like values
together. That is, the more ordered the data file is with respect to the key variable, the
more efficient the use of the index. If the data file has more than one index, sort the
data by the most frequently used key variable.
An index might not be necessary to optimize a WHERE expression if the data is
sorted appropriately in order to satisfy the condition. To process a WHERE
expression without an index, SAS first checks for the sort indicator that is stored
with the file from a previous SORT procedure. If the sort indicator is appropriate,
SAS stops reading the file once there are no more values that satisfy the WHERE
expression. For example, consider a file that is sorted by Age, without an index. To
process the expression where age le 25, SAS stops reading observations after it
finds an observation that is greater than 25. Note that while SAS can determine when
to stop reading observations, if there is no index, there is no indication where to
begin. Without an index, SAS always begins with the first observation, which can
require reading a lot of observations.
Index Use Considerations
Keep the number of indexes per data file to a minimum to reduce disk storage and to
reduce update costs.
Consider how often your applications use an index. An index must be used often in
order to make up for the resources that are used in creating and maintaining it. That
is, do not rely solely on resource savings from processing a WHERE expression.
Take into consideration the resources that it takes to actually create the index and to
maintain it every time the data file is changed.
When you create an index to process a WHERE expression, do not try to create one
index that is used to satisfy all queries. If there are several variables that appear in
queries, those queries might be best satisfied with simple indexes on the most
discriminating of those variables.
Key Variable Candidates
In most cases, multiple variables are used to query a data file. However, it probably
would be a mistake to index all variables in a data file, as certain variables are better
candidates than others:
The variables to be indexed should be variables that are used in queries. That is, your
application should require selecting small subsets from a large file, and the most
common selection variables should be considered as candidate key variables.
A variable is a good candidate for indexing when the variable can be used to
precisely identify the observations that satisfy a WHERE expression. That is, the
variable should be discriminating, which means that the index should select the
fewest possible observations. For example, variables such as Age, FirstName, and
Gender are not discriminating because it is possible for a large representation of the
data to have the same age, first name, and gender. However, a variable such as
LastName is a good choice because it is less likely that many employees share the
same last name.
For example, consider a data file with variables LastName and Gender.
Understanding SAS Indexes 645
If many queries against the data file include LastName, then indexing LastName
could prove to be beneficial because the values are usually discriminating.
However, the same reasoning would not apply if you issued a large number of
queries that included Gender. The Gender variable is not discriminating (because
perhaps half the population is male and half is female).
However, if queries against the data file most often include both LastName and
Gender as shown in the following WHERE expression, then creating a composite
index on LastName and Gender could improve performance.
where lastname='LeVoux' and gender='F';
Note that when you create a composite index, the first key variable should be the
most discriminating.
Creating an Index
Overview of Creating Indexes
You can create one index for a data file, which can be either a simple index or a
composite index, and you can create multiple indexes, which can be multiple simple
indexes, multiple composite indexes, or a combination of both simple and composite.
1. You request to create an index for one or multiple variables using a method such as
the INDEX CREATE statement in the DATASETS procedure.
2. SAS reads the data file one observation at a time, extracts values and RIDs for each
key variable, and places them in the index file.
SAS ensures that the values that are placed in the index are successively the same or
increasing. SAS determines whether the data is already sorted by the key variables in
ascending order. It determines this by checking the sort indicator in the data file, which
is an attribute of the file that indicates how the data is sorted. The sort indicator is stored
with the SAS data file descriptor information and is set from a previous SORT procedure
or SORTEDBY= data set option.
If the values in the sort indicator are in ascending order, SAS does not sort the values for
the index file and avoids the resource. Note that SAS always validates that the data is
sorted as indicated. If not, the index is not created. For example, if the sort indicator was
set from a SORTEDBY= data set option and the data is not sorted as indicated, an error
occurs. A message is written to the SAS log stating that the index was not created
because values are not sorted in ascending order.
If the values in the sort indicator are not in ascending order, SAS sorts the data that is
included in the index file in ascending value order. To sort the data, SAS follows this
procedure:
1. SAS first attempts to sort the data using the thread-enabled sort. By dividing the
sorting into separately executable processes, the time to sort the data can be reduced.
To use the thread-enabled sort, the index must be sufficiently large (which is
determined by SAS), the SAS system option CPUCOUNT= must be set to more than
one processor, and the THREADS system option must be enabled. Adequate
memory must be available for the thread-enabled sort. If not enough memory is
available, SAS reduces the number of threads to one and begins the sort process
again, which increases the time to create the index.
2. If the thread-enabled sort cannot be done, SAS uses the unthreaded sort.
646 Chapter 26 SAS Data Files
Note: To display messages regarding what type of sort is used, memory and resource
information, and the status of the index being created, set the SAS system option
MSGLEVEL=I; that is:
options msglevel=i;
Using the DATASETS Procedure
The DATASETS procedure provides statements that enable you to create and delete
indexes. In the following example, the MODIFY statement identifies the data file, the
INDEX DELETE statement deletes two indexes, and the two INDEX CREATE
statements specify the variables to index, with the first INDEX CREATE statement
specifying the options UNIQUE and NOMISS:
proc datasets library=mylib;
modify employee;
index delete salary age;
index create empnum / unique nomiss;
index create names=(lastname firstname);
Note: If you delete and create indexes in the same step, place the INDEX DELETE
statement before the INDEX CREATE statement so that space occupied by deleted
indexes can be reused during index creation.
Using the INDEX= Data Set Option
To create indexes in a DATA step when you create the data file, use the INDEX= data set
option. The INDEX= data set option also enables you to include the NOMISS and
UNIQUE options. The following example creates a simple index on the variable Stock
and specifies UNIQUE:
data finances(index=(stock /unique));
The next example uses the variables SSN, City, and State to create a simple index named
SSN and a composite index named CitySt:
data employee(index=(ssn cityst=(city state)));
Using the SQL Procedure
The SQL procedure supports index creation and deletion and the UNIQUE option. Note
that the variable list requires that variable names be separated by commas (which is an
SQL convention) instead of blanks (which is a SAS convention).
The DROP INDEX statement deletes indexes. The CREATE INDEX statement specifies
the UNIQUE option, the name of the index, the target data file, and the variable or
variables to be indexed. For example:
drop index salary from employee;
create unique index empnum on employee (empnum);
create index names on employee (lastname, firstname);
Using Other SAS Products
You can also create and delete indexes using other SAS utilities and products, such as
SAS/CONNECT software, SAS/IML software, SAS Component Language, and
SAS/Warehouse Administrator.
Understanding SAS Indexes 647
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.220.92