Chapter 12: Tuning for Performance and Efficiency

Introduction. 391

Understanding Performance Tuning. 391

Sorting and Performance. 392

User-Specified Sorting (SORTPGM= System Options) 392

Automatic Sorting. 393

Grouping and Performance. 393

Splitting Tables. 393

Indexes and Performance. 394

Reviewing CONTENTS Output and System Messages. 395

Optimizing WHERE Clause Processing with Indexes. 398

Constructing Efficient Logic Conditions. 398

Avoiding UNIONs. 401

Summary. 404

Introduction

A book on PROC SQL would not be complete without some discussion of query optimization and performance. Enabling a query to run efficiently involves writing code that can take advantage of the PROC SQL query optimizer. Because PROC SQL is designed to handle a variety of processes while accommodating small to large database environments, this chapter presents a number of query tuning strategies, techniques, and options to help you to write more efficient PROC SQL code. In this chapter, you will find tips and suggestions to help identify areas where a query’s inefficiencies might exist and to conduct the tuning process to achieve the best performance possible.

Understanding Performance Tuning

Performance tuning is the process of improving the way a program operates. It involves taking a program and seeing what can be done to improve performance in an intelligent, controlled manner. As you might imagine, a tuned program is one that gets the most from the existing hardware and software environment.

Performance tuning involves measuring, evaluating, and modifying a program until it uses the minimum amount of computer resources to complete its execution. The biggest problem with the tuning process is that it is sometimes difficult to determine the amount of computer resources that a program uses. Complicating matters further, adequate and complete information about resource utilization is often unavailable. In fact, no simple formula exists to determine how efficiently a program runs. Often the only way to assess whether a program is running efficiently is to evaluate its performance under varying conditions, such as during interactive use or during shortages of specific resources including memory and storage.

Performance issues might be difficult to identify. It is possible to have a program that operates without any apparent problem, but does not perform as efficiently as it could. In fact, a program might perform well in one environment and poorly in another. Take, for example, an organization that has a shortage of Direct Access Storage Device (DASD). A program that uses excessive amounts of this resource might be deemed a poor performer under these circumstances. But if the same program were run in an environment that had adequate levels of DASD, it might not be suspected or tagged as a poor performer. This distinction demonstrates the subjectivity that is frequently used to determine how a program performs and how it is linked to the specific needs (related to resource issues) that an organization has at any point in time.

Sorting and Performance

Sorting data in the SQL procedure, as in other parts of SAS, involves CPU and memory-intensive operations. When the table size is large, and CPU and/or memory resources are in short supply or simply not available, the number of sorts in programs must be minimized. You can minimize problems by understanding your query’s requirements and remembering a few simple guidelines.

Performance bottlenecks can occur if sorts are performed on disk as opposed to in memory because processing on disk is typically slower than processing in memory. The most logical and efficient place to perform a sort is in memory. If the sort requires more space than can fit in available memory, the sort must be performed on disk. The objective is to determine how much space a sort will require as well as where the sort will be performed before the sort is executed.

The performance of a sort is most influenced by the number of rows selected in a query. To reduce the row count, select specific parts of a table for processing rather than selecting the entire table. Another performance consideration is to reduce the number of columns that are specified in an ORDER BY clause. The technique of concatenating two or more columns as opposed to specifying each column individually can be used. Finally, sort performance can be influenced by the size of the columns that are specified in an ORDER BY clause. To reduce a column’s length, built-in functions such as SUBSTR and TRIM can be used.

User-Specified Sorting (SORTPGM= System Options)

You can control what sort utility SAS uses when performing sorts. By specifying the SORTPGM= system option, you can direct SAS to use the best possible sort utility for the environment in question. The SORTPGM= system options are displayed in Table 12.1.

Table 12.1: SORTPGM= System Options

Sort Option

Purpose

BEST

The BEST option uses the sort utility that is best suited to the data.

HOST

The HOST option tells SAS to use the host sort utility that is available on your host computer. This option might be the most efficient for large tables that contain many rows of data.

SAS

The SAS option tells SAS to use the sort utility that is supplied with SAS.

The next example illustrates how to use the SORTPGM= option to select the sort utility that is most suited to the data. Both options use the name that is specified in the SORTNAME= option.

OPTIONS SORTPGM=BEST;

 

<or>

 

OPTIONS SORTPGM=HOST;

Automatic Sorting

Using the SELECT DISTINCT clause invokes an internal sort to remove duplicate rows. The single exception is when an index exists. If an index exists, then the index is used to eliminate the duplicate rows.

The results of a grouped query are automatically sorted using the grouping columns. When the SELECT clause contains only the columns that are listed in the GROUP BY clause along with any summary functions, then the duplicates in each group based on the grouping columns are removed as soon as any defined summary functions are performed. If additional columns then appear in the SELECT clause, the rows are not collapsed and therefore duplicates are not removed.

Grouping and Performance

As with the ORDER BY clause, processes such as the GROUP BY clause can also impact the speed of a query. Because the GROUP BY clause can trigger sort processing, much of what was discussed with the ORDER BY clause also applies here. In addition, it is recommended that you avoid grouping redundant columns by keeping the number of grouping columns to a minimum.

Splitting Tables

Splitting tables involves moving some of the rows from one table to another table. Data is split for the purpose of separating some predetermined range of data, such as historical data from current data, so that query performance is improved. This reduces the burden imposed on queries that only access current data. The next example shows the current year’s data being copied and then removed from a table that contains five years of data.

SQL Code

PROC SQL;

  CREATE TABLE INVENTORY_CURRENT AS

    SELECT *

      FROM INVENTORY

        WHERE YEAR(ORDDATE) = YEAR(TODAY());

  DELETE FROM INVENTORY

    WHERE YEAR(ORDDATE) = YEAR(TODAY());

QUIT;

Indexes and Performance

Indexes can be used to allow rapid access to table rows. Rather than physically sorting a table (as performed by the ORDER BY clause or PROC SORT), an index is designed to set up a logical arrangement for the data without the need to physically sort it. This has the advantage of reducing CPU, I/O, and memory requirements. It also reduces data access time when using WHERE clause processing (which is discussed in the “Optimizing WHERE Clause Processing with Indexes” section).

Indexes are useful, but they do have drawbacks. As data in a table is inserted, modified, or deleted, an index must be updated to address the changes. This automatic feature requires additional CPU resources to process any changes to a table. Also, as a separate structure in its own right, an index can consume considerable storage space. As a consequence, care should be exercised not to create too many indexes but to assign indexes to the most discriminating variables in a table. Here are a few suggestions for creating indexes:

         Sort data in ascending order on the key column prior to creating the index.

         Sort data by the key variable first to achieve the greatest performance improvement.

         Sort data in ascending order by the key variable before it is appended to the table.

         Create simple indexes, when possible, to be used by most queries.

         Avoid creating one single index for all queries.

         Assign indexes to the most discriminating of variables (see Table 6.1, Rules of Cardinality, in Chapter 6).

         Select columns that are frequently the subject of summary functions (COUNT, SUM, AVG, MIN, MAX, etc.).

         Only create indexes that are actually needed.

         Avoid taxing CPU resources that are associated with index maintenance (maintaining an index during inserts, modifications, and deletions) by selecting columns that do not change frequently.

         On some operating systems, indexes are stored as a separate file on disk, which uses additional memory and disk space to store the structure.

         To avoid excessive and unnecessary I/O operations, prior to creating an index, sort data in ascending order by the most discriminating key column.

         Attempt to define composite indexes by using the most discriminating of the variables as your first variable in the index.

         Select columns that do not have numerous null values because this results in a large percentage of rows with the same value.

Note:  Indexes should only be created on tables where query search time needs to be optimized. Any unnecessary indexes might force SAS to expend resources needlessly—updating and reorganizing after insert, update, and delete operations are performed. And even worse, the SQL optimizer might accidentally use an index when it should not.

Note:  When creating an index is impractical or not feasible, consider storing the data in ascending (or descending) order depending on whether the query needs to access data from the first or second half of the table. This often results in a more efficient query because once a WHERE-clause expression is satisfied, the query stops processing.

Reviewing CONTENTS Output and System Messages

While no two organizations are alike, it is not surprising to find numerous causes for a program to run at less than peak efficiency. Performance is frequently affected by the specific needs of an organization or its lack of resources. SAS users need to learn as many techniques as possible to correct problems that are associated with poorly performing queries. Attention should be given to individual program functions, because poor query performance often points to one or more inefficient techniques being used.

Two methods can be used to better understand potential performance issues. The first approach uses PROC CONTENTS output to examine engine/host information and library data sets (tables). The output provides information to determine whether a table is large enough. (The page count in the following output shows the performance improvements offered by an index). The general rule that the SQL processor adheres to is that when a table is relatively small (usually fewer than three pages), there is no real advantage to using an index. In fact, using an index with a small table can actually degrade performance levels because in these situations sequential processing would be just as fast as using an index.

Results

image shown here

The second approach uses PROC SQL to access the dictionary tables, TABLES and COLUMNS, to determine whether a table is large enough to take advantage of the performance improvements that are offered by an index. See the output below.

SQL Code

PROC SQL;

  SELECT MEMNAME, NPAGE

    FROM DICTIONARY.TABLES

      WHERE LIBNAME=’WORK’ AND

        MEMNAME=’INVENTORY’;

  SELECT VARNUM, NAME, TYPE, LENGTH, FORMAT,

         INFORMAT, LABEL

    FROM DICTIONARY.COLUMNS

      WHERE LIBNAME=’WORK’ AND

        MEMNAME=’INVENTORY’;

QUIT;

Results

image shown here

Table 12.2 compares sequential table access with indexed table access. Although performance gains are data dependent, the greatest gains are realized when an index is applied to a small subset of data in a WHERE clause.

Table 12.2: Sequential versus Indexed Table Access

Condition

Sequential

Index

Page count < 3 pages (from CONTENTS output)

Yes

No

Small table

Yes

No

Frequent updates to table

Yes

No

Large subset of data based on WHERE processing

Yes

No

Infrequent access of table

Yes

No

Limited memory and disk space

Yes

No

Small subset of data (1%–25% of population)

No

Yes

System messages are displayed to provide information that can help tune the indexes that are associated with any data sets. Setting the MSGLEVEL= system option to “I” allows SAS to display vital information (if available) that is related to the presence of one or more indexes for optimization of WHERE clause processing. With the MSGLEVEL= option turned on, the SAS log shows that the simple index INVENQTY was selected in the optimization of WHERE clause processing.

 

SAS Log

      PROC SQL;

        CREATE INDEX INVENQTY ON INVENTORY;

 NOTE: Simple index invenqty has been defined.

 NOTE: PROCEDURE SQL used:

       real time           0.04 seconds

 

        SELECT *

          FROM INVENTORY

            WHERE invenqty < 3;

 INFO: Index invenqty selected for WHERE clause
 optimization.

      QUIT;

 NOTE: PROCEDURE SQL used:

       real time           0.65 seconds

Optimizing WHERE Clause Processing with Indexes

To get the best possible performance from programs that contain SQL procedure code, an index and WHERE clause can be used together (see the list below). Using a WHERE clause restricts processing in a table to a subset of selected rows (for more information, see Chapter 2, “Working with Data in PROC SQL”). When an index exists, the SQL processor determines whether to take advantage of it during WHERE clause processing. Although the SQL optimizer determines whether using an index will ultimately benefit performance, when it does use an index the result can be an improvement in processing speeds.

         Comparison operators such as EQ (=), LT (<), GT (>), LE (<=), GE (>=), and NOT

         Comparison operators with the IN operator

         Comparison operators with the colon modifier (for example, NOT = :“Ab”)

         CONTAINS operator

         IS NULL or IS MISSING operator

         Pattern-matching operators such as LIKE and NOT LIKE

Constructing Efficient Logic Conditions

Constructing efficient logic conditions has a direct effect on processing costs. Because the SQL optimizer evaluates a series of AND expressions from left to right, a chain of AND conditions in a WHERE clause should be specified with the most restrictive expression first. Specified this way, fewer resources are expended by bypassing rows that do not satisfy the first conditional value in the WHERE clause. For example, the SQL query below might expend more resources and run slower because the first condition, “SOFTWARE”, occurs even when products cost more than $99.00.

SQL Code (Less Efficient)

 PROC SQL;

   SELECT *

     FROM PRODUCTS

       WHERE UPCASE(PRODTYPE) = ‘SOFTWARE’ AND

             PRODCOST < 100.00;

 QUIT;

For this data, a more efficient way to produce the same results as the previous example, while reducing CPU resources, is to code the least likely condition first so that it appears as shown here.

SQL Code (More Efficient)

 PROC SQL;

   SELECT *

     FROM PRODUCTS

       WHERE PRODCOST < 100.00 AND

             UPCASE(PRODTYPE) = ‘SOFTWARE’;

 QUIT;

Another popular construct uses a series of OR condition equality tests or the IN predicate to select rows that match the multiple conditions. Programmers often specify these kinds of lists in order of magnitude or alphabetically to make the lists easier to read and/or maintain. A better and more efficient way is to construct a list of constants in the order of the most frequently occurring value to the least frequently occurring value.

Note:  One way to determine the frequency of product type (PRODTYPE) values is to submit the following code:

 PROC SQL;    

   SELECT PRODTYPE, COUNT(PRODTYPE) AS Product_Frequency

       FROM PRODUCTS

         GROUP BY PRODTYPE;

 QUIT;   

Results

image shown here

With the frequencies determined, the list of conditions can be specified in that order. In this way, fewer resources are spent locating frequently occurring values because it has to perform fewer steps to return a value of TRUE. It is also important to remove duplicate values in the constant list because the first identified duplicate value will automatically return a value of TRUE, which results in the second occurrence of the duplicate value being ignored.

The next query illustrates logic conditions that could require more processing resources because the first condition “LAPTOP” occurs less frequently than the value “SOFTWARE”. Consequently, the SQL processor needs to expend more resources to process the second condition in order to find a match of TRUE being returned.

SQL Code (Less Efficient)

 PROC SQL STIMER;

   SELECT *

     FROM PRODUCTS

       WHERE UPCASE(PRODTYPE) IN (‘LAPTOP’, ‘SOFTWARE’);

 QUIT;

SAS Log Results

PROC SQL STIMER;

NOTE: SQL Statement used (Total process time):

      real time           0.00 seconds

      cpu time            0.01 seconds

 

  SELECT *

    FROM PRODUCTS

      WHERE UPCASE(PRODTYPE) IN ('LAPTOP', 'SOFTWARE');

NOTE: SQL Statement used (Total process time):

      real time           0.08 seconds

      cpu time            0.04 seconds

A more efficient way to process the same data while generating the same results would be to specify the condition, “SOFTWARE”, first as follows:

SQL Code (More Efficient)

 PROC SQL STIMER;

   SELECT *

     FROM PRODUCTS

       WHERE UPCASE(PRODTYPE) IN (‘SOFTWARE’, ‘LAPTOP’);

 QUIT;

SAS Log Results

PROC SQL STIMER;

  SELECT *

    FROM PRODUCTS

      WHERE UPCASE(PRODTYPE) IN ('SOFTWARE', 'LAPTOP');

NOTE: SQL Statement used (Total process time):

      real time           0.02 seconds

      cpu time            0.03 seconds

 

QUIT;

 

Avoiding UNIONs

UNIONs are executed by creating two internal sets, then merge-sorting the results together. Duplicate rows are automatically eliminated from the final results. The Venn diagram for a UNION of two sets represents all distinct elements in the collection, as illustrated in Figure 12.1.

Figure 12.1: Venn Diagram

Figure 12.1: Venn Diagram

For example, the SQL procedure code that represents a UNION set first constructs the two result sets from each query, merges and sorts the two sets together, and then eliminates duplicate rows from the final results.

SQL Code (Less Efficient)

 OPTIONS FULLSTIMER;

 PROC SQL;

   SELECT *

     FROM PRODUCTS

       WHERE UPCASE(PRODTYPE) = ‘LAPTOP’

   UNION

   SELECT *

     FROM PRODUCTS

       WHERE UPCASE(PRODTYPE) = ‘SOFTWARE’;

 QUIT;

SAS Log Results

OPTIONS FULLSTIMER;

PROC SQL;

  SELECT *

    FROM PRODUCTS

      WHERE UPCASE(PRODTYPE) = 'LAPTOP'

 

  UNION

 

 

  SELECT *

    FROM PRODUCTS

      WHERE UPCASE(PRODTYPE) = 'SOFTWARE';

QUIT;

NOTE: PROCEDURE SQL used (Total process time):

      real time           0.16 seconds

      user cpu time       0.00 seconds

      system cpu time     0.04 seconds

      memory              634.09k

      OS Memory           6768.00k

      Timestamp           06/03/2012 04:59:57 PM

To improve UNION performance, SQL procedure code can be converted to a single query using OR conditions in a WHERE clause. The next example illustrates the previous SQL procedure code being made more efficient by converting the UNION to a single query using an OR operator in a WHERE clause.

SQL Code (Efficient)

 OPTIONS FULLSTIMER;

 PROC SQL;

   SELECT DISTINCT *

     FROM PRODUCTS

       WHERE UPCASE(PRODTYPE) = ‘SOFTWARE’ OR

             UPCASE(PRODTYPE) = ‘LAPTOP’;

 QUIT;

SAS Log Results

OPTIONS FULLSTIMER;

PROC SQL;

  SELECT DISTINCT *

    FROM PRODUCTS

      WHERE UPCASE(PRODTYPE) = 'SOFTWARE' OR

            UPCASE(PRODTYPE) = 'LAPTOP';

QUIT;

NOTE: PROCEDURE SQL used (Total process time):

      real time           0.10 seconds

      user cpu time       0.04 seconds

      system cpu time     0.01 seconds

      memory              565.60k

      OS Memory           6768.00k

      Timestamp           06/03/2012 05:06:55 PM

The next example illustrates the previous SQL procedure code being made more efficient by converting the UNION to a single query using an IN predicate in a WHERE clause.

SQL Code (Efficient)

 OPTIONS FULLSTIMER;

 PROC SQL;

   SELECT DISTINCT *

     FROM PRODUCTS

 

       WHERE UPCASE(PRODTYPE) IN (‘SOFTWARE’, ‘LAPTOP’);

 QUIT;

SAS Log Results

OPTIONS FULLSTIMER;

PROC SQL;

  SELECT DISTINCT *

    FROM PRODUCTS

      WHERE UPCASE(PRODTYPE) IN ('SOFTWARE', 'LAPTOP');

QUIT;

NOTE: PROCEDURE SQL used (Total process time):

      real time           0.07 seconds

      user cpu time       0.01 seconds

      system cpu time     0.04 seconds

      memory              561.41k

      OS Memory           6768.00k

      Timestamp           06/03/2012 05:12:56 PM

Another approach that can improve the way a query with a SET operator performs is to specify the ALL keyword, as long as duplicates are not an issue or the rows in the table are all unique. Because the ALL keyword prevents SQL from processing the data twice and does not remove duplicate rows, CPU resources might be improved. The next example shows the UNION ALL coding construct being used to perform what amounts to an append operation, thereby bypassing the sort altogether because the duplicate rows are not removed.

SQL Code (Less Efficient)

 OPTIONS FULLSTIMER;

 PROC SQL;

   SELECT *

     FROM PRODUCTS

       WHERE UPCASE(PRODTYPE) = ‘LAPTOP’

   UNION ALL

   SELECT *

     FROM PRODUCTS

       WHERE UPCASE(PRODTYPE) = ‘SOFTWARE’;

 QUIT;

 

SAS Log Results

OPTIONS FULLSTIMER;

PROC SQL;

  SELECT *

    FROM PRODUCTS

      WHERE UPCASE(PRODTYPE) = 'LAPTOP'

  UNION ALL

  SELECT *

    FROM PRODUCTS

      WHERE UPCASE(PRODTYPE) = 'SOFTWARE';

QUIT;

NOTE: PROCEDURE SQL used (Total process time):

      real time           0.06 seconds

      user cpu time       0.00 seconds

      system cpu time     0.04 seconds

      memory              226.42k

      OS Memory           6768.00k

      Timestamp           06/03/2012 05:22:14 PM

Summary

1.       Performance tuning involves measuring, evaluating, and modifying a query’s execution to achieve an optimal balance between competing computer resources (see the “Understanding Performance Tuning” section).

2.       Avoid specifying an ORDER BY clause when creating a table or view (see the “Sorting and Performance” section).

3.       When sorting is necessary, specify the SORTPGM= system option to instruct the SAS System to use the best possible sort utility relative to the size of the database environment (see the “User-Specified Sorting” section).

4.       It is recommended to keep the number of grouping columns with the GROUP BY clause as small as possible (see the “Splitting Tables” section).

5.       Care should be exercised to assign indexes to only those discriminating variables in a table and to avoid creating too many indexes (see the “Indexes and Performance” section).

6.       There is no advantage in creating or using an index when a table is relatively small (usually fewer than three pages) (see the “Reviewing CONTENTS Output and System Messages” section).

7.       Setting the MSGLEVEL= system option to “I” allows SAS to display vital information (if available) relative to the presence of one or more indexes for optimization of WHERE clause processing (see the “Reviewing CONTENTS Output and System Messages” section).

8.       Apply WHERE clause processing to restrict the number of rows of the result table (see the “Optimizing WHERE Clause Processing with Indexes” section).

9.       When constructing a chain of AND conditions in a WHERE clause, specify the most restrictive conditional values first (see the “Constructing Efficient Logic Conditions” section).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.120.57