Chapter 2. Waging War

Accessing Databases Efficiently

Il existe un petit nombre de principes fondamentaux de la guerre, dont on ne saurait s'écarter sans danger, et dont l’application au contraire a été presque en tous temps couronnée par le succès.

There exist a small number of fundamental principles of war, which it is dangerous to ignore: indeed, following these principles has almost invariably led to success.

Général Antoine-Henri de Jomini (1779-1869) Précis de l’Art de la Guerre

Anybody who has ever been involved in the switch from development to production of a critical system knows how much it can feel like the noise and tumult of battle. Very often, a few weeks before D-Day, performance tests will show that the new system is going to fall short of expectations. Experts are brought in, SQL statements are fine-tuned, and database and system administrators are called to contribute to a succession of crisis meetings. Finally, performance vaguely comparable to the previous system is obtained on hardware that is now twice as expensive as the original installation.

Tactics are often used as a substitute for a strategic approach. The latter demands the adoption of a sound overall architecture and design. As in war, the basic principles here are also few, but too often ignored. Architectural mistakes can prove extremely costly, and the SQL programmer must enter the battle fully prepared, knowing where to go and how to get there. In this chapter, we are going to review the key goals that will increase our chances of success in writing programs that access databases efficiently.

Query Identification

For centuries, the only means that a general had to check the progress of his troops during the heat of battle was to observe the position of his units as indicated by the color of the soldiers’ uniforms and the flags they were carrying. When some process in the database environment is consuming an inordinate amount of CPU, it is often possible to identify which piece of SQL code is actually running. But it is very often much more difficult, especially in a large and complicated system that includes dynamically built queries, to identify which precise part of a given application issued that statement and needs reviewing. Despite the fact that many products have good monitoring facilities, it is sometimes surprisingly difficult to relate an SQL statement to its broader environment. Therefore, you should adopt the habit of identifying your programs and critical modules whenever possible by inserting comments into your SQL to help identify where in the programs a given query is used. For instance:

    /* CUSTOMER REGISTRATION */ select blah ...

These identifying comments can be important and helpful in subsequently tracking down any erroneous code. They can also be helpful when trying to determine how much load is put on a server by a single application, especially when some localized increase in activity is expected and when you are trying to assess whether the current hardware can absorb the surge.

Some products have special registration facilities that can spare you the admittedly tedious step of commenting each and every statement. Oracle’s dbms_application_info package allows you to register a program using a 48-character module name, a 32-character action name, and a 64-character client information field. The content of those fields is left to your discretion. In an Oracle environment, you can use this package to keep track not only of which application is running, but also what that application is doing at any given time. This is because you can easily query the information that your application passes to the package through the Oracle V$ dynamic views that show what is currently happening in memory.

Important

Identifiable statements make the identification of performance issues easier.

Stable Database Connections

A new database connection can be created quickly and easily, but this ease can disguise the high cost of making repeated connections. You must manage the use of database connections with great care. The consequences of allowing multiple connections to occur, perhaps hidden within an application, can be substantial, as the next example illustrates.

Some time ago I came across an application in which numerous small files of up to an arbitrary maximum of 100 lines were being processed. Each line in these small text files contained both data and the identification of the database instance into which that data had to be loaded. In this particular case, there was a single server, but the principle being illustrated is exactly the same as if there were a hundred database instances.

The process for each file was coded as follows:

    Open the file
    Until the end of file is reached
          Read a row
          Connect to the server specified by the row
          Insert the data
          Disconnect
    Close the file

This process worked quite satisfactorily, except for the occasional circumstance in which a large number of small files would arrive in a very short space of time, and at a rate greater than the ability of the application to process them. This resulted in a substantial backlog, which took considerable time to clear.

I explained the problem of performance degradation as a consequence of frequent connection and disconnection to the customer with the help of a simple program (written in C) emulating the current application. Table 2-1 gives the results from that demonstration.

Note

The program generating the results in Table 2-1 used a conventional insert statement. I mentioned in passing to the customer the existence of direct-loading techniques that are even faster.

Table 2-1. Result of connect/disconnect performance tests

Test

Results

Connect/disconnect for each line in turn

7.4 lines loaded per second

Connect once, all candidate lines individually inserted

1,681 lines loaded per second

Connect once, all candidate lines inserted in arrays of 10 lines

5,914 lines loaded per second

Connect once, all candidate lines inserted in arrays of 100 lines

9,190 lines loaded per second

The demonstration showed the importance of trying to minimize the number of separate database connections that had to be made. Thus, there was an obvious and enormous advantage in applying a simple check to determine whether the “next” insert was into the same database as the previous one. The rationalization could go further, as the number of database instances was of course finite. You could likely achieve further performance gain by setting up an array of handlers, one for each specific database connection, opening a new connection each time a new database is referenced, and thus connecting at most once to each database. As Table 2-1 shows, the simple technique of connecting only once (or a very few times) improved performance by a factor of more than 200 with very little additional effort.

Of course, this was an excellent opportunity to show that minimizing the number of round-trips between a program and the database kernel, using arrays and populating them with incoming data, can also lead to spectacular improvements in performance. By inserting several rows at once, the throughput could be radically improved—by another factor of five. The results in Table 2-1 demonstrate that improvements in the process could reach a modest factor of 1,200.

Why such dramatic improvement?

The reason for the first and biggest improvement is that a database connection is fundamentally a “heavy,” or high-resource operation.

In the familiar client/server environment (which is still very widely used), the simple connection routine hides the fact that the client program first has to establish contact with a listener program on a remote machine; and then, depending on whether shared servers are being used on this machine, the listener must either spawn another process or thread and make it run some database kernel program, or hand the request, directly or indirectly, to an existing server process.

Whatever the number of system operations (process spawning or thread creation and the start of executions) your database system will need to create a new environment for each session, to keep track of what it does. Your DBMS will need to check the password provided against the encrypted password of the account for which a new session is to be created. Your DBMS may also have to execute the code for some logon trigger. It may have to execute some initialization code for stored procedures or packages the first time they are called. This does not include the base machine handshaking protocols between client and server processes. This is the reason techniques that allow the upkeep of permanent connections to the database, such as connection pooling, are so important to performance.

The reason for the second improvement is that a round-trip between your program (and even a stored procedure) and the database also has its costs.

Even when you are connected and maintain a connection, context switches between your program and the DBMS kernel take their toll. Therefore if your DBMS allows you to communicate through an array interface of some kind, you should not hesitate to use it. If, as sometimes happens, the array interface is implicit (the application program interface [API] uses arrays when you use only scalar values), it is wise to check the default array size that is used and perhaps tailor it to your particular needs. And of course, any row-by-row logic suffers the same context-switch mechanisms and is a cardinal sin—as you shall have several opportunities to see throughout this chapter.

Important

Database connections and round-trips are like Chinese Walls—the more you have, the longer it takes to receive the correct message.

Strategy Before Tactics

Strategy defines the tactics, not the other way round. A skillful developer doesn’t think of a process in terms of little steps, but in terms of the final result. The most efficient way to obtain that result may not be to proceed in the order specified in the business rules, but rather to follow a less obvious approach. The following example will show how paying too much attention to the procedural processes within a business can distract ones’ attention from the most efficient solution.

Some years ago I was given a stored procedure to try to optimize; “try” is the operative word here. Two attempts at optimization had already been made, once by the original authors, and secondly by a self-styled Oracle expert. Despite these efforts, this procedure was still taking 20 minutes to run, which was unacceptable to the users.

The purpose of the procedure was to compute quantities of raw materials to be ordered by a central factory unit, based on existing stocks and on orders that were coming from a number of different sources. Basically, the data from several identical tables for each data source had to be aggregated inside one master table. The procedure consisted of a succession of similar statements simplified as follows. First, all data from each distinct source table were inserted into the single master table. Second, an aggregate/update was applied to each instance of raw material in that master table. Finally, the spurious data not relevant to the aggregate result was deleted from the table. These stages were repeated in sequence inside the procedure for every distinct source table. None of the SQL statements were particularly complex, and none of them could be described as being particularly inefficient.

It took the better half of a day to understand the process, which eventually prompted the question: why was this process being done in multiple steps? A subquery in a from clause with a union operator would allow the aggregation of all the various sources. A single select statement could provide in one step the result set that had to be inserted into the target table. The difference in performance was so impressive—from 20 minutes down to 20 seconds—that it took some time to verify that the final result was indeed identical to that previously obtained.

Extraordinary skills were not required to achieve the tremendous performance improvement just described, but merely an ability to think outside the box. Previous attempts to improve this process had really been hindered by the participants allowing themselves to get too close to the problem. One needed to take a fresh look, to stand back, and try to see the bigger picture. The key questions to ask were “What do we have when we enter this procedure?” and “Which result do we want when we return from it?” Together with some fresh thinking, the answers to those questions led to a dramatically improved process.

Important

Stand back from your problem to get the wider picture before plunging into the details of the solution.

Problem Definition Before Solution

A little knowledge can be a dangerous thing. Frequently, people may have read or heard about new or unusual techniques—which in some cases can indeed be quite interesting—and then they will try to fit their problem to one of these new solutions. Ordinary developers and architects often jump quickly on to such “solutions,” which often turn out to be at the root of many subsequent problems.

At the top of the list of ready-made solutions, we usually meet denormalization. Blissfully unaware of the update nightmare that it turns out to be in practice, denormalization advocates often suggest it at an early stage in the hunt for “performance"--and in fact often at a point in the development cycle when better design (or learning how to use joins) is still an option. A particular type of denormalization, the materialized view, is also often seen as being something of a panacea. (Materialized views are sometimes referred to as snapshots , a less impressive term, but one that is closer to the sad reality: copies of data at one point in time.) This is not to say that sometimes, as a last resort option, theoretically questionable techniques cannot be used. To quote Franz Kafka: “Logic is doubtless unshakable, but it cannot withstand a man who wants to go on living.”

But the immense majority of problems can be solved using fairly traditional techniques in an intelligent manner. Learn first how to get the best of simple, traditional techniques. It’s only when you can fully master them that you will be able to appreciate their limitations, and then to truly be able to judge the potential advantage (if any) of new technical solutions.

All technological solutions are merely means to an end; the great danger for the inexperienced developer is that the attractions of the latest technology become an end in themselves. And the danger is all the greater for enthusiastic, curious, and technically minded individuals!

Important

Foundations before Fashion: learn your craft before playing with the latest tools.

Stable Database Schema

The use of data definition language (DDL) to create, alter, or drop database objects inside an application is a very bad practice that in most cases should be banned. There is no reason to dynamically create, alter, or drop objects, with the possible exception of partitions—which I describe in Chapter 5--and temporary tables that are known to the DBMS to be temporary tables. (We shall also meet another major exception to this rule in Chapter 10.)

The use of DDL is fundamentally based on the core database data dictionary. Since this dictionary is also central to all database operations, any activity on it introduces global locks that can have massive performance consequences. The only acceptable DDL operation is truncate table, which is a very fast way of emptying a table of all rows (without the protection of rollback recovery, remember!).

Important

Creating, altering, or dropping database objects belong to application design, not to regular operations.

Operations Against Actual Data

Many developers like to create temporary work tables into which they extract lists of data for subsequent processing, before they begin with the serious stuff. This approach is often questionable and may reflect an inability to think beyond the details of the business processes. You must remember that temporary tables cannot offer storage options of the same degree of sophistication as permanent tables (you see some of these options in Chapter 5). Their indexing, if they are indexed, may be less than optimal. As a result, queries that use temporary tables may perform less efficiently than well-written statements against permanent tables, with the additional overhead of having to fill temporary tables as a prerequisite to any query.

Even when the use of temporary tables is justified, they should never be implemented as permanent tables masquerading as work tables if the number of rows to be stored in them is or can be large. One of the problems lies in the automated collection of statistics: when statistics are not collected in real time, they are typically gathered by the DBMS at a time of zero or low activity. The nature of work tables is that they will probably be empty at such slack times, thus giving a wholly erroneous indicator to the optimizer. The result of this incorrect, and biased, statistical data can be totally inappropriate execution plans that not surprisingly lead to dismal performance. If you really have to use temporary storage, use tables that the database can recognize as being temporary.

Important

Temporary work tables mean more byte-pushing to less suitable storage.

Set Processing in SQL

SQL processes data in complete sets. For most update or delete operations against a database —and assuming one is not operating against the entire table contents—one has to define precisely the set of rows in that table that will be affected by the process. This defines the granularity of the impending process, which may be described as coarse if a large number of rows will be affected or as fine if only few rows will be involved.

Any attempt to process a large amount of data in small chunks is usually a very bad idea and can be massively inefficient. This approach can be defended only where very extensive changes will be made to the database which can, first, consume an enormous amount of space for storing prior values in case of a transaction rollback, and second, take a very long time to rollback if any attempted change should fail. Many people would argue that where very considerable changes are to be made, regular commit statements should be scattered throughout the data manipulation language (DML) code. However, regular commit statements may not help when resuming a file upload that has failed. From a strictly practical standpoint, it is often much easier, simpler, and faster to resume a process from the start rather than try to locate where and when the failure occurred and then to skip over what has already been committed.

Concerning the size of the log required to rollback transactions in case of failure, it can also be argued that the physical database layout has to accommodate processes, and not that processes have to make do with a given physical implementation. If the amount of undo storage that is required is really enormous, perhaps the question should be raised as to the frequency with which changes are applied. It may be that switching from massive monthly updates to not-so-massive weekly ones or even smaller daily ones may provide an effective solution.

Important

Thousands of statements in a cursor loop for endless batch processing, multiple statements applied to the same data for users doomed to wait, one swoop statement to outperform them all.

Action-Packed SQL Statements

SQL is not a procedural language. Although procedural logic can be applied to SQL, such approaches should be used with caution. The confusion between procedural and declarative processing is most frequently seen when data is required to be extracted from the database, processed, and then re-inserted back into the database. When a program—or a function within a program—is provided with some input value, it is all too common to see that input value used to retrieve one or several other values from the database, followed by a loop or some conditional logic (usually if...then...else) being applied to yet other statements applied to the database. In most cases, this behavior is the result of deeply ingrained bad habits or a poor knowledge of SQL, combined with a slavish obsession with functional specifications. Many relatively complex operations can be accomplished in a single SQL statement. If the user provides some value, try to get the result set that is of interest without decomposing the process into multiple statements fetching intermediate results of only minimal relevance to the final output.

There are two main reasons for shunning procedural logic in SQL:

Any access to the database means crossing quite a number of software layers, some of which may include network accesses.

Even when no network is involved, there will be interprocess communications; more accesses mean more function calls, more bandwidth, and more time waiting for the answer. As soon as those calls are repeated a fair number of times, the impact on process performance can become distinctly perceptible.

Procedural means that performance and future maintenance burdens fall to your program.

Most database systems incorporate sophisticated algorithms for executing operations such as joins, and for transforming queries so as to execute them in a more efficient way. Cost-based optimizers (CBOs) are complex pieces of software that have sometimes grown from being totally unusable when originally introduced to becoming mature products, capable of giving excellent results in most cases. A good CBO can be extremely efficient in choosing the most suitable execution plan. However, the scope of operation of the CBO is the SQL statement, nothing more. By doing as much as possible in a single statement, you shift the burden of achieving the best possible performance from your program to the DBMS kernel. You enable your program to take advantage of any improvement to the DBMS code, and therefore you are indirectly shifting a large part of the future maintenance of your program to the DBMS vendor.

As ever, there will be exceptions to the general rule that you should shun procedural logic, where in some cases procedural logic may indeed help make things faster. The monstrous all-singing-and-dancing SQL statement is not always a model for efficiency. However, the procedural logic that glues together successive statements that work on the same data and hit the same rows can often be pushed into one SQL statement. The CBO can consider a single statement that stays close to the sound rules of the relational model as a whole and can execute it in the most efficient way.

Important

Leave as much as you possibly can to the database optimizer to sort out.

Profitable Database Accesses

When you plan a visit to several shops, the first step is to decide what purchases have to be made at each shop. From this point, a trip is planned that will ensure minimum repetitive walking backward and forward between different shops. The first shop is then visited, the purchase completed, and then the next closest shop is visited. This is only common sense, and yet the principle underlying this obvious approach is not seen in the practical implementation of many database programs.

When several pieces of information are required from a single table—even if it appears as if they are “unrelated” (which in fact is unlikely to be the case)--it is highly inefficient to retrieve this data in several separate visits to the database. For example, do not fetch row values column by column if multiple columns are required: do the work in one operation.

Unfortunately, good object-oriented (OO) practice makes a virtue out of defining one method for returning each attribute. But do not confuse OO methods with relational database processing. It is a fatal mistake to mix relational and object-oriented concepts and to consider tables to be classes with columns as the attributes.

Important

Maximize each visit to the database to complete as much work as can reasonably be achieved for every visit.

Closeness to the DBMS Kernel

The nearer to the DBMS kernel your code can execute, the faster it will run. This is where the true strength of the database lies. For example, several database management products allow you to extend them by adding new functions, which can sometimes be written in comparatively low-level languages such as C. The snag with a low-level language that manipulates pointers is that if you mishandle a pointer, you can end up corrupting memory. It would be bad enough if you were the only user affected. But the trouble with a database server is that, as the name implies, it can serve a large number of users: if you corrupt the server memory, you can corrupt the data handled by another, totally innocent program. As a consequence, responsible DBMS kernels run code in a kind of sandbox, where it can crash without taking everything with it in its downfall. For instance, Oracle implements a complicated communication mechanism between external functions and itself. In some ways, this process is similar to that which controls database links, by which communication between two (or more) database instances on separate servers is managed. If the overall gain achieved by running tightly tailored C functions rather than stored PL/SQL procedures is greater than the costs of setting up an external environment and context-switching, use external functions. But do not use them if you intend to call a function for every row of a very large table. It is a question of balance, of knowing the full implications of the alternative strategies available to solve any given problem.

If functions are to be used, try to always use those that are provided by the DBMS. It is not merely a matter of not reinventing the wheel: built-in functions always execute much closer to the database kernel than any code a third-party programmer can construct, and are accordingly far more efficient.

Here is a simple example using Oracle’s SQL that will demonstrate the efficiencies to be gained by using Oracle functions. Let’s assume we have some text data that has been manually input and that contains multiple instances of adjacent “space” characters. We require a function that will replace any sequence of two or more spaces by a single space. Ignoring the regular expressions available since Oracle Database 10g, our function might be written as follows:

    create or replace function squeeze1(p_string in varchar2)
    return varchar2
    is
      v_string varchar2(512) := '';
      c_char   char(1);
      n_len    number := length(p_string);
      i        binary_integer := 1;
      j        binary_integer;
    begin
      while (i <= n_len)
      loop
        c_char := substr(p_string, i, 1);
        v_string := v_string || c_char;
        if (c_char = ' ')
        then
          j := i + 1;
          while (substr(p_string || 'X', j, 1) = ' ')
          loop
            j := j + 1;
          end loop;
          i := j;
        else
          i := i + 1;
        end if;
      end loop;
      return v_string;
    end;
    /

As a side note, 'X' is concatenated to the string in the inner loop to avoid testing j against the length of the string.

There are alternate ways of writing a function to eliminate multiple spaces, which can make use of some of the string functions provided by Oracle. Here’s one alternative:

    create or replace function squeeze2(p_string in varchar2)
    return varchar2
    is
      v_string varchar2(512) := p_string;
      i        binary_integer := 1;
    begin
      i := instr(v_string, '  '),
      while (i > 0)
      loop
        v_string := substr(v_string, 1, i)
                    || ltrim(substr(v_string, i + 1));
        i := instr(v_string, '  '),
      end loop;
      return v_string;
    end;
    /

And here’s a third way to do it:

    create or replace function squeeze3(p_string in varchar2)
    return varchar2
    is
      v_string varchar2(512) := p_string;
      len1     number;
      len2     number;
    begin
      len1 := length(p_string);
      v_string := replace(p_string, '  ', ' '),
      len2 :=  length(v_string);
      while (len2 < len1)
      loop
        len1 := len2;
        v_string := replace(v_string, '  ', ' '),
        len2 :=  length(v_string);
      end loop;
      return v_string;
    end;
    /

When these three alternative methods are tested on a simple example, each behaves exactly as specified, and there is no visible performance difference:

    SQL> select squeeze1('azeryt  hgfrdt   r')
      2  from dual
      3  /
    azeryt hgfrdt r

    Elapsed: 00:00:00.00
    SQL> select squeeze2('azeryt  hgfrdt   r')
      2  from dual
      3  /
    azeryt hgfrdt r

    Elapsed: 00:00:00.01
    SQL> select squeeze3('azeryt  hgfrdt   r')
      2  from dual
      3  /
    azeryt hgfrdt r

    Elapsed: 00:00:00.00

Assume now that this operation of stripping out multiple spaces is to be called many thousands of times each day. You can use the following code to create and populate a test table with random data, by which you can examine whether there are differences in performance among these three space-stripping functions under a more realistic load:

    create table squeezable(random_text  varchar2(50))
    /

    declare
        i         binary_integer;
        j         binary_integer;
        k         binary_integer;
        v_string  varchar2(50);
    begin
      for i in 1 .. 10000
      loop
        j := dbms_random.value(1, 100);
        v_string := dbms_random.string('U', 50);
        while (j < length(v_string))
        loop
          k := dbms_random.value(1, 3);
          v_string := substr(substr(v_string, 1, j) || rpad(' ', k)
                      || substr(v_string, j + 1), 1, 50);
          j := dbms_random.value(1, 100);
        end loop;
        insert into squeezable
        values(v_string);
      end loop;
      commit;
    end;
    /

This script creates a total of 10,000 rows in the test table (a fairly modest total when it is considered how many times some SQL statements are executed). The test can now be run as follows:

    select squeeze_func(random_text)
    from squeezable;

When I ran this test, headers and screen display were all switched off. Getting rid of output operations ensured that the results reflected the space-reduction algorithm and not the time needed to display the results. The statements were executed several times to ensure that there was no caching effect.

Table 2-2 shows the results on the test machine.

Table 2-2. Time to trim spaces from 10,000 rows

Function

Mechanism

Time

squeeze1

PL/SQL loop on chars

0.86 seconds

squeeze2

instr() + ltrim( )

0.39 seconds

squeeze3

replace( ) called in a loop

0.48 seconds

Even though all functions can be called 10,000 times in under one second, squeeze3 is 1.8 times as fast as squeeze1, and squeeze2 almost 2.2 times as fast. Why? Simply because PL/SQL is not “as close to the kernel” as is a SQL function. The performance difference may look like a tiny thing when functions are executed once in a while, but it can make quite a difference in a batch program—or on a heavily loaded OLTP server.

Important

Code loves the SQL kernel—the closer they get, the hotter the code.

Doing Only What Is Required

Developers often use count(*) for no purpose other than to implement an existence test. This usually happens as a result of a specification such as:

    If there are rows meeting a certain condition
    Then do something to them

which immediately becomes:

    select count(*)
    into counter
    from table_name
    where <certain_condition>

    if (counter > 0) then

Of course in 90% of the cases the count(*) is totally unnecessary and superfluous, as in the above example. If an action is required to operate on a number of rows, just do it. If no row is affected, so what? No harm is done. Moreover, if the process to be applied to those hypothetical rows is complex, the very first operation will tell you how many of them were affected, either in a system variable (@@ROWCOUNT with Transact-SQL, SQL%ROWCOUNT with PL/SQL, and so forth), in a special field of the SQL Communication Area (SQLCA) when using embedded SQL, or through special APIs such as mysql_affected_rows( ) in PHP. The number of processed rows is also sometimes directly returned by the function, which interacts with the database, such as the JDBC executeUpdate( ) method. Counting rows very often achieves nothing other than doubling your total search effort, because it applies a process twice to the same data.

Further, do not forget that if your purpose is to update or insert rows (a frequent case when rows are counted first to check whether the key already exists), some database systems provide dedicated statements (for instance, Oracle 9i Database’s MERGE statement) that operate far more efficiently than you can ever achieve by executing redundant counts.

Important

There is no need to code explicitly what the database performs implicitly.

SQL Statements Mirror Business Logic

Most database systems provide monitoring facilities that allow you to check statements currently being executed , as well as to monitor how many times they are executed. At the same time, you should have an idea of how many “business units” are being processed—activities such as orders or claims to be processed, customers to be billed, or anything else that makes sense to the business managers. You should review whether there is a reasonable (not absolutely precise) correlation between the two classes of activities. In other words, for a given number of customers, is the same number of activities being initiated against the database? If a query against the customers table is executed 20 times more than the number of customers being processed at the same time, it is a certainty that there is a problem somewhere. This situation would suggest that instead of going once to the table to find required information, repeated (and superfluous) visits are being made to the same rows in the same table.

Important

Check that your database activity is reasonably consistent with the business requirements currently being addressed.

Program Logic into Queries

There are several ways to achieve procedural logic in a database application. It’s possible to put some degree of procedurality inside an SQL statement (even if a statement should say what, and not how). Even when using a well-integrated host language within which SQL statements are embedded, it is still preferable to embed as much procedural logic as possible within an actual SQL statement, rather than in the host language. Of the two alternatives, embedding logic in the SQL statement will yield higher performance than embedding it in the application. Procedural languages are characterized by the ability to iterate (loops ) and to perform conditional logic (if...then...else constructs). SQL doesn’t need looping, since by essence it operates on sets; all it requires is the ability to test logically for some conditions.

Obtaining conditional logic breaks down into two components—IF and ELSE. Achieving IF is easy enough—the where condition provides the capability. What is difficult is to obtain the ELSE logic. For example, we may need to retrieve a set of rows, and then apply different transformations to different subsets. The case expression (Oracle has also long provided a functionally equivalent operator in decode( ) [*]) makes it easy to simulate some logic: it allows us to change on the fly the values that are returned to the result set by testing on row values. In pseudocode, the case construct operates like this:[]

    CASE
    WHEN condition THEN <return something to the result set>
       WHEN condition THEN <return something else>
    ...
       WHEN condition THEN <return still something else>
       ELSE <fall back on this value>
    END

Comparing numerical values or dates is straightforward. With strings, functions such as Oracle’s greatest( ) or least( ) or MySQL’s strcmp( ) can be useful. It is also sometimes possible to add some logic to insert statements, through multiple table inserts and conditional inserts,[*] and by using the merge statement. Don’t hesitate to use such statements if they are available with your DBMS. In other words, a lot of logic can be pushed into SQL statements; although the benefit may be small when executing only one of several statements, the gain can be much greater if you can manage to use case or merge or similar functionality to combine several statements into one.

Important

Wherever possible, try to embed your conditional logic within your SQL statements rather than in an associated host language.

Multiple Updates at Once

My basic assertion here is that successive updates to a single table are acceptable if they affect disjoint sets of rows; otherwise they should be combined. For example, here is some code from an actual application:[]

    update tbo_invoice_extractor
    set pga_status = 0
    where pga_status in (1,3)
      and inv_type = 0;
    update tbo_invoice_extractor
       set rd_status = 0
     where rd_status in (1,3)
       and inv_type = 0;

Two successive updates are being applied to the same table. Will the same rows be hit twice? There is no way to tell. The question is, how efficient are the search criteria? Any attribute with a name like type or status is typically a column with a totally skewed distribution. It is quite possible that both updates may result in two successive full scans of the same table. One update may use an index efficiently, and the second update may result in an unavoidable full table scan. Or, fortuitously, both may be able to make efficient use of an index. In any case, there is almost nothing to lose and everything to win by trying to combine both updates into a single statement:

    update tbo_invoice_extractor
    set pga_status = (case pga_status
                        when 1 then 0
                        when 3 then 0
                        else pga_status
                      end),
         rd_status = (case rd_status
                        when 1 then 0
                        when 3 then 0
                        else rd_status
                       end)
    where (pga_status in (1,3)
           or rd_status in (1, 3))
      and inv_type = 0;

There is indeed the possibility of some slight overhead due to the update of some columns with exactly the same contents they already have. But in most cases, one update is a lot faster than several separate ones. Notice that in regard to the previous section on logic, how we have used implicit conditional logic, by virtue of the case statement, to process only those rows that meet the update criteria, irrespective of how many different update criteria there may be.

Important

Apply updates in one fell swoop if possible; try to minimize repeated visits to the same table.

Careful Use of User-Written Functions

When a user-written function is embedded in a statement, the function may be called a large number of times. If the function appears within the select list, it is called for each returned row. If it appears within the where clause, it is called for each and every row that has successfully passed the filtering criteria previously evaluated. This may be a considerable number of times if the other criteria are not very selective.

Consider what happens if that same function executes a query. The query is executed each time the function is called; in practice, the result is exactly the same as a correlated subquery, except that the function is an excellent way to prevent the cost-based optimizer from executing the main query more intelligently! Precisely because the subquery is hidden within the function, the database optimizer cannot take any account of this query. Moreover, the stored procedure is not as close to the SQL execution engine as is a correlated subquery, and it will consequently be even less efficient.

Now I shall present an example demonstrating the dangers of hiding SQL code away inside a user-written function. Consider a table flights that describes commercial flights, with columns for flight number, departure time, arrival time, and the usual three-letter IATA[*] codes for airports. The translation of those codes (over 9,000 of them) is stored in a reference table that contains the name of the city (or of the particular airport when there are several located in one city), and of course the name of the country, and so on. Quite obviously any display of flight information should include the name of the destination city airport rather than the rather austere IATA code.

Here we come to one of the contradictions in modern software engineering. What is often regarded as “good practice” in programming is modularity, with many insulated software layers. That principle is fine in the general case, but in the context of database programming, in which code is a shared activity between the developer and the database engine itself, the desirability of code modularity is less clear. For example, we can follow the principle of modularity by building a small function to look up IATA codes and present the full airport name whenever the function is cited in a query:

    create or replace function airport_city(iata_code in char)
    return varchar2
    is
      city_name  varchar2(50);
    begin
      select city
      into city_name
      from iata_airport_codes
      where code = iata_code;
      return(city_name);
    end;
    /

For readers unfamiliar with Oracle syntax, trunc(sysdate) in the following query refers to today at 00:00 a.m., and date arithmetic is based on days; the condition on departure times therefore refers to times between 8:30 a.m. and 4:00 p.m. today. Queries using the airport_city function might be very simple. For example:

    select flight_number,
           to_char(departure_time, 'HH24:MI') DEPARTURE,
           airport_city(arrival) "TO"
    from flights
    where departure_time between trunc(sysdate) + 17/48
                             and trunc(sysdate) + 16/24
    order by departure_time
    /

This query executes with satisfactory speed; on a random sample on my machine, 77 rows were returned in 0.18 seconds (the average of several runs), the kind of time that leaves users happy (statistics indicate that 303 database blocks were accessed, 53 read from disk—and there is one recursive call per row).

As an alternative to using a look-up function we could simply write a join, which of course looks slightly more complicated:

    select f.flight_number,
           to_char(f.departure_time, 'HH24:MI') DEPARTURE,
           a.city "TO"
    from flights f,
         iata_airport_codes a
    where a.code = f.arrival
      and departure_time between trunc(sysdate) + 17/48
                             and trunc(sysdate) + 16/24
    order by departure_time
    /

This query runs in only 0.05 seconds (the same statistics, but there are no recursive calls). It may seem petty and futile to be more than three times as fast for a query that runs for less than a fifth of a second. However, it is quite common in large systems (particularly in the airline world) to have extremely fast queries running several hundred thousand times in one day. Let’s say that a query such as the one above runs only 50,000 times per day. Using the query with the lookup function, the query time will amount to a total of 2:30 hours. Without the lookup function, it will be under 42 minutes. This maintains an improvement ratio of well over 300%, which in a high traffic environment represents real and tangible savings that may ultimately translate into a financial saving. Very often, the use of lookup functions makes the performance of batch programs dreadful. Moreover, they increase the “service time” of queries for no benefit—which means that fewer concurrent users can use the same box, as you shall see in Chapter 9.

Important

The code of user-written functions is beyond the examination of the optimizer.

Succinct SQL

The skillful developer will attempt to do as much as possible with as few SQL statements as possible. By contrast, the ordinary developer tends to closely follow the different functional stages that have been specified; here is an actual example:

    -- Get the start of the accounting period
    select closure_date
    into dtPerSta
    from tperrslt
    where fiscal_year=to_char(Param_dtAcc,'YYYY')
      and rslt_period='1' || to_char(Param_dtAcc,'MM'),

    -- Get the end of the period out of closure
    select closure_date
    into dtPerClosure
    from tperrslt
    where fiscal_year=to_char(Param_dtAcc,'YYYY')
      and rslt_period='9' || to_char(Param_dtAcc,'MM'),

This is an example of very poor code, even if in terms of raw speed it is probably acceptable. Unfortunately, this quality of code is typical of much of the coding that performance specialists encounter. Two values are being collected from the very same table. Why are they being collected through two different, successive statements? This particular example uses Oracle, and a bulk collect of the two values into an array can easily be implemented. The key to doing that is to add an order by clause on rslt_period, as follows:

    select closure_date
    bulk collect into dtPerStaArray
    from tperrslt
    where fiscal_year=to_char(Param_dtAcc,'YYYY')
      and rslt_period in ('1' || to_char(Param_dtAcc,'MM'),
                          '9' || to_char(Param_dtAcc,'MM'))
    order by rslt_period;

The two dates are stored respectively into the first and second positions of the array. bulk collect is specific to the PL/SQL language but the same reasoning applies to any language allowing an explicit or implicit array fetch.

Note that an array is not even required, and the two values can be retrieved into two distinct scalar variables using the following little trick:[*]

    select max(decode(substr(rslt_period, 1, 1), -- Check the first character
                       '1', closure_date,
                            -- If it's '1' return the date we want
                           to_date('14/10/1066', 'DD/MM/YYYY'))),
                            -- Otherwise something old
           max(decode(substr(rslt_period, 1, 1),
                       '9', closure_date, -- The date we want
                           to_date('14/10/1066', 'DD/MM/YYYY'))),
    into dtPerSta, dtPerClosure
    from tperrslt
    where fiscal_year=to_char(Param_dtAcc,'YYYY')
      and rslt_period in ('1' || to_char(Param_dtAcc,'MM'),
                          '9' || to_char(Param_dtAcc,'MM'));

In this example, since we expect two rows to be returned, the problem is to retrieve in one row and two columns what would naturally arrive as two rows of a single column each (as in the array fetch example). We do that by checking each time the column that allows distinction between the two rows, rslt_period. If the row is the required one, the date of interest is returned. Otherwise, we return a date (here the arbitrary date is that of the battle of Hastings), which we know to be in all cases much older (smaller in terms of date comparison) than the one we want. By taking the maximum each time, we can be ensured that the correct date is obtained. This is a very practical trick that can be applied equally well to character or numerical data; we shall study it in more detail in Chapter 11.

Important

SQL is a declarative language, so try to distance your code from the procedurality of business specifications.

Offensive Coding with SQL

Programmers are often advised to code defensively, checking the validity of all parameters before proceeding. In reality, when accessing a database, there is a real advantage in coding offensively, trying to do several things simultaneously.

A good example is a succession of various checks, designed to flag up an exception whenever the criterion required by any of these checks fails to be met. Let’s assume that some kind of payment by a credit card has to be processed. There are a number of steps involved. It may be necessary to check that the customer id and card number that have been submitted are valid, and that they are correctly associated one with the other. The card expiration date must also be validated. Finally, the current purchase must not exceed the credit limit for the card. If everything is correct, the debit operation may proceed.

An unskilled developer may write as follows:

    select count(*)
    from customers
    where customer_id = provided_id

and will check the result.

Then the next stage will be something like this:

    select card_num, expiry_date, credit_limit
    from accounts
    where customer_id = provided_id

These returns will be checked against appropriate error codes.

The financial transaction will then proceed.

A skillful developer will do something more like the following (assuming that today( ) is the function that returns the current date):

    update accounts
    set balance = balance - purchased_amount
    where balance >= purchased_amount
      and credit_limit >= purchased_amount
      and expiry_date > today(  )
      and customer_id = provided_id
      and card_num = provided_cardnum

Then the number of rows updated will be checked. If the result is 0, the reason can be determined in a single operation, by executing:

    select c.customer_id, a.card_num, a.expiry_date,
           a.credit_limit, a.balance
    from customers c
         left outer join accounts a
              on a.customer_id = c.customer_id
              and a.card_num = provided_cardnum
    where c.customer_id = provided_id

If the query returns no row, the inference is that the value of customer_id is wrong, if card_num is null the card number is wrong, and so on. But in most cases this query will not even be executed.

Note

Did you notice the use of count(*) in the first piece of novice code? This is a perfect illustration of the misuse of count(*) to perform an existence test.

The essential characteristic of “aggressive coding " is to proceed on the basis of reasonable probabilities. For example, there is little point in checking whether the customer exists—if they don’t, they won’t be in the database in the first place! Assume nothing will fail, and if it does, have mechanisms in place that will address the problem at that point and only that point. Interestingly, this approach is analogous to the “optimistic concurrency control " method adopted in some database systems. Here update conflicts are assumed not to occur, and it is only when they do that control strictures are brought into play. The result is much higher throughput than for systems using pessimistic methods.

Important

Code on a probabilistic basis. Assume the most likely outcome and fall back on exception traps only when strictly necessary.

Discerning Use of Exceptions

There is a thin line between courage and rashness; when I recommend coding aggressively, my model is not the charge of the Light Brigade at Balaclava.[*] Programming by exception can also be the consequence of an almost foolhardy bravado, in which our proud developers determine to “go for it.” They have an overriding confidence that testing and the ability to handle exceptions will see them through. Ah, the brave die young!

As their name implies, exceptions should be exceptional occurrences. In the particular case of database programming, all exceptions do not require the same computer resources—and this is probably the key point to understand if they are to be used intelligently. There are good exceptions, conditions that are raised before anything has been done, and bad exceptions, which are raised only when the full extent of the disaster has actually happened.

For instance, a query against a primary key that finds no row will take minimal resources—the situation is detected while searching the index. However, if the query cannot use an index, then you have to carry out a full table scan before being able to tell positively that no data has been found. For a very large table, a total sequential read can represent a disaster on a machine near maximum capacity.

Some exceptions are extremely costly, even in the best-case scenario; take the detection of duplicate keys. How is uniqueness enforced? Almost always by creating a unique index, and it is when a key is submitted for entry into that index that any constraint violation of that unique index will be revealed. However, when an index entry is created, the physical address of the row must be provided, which means that the insertion into the table takes place prior to the insertion into the index. The constraint violation requires that the partial insert must be undone, together with the identification of the exact constraint violated being returned as an error message. All of these activities carry some significant processing cost. But the greatest sin is trying to fight at the individual exception level. Here, one is forced to think about individual rows rather than data sets—the very antithesis of relational database processing. The consequence of repeated constraint violations can be a serious deterioration in performance.

Let’s look at an Oracle example of the previous points. Assume that following the merger of two companies, email addresses are standardized on the <Initial><Name> pattern, on 12 characters at most, with all spaces or quotes replaced by an underscore character.

Let’s assume that a new employee table is created with the new email addresses obtained from a 3,000-row employee_old table. We want each employee to have a unique email address. We must therefore assign, for instance, flopez to Fernando Lopez, and flopez2 to Francisco Lopez (no relation). In fact, in our test data, a total of 33 potential duplicate entries exist, which is the reason for the following result:

    SQL> insert into employees(emp_num, emp_name,
                               emp_firstname, emp_email)
      2  select emp_num,
      3         emp_name,
      4         emp_firstname,
      5         substr(substr(EMP_FIRSTNAME, 1, 1)
      6               ||translate(EMP_NAME, ' ''', '_  _'), 1, 12)
      7  from employees_old;

    insert into employees(emp_num, emp_name, emp_firstname, emp_email)
    *
    ERROR at line 1:
    ORA-00001: unique constraint (EMP_EMAIL_UQ) violated


    Elapsed: 00:00:00.85

Thirty-three duplicates out of 3,000 is about 1%, so perhaps it would be possible to quietly process the conformant 99% and handle the rest through exceptions? After all, it would seem that a 1% load could be accommodated with some additional exception processing which should not be too significant. Following is the code for this optimistic approach:

    SQL> declare
      2     v_counter    varchar2(12);
      3     b_ok         boolean;
      4     n_counter    number;
      5     cursor c is  select emp_num,
      6                         emp_name,
      7                         emp_firstname
      8                  from employees_old;
      9  begin
     10    for rec in c
     11    loop
     12      begin
     13        insert into employees(emp_num, emp_name,
     14                              emp_firstname, emp_email)
     15        values (rec.emp_num,
     16                rec.emp_name,
     17                rec.emp_firstname,
     18                substr(substr(rec.emp_firstname, 1, 1)
     19                ||translate(rec.emp_name, ' ''', '_  _'), 1, 12));
     20      exception
     21       when dup_val_on_index then
     22         b_ok := FALSE;
     23         n_counter := 1;
     24         begin
     25           v_counter := ltrim(to_char(n_counter));
     26           insert into employees(emp_num, emp_name,
     27                                 emp_firstname, emp_email)
     28           values (rec.emp_num,
     29                   rec.emp_name,
     30                   rec.emp_firstname,
     31                   substr(substr(rec.emp_firstname, 1, 1)
     32                     ||translate(rec.emp_name, ' ''', '_  _'), 1,
     33                     12 - length(v_counter)) || v_counter);
     34           b_ok := TRUE;
     35         exception
     36          when dup_val_on_index then
     37            n_counter := n_counter + 1;
     38         end;
     39       end;
     40    end loop;
     41  end;
     40  /

    PL/SQL procedure successfully completed.

    Elapsed: 00:00:18.41

But what exactly is the cost of this exception handling? If the same exercise is attempted after removing the “problem” rows, the comparison between the loop with duplicates and the loop without duplicates shows that the cost of processing exceptions in the loop is fairly negligible—with duplicates the procedure also takes about 18 seconds to run. However, when we run the insert...select of our first attempt without duplicates it is considerably faster than the loop: we discover that the switch to the one-row-at-a-time logic adds close to 50% to processing time. But in such a case, is it possible to avoid the row-at-a-time process? Yes, but only by avoiding exceptions. It’s the decision of dealing with problem rows through exception handling that forced our adoption of sequential row processing.

Alternatively, there might be value in attempting to identify those rows that contain email addresses subject to contention, and assigning those addresses some arbitrary number to achieve uniqueness.

It is easy to determine how many rows are involved in this contention by adding a group by clause to the SQL statement. However, assigning numbers might be a difficult thing to do without using the analytical functions available in the major database systems. (Oracle calls them analytical functions, DB2 knows them as online analytical processing, or OLAP, functions, SQL Server as ranking functions.) It is worthwhile to explore the solution to this problem in terms of pure SQL.

Each email address can be assigned a unique number: 1 for the oldest employee whose first name initial and surname result in the given email address, 2 to the second oldest and so on. By pushing this result into a subquery, it is possible to check and concatenate nothing to the first email address in each group, and the sequence numbers (not in the Oracle sense of the word) to the following ones. The following code shows how our logic can be applied:

    SQL> insert into employees(emp_num, emp_firstname,
      2                        emp_name, emp_email)
      3  select emp_num,
      4         emp_firstname,
      5         emp_name,
      6         decode(rn, 1, emp_email,
      7                       substr(emp_email,
      8                       1, 12 - length(ltrim(to_char(rn))))
      9                        || ltrim(to_char(rn)))
     10  from (select emp_num,
     11               emp_firstname,
     12               emp_name,
     13               substr(substr(emp_firstname, 1, 1)
     14                 ||translate(emp_name, ' ''', '_  _'), 1, 12)
     15                          emp_email,
     16               row_number(  )
     17                  over (partition by
     18                        substr(substr(emp_firstname, 1, 1)
     19                         ||translate(emp_name,' ''','_  _'),1,12)
     20                        order by emp_num) rn
     21        from employees_old)
     22  /

    3000 rows created.

    Elapsed: 00:00:11.68

We avoid the costs of row-at-a-time processing, and this solution requires only 60% of the original time.

Important

Exception handling forces the adoption of procedural logic. Always try to anticipate possible exceptions by remaining within declarative SQL.



[*] decode( ) is a bit more rudimentary than case and may require the use of additional functions such as sign( ) to obtain the same results.

[] There are two variants of the case construct; the example shown is the most sophisticated variant.

[*] Available, for instance, in Oracle since release 9.2.

[] Table names have been changed.

[*] International Air Transport Association.

[*] The Oracle function decode( ) works like case. What is compared is the first argument. If it is equal to the second argument, then the third one is returned; if there is no fifth parameter, then the fourth one corresponds to else; otherwise, if the first argument is equal to the fourth one, the fifth one is returned and so on as long as we have pairs of values.

[*] During the Crimean War of 1854 that saw England, France, and Turkey fight against Russia, a poorly specified order and personal enmity between some of the commanders led more than 600 British cavalry men to charge down a valley in full line of fire of the Russian guns. Around 120 men and half the horses were killed, for no result. The bravery of the men, celebrated in a poem by Tennyson and (later) several Hollywood movies, helped turn a stupid military action into a myth.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.231.15