Writing good SQL is really about database performance. Although tuning of the database can help with performance, bad SQL will impact performance in ways that tuning the database just cannot cure. When the database administrator(s) has done their job, it is time for the training of the users to begin. All relational databases have their quirks, and it is paramount for the users to be aware of these and have the knowledge to either bypass them or create SQL that takes advantage of the database design and features.
Db2 is designed to hold and process huge amounts of data. Because of that, many of the design goals and features are quite different from other relational databases. The SQL optimizer is one of the most sophisticated in the industry. Database data can be spread over a wide variety of devices and an array of computational entities. All of this power does not come without some cost, and this chapter concerns the design and use of SQL.
This chapter will address some simple ways to structure your SQL to take advantage of the Db2 design. It will cover a variety of areas so that you have a good understanding of how Db2 works and what you can do to make your SQL perform better.
Relational Theory
Relational databases are designed around the mathematical theory of sets. Set theory is a very old and understood theory of relationships between entities or, in our case, data. All relational databases have set theory at their heart.
Because of this design, the storage of data in the database has practically no resemblance to traditional file systems. This makes it really hard for both programmers and users to understand why you should not do things in the traditional way. The very first thing to teach new users is that doing things in the traditional ways will ALWAYS cause detrimental impacts on database performance. And in some cases, the performance will be worse than sequentially accessing a file for the same data.
Relational databases have a tremendous amount of power built into them. When working with Db2, you should always try to make use of that power mainly because it will almost always be better than trying to design your own solution.
One of the mistakes that new users make is thinking of a database table as a file. They are not files, and even the data in a table row is not stored together. The reason for this goes back to the set theory of design. It will always be easier to structure data on the fly rather than by breaking structured data apart and restructuring it as a result set.
The next mistake new users make is thinking that construction of a result set happens a little at a time. This is a big mistake. When the first result row is presented to the user, the complete result set has already been completed by the system. Thus, processing the result set a line at a time delays the destruction of the result set and uses up valuable system scratch space.
These are the two really big problems for new users and programmers, but another mistake made especially by programmers involves processing the data to produce a new result after fetching the data. New programmers sometimes assume that the database system is inefficient in dynamically modifying a result – like adding a fixed numerical value to a result column. This is just not the case. The database system is much faster at this sort of calculation than any program could be.
New users are also sometimes afraid of creating SQL that is complex, thinking that the system will find it hard to interpret for a valid result. The SQL optimizer for Db2 is very efficient at breaking apart the input SQL and efficiently interpreting and executing it.
The last mistake users make is to code SQL to ask for data you already know. For instance, if you ask for data on Dept. 4, do not ask for the department number as a part of the result set. This is just a waste of time and resources.
The last thing to cover is the data types used in Db2 and contrasting them to the native data types of the hardware systems. For instance, very few smaller systems natively support decimal data. Sometimes there are libraries available that can process this data, but even then, it will require some programming to accommodate the decimal data.
Reduce Passes Through Data
This query will make a single pass through the data to list all the employees that are either above or below the range.
It is really important to always step back from a question when it takes two queries to obtain the data. Try to figure out the fundamental question you are asking as that usually leads to a single query that really answers your question.
This simple statement causes four passes through the sysibm.systables table. The result output shows all the table-like objects in the database catalog. However, the statement can be simplified so that only a single pass is necessary over the sysibm.systables table.
This new query provides the same results as the previous one but only requires a single pass through the sysibm.systables table. It is also much easier to read and understand.
Using Indexes to Increase Performance
The preceding statement will now utilize the index to locate the employee row(s) without a full table scan. The column empno has an index on it; thus, Db2 can use it to locate the rows to be returned. Note that it is not necessary to retrieve the indexed column to utilize its index. A reference in the WHERE clause is sufficient to prevent a full table scan.
The indexes that are created are only temporary and will be deleted when the query is complete.
Sorting and Grouping
Sorting is sometimes a confusing topic for new SQL users because there are some underlying rules that govern it. Also, sorting can be invoked by the ORDER BY clause or by the DISTINCT keyword on the SELECT clause.
As you can see, the workdept does not appear in the output, but the data will be ordered by that value.
You should also limit the number of columns to be sorted. The more columns you include, the longer the sort will take.
The DISTINCT keyword always causes a sort, even if there are no duplicate rows. You should be very careful when using DISTINCT as the overhead for it can be high. However, this is not a recommendation to never use it as there may be very valid reasons for doing so.
Programs Containing SQL
Programs that contain SQL are sometimes called “black box code” because the SQL is hidden from view by the users that run the programs. These programs are created by programmers who work from a set of requirements created by their users. Problems arise when the programmer(s) takes shortcuts when writing the SQL for the program. Let us look at an example that will make the SQL inefficient, and thus the program will be inefficient.
This is not nearly as efficient as the three previous queries. It is also bad form because it will always return columns in which the users are uninterested. The bottom line is this is a terrible idea. While it may seem to be saving programmer time, it wastes user time and database resources every time the query is run.
Use Db2 Utilities Where Possible
Programmers and users alike should always investigate the possibilities of using a Db2 utility in place of a custom-made program. In most cases, the Db2 utility will be much more efficient than a programmer-created program. The vast majority of utilities in Db2 have existed for decades and have 99% of all bugs already removed. Plus, the efficiency of these utilities has been honed over the years, far longer than any other locally created program.
The load and import utilities can be used to add or replace data in existing tables. These utilities are very efficient, especially the load utility. They can also deal with very large input data sets. There are any number of options supported by these utilities that can modify how the data is loaded into a table including skipping some input data.
The unload utility can extract data from a table in mass amounts very quickly. It uses an SQL statement to determine the data to be unloaded and thus is easily modified by users to match their requirements.
The MERGE SQL statement, while not a Db2 utility, is another way to take data from two tables and merge it into a single table. It is a very powerful SQL statement with a number of options for matching the data to be merged.
The TRUNCATE SQL statement is another tool that is not a Db2 utility. This one statement can remove all the data in a table and optionally reclaim the storage space occupied by the table.
Db2 Functions
The Db2 system provides a large collection of built-in functions (BIFs) that can be used in your SQL statements that can improve the retrieval of the exact data you are looking for. These functions are as much a part of SQL and the Db2 optimizer as the SQL statements themselves. Using them in your SQL statements usually does not cause performance issues.
From a broad perspective, there are two types of functions provided by Db2. There are the built-in functions that are supplied by the database and user-developed functions that are supplied by users and programmers. The built-in functions are very efficient and can be used by anyone developing a SQL statement for Db2. The user-defined functions must be developed, usually by programmers, in order to be available for SQL statements.
Aggregate functions: These are used to compute summary information from a column contained in one or more rows and summarize it in some form.
Scalar functions: These functions return a subset of column data in a single row.
Table functions: These functions return a set of columns from within the WHERE clause.
Row functions: These are not available yet in Db2 for LUW.
The current Db2 BIFs are listed here:
ABS: Returns the absolute value of a column from a set of rows.
AVG: Computes the average of a set of column values.
CEILING: Returns the smallest integer value greater than or equal to the argument.
CHAR: Returns a fixed-length character string of the argument.
COALESCE: Returns the value of the first non-NULL expression.
CONCAT: Returns the concatenation of two compatible string arguments.
CORRELATION: An expression that returns a value of any built-in numeric data type.
COUNT: Counts the number of rows in a column result.
COVARIANCE: Returns the population (covariance) of a set of number pairs.
DATE: Returns the Db2 date derived from an expression.
IDENTITY_VAL_LOCAL: Returns the most recently assigned value for an identity element.
INT: Returns an integer for a number or a character string representation of an integer.
LENGTH: Returns the length of a value.
LOCATE: Returns the position where the first occurrence of an argument starts within another argument.
LOWER: Returns the argument string in lowercase.
LPAD: Returns a padded string argument on the left with blanks (or another string).
LTRIM: Returns a string with bytes removed from the left of the string.
MAX: Returns the maximum value in a column from a set of rows.
MIN: Returns the minimum value in a column from a set of rows.
POSITION: Returns the position of the first occurrence of an argument within another argument.
RAND: Returns a random floating-point number between 0 and 1.
REPLACE: Returns a string with all occurrences of a string with another string.
ROUND: Returns a number rounded to the specified number of digits to the right or left of the decimal point.
RPAD: Returns a padded string argument on the right with blanks (or another string).
RTRIM: Returns a string with bytes removed from the right of the string.
SQRT: Returns the square root of the argument.
STDDEV: Returns the standard deviation in a column from a set of rows.
SUBSTR: Returns a substring of a string argument.
SUM: Returns the sum of all columns from a set of rows.
UPPER: Returns the argument string in uppercase.
UNNEST: Returns a result table that includes a row for each element of an array.
VARIANCE: Returns the variance of all columns from a set of rows.
Multiple Ways to Code SQL
This will at least give you correctly targeted results, but it will improve performance as well.
Summary
This chapter has introduced several individual topics all designed to help you write good SQL for the Db2 database. Good SQL is defined as efficient, logical, and easy to read. The result of writing good SQL is that it is easily understood and easy to maintain.