© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
W. D. AshleyFoundation Db2 and Pythonhttps://doi.org/10.1007/978-1-4842-6942-8_7

7. Writing Good SQL for Db2

W. David Ashley1  
(1)
Austin, TX, USA
 

Writing good SQL is really about database performance. Although tuning of the database can help with performance, bad SQL will impact performance in ways that tuning the database just cannot cure. When the database administrator(s) has done their job, it is time for the training of the users to begin. All relational databases have their quirks, and it is paramount for the users to be aware of these and have the knowledge to either bypass them or create SQL that takes advantage of the database design and features.

Db2 is designed to hold and process huge amounts of data. Because of that, many of the design goals and features are quite different from other relational databases. The SQL optimizer is one of the most sophisticated in the industry. Database data can be spread over a wide variety of devices and an array of computational entities. All of this power does not come without some cost, and this chapter concerns the design and use of SQL.

This chapter will address some simple ways to structure your SQL to take advantage of the Db2 design. It will cover a variety of areas so that you have a good understanding of how Db2 works and what you can do to make your SQL perform better.

Relational Theory

Relational databases are designed around the mathematical theory of sets. Set theory is a very old and understood theory of relationships between entities or, in our case, data. All relational databases have set theory at their heart.

Because of this design, the storage of data in the database has practically no resemblance to traditional file systems. This makes it really hard for both programmers and users to understand why you should not do things in the traditional way. The very first thing to teach new users is that doing things in the traditional ways will ALWAYS cause detrimental impacts on database performance. And in some cases, the performance will be worse than sequentially accessing a file for the same data.

Relational databases have a tremendous amount of power built into them. When working with Db2, you should always try to make use of that power mainly because it will almost always be better than trying to design your own solution.

One of the mistakes that new users make is thinking of a database table as a file. They are not files, and even the data in a table row is not stored together. The reason for this goes back to the set theory of design. It will always be easier to structure data on the fly rather than by breaking structured data apart and restructuring it as a result set.

The next mistake new users make is thinking that construction of a result set happens a little at a time. This is a big mistake. When the first result row is presented to the user, the complete result set has already been completed by the system. Thus, processing the result set a line at a time delays the destruction of the result set and uses up valuable system scratch space.

These are the two really big problems for new users and programmers, but another mistake made especially by programmers involves processing the data to produce a new result after fetching the data. New programmers sometimes assume that the database system is inefficient in dynamically modifying a result – like adding a fixed numerical value to a result column. This is just not the case. The database system is much faster at this sort of calculation than any program could be.

New users are also sometimes afraid of creating SQL that is complex, thinking that the system will find it hard to interpret for a valid result. The SQL optimizer for Db2 is very efficient at breaking apart the input SQL and efficiently interpreting and executing it.

The last mistake users make is to code SQL to ask for data you already know. For instance, if you ask for data on Dept. 4, do not ask for the department number as a part of the result set. This is just a waste of time and resources.

The last thing to cover is the data types used in Db2 and contrasting them to the native data types of the hardware systems. For instance, very few smaller systems natively support decimal data. Sometimes there are libraries available that can process this data, but even then, it will require some programming to accommodate the decimal data.

Reduce Passes Through Data

One of the easiest ways to reduce the processing of SQL queries is to eliminate multiple passes through the same data. For instance, if you are looking for employees whose salaries are out of range, you might create two queries to return the data:
SELECT firstnme, midinit, empno
   FROM emp
   WHERE salary > 75000;
SELECT firstnme, midinit, empno
   FROM emp
   WHERE salary < 10000;
These two queries will return the employees whose salaries are either above the salary range or below it. If the number of employees is really large, each query will take some time to process. The problem here is the question we really want answered is what “employees are out of range.” This fundamental question can be answered with a single query:
SELECT firstnme, midinit, empno
   FROM emp
   WHERE salary > 75000 OR salary < 10000;

This query will make a single pass through the data to list all the employees that are either above or below the range.

It is really important to always step back from a question when it takes two queries to obtain the data. Try to figure out the fundamental question you are asking as that usually leads to a single query that really answers your question.

The following shows another example of how multiple passes can creep into your SQL coding:
SELECT creator, name, 'Table '
   FROM sysibm.systables
   WHERE type = 'T'
UNION
SELECT creator, name, 'View  '
   FROM sysibm.systables
   WHERE type = 'V'
UNION
SELECT creator, name, 'Alias '
   FROM sysibm.systables
   WHERE type = 'A'
UNION
SELECT creator, name, 'Global'
   FROM sysibm.systables
   WHERE type = 'G'
ORDER BY creator, name;

This simple statement causes four passes through the sysibm.systables table. The result output shows all the table-like objects in the database catalog. However, the statement can be simplified so that only a single pass is necessary over the sysibm.systables table.

When joining two or more tables, it will be very useful for the WHERE clause to refer to an indexed column in each table you reference. All of this is to prevent a complete table scan looking for the proper rows in each table:
SELECT creator, name,
   CASE type
      WHEN 'T' THEN 'Table '
      WHEN 'V' THEN 'View  '
      WHEN 'A' THEN "Alias '
      WHEN 'G' THEN 'Global'
   END
   FROM sysibm.systables
   ORDER BY creator, name;

This new query provides the same results as the previous one but only requires a single pass through the sysibm.systables table. It is also much easier to read and understand.

The following shows how you can make mass updates with only a single pass through a table:
UPDATE emp
SET salary = CASE workdept
      WHEN 'A01' THEN salary * .04
      WHEN 'C01' THEN salary * .02
      WHEN 'D11' THEN salary * .10
      WHEN 'D21' THEN salary * .50
      ELSE salary * .01
   WHERE edlevel > 15;

Using Indexes to Increase Performance

When querying data from the database, it is really important to utilize indexes wherever possible. The increase in performance can be quite dramatic. You do NOT need to necessarily fetch columns that have an index on them, but instead try to utilize them wherever possible in the SQL WHERE clause. For instance, if the empno column has an index on it, you should attempt to utilize the index in your queries:
SELECT firstnme, midinit, lastnme, dept
   FROM emp
   WHERE empno > 10000

The preceding statement will now utilize the index to locate the employee row(s) without a full table scan. The column empno has an index on it; thus, Db2 can use it to locate the rows to be returned. Note that it is not necessary to retrieve the indexed column to utilize its index. A reference in the WHERE clause is sufficient to prevent a full table scan.

Also, you should be aware that Db2 can create an index on the fly to help query a table or tables. For instance, the following WHERE clause will cause dynamic indexes to be created on the salary, bonus, and comm columns to help reduce the access time for the data:
SELECT empno, firstnme, lastnme
FROM emp
WHERE salary + bonus + comm > 100000.00;

The indexes that are created are only temporary and will be deleted when the query is complete.

Sorting and Grouping

Sorting is sometimes a confusing topic for new SQL users because there are some underlying rules that govern it. Also, sorting can be invoked by the ORDER BY clause or by the DISTINCT keyword on the SELECT clause.

One aspect of the ORDER BY clause is that the column(s) named in that clause does NOT have to be retrieved in the output (appears after the SELECT keyword). Many new users are always surprised by that aspect of ordering the output data. An example is shown here:
SELECT empno, lastnme, firstnme
FROM emp
ORDER BY workdept;

As you can see, the workdept does not appear in the output, but the data will be ordered by that value.

You should also limit the number of columns to be sorted. The more columns you include, the longer the sort will take.

The DISTINCT keyword always causes a sort, even if there are no duplicate rows. You should be very careful when using DISTINCT as the overhead for it can be high. However, this is not a recommendation to never use it as there may be very valid reasons for doing so.

Grouping with SQL can be extremely helpful when you want to return aggregate data. Consider the following query:
SELECT workdept, SUM(salary)
   FROM emp
   GROUP BY workdept;
This causes the output to contain a single line for each workdept along with the aggregate sum of salaries for all the employees in the workdept. The order of the workdepts will be random since no ORDER BY clause was used. Additionally, the HAVING clause can eliminate departments that are of no use to you:
SELECT workdept, SUM(salary)
   FROM emp
   GROUP BY workdept
   HAVING AVG(salary) < 15000;

Programs Containing SQL

Programs that contain SQL are sometimes called “black box code” because the SQL is hidden from view by the users that run the programs. These programs are created by programmers who work from a set of requirements created by their users. Problems arise when the programmer(s) takes shortcuts when writing the SQL for the program. Let us look at an example that will make the SQL inefficient, and thus the program will be inefficient.

If the program requires three different kinds of data from the customer table, the best way to do this is to write three different queries against the customer table:
SELECT firstnme, lastnme, address, city, state, zipcode
    FROM customer
    WHERE areacode = :HV-AC;
SELECT custid, firstnme, lastnme, phoneno
   FROM customer
   WHERE areacode = :HV-AC;
SELECT custid, firstnme, lastnme, custtype
   FROM customer
   WHERE areacode = :HV-AC;
All of the preceding queries are very efficient and require very few resources from the database. But, in the name of reduced programming effort, the programmer(s) may decide to code them as follows to reduce the number of SQL statements in the program:
SELECT custid, firstnme, lastnme, address, city, state, zipcode, phonno, custtype
    FROM customer
    WHERE areacode = :HV-AC;

This is not nearly as efficient as the three previous queries. It is also bad form because it will always return columns in which the users are uninterested. The bottom line is this is a terrible idea. While it may seem to be saving programmer time, it wastes user time and database resources every time the query is run.

Use Db2 Utilities Where Possible

Programmers and users alike should always investigate the possibilities of using a Db2 utility in place of a custom-made program. In most cases, the Db2 utility will be much more efficient than a programmer-created program. The vast majority of utilities in Db2 have existed for decades and have 99% of all bugs already removed. Plus, the efficiency of these utilities has been honed over the years, far longer than any other locally created program.

The load and import utilities can be used to add or replace data in existing tables. These utilities are very efficient, especially the load utility. They can also deal with very large input data sets. There are any number of options supported by these utilities that can modify how the data is loaded into a table including skipping some input data.

The unload utility can extract data from a table in mass amounts very quickly. It uses an SQL statement to determine the data to be unloaded and thus is easily modified by users to match their requirements.

The MERGE SQL statement, while not a Db2 utility, is another way to take data from two tables and merge it into a single table. It is a very powerful SQL statement with a number of options for matching the data to be merged.

The TRUNCATE SQL statement is another tool that is not a Db2 utility. This one statement can remove all the data in a table and optionally reclaim the storage space occupied by the table.

Db2 Functions

The Db2 system provides a large collection of built-in functions (BIFs) that can be used in your SQL statements that can improve the retrieval of the exact data you are looking for. These functions are as much a part of SQL and the Db2 optimizer as the SQL statements themselves. Using them in your SQL statements usually does not cause performance issues.

From a broad perspective, there are two types of functions provided by Db2. There are the built-in functions that are supplied by the database and user-developed functions that are supplied by users and programmers. The built-in functions are very efficient and can be used by anyone developing a SQL statement for Db2. The user-defined functions must be developed, usually by programmers, in order to be available for SQL statements.

There are many types of BIFs, a few of which are listed here:
  • Aggregate functions: These are used to compute summary information from a column contained in one or more rows and summarize it in some form.

  • Scalar functions: These functions return a subset of column data in a single row.

  • Table functions: These functions return a set of columns from within the WHERE clause.

  • Row functions: These are not available yet in Db2 for LUW.

The current Db2 BIFs are listed here:

ABS: Returns the absolute value of a column from a set of rows.

AVG: Computes the average of a set of column values.

CEILING: Returns the smallest integer value greater than or equal to the argument.

CHAR: Returns a fixed-length character string of the argument.

COALESCE: Returns the value of the first non-NULL expression.

CONCAT: Returns the concatenation of two compatible string arguments.

CORRELATION: An expression that returns a value of any built-in numeric data type.

COUNT: Counts the number of rows in a column result.

COVARIANCE: Returns the population (covariance) of a set of number pairs.

DATE: Returns the Db2 date derived from an expression.

IDENTITY_VAL_LOCAL: Returns the most recently assigned value for an identity element.

INT: Returns an integer for a number or a character string representation of an integer.

LENGTH: Returns the length of a value.

LOCATE: Returns the position where the first occurrence of an argument starts within another argument.

LOWER: Returns the argument string in lowercase.

LPAD: Returns a padded string argument on the left with blanks (or another string).

LTRIM: Returns a string with bytes removed from the left of the string.

MAX: Returns the maximum value in a column from a set of rows.

MIN: Returns the minimum value in a column from a set of rows.

POSITION: Returns the position of the first occurrence of an argument within another argument.

RAND: Returns a random floating-point number between 0 and 1.

REPLACE: Returns a string with all occurrences of a string with another string.

ROUND: Returns a number rounded to the specified number of digits to the right or left of the decimal point.

RPAD: Returns a padded string argument on the right with blanks (or another string).

RTRIM: Returns a string with bytes removed from the right of the string.

SQRT: Returns the square root of the argument.

STDDEV: Returns the standard deviation in a column from a set of rows.

SUBSTR: Returns a substring of a string argument.

SUM: Returns the sum of all columns from a set of rows.

UPPER: Returns the argument string in uppercase.

UNNEST: Returns a result table that includes a row for each element of an array.

VARIANCE: Returns the variance of all columns from a set of rows.

Multiple Ways to Code SQL

With SQL, just like any other language, there is more than one way to obtain the results you are looking for from a query. Examine the following statement:
SELECT empno, lastname
    FROM emp
    WHERE workdept LIKE 'A%';
While that may get the result you are looking for, it may return a far greater set of rows than you are expecting. The preceding statement assumes you know that there are only three departments starting with the letter “A.” But in the future, there may be far more departments that start with that letter. A better alternative is to code what you know to prevent problems in the future and to improve efficiency:
SELECT empno, lastname
    FROM emp
    WHERE workdept IN('A00', 'A01', 'A02');

This will at least give you correctly targeted results, but it will improve performance as well.

Summary

This chapter has introduced several individual topics all designed to help you write good SQL for the Db2 database. Good SQL is defined as efficient, logical, and easy to read. The result of writing good SQL is that it is easily understood and easy to maintain.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.183.187