CHAPTER 5

image

It’s about the Question

“It’s not about the query; it’s about the question.” This is one of my favorite sayings when it comes to writing SQL. Regardless of your level of proficiency, writing SQL well is as much about questions as it is about queries.

There are many ways that questions play an important role when writing SQL. First, understanding the question behind the query is often more important than the query syntax itself. If you start with the question the SQL is intended to answer, you are more likely to think through and understand how best to formulate the query to get the desired result. Second, it is critical to be able to ask good questions to clarify what the SQL is intended to do, and to gather all the pertinent information you need to write SQL that is not only functionally correct, but also efficient. Last, you must be able to create well-formed logical expressions that help answer the questions behind the SQL.

In this chapter, I cover how to go about ferreting out all the information you need to write the query in the best way possible. The way you do this is by asking good questions. Regardless of whether you are writing a new SQL statement or modifying an existing one, questions are the heart of the process.

Asking Good Questions

Asking good questions is an intellectual habit, and habits don’t form overnight. Long ago, I read that it takes between 21 days and 28 days to form a new habit. However, a 2009 research study published in the European Journal of Social Psychology 1 suggests that forming new habits actually takes an average of 66 days; however, the more complex the behavior, the longer it takes for that behavior to become a habit. So, if you’re not already in the habit of asking good questions, it’s important to understand that learning to do so takes specific effort on your part to gain proficiency.

You may be wondering what any of this has to do with writing SQL. I believe knowing how to ask good questions and, even more specifically, knowing how to ask questions that allow you to determine the correct question your SQL statement is intended to answer, is a crucial habit you need to form if you really want to elevate your SQL skills to the next level.

To write any SQL statement, begin with a question you need to answer. The answer is a result set comprised from one or more rows of data from the tables in your database. As a starting point, you may be given the answer being sought in the form of a sample report or screen display. At other times, you may be given a more complete specification for what the query needs to deliver. You shouldn’t be surprised when I tell you that you get weak query specifications more often than you get strong, detailed ones. No matter how much information you are given about the queries you need to write, you need to make sure to ask good questions that ensure you have everything you need to write SQL that does what it is supposed to—and does it quickly and efficiently.

The Purpose of Questions

Questions help you clarify the request and help you probe assumptions that either you or the requestor may hold. Questions also help you to gather evidence and work out the implications or consequences of implementing code in certain ways. Questions are gold. Well, I suppose you could say that the answers are the gold, but questions are the tools you need to mine the gold.

To ask questions that get you the information you need to write functionally correct and optimally performing SQL, you must be able to formulate your questions properly. Regardless of how much you know, or think you know, about what you’ve been asked to code, it can be helpful to start with a blank slate and ask questions as if you know nothing. By doing so, you are more likely to reach greater levels of detail and avoid making assumptions.

Many people think that asking questions makes them appear ignorant. I believe questions are a magic tool. Asking intelligent, well-crafted questions cause people to think. And when someone thinks, the door is open for new ideas, new answers, and new possibilities to emerge. When you ask people a question, particularly those who want something from you, you are letting them know you care about what they want and you want to service their request in the best way possible. Keeping silent out of a fear of looking dumb has more potential to backfire. If you don’t ask questions and then deliver something that doesn’t satisfy the request effectively, you call more negative attention to yourself than asking questions ever could.

I want to point out that you should ask questions even if you ask them only to yourself. As odd as this may sound, if you happen to be in a situation in which there is no good resource at your disposal, you still need to ask questions and get answers. The answers have to come from research you do, but if you start by preparing a good list of questions, you can direct your research more clearly.

Categories of Questions

There are many categorizations of questions. Typically, questions are categorized primarily as open or closed. The category you choose depends on whether you want a longer, detailed answer or a short, limited answer.

Open questions are intended to open a dialogue and help you engage in a conversation. Answers to open questions usually provide more detail and can’t be answered with a simple yes or no. Questions that begin with What, How, Who, When, and Why are open questions. Just be careful when asking Why questions because they may come across as confrontational. Remember that your questions should be intended to help you get all the detail you need, but not put anyone on the defensive. For example, asking, “Why would anyone ever choose to do it that way?” has a very different feeling than “What rationale prompted that choice?” Even if you discover something questionable, or just plain wrong, you can provide feedback directly and may not need to use Why questions very often.

Most of the time, your questions should be aimed at digging out facts. Objective, open questions ask for specific information and tend to be answered with facts. However, you must take care to make sure you are getting facts and not opinions. Formulating a question subjectively by asking someone what they think about something elicits a response that is more subjective. The difference can be a critical one.

Some open questions are intended to get responses to help you get ideas—in particular, ideas about actions you should take. These are problem-solving questions. These types of questions are great to aid in brainstorming different approaches to take. Your colleagues are great people sources for the answers to these types of questions. After you have the detail you need, don’t hesitate to bounce things off other developers. They often offer solutions you never would have thought of.

The two most common types of questions you should ask when developing SQL are objective and problem-solving questions. Here are a few examples:

  • What is the data model and is a data dictionary or entity relationship diagram (ERD) available?
  • How have other related queries, if any, been written?
  • Who is the subject matter expert for this application?
  • What are the response time requirements for this query?
  • How would you implement the request?
  • What steps should I take next?
  • What resources do you suggest I review?

If you need a yes or no response or just a short answer, closed questions suit that purpose best. Questions that begin with Are, Can, Did, or Do elicit short, direct responses. These types of questions should not be ambiguous. You want to make sure you ask the question so that you don’t end up getting a long response if all you really want is a yes or no. These kinds of questions are intended to prevent or inhibit long discussions.

Closed questions can be separated into three types: identification, selection, and yes/no. When you use an identification-type question, you want to know a specific answer but don’t provide choices. A selection-type question provides a list of two or more choices. The yes/no type asks for a simple yes or no response only.

To demonstrate the differences between these three types, I ask the same question in three different ways:

  • Identification: What kind of table is employees?
  • Selection: Is the employees table a heap table or an Index-Organized Table (IOT)?
  • Yes/no: Is the employees table a heap table?

Of these types, the selection type is the one you need to formulate most carefully. In this example, I provided only two selections: heap and IOT. But what if the table is a clustered table type? If you don’t include that option in the selection list, you could end up getting a yes/no answer to the original question. The person answering the question might (rudely) answer with a simple “No,” and then you have to follow up with an identification question to get the answer you need.

Selecting the right type of question is almost as important as the question itself. You want to get the details needed as expeditiously as possible. So, remember to use closed questions if you want to keep answers short, and to use open questions if you want to open up a discussion and get more detail. The most common mistake is asking a closed question when you really want more detail. For example, “Will you tell me about the project?” is technically a closed question that should return a yes or no answer. Most people have learned to provide a polite response (with detail) even when asked the wrong type of question. But, it is truly your responsibility to ask the correct type and style of question to make it easier for the responder to provide you with the appropriate answer.

Questions about the Question

Developing new queries is usually easier than trying to modify a query that someone else has already written. This is because when you write a brand new query, you don’t have to worry about interpreting the meaning of someone else’s code. But, what you do have to worry about is the query specification. Regardless of whether it’s detailed, it’s your job to make sure you code the SQL to deliver the answer to the question you’ve been handed.

Let’s walk through an example of how this process might work. I’ll play the role of business user and you play the role of application developer. My request is for you to write a query that provides a list of employees who have held more than one job in the company. I’d like the output to display only the employee_id and a count of how many total jobs each employee has held. Listing 5-1 shows the query you create to satisfy my request.

Listing 5-1.  List of Employees Who Have Held Multiple Jobs

SQL> desc job_history
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 EMPLOYEE_ID                   NOT NULL NUMBER(6)
 START_DATE                    NOT NULL DATE
 END_DATE                      NOT NULL DATE
 JOB_ID                        NOT NULL VARCHAR2(10)
 DEPARTMENT_ID                          NUMBER(4)

SQL> select employee_id, count(*) job_ct
  2  from job_history
  3  group by employee_id
  4  having count(*) > 1;

    EMPLOYEE_ID          JOB_CT
--------------- ---------------
            101               2
            176               2
            200               2

3 rows selected.

Note that we are making an assumption that the job_history table contains the history of all jobs, including the current ones. Even though this isn't accurate for this dataset, for purposes of this example, we're going to pretend that it is. So, that was pretty simple, right? You complete your testing and deliver this code. However, I come back to you and say it’s wrong. The list is missing some employees who have held more than one job. I produced this list manually before I came to seek your help, and I know that the following list of employees should be displayed: 101, 102, 114, 122, 176, 200, and 201.

What went wrong? This seems like a fairly simple query, doesn’t it? It went wrong because the solution was developed without any questions being asked. By not asking questions, you made some assumptions (regardless of whether you realized it). The assumptions you made caused you to write the query as you did. The way you wrote the query didn’t provide the result I expected. Admittedly, I could have helped you out more by giving you a more detailed specification or by providing you with the expected result set initially. Regardless of the quality of the query specification you have, never forget that it is your job to ferret out the details and make sure you develop code that answers specifically the real question being asked.

Let’s start over. The query specification I provided asked you to write a query that provides a list of employees who have held more than one job in the company, displaying the employee_id and a count of how many total jobs they’ve held. Although at first glance the query request seems straightforward, the apparent simplicity hides several nuances that you can’t be aware of unless you ask some questions. The following list includes a few questions you could ask to help clarify the request:

  • Should the query consider the employee’s current job as part of the count or only jobs held other than the current position?
  • Where are the data that satisfy the request stored (in other words, in one table or several)?
  • What is the data model and can I get a copy of the data dictionary or an ERD if one exists?
  • Is there an expected typical size of the result set?
  • How are the data stored? In particular, are they stored in some sorted order or not?
  • Must this query meet any response time SLA?
  • How frequently will the query execute?

If you receive a request from a business user, it might not be feasible to ask him or her all these questions. Asking a business user about which tables contain the data he or she wants or whether you can get a copy of the ERD might be answered with blank stares, because these things aren’t typically in the domain of the business user’s knowledge. It is important to note whether the request for the query is coming from an application user or an application technical architect. Many of these questions can be answered only by someone with an understanding of the application from the technical perspective. Therefore, learn who the “go-to” people are when you need to get detailed technical information. This may be the database administrator (DBA), the data architect, or perhaps a developer who worked initially on other code in this particular application. Over time, you build the knowledge you need to determine many of these answers for yourself, but it’s always good to know who the subject matter experts are for any application you support.

Getting answers to the first three questions in the previous list are the most important, initially. You must know more than just a description of what the query needs to ask for. Being as familiar as possible with the data model is the starting point. When writing the original query, an assumption was made that the only table containing information needed to satisfy the query was the job_history table. If you had asked the first three questions, you’d have found out that the job_history table is truly a history table; it only contains historical data, not current data. The employees table contains a job_id column that holds the employees’ current position information. Therefore, to determine how many positions an employee has held, you need to get their current job from the employees table and their previous jobs from the job_history table. With this information, you might rewrite the query as shown in Listing 5-2.

Listing 5-2.  The Rewritten Employee Jobs Query

SQL> select employee_id, count(*) job_ct
  2  from
  3  (
  4  select e.employee_id, e.job_id
  5  from employees e
  6  union all
  7  select j.employee_id, j.job_id
  8  from job_history j
  9  )
 10  group by employee_id
 11  having count(*) > 1;

 
    EMPLOYEE_ID          JOB_CT
--------------- ---------------
            102               2
            201               2
            101               3
            114               2
            200               2
            176               2
            122               2

7 rows selected.

It looks like the answer is correct now. It’s at this point that the answers to the next questions come in to play. Knowing what to ask for is certainly important, and the first three questions helped me describe the data the query needed to return. Most people would stop here. However, knowing how to get the data I’m after is just as important. This is contrary to what most of us are taught about relational databases in general. In one of my college courses on RDBMSs, I was taught that SQL is used to access data. There is no requirement that I need to know anything about where or how the data are stored or how the RDBMS processes a SQL statement to access that data. In other words, SQL is used to describe what is done, not how it is done.

The reality is that knowing how your data are stored and accessed is just as important as describing the data your query retrieves. Let’s say you need to book a trip from Washington, DC, to Los Angeles, California. You call your travel agent to handle the booking for you. If the only information you provide to the agent is your departure and arrival cities and that you want the least expensive fare possible, what could happen? Well, it’s possible that the least expensive fare involves leaving at 5:30 am from Washington, DC, then making stopovers in Atlanta, Chicago, and Dallas before finally connecting into Los Angeles at midnight (Los Angeles time, which means it would be 3 am in DC time). Would that be OK with you? Probably not. Personally, I’d be willing to pay extra to get a direct flight from DC to Los Angeles. Think about it. If you could get a direct flight leaving from DC at 8 am and arriving into Los Angeles at 10 am, wouldn’t it be worth quite a bit to you vs. making multiple stopovers and spending nearly a full day to complete the trip? And what if the direct flight cost only 10 percent more than the nightmare flight? Your original request to book the least expensive fare didn’t include any conditions under which you’d be willing to pay more, so your request was satisfied but you probably won’t be happy with the outcome.

Knowing how the data are stored and how they should be accessed ensures your query not only returns the correct answer, but also does so as quickly and efficiently as possible. That’s why questions like, “How big is the typical expected result set?” and “How are the data stored?” and “How fast and how frequently do they need to execute?” must be asked. Without the answers to these questions, your query may get the correct answer but still be a failure because of poor performance. Simply getting the right result isn’t enough. To be successful, your query must be right and it must be fast.

Questions about Data

I hope at this point you agree that you do need to concern yourself with how data are stored and how they should be accessed. Where do you find this information? The database can give you most of the answers you need when you execute a few simple queries. After you have this information, you then need to determine how data should be accessed, which comes from understanding how the various access and join methods work and when it is appropriate to use each. I’ve already covered access and join methods, so you’ve got the information you need to help you there, but how do you discover how the data are stored? Let’s walk through the questions you need to ask and queries you can execute to get the answers.

As a first step, try to think like the optimizer. The optimizer needs statistics and instance parameter values to be able to compute a plan. Therefore, it’s a good idea for you to put yourself in the optimizer’s place and gather the information to help formulate the execution plan. Always seek out the answers to the following questions about the data:

  • Which tables are needed to gather all the data required?
  • Are any of the tables partitioned and, if so, how are the partitions defined?
  • What columns are in each table?
  • What indexes are available in each table?
  • What are the statistics for each table, column, and index?
  • Are there histograms on any of the columns?

Statistics help the optimizer paint a picture of how the various ways of accessing and joining data perform. You can know what the optimizer knows. All you need to be able to do is query the information from the data dictionary. One thing to keep in mind when you’re reviewing statistics is that statistics may or may not represent your data accurately. If the statistics are stale, missing, or poorly collected, it’s possible they may paint the wrong picture. The optimizer can only know what the statistics tell it. You, on the other hand, have the ability to determine whether the statistics make sense. For example, if a date column in one of your tables has a high value of six months ago, you can see this quickly and know that rows exist with current date values. This visual inspection can help you determine whether statistics need to be updated, but you can’t know these kinds of things unless you look. A key question you must always ask is whether the statistics represent your data accurately. Listing 5-3 uses a single script named st-all.sql (previously used in Chapter 2) to answer each of the questions listed previously in one simple script. The script gives you a single source to review to verify how representative the available statistics really are.

Listing 5-3.  Getting All the Statistics Information You Need

SQL> @st-all
Enter the owner name: sh
Enter the table name: sales
======================================================================================
  TABLE STATISTICS
======================================================================================
Owner         : sh
Table name    : sales
Tablespace    : EXAMPLE
Partitioned   : yes
Last analyzed : 05/31/2013 20:17:03
Sample size   : 918843
Degree        : 1
# Rows        : 918843
# Blocks      : 1769
Empty Blocks  : 0

Avg Space     : 0
Avg Row Length: 29
Monitoring?   : yes
======================================================================================
  PARTITION INFORMATION
======================================================================================
 Part# Partition Name      Sample Size          # Rows        # Blocks
------ --------------- --------------- --------------- ---------------
     1 SALES_1995      .                             0               0
     2 SALES_1996      .                             0               0
     3 SALES_H1_1997   .                             0               0
     4 SALES_H2_1997   .                             0               0
     5 SALES_Q1_1998             43687           43687              90
...
    28 SALES_Q4_2003   .                             0               0

 Part# Partition Name  Partition Bound
------ --------------- -------------------------------------------------------------
     1 SALES_1995      TO_DATE(' 1996-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ...
     2 SALES_1996      TO_DATE(' 1997-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ...
     3 SALES_H1_1997   TO_DATE(' 1997-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ...
     4 SALES_H2_1997   TO_DATE(' 1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ...
     5 SALES_Q1_1998   TO_DATE(' 1998-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ...
...
    28 SALES_Q4_2003   TO_DATE(' 2004-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ...
======================================================================================
  COLUMN STATISTICS
======================================================================================
 Name           Null?  NDV      Density  # Nulls   # Bkts  AvgLen  Lo-Hi Values
======================================================================================
amount_sold     N      3586     .000279  0         1       5       6.4 | 1782.72
channel_id      N      4        .250000  0         1       3       2 | 9
cust_id         N      7059     .000142  0         1       5       2 | 101000
prod_id         N      72       .000001  0         72      4       13 | 148
promo_id        N      4        .000001  0         4       4       33 | 999
quantity_sold   N      1       1.000000  0         1       3       1 | 1
time_id         N      1460     .000685  0         1       8       01/01/1998 00:00:00 |
                                                                   12/31/2001 00:00:00

======================================================================================
  HISTOGRAM STATISTICS     Note: Only columns with buckets containing > 5% are shown.
======================================================================================

PROMO_ID (4 buckets)
1 97%

======================================================================================
  INDEX INFORMATION
======================================================================================

Index Name                                Dstnct  Lf/Blks Dt/Blks Cluf Unq? Type Part?
                    BLevel Lf Blks # Rows   Keys  /Key    /Key
------------------ ------- ------- ------ ------ ------- ------- ----- ---- ---- -----
SALES_CHANNEL_BIX        1      47     92      4      11      23    92 NO   BITM YES
SALES_CUST_BIX           1     475  35808   7059       1       5 35808 NO   BITM YES
SALES_PROD_BIX           1      32   1074     72       1      14  1074 NO   BITM YES
SALES_PROMO_BIX          1      30     54      4       7      13    54 NO   BITM YES
SALES_TIME_BIX           1      59   1460   1460       1       1  1460 NO   BITM YES

Index Name                           Pos# Order Column Name
------------------------------ ---------- ----- ------------------------------
sales_channel_bix                       1 ASC   channel_id

sales_cust_bix                          1 ASC   cust_id

sales_prod_bix                          1 ASC   prod_id

sales_promo_bix                         1 ASC   promo_id

sales_time_bix                          1 ASC   time_id

======================================================================================
  PARTITIONED INDEX INFORMATION
======================================================================================

Index: SALES_CHANNEL_BIX
                                           Dst LfBlk DtBlk
Part# Partition Name  BLevel LfBlks # Rows Keys /Key  /Key CluF Partition Bound
----- --------------- ------ ------ ------ ---- ----  ---- ---- ----------------------
    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        1      2      5    4    1     1    5 TO_DATE('1998-04-01...
...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...

Index: SALES_CUST_BIX

    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        1     28   3203 3203    1     1 3203 TO_DATE('1998-04-01...
...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...

Index: SALES_PROD_BIX

    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        1      2     60   60    1     1   60 TO_DATE('1998-04-01...
...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...

Index: SALES_PROMO_BIX

    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        0      1      3    2    1     1    3 TO_DATE('1998-04-01...
...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...

Index: SALES_TIME_BIX

    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        1      3     90   90    1     1   90 TO_DATE('1998-04-01...
...
   27 SALES_Q3_2003        0      0      0    0    0     0    0 TO_DATE('2003-10-01...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...

With this information, you can answer almost any question about the data. It is best if these statistics are from your production database, where the SQL you are writing is executed. If your development database doesn’t have a copy of the production statistics, it’s a good idea to request that the production stats be imported into the development database so that the optimizer is formulating plans based on information that is as close to production as possible. Even if the data don’t match, remember that it’s the statistics that the optimizer uses to determine the plan.

Now that you’ve obtained the statistics, you can use the information to ask, and answer, questions about what you expect the optimizer to do with your SQL. For example, if you were writing a query that needed to return all sales data for a specified customer (cust_id), you might want to know how many rows the optimizer estimates the query to return. With the statistics information you have queried, you could compute the number of rows estimated to be returned by the query to be 130 (918,843 total rows × 1/7059 distinct values). You can see there is an index on cust_id, so the proper access operation to use to satisfy the query should be the SALES_CUST_BIX index. When you execute the query, you can verify this operation is selected by checking the execution plan.

In Chapter 3, I discussed the index statistic called clustering factor. This statistic helps the optimizer compute how many blocks of data are accessed. Basically, the closer the clustering factor is to the number of blocks in the table, the fewer the estimated number of blocks to be accessed when using the index. The closer the clustering factor is to the number of rows in the table, the greater the estimated number of blocks. The fewer blocks to be accessed, the lower the cost of using that index and the more likely it is that the optimizer chooses that index for the plan. Therefore, you can check this statistic to determine how favorable the index will appear. Listing 5-4 shows the clustering factor statistics for the SALES table.

Listing 5-4.  Index clustering_factor

SQL> select t.table_name||'.'||i.index_name idx_name,
 2          i.clustering_factor, t.blocks, t.num_rows
 3     from user_indexes i, user_tables t
 4    where i.table_name = t.table_name
 5      and t.table_name = 'SALES'
 6    order by t.table_name, i.index_name;

IDX_NAME                 Clustering Factor        # Blocks          # Rows
------------------------ ----------------- --------------- ---------------
SALES.SALES_CHANNEL_BIX                 92            1769          918843
SALES.SALES_CUST_BIX                 35808            1769          918843
SALES.SALES_PROD_BIX                  1074            1769          918843
SALES.SALES_PROMO_BIX                   54            1769          918843
SALES.SALES_TIME_BIX                  1460            1769          918843

5 rows selected.

In this case, the clustering factors for all the indexes for the SALES table have a low value (in other words, closer to the number of blocks in the table). This is a good indication that when the optimizer computes the cost of using these indexes, they are not weighted too heavily based on the estimated number of blocks they return if used.

In addition to using statistics, you can execute actual queries against the tables to get an idea of the data and number of rows to be accessed or returned from a single table. Regardless of how complex a statement is, you can do just what the optimizer does and break down the statement into single table accesses. For each table involved, simply execute one or more queries to count and review the data to be returned using the filter conditions your SQL uses. Always think “divide and conquer.” Breaking down a statement into small increments helps you understand how best to put it together in the most efficient way to arrive at the final result.

Building Logical Expressions

When you understand the question that the statement you are writing needs to answer, you have to be able to build the SQL to provide the answer. There are often many possible ways to express the same predicate logic. Being able to formulate the conditions in a way that is easy to read and efficient requires you to think in ways you may not be used to. Remember when I discussed the idea of thinking in sets vs. thinking procedurally in Chapter 4? There is a similar thought shift you may need to make to be able to build predicates for your SQL statements most efficiently.

The key is to learn some good Boolean logic techniques so that you don’t have to rely on only one way to express conditional logic. You may find that using Boolean logic expressions always produces the most efficient plan operation choices (make sure to test alternatives thoroughly), but it’s good to know how to formulate different alternatives so you aren’t stuck with a single way to do things.

When I say conditional logic, I mean an expression something like, “if X then Y” where X and Y are both conditions. In a WHERE clause, you might want to have a condition like if :GetAll <> 1 then empno = :empno. In other words, if the value of the input bind variable named :GetAll is 1, then you want to return all rows, but if :GetAll is not 1, then only return rows in which empno is equal to the :empno bind variable supplied. A WHERE clause to express this logic might be coded like this:

WHERE empno = CASE WHEN :GetAll <> 1 THEN :empno ELSE empno END

This logic works, but it’s a bit counterintuitive to me. Why would you even want to check empno = empno? There are other problems with this kind of formulation as well. If you need to check multiple columns, then you need multiple CASE statements. Plus, if empno is null, this check is going to fail, or at the very least give you a result you don’t expect.

The key is to change this expression to use a regular Boolean expression that uses only AND, OR, and NOT so that your “if X then Y” condition is translated to “(Not X) or Y,” which becomes the following:

WHERE (:GetAll = 1) OR (empno = :empno)

What you are covering with this expression is that if :GetAll = 1, then you don’t even want to bother with checking any more of the expression. Always remember that when using an OR condition, if one condition evaluates to TRUE, then the whole expression is TRUE. There is no need to check the remaining expression. This “short-circuit” mechanism can save time by not requiring some of the code path to be evaluated, which means you won’t burn as many CPU cycles overall. Only if the first condition that the optimizer chooses to test evaluates to FALSE would the other expression need to be evaluated. Regardless of the method you choose, each of the options just discussed results in the optimizer using a full table scan and filtering the data based on your condition.

Although you’re not looking at expressions involving ANDed conditions in these examples, you can apply similar thinking to the use of ANDed predicates. When using an AND condition, if the first condition evaluates to FALSE, then the whole expression is FALSE. There is no need to evaluate the second expression because both conditions must evaluate to TRUE for the whole condition to be TRUE. So, when you’re using AND conditions, it’s a good idea to write the condition so the expression that is most likely to evaluate to FALSE is placed first. Doing so allows the second expression evaluation to be short-circuited with similar savings as noted when placing a TRUE expression first in an OR condition. In reality, the cost-based optimizer evaluates and determines the order in which the conditions are evaluated. When you write the conditions in a certain order, you are simply helping yourself define your expectations of what should happen.

A similar way of approaching this type of conditional expression is to use a single bind variable instead of two. In this case, you would say “if X is not null then Y = X.” This becomes as follows:

WHERE empno = NVL(:empno, empno)

This is basically the same as writing the CASE expression from the earlier example and can be converted to the following:

WHERE (:empno is null) OR (empno = :empno)

In both these cases, the optimizer may have a bit of a dilemma with determining the optimal plan. The reason is that if the binds you use cause the comparison to end up returning all rows, then the plan operation best suited for that would likely be a full table scan. However, if you specify binds that end up limiting the result set, an index scan might be best. Because you’re using bind variables, each time you execute the query, the input bind values could change. So, the optimizer has to choose a plan that covers both situations. Most likely, you end up with a full table scan. Listings 5-5 and 5-6 demonstrate one scenario for each of the similar alternatives I covered and show the execution plan output for each.

Listing 5-5.  Using Two Bind Variables to Create a Conditional WHERE Clause

SQL> variable empno number
SQL> variable getall number
SQL>
SQL> exec :empno := 7369;

PL/SQL procedure successfully completed.

SQL>
SQL> exec :getall := 1;

PL/SQL procedure successfully completed.

SQL>
SQL> select /* opt1 */ empno, ename from emp
  2  where empno = CASE WHEN :GetAll <> 1 THEN :empno ELSE empno END;

          EMPNO ENAME
--------------- ----------
           7369 SMITH
           7499 ALLEN
           7521 WARD
           7566 JONES
           7654 MARTIN
           7698 BLAKE
           7782 CLARK
           7788 SCOTT
           7839 KING
           7844 TURNER
           7876 ADAMS
           7900 JAMES
           7902 FORD
           7934 MILLER

14 rows selected.

SQL>
SQL> @pln opt1

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------
SQL_ID  gwcmrzfqf8cu2, child number 0
-------------------------------------
select /* opt1 */ empno, ename from emp where empno = CASE WHEN :GetAll
<> 1 THEN :empno ELSE empno END

Plan hash value: 3956160932

-----------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows | Buffers |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |     14 |       8 |
|*  1 |  TABLE ACCESS FULL| EMP  |      1 |      1 |     14 |       8 |
-----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("EMPNO"=CASE  WHEN (:GETALL<>1) THEN :EMPNO ELSE "EMPNO" END )

19 rows selected.

Listing 5-6.  Using One Bind Variable to Create a Conditional WHERE Clause

SQL> exec :getall := 0;

PL/SQL procedure successfully completed.

SQL>
SQL> select /* opt5 */ empno, ename from emp
  2  where empno = NVL(:empno, empno);

          EMPNO ENAME
--------------- ----------
           7369 SMITH

1 row selected.

SQL>
SQL> @pln opt5

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------
SQL_ID  605p3gyjbw82b, child number 0
-------------------------------------
select /* opt5 */ empno, ename from emp where empno = NVL(:empno, empno)

Plan hash value: 1977813858

-------------------------------------------------------------------------------------
| Id  | Operation                     | Name   | Starts | E-Rows | A-Rows | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |        |      1 |        |      1 |       2 |
|   1 |  CONCATENATION                |        |      1 |        |      1 |       2 |
|*  2 |   FILTER                      |        |      1 |        |      0 |       0 |
|   3 |    TABLE ACCESS BY INDEX ROWID| EMP    |      0 |     14 |      0 |       0 |
|*  4 |     INDEX FULL SCAN           | PK_EMP |      0 |     14 |      0 |       0 |
|*  5 |   FILTER                      |        |      1 |        |      1 |       2 |
|   6 |    TABLE ACCESS BY INDEX ROWID| EMP    |      1 |      1 |      1 |       2 |
|*  7 |     INDEX UNIQUE SCAN         | PK_EMP |      1 |      1 |      1 |       1 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(:EMPNO IS NULL)
   4 - filter("EMPNO" IS NOT NULL)
   5 - filter(:EMPNO IS NOT NULL)
   7 - access("EMPNO"=:EMPNO)

27 rows selected.

SQL>
SQL> select /* opt6 */ empno, ename from emp
  2  where (:empno is null) OR (:empno = empno);

          EMPNO ENAME
--------------- ----------
           7369 SMITH

1 row selected.

SQL>
SQL> @pln opt6

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------
SQL_ID  gng6x7nrrrhy9, child number 0
-------------------------------------
select /* opt6 */ empno, ename from emp where (:empno is null) OR
(:empno = empno)

Plan hash value: 3956160932

-----------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows | Buffers |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |      1 |       8 |
|*  1 |  TABLE ACCESS FULL| EMP  |      1 |      2 |      1 |       8 |
-----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter((:EMPNO IS NULL OR "EMPNO"=:EMPNO))

Note that if you used the format WHERE (:GetAll = 1) OR (empno = :empno) instead, you get the same plan shown in Listing 5-5. There is no real difference between the two.

In this example, note that there is a difference between the plan when you use the first technique of WHERE empno = NVL(:empno, empno) vs. using WHERE (:empno is null) OR (:empno = empno). For the first example in Listing 5-5, in which there are two bind variables, notice that the optimizer chooses a full table scan operation. But, notice what happens when you use only a single variable in Listing 5-6 for the second set of examples. In the second case, the optimizer uses a CONCATENATION plan for the NVL predicate and full table scan for the Boolean expression. The CONCATENATION plan is the best in this case because it works such that when the bind variable is null, the plan executes the INDEX FULL SCAN operation to get all the rows; when the bind variable is not null, the plan executes the INDEX UNIQUE SCAN operation to get just the one row that is needed. In this way, both options use an optimal execution path.

In this case, the Boolean logic didn’t give you the best plan, so it’s good to know several alternative ways to formulate the predicate so you can work to achieve the best possible plan. With this in mind, you actually could have written the query as shown in Listing 5-7.

Listing 5-7.  Using UNION ALL to Handle Conditional Logic

SQL> select /* opt9 */ empno, ename from emp
  2  where :empno is null
  3  union all
  4  select empno, ename from emp
  5  where :empno = empno;

          EMPNO ENAME
--------------- ----------
           7369 SMITH

1 row selected.

SQL>
SQL> @pln opt9

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------
SQL_ID  ab0juatnpc5ug, child number 0
-------------------------------------
select /* opt9 */ empno, ename from emp where :empno is null union all
select empno, ename from emp where :empno = empno

Plan hash value: 2001993376

------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Starts | E-Rows | A-Rows | Buffers |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |      1 |        |      1 |       2 |
|   1 |  UNION-ALL                   |        |      1 |        |      1 |       2 |
|*  2 |   FILTER                     |        |      1 |        |      0 |       0 |
|   3 |    TABLE ACCESS FULL         | EMP    |      0 |     14 |      0 |       0 |
|   4 |   TABLE ACCESS BY INDEX ROWID| EMP    |      1 |      1 |      1 |       2 |
|*  5 |    INDEX UNIQUE SCAN         | PK_EMP |      1 |      1 |      1 |       1 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(:EMPNO IS NULL)
   5 - access("EMPNO"=:EMPNO)

Similar to the CONCATENATION plan, in this case you get a plan in which two separate subplans are “unioned” together to get the result. If the bind variable is null, you get a full scan operation and get all rows returned. When the bind variable is not null, you get the unique index scan and return only the one row needed. The FILTER operation acts to determine whether the first subplan should be executed. Notice the Predicate Information section in which step 2 shows filter(:EMPNO IS NULL), indicating that only if the bind is null does the operation actually happen.

In general, you’ll find that the optimizer is able to make better plan operation choices when AND conditions are used. As mentioned earlier, this is because an OR condition means there could be two different possible operations that could be used based on how the expression evaluates. With an AND condition, it is more likely that only a single choice, or at least choices that are not opposite in nature, are considered. So, if you can figure out a way to formulate your predicates to use ANDed conditions only, you may find that the SQL produces more efficient plans and is even easier to maintain.

Also, if you are writing SQL statements inside a larger code body, such as in a PL/SQL procedure, use conditional constructs in the language and don’t put that logic in the SQL. The simpler you can make your SQL, and the fewer conditions that have to be handled in the statement directly, the less complexity the optimizer needs to sort through to determine an optimal plan.

Summary

Questions are an important part of the process of writing good SQL. You begin by understanding the question the SQL needs to answer, then you follow up by asking questions about the data to formulate a SQL statement that is functionally correct as well as optimized for performance. The ability to ask good questions is an intellectual habit that must be developed over time. The more you work to ask questions that clarify and enhance your understanding of what you need to do, the greater your skills as a writer of high-quality, high-performing SQL become.

1 See www3.interscience.wiley.com/journal/122513384/abstract?CRETRY=1&SRETRY=0, www.telegraph.co.uk/health/healthnews/5857845/It-takes-66-days-to-form-a-habit.html,  and www.spring.org.uk/2009/09/how-long-to-form-a-habit.php.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.84.157