C H A P T E R  16

Unlearning Some Things We Thought We Knew

“It ain't what you don't know that gets you into trouble.

It's what you know for sure that just ain't so.”

~ Mark Twain

Oracle does some things very differently when running on Exadata than when running on non-Exadata platforms. The optimizations provided by Exadata are designed to take a different approach than Oracle has traditionally followed. This change means that we need to attack some problems with a completely different mindset. That is not to say that everything is different. In fact, most of the fundamental principles remain unchanged. After all, the same database software runs on Exadata that runs on other platforms. But there are some things that are just fundamentally different.

This chapter focuses on how we should change our thinking when running databases on Exadata.

A Tale of Two Systems

The way we think about systems running on Exadata largely depends on the workload being executed. OLTP-oriented workloads tend to focus us on using flash cache for speeding up small reads. But frankly, this type of workload is not able to take advantage of most of the advantages provided by Exadata. DW-oriented workloads tend to focus us on making use of Smart Scans at every opportunity and trying to use all the available resources (CPU resources on the storage and database tiers, for example). This is where Exadata's built-in advantages can have the most impact. Unfortunately, most systems exhibit characteristics of both DW and OLTP workloads. These “mixed Workloads” are the most difficult and ironically the most common. They are the most common because it is rare to see an OLTP-oriented system that doesn't have some reporting component that produces long-running, throughput-sensitive queries. It's also quite common to see DW-oriented workloads that have OLTP-like trickle feeds. These combination systems require the most difficult thought process because, depending on the issue at hand, you will have to be constantly resetting your approach to managing performance. They are also difficult because DW workloads are generally constrained by the data-flow dynamics that control throughput, while OLTP systems are usually constrained by latency issues. So for mixed workloads you basically need to train yourself to evaluate each scenario and categorize it as either latency-sensitive or throughput-sensitive. This evaluation of workload characteristics should be done prior to beginning any analysis.

OLTP-Oriented Workloads

There is little to say about running OLTP workloads on Exadata. Since Exadata runs standard Oracle database software, you should not have to adjust your basic approach significantly. There are only a handful of points to keep in mind with this type of system.

Exadata Smart Flash Cache (ESFC)

The key component of Exadata when it comes to OLTP workloads is Exadata Smart Flash Cache (ESFC), which can significantly reduce disk access times for small reads. For that reason, it is important to verify that ESFC is working correctly. For this type of workload you should also expect that a large percentage of physical I/O operations are being satisfied by ESFC. This can be inferred fairly easily by looking at the average single-block read times. A single-block read should take around 0.5 ms if it is satisfied by flash cache. By contrast, single-block reads take on average around 5 ms if they are satisfied by actual disk reads. Standard AWR reports provide both average values and a histogram of wait events. If the average single-block read times are well above the 1 ms range, you should be looking for a systemic problem, like flash cards that are not working or a critical table has been defined to never be cached, using the CELL_FLASH_CACHE NONE syntax. The histograms should be used as well to verify that the average is not covering up a significant number of outliers. Here's the cellcli syntax to check the status of the flash cards:

CellCLI> list flashcache detail
      name:                   dm01cel03_FLASHCACHE
      cellDisk:
FD_00_dm01cel03,FD_01_dm01cel03,FD_02_dm01cel03,FD_03_dm01cel03,FD_04_dm01cel03,FD_05_dm01cel0
3,FD_06_dm01cel03,FD_07_dm01cel03,FD_08_dm01cel03,FD_09_dm01cel03,FD_10_dm01cel03,FD_11_dm01ce
l03,FD_12_dm01cel03,FD_13_dm01cel03,FD_14_dm01cel03,FD_15_dm01cel03•
      creationTime:           2010-03-22T17:39:46-05:00
      id:                     850be784-714c-4445-91a8-d3c961ad924b
      size:                   365.25G
      status:                 critical

Note that the status attribute on this cell is critical. As you might expect, this is not a good thing. On this particular system, the flash cache had basically disabled itself. We noticed it because the single block read times had slowed down. This example is from an early version of cellsrv. The later versions include a little more information. Here is an example from cellsrv 11.2.2.3.1:

CellCLI> list flashcache detail
         name:                   enkcel03_FLASHCACHE
         cellDisk:
FD_14_cell03,FD_08_cell03,FD_13_cell03,FD_11_cell03,FD_02_cell03,FD_04_cell03,FD_01_cell03,FD_
00_cell03,FD_15_cell03,FD_03_cell03,FD_12_cell03,FD_10_cell03,FD_05_cell03,FD_09_cell03,FD_07_
cell03,FD_06_cell03
         creationTime:           2011-03-27T07:51:04-05:00
         degradedCelldisks:
         effectiveCacheSize:     365.25G
         id:                     7f60871e-d5f2-4990-b7e5-30c4a238de12
         size:                   365.25G
         status:                 normal

Notice the new attribute degradedCelldisks . Also notice that the flash cache on this cell shows a status of normal. Monitoring storage software behavior is covered in more detail in Chapter 12.

Scalability

Another thing to keep in mind when dealing with OLTP workloads is that the Exadata platform provides exceptional scalability. Upgrading from a half rack to full rack doubles the number of CPUs at both the database layer and the storage layer. The amount of ESFC is also doubled, as is the available memory. This allows Exadata to scale in a nearly linear fashion for many systems.

images Kevin Says: The authors are correct that doubling from a half rack to a full rack is perfect hardware scale-up. I urge caution, however, in presuming that the software—specifically Real Application Clusters (RAC)—will scale linearly with all applications. RAC has been around for over ten years, so let's not throw away our understanding of its scalability characteristics when considering Exadata. There is nothing about Exadata architecture that changes the intrinsic RAC scalability characteristics.

Write-Intensive OLTP Workloads

Write-intensive workloads are a subset of OLTP-oriented systems. There are some systems that just bang away at single-row inserts. These systems are often limited by the speed at which commits can be done, which often depends on the speed with which writes to the log files can be accomplished. This is one area where Exadata competes with other platforms on a fairly even playing field. There are no major enhancements that make Exadata run orders of magnitudes faster for systems that are bottlenecked on write operations. This means that there is no magic bullet, and traditional methods of tuning, such as minimizing commits, are appropriate for these types of systems.

images Kevin Says: Write-intensive workloads warrant deep thought. The authors point out that there are no major enhancements that make Exadata perform orders of magnitude better than Oracle systems connected to conventional storage. There are, in fact, no attributes of Exadata that favor write-intensive workloads as of the publication date of this book. The V2 and X2 models are capable of servicing over 1 million random single-block reads per second—from Exadata Smart Flash Cache on the full rack configuration. Random writes, on the other hand, have to go to spinning disks, of which there are 168 in the full rack configurations. The high-performance SAS drives can service roughly 300 random write IOPS—as long as the seeks are short. However, writes must be redundant, and thus the net random write bandwidth available to databases is roughly 25,000 IOPS. To scale an application to 1 million read IOPS on Exadata, the imposed read:write ratio is 40 to 1. That is, Exadata is 40-fold more read-capable than write-capable for random I/O. Applications that lean heavy on random writes will be the first to suffer this inherent imbalance. This imbalance is the reason some deployments require the flash assets of the cells to be provisioned as flash grid disks. Committing flash capacity in the static grid disk fashion, however, eats into the value proposition of Exadata. Every byte of storage committed to flash grid disk is less capacity available for the dynamic, intelligent flash cache algorithms. Deploying Oracle is often a balancing act—even in the Exadata environment. This book should equip readers with sufficient skills to use Exadata assets wisely.

DW-Oriented Workloads

Exadata was designed to speed up long-running queries against large volumes of data. So it should come as no surprise that data warehousing is where it's important that we change some of our basic thought processes. Of course, one of the major techniques is to be constantly looking for opportunities to allow Exadata optimizations to kick in. This means making sure that the application can take advantage of Smart Scans.

Enabling Smart Scans

The most important concept to keep in mind when dealing with DW-oriented workloads is that long-running statements should usually be offloaded. The steps to follow are:

  1. Determine whether Smart Scans are being used.
  2. If Smart Scans are not being used, fix things so that they will be used.

These points seem so obvious that we shouldn't need to repeat them. A large portion of the optimizations built into Exadata work only when Smart Scans are used. So one of the first changes you need to make in the way you think is to train yourself to be constantly considering whether Smart Scans are being used appropriately or not. This means you need to have a good understanding of which statements (or parts of statements) can be offloaded and be able to determine whether statements are being offloaded or not. We have covered the requirements for Smart Scans and some of the techniques that can be used to verify whether they are being performed or not throughout the book. But at the risk of being repetitive, we'll mention them again here.

There are two main requirements that must be met in order for Smart Scans to occur. The first is that the optimizer must choose to do a full scan of a table or a materialized view, or the optimizer must choose to do a fast full scan of an index. Note that Smart Scans are not limited to queries or even to subqueries. The optimizer can also choose to use full scans for DELETEs and UPDATEs when a large percentage of the rows will be affected. However, if your application is doing this, you might want to consider modifying it to do something like a truncate and rebuild.

The second requirement for Smart Scans is that the scans must be performed using the direct path read mechanism. Note that we intentionally didn't mention the optimizer in the description of the second requirement. This is because the optimizer does not make the decision about whether to use direct path reads or not. It is a heuristic decision that is made after the plan has been determined. As such, it is not directly exposed by any of the tools like explain plan type tools. What this means in practice is that it's easy to verify that the first requirement has been met, but more challenging to verify the second requirement.

On most Exadata implementations, a fairly high percentage of long-running queries are offloaded. You can check to see what percentage of your long-running SQL statements have been offloaded by selecting all the statements from v$sql (or from the AWR table DBA_HIST_SQLSTAT) that have an average run time of over some number of seconds, or that have an average logical I/O value that is greater than some reasonable value. Actually, the logical I/O is a better metric to use, as some of the offloaded statements will run very quickly and may not meet your minimum time criteria, which will give you a distorted perspective on the percentage of statements that are being offloaded. Here's an example (note that the scripts are included in the online code repository):

SYS@SANDBOX1> @offload_percent
Enter value for sql_text:
Enter value for min_etime:
Enter value for min_avg_lio: 500000

     TOTAL  OFFLOADED OFFLOADED_%
-------------------------------
        13         11      84.62%

SYS@SANDBOX1> /
Enter value for sql_text: SELECT%
Enter value for min_etime:
Enter value for min_avg_lio: 500000

     TOTAL  OFFLOADED OFFLOADED_%
-------------------------------
        11         11        100%

The listing makes use of the offload_percent.sql script, which calculates a percentage of statements currently in the shared pool that have been offloaded. We initially used it to evaluate all statements that had over 500,000 logical I/Os. We then ran it a second time and limited our investigation to statements that begin with the word SELECT. In the next listing we've made use of a script (fsxo.sql) that allows us to see the actual statements that contribute to the OFFLOAD_% calculated in the previous listing.

SYS@SANDBOX1> @fsxo
Enter value for sql_text:
Enter value for sql_id:
Enter value for min_etime:
Enter value for min_avg_lio: 500000
Enter value for offloaded:

SQL_ID         EXECS  AVG_ETIME OFFLOAD IO_SAVED_% SQL_TEXT
----------------------------------------------------------------------------
0bvt5z48t18by      1        .10 Yes         100.00 select count(*) from skew3 whe
0jytfr1y0jdr1      1        .09 Yes         100.00 select count(*) from skew3 whe
12pkwt8zjdhbx      1        .09 Yes         100.00 select count(*) from skew3 whe
2zbt555tg123s      2       4.37 Yes          71.85 select /*+ parallel (a 8) */ a
412n404yughsy      1        .09 Yes         100.00 select count(*) from skew3 whe
5zruc4v6y32f9      5      51.13 No             .00 DECLARE job BINARY_INTEGER :=
6dx247rvykr72      1        .10 Yes         100.00 select count(*) from skew3 whe
6uutdmqr72smc      2      32.83 Yes          71.85 select /* avgskew3.sql */ avg(
7y09dtyuc4dbh      1       2.87 Yes          71.83 select avg(pk_col) from kso.sk
b6usrg82hwsa3      5      83.81 No             .00 call dbms_stats.gather_databas
fvx3v0wpvxvwt      1      11.05 Yes          99.99 select count(*) from skew3 whe
gcq9a53z7szjt      1        .09 Yes         100.00 select count(*) from skew3 whe
gs35v5t21d9yf      1       8.02 Yes          99.99 select count(*) from skew3 whe

13 rows selected.

The fsxo.sql script provides the same limiting factors as the offload_percent.sql script, namely a minimum average elapsed time and a minimum average logical I/O. It also optionally allows you to limit the statements to only those that are offloaded or those that are not offloaded. Please refer to the scripts for further details and keep in mind that these techniques can also be applied to the data recorded by AWR for a historical perspective.

In the next section we'll discuss some issues that can complicate your efforts to enable Smart Scans.

Things that Can Cripple Smart Scans

There are several common coding “techniques” that either disable Smart Scans completely or cause them to be much less effective than they could be. Some of the techniques are just bad practices regardless of whether you are on the Exadata platform or not. Others don't carry as significant a penalty on non-Exadata platforms, but when run on Exadata they can prevent the storage software from doing all that it could do. We've discussed many of these issues throughout this book. Here are a few that you should keep in mind because of the fundamentally different behavior on the Exadata platform.

Functions in WHERE Clauses

Oracle provides a large set of functions that can be applied directly in SQL statements. As we discussed in Chapter 2, not all of those functions are offloadable. Knowing which functions are not offloadable is important because the use of those functions in WHERE clauses disables predicate filtering that may otherwise provide a massive reduction in the amount of data to be transferred back to the database layer. Obviously, custom-written PL/SQL functions fall into the category of “non-offloadable” functions as well.

This issue is somewhat counterintuitive, since we are often doing full table scans anyway with data warehouse systems. On non-Exadata platforms, applying a function in a WHERE clause of a statement that is executed via a full table scan does not impose much of a penalty with regard to the amount of data that must be returned, because the database must already return all blocks from the table to the database server. With Exadata, though, applying a function that can disable predicate filtering can impose a huge performance penalty. By the way, using custom PL/SQL functions in a WHERE clause is generally also a bad idea on non-Exadata platforms, because additional CPU will be required to process PL/SQL for each row, as opposed to the optimized functions, based on C code, provided by Oracle.

images Note: You can query V$SQLFN_METADATA to see which functions are offloadable.

Furthermore, “offloadable” functions can also impose large performance penalties. Here's a very simple example showing the negative effect of an offloadable function in a WHERE clause:

SYS@SANDBOX1> select count(*) from kso.skew3 where col2='KSO';

  COUNT(*)
----------
         9

Elapsed: 00:00:00.13
SYS@SANDBOX1> select count(*) from kso.skew3 where upper(col2)='KSO';

  COUNT(*)
----------
         9

Elapsed: 00:00:13.16
SYS@SANDBOX1> @fsx4
Enter value for sql_text: select count(*) from kso.skew3 where%KSO%
Enter value for sql_id:

SQL_ID         CHILD OFFLOAD IO_SAVED_%  AVG_ETIME SQL_TEXT
--------------------------------------------------------------------------------------
0u8gnxmm7cbdk      0 Yes         100.00        .12 select count(*) from kso.skew3 where col
37t7fzkajz2fc      0 Yes          99.99      13.16 select count(*) from kso.skew3 where upp

Elapsed: 00:00:00.08

So UPPER is an offloadable function. Yet when we used it in the WHERE clause of this particular statement, the result was a large degradation in performance. If you've been following along on-screen, you may already have a pretty good idea why.

SYS@SANDBOX1> select name, value from v$mystat s, v$statname n
  2  where n.statistic# = s.statistic#
  3  and name like '%storage%';

NAME                                                                        VALUE
--------------------------------------------------------------------------------
cell physical IO bytes saved by storage index                                   0

SYS@SANDBOX1> set timing on
SYS@SANDBOX1> select count(*) from kso.skew3 where upper(col2)='KSO';


  COUNT(*)
----------
         9

Elapsed: 00:00:13.29
SYS@SANDBOX1> select name, value from v$mystat s, v$statname n
  2  where n.statistic# = s.statistic#
  3  and name like '%storage%';

NAME                                                                        VALUE
--------------------------------------------------------------------------------
cell physical IO bytes saved by storage index                                   0

Elapsed: 00:00:00.01
SYS@SANDBOX1> select count(*) from kso.skew3 where col2='KSO';

  COUNT(*)
----------
         9

Elapsed: 00:00:00.10
SYS@SANDBOX1> select name, value from v$mystat s, v$statname n
  2  where n.statistic# = s.statistic#
  3  and name like '%storage%';

NAME                                                                        VALUE
--------------------------------------------------------------------------------
cell physical IO bytes saved by storage index                         15998312448

Elapsed: 00:00:00.01

Storage indexes are disabled by functions, just like regular indexes. This is not too surprising, but again, when we're already doing a full scan, we've trained ourselves to not worry about functions in the WHERE clause that could disable indexes. Exadata is different.

Chained Rows

This is a very broad generalization, but basically any Oracle processing that requires reading an extra block to complete a row causes the Exadata storage software to revert to block shipping, or pass-through mode. We have discussed this in several places in the previous chapters. A simple example is a chained row, but there are other situations that can cause Oracle to revert to pass-through mode. What this means in practice is that some operations that cause slight delays on non-Exadata platforms can severely impact performance on Exadata. The primary diagnostic symptom of this issue is the presence of many single-block-read wait events in combination with cell Smart Scan wait events. In such situations, you may find that you are better off not using offloading for the statements in question. Here is an example showing where Oracle spends its time when selecting from a table with chained rows:


SYS@SANDBOX1> select num_rows, chain_cnt, avg_row_len from dba_tables
  2  where table_name = 'T_CHAINED2';

    NUM_ROWS    CHAIN_CNT  AVG_ROW_LEN
------------------------------------
      100000       100000        12037

1 row selected.

SYS@SANDBOX> alter system flush buffer_cache;

System altered.

SYS@SANDBOX> select avg(length(c)) from tanel.t_chained2;

AVG(LENGTH(C))
--------------
          4000

1 row selected.


Execution Plan

----------------------------------------------------------
Plan hash value: 1927022845

-----------------------------------------------------------------------------------------
| Id  | Operation                  | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |            |     1 |  4000 | 47533   (1)| 00:08:43 |
|   1 |  SORT AGGREGATE            |            |     1 |  4000 |            |          |
|   2 |   TABLE ACCESS STORAGE FULL| T_CHAINED2 |   100K|   381M| 47533   (1)| 00:08:43 |
-----------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
     332688  consistent gets
     332683  physical reads
          0  redo size
        341  bytes sent via SQL*Net to client
        448  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed


==== tkprof Output ====

SQL ID: 3upcma5nsnbcd Plan Hash: 1927022845

select avg(length(c))
from
 tanel.t_chained2


call     count       cpu    elapsed       disk      query    current        rows
------------- ------------------------------------------------ ----------
Parse        2      0.00       0.00          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch        4     23.39     110.56     332683     665392          0           2
------------- ------------------------------------------------ ----------
total        8     23.39     110.56     332683     665392          0           2

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: SYS
Number of plan statistics captured: 2

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
------------------------------ ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=332696 ...
    100000     100000     100000   TABLE ACCESS STORAGE FULL T_CHAINED2 (cr=332696 ...


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
 ----------------------------------------   Waited ---------- ------------
  SQL*Net message to client                       4        0.00          0.00
  SQL*Net message from client                     4        0.10          0.21
  cell single block physical read            166208        0.05         93.25 <== wow!
  reliable message                                1        0.00          0.00
  enq: KO- fast object checkpoint                2        0.00          0.00
  cell smart table scan                         256        0.01          0.11
********************************************************************************

In this example we ran a query against a table in which every row was chained (admittedly a bit of an extreme example). We generated a 10046 trace during the execution of the statement and then ran tkrprof on the trace file, which produced the list of wait events just shown. Notice that the statement had an elapsed time of almost 2 minutes (110 seconds), of which more than 93 seconds was spent doing single-block physical reads. The statement used a full table scan and was clearly offloaded, as indicated by the cell smart table scan wait events, but it spent most of its time doing single-block reads. Of course, the single-block reads were a result of the chained rows. The next listing shows the difference between running the query with offloading and without offloading.

==== turn off smart scans

SYS@SANDBOX1> alter session set "_serial_direct_read"=true;

Session altered.

Elapsed: 00:00:00.01
SYS@SANDBOX1> alter session set cell_offload_processing=false;

Session altered.

Elapsed: 00:00:00.00
SYS@SANDBOX1> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:01.99
SYS@SANDBOX1> select avg(length(c)) from tanel.t_chained2;

AVG(LENGTH(C))
--------------
          4000

Elapsed: 00:00:08.78 <== chained rows, direct path reads, no smart scan


==== turn off direct path reads

SYS@SANDBOX1> alter session set "_serial_direct_read"=false;

Session altered.

Elapsed: 00:00:00.01
SYS@SANDBOX1> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.14
SYS@SANDBOX1> select avg(length(c)) from tanel.t_chained2;

AVG(LENGTH(C))
--------------
          4000

Elapsed: 00:00:09.35 <== chained rows, single block reads, no smart scan

==== turn on smart scans

SYS@SANDBOX1> alter session set "_serial_direct_read"=true;

Session altered.

Elapsed: 00:00:00.00
SYS@SANDBOX1> alter session set cell_offload_processing=true;

Session altered.

Elapsed: 00:00:00.00
SYS@SANDBOX1> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.08
SYS@SANDBOX1> select avg(length(c)) from tanel.t_chained2;

AVG(LENGTH(C))
--------------
          4000

Elapsed: 00:01:59.24 <== OUCH! smart scan is not good for lots of chained rows

SYS@SANDBOX1> @fsx4
Enter value for sql_text: select avg(length(c)) from tanel.t_chained2
Enter value for sql_id:

SQL_ID         CHILD OFFLOAD IO_SAVED_%  AVG_ETIME SQL_TEXT
--------------------------------------------------------------------------------------
3upcma5nsnbcd      0 Yes        -27.41     159.24 select avg(length(c)) from tanel.t_chain

Elapsed: 00:00:00.06

Notice that the statement was offloaded and that the IO_SAVED_% calculation produced a negative number. This shows that more bytes were returned by the storage cells than the total number of bytes that the table occupies. It's clear from this example that chained rows can be much more detrimental to performance when attempting to make use of Exadata's Smart Scans than to queries that are handled in normal block-shipping mode. It's worth noting that this is an extreme example, as every row was chained. In fact, many of the records spanned three blocks. In this extreme case, it is clear that avoiding offloading actually improves the performance significantly. Of course, eliminating the chained rows would be the best solution.

Very Busy Cells

When a storage cell becomes very busy with CPU operations, it is possible for the storage software to begin refusing to perform offload activities on some portion of the requests. That is, if a particular cell becomes extremely busy and is getting offloadable requests from a database server that is not particularly busy, it may decide to send some data back to the database server that has not been processed or has only partially been processed. In some cases column projection may be done but not filtering, in other cases, cellsrv may revert to shipping entire blocks back to the database layer. So while this issue does not result in a complete shutdown of offloading, it can reduce the amount of work that is done on the storage tier.

This is an extremely complex topic, and it is difficult to observe the behavior directly. The goal of the feature is to utilize available CPU resources regardless of whether they are on the database tier or the storage tier. This behavior was introduced relatively recently, in cellsrv version 11.2.2.3.1 with databases running 11.2.0.2 with bundle patch 6 or later. There is a statistic called cell physical IO bytes pushed back due to excessive CPU on cell in 11.2.0.2 with BP 6 and later that shows this is happening. Note that the statistic name may change in a future version to something referring to “balancing CPU usage.” The feature is designed to improve throughput on very busy systems, but it may also cause some degree of instability in the performance of certain statements. It is possible to disable this feature if your cellsrv is erroneously deciding it is too busy to take on additional work; but in general, if you observe this behavior, you are probably getting close to the limits of the system. Adding additional resources at the storage layer (more storage cells) may be a viable option.

images Kevin Says: The authors are correct in their description of this new “feature.” However, there is one point I cannot stress enough, and that is that the most compute-intensive work in query processing must occur in the database grid; yet there are significantly fewer processors there than in the storage grid (for example, 1:1.75 in the X2-2 and 1:1.31 in the X2-8 models). In the end, the only thing that matters is whether the users' experience is satisfactory. However, diagnosing an unsatisfactory user experience may prove difficult when this feature is invoked. The feature has the propensity to pass as much as 40% of the physical I/O payload to the cells either completely unprocessed or only lightly processed (as in, for example, EHCC projection). So, there are some scenarios I would be particularly concerned about. Allow me to explain.

We are no longer living the “good life” of the 1990s, when Unix systems were physically constrained by card slots for memory and CPU. We have systems with vast amounts of memory (a rapidly increasing trend) and ample CPU. We are no longer living the “good life” of “ample CPU” bottlenecked by front-side-bus limitations. In short, the industry is delivering to us huge, fast, balanced systems. The servers in the Exadata Database Machine are no exception. The significant cost of Exadata includes a feature known as in-memory parallel query and other features that can service users' needs without touching storage in the cells. I'm referring to features that lean heavily on host-tier resources and, in fact, rely very little on offload processing. If your Exadata system hosts an application that exhibits bursts of database-tier, CPU-intensive processing intermixed with use cases that drive cells to processor saturation, I urge close monitoring of this push-back (aka pass-through) feature. Heuristics that govern the “when” and “how much” for such a feature are very difficult to get right. If the heuristics are triggering this feature, the performance of certain applications may suffer—probably at a time when you really aren't in the mood for dealing with unpredictable system performance. However, as the authors point out, there are metrics available, and the feature can be tuned—or even disabled if necessary.

With this feature in the mix I know the first thing I'd check should users complain about response time, because when this feature is triggered you should expect your database grid processors to ratchet up to complete saturation. If your normal host CPU utilization level is not 100% when users are happy, then perhaps a little deductive reasoning will go a long way toward figuring out what has gone wrong should the complaints come in. Having said all that, let's not forget for a moment that DW/BI workloads must burn CPU. That is the nature of the beast as it were. Life is a never-ending series of choices.

CASE Statements

CASE statements (and their older cousin the DECODE function) are often used to avoid repeated scans of the same table. Generally speaking this is a reasonable technique, although analytic functions now provide another way to deal with this issue. Nevertheless, it is quite common to see code that applies this trick. The whole idea is to replace WHERE clauses with CASE statements. This is perfectly reasonable when all the blocks must be shipped back to the database server to be evaluated. But with Exadata's ability to do filtering at the storage layer, this technique may not be the best solution. Here's a quick example:

SYS@SANDBOX1> alter session set cell_offload_processing=false;

Session altered.

Elapsed: 00:00:00.00
SYS@SANDBOX1>-- Without Smart Scans
SYS@SANDBOX1> select /*+ full(a) */ count(*) from kso.skew a where col1 =1;

  COUNT(*)
----------
   3199971

Elapsed: 00:00:11.04
SYS@SANDBOX1> select /*+ full(a) */ count(*) from kso.skew a where col1 = 999999;

  COUNT(*)
----------
        32

Elapsed: 00:00:10.52
SYS@SANDBOX1> select sum(case when col1 = 1 then 1 else 0 end) how_many_ones,
  2                  sum(case when col1 = 999999 then 1 else 0 end) how_many_nines
  3  from kso.skew;

HOW_MANY_ONES HOW_MANY_NINES
---------------------------
      3199971             32

Elapsed: 00:00:12.83

So you can see that without offloading, combining the statements basically cuts the elapsed time in half. This is what we're used to seeing on a non-Exadata platform. Now let's see what happens when we enable Smart Scans:

SYS@SANDBOX1> alter session set cell_offload_processing=true;

Session altered.

Elapsed: 00:00:00.00
SYS@SANDBOX1> alter session set "_serial_direct_read"=true;

Session altered.

Elapsed: 00:00:00.00
SYS@SANDBOX1>-- With Smart Scans
SYS@SANDBOX1> select /*+ full(a) */ count(*) from kso.skew a where col1 =1;

  COUNT(*)
----------
   3199971

Elapsed: 00:00:01.06
SYS@SANDBOX1> select /*+ full(a) */ count(*) from kso.skew a where col1 = 999999;

  COUNT(*)
----------
        32

Elapsed: 00:00:00.08
SYS@SANDBOX1> select sum(case when col1 = 1 then 1 else 0 end) how_many_ones,
  2                  sum(case when col1 = 999999 then 1 else 0 end) how_many_nines
  3  from kso.skew;

HOW_MANY_ONES HOW_MANY_NINES
---------------------------
      3199971             32

Elapsed: 00:00:04.33

Offloading completely reverses the results. The total elapsed time to run the two individual statements (1.14 seconds) is about 25% of the time required to complete the combined statement (4.33 seconds). Notice also that the combined statement runs in a quarter of the time it took without offloading, but the individual statements completed in roughly one-tenth the time that was required without offloading. The combined statement was not as effective as the individual statements, since the CASE statement causes each record in the table to be returned to the database server for evaluation. So while the statement does benefit from column projection, it does not benefit from predicate filtering or storage indexes. Just to be clear, CASE statements don't disable Smart Scans, but they are often used in a way that reduces their effectiveness by eliminating the opportunity to filter rows at the storage layer.

Hinted Code

Hints are very useful for coercing the optimizer to do what you want it to do. Unfortunately, hints are not well documented and even less well understood. In many cases, hints are used to resolve a problem that is caused by some misconfiguration of the database. Their intended purpose is to allow humans to help the optimizer make the right choices in situations where it just can't do the job effectively (or consistently) on its own. This happens in situations where the optimizer is just not smart enough (yet) to arrive at the best execution plan in a specific situation. However, even when hints are used appropriately and are generating the expected behavior, they can prevent Exadata from taking advantage of some of its built in abilities. So the best approach when migrating to Exadata is to allow ample time for testing. And if your application makes use of hints, one of the important steps in the test plan should be to test its behavior without the hints. This can easily be accomplished by setting the hidden parameter _optimizer_ignore_hints to true.

Indexes

This may seem like a strange topic, but indexes can work against Smart Scans as well. The optimizer will try to use indexes if they are available. In a true data warehouse environment, indexes may not be necessary at all. We'll have more to say about indexes in the next section on mixed workloads, but it is important to understand that index usage often means that offloading is less likely to occur.

Row-at-a-Time Processing

It is amazing how often we see row at a time processing in very large data sets. This type of coding is rarely a good idea on non-Exadata platforms, and it is definitely not a good idea on Exadata. In fact, the differences in performance can be even more dramatic on Exadata because of the optimizations built into the platform.

Other Things to Keep in Mind

There are a few other things that you should keep in mind when working with DW systems: the use of Exadata Smart Flash Cache, compression, and partitioning.

Exadata Smart Flash Cache: Keep

Exadata Smart Flash Cache (EFSC) is thought of primarily as providing benefit to latency-sensitive SQL statements. It's important to remember that it can dramatically improve scan performance as well. By default, Smart Scans will ignore ESFC and will only scan disks. However, if an object is defined with the CELL_FLASH_CACHE attribute set to KEEP, Smart Scans will use both the disks and the ESFC for scanning. Obviously, overriding the default caching behavior for a large object will require some thought and testing, but this is definitely something worth considering. Here's an example:

SYS@SANDBOX1> alter session set cell_offload_processing=false;

Session altered.

SYS@SANDBOX1> select count(*) from kso.skew3;

  COUNT(*)
----------
 384000048

Elapsed: 00:00:50.11
SYS@SANDBOX1> alter session set cell_offload_processing=true;

Session altered.

Elapsed: 00:00:00.06
SYS@SANDBOX1> select count(*) from kso.skew3;

  COUNT(*)
----------
 384000048

Elapsed: 00:00:26.12

Elapsed: 00:00:00.04
SYS@SANDBOX1> alter table kso.skew3 STORAGE (CELL_FLASH_CACHE KEEP);

Table altered.

Elapsed: 00:00:00.07
SYS@SANDBOX1> select count(*) from kso.skew3;

  COUNT(*)
----------
 384000048

Elapsed: 00:00:28.74
SYS@SANDBOX1> select count(*) from kso.skew3;

  COUNT(*)
----------
 384000048

Elapsed: 00:00:09.46
SYS@SANDBOX1> /

  COUNT(*)
----------
 384000048

Elapsed: 00:00:08.06
SYS@SANDBOX1> alter table kso.skew3 STORAGE (CELL_FLASH_CACHE DEFAULT);

Table altered.

Elapsed: 00:00:00.21
SYS@SANDBOX1> select count(*) from kso.skew3;

  COUNT(*)
----------
 384000048

Elapsed: 00:00:26.10

In this example we turned off Smart Scans and did a query without a WHERE clause. The query took 50 seconds. We then enabled Smart Scans and ran the query again. This time the query took only 26 seconds. Next we altered the table, telling Oracle that we wanted ESFC to be more aggressive in caching this table (CELL_FLASH_CACHE KEEP). We then executed the statement again; this did not result in an improvement in performance, because the table's blocks were not stored in ESFC at that point. The subsequent executions, though, took only 8 to 9 seconds, showing the benefit of the caching. Finally, we disabled the aggressive caching and the statement execution returned to 26 seconds. This short example shows that making use of ESFC for offloaded queries can have a significant impact on performance.

Compression

Exadata's HCC is a big step forward in its ability to reduce the size of data stored inside of Oracle databases. The compression ratios that are achievable with HCC turn the concept of information lifecycle management on its head. HCC makes it practical to consider using compression instead of tiered storage or archiving and purging strategies. Because partitions of a table can be defined to use different compression methods, the combination of partitioning and compression can provide a much more robust solution for “archiving” data than actually purging it from the database.

You should remember, though, that HCC it is not appropriate for data that is being actively updated. A better approach is to partition data such that HCC can be applied to data that is no longer being changed. This leads us to the next topic, partitioning.

Partitioning

Partitioning is still a very key component for data warehousing systems. The optimizations provided by Exadata do not alleviate the need for a well-thought-out partitioning strategy. Of course, date-based strategies are very useful from a management standpoint. Being able to use more aggressive compression on older data is often a good approach. But partition elimination is still a technique that you will want to use. And of course, storage indexes can work well with partitioning, providing behavior comparable to partition elimination on additional columns.

You should keep in mind that the sizes of partitions can affect Oracle's decision to use Smart Scans. When performing a serial scan on a partitioned object, the decision to do direct path reads is based on the individual partition size, not the overall size of the object. This can result in situations where scans of some partitions are offloaded, while scans of others are not.

Mixed Workloads

There is a third type of system that is a combination of the other two. In fact, we could argue that the other two (OLTP and DW) rarely exist in the real world. There are many systems that don't fall neatly into the two main categories we've already described. In fact, most systems display characteristics of both.

Combining long-running, throughput-sensitive queries with fast, latency-sensitive statements definitely introduces some additional issues that must be dealt with. One of the main issues in systems of this type is how to deal with indexes.

To Index or Not to Index?

One of the biggest debates we've had during Exadata implementations is whether to drop indexes or not. Access paths that use indexes are generally not able to take advantage of Exadata-specific optimizations. We say generally, because offloading can occur in cases where the optimizer chooses to execute a Fast Full Scan on an index, but this is not the most common usage pattern for indexes. The more common pattern is to use them for retrieving relatively few records from a table using an Index Range Scan, and this operation is not currently offloadable. Generally speaking, we want to use index range scans on selective predicates, but since Exadata is so effective at scanning disk, in many cases the index-based access paths are no longer faster than the scan-based access operations. It's really a case of getting our bearings all over again with respect to when we want to use indexes and when we would expect a full scan to perform better.

One of the things we commonly heard when Exadata was first starting to appear at customer sites was that indexes were no longer necessary and that they should be dropped. For pure data warehouse workloads, this may actually be pretty good advice. However, we rarely see anything pure. Most systems have a mix of access patterns, with one set of statements hoping for low latency and another set hoping for high throughput. In these cases, dropping all indexes just will not work. This is why we saved this discussion for this section. The problem with mixed workloads, where it is necessary to keep indexes for specific sets of statements, is that the optimizer is not as well equipped to choose between using and ignoring them as one might hope. We'll discuss the optimizer's limitations next.

images Kevin Says: I know the origin of the mistaken notion that indexes are not necessary with Exadata. You see, Oracle Database enjoyed majority market share of the world-wide data warehousing market (as per industry analysts) before Exadata came to market. The methodology required to coerce Oracle Database into delivering adequate performance in the pre-Exadata era required an immense amount of “index trickery.” Oracle's competition (most notably Netezza [now IBM] and Greenplum [now EMC]) focused their marketing message on this “Achilles heel.” However, this over-indexing phenomenon (with all the associated complexities) was never an Oracle database weakness per se. Before the original release of Exadata in 2008, customers did not have the richness of systems technology available for configuring a high-bandwidth, balanced DW/BI system. Imbalance and low bandwidth drove the need to avoid doing physical I/O. The only way to avoid physical I/O in that era was to fall into the “index death-spiral,” as it became known in certain cliques within Oracle. While Exadata certainly addresses the I/O bottlenecks, there remains one all-encompassing truth: It is always faster not to do something than to do something fast. To that end, Exadata administrators need to remember that some of the tools their company has paid for include on-disk indexing technology. Always use the right tools for the job at hand.

To close this train of thought I'd like to put something into perspective. The software that put Oracle in the market leadership position for DW/BI prior to Exadata was Oracle Database 10g Release 2. If you were to deploy Oracle Database 10g Release 2 on the same number of processor cores you used in the pre-Exadata timeframe (circa 2007), you could easily enjoy as much as a tenfold performance increase for process-intensive use cases. And, since the methodology for deploying Oracle for DW/BI use cases in that timeframe used every trick in the book to avoid physical I/O, it is fair to presume that those systems were generally CPU-bound. Indeed, systems that are neither CPU-intensive nor I/O-intensive are probably not considered critical.

In our day-to-day work as IT professionals it is easy to miss just how rapidly technology is advancing all around us—particularly when we are naturally conservative about migrating production applications to state-of-the-art technology, and the state of the art—with or without Exadata—has been advancing at a phenomenal pace.

The Optimizer Doesn't Know

We've made the point several times that the optimizer is not aware that it is running on Exadata (yet). In general, the principles that guide the optimizer decisions are sound regardless of the storage platform. The fact that the code on the database tier is identical regardless of whether it's running on Exadata or not means that an application will behave similarly on Exadata in terms of plan selection. So you shouldn't expect any application to experience a large number of changes in the plans caused simply by moving to Exadata. This is a good thing from a stability standpoint.

The downside is that the optimizer is not aware that Exadata has optimizations that can cause full scans to perform much better than on other platforms. So mixed-workload systems that have many indexes make the optimizer's job more challenging. In fact, as you might expect, the optimizer will tend to pick index-oriented plans in preference to scan-based plans in situations where indexes are available, despite the fact that the scan-based plans are often much faster.

There are several ways to deal with the optimizer's tendency to prefer index access over full table scans. System statistics, optimizer parameters, and hints all come to mind as potential solutions.

images Kevin Says: Although the authors are correct to point out the current lack of Exadata-aware optimizer code, I do expect that to change. There should certainly be Exadata-aware optimizer improvements showing up over time.

System Statistics

System statistics provide the optimizer with additional information about the “system,” including how long it takes to do a single-block read (typical of index lookups) and how long it takes to do a multi-block read (typical of full table scans). This may appear to be an ideal mechanism to manipulate the optimizer by giving it the additional information it needs to make the right decisions. Unfortunately, Smart Scans are not based on multi-block reads and in fact, Smart Scans can be orders of magnitude faster than multi-block reads. So modifying System Statistics is probably not the best option in this case.

images Note: The issue of whether to gather system statistics on Exadata comes up fairly often. System statistics are designed to provide additional information to the optimizer about the behavior characteristics of the hardware. In particular, they provide values for average single-block and average multi-block read times. Since Exadata often avoids actually performing multi-block reads, using Smart Scans instead, it seems counterintuitive to provide information to the optimizer about an operation that will seldom be used, particularly when this is one of the main inputs into the optimizer's cost calculations. If that argument is not enough for you, then please review MOS Note 1094934.1, which specifically states:

Note:     It is currently not necessary to gather system statistics on the database machine.
Optimizer Parameters

There are a couple of initialization parameters that can push the optimizer toward or away from index usage. The parameters OPTIMZER_INDEX_CACHING and OPTIMIZER_INDEX_COST_ADJ can both be used for this purpose. While these are big knobs that can affect the core functionality of the optimizer, they were designed for the very purpose of making indexes more or less attractive to the optimizer. Using the parameters in a limited way, such as with an alter session command before running large batch processes, is a viable approach in some cases. These parameters can also be set at the statement level using the OPT_PARAM hint. Here's a very simple example:

SYS@SANDBOX> @bloom_join2.sql

COL2                           SUM(A.COL1)
-----------------------------------------
asddsadasd                       153598416
2342                                   144

Elapsed: 00:03:21.39
SYS@SANDBOX> alter session set optimizer_index_cost_adj=10000;

Session altered.

Elapsed: 00:00:00.07
SYS@SANDBOX> @bloom_join2.sql

COL2                           SUM(A.COL1)
-----------------------------------------
asddsadasd                       153598416
2342                                   144

Elapsed: 00:01:08.64
SYS@SANDBOX1> @dplan
Enter value for sql_id: 09m6t5qpgkywx
Enter value for child_no:

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------
SQL_ID  09m6t5qpgkywx, child number 0
-------------------------------------
select /*+ bloom join 2  use_hash (skew temp_skew) */ a.col2,
sum(a.col1) from kso.skew3 a, kso.skew2 b where a.pk_col = b.pk_col and
b.col1 = 1 group by a.col2

Plan hash value: 466947137

-----------------------------------------------------------------------------------------
| Id  | Operation                             | Name       | Rows  | Bytes | Cost (%CPU)|
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |            |       |       | 37239 (100)|
|   1 |  PX COORDINATOR                       |            |       |       |            |
|   2 |   PX SEND QC (RANDOM)                 | :TQ10002   |     2 |    66 | 37239   (1)|
|   3 |    HASH GROUP BY                      |            |     2 |    66 | 37239   (1)|
|   4 |     PX RECEIVE                        |            |     2 |    66 | 37239   (1)|
|   5 |      PX SEND HASH                     | :TQ10001   |     2 |    66 | 37239   (1)|
|   6 |       HASH GROUP BY                   |            |     2 |    66 | 37239   (1)|
|*  7 |        HASH JOIN                      |            |  1706 | 56298 | 37238   (1)|
|   8 |         BUFFER SORT                   |            |       |       |            |
|   9 |          PX RECEIVE                   |            |   142 |  1562 |   131   (0)|
|  10 |           PX SEND BROADCAST           | :TQ10000   |   142 |  1562 |   131   (0)|
|  11 |            TABLE ACCESS BY INDEX ROWID| SKEW2      |   142 |  1562 |   131   (0)|
|* 12 |             INDEX RANGE SCAN          | SKEW2_COL1 |   142 |       |     3   (0)|
|  13 |         PX BLOCK ITERATOR             |            |   384M|  8056M| 37020   (1)|
|* 14 |          TABLE ACCESS STORAGE FULL    | SKEW3      |   384M|  8056M| 37020   (1)|
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   7- access("A"."PK_COL"="B"."PK_COL")
  12- access("B"."COL1"=1)
  14- storage(:Z>=:Z AND :Z<=:Z AND SYS_OP_BLOOM_FILTER(:BF0000,"A"."PK_COL"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"A"."PK_COL"))

SQL_ID  09m6t5qpgkywx, child number 1
-------------------------------------
select /*+ bloom join 2  use_hash (skew temp_skew) */ a.col2,
sum(a.col1) from kso.skew3 a, kso.skew2 b where a.pk_col = b.pk_col and
b.col1 = 1 group by a.col2

Plan hash value: 2628392092

-------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | Rows  | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |          |       |       | 49465 (100)|
|   1 |  PX COORDINATOR                     |          |       |       |            |
|   2 |   PX SEND QC (RANDOM)               | :TQ10002 |     2 |    66 | 49465   (1)|
|   3 |    HASH GROUP BY                    |          |     2 |    66 | 49465   (1)|
|   4 |     PX RECEIVE                      |          |     2 |    66 | 49465   (1)|
|   5 |      PX SEND HASH                   | :TQ10001 |     2 |    66 | 49465   (1)|
|   6 |       HASH GROUP BY                 |          |     2 |    66 | 49465   (1)|
|*  7 |        HASH JOIN                    |          |  1706 | 56298 | 49464   (1)|
|   8 |         PX RECEIVE                  |          |   142 |  1562 | 12357   (1)|
|   9 |          PX SEND BROADCAST          | :TQ10000 |   142 |  1562 | 12357   (1)|
|  10 |           PX BLOCK ITERATOR         |          |   142 |  1562 | 12357   (1)|
|* 11 |            TABLE ACCESS STORAGE FULL| SKEW2    |   142 |  1562 | 12357   (1)|
|  12 |         PX BLOCK ITERATOR           |          |   384M|  8056M| 37020   (1)|
|* 13 |          TABLE ACCESS STORAGE FULL  | SKEW3    |   384M|  8056M| 37020   (1)|
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   7- access("A"."PK_COL"="B"."PK_COL")
  11- storage(:Z>=:Z AND :Z<=:Z AND "B"."COL1"=1)
       filter("B"."COL1"=1)
  13- storage(:Z>=:Z AND :Z<=:Z AND SYS_OP_BLOOM_FILTER(:BF0000,"A"."PK_COL"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"A"."PK_COL"))

So in this simple example, pushing the optimizer away from indexes with the alter session caused the optimizer to pick a plan that was considerably faster. The plans show that the improvement in elapsed time was a result of doing a full table scan, instead of using the index.

Hints

Of course hints can also be used to help the optimizer make the right choices, but that is somewhat of a slippery slope. Nevertheless, telling Oracle that you would prefer to do a hash join or ignore a specific index is an option. As mentioned in the previous section, the OPT_PARAM hint can also prove useful for setting some initialization parameters that can influence the optimizer's decisions.

Using Resource Manager

It's a commonly held belief that Oracle databases can't be configured to adequately handle both DW and OLTP workloads at the same time. And in truth, keeping them on separate systems does make them easier to manage. The downside of this approach is that it is expensive. Many companies dedicate the majority of their computing resources to moving data between platforms. The power of Exadata makes it tempting to combine these environments. Keep in mind that Exadata has additional capabilities for dividing resources between multiple databases that are not available on other platforms. IORM can prevent long-running DW queries from crippling latency-sensitive statements that are running on the same system. Having a good understanding of Oracle's resource management capabilities should change the way you think about what is possible in a mixed-workload or consolidated environment. Resource management is covered in depth in Chapter 7.

Summary

Exadata is different. To make the best use of it you'll need to think differently.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.201.217