C H A P T E R  12

Monitoring Exadata Performance

By now you have learned about the key Exadata performance features and the related performance metrics. Let’s see how we can use these for everyday tasks. We will look at standard tools available for database-layer and cell performance monitoring, and how to interpret their output.

Oracle Database and the Exadata cells provide a huge variety of different metrics, but before monitoring any metrics, you should ask yourself why you are monitoring them. Additionally, you should know what your action would be if a metric crosses some threshold. This leads to the follow-up question: which exact threshold should prompt some action from you—and why? In other words, you should know what are you trying to achieve (good response times for users) and how performance metrics relate to that.

The monitoring tasks covered here can be divided into the following categories:

  • SQL statement response time monitoring
  • Database layer utilization and efficiency monitoring
  • Storage cell layer utilization and efficiency monitoring
  • Advanced metrics and monitoring for Exadata performance troubleshooting

Note that we will focus on the Exadata-specific performance topics here and not the whole wide range of other Oracle performance topics, like lock contention or general SQL performance issues.

A Systematic Approach

Whatever metrics you monitor, you should have a purpose in mind. In other words, don’t collect and display metrics just because they are available; this will lead you nowhere or potentially even mislead you to fixing the wrong problem. Note that the term “performance” is vague—different people may mean different things when they use it. From an IT system user’s perspective, performance is ultimately only about one thing— response time. And not some response times of individual wait events measured at a low level; the end users don’t care about that. They do care about how much they have to wait for their business task to complete, like the time it takes from the report submission to actually seeing its output. This time is measured in regular wall-clock time; it’s as simple as that.

If your purpose in monitoring is to ensure good response times for your application users, then you should measure what matters—response time as your end-user experiences it. This would be the ideal entry point to performance monitoring. In addition to this entry point, you should measure more detailed lower-level metrics to break the end user response time down into individual components, like time spent in an application server and database time. Your application instrumentation and monitoring tools should keep track of which database sessions were used for which end-user task, so you can report what these exact database sessions were doing when a user experienced unacceptable response times.

images Note: We deliberately said unacceptable response times here instead of just “user experienced performance problems.” Whenever anyone complains about a performance problem, you should clarify what they actually mean by that and how it was measured. Does any user actually experience far too long response times in their application, or did some monitoring system merely raise an alarm about “too high” CPU utilization or any other secondary metric like that? Your subsequent actions would depend on the problem you’re trying to solve. Ideally, you should not use a performance tool or Oracle metric for determining whether you have a performance problem. Your starting point should be the users (who report a problem) or application-level metrics, which see the database response time from the application perspective. No matter how good the database metrics look, if the application waits for the report completion for ages, you have a performance problem to drill down into. Conversely, no matter how “bad” your database metrics seem to be, if your application response times are satisfactory, you don’t have an acute need to start fixing anything.

When examining performance metrics because of an ongoing problem (of too long response times), you should start by identifying the sessions servicing this slow application or job and then drilling down into that response time. It would be more correct to call this performance troubleshooting, not just monitoring.

Note that there are other kinds of performance monitoring tasks, which you may want to do—for example, proactive utilization and efficiency monitoring. Performing these tasks allows you to keep an eye on the utilization headroom left in the servers and detect any anomalies and sudden spikes in system utilization and low-level response times, possibly even before users notice a response time difference. Yet another reason for collecting and monitoring performance and utilization data is for capacity planning—but capacity planning is outside the scope of this book. Also, because this is a database book, we won’t dive into any end-to-end performance measurement topics, which would involve identifying time spent in application servers, on the network, and so on.

We will start by looking at how to identify where a long-running query is spending most of its time. We’ll also look at how to tell whether a query is taking full advantage of Exadata’s performance features.

Monitoring SQL Statement Response Time

The best tool for monitoring long-running queries is Oracle 11g‘s SQL Real Time monitoring page. It is able to gather all the key performance information onto a single page, even in the case of parallel execution across multiple RAC instances.

The SQL Monitoring feature requires you to have a Diagnostics and Tuning Pack license. SQL Monitoring kicks in automatically if you run your query with parallel execution or when a serial query consumes more than 5 seconds of I/O and CPU time in total. Additionally, you can control the monitoring feature with MONITOR and NO_MONITOR hints. If you want to monitor your frequently executed short-running queries, the best tool for this would be to use ASH data and list the top wait events and top rowsources from there (using the SQL_PLAN_LINE columns in the V$ACTIVE_SESSION_HISTORY view).

If you are already aware of a performance problem (perhaps your users are already complaining of too long response times), then you should use a top-down approach for monitoring. You should identify the session(s) of the problematic users’ applications or reports and drill down into what these sessions are doing with ASH (which gives you the SQL_IDs of the top SQL statements for these sessions) and when needed, then drill down further into the top statements with SQL Monitoring reports.

Monitoring SQL Statements with Real-Time SQL Monitoring Reports

When you click the SQL Monitoring link in the Enterprise Manager performance page, you will see the latest monitored queries. The SQL Monitoring reports are present in Grid Control 11g R1 or 10g R5 (10.2.0.5). If you are not using Grid Control, then you can either use the built-in Enterprise Manager Database Console or run the SQL Monitoring reports manually from SQL*Plus as explained shortly. Note that the SQL Monitoring feature requires Diagnostics and Tuning Pack licenses.

Figure 12-1 shows the entry page to the SQL Monitoring reports. You can get there by clicking the SQL Monitoring link at the bottom of the Performance page in Grid Control. The SQL Monitoring page, as seen in Figure 12-1, lists the currently running, queued, and recently completed monitored SQL executions, with some key performance metrics and details, such as the degree of parallelism.

images

Figure 12-1. Enterprise Manager’s overview of Monitored SQL Executions

The Status column shows an icon with one of four statuses—running, done, error, or queued. When you click the status icon, you will be taken to the SQL statement’s detailed monitoring page. One of the most important pieces of information is the Duration column, showing how long a statement has been active. The duration is the wall-clock time from the start of the statement execution through the finish, or to the current time in the event the statement is still executing. Figure 12-2 illustrates the difference between duration (wall-clock) and CPU time in statement execution.

images

Figure 12-2. Statement Duration compared to Database Time in the SQL Monitoring page

The Duration column shows what users care about, which is the response time of a query since the SQL execution start. This is the time they have had to wait for the SQL to complete. Of course, the end users may have had to wait for much longer than the duration of the query, as the page in Figure 12-2 shows only the database response time. Time may have also been spent in the application layer or the network connection in between the user and the application server.

It is important to know that the duration measures time from SQL execution start all the way until the cursor is closed or cancelled (for example, when all data is fetched from it). This means that if your database can process the query in 30 seconds, but then millions of rows are fetched a few at a time, your query will take a long time as far as the application is concerned (thanks to the network ping-pong), but in fact only a little time is spent processing within the database. The Duration column still shows long query “runtime,” as the cursor is still kept open for fetching of the data. Remember, the duration measures time from when the cursor is executed all the way to when it is finally closed after all the fetches are done or the application has fetched enough.

This leads us to the next important metric—Database Time, seen in the sixth column in Figure 12-2. The Database Time metric shows the total time your query spent executing in the database. So, if you run a serial DML that runs for 60 seconds and spends all of the time executing inside the database, you will end up seeing 60 seconds of database time, too. However, if you are running some SELECT statement and are fetching a lot of rows, causing your session to spend (let’s say) 50% of its time executing in the database and another 50% waiting for the next fetch command (once it has sent an array of rows back to the application), you would see only half of that total 60-second duration as database time. In other words, you would see 30 seconds of database time, as the database session has been servicing your request only for 30 seconds and the rest of the time it was idle.

Looking at the second entry in Figure 12-2, you see that the duration of the query (the time from when the execution phase started) is 4.4 minutes, the query has executed in serial (the fifth column shows that), and it has consumed only 2.4 minutes of database time. So this indicates that the executing session has been doing something else for 2 minutes. It was either idle (probably waiting for the next fetch request) or executing some other statement (while the first statement’s cursor was still open).

The example we just discussed was about a single serial query. When you run a parallel query, you have multiple sessions executing pieces of the same query for you. Then the database time might end up being much higher than the duration (response time) of your query. If you look at the first entry in Figure 12-2, you see a statement with duration of 19 seconds, but the total time spent in the database is 9.5 minutes. When you look at the parallel column, you see that the session was executed with parallel degree 32, which means that you had multiple sessions actively working in the database for your query. All of these parallel sessions’ database time plus the query coordinator’s time is added into the Database Time column. This database time gives an idea of how much work was done in the database. But because the statement was parallelized, you don’t have to wait for that long. If you ran that same query in serial mode, then the database time would be in the ballpark of how much time you might have to wait. (Please take that last statement with a grain of salt, as in practice many other things may happen in SQL execution when switching from parallel execution to serial, so you might have some pleasant or unpleasant surprises in the response time).

Note that these entries in the SQL Monitoring overview page are not SQL statement-specific, but SQL statement execution-specific. So, if two user sessions were running the same statement, you would see two separate entries in this list. This allows you to examine exactly what the problem is with a specific user (who is complaining) as opposed to looking at statement-level aggregate metrics (like those V$SQL provides) and trying to figure out your user’s problem from there.

Real-Time Monitoring of Single Long-Running Query

Once you have identified your query of interest from the list of all long-running queries, you can click on its “running” icon in the left side of the list and you will be taken to the Monitored SQL Execution Details page shown in Figure 12-3. This page has a number of sub-tabs, so feel free to explore and see all the information that’s available there.

images

Figure 12-3. Monitored SQL statement execution details

The Monitored SQL Execution Details page has a lot of information in it. This page is pretty self-explanatory if you have read SQL execution plans before, so we won’t go through every single detail here, but will focus on the most important metrics. Compared to old-fashioned DBMS_XPLAN output, the SQL Monitoring page is very interactive, so make sure that you hover your mouse over and click on almost everything on the screen, to see the full array of functionality available to you.

Let’s start from the top section on the page. At the top of the screen, right after the Monitored SQL Execution Details header, you see a little icon, which shows the status of the selected SQL statement’s execution. The little circle will mean that the statement is still being executed (the cursor has not been closed nor cancelled).

The Overview section, illustrated in Figure 12-4, will show some basic details, like which SQL ID is being executed, by whom, and so on. You can see the SQL text when you click on the little (i) icon next to the SQL ID. You will see the SQL statement text and, beginning with version 11.2.0.2, you will also see the bind variable values used for the execution. Figure 12-5 shows the display that you get by clicking that (i) icon.

images

Figure 12-4. SQL Monitoring overivew section

images

Figure 12-5. Bind variable values in the SQL Monitoring detail page

Of course, whether you should use bind variables for your long-running reports and DW queries is an entirely separate question. You shouldn’t use binds for long-running queries in your DW; you’d want to sacrifice some response time for hard-parsing a new query plan for each combination of literal values and possibly get an optimal plan for each variable set. Nevertheless, monitoring bind variable values of an already running query is easy with Oracle 11.2.0.2 and later, because you no longer must resort to the ORADEBUG ERRORSTACK command.

The Time & Wait Statistics section in Figure 12-6 shows the familiar metrics of Duration and Database Time of a statement. Move your mouse over the different database time components to see how much time was spent waiting inside the database compared to running on CPU. The Wait Activity % bar shows the breakdown of wait events. Note that the 100% in this bar means 100% of the wait time component of database time, not of the entire database time (which also includes CPU time).

images

Figure 12-6. Time and Wait Statistics in the SQL Monitoring detail page

The IO Statistics section in Figure 12-7a shows some key I/O statistics for statement execution.

images

Figure 12-7a. I/O Statistics in the SQL Monitoring detail page

The Buffer Gets bar shows the total number of logical I/Os done in both the database layer and the cells (if Smart Scan kicked in). The IO Requests and IO Bytes statistics are self-explanatory, but note that those metrics show all I/O done for the given statement, which includes the Smart Scan I/O, any regular block I/O, and also TEMP tablespace I/O (done for sorting, hash joins, and for any other work area operations that didn’t fit into allowed PGA memory). Now you should begin to understand why the Cell Offload Efficiency is negative for the statement execution in the figure. This metric should ideally show the percentage of disk-to-database-host interconnect traffic that was avoided thanks to the Smart Scans performing filtering and projection early in the cells and returning only a subset of data. However, in a complex execution plan there is much more going on than just a Smart Scan against a single table. You have joins, aggregate operations, sorting, and direct path data loads, which all use extra I/O bandwidth, driving the offload efficiency percentage down. This is a good example of why focusing on only a single ratio is not a good idea. The single percentage value for offload efficiency hides a lot of information, such as where the percentage taken was from and the real values behind it. You can move your mouse over the percentage and see the underlying values used for the calculation.

So when moving your mouse over the ratio (see Figure 12-7b), you’ll see that the cells did read 486GB from disks, but much more data (715GB) was actually sent back and forth over the interconnect. The metric explanations in Figure 12-7b are actually incorrect. The Bytes Read From Disks actually means total bytes of reads and writes that Oracle Database has issued, not just for Smart Scans, but for any reason. And the Bytes Returned From Exadata actually means the total interconnect traffic between the database and the cells, caused by any reason, such as block reads and writes and arrays of returned rows by Smart Scans.

images

Figure 12-7b. Cell Offload Efficiency ratio in the SQL Monitoring detail page

If you are wondering why the I/O interconnect bytes is 715GB when the actual database-issued I/O is only 486GB, the explanation is in how these metrics are measured. The database I/O metric (the Bytes Read From Disk in Figure 12-6) is measured by the database layer, while the I/O interconnect bytes is measured by the low-level interconnect/Oracle networking layer. And one of the layers in between is the ASM layer, which manages the software mirroring, among other things. So the interconnect I/O traffic in this case is higher than the database traffic thanks to the write operations done by the SQL statement, which had to be mirrored by the ASM layer. So every MB written by database (whether because of direct path load or some work area operation spilling to temporary tablespace) resulted in the writing of 2MB of data thanks to ASM normal redundancy mirroring.

There are other reasons why the interconnect traffic may be higher than the actual amount of data read. One example is HCC compressed tables. If you have compressed your 10GB partition down to 1GB, then you will have to do 1GB worth of I/O to read it. But now if the Smart Scan uncompresses this data in the cells on the fly and returns all that 10GB of uncompressed data back over the interconnect (assuming no projection or filtration in the cell was done), the I/O interconnect bytes will be much higher than the amount of data read from disks. This would drive the Cell Offload Efficiency down to 10% for this example Smart Scan. All this is yet another reason why you shouldn’t focus solely on improving just the Cell Offload Efficiency ratio, but should rather look into where response time is spent instead. Time is what end users care about.

Execution Plan Rowsource Details

Now that you have checked the key metrics of your statement execution— duration, database time, parallelism used and how much of the database time is spent running on CPU vs. waiting— it’s time to drill down into the details of the execution plan at the row-source level. These are shown in Figure 12-8.

images

Figure 12-8. Execution plan rowsource activity monitoring in the SQL Monitoring detail page

Let’s focus on the right-hand columns of the execution plan output first. The CPU Activity column shows a breakdown of total CPU usage by the statement so far. The longest bar at the SORT ORDER BY line in the execution plan shows that this rowsource consumed 43% of total CPU usage of that statement execution so far. Note that this is 43% of total CPU consumption, not of total duration or database time of the statement execution. You should examine the Time and Wait Statistics section (shown earlier, in Figure 12-4) to see how much of the total database time was consumed by CPU usage and how much by waits.

The Wait Activity column shows a breakdown of which wait events this statement execution waited for. Again, the SORT ORDER BY rowsource has experienced the most waits, 72% of total wait time of that statement execution. Note that this bar consists of two different color bars. If you move your mouse pointer over these bars, you’ll see the names of these wait events. In this case these waits were direct path read temp and direct path write temp events, which you see with multi-pass work area operations such as sorts and hash joins. This column also shows percentages of total wait time of this statement execution, not the percentage of total runtime duration, nor total database time.

The IO Bytes column shows how many I/O operations or bytes were read or written by a rowsource. You can right-click on the chart to toggle the display of bytes or I/O operations. In Figure 12-8 you see that the SORT ORDER BY rowsource has done a total of 241GB worth of I/O. When you move your mouse over the different color bars, you’ll see the details about how much was written and how much was read (roughly 50/50, as the chart seems to show). The execution plan step LOAD AS SELECT above it has written about 125GB of data to disk. Note that this 125GB of data is measured at the Oracle database level, but the lower-level ASM layer will likely mirror (or triple-mirror, depending on your configuration) these writes, so actually twice as many bytes were written physically to disks. You can check the Cell Physical IO Interconnect Bytes session statistic to see how much data was really sent (and received) over the InfiniBand network; this metric is aware of the ASM mirror write overhead and any other low-level traffic too.

Let’s look at a few more columns of the Real Time SQL Monitoring page. The Timeline column you see in Figure 12-9 is one of the most important columns for understanding where the SQL statement spends its time.

images

Figure 12-9. Row-source activity timeline in the SQL Monitoring detail page

The Timeline column in the figure shows a visual timeline of individual execution plan rowsources’ activity. It is based on ASH samples; ASH collects SQL execution plan line-level details starting from Oracle 11gR1. Look into the brackets in the Timeline column header. This should show you the total runtime duration of this SQL execution, 2226 seconds in this case (around 37 minutes). So the total width of the timeline column means about 37 minutes of wall-clock runtime. Now it’s easy to visually interpret the length and position of these bars in each rowsource in the execution plan. When you look at the TABLE ACCESS (FULL) T4 rowsource at the bottom of the plan, you see that this table scan was active from the beginning of the execution and continued up to around two-thirds of the total Timeline (to around 1400 seconds) and then it completed. So it looks like this table scan was active during most of the execution time.

Oracle execution plans are trees of rowsource functions with strict hierarchy enforced, so a child rowsource can pass its results only to its direct parent. In our case, the TABLE ACCESS FULL against T4 fetched some rows and passed them back to its parent operation, PX BLOCK ITERATOR, which then sent the rows back to the PX SEND RANGE operator, which knows how to send results to other slaves via the PX Table Queue buffers in the SGA. Another set of slaves received these rows using the PX RECEIVE rowsource and passed these rows to their parent, the SORT ORDER BY rowsource, which performed the sorting of these rows. Then the SORT ORDER BY rowsource just repeated its loop and asked for more rows from its child operator PX RECEIVE, which then read (consumed) the table queue buffers for more rows put there by the first set of slaves (producers). We are not going deeper into SQL execution engine internals here, but hopefully this example has illustrated the hierarchical nature of the SQL execution and how the data “flows” upward through the execution plan tree toward the root of the tree (CREATE TABLE STATEMENT in this case).

When you look at the timeline bars, you see that all the bars starting from PX RECEIVE and below it started roughly at the same time and ended roughly at the same time, too. We say roughly, because this data comes from ASH samples, sampled once per second only, so we don’t have microsecond accuracy here. And we don’t need it anyway, as our query runs for much longer than a second. But anyway, the timeline bars indicate that the table scanning part of the query, including these “data transportation” steps for that data (PX RECEIVE, PX SEND RANGE, and PX BLOCK ITERATOR) took around 66% of the total query response time. If we want to make our query faster, we should probably focus on that part. we should either scan less data, filter more (in cells hopefully), access less partitions or just increase the scanning throughput (by increasing parallelism in this case, because this TABLE ACCESS rowsource wasn’t waiting too much for I/O, as shown in Figure 12-8, it’s not I/O-bound yet and the database nodes had plenty of CPU resource left).

When you look at the upper rowsources (LOAD AS SELECT and higher), you see that they have only been reported active in the last third of the query response time, extending all the way to the end of execution. So, if we somehow made only this data loading part faster, we would only win up to 33% of the response time back, even if we could make the loading part infinitely fast. The table access and sorting would still take the majority of the response time. The timeline bars are very valuable for quickly understanding in which execution plan tree part (branch) most of the response time is spent.

The timeline bars are just the first thing to check. There are lots of other useful details in the SQL Monitoring report, too. For example, if you look at the SORT ORDER BY line in Figure 12-9, you see that the sort operation has been active throughout the whole execution of this statement. And if you know the basics of sorting, it should make sense—the sorting rowsource was active for the first two-thirds of the time because it was fetching data into sort buffers and sorting it (and spilling some data to temporary tablespace). After the sort itself was complete (the child rowsource returned an “end of data” condition), the sorting rowsource function had still had to be invoked. It is by invoking that function that the sort operation returns the sorted rows to its parent operation. This is why the parent rowsources became active only toward the end of the execution timeline—because all the rows had to be fetched and sorted first, before anything could have been returned for further processing. And once all the rows were fetched and sorted, there was no reason for the SORT ORDER BY rowsource to visit its child rowsources again. This is why the timeline bar for TABLE ACCESS ended at around two-thirds of the total execution duration timeline.

Manually Querying Real-Time SQL Monitoring Data

All the pretty charts in Grid Control are based on some V$ or DBA_ views internally, so if you don’t happen to have access to Grid Control or Enterprise Manager Database Console (or it is just broken), you can get what you want from the V$ views directly. You probably don’t have to access the V$ views for your everyday monitoring and tuning tasks, but nevertheless it’s useful to know where this information comes from, as it may become handy for custom monitoring and advanced problem troubleshooting. Following are some key views to be aware of:

  • The GV$SQL_MONITOR view contains the statement execution-level monitoring data. When multiple sessions are running the same statement, you will have multiple entries in this view. So, make sure that you query the right execution, by using the right search filters. For example, you should pay attention to which SID and INST_ID you are really looking for (or PX_QCSID, PX_QCINST_ID if monitoring a parallel query) and whether the STATUS column still shows EXECUTING if you are trying to troubleshoot a currently running query.
  • The GV$SQL_PLAN_MONITOR view contains execution plan line-level metrics, monitored and updated in real time. For example you can query the IO_INTERCONNECT_BYTES and compare it to PHYSICAL_READ_BYTES and PHYSICAL_WRITE_BYTES to determine the offloading efficiency by each individual execution plan line, instead of the whole query efficiency. Note that increasing the offloading efficiency percentage should not be your primary goal of monitoring and tuning – where you spend your response time, matters.
  • The GV$ACTIVE_SESSION_HISTORY view contains columns like SQL_PLAN_LINE_ID, SQL_PLAN_OPERATION and SQL_PLAN_OPTIONS starting from Oracle 11gR1. You can query these columns, in addition to SQL_ID, to find the top rowsources of an SQL execution plan, too, instead of just listing the top SQL statement.
Reporting Real-Time SQL Monitoring Data with DBMS_SQLTUNE

If you do not have access to Grid Control or Enterprise Manager Database Console for some reason, you can also extract the SQL monitoring details using the DBMS_SQLTUNE.REPORT_SQL_MONITOR package function. You can use the following syntax, but make sure you read the related documentation to see the full power of this feature. Note that the text below is edited, as this function generates very wide output, which wouldn’t fit the book pages.

SQL> SELECT
        DBMS_SQLTUNE.REPORT_SQL_MONITOR(
          session_id=> &sid,
          report_level=>'ALL',
          type => 'TEXT') as report
     FROM dual
SQL> /
Enter value for sid: 543

REPORT
----------------------------------------------------------------------------
SQL Monitoring Report

SQL Text
------------------------------
SELECT /*+ parallel(32) */ COUNT(*) FROM T4 A

Global Information
------------------------------
 Status              :  EXECUTING
 Instance ID         :  2
 Session             :  TANEL (543:4641)
 SQL ID              :  0duat97xt417k
 SQL Execution ID    :  33555560
 Execution Started   :  05/27/2011 06:59:23
 First Refresh Time  :  05/27/2011 06:59:23
 Last Refresh Time   :  05/27/2011 06:59:59
 Duration            :  36s
 Module/Action       :  SQL*Plus/-
 Service             :  SANDBOX
 Program             :  [email protected] (TNS V1-V3)

Global Stats
============================================================================
| Elapsed | Queuing |   Cpu   |    IO    | Buffer | Read | Read  |  Cell   |
| Time(s) | Time(s) | Time(s) | Waits(s) |  Gets  | Reqs | Bytes | Offload |
============================================================================
|    1095 |    0.00 |      17 |     1079 |    11M | 170K |  83GB |  96.60% |
============================================================================

SQL Plan Monitoring Details (Plan Hash Value=3536567600)
============================================================================
|Id  |           Operation            |  Name  | Read  |  Cell   | Activity |
|    |                                |        | Bytes | Offload |   (%)    |
============================================================================
|   0|SELECT STATEMENT                |        |       |         |          |
|   1|  SORT AGGREGATE                |        |       |         |          |
|   2|   PX COORDINATOR               |        |       |         |          |
|   3|    PX SEND QC (RANDOM)         |:TQ10000|       |         |          |
|-> 4|     SORT AGGREGATE             |        |       |         |     0.09 |
|-> 5|      PX BLOCK ITERATOR         |        |       |         |          |
|-> 6|       TABLE ACCESS STORAGE FULL|T4      |  87GB |  96.60% |    99.91 |
|    |                                |        |       |         |          |
============================================================================

DBMS_SQLTUNE.REPORT_SQL_MONITOR can also take HTML as a value for the TYPE parameter, instead of TEXT, in which case the output is generated as HTML. If you spool this output into an HTML file and open it in the browser, you will see much nicer output than just text. And starting with Oracle 11.2, you can also use ACTIVE as a parameter and spool that output into an HTML file. Now if you open this file in the browser, you will see the SQL Monitoring page almost exactly as it looks in the Grid Control! And all the data required for displaying that report is self-contained in the spooled HTML file —no database access needed when opening it! So this is very useful if you want to send a detailed self-contained report with some SQL execution problem to someone over the email.

Controlling SQL Monitoring

SQL Monitoring kicks in immediately for all statements executed with parallel execution, no matter how long they run. For serially executed statements, the SQL Monitoring does not kick in immediately, as it’s not designed to monitor typical fast OLTP queries, which are executed many times per second. Nevertheless, if a serial query has consumed more than 5 seconds of total CPU and/or I/O wait time, it’s considered as a long-running query and the SQL Monitoring is enabled for that statement execution. This happens seamlessly on-the-fly, and statement restart is not needed.

You can also use MONITOR and NO_MONITOR hints to control the SQL monitoring for a statement. V$SQL_HINT shows all the hints available for use and the version when they were introduced. For example:

SQL> SELECT name, inverse, version, class, sql_feature
  2  FROM v$sql_hint WHERE name LIKE '%MONITOR%';

 NAME                 INVERSE              VERSION
-------------------- -------------------- ---------------
NO_MONITORING                             8.0.0
MONITOR              NO_MONITOR           11.1.0.6
NO_MONITOR           MONITOR              11.1.0.6

Note that the NO_MONITORING hint is something completely different despite the similar name. The NO_MONITORING hint allows you to disable the predicate usage monitoring on table columns (sys.col_usage$), and it has nothing to do with the Real-Time SQL Monitoring option introduced in Oracle 11g.

Monitoring SQL Statements using V$SQL and V$SQLSTATS

The performance monitoring in the “olddays” before-Oracle 10g ASH was usually done using various V$ views, which showed aggregated instance-wide metrics. For example, the Statspack’s TOP-5 wait events report section was just a delta between two V$SYSTEM_EVENT view snapshots. The TOP SQL reports were based on V$SQL snapshots, which externalize the execution statistics and resource consumption for each child cursor still in library cache. However, in a large database system (think e-Business Suite or SAP), you can have tens of thousands of cursors in the library cache, so gathering and storing deltas of all of their stats is not feasible. That’s why tools like Statspack and AWR store deltas of only some top resource-consuming statements and ignore the insignificant ones. Remember that as these tools gather instance-wide data, they may end up ignoring a long-running statement if there are only a few sessions executing it. A single session running a bad SQL statement may not be “heard” in the noise of all the other sessions in the instance—potentially thousands of them. This instance-wide scope performance data analysis is not as powerful as the session-level ASH data slicing and dicing. With ASH you can drill down into any single session, regardless of how many sessions in total you have making noise in the instance.

If you use Exadata, you must be running at least Oracle 11g R1 on it, so all these superior tools are available, assuming that you have the Diagnostics and Tuning Pack licenses for your Exadata cluster. By the way, we haven’t seen an Exadata-using customer without Diagnostics and Tuning Pack licenses yet, so it looks like the vast majority of Exadata users do not have to resort to old tools like Statspack or create a custom ASH-style repository themselves (although it’s easily doable with a few lines of PL/SQL code polling V$SESSION or its underlying X$KSUSE view in each instance).

The V$SQL and V$SQLSTATS views still do have some advantage over SQL Monitoring and ASH-style sampled data in a few cases. For example, if you want to measure metrics like the number of executions, buffer gets, parse calls, fetches, or rows returned by the SQL child cursor, you can get this data from both Real-time SQL Monitoring (V$SQL_MONITOR) or the V$SQL/V$SQLSTATS views, but not ASH. But the problem with SQL Monitoring is that it doesn’t monitor short-running queries at all, therefore making it unusable for keeping track of OLTP-style small queries executed many times per second. Even adding a MONITOR hint into every query of your application would not help, as the maximum number of monitored plans is limited (controlled by the _sqlmon_max_plan parameter, which defaults to 20 plans per CPU) and you would likely end up with real-time plan statistics latch contention as well. The SQL Monitoring feature is not meant to monitor short-running queries executed many times per second.

And this leaves us with V$SQL and V$SQLSTATS. They both maintain similar data, but they are internally different. Whenever you query V$SQL without specifying an exact SQL_ID, Oracle has to traverse through every single library cache hash bucket, and all cursors under it. This may contribute to library cache mutex contention if you have a busy database and lots of cursors in the library cache, because when you traverse the library cache structure and read its objects’ contents, you’ll have to hold a mutex on the object. Note that starting from Oracle 11g, all library cache latches are gone and are replaced by mutexes. These same mutexes are used for parsing, looking up, and pinning cursors for execution, so if your monitoring queries poll V$SQL frequently, they may end up causing waits for other sessions.

The V$SQLSTATS view, which was introduced in Oracle 10gR2, doesn’t have this problem. Starting from Oracle 10gR2, Oracle actually maintains SQL execution statistics in two places— inside the child cursors themselves (V$SQL) and in a separate cursor statistics array stored in a different location in the shared pool. This separation gives the benefit that even if a cursor is flushed out from the shared pool, its stats in this separate array may remain available for longer. Also, when monitoring tools query V$SQLSTATS, they don’t have to scan through the entire library cache, thanks to the separate array where stats are stored. This means that your monitoring tools won’t cause additional library cache latch (or mutex) contention when they use V$SQLSTATS instead of V$SQL. Both Statspack and AWR do use V$SQLSTATS to collect data for their top SQL reports.

Let’s look into V$SQLSTATS (the V$SQL view has pretty much the same columns, by the way). Some of the output here is removed to save space:

SQL> @desc v$sqlstats
           Name                            Null?    Type
           ------------------------------- -------- ----------------------
    1      SQL_TEXT                                 VARCHAR2(1000)
    2      SQL_FULLTEXT                             CLOB
    3      SQL_ID                                   VARCHAR2(13)
   ...
    6      PLAN_HASH_VALUE                          NUMBER
   ...
   20      CPU_TIME                                 NUMBER
   21      ELAPSED_TIME                             NUMBER
   ...
   26      USER_IO_WAIT_TIME                        NUMBER
   ...
   33      IO_CELL_OFFLOAD_ELIGIBLE_BYTES           NUMBER
   34      IO_INTERCONNECT_BYTES                    NUMBER
   35      PHYSICAL_READ_REQUESTS                   NUMBER
   36      PHYSICAL_READ_BYTES                      NUMBER
   37      PHYSICAL_WRITE_REQUESTS                  NUMBER
   38      PHYSICAL_WRITE_BYTES                     NUMBER
   ...
   41      IO_CELL_UNCOMPRESSED_BYTES               NUMBER
   42      IO_CELL_OFFLOAD_RETURNED_BYTES           NUMBER

The highlighted rows starting, with IO_CELL, are metrics related to Exadata storage cells (although the IO_INTERCONNECT_BYTES is populated on non-Exadata databases as well, thanks to a bug). You would want to compare these cell metrics to database metrics (like physical_read_bytes) to understand whether this SQL statement is benefitting from Exadata Smart Scan offloading. Note that these metrics, which merely count bytes, should not be used as the primary metric of performance ; again, the primary metric, the starting point, should always be response time, which you can then break down into individual wait events or into execution plan rowsource activity (with the SQL Monitoring report or ASH). You can learn more about the meaning of the metrics shown here in Chapter 11.

Note that even though the V$SQLSTATS view also contains the PLAN_HASH_VALUE column, it actually does not store separate stats for the same SQL ID with different plan hash values. It aggregates all the stats generated by different plans of the same SQL ID under a single bucket. So you don’t really know which plan version consumed the most resources from this view. Luckily in Oracle 11.2 there is a new view, named V$SQLSTATS_PLAN_HASH, which you should query instead. It organizes and reports the stats broken down by SQL ID and plan hash value instead of just SQL ID as V$SQLSTATS does.

Monitoring the Storage Cell Layer

Let’s look at how to monitor the storage cell layer for utilization and efficiency. As you already know, the storage cells are regular commodity servers with some disks attached to them and running Linux. You can use regular Linux OS monitoring commands and tools to keep an eye on OS metrics like CPU usage or disk I/O activity. The only catch is that Oracle does not allow you to install any additional (monitoring) daemons and software on the cells— if you still want to have a supported configuration.

The good news is that Oracle cell server software and additional OS Watcher scripts do a good job of collecting detailed performance metrics in each cell. The following sections show how to extract, display, and use some of these metrics.

Accessing Cell Metrics in the Cell Layer Using CellCLI

The cell-collected metrics can be accessed using the CellCLI utility. That’s in fact how the Exadata storage cell plug-in for Grid Control retrieves its data. You have seen CellCLI used in earlier chapters. Following are a few examples of using CellCLI for retrieving performance metrics. The command names should be fairly self-explanatory - we will cover the CellCLI commands in more detail in Appendix A.

$ cellcli
CellCLI: Release 11.2.2.3.1 - Production on Mon May 16 09:22:19 CDT 2011

Copyright (c) 2007, 2009, Oracle.  All rights reserved.
Cell Efficiency Ratio: 163

CellCLI> LIST METRICDEFINITION  ← show metric short names
           CD_IO_BY_R_LG
      CD_IO_BY_R_LG_SEC
      CD_IO_BY_R_SM
      CD_IO_BY_R_SM_SEC
      CD_IO_BY_W_LG
      CD_IO_BY_W_LG_SEC
      CD_IO_BY_W_SM
      CD_IO_BY_W_SM_SEC

   ... lots of output removed ...

      N_MB_RECEIVED
      N_MB_RECEIVED_SEC
      N_MB_RESENT
      N_MB_RESENT_SEC
      N_MB_SENT
      N_MB_SENT_SEC
      N_NIC_KB_RCV_SEC
      N_NIC_KB_TRANS_SEC
      N_NIC_NW
      N_RDMA_RETRY_TM

CellCLI> LIST METRICDEFINITION CL_CPUT DETAIL;   ← show metric description and details
      name:                    CL_CPUT
      description:             "Percentage of time over the previous minute that the system CPUs were not idle."
      metricType:              Instantaneous
      objectType:              CELL
      unit:                    %


CellCLI> LIST METRICCURRENT CL_CPUT;             ← list the latest metric snapshot value      CL_CPUT      enkcel01      5.3 %

CellCLI> LIST METRICCURRENT CL_CPUT DETAIL;      ← list latest snapshot in detail
      name:                    CL_CPUT
      alertState:              normal
      collectionTime:          2011-05-16T09:31:53-05:00
      metricObjectName:        enkcel01
      metricType:              Instantaneous
      metricValue:             5.3 %
      objectType:              CELL


CellCLI> LIST METRICHISTORY CL_CPUT;             ← show historical metric snapshots

      CL_CPUT      enkcel01      2.2 %      2011-05-06T12:00:46-05:00
      CL_CPUT      enkcel01      2.4 %      2011-05-06T12:01:49-05:00
      CL_CPUT      enkcel01      2.1 %      2011-05-06T13:00:15-05:00
      CL_CPUT      enkcel01      0.3 %      2011-05-06T13:01:15-05:00
      CL_CPUT      enkcel01      1.4 %      2011-05-06T13:02:15-05:00
      CL_CPUT      enkcel01      0.3 %      2011-05-06T13:03:15-05:00
      CL_CPUT      enkcel01      1.2 %      2011-05-06T13:04:15-05:00
      CL_CPUT      enkcel01      1.2 %      2011-05-06T13:05:15-05:00

Accessing Cell Metrics Using the Grid Control Exadata Storage Server Plug-In

If you have configured the System Monitoring plug-in for Exadata Storage Server, you will be able to see some cell-level metrics, reports, and charts in Grid Control. The plug-in installation is out of the scope of this book, but it’s well documented in the plug-in’s installation guide.

Currently the Exadata Storage Server monitoring plug-in gives only some basic metrics to monitor and is not very sophisticated for systematic cell performance troubleshooting or monitoring. Nevertheless, it will give you a more convenient way for monitoring cell statistics changing over time, compared to running CellCLI’s list metrichistory commands manually. When you initially navigate to a storage cell target (or group of cells) in Grid Control, you may not see any charts defined there yet. In Figure 12-10, we have created a group called “cells” from the Exadata Storage Server targets and navigated to the Edit Group page. This is where you can add new charts for collected cell-level metrics.

images

Figure 12-10. Adding storage cell metrics into Grid Control monitoring charts

All the chart definitions in Figure 12-10 have been manually added for the cell group. You can add charts yourself, by clicking the Add button, which will bring you to the Add Chart page shown in Figure 12-11.

images

Figure 12-11. Selecting a “highest-average” chart for monitoring cells with highest read I/O requests

To see the cell metrics available, you’ll have to select Oracle Exadata Storage Server as the chart type and pick one of the metrics to show. You probably want to use the Group Statistics for monitoring the whole cell group as a starting point, instead of monitoring all 14 storage cells (in a full rack) separately. The group statistics allow you to sum together stats from all the cells in the group (when monitoring total disk I/O throughput, for example) or show minimum, maximum, averages and standard deviation across all cells (when monitoring average disk I/O response times, for example). Pay attention to what the metric means and which aggregation operations make sense for a given metric. For example, it doesn’t make much sense to display the sum of all average disk I/O response times in a chart. The metrics starting with Avg show average values over multiple objects monitored. For example, the Avg Celldisk Reads (MB) monitors average megabytes read per celldisk. On the other hand, the Total Celldisk Reads (MB) metric sums together the MB read by all cell disks in a cell. And now if you use the Group Statistics option to monitor multiple cells, you should use Sum to see the total MB read from cell disks across all cells and Average or Min/Max to see the average per cell or a cell with the smallest or biggest MB read cell disk throughput.

If you do want to monitor individual cells— for example, the two storage cells with the worst offload efficiency—then you can either define a separate offloading efficiency chart for each cell (and end up with 14 charts in a full rack configuration) or you can select Targets With Lowest Average as the chart type, as shown in Figure 12-12.

images

Figure 12-12. Selecting a “lowest average” chart for monitoring cells with worst offload efficiency

You can read more about the meaning of various metrics in the Oracle document “Enterprise Manager System Monitoring Plug-In Metric Reference Manual for Oracle Exadata Storage Server.”

Now, you can reorder the charts on the cell group dashboard by clicking the Reorder button as shown in Figure 12-10 earlier. This will show you a list of defined charts, which you can then reorder on the screen, as seen in Figure 12-13.

images

Figure 12-13. Ordering the storage cell metric charts in Grid Control

Now you are ready to view your charts. Depending on how many of them you configured, you should see a screen layout like that in Figure 12-14.

images

Figure 12-14. Multiple cell-level metrics displayed in Grid Control

Note that when you first configure the charts, they will be empty. If you wait for a few minutes, you should see the first data points appearing, and after a day has passed, the charts should be fully populated (if Last 24 Hours is selected in the View Data drop-down). You can click Customize Charts to modify the chart metrics again.

Now let’s look at some of the individual metrics we have put onto the page, as shown in Figure 12-15.

images

Figure 12-15. Various celldisk average metrics charts displayed in Grid Control

In the Avg Celldisk Small Read Latency (msec) chart here, we do see an issue. An hour before 12PM, the average small block I/O (under 256kB) latency has jumped up, probably because more large I/O operations were done in the cell. We can conclude this from the other two metrics in the top row; the Avg Cellidisk Read Requests has significantly dropped, while the Avg Celldisk Reads (MB) has remained the same, or actually even increased slightly. So there are more large I/Os being done, perhaps due to more queries using a Smart Scan. The small I/O latency has suffered because of this. Your OLTP database users running on the same cluster may notice that.

This is a place where you should find out what causes all this extra I/O. We will look at it later in this chapter. If you want to avoid situations where large Smart Scans impact small (single-block) IOs, you should consider enabling the I/O Resource Manager (IORM), which allows you to set the IORM goal to optimize random I/O latency as opposed to full scanning throughput. Additionally, you could make sure that your hot segments accessed with single-block reads are cached in flash cache. See Chapter 7 for more about IORM and resource management.

Figure 12-16 shows an example of the total cell disk I/O throughput chart. Note that the metric used starts with “Total,” indicating that the throughput MB values are summed over all twelve disks in the cell (as opposed to the Average metrics, which show averages across the disks). This is a very important point. If you are summing together average metrics like in the Avg Celldisk Reads section (top right corner) of Figure 12-15, then you will still end up with a sum of the average cell disk read rates, not the total aggregate throughput of all disks in the cell. The cell disks’ average metrics are good for charting things like the min, max, and cell-level average I/O latencies or disk utilization figures, but not valid for monitoring things like I/O throughput of all cells.

It is valid to sum together Total Celldisk metrics to get the total aggregate throughput, so from Figure 12-16 we see that the aggregate disk read throughput across all the cells is around 2000MB/second for the given workload.

images

Figure 12-16. Celldisk-level Total metrics summed together for monitoring total I/O throughput

Sometimes, especially when the monitoring UI charts don’t seem to make sense, you might ask where these charts take their data from. This is easy to find out, as the Grid Control metrics plug-ins are just Perl scripts, which then run command-line utilities like CellCLI or plain SSH clients to fetch their metrics. You can check the directory where you have installed the Exadata Storage Cell Server monitoring plug-in yourself:

$ ls /u01/app/oracle/product/middleware/agent11g/sysman/admin/scripts/emx/oracle_cell

oracle_cell_category_metrics.pl       oracle_cell_griddisk_config.pl
oracle_cell_celldisk_config.pl        oracle_cell_griddisk_metrics.pl
oracle_cell_celldisk_metrics.pl       oracle_cell_interconnect_metrics.pl
oracle_cell_common.pm                 oracle_cell_iormboost_metrics.pl
oracle_cell_config.pl                 oracle_cell_iorm_config.pl
oracle_cell_consumergroup_metrics.pl  oracle_cell_lun_config.pl
oracle_cell_database_metrics.pl       oracle_cell_lun_metrics.pl
oracle_cell_disk_config.pl            oracle_cell_metrics.pl
oracle_cell_disk_metrics.pl           oracle_cell_offload_metrics.pl
oracle_cell_flashcache_metrics.pl     oracle_cell_resp.pl
oracle_cell_fs_metrics.pl

Let’s take a quick look at one of these Perl scripts, oracle_cell_celldisk_metrics.pl, to see where it gets its data:

...beginning of the script skipped...

my $command = "ssh -l cellmonitor $ipaddress cellcli
                      -xml -e 'list metriccurrent where objectType=".'"CELLDISK"'."'";

my $xmlout = `$command`;

my $parser = XML::Parser->new( Handlers => {
                                      Start=>&handle_start,
                                      End=>&handle_end,
                                      Char=> &characterData,
                                     });

...rest of the script skipped...

As you can see, the Exadata Cell Grid Control plug-in is merely a bunch of Perl scripts, which then log into target cells via SSH and run regular CellCLI commands with XML output mode. The output is then parsed and returned to the Grid Control agent.

Now that we know which actual command the Grid Control plug-in uses for pulling data, we can run it ourselves and see which data it includes. Note that there’s an additional filter condition showing only one metric for brevity.

CellCLI> LIST METRICCURRENT WHERE objectType = 'CELLDISK' AND name = 'CD_IO_TM_R_SM_RQ'
      CD_IO_TM_R_SM_RQ      CD_00_cell03      25,390 us/request
      CD_IO_TM_R_SM_RQ      CD_01_cell03      11,633 us/request
      CD_IO_TM_R_SM_RQ      CD_02_cell03      6,067 us/request
      CD_IO_TM_R_SM_RQ      CD_03_cell03      11,412 us/request
      CD_IO_TM_R_SM_RQ      CD_04_cell03      20,410 us/request
      CD_IO_TM_R_SM_RQ      CD_05_cell03      2,871 us/request
      CD_IO_TM_R_SM_RQ      CD_06_cell03      30,480 us/request
      CD_IO_TM_R_SM_RQ      CD_07_cell03      22,791 us/request
      CD_IO_TM_R_SM_RQ      CD_08_cell03      2,303 us/request
      CD_IO_TM_R_SM_RQ      CD_09_cell03      12,019 us/request
      CD_IO_TM_R_SM_RQ      CD_10_cell03      2,010 us/request
      CD_IO_TM_R_SM_RQ      CD_11_cell03      15,056 us/request
      CD_IO_TM_R_SM_RQ      FD_00_cell03      464 us/request
      CD_IO_TM_R_SM_RQ      FD_01_cell03      471 us/request
      CD_IO_TM_R_SM_RQ      FD_02_cell03      476 us/request
      CD_IO_TM_R_SM_RQ      FD_03_cell03      473 us/request
      CD_IO_TM_R_SM_RQ      FD_04_cell03      450 us/request
      CD_IO_TM_R_SM_RQ      FD_05_cell03      467 us/request
      CD_IO_TM_R_SM_RQ      FD_06_cell03      469 us/request
      CD_IO_TM_R_SM_RQ      FD_07_cell03      474 us/request
      CD_IO_TM_R_SM_RQ      FD_08_cell03      456 us/request
      CD_IO_TM_R_SM_RQ      FD_09_cell03      459 us/request
      CD_IO_TM_R_SM_RQ      FD_10_cell03      461 us/request
      CD_IO_TM_R_SM_RQ      FD_11_cell03      455 us/request
      CD_IO_TM_R_SM_RQ      FD_12_cell03      467 us/request
      CD_IO_TM_R_SM_RQ      FD_13_cell03      480 us/request
      CD_IO_TM_R_SM_RQ      FD_14_cell03      483 us/request
      CD_IO_TM_R_SM_RQ      FD_15_cell03      471 us/request

The grid disk names starting with FD_ indicate that these lines are reported for flash disks (each cell has 4 × 96GB flash cards, but each flash card presents four flash disks to the operating system, totaling 16 flash disks per cell). So, using this exercise we have found out whether the cell-disk metrics include flash disk metrics in addition to regular disk drives— and yes, they do. This makes some of the charts lie, as the flash read times that are an order of magnitude faster will make the average regular disk read times look better as well. In fact, some of the cell disks have a pretty bad small read I/O latency. For example, up to 30 milliseconds per request on average, which is much higher than the average seek + rotational + data transfer latency should be for a modern 15000 RPM disk, even if it is 100% busy with random I/O requests. This probably indicates I/O queueing somewhere, likely in the OS I/O queue, which you can confirm by looking at the await and svctm columns in iostat output (more about iostat later in this chapter, but man iostat is your friend).

So, ideally you should monitor the flash disks and good old spinning disk drives separately, by either post-processing the returned data or just running two separate CellCLI commands in your monitoring tools. CellCLI allows you to filter its results further using a regular expression, as shown here:

CellCLI> LIST METRICCURRENT WHERE objectType = 'CELLDISK'
         AND name = 'CD_IO_TM_R_SM_RQ' AND metricObjectName LIKE 'FD_.*';

      CD_IO_TM_R_SM_RQ      FD_00_cell03      464 us/request
      CD_IO_TM_R_SM_RQ      FD_01_cell03      471 us/request
      CD_IO_TM_R_SM_RQ      FD_02_cell03      476 us/request
      CD_IO_TM_R_SM_RQ      FD_03_cell03      473 us/request
      CD_IO_TM_R_SM_RQ      FD_04_cell03      450 us/request
      CD_IO_TM_R_SM_RQ      FD_05_cell03      467 us/request
      CD_IO_TM_R_SM_RQ      FD_06_cell03      469 us/request
      CD_IO_TM_R_SM_RQ      FD_07_cell03      474 us/request
      CD_IO_TM_R_SM_RQ      FD_08_cell03      456 us/request
      CD_IO_TM_R_SM_RQ      FD_09_cell03      459 us/request
      CD_IO_TM_R_SM_RQ      FD_10_cell03      461 us/request
      CD_IO_TM_R_SM_RQ      FD_11_cell03      455 us/request
      CD_IO_TM_R_SM_RQ      FD_12_cell03      467 us/request
      CD_IO_TM_R_SM_RQ      FD_13_cell03      480 us/request
      CD_IO_TM_R_SM_RQ      FD_14_cell03      483 us/request
      CD_IO_TM_R_SM_RQ      FD_15_cell03      471 us/request

Now we see only the flash drives. You can read more about the CellCLI usage and syntax examples in Appendix A of this book and also in the official documentation, Oracle Exadata Storage Server Software User’s Guide.

Which Cell Metrics to Use?

Which cell metrics to use is an important question. Oracle provides a huge variety of different metrics, from the database level, the host server level, and the Exadata cell level. It is very easy to get lost in them or even worse, get misled and troubleshoot a wrong or non-existent problem. In performance troubleshooting, we should start from what matters—user response time—and drill down from there (as explained in the previous sections of this chapter). But general efficiency monitoring, utilization overview, and capacity planning are sometimes good reasons for examining cell-level aggregated metrics.

This may come as a surprise if you have expected more, but there is no long list of secret and special Exadata metrics you should monitor in your cells. A cell is just a server, with RAM, CPUs, and disks, sending data out over a network link. So the fundamental metrics you would want to monitor at the cell level are the same you’d use in any server: CPU utilization, system load, memory utilization and disk utilization, and I/O throughput and latency. The most important metrics are available in the Grid Control Storage Cell monitoring plug-in, and the rest can be extracted from OS Watcher logfiles. The advanced metrics come into play only when dealing with bugs or some rare, very specific performance issues, so they shouldn’t really be actively monitored unless there is a problem.

One topic deserves some elaboration— the CPU usage monitoring, both in the database and cells. The data warehouses and reporting systems, unlike OLTP databases, don’t usually require very quick response times for user queries. Of course, the faster a query completes, the better, but in a DW, people don’t really notice if a query ran in 35 seconds instead of 30 seconds. But a typical OLTP user would definitely notice if their 1-second query took 6 seconds occasionally. That’s one of the reasons why in OLTP servers you would not want to constantly run at 100% CPU utilization. You cannot do that and also maintain stable performance. In an OLTP system you must leave some headroom. In DW servers, however, the small fluctuations in performance would not be noticed, and you can afford to run at 100% of CPU utilization in order to get the most out of your investment.

However, Exadata complicates things. In addition to having multiple database nodes, you also have another whole layer of servers: the cells. Things get interesting especially when running Smart Scans with high parallelism against EHCC tables. That’s because offloaded decompression requires a lot of CPU cycles in the cells. Thus it is possible that for some workloads your cells’ CPUs will be 100% busy and unable to feed data back to the database layer fast enough. The database layer CPUs may be half idle, while cells could really use some extra CPU capacity.

The risk from cell utilization reaching 100% is the reason Oracle made cellsrv able to skip offload processing for some datablocks and pass these blocks straight back to the database (starting in cellsrv 11.2.2.3.0). The cell checks whether its CPU utilization is over 90% and whether the database CPU utilization (it’s sent in based on resource manager stats from the database) is lower than that. And if so, some blocks are not processed in the cells, but passed through to the database directly. The database then will decrypt (if needed) and decompress the blocks and perform projection and filtering in the database layer. This allows you to fully utilize all your CPU capacity in both layers of the Exadata cluster. However, this automatically means that if some blocks are suddenly processed in the database layer (instead of being offloaded to cells), then you may see unexpected CPU utilization spikes in the database layer, when cells are too busy. This shouldn’t be a problem on most DW systems, especially with properly configured resource manager, but you would want to watch out for this when running OLTP or other low-latency systems on Exadata.

As usual, Oracle provides good metrics about this pass-through feature. Whenever the offload processing is skipped for some blocks during a Smart Scan and these blocks are sent back to the database layer for processing, the statistic “cell physical IO bytes pushed back due to excessive CPU on cell” gets incremented in V$SESSTAT/V$SYSSTAT and AWR reports. Read more about this feature and statistic in Chapter 11.

Monitoring Exadata Storage Cell OS-level Metrics

Oracle Database and the cellsrv software do a good job of gathering various performance metrics, but there are still cases where you would want to use an OS tool instead. One of the reasons is that usually the V$ views in Oracle tell you what Oracle thinks it’s doing, but this may not necessarily be what’s really happening if you hit a bug or some other limitation of Oracle’s built-in instrumentation. One of the limitations is low-level I/O measurement.

Monitoring the Storage Cell Server’s I/O Metrics with iostat

Both Oracle Database and the storage cells do measure the I/O Completion Time; in other words, the response time of I/Os. With synchronous I/O operations, Oracle’s I/O wait time is merely the system call (like pread) completion time. With asynchronous I/O, the response time measurement is trickier, as an asynchronous I/O submit system call will not block and wait; it will return immediately in microseconds and some I/O reaping system call will be executed later, which will mark the I/O operation as complete. The Exadata storage cells keep track of each asynchronous I/O request (for example, when it was submitted) and once the I/O is successfully reaped, they will check the reaping timestamp against that I/O’s submit timestamp and know its duration.

Regardless of this extra cellsrv-level I/O monitoring, the cells still don’t break the I/O response time into the two important components: how much time this I/O request spent uselessly waiting in the OS I/O queue before it was even sent out to the SCSI device, and how long the actual hardware service time was once the I/O request was sent out to the hardware. Comparing the I/O service time to waiting time gives an important clue about whether the storage hardware itself responds slowly, in which case the service time is higher than you would expect from a modern disk drive.

images Note: So what is this normal I/O service time to expect from disks? That depends on which disks you have, what their seek latency is, how far the disk read/write head has to seek from its previous location, what the rotational latency of the disk is, how fast it spins (RPM), and how fast it can transfer the bits just read over the wires (or fiber). There is no magic in calculating what a good service time should be for a disk; it’s all based on the physical capabilities of the disk, and these capabilities are documented in the vendor’s disk specs. You can read more about the various latencies involved from Wikipedia: http://en.wikipedia.org/wiki/Hard_disk_drive#Access_time.

Of course, there may be other overhead, like SAN storage network roundtrip times and any extra latency caused by storage controllers, switches, and so on. But remember, Exadata is not connected to SAN storage; the disks are attached directly to the storage cells. Each storage cell has a little LSI MegaRaid controller in it, and the disks are attached to that controller. There are no complex network components between the disks and storage cells, which might drive up latency or fail. More importantly, there are no “other” databases or applications connected to the storage, which you have no control over, but which could suddenly start saturating the disks without a warning. All you would see is that the disk I/O service times suddenly go up as the storage array gets overloaded.

Anyway, a small I/O request (up to 256 KB) against a modern 15000 RPM disk drive should ideally have average service time of 5–7 milliseconds per I/O request. This comes from a 3.5ms average seek latency, 2ms average rotational latency, plus 0.4ms for transferring the 256kB over SAS channels. In theory this average seek latency is a pessimistic figure, assuming that the whole disk is full and the disk read/write heads have to do random seeks across tracks from end to end of the disk cylinders. But you probably have configured your disks into hotter and colder regions (the RECO disk group, for example), and the colder regions are visited much less frequently, driving the real average seek times shorter. So, while you may get these low 2–3ms service times when your disk heads don’t have to seek too far, in practice, 10ms average I/O service times are completely OK once you run real production workloads.

images Kevin Says: I’m often surprised at the lack of focus paid to data placement on physical disks. The authors are correct to point out that so-called short-stroked I/O requests have some, if not all, seek time removed from the service time of the request. However, I’d like to elaborate on this point further. When discussing Exadata we often concentrate on data capacity of enormous proportion. However, it is seldom the case that all data is active in all queries. Most data warehouses have an active subset of the total database. One can easily envision the active portion of a 100TB data warehouse being as small as 1TB. Just one percent? How can this be? Always remember that Oracle Database, combined with Exadata, offers many “lines of defense” in front of the dreaded physical I/O. Before physical I/O there is a results cache miss, SGA buffer cache miss (yes, it has purpose even when Exadata is involved), partition elimination, and finally a storage index miss. Once through these “lines of defense,” there is a physical I/O for data that is potentially compressed with a compression ratio of, say, 6:1. To that end, the 1TB active portion is actually 6TB—a sizable amount of data. Seek times are the lion’s share of physical I/O service time. In fact, from a current cost perspective, the optimal data warehouse architecture would consist of seek-free rotational hard disk drives. While that may seem absurd, allow me to point out that 1TB is exactly 1 percent of the aggregate surface area of the 168 (600GB) SAS drives offered in a full rack X2 model with high-performance drives. Long seeks are not a fixed cost in modern data warehouses—especially when the active portion is such a small percentage of the whole. If the active portion of the data warehouse remains a fixed percentage, yet service times increase over time, the problem is likely fragmentation. Data scattered over the entirety of round, spinning magnetic media is not optimal regardless of whether Oracle Database is connected to conventional storage or Exadata. Even with Exadata, the fundamentals still apply.

The whole point of this explanation so far is really that it is possible to know what the ideal disk service times should be for Exadata cells disk drives, and if the service times are constantly much higher than that, then there is some problem with the storage hardware (disks or controllers). With SAN storage, high OS-level service times could also mean that there is some queuing going on inside the SAN network or the storage array (for example, if a thousand other servers are hammering the same storage array with I/O). But as said above, Exadata storage cells have dedicated storage in them; only the current cell OS can access this storage, and all IOs are visible in iostat.

So, what if the service time is OK, but Oracle (cellsrv) still sees bad I/O performance? Well this may mean there is still queuing going on inside the cell Linux servers—and iostat can show this information, too. You can have only a limited number of outstanding I/Os in the “on-the-fly” state against your storage controller LUNs. The storage controller needs to keep track of each outstanding I/O (for example, it has to remember where in the host RAM to write the block once it is read from disk and arrives at the controller), and these I/O slots are limited. So the Linux kernel does not allow sending out more I/Os than the storage controller can handle; otherwise a SCSI reset would occur. These throttled I/Os will have to wait in the OS disk device I/O queue— and they are uselessly waiting there; they aren’t even sent out to the storage controller yet. Only when some previously outstanding I/O operation completes will the first item in the queue be sent to the disks (assuming that the I/O latency deadline hasn’t been reached for some request; cellsrv uses the Linux “deadline” I/O scheduler).

If you do have I/O waiting (queueing time) going on, then the await (average I/O completion time) column in iostat will be significantly higher than the svctm (estimated I/O service time) column. Note that the name “await” is somewhat misleading, as it does not show only the wait (queuing) time, it shows the total wait plus service time. Similarly, the avgqu-sz column does not show just the average I/O wait queue length, but rather the average total number of not completed I/Os in the I/O request queue, regardless of whether they have already been sent out to storage hardware (are already being serviced) or still waiting in the I/O queue (not yet serviced).

Linux iostat shows statistics for the disk partition and the cell software RAID device (used for cell OS partition mirroring). We can filter out those lines so that only the physical disk stats would be listed. For example:

$ iostat -xm 5 | egrep -v "sd.[0-9]|^md"

Linux 2.6.18-194.3.1.0.4.el5 (enkcel03.enkitec.com)   05/22/2011

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.73    0.00    0.75    1.01    0.00   97.51

Device:  r/s   w/s  rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda    12.56  7.33   9.63     0.26  1018.44     1.39   70.13   3.99   7.94
sdb     8.64  8.75   3.94     0.24   491.94     0.87   49.94   3.16   5.49
sdc     5.99  1.75   3.87     0.14  1062.34     0.63   80.86   5.19   4.02
sdd     5.04  2.89   3.75     0.15  1005.46     0.87  110.01   5.18   4.11
sde    10.84  3.28  10.06     0.20  1488.13     0.94   66.61   4.78   6.74
sdf     1.62  1.83   0.05     0.05    62.18     0.01    3.15   0.92   0.32
sdg     6.62  2.02   6.24     0.15  1515.53     0.49   57.32   5.99   5.17
sdh     4.56  2.50   3.86     0.15  1165.21     0.42   59.70   5.63   3.97
sdi    11.15  3.14   4.58     0.17   681.93     0.66   46.39   4.41   6.30
sdj     4.39  1.66   3.87     0.12  1351.37     0.50   82.55   6.51   3.94
sdk    11.87  3.95  11.09     0.21  1463.66     0.88   55.94   4.64   7.33
sdl    16.60  5.41   9.86     0.15   931.06     0.76   34.68   3.20   7.04
sdm     0.28  0.00   0.01     0.00    47.93     0.00    1.59   1.38   0.04
sdn     2.87  0.30   0.03     0.01    24.42     0.00    0.55   0.34   0.11
sdo     2.84  0.31   0.03     0.01    24.63     0.00    0.57   0.34   0.11
sdp     2.82  0.30   0.03     0.00    24.49     0.00    0.58   0.34   0.10
sdq     2.79  0.45   0.03     0.01    25.05     0.00    0.58   0.33   0.11
sdr     2.89  0.28   0.03     0.00    24.44     0.00    0.60   0.34   0.11
sds     2.94  0.32   0.03     0.01    24.35     0.00    0.56   0.34   0.11
sdt     2.85  0.30   0.03     0.01    24.49     0.00    0.54   0.34   0.11
sdu     2.82  0.29   0.03     0.00    24.64     0.00    0.56   0.34   0.11
sdv     2.86  0.49   0.03     0.01    25.01     0.00    0.55   0.33   0.11
sdw     2.93  0.58   0.03     0.01    25.12     0.00    0.53   0.33   0.11
sdx     2.87  0.35   0.03     0.01    24.37     0.00    0.57   0.33   0.11
sdy     2.84  0.61   0.03     0.01    25.28     0.00    0.52   0.32   0.11
sdz     2.81  0.29   0.03     0.00    24.63     0.00    0.57   0.34   0.10
sdaa    2.86  0.32   0.03     0.01    24.40     0.00    0.60   0.34   0.11
sdab    2.89  0.33   0.03     0.01    24.37     0.00    0.58   0.34   0.11
sdac    2.90  0.29   0.03     0.00    24.36     0.00    0.58   0.34   0.11

Wait a minute! Shouldn’t we see fewer disks, as each cell has only 12 hard drives in it? We see so many disks because in addition to the 12 hard disks, cells also have their flash cards presented as separate disks. Finally, there’s also one USB disk. You can use the lsscsi command in the cell to see all disks detected by the OS:

$ lsscsi
[0:2:0:0]    disk    LSI      MR9261-8i        2.12  /dev/sda
[0:2:1:0]    disk    LSI      MR9261-8i        2.12  /dev/sdb
[0:2:2:0]    disk    LSI      MR9261-8i        2.12  /dev/sdc
[0:2:3:0]    disk    LSI      MR9261-8i        2.12  /dev/sdd
[0:2:4:0]    disk    LSI      MR9261-8i        2.12  /dev/sde
[0:2:5:0]    disk    LSI      MR9261-8i        2.12  /dev/sdf
[0:2:6:0]    disk    LSI      MR9261-8i        2.12  /dev/sdg
[0:2:7:0]    disk    LSI      MR9261-8i        2.12  /dev/sdh
[0:2:8:0]    disk    LSI      MR9261-8i        2.12  /dev/sdi
[0:2:9:0]    disk    LSI      MR9261-8i        2.12  /dev/sdj
[0:2:10:0]   disk    LSI      MR9261-8i        2.12  /dev/sdk
[0:2:11:0]   disk    LSI      MR9261-8i        2.12  /dev/sdl
[1:0:0:0]    disk    Unigen   PSA4000         1100  /dev/sdm
[8:0:0:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdn
[8:0:1:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdo
[8:0:2:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdp
[8:0:3:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdq
[9:0:0:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdr
[9:0:1:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sds
[9:0:2:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdt
[9:0:3:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdu
[10:0:0:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdv
[10:0:1:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdw
[10:0:2:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdx
[10:0:3:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdy
[11:0:0:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdz
[11:0:1:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdaa
[11:0:2:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdab
[11:0:3:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdac

The LSI disks here are the hard disks presented to the host by the LSI SCSI RAID controller (from devices sda to sdl), the Unigen PSA4000 is the USB disk, and the MARVELL ATA disks are the flash cards. If you compare the average I/O service times (svctm) in iostat, you’ll see that the devices belonging to flash cards have a service time an order of magnitude lower than hard disk devices. So, if you want to monitor only the hard-disk devices, you can filter iostat output this way:

$ iostat -xm 5 | egrep "Device|^sd[a-l] "
Device:  r/s   w/s  rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda    11.95 12.22   8.62     1.38   847.59     1.60   66.26   3.76   9.09
sdb     9.79 13.62   4.14     1.36   480.53     1.13   48.12   3.31   7.74
sdc     7.06  6.64   4.07     1.26   796.10     0.63   46.05   3.65   5.00
sdd     6.06  7.89   3.94     1.27   765.02     0.84   60.38   3.74   5.22
sde    10.45  8.20   8.90     1.31  1121.89     0.86   46.02   3.79   7.07
sdf     1.60  1.79   0.05     0.04    55.38     0.01    2.84   1.03   0.35
sdg     7.15  6.93   5.90     1.28  1044.40     0.53   37.76   4.22   5.94
sdh     5.61  7.48   4.06     1.28   835.26     0.49   37.78   3.85   5.04
sdi    12.89  8.11   4.63     1.31   579.09     0.71   33.88   3.50   7.34
sdj     5.43  6.55   4.07     1.23   906.89     0.56   46.48   4.05   4.85
sdk    11.34  8.86   9.74     1.34  1122.70     0.83   41.18   3.81   7.70
sdl    16.40 10.23   8.80     1.26   773.96     0.75   28.27   2.83   7.53
Advanced Cell Monitoring With OS Watcher

While the database-level wait profile and SQL Monitoring should be used as a starting point for performance monitoring and troubleshooting, sometimes these approaches are not enough and you want to drill deeper. This section explains one additional data source for that. Exadata storage cells come with a preinstalled tool called OS Watcher, located in the /opt/oracle.oswatcher/osw directory. This tool is just a set of shell scripts, which then run standard OS tools, like vmstat, iostat and netstat, to collect their data. OS Watcher’s benefit is that it runs at OS level, not inside a database, so it is not affected by database hangs and performance issues or cases where the database’s V$ views don’t show the truth or have enough detail. Additionally, the high-frequency OS Watcher collectors sample data every second or few seconds, allowing it to detect short “hiccups” or bursts of activity.

You can check whether the OS Watcher is running by simply searching for any processes with “OSW” in their name (be sure to search for both upper and lower case):

# pgrep -lf "OSW|osw"
10543 /bin/ksh ./oswsub.sh HighFreq ./Exadata_cellsrvstat.sh
26412 /bin/ksh ./OSWatcher.sh 15 168 bzip2
28827 /bin/ksh ./OSWatcherFM.sh 168
28846 /bin/ksh ./oswsub.sh HighFreq ./Exadata_vmstat.sh
28847 /bin/ksh ./oswsub.sh HighFreq ./Exadata_mpstat.sh
28848 /bin/ksh ./oswsub.sh HighFreq ./Exadata_netstat.sh
28849 /bin/ksh ./oswsub.sh HighFreq ./Exadata_iostat.sh
28853 /bin/ksh ./oswsub.sh HighFreq ./Exadata_top.sh
28877 /bin/ksh ./oswsub.sh HighFreq /opt/oracle.oswatcher/osw/ExadataRdsInfo.sh
28910 /bin/bash /opt/oracle.oswatcher/osw/ExadataRdsInfo.sh HighFreq

Apparently, the OS Watcher is running in this storage cell and there are a number of Exadata-specific collectors enabled, too. If you want to see the hierarchy of the OS Watcher process daemons, then on Linux you can use either the ps -H command shown in the first chapter or the pstree command as shown here:

# pstree -aAhlup `pgrep OSWatcher.sh`

OSWatcher.sh,26412 ./OSWatcher.sh 15 168 bzip2
  |-ExadataDiagColl,28866 /opt/oracle.cellos/ExadataDiagCollector.sh
  |   `-sleep,25718 3
  |-OSWatcherFM.sh,28827 ./OSWatcherFM.sh 168
  |-oswsub.sh,10543 ./oswsub.sh HighFreq ./Exadata_cellsrvstat.sh
  |   `-Exadata_cellsrv,10547 ./Exadata_cellsrvstat.sh HighFreq
  |       |-bzip2,10591 --stdout
  |       `-cellsrvstat,10590 -interval=5 -count=720
  |-oswsub.sh,28846 ./oswsub.sh HighFreq ./Exadata_vmstat.sh
  |   `-Exadata_vmstat.,28850 ./Exadata_vmstat.sh HighFreq
  |       |-bzip2,12126 --stdout
  |       `-vmstat,12125 5 720
  |-oswsub.sh,28847 ./oswsub.sh HighFreq ./Exadata_mpstat.sh
  |   `-Exadata_mpstat.,28852 ./Exadata_mpstat.sh HighFreq
  |       |-bzip2,12627 --stdout
  |       `-mpstat,12626 5 720
  |-oswsub.sh,28848 ./oswsub.sh HighFreq ./Exadata_netstat.sh
  |   `-Exadata_netstat,28854 ./Exadata_netstat.sh HighFreq
  |       `-sleep,25717 15
  |-oswsub.sh,28849 ./oswsub.sh HighFreq ./Exadata_iostat.sh
  |   `-Exadata_iostat.,28856 ./Exadata_iostat.sh HighFreq
  |       |-bzip2,12105 --stdout
  |       `-iostat,12104 -t -x 5 720
  |-oswsub.sh,28853 ./oswsub.sh HighFreq ./Exadata_top.sh
  |   `-Exadata_top.sh,28864 ./Exadata_top.sh HighFreq
  |       |-bzip2,13453 --stdout
  |       `-top,13452 -b -c -d 5 -n 720
  `-oswsub.sh,28877 ./oswsub.sh HighFreq /opt/oracle.oswatcher/osw/ExadataRdsInfo.sh
      `-ExadataRdsInfo.,28910 /opt/oracle.oswatcher/osw/ExadataRdsInfo.sh HighFreq
          `-sleep,25710 10

The OS Watcher daemons store their collected data in the /opt/oracle.oswatcher/osw/archive/ directory. It doesn’t use any special format for storing the data; it just stores the text output of the standard OS tools it runs. This makes it easy to use regular text processing utilities, like grep, AWK or Perl/python scripts to extract and present the information you need. Here’s an example:

# ls -l /opt/oracle.oswatcher/osw/archive/
total 280
drwxr-s--- 2 root cellusers 28672 May 25 14:00 ExadataDiagCollect
drwxr-s--- 2 root cellusers 20480 May 25 14:00 ExadataOSW
drwxr-s--- 2 root cellusers 20480 May 25 14:00 ExadataRDS
drwxr-s--- 2 root cellusers 45056 May 25 14:00 oswcellsrvstat
drwxr-s--- 2 root cellusers 20480 May 25 14:01 oswiostat
drwxr-s--- 2 root cellusers 20480 May 25 14:00 oswmeminfo
drwxr-s--- 2 root cellusers 20480 May 25 14:00 oswmpstat
drwxr-s--- 2 root cellusers 24576 May 25 14:00 oswnetstat
drwxr-s--- 2 root cellusers  4096 May  6 12:35 oswprvtnet
drwxr-s--- 2 root cellusers 20480 May 25 14:00 oswps
drwxr-s--- 2 root cellusers 20480 May 25 14:00 oswslabinfo
drwxr-s--- 2 root cellusers 20480 May 25 14:00 oswtop
drwxr-s--- 2 root cellusers 20480 May 25 14:01 oswvmstat

Each directory contains archive files with corresponding command output:

# cd /opt/oracle.oswatcher/osw/archive/oswiostat/
# ls -tr  | tail

enkcel03.enkitec.com_iostat_11.05.25.0500.dat.bz2
enkcel03.enkitec.com_iostat_11.05.25.0600.dat.bz2
enkcel03.enkitec.com_iostat_11.05.25.0700.dat.bz2
enkcel03.enkitec.com_iostat_11.05.25.0800.dat.bz2
enkcel03.enkitec.com_iostat_11.05.25.0900.dat.bz2
enkcel03.enkitec.com_iostat_11.05.25.1000.dat.bz2
enkcel03.enkitec.com_iostat_11.05.25.1100.dat.bz2
enkcel03.enkitec.com_iostat_11.05.25.1200.dat.bz2
enkcel03.enkitec.com_iostat_11.05.25.1400.dat.bz2
enkcel03.enkitec.com_iostat_11.05.25.1300.dat.bz2

There is a separate bzipped file saved for each hour, making it easy to manually investigate past performance data or write a quick AWK, grep or Perl script, which extracts only the data of interest. The example below extracts iostat I/O statistics from a bzipped file from 9AM on May 25, 2011:

# bzcat enkcel03.enkitec.com_iostat_11.05.25.0900.dat.bz2 | head -15
zzz ***Wed May 25 09:01:43 CDT 2011 Sample interval: 5 secconds
Linux 2.6.18-194.3.1.0.4.el5 (enkcel03.enkitec.com)      05/25/11

Time: 09:01:43
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.76    0.00    0.78    1.37    0.00   97.09

Device:    r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda      12.30 12.66 17968.35  3004.31   840.31     1.27   51.05   3.68   9.18
sda1      0.00  0.00     0.45     0.00   114.13     0.00   12.17   1.57   0.00
sda2      0.00  0.00     0.00     0.00     2.20     0.00    0.62   0.60   0.00
sda3     11.44  7.14 17879.86  2869.85  1116.56     1.21   65.19   4.76   8.84
sda4      0.00  0.00     0.00     0.00     1.14     0.00   18.41  18.41   0.00
sda5      0.25  4.55    14.33   116.28    27.20     0.04    8.01   2.33   1.12
sda6      0.29  0.00    38.61     0.00   131.20     0.01   27.53   0.99   0.03
...

The OS Watcher output samples are prefixed by a timestamp line, which starts with “zzz” as you see above. This shows you the exact time, up to the second, when the OS command was executed and what the sample interval was. Note that this “zzz” line will only show you when the monitoring command was executed. Some commands are executed only once per hour, but they keep collecting and dumping data samples throughout that hour. In such case there are monitoring command-specific timestamps printed for each data sample. The following example shows how the cellsrvstat command uses ===Current Time=== to indicate the exact time the performance data below it is from. You can use this prefix in your custom scripts. Here’s an example:::

# bzcat enkcel03.enkitec.com_cellsrvstat_11.05.25.1300.dat.bz2 | head -30
zzz ***Wed May 25 13:00:00 CDT 2011 Sample interval: 5 secconds
===Current Time===                                      Wed May 25 13:00:00 2011

== Input/Output related stats ==
Number of hard disk block IO read requests                      0      167987444
Number of hard disk block IO write requests                     0      150496829
Hard disk block IO reads (KB)                                   0   111186344274
Hard disk block IO writes (KB)                                  0    25224016152
Number of flash disk block IO read requests                     0       67230332
Number of flash disk block IO write requests                    0       15528794
Flash disk block IO reads (KB)                                  0      870806808
Flash disk block IO writes (KB)                                 0      347337424
Number of disk IO errors                                        0              0
Number of reads from flash cache                                0       57890992
Number of writes to flash cache                                 0       13119701
Flash cache reads (KB)                                          0      870014160
Flash cache writes (KB)                                         0      347236056
Number of flash cache IO errors                                 0              0
Size of eviction from flash cache (KB)                          0              0
Number of outstanding large flash IOs                           0              0
Number of latency threshold warnings during job                 0          13400
Number of latency threshold warnings by checker                 0              0
Number of latency threshold warnings for smart IO               0              0
Number of latency threshold warnings for redo log writes        0           6568
Current read block IO to be issued (KB)                         0              0
Total read block IO to be issued (KB)                           0     9066623802
Current write block IO to be issued (KB)                        0              0
Total write block IO to be issued (KB)                          0    25054696317
Current read blocks in IO (KB)                                  0              0
Total read block IO issued (KB)                                 0     9066623802
.... a lot of output removed ....

This output is just the first of each 5-second samples dumped by cellsrvstat every hour. The first column shows the cell metric name, and the second column (all zeros) shows the metric value in the current metric interval (snapshot). Because the command was just executed (and took its first sample), it shows zero for each metric, as the delta computation starts only after the second snapshot is taken. If you navigate downward in the OS Watcher dumpfile, you will see nonzero values for many metrics, starting from the second snapshot. The last column shows the cumulative value (since the cellsrv process started) for each metric. You probably should just ignore this cumulative value, as it contains information from since the cellsrv start (which may be months ago). Looking at a single cumulative value, accumulating information from such a long time, would not tell you much about what’s happening right now or what was happening last Friday at 8AM. Metric deltas over shorter time periods are the way to go, and this is what you see in the second column: the current interval value as of sampling the metrics. Now let’s say we are interested only in a single cellsrv metric, Number of Latency Threshold for Redo Log Writes, as the database layer experienced some log file parallel write delays (as seen in V$EVENT_HISTOGRAM, for example) earlier. Let’s say this happened on May 5, 2011 early in the morning. As the OS Watcher logfiles are named according the date and time (YY.MM.DD and hour in 24-hour format), you can search for specific files only by using regular Unix filesystem wildcards. In the following example we’re searching for all hours of May 5, 2011 and using egrep to list both the “current time” and the metric lines from the matching files:

# bzcat enkcel03.enkitec.com_cellsrvstat_11.05.25.*.dat.bz2 |
  egrep "===Current Time===|Number of latency threshold warnings for redo log writes" |
  head

===Current Time===                                      Wed May 25 00:00:03 2011
Number of latency threshold warnings for redo log writes        0           6564
===Current Time===                                      Wed May 25 00:00:08 2011
Number of latency threshold warnings for redo log writes        0           6564
===Current Time===                                      Wed May 25 00:00:13 2011
Number of latency threshold warnings for redo log writes        0           6564
===Current Time===                                      Wed May 25 00:00:18 2011
Number of latency threshold warnings for redo log writes        0           6564
===Current Time===                                      Wed May 25 00:00:23 2011
Number of latency threshold warnings for redo log writes        2           6566

As it’s hard to read the above output, here’s a little script which formats the output better and is more convenient to use:

#!/bin/bash

# Name:    oswextract.sh
# Purpose: Extract a specific metric from OS Watcher cellsrvstat archives
#
# Usage:
#   ./oswetract.sh "cellsrv metric to grep for" osw_archive_files_of_interest*.dat.bz2
#
# Example:
#   ./oswextract.sh "Number of latency threshold warnings for redo log writes"
#                                    cell01.example.com_cellsrvstat_11.05.25.*.dat.bz2

METRIC=$1
shift

bzcat -q $* |
  egrep "Current Time|$METRIC" |
  awk '
    BEGIN
      { printf("%-21s %20s %20s ", "TIME", "CURRENT_VAL", "CUMULATIVE_VAL") }
    /Current/
      { printf("%s %s %s %s", $3, $4, $5, $6, $7) }
    /Number of latency threshold warnings for redo log writes/
      { printf("%20d %20d ", $10, $11) }

  '
# end of script

While this script is meant to make extracting the desired metric history out of the OS Watcher logs easier, the main purpose of listing the script in this book is to show how easy extracting historical OS-level (and other) metrics can be. Investing a day into learning basic AWK or Perl text processing will likely pay off and save time when you need to do unconventional monitoring or troubleshooting tasks. Let’s see what it does:

# ./oswextract.sh"Number of latency threshold warnings for redo log writes"
                        enkcel03.enkitec.com_cellsrvstat_11.05.25.*.dat.bz2  | head

TIME                         CURRENT_VAL       CUMULATIVE_VAL
Wed May 25 00:00:03                   0                 6564
Wed May 25 00:00:08                   0                 6564
Wed May 25 00:00:13                   0                 6564
Wed May 25 00:00:18                   0                 6564
Wed May 25 00:00:23                   2                 6566
Wed May 25 00:00:28                   0                 6566
Wed May 25 00:00:33                   0                 6566
Wed May 25 00:00:38                   0                 6566
Wed May 25 00:00:43                   0                 6566

The script goes through all the filenames, which match the second expression on the command line and greps for whatever metric is specified as the first parameter (in double quotes). In this case we are looking for Number of Latency Threshold Warnings for Redo Log Writes, which is incremented when I/O operations tagged as redolog file writes take too long to complete in the cell. Apparently there were two I/O operations around the time range between 00:00:18 and 00:00:23 which had response times high enough to be reported.

Using the previous example, you can already generate formatted data good enough for visual inspection or loading into Excel or some charting tool. If you are looking only for some random problem times, then you can do post-processing, such as showing just the lines where the metric value is not zero:

# ./oswextract.sh "Number of latency threshold warnings for redo log writes"
             enkcel03.enkitec.com_cellsrvstat_11.05.*.*.dat.bz2 | grep -v " 0 "

TIME                       CURRENT_VAL       CUMULATIVE_VAL
Sat May 7 02:00:02                   2                    2
Sat May 7 06:00:03                  10                   12
Sat May 7 08:00:02                   5                   17
Sat May 7 21:28:15                   1                   18
Sat May 7 22:00:19                   1                   19
Sat May 7 22:00:29                   4                   23
Sat May 7 22:00:34                   4                   27
Sun May 8 02:00:03                  10                   37
Sun May 8 02:00:09                   8                   45
Sun May 8 04:00:03                   4                   49
Sun May 8 06:00:11                  11                   82
Sun May 8 08:00:05                   6                   88
Mon May 9 02:00:10                  16                  104
Mon May 9 04:00:02                   4                  108

Alternatively, you can sort the output by the metric value in the CURRENT_VAL column to see the worst problem times first. The next example uses sort -nrk5 as we’re sorting the fifth column (columns 1–4 are the date column components, as far as the sort utility sees):

# ./oswextract.sh "Number of latency threshold warnings for redo log writes"
      enkcel03.enkitec.com_cellsrvstat_11.05.*.*.dat.bz2 |
      grep -v " 0 " | sort -nrk5 | head

Sat May 14 06:00:03                  30                  631
Fri May 13 00:00:09                  25                  432
Tue May 10 02:00:09                  22                  174
Sat May 14 00:00:11                  22                  549
Wed May 11 02:00:08                  17                  215
Mon May  9 02:00:10                  16                  104
Tue May 10 22:00:09                  12                  190
Sun May  8 06:00:11                  11                   82
Mon May  9 22:00:07                  11                  126
Sun May  8 02:00:03                  10                   37

Apparently, the most log file write latency problems were detected at 6:00:03 AM May 14. When you look at the other output lines, you probably see an interesting pattern. All these problems are reported from a few seconds after a full hour. Well, this is just because in this particular Exadata testing cluster, we have over ten separate databases installed, but they only run some reports in parallel and usually there is no OLTP activity doing commits. But is it then doing these commits just after the hour starts? Well, these are default Oracle Database installations with AWR snapshots taking place at the full hour. And there are over 10 databases in the cluster, with an AWR snapshot taken and written in each cluster database instance. Each AWR snapshot write involves multiple commits, and apparently the resulting log file writes have occasionally taken longer than expected. The next step for troubleshooting such low-level I/O issues in a cell would be to look at the OS-level metrics, like iostat. The cells do perform regular block I/O against their disks, after all, so all the OS-level I/O monitoring tools still do apply.

In the following extract from the iostat archive, you see a few interesting things. The await time shows that on average, I/O completion time (queuing plus service time) is over 100ms for almost all the disks. For a couple of disks it is as high as 549 ms and 738 ms on average! The disk’s estimated service time (svctm) shows good I/O service times (below 10ms), so the hard disks and lower half of the I/O subsystem seems to work OK. The high response times must have come from I/O queuing waits, which happens when more I/O requests are issued to OS than the storage hardware (and device driver configuration) can accept.

Time: 06:00:05
Device:  r/s   w/s    rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda    84.60 15.20 162235.20  1038.40  1636.01    61.35  738.49   9.40  93.84
sdb    68.60 16.80 124716.80  1375.20  1476.49    20.91  253.98   8.37  71.50
sdc    65.00  0.00 130918.40     0.00  2014.13    11.14  171.82   8.94  58.12
sdd    60.00  1.20 120358.40    27.60  1967.09     9.74  167.52   8.56  52.38
sde    60.00  4.80 122752.00   245.60  1898.11    10.41  172.73   8.32  53.92
sdf     1.60  1.20     51.20    28.80    28.57     0.00    0.07   0.07   0.02
sdg    58.00  0.00 118272.00     0.00  2039.17     9.51  173.76   8.53  49.46
sdh    54.60  2.40 110912.00    48.40  1946.67    77.09  549.34   9.08  51.78
sdi    24.20  4.80 133280.00   242.80  1035.06    15.82  125.00   5.24  67.62
sdj    64.20  0.40 130566.40     3.40  2021.20    11.07  171.43   8.68  56.10
sdk    61.20  0.80 124515.20    25.60  2008.72    12.32  202.59   9.32  57.78
sdl    78.40  2.60 131196.80    21.20  1619.98    12.41  153.18   7.55  61.14

This case is an example of how low-latency OLTP I/Os (like log file writes for commits) don’t go well together with large, massively parallel asynchronous I/O workloads, which drive the disk I/O queue lengths up, possibly making the low-latency I/Os suffer. This is why you should keep the I/O Resource Manager enabled in Exadata cells when running mixed workloads, as IORM is able to throttle the Smart Scan I/O and prioritize important background process I/O, such as log file writes, automatically.

You can read more about the OS Watcher tool in My Oracle Support article “OS Watcher User Guide” (Doc ID 301137.1). In fact, you can download the latest OS Watcher from there and use it for monitoring your other servers too. The My Oracle Support version of OS Watcher does not include the Exadata cell-specific monitoring scripts, but you don’t need those for your regular servers, anyway. For Exadata storage cells, you’ll have to keep using the original version shipped with the cell software, as Oracle doesn’t authorize installation of any additional storage into storage cells if you want to keep the supported configuration.

It pays off to open up and skim through OS Watcher scripts; you might learn some new useful OS-level monitoring commands from there and will have a better overview of which OS metrics are available. You should drill down to these low-level metrics when the more conveniently accessible Oracle Database and Cellcli-level metrics are not enough.

Summary

There are thousands of metrics that Oracle Database and the cells provide us. We’ve only touched on a small set of them here. This leads to the obvious questions: are all these metrics important? What should be their “good” values? Which ones should we act on? And so on. The general answer to all those questions is that no, you do not need to learn, memorize, and “tune” all of these metrics (it is impossible). You should always measure what matters – and it’s usually the response time. Ideally, you should start by measuring the end user response time and drill down where needed from there, sometimes to the database, sometimes to the application server or network metrics. It’s not always the database that is causing the trouble, you know. However, thanks to the complexity of modern multi-tier applications, this end-to-end diagnosis may not be available (and not feasible to implement).

In such case you would take a step downward and monitor the response time of your database queries and transactions. You just start the top-down approach from a little lower in the application stack, keeping in mind that the problem may still actually be happening somewhere in the application layer or higher. When monitoring the response times of your user reports and DW queries, you should probably start from the SQL Monitoring page (or ASH data), and identify the problem user’s query from there so you can drill down into it. Using the top-down approach and following the biggest time consumers in the SQL plan from there is much easier than the opposite bottom-up approach, where you might look at some database-wide top SQL report and hope to figure out what’s wrong with your specific user’s workload.

Of course, there are cases where you’ll be monitoring system-wide aggregated anonymous performance data (not tied to any specific user or session), as in capacity planning and utilization monitoring. However, questions like “Is my database performing well?” should never be answered by looking at some system utilization report. None of these tools can tell you whether your database system performs well; only your users can. And if the users are unhappy, you can start from an unhappy user and drill down into the response time from there. Oracle provides all the metrics you need!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.242