Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 11 Fixing Problems

Rise and shine, sleepy Joe

There are places to go

There are windows to clean on the way

You’ve got nothing to lose

But a shine on your shoes

Do the best things you can every day

“Sleepy Joe,” composed by John Carter and Russell Alquist and recorded by Herman’s Hermits

To state the obvious, fixing problems is the part of the DBA’s job that takes precedence over everything else. In his best-selling book The 7 Habits of Highly Effective People, Stephen Covey explains that our activities can be divided among four quadrants depending on their urgency and importance. Figure 11-1 illustrates these quadrants.

Figure 11-1. Classifying activities into four quadrants

Restoring service when the database crashes is an example of an activity that is both important and urgent. Determining the root cause of the database crash is important but not as urgent. The loud ringing of your cell phone demands your immediate attention but is rarely more important than the task at hand. And, of course, watching the latest sensational video on YouTube is unimportant and not urgent.

We tend to spend most of our time in the quadrant of activities that are important and urgent. However, the ideal place to spend time is the quadrant of activities that are important but not urgent: for example, keeping records, identifying the root causes of incidents and eliminating them, updating documentation, automating common tasks, and so forth.

In this chapter, you watch a real-life problem as it progresses from detection to resolution. You learn a five-step systematic approach to problem-fixing and the difference between incident management and problem management. I cover the variety of Internet resources that are available to you, introduce an Oracle knowledge base called My Oracle Support, and explain how to get paid support from Oracle Corporation. Finally, I discuss a few common problems.

Systematic Five-Step Problem-Solving Method

In his book The Art and Science of Oracle Performance Tuning, Christopher Lawson describes a systematic five-step method for solving a performance-tuning problem. The method applies to any problem, not just to a performance problem. Oracle Database versions and software tools may change, but the five steps always remain the same. A problem may be simple and require only a few minutes of your time, or it may be tremendously complex and require weeks, but the five steps never change:

Define the problem. This requires patient listening, skillful questioning, and even careful observation. “Is the database having a problem?” is a question, not a problem statement. “The users are complaining” is a poorly defined problem statement. “I cannot connect to the database” is very precise. Ask the user for the history of the problem. Ask what previous efforts have been made to solve the problem. Ask what changed recently in the environment—for example, software or hardware upgrades. Ask whether all users are affected or only some. Ask whether the problem occurs at specific times of the day or week. Ask whether all parts of the application are equally affected or just parts. Avoid confusing the problem with the solution—for example, “The problem is that we need to reboot the server.” A good way to end this phase is with a reproducible test case or with one or more Oracle error codes.
Investigate the problem, and collect as much pertinent evidence as possible. A good place to start is the Oracle alert log.
Analyze the data collected in step 2, and isolate the cause of the performance problem. This is often the most challenging part of the performance-tuning exercise. If the root cause is not found, you can go back to step 2 to continue your investigation of the problem or to step 1 to refine the definition of the problem.
Solve the problem by creating a solution that addresses the cause of the problem. Solutions are not always obvious, and, therefore, this part of the exercise may require a great deal of ingenuity and creativity.
Implement the solution in a safe and controlled manner. Conduct an appropriate level of testing in a suitable testing environment. Obtain necessary approvals, and follow the organization’s change-management procedures. Before and after measurements should be obtained in the case of performance problems. If the after measurements indicate that the problem is not fixed, you can return to step 2 and continue your investigation of the problem.

The flowchart in Figure 11-2 illustrates the problem-solving work flow.

Figure 11-2. Systematic five-step problem-solving method

The Book We All Want and Best Practices for Problem Management

The book we all want is a book that has a clear description of every Oracle problem and step-by-step instructions for fixing them. That book will never be written. There are just too many problems that can occur. Problems can have multiple solutions. The solution may depend on your particular circumstances. In many cases, it is not even clear what the problem really is. The best you can do is to use best practices for problem management. Here are my suggestions:

Fix problems proactively. Monitoring your database and preventing problems is better than fixing problems—for example, add space to a database before it fills up completely and jobs begin to fail.
Find the root cause of problems. After the problem has been fixed, look for the underlying root cause—for example, find out why the database is growing and how much space will be needed over the medium term.
Use good tools. Workers are only as good as their tools. The more tools you have, the better you are equipped to solve a problem. Commonly used tools include Oracle-supplied tools such as Enterprise Manager and SQL Developer and third-party tools such as Toad from Quest Software and DBArtisan from Embarcadero Technologies.
Use standard operating procedures (SOPs). The two obvious advantages of SOPs are consistency and efficiency, but there are many others. Chapter 15 returns to the subject of SOPs.
Document the database environment. Remote Diagnostic Agent (RDA), introduced in Chapter 9, is a simple tool that you can use to document your environment. An RDA collection collects all the information about the database and operating system into one compact package that is very useful in solving problems.
Ask for help. Several online forums can help you solve a problem. If you have an Oracle support contract, you can escalate a problem to an Oracle engineer. We return to this subject later in this chapter.
Keep work records. The most important thing to do is to keep work records. When a problem reoccurs, it helps to have access to the details of prior occurrences. Every environment is prone to certain problems—for example, Oracle Database 10.1.0.3 is affected by a bug that causes the nightly statistics-gathering job GATHER_STATS_JOB to fail—the characteristic error code is ORA-00904.

Note An IT organization without work records is like a dentist’s office without dental records. Work records are one of the ten deliverables of the database administration role listed in Chapter 15.

Figure 11-3, courtesy of Database Specialists, proves the value of work records in solving chronic problems: Three DBAs—Iggy Fernandez, Terry Sutton, and Roger Schrag—encountered an ORA-00904 error in the alert log, and a quick search of the work records told them each time that it was a known problem that had occurred many times in the same database.

Figure 11-3. Searching work records

Real-Life Example: Unresponsive Listener

This section presents the transcripts—with some editing to preserve confidentiality—of several instant-messaging conversations between a DBA and another member of the IT staff. The problem puzzles the DBA the first time but is fixed very quickly the second time it occurs.

Define the Problem

A member of the IT staff tells the DBA that the database “seems to be locked up.” This phase ends when the DBA is able to reproduce the problem—he cannot connect to the database from the reporting server. The DBA now has a clear definition of the problem:

User: help. our oracle database server seems to be locked up
User: if I do sql
User: it says cannot connect. I cannot connect from reporting server
DBA: ok let me try that
DBA: ok i cannot connect either

Investigate and Analyze the Problem

The DBA now begins looking for information that may shed light on the problem. The first thing he does is check the alert log. He finds that the database is healthy, and he is able to connect to it without going through the listener—that is, using a local session on the database server. This phase ends when he discovers that there are two listeners. This is apparently the cause of the problem, because the timestamp of the last good entry in the alert log coincides with the time when the second listener was started:

DBA: no errors in alert log
DBA: cpu and load average good check
DBA: without the @ i can connect
DBA: this tells me that the database is OK
DBA: connections to port 1521 are failing
User: listener ?
DBA: trying telnet localhost 1521 to find out if there is connectivity to the port
DBA: hmmmm there is
DBA: $ telnet localhost 1521
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is ’^]’.
DBA: that is the behaviour i expect
DBA: can you try it from the reporting server
DBA: telnet should produce the same output
DBA: i.e. it should connect
User: ok. one sec
DBA: i.e. it should "connect" but not quite
DBA: i.e. it should detect somebody listening
DBA: but obviously there is no telnet daemon there
DBA: and it will hang
User: it hangs
DBA: same output as above
DBA: ok ...
DBA: the last connection recorded in tail /local/service/oracle/product/10.2.0/db/network/log/listener.log
on the db server was at 14:17
DBA: it is now 15:05
DBA: 09-JUN-2007 14:17:47 *
(CONNECT_DATA=(SID=devdb)(CID=(PROGRAM=)(HOST=__jdbc__)(USER) *
(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=5009)) * establish * proddb * 0 09-JUN-
2007 14:17:47 * (CONNECT_DATA=(SID=devdb)(CID=(PROGRAM=)(HOST=__jdbc__)(USER) *
(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=5010)) * establish * proddb * 0
DBA: definitely a problem with the listener
DBA: but ....
DBA: my guess is that somehow the database server cannot talk to the world
DBA: the way it works is that ...
DBA: all traffic flows through the 1521 port
DBA: both inbound and outbound
DBA: for all sessions
DBA: there is one unix process servicing each client process
DBA: but everybody talks over the 1521 port
DBA: looks like the report server can reach the 1521 port
DBA: but perhaps the db cannot talk back
DBA: aah i have it
DBA: $ ps -ef |grep lsnr
oracle 19832 1 0 Mar31 ? 00:05:40
/local/service/oracle/product/10.2.0/db/bin/tnslsnr LISTENER –inherit
oracle 19987 19832 0 14:17 ? 00:00:00
/local/service/oracle/product/10.2.0/db/bin/tnslsnr LISTENER -inherit
DBA: look carefully
DBA: two listeners
DBA: one at 14:17
DBA: the second process is the child of the first

Solve and Implement the Problem

Having determined that the second listener is the problem, the DBA first tries to shut down the listeners gracefully. When that fails, he uses the Unix kill command. The situation returns to normal as soon as the second listener is terminated. The DBA promises to research what caused the second listener to be started and why nobody was able to connect to the database as a result:

User: can we restart the listener
DBA: let me try
DBA: I cannot stop it the normal way
DBA: because the normal way involves connecting to it
DBA: lsnrctl hangs
User: aaah
DBA: $ lsnrctl status

LSNRCTL for Linux: Version 10.2.0.1.0 - Production on 09-JUN-2007 15:18:28

Copyright (c) 1991, 2005, Oracle. All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521)))
DBA: hangs at that point
User: ok. makes sense. something is locked up in that process. may be one of the
threads
DBA: ok killing both
User: yeah.
DBA: woo hoo
DBA: i killed the second process
DBA: i can connect now
DBA: SQL*Plus: Release 10.2.0.1.0 - Production on Wed Jun 9 15:21:32 2007

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:
Oracle Database 10g Release 10.2.0.1.0 – Production

SQL> quit
Disconnected from Oracle Database 10g Release 10.2.0.1.0 - Production
DBA: it was failing before
DBA: check on the report server
User: yeah. it works fine now
User: great.
DBA: thanks ... two heads are better than one
User: I agree. It’s good to know what happened. great. thanks.
DBA: i;ll do some research
DBA: on the phrase "two listeners"
User: ok. cool. gotta go. bye
DBA: l8r

One Week Later

When the problem occurred again, very little diagnosis and resolution time were required:

User: hi, the oracle server seems to have died last night. can you check. tks
DBA: really?
DBA: that’s really bad
User: it was running level 0 backup y’day and I thought because of that queries were
timing out. but looks like it’s hosed.
DBA: checking now
User: thank you!
DBA: it may be the same problem that we saw a week ago
DBA: dev
DBA: checking
DBA: yes ... two listeners
DBA: nobody can attach
User: hmmm
DBA: problem cleared
User: tks.

Opportunities for Improvement

The DBA in this example has considerable Oracle experience and good problem-solving skills, but there are nevertheless several opportunities for improvement:

The problem was reported to the DBA by a member of the IT staff, well after the problem started. No monitoring mechanisms were in place to automatically alert the DBA about the loss of service.
There was no further investigation after the first incident. The problem had been fixed, but the root cause had not been determined. In this case, the listener log file contained many occurrences of the following message: WARNING: Subscription for node down event still pending. This would have been a good place to start the investigation.

Incident Management vs. Problem Management

An incident is a single occurrence of a problem. Incident management and problem management are therefore separate aspects of IT management. Incident management is concerned with restoring service as rapidly as possible. Problem management is concerned with permanently fixing defects so the incidents are not repeated.

In the real-life example in the preceding section, the organization gets high marks for incident management but low marks for problem management. Problem management is facilitated by record keeping—tools such as BMC Remedy Service Desk can be used for this purpose. Knowledge of the event was confined to the DBA and the member of the IT staff who contacted him. Records were not kept, the incident was forgotten, and a second service outage occurred a week later.

Internet Resources

The Internet is a treasure trove of information that can help you in solving a problem. For example, user groups such as the Northern California Oracle Users Group (www.nocoug.org) have made vast collections of electronic presentations and white papers available on their web pages. Often a simple Google search will bring up an answer, but many specialized resources also are available:

Online Oracle documentation
Online forums
Oracle Technology Network
Ask Tom web site
Usenet
Oracle-L mailing list

The highest quality resource is of course the online Oracle documentation, the home page for which is shown in Figure 11-4. This documentation is freely viewable, searchable, and downloadable at https://docs.oracle.com. Documentation for older versions of Oracle software going back to Oracle 7 is also available here. If you need to be able to work offline, you can download as much documentation as you need.

Figure 11-4. Online Oracle documentation

Tip Download the manuals and guides you need to your desktop, laptop, or mobile device so you can read them offline, especially if you prefer the book format instead of the online format.

You also can ask questions on the Oracle forums (https://community.oracle.com), shown in Figure 11-5. Many Oracle experts donate a lot of time answering questions posted here.

Figure 11-5. The Oracle forums

The Oracle Technology Network (www.oracle.com/technetwork) is the jumping-off place for all the free resources provided by Oracle Corporation, including articles, sample code, and tutorials. It also contains links to the Oracle documentation and Oracle forums. Figure 11-6 shows the Oracle Technology Network.

Figure 11-6. The Oracle Technology Network

Working with Oracle Support

You can search the Oracle knowledge base My Oracle Support (MOS ) and obtain technical support from Oracle Support if you are paying annual support fees to Oracle and have a valid Customer Support Identifier (CSI). The support fees are typically 22 percent of the cost of your Oracle licenses. If you choose to forgo Oracle support, you will not be entitled to any patches (fixes for software bugs) or upgrades.

The web page for Oracle Support is http://support.oracle.com, shown in Figure 11-7. The tabs on the page indicate the range of things you can do here—for example, read headlines, search the knowledge base, request service, download patches, participate in forums, and so forth.

Figure 11-7. Working with Oracle Support

If you cannot find an answer in the Oracle knowledge base, you can create a service request. The priority of the service request and the corresponding service-level commitment depend on the impact to your organization. For instance, a production outage is classified as Severity 1 and is given the highest level of attention.

Remote Diagnostic Agent (RDA)

Chapter 9 touched on RDA collections. An RDA collection, shown in Figure 11-8, collects all the information about the database and operating system into one compact package that can be attached to a service request. The RDA collection eliminates the need for Oracle engineers to ask questions about the database environment and shortens the time required for problem resolution.

Figure 11-8. An RDA collection

Tip An RDA collection is a great problem-solving tool and facilitates collaboration among DBAs. A best practice for Oracle database administration is to install the RDA utility in every database environment to avoid losing time when faced with a severe problem that requires vendor support. Periodic RDA collections also serve as a record of changes to the database environment.

ORAchk Oracle Configuration Audit Tool

ORAchk is a tool provided by Oracle Support and can be downloaded from My Oracle Support. As the name implies, it checks whether you are using best practices for your Oracle Database configuration. Here is the summary output produced by the tool in my VM environment:

INFO =>    Oracle Data Pump Best practices.
FAIL =>    Bash is vulnerable to code injection (CVE-2014-6271)
WARNING => Linux Swap Configuration does NOT meet Recommendation
WARNING => physical memory is not sufficient
 INFO =>    Important Storage Minimum Requirements for Grid & Database Homes
 FAIL =>    DB_UNIQUE_NAME on primary has not been modified from the default,
 confirm that database name is unique across your Oracle enterprise. for orcl
 INFO =>    Hidden database initialization parameters should not be set
per best practice recommendations for orcl
 WARNING => OSWatcher is not running as is recommended.
 FAIL =>    Database parameter DB_LOST_WRITE_PROTECT is NOT set to recommended value on orcl instance
 INFO =>    umask for RDBMS owner is not set to 0022
 WARNING => Database parameter DB_BLOCK_CHECKING on PRIMARY is NOT set to the recommended value. for orcl
 INFO =>    Operational Best Practices
 INFO =>    Database Consolidation Best Practices
 INFO =>    Computer failure prevention best practices
 INFO =>    Data corruption prevention best practices
 INFO =>    Logical corruption prevention best practices
 INFO =>    Database/Cluster/Site failure prevention best practices
 INFO =>    Client failover operational best practices
 WARNING => One or More Registry Components were found to be Invalid for orcl
 WARNING => One or More Registry Components were found to be Invalid for PDB1
 WARNING => Duplicate objects were found in the SYS and SYSTEM schemas for orcl
 WARNING => Redo log file size should be sized to switch every 20 minutes during peak redo generation for orcl
 FAIL =>    Database parameter LOG_BUFFER is NOT set to recommended value on orcl instance
 WARNING => Oracle clusterware is not being used
 WARNING => RAC Application Cluster is not being used for database high availability on orcl instance
 WARNING => "DISK_ASYNCH_IO is NOT set to recommended value for orcl
 FAIL =>    Flashback on PRIMARY is not configured for orcl
 INFO =>    Database failure prevention best practices
 WARNING => fast_start_mttr_target has NOT been changed from default on orcl instance
 WARNING => Database Archivelog Mode should be set to ARCHIVELOG for orcl
 FAIL =>    Primary database is NOT protected with Data Guard (standby
 database) for real-time data protection and availability for orcl
 FAIL =>    Active Data Guard is not configured for orcl
 INFO =>    Oracle recovery manager(rman) best practices
 INFO =>    Consider increasing the COREDUMPSIZE size
 INFO =>    Consider investigating changes to the schema objects such as DDLs or new object creation for PDB1
 INFO =>    Consider increasing the PGA size for PDB1
 WARNING => Consider increasing the value of the session_cached_cursors database parameter for orcl
 INFO =>    Consider investigating the frequency of SGA resize operations and take corrective action for orcl
 FAIL =>    There should be no duplicate parameter entries in the database init.ora(spfile) file for orcl

As you can see, there are numerous opportunities for improvement in this VM environment. Detailed output is produced in HTML format. In the exercises, you are asked to run ORAchk in your VM environment and study the findings.

Automatic Diagnostic Repository (ADR)

The history of Oracle errors is stored in a directory structure called the Automatic Diagnostic Repository (ADR). Oracle provides a tool called adrci (ADR Command Interpreter) to query the ADR and to create packages containing diagnostic information. These packages can be attached to service requests that you initiate with Oracle Support. You see an example of using adrci later in this chapter.

Error Codes

With a few exceptions, such as locking problems, the most important symptom of a problem is the error code. These error codes are all listed in Oracle Database Error Messages—part of the Oracle documentation set—and there are literally tens of thousands of them. Here is a simple example featuring the very first Oracle error in the book, ORA-00001. The user is trying to insert a record that already exists:

SQL*Plus: Release 12.1.0.1.0 Production on Mon Apr 6 01:07:28 2015

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Sun Apr 05 2015 23:28:38 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

PDB1@ORCL> insert into employees select * from employees;
insert into employees select * from employees
*
ERROR at line 1:
ORA-00001: unique constraint (HR.EMP_EMAIL_UK) violated

Tip When investigating a problem, the best question to ask is, “What is the Oracle error code?” Sometimes applications mask the error code, and this hinders the investigation into a problem.

A very limited amount of advice about each Oracle error code is provided in Oracle Database Error Messages. Here is what you find in the case of ORA-00001:

ORA-00001: unique constraint (string.string) violated
Cause: An UPDATE or INSERT statement attempted to insert a duplicate key. For
Trusted Oracle configured in DBMS MAC mode, you may see this message if a duplicate
entry exists at a different level.
Action: Either remove the unique restriction or do not insert the key.

Tip Oracle supplies the oerr utility for Unix platforms; it displays the text of an error message as well as the cause and action. To display the text associated with ORA-00001, simply type oerr ora 1.

Notice that an Oracle error code has two parts: a facility code and a five-digit number, separated by a hyphen. Leading zeros are often omitted—for example, ORA-00600 and ORA-600 refer to the same error. Here is an example of an error message produced by the Recovery Manager (RMAN) tool:

RMAN-06171: not connected to target database
Cause: A command was issued but no connection to the target database has been
established.
Action: Issue a CONNECT TARGET command to connect to the target database.

You must learn to identify which Oracle error codes indicate problems with the infrastructure supported by the DBA and which codes indicate errors made by a user or problems with an application. For example, Table 11-1 lists the standard named exceptions that are part of the PL/SQL programming language—they indicate problems with the application in question.

Table 11-1. Named Exceptions in PL/SQL

Name	Error Code
ACCESS_INTO_NULL	ORA-06530
COLLECTION_IS_NULL	ORA-06531
CURSOR_ALREADY_OPEN	ORA-06511
DUP_VAL_ON_INDEX	ORA-00001
INVALID_CURSOR	ORA-01001
INVALID_NUMBER	ORA-01722
LOGIN_DENIED	ORA-01017
NO_DATA_FOUND	ORA-01403
NOT_LOGGED_ON	ORA-01012
PROGRAM_ERROR	ORA-06501
ROWTYPE_MISMATCH	ORA-06504
SELF_IS_NULL	ORA-30625
STORAGE_ERROR	ORA-06500
SUBSCRIPT_BEYOND_COUNT	ORA-06533
SUBSCRIPT_OUTSIDE_LIMIT	ORA-06532
SYS_INVALID_ROWID	ORA-01410
TIMEOUT_ON_RESOURCE	ORA-00051
TOO_MANY_ROWS	ORA-01422
VALUE_ERROR	ORA-06502
ZERO_DIVIDE	ORA-01476

Tip Errors reported in the Oracle alert log usually represent infrastructure errors. A good example is ORA-01653: Unable to extend table HR.EMPLOYEES by 1024 in tablespace EXAMPLE.

Four Errors

I have enough space to discuss only a few errors. I selected the four errors in this section for discussion because they are common and frequently misunderstood.

ORA-01555: Snapshot Too Old

Here is the minimal explanation of the ORA-01555 error found in Oracle Database 11g Error Messages:

ORA-01555: Snapshot too old; rollback segment number string with name "string" too
small
Cause: rollback records needed by a reader for consistent read are overwritten by
other writers
Action: If in Automatic Undo Management mode, increase undo_retention setting.
Otherwise, use larger rollback segments

Consider a query that begins at 9 a.m. and completes at 10 a.m. Suppose that the query is counting the number of records in a very large table, and suppose that the table is very active—that is, records are constantly being inserted into the table. Suppose that Oracle’s answer to the query is 10 billion records. Does the answer indicate the number of records at 9 a.m., at 10 a.m., or at some intermediate time between those times?

If you were working with another database technology such as IBM’s DB2 or Informix, Microsoft’s SQL Server, or Sybase, the answer would be 10 a.m. because the query would not complete until it had locked all the records in the table; it would wait for all insert, update, and delete operations to complete, and it would block others from starting. In other words, readers block writers, and writers block readers. With Oracle, on the other hand, readers do not block writers, and writers do not block readers. Instead, Oracle determines what the answer to the query would have been at the instant the query started—that is, at 9 a.m. This manner of operation is called read consistency.¹ Read consistency extends to a single query by default or can extend to an entire transaction. For example, the entire contents of a database can be exported by using read-consistent mode to ensure that there are no conflicts or inconsistencies in the data.

To ensure read consistency, Oracle must construct a snapshot of the database at the time the query or transaction started. If the required data has changed since the query started, Oracle must obtain a prior version of the data from the rollback segments. However, the rollback segments are a shared resource, and the information they contain may be overwritten if other transactions need the space. Therefore, Oracle may not be able to construct the snapshot—that is, the snapshot may be too old to be reconstructed.

Prior to Oracle 9i, the number and size of rollback segments were left to the DBA. Beginning with Oracle 9i, this task should be entrusted to Oracle by changing the value of the UNDO_MANAGEMENT setting to AUTO. If the Snapshot too old error occurs, increase the value of the UNDO_RETENTION setting and ensure, through trial and error, that the tablespace specified by the UNDO_TABLESPACE setting is big enough, adding data files or increasing the size of data files as necessary. The default value of the UNDO_RETENTION setting is 900 seconds (15 minutes); a large value such as 14,400 (4 hours) may be more appropriate, depending on the circumstances, but must be supported by an undo tablespace that is large enough. Note that you can never completely avoid the Snapshot too old error because it is always possible for queries to take so long that the available resources are exhausted. Also, transactions that modify the data typically take precedence over transactions that read the data—that is, the modified transactions overwrite the information in the rollback segments if space is short, regardless of the value of the UNDO_RETENTION setting. You can prevent this by imposing a retention guarantee on the undo tablespace by using the ALTER TABLESPACE command. You must take more care than usual to ensure that the undo tablespace is adequately sized if a retention guarantee is in effect.

ORA-00060: Deadlock Detected

Once again, you see only a minimal explanation in Oracle Database 11g Error Messages:

ORA-00060: deadlock detected while waiting for resource
Cause: Transactions deadlocked one another while waiting for resources.
Action: Look at the trace file to see the transactions and resources involved. Retry
if necessary.

This error is frequently encountered but is one of the least understood Oracle errors. A deadlock is a situation in which two transactions interfere with each other; each is waiting for a lock on a resource locked by the other. The correct thing for the application to do when this error is received is to issue the ROLLBACK command and retry the transaction.

Create two tables, PARENT and CHILD, related to each other by a referential constraint (foreign key), and populate them with data by using the following commands:

CREATE TABLE parent (
  parent_ID INTEGER NOT NULL,
  CONSTRAINT parent_PK PRIMARY KEY (parent_id)
);

CREATE TABLE child (
  child_ID INTEGER NOT NULL,
  parent_ID INTEGER NOT NULL,
  CONSTRAINT child_PK PRIMARY KEY (child_id),
  CONSTRAINT child_FK FOREIGN KEY (parent_ID) references parent
);

INSERT INTO parent (parent_ID) values (1);
INSERT INTO parent (parent_ID) values (2);
INSERT INTO child (child_ID, parent_ID) values (1,1);
INSERT INTO child (child_ID, parent_ID) values (2,2);

Suppose that two users try to delete the data from the CHILD table, but each deletes records in a different order. At step 3 in the following example, user A is blocked because user B has already locked the requested record. User B is blocked in turn at step 4, and a deadlock results. To resolve the deadlock, Oracle picks either user A or user B as a victim, and any effects of the last command issued by the victim are rolled back. To allow the other user to proceed, the victim must then issue the ROLLBACK command to release any other locks it still holds—it can then retry its transaction:

/* Step 1: User A */ DELETE FROM child WHERE child_ID=1;

/* Step 2: User B */ DELETE FROM child WHERE child_ID=2;

/* Step 3: User A */ DELETE FROM child WHERE child_ID=1;

/* Step 4: User B */ DELETE FROM child WHERE child_ID=2;

In the preceding example, two users are modifying the same data. Deadlock also occurs in the next example, even though the users are modifying different data. At step 3, user A is blocked because Oracle needs to verify the absence of child records before deleting a parent record. In the absence of an index on the child_ID column of the Child table, Oracle attempts to lock the entire Child table and is prevented from doing so because user A has a lock on one of the records in the table. User B is blocked in turn at step 4, and a deadlock results:

/* Step 1: User A */ DELETE FROM child WHERE child_ID=1;

/* Step 2: User B */ DELETE FROM child WHERE child_ID=2;

/* Step 3: User A */ DELETE FROM parent WHERE parent_ID=1;

/* Step 4: User B */ DELETE FROM parent WHERE parent_ID=2;

Defects in database design such as un-indexed foreign keys can increase the frequency of deadlocks, but deadlocks can happen anytime more than one user is using the database. All database programs must anticipate the possibility and take the appropriate action when a deadlock happens—that is, issue the ROLLBACK command and retry the transaction.

ORA-00600: Internal Error Code

Oracle reports an ORA-00600 error in the alert log whenever it encounters an unexpected condition—the frequent cause is an Oracle bug—and stops processing the SQL statement it was processing at the time. Here is the verbiage from Oracle Database 11g Error Messages:

ORA-00600: internal error code, arguments: [string], [string], [string], [string],
[string], [string], [string], [string]
Cause: This is the generic internal error number for Oracle program exceptions. This
indicated that a process encountered an exceptional condition.
Action: Report as a bug - the first argument is the internal error number.

Caution ORA-00600 errors should always be investigated, and the DBA should confirm that the error was not triggered by a corrupted data block.

Here is an example of an ORA-00600 incident, from detection to diagnosis. The following code produces an ORA-00600 error in Oracle Database 11.1.0.6. Notice that an ORA-00600 error has a number of arguments, each enclosed in square brackets. The first such argument is the most important and is used to further classify the error:

[oracle@localhost ~]$ sqlplus hr/oracle@pdb1

SQL*Plus: Release 12.1.0.1.0 Production on Sun Apr 5 23:28:38 2015

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Sun Apr 05 2015 23:27:29 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

PDB1@ORCL> @demo
PDB1@ORCL>
PDB1@ORCL> -- create a demo table
PDB1@ORCL> CREATE TABLE scores AS
  2  SELECT o.object_type,
  3    o.object_name,
  4    TRUNC(dbms_random.value(1,101)) AS score
  5  FROM all_objects o;

Table created.

PDB1@ORCL>
PDB1@ORCL> -- create an index
PDB1@ORCL> CREATE INDEX scores_ix ON scores
  2    (object_type, score
  3    );

Index created.

PDB1@ORCL>
PDB1@ORCL> -- list the top 3 scoring objects in each category
PDB1@ORCL> WITH object_types(object_type) AS
  2    ( SELECT MIN(object_type) FROM scores
  3    UNION ALL
  4    SELECT
  5      (SELECT MIN(object_type) FROM scores WHERE object_type > u.object_type
  6      )
  7    FROM object_types u
  8    WHERE object_type IS NOT NULL
  9    )
 10  SELECT l.*
 11  FROM object_types o,
 12    lateral
 13    (SELECT /*+ INDEX(s) */ *
 14    FROM scores s
 15    WHERE s.object_type = o.object_type
 16    ORDER BY s.object_type,
 17      s.score
 18    FETCH FIRST 3 ROWS ONLY
 19    ) l;
  ( SELECT MIN(object_type) FROM scores
                                 *
ERROR at line 2:
ORA-00600: internal error code, arguments: [qctcte1], [0], [], [], [], [], [],
[], [], [], [], []

The first thing to do is to check the Oracle alert log. If you have forgotten where it is, you can check the value of the BACKGROUND_DUMP_DEST setting:

PDB1@ORCL> SHOW parameter background_dump_dest;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
background_dump_dest                 string      /u01/app/oracle/diag/rdbms/orc
                                                 l/orcl/trace

Here is what appears in the Oracle alert log. The alert log indicates that full details are found in a trace file:

Sun Apr 05 23:28:48 2015
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_14018.trc  (incident=60308):
ORA-00600: internal error code, arguments: [qctcte1], [0], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl/incident/incdir_60308/orcl_ora_14018_i60308.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Apr 05 23:28:50 2015
Dumping diagnostic data in directory=[cdmp_20150405232850], requested by (instance=1, osid=14018), summary=[incident=60308].
Sun Apr 05 23:28:50 2015
Sweep [inc][60308]: completed
Sweep [inc2][60308]: completed

Next, you use the adrci tool and create a package of information that can be sent to Oracle Support. The specific command is ips (Incident Packaging Service). Notice that ADR differentiates between a problem and an incident:

[oracle@localhost trace]$ adrci

ADRCI: Release 12.1.0.1.0 - Production on Sun Apr 5 23:35:21 2015

Copyright (c) 1982, 2013, Oracle and/or its affiliates.  All rights reserved.

ADR base = "/u01/app/oracle"
adrci> set homepath diag/rdbms/orcl/orcl
adrci> show problem

ADR Home = /u01/app/oracle/diag/rdbms/orcl/orcl:
*************************************************************************
PROBLEM_ID PROBLEM_KEY       LAST_INCIDENT LASTINC_TIME
---------- ----------------- ------------- ---------------------------------
1          ORA 4031          7555          2014-01-21 15:06:16.283000 -05:00
2          ORA 600 [qctcte1] 60308         2015-04-05 23:28:48.283000 -04:00
2 rows fetched

adrci> ips create package incident 60308
Created package 1 based on incident id 60308, correlation level typical
adrci> ips generate package 1 in /u01/oracle
Generated package 1 in file /u01/oracle/ORA600qct_20150405233734_COM_1.zip, mode complete
adrci> exit

An alternative to creating a service request is to research the problem yourself by using the MetaLink knowledge base. Specifically, you have to use a search tool called the ORA-600 Troubleshooter, shown in Figure 11-9. You must provide the first argument of the ORA-00600 error you encountered and the version of Oracle Database you are using.

Figure 11-9. Using the ORA-600 Troubleshooter

When you click the Look-Up Error button, Oracle shows you a research note discussing this particular error.

ORA-07445: Exception Encountered

An ORA-07445 error causes the instant termination of the Oracle server process that was handling the query being processed at the time. This sort of error occurs when an Oracle server process attempts to perform an illegal operation such as reading an area of memory that is not allocated to it—the operating system detects the illegal operation and forces the Oracle server process to shut itself down. ORA-00600 and ORA-7445 errors should be handled in the exact same way.

Summary

Here is a short summary of the concepts touched on in this chapter:

Your activities can be divided among four quadrants depending on their urgency and importance. The ideal place to spend time is the quadrant of activities that are important but not urgent—for example, keeping records, identifying the root causes of incidents and eliminating them, updating documentation, and automating common tasks.
The first step in solving a problem is to define the problem. The second step is to investigate the problem. The third step is to analyze the data. The fourth step is to solve the problem. The last step is to implement the solution in a safe and controlled manner.
Best practices for problem management include fixing problems proactively, finding the root cause of problems, using good tools, using standard operating procedures, using RDA collections, and keeping work records.
An incident is a single occurrence of a problem. Incident management is concerned with restoring service as rapidly as possible. Problem management is concerned with permanently fixing defects so incidents are not repeated.
High-quality Internet resources include the Oracle documentation site, Oracle Technology Network, and Oracle-L mailing list.
If you have purchased an Oracle support contract, you can search the Oracle knowledge base MetaLink and escalate problems to the Oracle Support team.
You can use the RDA tool to document the Oracle Database configuration. The ORAchk configuration audit tool can be used to determine whether best practices are being followed for Oracle Database configuration.
The history of Oracle errors is stored in a directory structure called the Automatic Diagnostic Repository (ADR). Oracle provides a tool called adrci (ADR Command Interpreter) to query the ADR and to create packages containing diagnostic information to send to the Oracle Support team.
An Oracle error code has two parts: a facility code and a five-digit number, separated by a hyphen. The DBA must learn to identify which Oracle error codes indicate problems with the database infrastructure and which codes indicate errors made by a user or a problem with an application. The named exceptions defined by PL/SQL are examples of error codes that indicate problems in the application, not in the database infrastructure.
An ORA-01555: Snapshot too old error message means the query has taken so long that the undo information required to re-create the required read-consistent snapshot of the database has been overwritten by other transactions—that is, the snapshot is too old to be reconstructed.
An ORA-00060: Deadlock detected error message means Oracle detected that two transactions are blocking each other and picked one of them as a victim. The appropriate action the victim must take is to issue the ROLLBACK command and retry its transaction.
An ORA-00600: Internal error code error message indicates that Oracle encountered an unexpected condition and stopped processing the query in question. ORA-00600 errors should always be investigated, and the DBA should confirm that they were not triggered by corrupted data blocks.
An ORA-07445: Exception encountered error message means the Oracle server process attempted to perform an illegal operation and was forced by the operating system to terminate itself.

Exercises

Download the RDA tool from My Oracle Support, if you have a valid support contract, and run it in your VM environment. You have to unset the TWO_TASK environment variable (unset TWO_TASK) before running the tool. Use the custom RDA profile DB12c (./rda.sh -p DB12c) to limit the scope of the collection to Oracle Database only. Review all the sections of the RDA report.
Download the ORAchk tool from My Oracle Support, if you have a valid support contract, and run it in your VM environment. You have to unset the TWO_TASK environment variable (unset TWO_TASK) before running the tool. Examine each finding in the detailed HTML report, and decide whether it should be accepted. Consult the Oracle Database documentation to determine how to implement the findings you have accepted.
Reproduce one of the deadlock examples in your database. Check whether any messages are recorded in the alert log. Check whether a trace file is produced, and review it. What happens if the victim simply retries its last SQL command instead of issuing the ROLLBACK command and retrying its entire transaction?
Stop after step 3 of one of the deadlock examples. Session A is now blocked by session B, but Oracle cannot take any action because this is not yet a deadlock scenario. Use a GUI tool such as SQL Developer, Enterprise Manager, Toad, or DBArtisan to identify the blocking process and terminate it.
In what kind of problem scenarios might users not receive any error messages?
Users are complaining that their application is behaving sluggishly. How would you check the database environment for signs of trouble?

Categorize the following errors into user errors and infrastructure errors. What should be done to fix each of them?

ORA-01017: invalid username/password; logon denied
ORA-00923: FROM keyword not found where expected
ORA-01653: unable to extend table HR.EMPLOYEES by 128 in tablespace EXAMPLE
ORA-12154: TNS:could not resolve the connect identifier specified
ORA-12541: TNS:no listener
ORA-12514: TNS:listener does not currently know of service requested in
connect descriptor

Review the ORA-00600 example discussed in this chapter. Could the SQL statement be rewritten in a way that avoids the error?

Footnotes

¹SQL Server offers read consistency beginning with SQL Server 2005. However, it is not the default manner of operation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 11: Fixing Problems

Create new playlist

Sign In

Sign Up

CHAPTER 11

Fixing Problems