Chapter 16. Oracle Backup and Recovery

Oracle is a very popular commercial relational database management system with a very large installed base. Oracle can be backed up using two very different methods, and each method has its advantages and disadvantages—and its strong supporters and detractors.

Two Backup Methods

The backup and recovery method with the most history is now called user-managed backup. A user-managed backup consists of putting the Oracle database into a “backup friendly” state, then backing up its files using whatever tool strikes your fancy. Once the files have been backed up, you can take the database out of its “backup friendly” state. User-managed backup is documented and supported, but is not what Oracle would prefer you do.

The preferred method of backing up Oracle databases is the Recovery Manager (rman), which was first available in Oracle 8. The rman utility can be used independently or with a commercial backup utility to back up Oracle to disk or tape. It offers a number of advantages over user-managed backups, including incremental backups, data integrity checks, block-level media recovery, and guided restores. Those who have learned rman swear by it, especially the new and improved version available in Oracle 10g.

Tip

As of this writing, approximately half the Oracle community is using rman and half is performing user-managed backups. Both methods will be covered in this chapter, starting with this comparison of the two.

rman

rman has a number of advantages over user-managed backups. The first advantage is that rman will get much more research and development funds than user-managed backups. Where user-managed backups haven’t changed much in the last several years, dozens of features have been added to rman in that time.

rman also has a number of features that aren’t ever going to be available with user-managed backups:

Incremental backups

rman can create backups that contain the blocks that have changed since the last full backup. The speed and performance of incremental backups are significantly enhanced in 10g.

Incrementally updating backups

If you’re using disk as the target for rman, it has the ability to take the latest incremental and merge it with the full backup already on disk, creating a new full backup without having to actually perform a full backup.

Data integrity checking

Previously, you had to perform an export of Oracle to check the integrity of blocks on disk. rman now performs this check for you as part of the backup.

Block-level media recovery of a corrupted datafile

If rman identifies any corrupt blocks in a datafile, it can recover them one block at a time.

Directed restores/recoveries

You tell rman to restore the database, and it figures out what needs to be restored and automatically restores it.

Many other features

Oracle has continued to add many features to rman, none of which will be available with user-managed backups.

rman has two disadvantages when compared to user-managed backups: learning curve and cost. While user-managed backups aren’t easy to learn either, they’ve been around a lot longer, and many DBAs understand them. Many DBAs still view rman with trepidation, often due to the large manual they’d have to read. The second disadvantage is that you must either back up to disk or purchase a third-party media manager to allow you to back up to tape. Some environments don’t have the money for enough disk to store an entire copy of their database (even if compressed), but they do have a tape drive they can dump/tar/cpio/ntbackup to. If they don’t have the money for more disk, they don’t have the money for a commercial media manager to back up to tape.

Tip

If you want the advantages that rman offers but don’t want to purchase your commercial backup utility’s interface to it, you can still integrate the two by using rman to back up to a disk, then using your backup utility to back up that disk. It’s not perfect, but it’s something.

User-Managed Backups

The biggest advantage user-managed backups have is history. Any DBA who has worked with Oracle for a long time probably understands them.

User-managed backups can also be relatively simple, especially if you can shut down the database to do a cold backup. All you have to do is shut it down prior to performing your usual filesystem backup, using whatever tool you want to use. If you can’t shut down the database, put it in backup mode and back it up live. (Backup mode is covered later in the chapter.) Once you’ve run the backup, start up the database or take it out of backup mode.

This procedure allows you to integrate backups of Oracle with any backup utility without having to pay for its interface to rman. While rman is free, your backup utility’s interface to it definitely won’t be. These interfaces can cost several thousand dollars per server.

Tip

This chapter uses the command sqlplus /nolog, followed by connect / as sysdba to connect to Oracle 9i and 10g databases. If you are running Oracle 8i, you can use svrmgrl and connect internal. (In the course of reading this chapter, you’ll learn some really great reasons to upgrade to 10g.)

Oracle Architecture

As mentioned in Chapter 15, it is important to understand the design of the database that is being backed up. Therefore, this chapter starts with a discussion of Oracle architecture. Similar information is provided in Chapter 15, but this chapter concentrates on information specific to Oracle. Just as in Chapter 15, we start with the power user’s view of the database, then continue with that of the database administrator. This chapter uses Oracle-specific terms. To see how a particular term relates to one used in DB2, SQL Server, or Sybase, consult Chapter 15. As much as possible, these architectural elements are presented in a “building block” order. Elements that are used to explain other elements are presented first. For example, we explain what a tablespace is before explaining what the backup tablespace command does.

The Power User’s View

Unless a power user wants to start doing the DBA’s job of putting a database together, the following terms should be all she needs to know. This view also could be called the “logical” view, because many of the elements described in this view don’t exist in a physical sense.

Instance

An instance is a set of processes through which the Oracle database talks to shared memory and the shared memory that Oracle allocates for its use. Often, there is more than one instance on a single system. When an Oracle instance is opened, the database connected to it becomes available.

On a Unix/Linux system, an instance can be identified by a set of processes with the pattern ora_ function_SID, where function is a string indicating which Oracle function the process applies to and SID is the instance name. On Windows, each Oracle SID has its own service named OracleService SID, where SID is the instance name.

On a Unix/Linux system, an instance is automatically started with the dbstart script and shut down with the dbshut script. You can use these scripts or SQL*Plus commands to manually stop and start Oracle. In Unix or Linux, you have to place this script into the appropriate directory and add the appropriate symbolic links to have it run automatically after a reboot. To automatically start and stop instances in Windows, you should enable automatic start in the Control Panel for the appropriate OracleService. (If you cannot find OracleService SID in the Control Panel, you can create it using the oradim utility.) You then need to tell Oracle to automatically start the database when the service is started. You can do this by right-clicking on the SID in the Administrative Assistant; choosing Startup/Shutdown Options; choosing the Oracle Instance tab; and selecting “Start up instance when service is started,” “Shut down instance when service is stopped,” or both.

Database

The database is what most people think about when they are using Oracle. That’s because the database contains the data! A database is a collection of files that contain tables, indexes, and other important database objects. Unless you’re using Oracle Real Application Clusters (RAC), there is a one-to-one relationship between instances and databases. Without Oracle RAC, a database connects to only one instance, and an instance mounts only one database. RACs run multiple instances (most likely on multiple servers) that share a single database.

That is why Oracle DBAs often use the two terms interchangeably. Technically, though, the instance is a set of processes through which the database talks to Oracle’s shared memory segments whereas the database is the collection of files that contain the data.

Table

A table is a collection of related rows that all have the same attributes. There are three types of tables in Oracle: relational, object, and XML.

Index

A database index is analogous to an index in a book: it allows Oracle to find data quickly. Again, an Oracle index is the same as anyone else’s index, and it presents no unique backup requirements. An index is a derived object; it is created based on the attributes in another table so that it can be recreated during a restore, and it’s usually faster to recreate an index than to restore it. Oracle has several types of indexes, including normal, bitmap, partitioned, function-based, and domain indexes.

Large object datatypes

Oracle 8 has special datatypes called BLOB, CLOB, and BFILE for storing large objects such as text or graphics. The BLOB and CLOB datatypes present no special backup requirements because they are stored within the database itself. (A BLOB typically contains image files, and a CLOB normally contains text data.) However, the BFILE datatype stores only a pointer inside the database to a file that actually resides somewhere in a filesystem on the database server. This aspect requires some special attention during backups. See the section “LOB space” in Chapter 15 for more information.

Object

An object is any type of database object, such as a table or an index; it’s really a generic term rather than an Oracle-specific term. Unfortunately, Oracle also uses the term “object” to refer to reusable components created by object-oriented SQL programming. Here, it is used simply as a generic way to refer to any type of table, index, or other entity within Oracle.

Row

A row is a collection of related attributes, such as all the information about a specific customer. Oracle also may refer to this as a record.

Attribute

An attribute is any specific value (also known as a “column” or “field”) within a row.

The DBA’s View

Now that we have covered the logical structure of an Oracle database, let’s concentrate on the physical structure. Since only the DBA should need to know this information, we will call it “the DBA’s view.”

Blocks

A block is the smallest piece of data that can be moved within the database. Oracle allows a custom block size for each instance; the size can range from 2 K to 32 K, and each tablespace (defined later) can have its own block size. A block is what is referred to as a page in other RDBMSs.

Extents

An extent is a collection of contiguous Oracle blocks that are treated as one unit. The size of an extent varies based on a combination of factors, the most important of which is the tablespace storage allocation method defined at tablespace creation.

Segment

A segment is the collection of extents dedicated to a database object. (An object is a table, index, etc.) Depending on the type of object, extents may be allocated or taken away to meet the storage needs of a given table. Oracle8 had the rollback segment, but this has been replaced in later versions with undo segments, which are stored in an undo tablespace. (See the later sections “Tablespace” and “Undo tablespace.”)

Datafile

An Oracle datafile can be stored on either a raw (disk device) or cooked (filesystem) file. Once they are created, the syntax to work with raw and cooked datafiles is the same. However, if rman is not used to back up the Oracle datafiles, backup scripts do have to take the type of datafile into account. If the backup script is going to support datafiles on raw partitions, it will need to use dd or some other command that can back up a raw partition. Using cp or tar will not work because they support only filesystem files.

Each Oracle datafile contains a special header block that holds that datafile’s system change number (SCN). Each transaction is assigned an SCN, the SCN in each datafile is updated every time a change is made to that datafile, and the control file keeps track of the current SCN. When an instance is started, the current SCN is checked against the SCN markers in each datafile. (See the section “Control file later in this chapter.)

Tip

Please see an important explanation of the role that the SCN plays during hot backup in the section “Debunking Hot-Backup Myths” later in this chapter.

Tablespace

The tablespace is the virtual area into which a DBA creates tables and other objects. It consists of one or more datafiles and is created by the create tablespace tablespace_name datafile device command. A tablespace may contain several tables. The space that each object (e.g., table) occupies within that tablespace is a segment (see the earlier definition of segment).

As of 10g, every Oracle database has at least three tablespaces: system, sysaux, and undo. The system tablespace stores the data dictionary, PL/SQL programs, view definitions, and other types of instance-wide information. The sysaux tablespace stores non-system-related tables and indexes that traditionally were placed in the system tablespace. Its name implies that it is the auxiliary system tablespace. The undo tablespace contains the undo segments, which replaced rollback segments. When it comes to backup and recovery, the main difference between these tablespaces and the rest of the tablespaces is that they must be recovered offline. That is because the instance cannot be brought online without these tablespaces. Other tablespaces can be recovered after the instance has been brought online.

Partition

A table or index can be divided into chunks called partitions and spread across multiple tablespaces for performance and availability reasons. As long as you are backing up all tablespaces, partitioned tables do not present any challenges to backup and recovery.

Control file

The control file is a database (of sorts) that keeps track of the status of all the files that comprise a given database and maintains rman backup metadata. It knows about all tablespaces, datafiles, archive logs, and redo logs within the database. It also knows the current state of each of these files by tracking each object’s SCN. Every transaction is assigned an SCN, and every time a change is made, the SCN gets updated both in the control file and in each datafile. (See the earlier section “Datafile.”) That way, when the instance is starting up, the control file has a record of what SCN the file should be at, and it checks that against the SCN that the file has. This is how it “notices” that a file is older than the control file and in need of media recovery. Also, if an older control file is put in place, Oracle will see that the SCN of the datafiles is higher than those that it has recorded in the control file. That’s when Oracle displays the datafile is more recent than the controlfile error.

If you’re performing user-managed backups, control files should be backed up using both the backup controlfile to filename and backup controlfile to trace commands. If you’re using rman, you can use the backup controlfile command, or you can have rman automatically back up the control file every time you run a backup by setting the controlfile autobackup parameter to on. You should also issue the backup controlfile to trace command within sqlplus. Additionally, it’s a good practice to copy the results of this command to a known location to make it easy to find during recovery. There is a scenario that we will cover in the “Recovering Oracle” section where this trace output will come in quite handy. Restoring control files is a bit tricky. The mechanics of restoring control files are covered later in the section “Recovering Oracle.”

Tip

Back up the control file to a file or within rman, and back it up to trace, too. Both backups can be very useful.

It is best to avoid having to restore or rebuild a control file. The best way to do this is to mirror your control files within Oracle via a function Oracle calls multiplexing. This allows you to create two or three copies of the control file, each of which is written to every time the control file is updated.

Tip

Please don’t confuse the Oracle term multiplexing with how this term is used everywhere else in the backup and recovery space (the sending of simultaneous backup streams to a single tape drive). To minimize this confusion, this chapter refers to multiplexing in Oracle as multiplexing/mirroring or multiplexed/mirrored.

Multiplexing/mirroring is slightly different from disk-level mirroring and offers an additional level of protection. With disk-level mirroring, you have one control file that’s mirrored to multiple disks via RAID 1. This protects you from disk failure, but not human error. If someone accidentally deleted the control file, it would instantly be deleted on each disk it was mirrored on. Storing multiplexed/mirrored control files on separate disks protects you from disk failure as well as accidental deletion. If a single control file gets deleted or corrupted, the instance becomes inoperable. However, the deleted/corrupted control file can easily be restored via one of the multiplexed copies, and the instance can be returned to proper operation.

Tip

Make sure you are multiplexing/mirroring your control files to separate disks. They take up very little space and provide an incredible amount of recovery flexibility.

Transaction

A transaction is any activity by a user or a DBA that changes one or more attributes in an Oracle database. (A set of commands that ends with a commit statement is treated as one transaction.) Logically, a transaction modifies one or more attributes, but what eventually occurs physically is a modification to one or more blocks within the Oracle database.

Undo tablespace

Undo data is now used for three purposes, the first of which is to undo an aborted transaction or a transaction that was not yet committed when the database crashed. Undo information also provides a consistent read for long-running queries. (See the section “ACID Compliance” in Chapter 15.) Finally, undo information can be used by Oracle’s flashback feature, which allows you to manually undo transactions. For these reasons, it’s important to keep undo information as long as possible.

Prior to 9i, undo was managed using rollback segments. The biggest challenge with rollback segments was trying to figure out how large they should be. If you made them too large, you wasted space. If you made them too small, you would get the dreaded ORA-01555 snapshot too old error, indicating that a long-running query was not able to obtain the consistent read it needed. If you had a very active database with a lot of long transactions or long running queries, this could prove to be a real problem.

Oracle 9i introduced automatic undo management, or AUM. You can continue to use manual undo management using rollback segments, but once you understand the value of AUM, it’s hard to understand why you wouldn’t use it. When configuring an Oracle database, you now specify an undo tablespace. Oracle automatically creates undo segments in this tablespace instead of requiring you to manually create rollback segments. AUM eliminates the complexities of managing rollback segment space and lets you exert control over how long undo is retained before being overwritten. Oracle strongly recommends that you use undo tablespaces to manage undo rather than rollback segments.

At first, it’s hard to differentiate between a tablespace dedicated to rollback segments and a tablespace dedicated to undo segments. Perhaps an explanation of how Oracle manages the automatic deletion of older undo data will help.

You can specify a minimum amount of time that undo information is to be kept using the undo_retention parameter. When Oracle needs space for more undo data, Oracle does its best to honor that value by deleting the oldest expired undo data first. However, if the undo tablespace is out of space, Oracle may begin deleting unexpired undo information, which unfortunately can result in the same snapshot too old error that you could experience with rollback segments. However, if you enable autoextend on the undo tablespace, the tablespace automatically extends to the maximum size you have specified. This automatic space management is what makes AUM so popular with those who have used it.

Checkpoint

A checkpoint is the point at which all changed data that is kept in memory is flushed to disk. In Oracle, a DBA can force a checkpoint with the alter system checkpoint command, but a checkpoint also is done automatically every time the database switches to a different online redo log group.

Flash recovery area

The components that create different backup and recovery-related files (e.g., rman, alter database backup controlfile, alter system switch logfile) have no knowledge of each other or of the space available in the filesystem in which they’re storing data.

One of the big reasons to upgrade to Oracle 10g is its flash recovery area, which solves this problem by automating management of backup-related files. Choose a location on disk and an upper limit for storage space, and set a retention policy that governs how long backup files are needed for recovery, and the database automatically manages the storage used for backups, archived redo logs, and other recovery-related files for your database within that space. To create a flash recovery area, specify a value for the db_recovery_file_dest and db_recovery_file_size parameters in your startup file. To have archive logs sent to the flash recovery area, specify location = use_db_recovery_file_dest as the value for one of your log_archive_dest_ n parameters.

Redo log

If the undo tablespace or rollback segments contain the information to roll back a transaction, the redo log contains the information to “roll forward” transactions. Every time that Oracle needs to change a block on disk, it records the change vector in the redo log; that is, it records how it changed the block, not the value it changed it to. A mathematical explanation may be helpful here. Suppose that you had a variable with a value of 100 and added 1 to it. To record the change vector, you would record +1; to record the changed value, you would record 101. This is how Oracle records information to the redo logs during normal operation. As explained in the later section “Debunking Hot-Backup Myths,” it switches to recording the changed value when a tablespace is in hot-backup mode.

In times of recovery, the redo log is used to redo (or replay) transactions that have occurred since the last checkpoint or since the backup that is being used for a restore. Oracle can be configured to use both online redo logs and offline (archived) redo logs.

Tip

The online and archived redo logs are essential to recovering from a crash or disk failure. Learn everything you can about how they work and protect them as if they were gold!

Originally, the online redo logs were a few (typically three) files to which Oracle wrote the logs of each transaction. The problem with this approach is that the log to which Oracle was currently writing always contained the only copy of the most recent transaction logs. If the disk that this log was stored on were to crash, Oracle would not be able to recover up to the point of failure. This is why, in addition to multiplexing/mirroring the control file, Oracle also supports multiplexing/mirroring redo logs as well. Instead of using individual redo logs, you can now write redo information to a log group. A log group is a set of one or more files that are written to simultaneously by Oracle—essentially a mirror for the redo logs. Believe it or not, you’re only required to have one member of each log group. Obviously, if you don’t have multiple members of the log group, though, you’re not helping yourself much. The usual practice is to put three members of each log group on three separate disks—often the same disks that you’re multiplexing your control files to. The separate files within a log group are referred to as members. Each log group is treated as a single logfile, and all transaction records are simultaneously written to all disks within the currently active log group.

Now, instead of three or more separate files, any one of which could render the database useless if damaged, there are three or more separate log groups of multiplexed/mirrored files. If each log group is assigned more than one member, every transaction is being recorded in more than one place. After a crash, Oracle can read any one of these members to perform crash recovery.

Oracle writes to the log groups in a cyclical fashion. It writes to one log group until that log group is full. It then performs a log switch and starts writing to the next log group. As soon as this happens, the log group that was just filled is then copied to an archived redo logfile, if running in archivelogmode and automatic archiving is enabled. If archivelogmode is not enabled, this file is not copied and is simply overwritten the next time that Oracle needs to write to that log. If automatic archiving is not enabled, the instance hangs the next time Oracle needs to write to an unarchived redo log.

For nonenterprise customers, each of the online redo logs is copied to the filename pattern specified by the value in one or more log_archive_dest_ n parameters in the parameter file, followed by an incremented string specified by the log_archive_format parameter in the parameter file. For example, assume that log_archive_dest_0 is set to /archivelogs/arch and log_archive_format is set to %s .log, where %s is Oracle’s variable for the current sequence number. If the current sequence number is 293, a listing of the archivelogs directory might show the following:

# cd /archivelogs/arch
# ls -l arch*
arch291.log
arch292.log
arch293.log

The log_archive_dest_ n parameter is an enhanced version of the log_archive_dest parameter. Where log_archive_dest could be set to only one destination, you can have multiple log_archive_dest_ n parameters, each of which can be set to a directory or to the flash recovery area by specifying db_recovery_file_dest. (See the section “Flash recovery area” earlier in this chapter.)

Depending on how much activity a database has, the archive log destination directory may have hundreds of files over time. If you send archive logs directly to a directory, Oracle does not manage this area; however, if you’re sending archive logs to the flash recovery area, Oracle manages the space for you. If you’re managing the space, a cron job must be set up to clean up archive log destinations. As long as these files are being backed up to some kind of backup media, they can be removed after a few days. However, the more logs there are on disk, the better off the database will be because it may sometimes be necessary to restore from a backup that is not the most current. (For example, this could happen if the current backup volume is damaged.) If all the archive logs since the time the old backup was taken are online, it’s a breeze. If they aren’t, they have to be restored as well. That can create an available-space problem for many environments. This is why I recommend having enough space to store enough archive logs to span two backup cycles. For example, if the system does a full database backup once a night, there should be enough space to have at least two days’ worth of redo logs online. If it backs up once a week, there should be enough storage for two weeks’ worth of transaction logs. (This is yet another reason for backing up every night.)

In summary, the online redo logs are usually three or more log groups that Oracle cycles through to write the current transaction log data. A log group is a set of one or more logs that Oracle treats as one redo log. Log groups should have more than one member; more than one member means there is little chance for data corruption in case of disk failure. Once Oracle fills up one online redo log group, it copies that redo log to the archive log destination as a separate file with a sequence number contained in the filename. It makes this copy only if you are running in archivelog mode and automatic archiving is enabled.

Initialization parameters

There are a number of initialization parameters in Oracle that are important to know during recovery, so it’s vital to learn where those parameters are stored both in and outside of the database.

Historically, these initialization parameters were stored in a text file named init<SID>.ora. You could change most of the parameters inside the database, but if you wanted them to survive a reboot, you also had to make the changes in the init<SID>.ora file.

Oracle 9i introduced the concept of a server parameter file, or spfile for short. An spfile is a binary file named spfile<SID>.ora that is usually stored in the ORACLE_HOME/dbs directory. It cannot be edited, but it can be directly controlled by Oracle, allowing dynamic configuration changes to be written to the spfile automatically so that they will survive a reboot. For those who are used to init<SID>.ora files, an spfile can be a bit of an adjustment because there’s nothing for you to edit. There are actually a few ways to change initialization parameters.

If the database is running, simply set the appropriate parameter via the alter system set parameter_name = value command. This action immediately changes the parameter in the running database and also stores it in the spfile. This is certainly easier than the old method of changing it in the database and the init.ora file. Some parameters, such as log_archive_start, cannot be changed this way. You must change them in the spfile and then restart.

Tip

You can query both the currently used parameters and the parameters stored in the parameter file using v$parameter and v$sparameter tables, respectively.

If you want to change a parameter in the spfile without using the database to do it, you need to create a text version of the file, edit it, then import the text file into a new spfile. The following SQL command creates a pfile called pfilename from the current spfile:

SQL> create pfile='pfilename' from spfile

Once you’ve edited pfilename and made the appropriate changes, you can make a new spfile from the pfile using this SQL command:

SQL> create spfile from pfile='pfilename'

Tip

Although you can query the v$parameter or v$sparameter tables even if a database is down, it might be helpful to occasionally make a text export of your spfile to allow you to look at it during a recovery.

Restore versus recover

Oracle distinguishes between a restore and a recovery. A restore is meant in the traditional sense of the term; it returns the datafile to the way it looked when you last backed it up by reading that backup from tape or disk. You then initiate a recovery, which applies redo against that datafile to bring it up to the current point in time. This is also referred to as media recovery.

Finding All Instances

All procedures in this chapter assume you know what the instances are on your server or that you’re able to determine what they are. Unfortunately, there is no foolproof way to determine all instances on all machines. For example, I could tell you to use the oratab file on Unix, but not everybody uses it, and it can be out of date. For Windows, I could tell you to list the registry keys matching a certain pattern, but like the oratab file, these keys do not always represent real instances. Another method would be to list the instances that are actually running on the machine. This finds active instances, of course, but it won’t find instances that were not running when the backup ran.

The best you can do is to tell everybody the method you’re using to determine the list of instances, then stick with it. Tell them that if something’s in oratab or listed in the Windows Registry, it gets backed up. If it’s not there, it won’t get backed up. Tell them that you’re going to back up any running instances and will not back up instances that were manually shut down. Just make a choice, and publish it well.

You can enforce the use of the oratab file on Unix/Linux or the registry keys in Windows by automatically starting only databases found there and taking note any time someone complains about an instance that didn’t start after a reboot. Explain to them that it needs to be in the registry or oratab. If an instance is not in the oratab, it is not automatically started, and it doesn’t get backed up!

The oratab file is usually located in /etc/oratab or /var/oracle/oratab. In Windows, you want to look at the following registry tree:

HKEY_LOCAL_MACHINESOFTWAREORACLEHOMEn

Usually n is 0, but there can be additional registry trees with other digits if there are additional ORACLE_HOMEs configured on this machine. Underneath this registry tree, you should find one or more of the following values:

  • ORA_ SID _AUTOSTART

  • ORA_ SID _PFILE

  • ORA_ SID _SHUTDOWN

You can then parse that list to determine a list of Oracle SIDs.

Tip

If you want to use the oratab/registry method but also want to double check that you’re getting everything, you can also use the process list method discussed in the following paragraphs to see if any instances are running that aren’t in the oratab or registry.

One way to do this is just to list the instances that are running on the machine. This, of course, lists only instances that are running. To get a list of all instances running on a Unix/Linux system, use a variation of this command:

$ ps -ef|grep "ora_.*pmon" | awk '{print $6}' | awk -F_ '{print $2}'

Tip

The output of the ps command can obviously be different in different versions of Unix. You may have to change the column that this command prints from $6 to something else to get this command to work for you.

On Windows, each Oracle SID has its own service named OracleService SID, where SID is the instance name. You can get a list of running Windows instances using this command:

C:> net start | find "OracleService"
   OracleServiceXYZ
   OracleServiceABC

If you would like to make this list even better, download the sed command from the GnuWin32 project (http://gnuwin32.sourceforge.net), and add it to the list of commands. The following sed command tells sed to strip off all characters up to the string OracleService. This leaves you with a list of instance names.

C:> net start | find "OracleService" 
| sed "s/.*OracleService//"
XYZ
ABC

Tip

Again, the best thing you can do is pick one of these methods (preferably the oratab/ registry method), then publish it and enforce it.

Physical Backups Without rman

Many Oracle environments prefer the supported nature of rman. They also enjoy the way that you can completely integrate your commercial backup software with rman. It’s also the only way to get true incremental backups of your datafiles. In addition, you can also use rman to back up to disk without purchasing an agent for your commercial backup system.

However, some DBAs prefer what Oracle calls user-managed backups. They put their database into backup mode prior to backing up and take it out of backup mode after backup. Sometimes this is due to a long history and familiarity with user-managed backups; sometimes it is the cost of either the disk you would need to use rman without a media manager or the cost of the media manager. For whatever reason, approximately half of Oracle customers perform user-managed backups, but this percentage goes down every day.

This section discusses methods that can be used to safely back up Oracle without using rman. You can back up to disk and then back that disk up using your normal backup procedures, or you can back up directly to tape.

Like most RDBMSs, Oracle databases can reside on cooked filesystem files or raw disk devices (raw disks are available only in Unix). Even if raw devices are available, many Oracle DBAs put their databases on cooked files. One of the reasons for this is that if all of the database files are accessible via the filesystem, backing up is very simple. Any standard copy utility (e.g., cp, copy) or any backup utility (e.g., dump, cpio, or commercial utility) can copy the data. If you are running Oracle on Unix and decide to put your Oracle database on raw partitions, you need to use a tool that can back up raw partitions (e.g., dd).

Backups can be done offline (a cold backup) or online (a hot backup). If you’re going to perform user-managed backups, you must back up all of the following:

  • Datafiles

  • Control files

  • Online redo logs (if performing a cold backup)

  • The parameter file

  • Archived redo logs

  • Password file if used

  • The ORACLE_HOME directory

Cold Backup

A cold backup of an Oracle database that is based on filesystem files is the easiest of all database backups because most companies already have some system that backs up their server’s filesystems. It could be a homegrown program that runs dump or ntbackup, or it could be a commercial backup product. To perform a cold backup of an Oracle database, simply perform an orderly shutdown of Oracle (not shutdown abort) before running the normal backup. An orderly shutdown performs a checkpoint, flushing any changed data stored in memory to disk, and then stops all processes that allow access to the database. This procedure puts all Oracle files into a clean, consistent, quiescent state.

This procedure assumes, of course, the use of filesystem files. An Oracle database running on a Unix server also could be sitting on raw devices. A cold backup of such a database requires a little more effort because you need to understand the structure of the database. The procedure starts out the same, by shutting down the database. A filesystem backup at this point, though, gets only the executables and any database objects that reside in the filesystem, such as the control file. The database itself requires extra effort. The first thing to figure out is where all the database files are. A script can “ask” Oracle this question by querying v$datafile, but that script would have to be written. Once the locations of all files are known, dd can back them up to a file somewhere in the filesystem or send them directly to a backup volume. If they are backed up to a file in the filesystem, it must be done before the normally scheduled filesystem backup. That way, the files are automatically backed up to a backup volume.

Therefore, backing up an Oracle database that uses raw partitions is harder than backing one up that is based on filesystem files. This is one reason why Oracle DBAs have historically used the filesystem to store their database files, even though raw partitions yield slightly better performance.

Hot Backup

If an Oracle database is providing the data for the customer service web page or any other application that requires 24-hour uptime, a cold backup is not acceptable because it requires that the database be shut down on a regular basis. Even if this is done late at night, customers accessing the web page may do so at any time. A company has a much better online image if it is able to leave the web page up all the time. What is needed, then, is a hot, or online backup.

Tip

The database must be running in archivelog mode in order to run hot backups.

A hot backup requires quite a bit more work than a cold backup, and this is even more true when the cold backup is of a database using raw devices. The following steps must be taken every time a hot backup is performed:

  1. Find out the names and locations of all tablespaces and the datafiles they reside on:

    1. If running Oracle 10gR2 or later, you must ask Oracle for a list of all datafiles, using the SQL command shown here:

    2. SQL> select file_name from sys.dba_data_files;
      /oracle/product/10.2.0/oradata/orcl/users01.dbf
      /oracle/product/10.2.0/oradata/orcl/sysaux01.dbf
      /oracle/product/10.2.0/oradata/orcl/undotbs01.dbf
      /oracle/product/10.2.0/oradata/orcl/system01.dbf
      /oracle/product/10.2.0/oradata/orcl/example01.dbf
    3. If running a version prior to Oracle 10gR2, you must ask Oracle for a list of all datafiles and tablespaces, using the SQL command shown here:

    4. SQL> select tablespace_name, file_name from sys.dba_data_files;
      USERS  /oracle/product/10.2.0/oradata/orcl/users01.dbf
      SYSAUX  /oracle/product/10.2.0/oradata/orcl/sysaux01.dbf
      UNDOTBS1 /oracle/product/10.2.0/oradata/orcl/undotbs01.dbf
      SYSTEM  /oracle/product/10.2.0/oradata/orcl/system01.dbf
      EXAMPLE /oracle/product/10.2.0/oradata/orcl/example01.dbf
  2. Ask Oracle for the location of the archived redo logs using a SQL command like the one shown here. Since you can now specify multiple locations, change the query to show those parameters that are like 'log_archive_dest_%'.

  3. SQL> select name,value from v$parameter where name 
    SQL> like 'log_archive_dest_%';
    log_archive_dest_1  location=/backups/archive
  4. Put the database into backup mode:

    1. If running Oracle 10gR2 or later, you can put the entire database into backup mode using the SQL command alter database begin backup.

    2. If running a version prior to 10gR2, put each tablespace into backup mode using the SQL command alter tablespace tablespace_name begin backup.

Tip

Because more redo is generated when tablespaces are in backup mode, there are performance ramifications to placing the entire database in backup mode. If placing the entire database into backup mode at once creates too much of a load on your server, place a few tablespaces into backup mode at a time and back up just their datafiles. This is obviously a lot more complicated than the method proposed here.

  1. Copy each tablespace’s datafiles to an alternate location such as disk or tape, or run your commercial backup program to back up all filesystems (and possibly raw devices) to disk or tape.

  2. Take the database out of backup mode:

    1. If running Oracle 10gR2 or later, you can take the entire database out of backup mode by running the SQL command alter database end backup.

    2. If running a version prior to 10gR2, take each tablespace out of backup mode using the SQL command alter tablespace tablespace_name end backup.

  3. Switch the redo logfile and make sure it is archived. The best method to do both is to run the SQL command alter system archive log current. This switches the logfile but does not return the prompt until the previous redo log has been archived. You can run alter system switch logfile, but then you won’t be sure that the latest redo log has been archived before you move on to the next step.

  4. Back up the control file using the SQL commands alter database backup controlfile to filename and alter database backup controlfile to trace.

Tip

Make sure you use both methods to back up the control file; either one may come in handy at different times.

  1. Ensure that all archived redo logs that span the time of the backup are backed up.

This is obviously a lot of work. A good knowledge of scripting is required, as well as a good knowledge of the commands necessary to accomplish these tasks. To explain the whole process, let’s break it down by section:

Determine structure

Steps 1 and 2 have to do with figuring out where everything is. Make sure you do these steps every single time you do a backup, not just once or only when you change the structure of the database. This ill-advised method results in a static script that backs up only the tablespaces that were discovered the last time these steps were done. This is too open to human error and not recommended. It is much better to automate these steps and do them each time an instance is backed up. Doing them each time ensures that the configuration data is always current—and that the backups never miss a thing.

Put files into backup mode, and then copy them

Copying the database files while the database is running can be done only by placing one or more tablespaces (or the whole database) in backup mode. The datafiles that are in those tablespaces or that database then can be copied or backed up at will. Since the files are still being written to as you are backing them up, Oracle refers to this as an inconsistent backup. Some also refer to this type of backup as a “fuzzy” backup; they are the same thing. The point is that the file is still changing while you are backing it up. Don’t worry; Oracle will be able to resolve any inconsistencies in the file during recovery.

Tip

Please see the following section for a refutation of the common misconception that datafiles are not being written to when they are in hot-backup mode.

Back up related files

The datafiles are just one of the many sections of Oracle that need to be backed up. Also back up the control file, the archived redo logs, the configuration files, the password file if you’re using it, the ORACLE_HOME directory, and any directories. (If it is a cold backup, you should also copy the online redo logs because they create a complete copy of the database at that point in time.)

Debunking Hot-Backup Myths

What happens during a hot backup is widely misunderstood. I debunk two main myths here:

  • Datafiles don’t change during hot backup.

  • Oracle logs a full copy of every block every time it’s changed during hot backup.

Many people believe that while a tablespace is in backup mode, the datafiles within that tablespace are not written to. They believe that all changes to these files are kept in the redo logs until the tablespace is taken out of backup mode, at which point all changes are applied to the datafiles just as they are during a media recovery. (The concept of media recovery is described in the section “Recovering Oracle” later in this chapter.) Although this explanation is easier to understand (and swallow) than how things really work, it is absolutely not how hot backups work in Oracle.

Many people believe that when a tablespace is in backup mode, Oracle switches from logging the redo vector to logging full blocks. Neither is correct. It continues to log redo vectors, but it also logs the full image of any block that is changed—but only the first time it changes that block.

A common reaction to these statements is a very loud “What?” followed by crossed arms and a really stern look. (I reacted the same way the first time I heard it.) “How could I safely back up these files if they are changing as I’m backing them up? What do you mean it logs the full image only the first time it changes the block?” Don’t worry; Oracle has it all under control. Remember that every Oracle datafile has an SCN that is changed every time an update is made to the file. Also remember that every time Oracle makes a change to a datafile, it records the vector of that change in the redo log. When a tablespace is put into backup mode, the following three things happen:

  1. Oracle checkpoints the tablespace, flushing all changes from shared memory to disk.

  2. The SCN markers for each datafile in that tablespace are “frozen” at their current values. Even though further updates will be sent to the datafiles, the SCN markers will not be updated until the tablespace is taken out of backup mode.

  3. In addition to recording how it changed a particular block (referred to as the redo vector), it logs the entire image of each block the first time it changes it on disk.

After this happens, your backup program works happily through this datafile, backing it up block by block. Since the file is being updated as you are reading it, it may read blocks just before they’re changed, after they’re changed, or even while they’re changing! Suppose that your filesystem block size is 4 KB, and Oracle’s block size is 8 KB. Your backup program will be reading in increments of 4 KB. It could back up the first 4 KB of an 8 KB Oracle data block before a change is made to that block, then back up the last 4 KB of that file after a change has been made. This results in what Oracle calls a split block. However, when your backup program reaches the point of the datafile that contains the SCN, it backs up the SCN the way it looked when the backup began because the SCN is frozen. Once you take the tablespace out of backup mode, the SCN marker is advanced to the current value, and Oracle switches back to logging change vectors and doesn’t worry about full images of changed blocks.

How does Oracle straighten this out during media recovery? It’s actually very simple. You use your backup program to restore the datafile. When you attempt to start the instance, Oracle looks at the datafile and sees an old SCN value. Actually, it sees the value that the SCN marker had before the hot backup began. When you enter recover datafile, it begins to apply redo against this datafile. Since the redo logs contain one complete image of every block that changed during your backup, it can rebuild this file to a consistent state, regardless of when you backed up a particular block of data. The first thing it does is overwrite any blocks for which it contains full images (that is, those blocks that changed during backup). It then applies regular redo against the file to apply any changes that occurred during or after the backup. This is why it needs only the first image of any block that was changed during the backup. It just needs one image to ensure that the block is put into a known state (as opposed to a split block). It can then apply redo against that block to redo any other changes.

If you’re like me, you won’t believe this the first time that you read it. So I’ll prove it to you. Let’s create a table called tapes in the tablespace test, insert the value “DLT” into it, and force a checkpoint:

SQL> create table tapes (name varchar2(32)) tablespace test;
Table created.
SQL> insert into tapes values ('DLT'),
1 row created
SQL> commit;
Commit complete.
SQL> alter system checkpoint;
System altered.

Now we ask Oracle what block number contains the new value:

SQL> select dbms_rowid.rowid_block_number(rowid) blk, name from tapes;
  BLK NAME
------- ----------------
   3 DLT

The value “DLT” is recorded in the third data block. Allowing nine blocks for the datafile headers, we can read the third block of data with dd and run strings on it to actually see that the value is there:

$ dd if=/db/Oracle/a/oradata/crash/test01.dbf ibs=8192 skip=11 count=1|strings
1+0 records in
16+0 records out
DLT

Place the tablespace in hot-backup mode:

SQL> alter tablespace test begin backup ;
Tablespace altered.

Now update the table, commit the update, and force a global checkpoint on the database:

SQL> update tapes set name = 'AIT';
1 row updated
SQL> commit;
Commit complete.
SQL> alter system checkpoint;
System altered.

Extract the same block of data to show that the new value was actually written to disk:

$ dd if=/db/Oracle/a/oradata/crash/test01.dbf ibs=8192 skip=11 count=1|strings
1+0 records in
16+0 records out
DLT,
AIT

Now we can take the tablespace out of backup mode:

SQL> alter tablespace test end backup;

This test proves that datafiles are indeed being written to during hot backups!

Physical Backups with rman

rman offers a number of advantages over user-managed backups. The following is a quick list of them:

Full support from Oracle

While user-managed backups are supported by Oracle, you’ll find that Oracle’s interest goes up even more if you’re recovering your database using its favorite tool.

Automatic creation of multiple simultaneous backup streams

If you’ve got a big database, this will come in very handy. All you have to do is specify how many simultaneous backups you need, and rman will generate them.

Incremental backups for all versions since 8i

rman performs a true incremental backup, sending only those blocks that have changed since the last backup.

Low-impact incremental backups for all versions since 10g

With 10g, incremental backups got much better. Prior to 10g, it took a lot of I/O to determine which blocks needed to be on the incremental backup. This saved a lot of space and bandwidth, but definitely didn’t save I/O. 10g stores a record of changed blocks in a simple bitmap that’s consulted by rman during incremental backups, making them much faster and creating much less work for the server.

Use of the flash recovery and flashback features

Flash recovery and flashback are integrated into rman, and you get to use these wonderful features if you use rman.

Backup optimization

Since rman is keeping track of all backups, you have the option of using backup optimization, which doesn’t back up parts of the database it knows it already has a backup of, such as read-only tablespaces.

Restore optimization

Instead of having to figure out which datafiles need to be restored and selecting them from your backup software, you can simply tell rman to restore database. It then automatically determines which datafiles are in need of a restore and restores them.

During media recovery, rman comes in quite handy as well. Instead of having to figure out where your archived redo logs are, and restoring all of them to disk for the restore to work, rman completely automates retrieval of redo logs for media recovery. It can also be told to keep a certain amount on disk, automatically deleting the oldest versions from disk to bring in newer versions from tape.

If you use rman with a commercial backup utility, such as those covered in Chapter 8, it can actually deliver a completely integrated backup solution for all Oracle databases.

Tip

Since this book is focused on free and open-source backup methods, this chapter does not cover how to integrate rman with a commercial backup utility.

In addition to these rman features, a fully commercial solution would add these features:

Full support from your commercial backup product vendor

Don’t expect your backup vendor to know anything about user-managed backups.

Direct backup to tape or virtual tape or other disk on your backup server (instead of being forced to back up to a filesystem accessible to the Oracle server)

If you’re using rman without a media manager, you’ll need to back up to a filesystem accessible on the database server. This can present a number of challenges to some environments. A media manager gives you the most flexibility when choosing backup targets.

Usage of full capabilities of backup product vendor

Commercial backup products have come a long way in the past few years and include a number of features that we discussed in Chapter 8. You get access to all these features for your rman backups as well.

A complete inventory of your backups, allowing a “point-and-click” restore

You can continue to restore things from the rman interface, or you can use your backup software’s ability to perform the restore via its interface.

Tip

If you’re interested in these features but have been intimidated by the pricing of enterprise-level backup software products, you may consider Oracle’s Secure Backup product. It’s got a very simple pricing structure. Oracle doesn’t charge per client, per server, or per gigabyte; it charges only a certain amount per backup device.

Once a third-party backup software product, your rman script, or Oracle’s Secure Backup initiates an rman backup, rman then does all the internal communication that it needs to do to supply the backup utility with n threads of data. rman records the time of the backup for future reference and passes it to the backup product if there is one. After things have been set up, it is also possible for a DBA to run the rman command from the command line. This command then calls the appropriate programs to connect with the backup utility or back up directly to disk. If sent to a backup utility, it responds to this as to any other backup request, loading volumes as necessary, and accepting the backup from rman.

rman is supported in Oracle 8i and following. You can specify to restore a tablespace, database, datafile, or even a block within a datafile. It knows what files comprise a tablespace or database, and then automatically selects and restores the most recent backup of those files. This really takes the guesswork out of the DBA’s hands. The program knows what needs to be restored, it knows where the backups are, and it automatically goes through the process of restoring the appropriate items and can automatically apply media recovery against them as well. This is far better than having to do all this work yourself.

rman is too complex to be covered in detail in a chapter of this size; consult Oracle’s documentation for an explanation of how rman works. What we can look at is a list of major features that have been added to rman since it was introduced in 8i, resulting in a much more powerful and usable backup command. I’ll then provide some general guidance on the use of rman.

Important New rman Features

The following section lists important features of rman that have been introduced since it was originally released. Next to each feature is listed the version of Oracle that introduced it.

Crosscheck and delete functionality (8i)

Prior to these commands, the rman catalog could get out of sync with the media manager catalog. These commands automatically synchronize the two, including deleting rman catalog entries for backups that have been expired by the media manager.

create, upgrade, and drop catalog commands (8i)

These commands automate a number of important catalog operations, replacing a number of scripts and other commands in the process.

Automatic backup name generation (8i)

You no longer have to specify the format parameter with a backup. rman automatically creates a unique name for each piece.

Control file autobackups (9i)

If configure controlfile autobackup is set to on, rman automatically backs up the control file and server parameter file any time there is a structural change to the database or any time you issue a backup command.

Backup of server parameter files (9i)

Server parameter files are automatically backed up any time you issue the backup command, allowing you to issue the restore spfile command (if you enable the controlfile autobackup feature).

Backup of archived logs that need backups; deletion of those that don’t (9i)

Previous methods of specifying which archived logs to back up were a bit clunky and required you to figure out which ones to back up. Now you can specify not backed up n times. Files that have not been backed up the specified number of times are backed up. You can also specify that files that have been backed up the appropriate number of times should be deleted.

Persistent rman configurations (9i)

Previous to this, you had to specify a number of arguments to rman every time you ran a backup. Now you can create persistent values for things like automatic channels, channel parallelism, retention policies, and backup options.

Block media recovery (9i)

Block media recovery can perform media recovery on individual blocks in an individual datafile while the datafile remains online.

Flash recovery area (10g)

You can tell rman to use a single location on disk for the location of all backup and recovery files. This area is called the flash recovery area.

Optimization of incremental backups (10g)

Previous versions of rman required a number of reads of an Oracle database to determine which blocks should be included in an incremental backup, resulting in an I/O load that was actually higher than a full backup. As of 10g, Oracle keeps track of which blocks have changed in each datafile. (The DBA must enable this feature.) Now rman can simply ask for changed blocks, significantly reducing the I/O load of an incremental.

Incrementally updating backups (10g)

rman can apply the changes found in an incremental backup to a previous backup of a datafile, resulting in a new full backup of that file, without having to actually back up the entire file.

Backup set compression (10g)

Oracle can compress backups to better use disk space, reducing the space or bandwidth required by 50 to 75 percent. This is not the default option, but you can make it your default option using a persistent rman parameter.

Recovery through resetlogs (10g)

Prior to this feature, you were forced to take a backup after opening the database with the resetlogs option because you couldn’t use backups prior to that time. This is no longer the case.

Global stored scripts (10g)

rman scripts can be stored in the rman catalog that can be accessed by any target database.

Automating rman

The following section contains some tips on automating rman backups, especially those in large environments.

Warning

This section is absolutely not a replacement for the rman documentation. We cannot cover in a few pages what they cover in an entire book.

Like all other Oracle commands, rman assumes you know the name of the instance that you’re backing up. Therefore, you must first obtain a list of Oracle instances.

Create a recovery catalog

Yes, you can use rman without a recovery catalog. In fact, looking at the release notes shows that Oracle continues to enhance what you can do without a catalog. If you don’t use a catalog, rman data is stored in the control file. Storing your recovery information inside the thing you want to recover is never a good idea. It would be much better to have a centralized recovery catalog that contains all the recovery information for all instances. It’s also a good idea if this catalog is on a different server than any of the databases it is tracking.

It’s also much easier to create the catalog now by just using the create catalog command in rman. Large environments might want to use more than one rman catalog and back up each of those catalogs using an rman instance that’s connected to the other catalog. You might also consider performing a hot or cold backup of the rman catalog by some method that doesn’t use rman.

Warning

Always keep a record of your DBID. This may come in very handy if you lose your control file, especially if you have multiple databases with the same name.

Create persistent parameters

Each persistent parameter that you create is one less parameter that you have to specify on the command line. The ultimate goal here is that you would simply be able to tell rman to backup database, and everything else would come from these parameters. The parameters that can be set persistently can be viewed by using the show all command in an rman session. It will list all of the parameters and show you which of them you have customized, and which of them are still set to the default values. Here is the example shown in Oracle’s documentation:

RMAN> show all;

RMAN configuration parameters are:
CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default
CONFIGURE BACKUP OPTIMIZATION OFF; # default
CONFIGURE DEFAULT DEVICE TYPE TO DISK; # default
CONFIGURE CONTROLFILE AUTOBACKUP OFF; # default
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F'; # default
CONFIGURE DEVICE TYPE DISK PARALLELISM 1 BACKUP TYPE TO BACKUPSET; # default
CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE MAXSETSIZE TO UNLIMITED; # default
CONFIGURE ENCRYPTION FOR DATABASE OFF; # default
CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default
CONFIGURE ARCHIVELOG DELETION POLICY TO NONE; # default
CONFIGURE SNAPSHOT CONTROLFILE NAME TO
'/oracle/product/10.2.0/db_1/dbs/snapcf_orcl.f'; # default

Using persistent parameters greatly simplifies your rman scripts. To create or modify these parameters, simply connect to rman and issue commands like those shown here:

RMAN> configure backup optimization on;
RMAN> configure device type disk backup type to compressed backupset;
RMAN>configure controlfile autobackup on;

Create rman scripts

You can create rman scripts to back up an entire database, a tablespace, a datafile, a control file, or an archive log. These scripts can be stored in the filesystem, or they can be stored in the rman catalog, allowing all instances using that catalog to use the scripts. Once stored there, they can be executed by connecting to the catalog and issuing the rman command run script_name.

For space reasons, we’ll cover only two very basic rman scripts that perform a full (level 0) and incremental (level 1) backup of the entire instance as well as switching logfiles and backing up any archived redo logs that were created during the backup. A full or level 0 backup backs up every byte of the database, whereas an incremental or level 1 backup backs up only those bytes that have changed since the last backup. This script assumes that you have controlfile_autobackup set to on.

Here is the rman command file used to perform a full backup. It creates a full (level 0) backup of the database to the default device. The plus archivelog option causes it to switch logfiles (using archivelog current) before the backup, back up any archived logs not backed up (if backup optimization is set to on), then back up the database, followed by another logfile switch and a backup of any redo logs that were created during the backup:

connect target userid/passwd@oracle_sid;
connect catalog userid/passwd@rman_sid;
backup incremental level 0 plus archivelog;

Warning

Oracle’s documentation tells you to enter the rman password on the command line. This makes it available to anyone who can enter ps -ef. (The following scripts do not do this, but you can see that it was done by manually entering the passwords into the script.) If you follow this method, make sure you make the scripts readable only by Oracle.

Substitute the appropriate values for userid, password, and the SIDs for the target and catalog databases. (The target database is the one to be backed up.)

The database.inc.rman command file (level 1 backups)

This is an incremental companion to the full script shown previously. It performs an incremental backup of those bytes that have changed in the database since the level 0 backup was taken:

connect target userid /passwd@oracle_sid;
connect catalog userid /passwd@rman_sid;
backup incremental level 1 plus archivelog;

Tip

Long-time users of rman should note the missing allocate channel commands and the lack of an alter system archive log current command, and backup of the archive logs at the end. The former is handled with the persistent parameters, and the latter is handled by the plus archivelog argument.

Flashback

Oracle 10g introduced a wonderful set of features collectively referred to as flashback. Some use the flash recovery area that was discussed in the architecture section, and others use the new recycle bin or the undo tablespace. Where physical backups (the focus of this chapter) are mainly designed to recover from equipment and disk failures, flashback offers a number of interesting ways to recover from human error. Here is a quick summary of these features. Be sure to consult your documentation on how to use flashback.

Flashback query

This allows a user to query data at a previous point in time in order to reconstruct data that was lost or deleted by accident. The data needed for this feature is stored in the undo tablespace.

Flashback version query

Allows a DBA to view changes to the rows of a table over time. The data needed for this feature is stored in the undo tablespace.

Flashback transaction query

Allows a DBA to view changes to the database at the transactional level. The data needed for this feature is stored in the undo tablespace.

Flashback database

This is a CDP-style recovery of Oracle. It allows you to simply roll Oracle back in time to just before a major logical corruption error happened, such as the drop of a table. The data needed for this feature is stored in the flash recovery area.

Flashback table

Allows the DBA to recover a table to an earlier point in time. The data needed for this feature is stored in the undo tablespace.

Flashback drop

Provides “undrop” protection when dropping tables in Oracle. The data needed for this feature is stored in the Oracle recycle bin.

Other Commercial Backup Methods

There is at least one commercial backup tool that backs up Oracle databases without using rman. It uses the same tools your scripts would use (e.g., begin backup/end backup), so it suffers many of the same limitations that such scripts would have. One such product is BMC’s SQL Backtrack. Originally designed for Sybase, this product has been ported to Oracle, Informix, and SQL Server. At this writing, SQL Backtrack does not do volume management. It can, however, interface with some commercial storage managers, allowing them to provide volume management.

Managing the Archived Redo Logs

How common is the question, “Should I have archiving turned on?” Yes, yes, a thousand times yes! When in doubt, archive it out! Here’s what is possible only if archiving is enabled:

  • Recovery up to the point of failure

  • Recovery from a backup that is a month or more old—if all the archived redo logs since then are available

  • Performance of a complete backup of the database without even shutting it down

The existence of archive logs does all this without adding significant overhead to the entire process. The only difference between having archiving on or off is whether or not Oracle copies the current redo log out to disk when it “switches” from one redo log to the next. That’s because even with archiving off, it still logs every transaction in the online redo logs. This means that the only overhead associated with archiving is the overhead associated with copying the online file to the archive location, which is why there may be only a 1 to 3 percent performance hit in an environment with many transactions—if there is one at all. Feel free to experiment, but it is very difficult to justify turning off archiving on any production database.

Warning

Remember that the archived redo log destination must not fill up. This causes the production database to stop operating.

In my opinion, there are only two environments in which turning off archiving is acceptable. The first is an environment in which the data does not matter. What type of environment would that be? The only one is a true test environment that is using fake data or data restored from production volumes. No structure changes are being made to this database, and any changes made to the data are discarded. This database does not need archiving and probably doesn’t even need to be backed up at all. (Did I just write that?) It should be mentioned, though, that if you’re benchmarking a database that will go into production, backup and archiving should be running because testing the application without archiving running does not give realistic results. Tests with archiving running are more realistic—even if all the archive logs are deleted as soon as they are made.

Development databases do not fall into this category because, although the data in a development database may be unimportant, the structure of the database often is highly important. If archiving is off, a DBA cannot restore any development work that he has done since the last backup. That creates the opportunity to lose hours or even days worth of work, just so a development database can be 1 to 3 percent faster. That is a big risk for such a small gain.

The second type of database that doesn’t need archive logs is a completely read-only database or a “partially read-only” database where an archive log restore would be slower than a reload of the original data. The emergence of the data warehouse has created this scenario. There are now some databases that have completely read-only tablespaces and never have data loaded into them. This type of database can be backed up once and then left alone until it changes again.

A partially read-only database is one that stays read only for long periods of time and is updated by a batch process that runs nightly, weekly, or even as needed. The idea is that, instead of saving hundreds of redo logs, the database would be restored from a backup that was taken before the load. The DBA then could redo the load. There are two choices in this scenario. The first is to turn off archiving, making sure that there is a good cold backup after each database load. If the load aborted or a disk crashed after the load but before the next backup, you can simply load the older backup and then redo the load. The cold backup costs some downtime, but having archiving off speeds up the loads somewhat. The other option would be to turn on archiving. That allows taking a hot backup anytime and creates the option of using the redo logs to reload the data instead of doing an actual data reload. This method allows for greater backup flexibility. However, depending on the database and the type of data, an archive log restore could take longer than a reload of the original data—especially if it is a multithreaded load. It is a tradeoff of performance for recoverability. Test both ways to see which one works best for you.

Recovering Oracle

Since an Oracle database consists of several interrelated parts, recovering such a database is done through a process of elimination. Identify which pieces work, then recover the pieces that don’t work. The following recovery guide follows that logic and works regardless of the chosen backup method. It consists of a number of related steps that can be followed in order. If you’re performing user-managed backups, you’ll have to perform much of the logic yourself. If you’re using rman, you’ll be able to skip right through a lot of these steps.

Using This Recovery Guide

The process presented in the following sections for recovering an Oracle database assumes nothing. Specifically, it does not assume that you know the cause of the database failure. By following these steps you’ll work through a series of tasks that determine which part(s) of the database are no longer functional. You can then bring the database up as soon as possible while allowing recovery of the pieces that are damaged. (“Damaged” may mean that a file is either missing or corrupted.)

Start with Step 1. If it succeeds, it directs you to Step 10. If the “startup mount” fails, it directs you to Step 2. Each of the steps follows a similar pattern, directing you to the appropriate step following the failure or success of the current step. Assume that prior to any SQL commands, we issued the following commands:

$ sqlplus /nolog
SQL>connect /as sysdba;

Prior to running any rman commands, we connected to the appropriate target database and recovery catalog (if we’re using one). Consult your manual for the dozens of different ways you can do that. Our examples didn’t use a recovery catalog, and we were logging into the target database using OS-level authentication, so we simply had to run the following command:

$ rman target /

For each sqlplus or rman command, I describe the state the database needs to be in prior to running the command. For example, the paragraph will say something like, “issue the following SQL command against a mounted, closed database.”

For more detailed information about individual steps, please consult Oracle’s documentation, especially the Backup and Recovery Basics documentation.

These steps can be used on a Windows machine as well as a Unix machine. Differences are noted when necessary.

Warning

Before you begin any recovery, it’s wise to start a support ticket with Oracle. They can be very helpful if things get difficult.

Seriously, Think About rman

You can see that the restore and recovery of an Oracle database can be a very complicated process. If you’re not yet using rman, now is the time to think about it. As long as you still have your control files, this 26-step recovery process would be reduced to the following 6 rman commands:

connect target......
connect catalog.....
startup mount;
restore database;
recover database;alter database open;

rman automatically figures out what’s broken, restores it, then applies media recovery against it.

Step 1: Try Startup Mount

The first step in verifying the condition of an Oracle database is attempting to mount it. This works because mounting a database (without opening it) reads the control files but does not open the datafiles. If the control files are multiplexed/mirrored, Oracle attempts to open each of the control files listed in the parameter file. If any of them are damaged, the mount fails.

To mount an unmounted database, simply run sqlplus /nolog, connect to the database, and enter startup mount:

SQL> startup mount;

If it succeeds, the output looks something like this:

SQL> startup mount;
ORACLE instance started.
Total System Global Area       5130648 bytes
Fixed Size                     44924 bytes
Variable Size                  4151836 bytes
Database Buffers               409600 bytes
Redo Buffers                   524288 bytes
Database mounted.

If the attempt to mount the database fails, the output looks something like this:

SQL> startup mount;
Total System Global Area    5130648 bytes
Fixed Size                  44924 bytes
Variable Size               4151836 bytes
Database Buffers            409600 bytes
Redo Buffers                524288 bytes
ORACLE instance started.
ORA-00205: error in identifying controlfile, check alert log for more info

Tip

If the attempt to mount the database succeeds, proceed to Step 10. If it database fails, proceed to Step 2.

Step 2: Are All Control Files Missing?

Don’t panic if the attempt to mount the database fails. Control files are easily restored if they were multiplexed/mirrored and can even be rebuilt from scratch if necessary. The first important piece of information is that one or more control file is missing.

Unfortunately, since Oracle aborts the mount at the first failure it encounters, it could be missing one, two, or all of the control files, but so far you know only about the first missing file. So, before embarking on a course of action, determine the severity of the problem by doing a little research.

First, determine the names of all of the control files. You can do that by querying v$parameter, which is a fixed table available from an unmounted (or mounted) database.

SQL> select value from v$parameter where name like 'control_files'
/db/Oracle/a/oradata/crash/control01.ctl,
/db/Oracle/b/oradata/crash/control02.ctl,
/db/Oracle/c/oradata/crash/control03.ctl

Or on Windows, it might look like this:

SQL> select value from v$parameter where name like 'control_files'
D:ORACLEORADATACRASHCONTROL01.CTL,
E: ORACLEORADATACRASHCONTROL02.CTL,
F: ORACLEORADATACRASHCONTROL03.CTL

It’s also important to get the name of the control file that Oracle is complaining about. Find this by looking for the phrase controlfile: in the alert log. (The alert log can be found in the location specified by the background_dump_dest value also available in v$parameter.) Use the technique shown previously to get the locations of the control files to get this value. (Typically, the alert log is in the <ORACLE_BASE>/admin/<ORACLE_SID>/bdump directory.) In that directory, there should be a file called alert_<ORACLE_SID>.log. In that file, there should be an error that looks something like this:

Sun Jul 23 18:53:19 PDT 2006 alter database mount exclusive
Sun Jul 23 18:53:19 PDT 2006 ORA-00202: controlfile:
'/db/a/oradata/crash/control01.ctl'
ORA-27037: unable to obtain file status
SVR4 Error: 2: No such file or directory

Warning

Some of the following procedures may say to override a potentially corrupted control file. Since you never know which file may be needed, always make backup copies of all the control files before doing any of this. This procedure offers an “undo” option that isn’t possible otherwise. (Also make copies of the online redo logs as well.)

With the names of all of the control files and the name of the damaged file, it’s easy to determine the severity of the problem. Do this by listing each of the control files and comparing their size and modification time. (Remember the game “One of these is not like the others” on Sesame Street?) The following scenarios assume that the control files were multiplexed/mirrored to three locations, which is a very common practice. The possible scenarios are:

The damaged file is missing or corrupted, and at least one other file is present

If the file that Oracle is complaining about is just missing, that’s an easy thing to fix. If this is the case, proceed to Step 3.

All control files are missing

If all the control files are corrupt or missing, they must be rebuilt or the entire database must be restored. Hopefully your backup system has been running the backup controlfile to trace command on a regular basis. (The output of this command is a SQL script that rebuilds the control files automatically.) Proceed to Steps 4 through 7.

Step 3: Replace Missing Control File

Tip

If you don’t have the output of a backup control file to trace, you can still build a recreate control file script using a process documented by Oracle. Support will be very helpful here.

Please note the careful choice of words here. We are not going to restore the control file, which would imply that we’re going to get it back from backups. Remember that restoring the control file forces us to perform an alter database open resetlogs, which is not desirable if we can afford it. We are going to replace it by copying one of the other copies that we multiplexed/mirrored it to. Let’s see if we can do that before we actually restore one from backup.

Tip

This step applies to both user-managed backups and rman backups. It’s always better to repair the control file from a multiplexed/mirrored copy than to restore it from backup.

If you’re able to identify that at least one of the multiplexed/mirrored copies of the control file is good, then this part is easy. Simply copy another one of the multiplexed/mirrored copies of the control file to the damaged control file’s name and location. (The details of this procedure follow.) Once this is done, you can attempt to mount the database again.

Warning

Be sure to make backup copies of all of the control files before trying this procedure!

The first thing to do is to get the name of the damaged control file. Again, this is relatively easy. Look in the alert log for a section like this one:

Sun Jul 23 18:53:19 PDT 2006 alter database mount exclusive
Sun Jul 23 18:53:19 PDT 2006 ORA-00202: controlfile:
'/db/a/oradata/crash/control01.ctl'
ORA-27037: unable to obtain file status
SVR4 Error: 2: No such file or directory

The next step would be to copy a known good control file to the damaged control file’s location.

“But I don’t have a good control file!”

It’s possible that there is no known copy of the current control file. If this is the case and you weren’t using your control file to store rman backup history, it’s probably best to attempt to use the create controlfile script. If you have rman backup history in your control file, it’s probably best to actually restore your control file from backup.

To use the create controlfile script, proceed to Steps 4 through 7. To restore your control file from backup, proceed to Step 8.

If that’s not possible or probable, try the following procedure. First, make backups of all the control files. Then, one at a time, try copying every version of each control file to all the other locations—excluding the one that Oracle has already complained about because it’s obviously damaged.

Tip

Each time a new control file is copied to multiple locations, return to Step 1.

For example, assume there are three control files: /a/control1.ctl, /b/control2.ctl, and /c/control3.ctl. The alert log says that the /c/control3.ctl is damaged, and because /a/control1.ctl and /b/control2.ctl have different modification times, there’s no way to know which one is good. Try the following steps:

First, make backup copies of all the files:

$ cp /a/control1.ctl /a/control1.ctl.sav
$ cp /b/control2.ctl /b/control2.ctl.sav
$cp /c/control3.ctl /c/control3.ctl.sav

Or, for a Windows system, run the following commands:

C: copy C:CONTROL01.CTL C:CONTROL01.SAV
C: copy D:CONTROL02.CTL D:CONTROL02.SAV 
C:copy E:CONTROL03.CTL E:CONTROL03.SAV

Second, try copying one file to all locations. No sense in copying from control3.ctl because it’s obviously damaged. Try starting with control1.ctl:

$ cp /a/control1.ctl /b/control2.ctl
$cp /a/control1.ctl /c/control3.ctl

Or, for a Windows system, run the following commands:

C: copy C:CONTROL01.CTL D:CONTROL02.CTL
C:copy C:CONTROL01.CTL E:CONTROL03.CTL

Now attempt a startup mount:

SQL> startup mount
Sun Jul 23 18:53:19 PDT 2006 alter database mount exclusive
Sun Jul 23 18:53:19 PDT 2006 ORA-00202: controlfile: '/a/control3.ctl'
ORA-27037: unable to obtain file status

This error says that the file that was copied to all locations is also damaged. Now try the second file, control2.ctl. This time we have to copy from backup since we overwrote the original with the last step.

$ cp /b/control2.ctl.sav /a/control1.ctl
$cp /b/control2.ctl.sav /a/control3.ctl

Or, for a Windows system, run the following commands:

C: copy D:CONTROL02.SAV C:CONTROL01.CTL
C:copy D:CONTROL02.SAV E:CONTROL03.CTL

Now attempt to do a startup mount:

SQL> startup mount;
ORACLE instance started.
Total System Global Area            5130648 bytes
Fixed Size                     44924 bytes
Variable Size                  4151836 bytes
Database Buffers                 409600 bytes
Redo Buffers                   524288 bytes
Database mounted.

It appears that control2.ctl was a good copy of the control file.

Tip

If you were able to perform the startup mount successfully, proceed to Step 10. If you were unable to identify a good copy of the control file, proceed to Step 4.

Step 4: Are All Datafiles and Redo Logs OK?

Tip

Steps 4 through 7 should only be performed after unsuccessfully performing Steps 1 through 3. The point of these steps are to get ready to rebuild the control file using the control file script.

Steps 4 and 5 are required only prior to performing Step 6.

In 10g, Oracle enhanced the backup controlfile to trace output to include both a noresetlogs and a resetlogs option in the output. If you want to use the noresetlogs option, all datafiles and online redo logs must be intact. The datafiles can be older versions that were restored from backup because they will be rolled forward by the media recovery. The online redo logs must be current and intact for the noresetlogs version of the create controlfile script to work. If any online redo log group is completely damaged, you will have to use the resetlogs option.

Why? Because the rebuild process uses the online redo logs to identify the current SCN, then identifies it as the current SCN in the control file. It’s OK if one or more of the datafiles have older SCNs. But if a datafile shows that it has an SCN that is more recent than the available online redo logs, it indicates that we’re not dealing with the current online redo logs—the control file rebuild process aborts.

If it’s likely that one or more of the datafiles or online redo logs is damaged, go to Step 5. If it’s more likely that they are all intact, go to Step 6.

Step 5: Restore Damaged Datafiles or Redo Logs

If one or more of the datafiles or online redo logs are definitely damaged, follow all the instructions given here to see if there are any other damaged files. (A little extra effort now saves a lot of frustration later.) If it’s possible that all the datafiles and online redo logs are OK, another option is to skip this step and try to recreate the control file now. (An unsuccessful attempt at this does not cause any harm.) If it fails, return to this step. If there is plenty of time, go ahead and perform this step first. To try to recreate the control files now, proceed to Step 6.

The first thing to find out is where all the datafiles and redo logs are. To determine this, run the following commands on the mounted, closed database.

SQL> select name from v$datafile;
/db/Oracle/a/oradata/crash/system01.dbf
/db/Oracle/a/oradata/crash/rbs01.dbf
/db/Oracle/a/oradata/crash/temp01.dbf
/db/Oracle/a/oradata/crash/tools01.dbf
/db/Oracle/a/oradata/crash/users01.dbf
/db/Oracle/a/oradata/crash/test01.dbf
SQL> select group#, member from v$logfile;
1 /db/Oracle/a/oradata/crash/redocrash01.log
3 /db/Oracle/c/oradata/crash/redocrash03.log
2 /db/Oracle/b/oradata/crash/redocrash02.log
1 /db/Oracle/b/oradata/crash/redocrash01.log
2 /db/Oracle/a/oradata/crash/redocrash03.log
3 /db/Oracle/c/oradata/crash/redocrash02.log

The Windows output of these commands would look very similar, with different paths, of course.

Let’s check each of the files shown by the preceding command. First, look at the datafiles. Most of the datafiles probably have the same modification time, or there might be a group of them with one modification time and another group with a different modification time. If there are read-only tablespaces, there may be a few datafiles with modification times a lot older than the others. That’s OK. The main thing to look for is a missing file or a zero-length file. Something else to look for is one or more files that have a modification time that is newer than the newest online redo logfile. If a datafile meets any one of these conditions, it must be restored from backup.

Redo logfiles, however, are a little different. Each redo logfile within a log group should have the same modification time. For example, the output of the preceding example command shows that /db/Oracle/a/oradata/crash/redocrash01.log and /db/Oracle/b/oradata/crash/redocrash01.log are in log group 1. They should have the same modification time and size. The same should be true for groups 2 and 3. There are a couple of possible scenarios:

One or more log groups has at least one good and one damaged log.

This is why redo logs are multiplexed/mirrored! Copy the good redo log to the location of the damaged redo log. For example, if /db/Oracle/b/oradata/crash/redocrash01.log is missing, but /db/Oracle/a/oradata/crash/redocrash01.log is intact, issue the following command:

$ cp /db/Oracle/a/oradata/crash/redocrash01.log 
 /db/Oracle/b/oradata/crash/redocrash01.log.

Or, if the errors showed that D:ORACLEORADATACRASHREDOCRASH01.LOG was OK, but E:ORACLEORADATACRASHREDOCRASH01.LOG was missing or corrupt:

C: COPY D:ORACLEORADATACRASHREDOCRASH01.LOG 
 E:ORACLEORADATACRASHREDOCRASH01.LOG.

All redo logs in at least one log group are damaged.

This is a bad place to be. The create controlfile script in Step 6 requires that all online redo logs be present to use the noresetlogs option. If even one log group is completely damaged, you will have to use the resetlogs portion of the script.

If all datafiles are available, but all the control files are missing, proceed to Step 6.

If the database will not open for some other reason, proceed to Step 10.

Step 6: Is There a “Backup to Trace” of the Control File?

Tip

Steps 4 and 5 must be completed prior to this step.

You’ll need the output from the sqlplus command alter database backup controlfile to trace. It creates a trace file that contains two create controlfile scripts. This command should be run (via cron on a Unix/Linux system, a Windows scheduled task, or via the DBMS scheduler) on a regular basis. To find out if there is such a script available, follow these instructions. The first thing to find out is the destination of the trace files. This is specified by the user_dump_dest value in the v$parameter table. (Typically, user_dump_dest is set to $ORACLE_BASE /admin/ $ORACLE_SID /udump.) First cd to that directory, then use grep on Unix or find on Windows to look for the phrase CREATE CONTROLFILE, as shown in Example 16-1.

Example 16-1. Locating the most recent create controlfile script
SQL> select value from v$parameter where name like 'user_dump_dest';
/db/Oracle/admin/crash/udump
$ cd /db/Oracle/admin/crash/udump ; grep 'CREATE CONTROLFILE' * 
 |awk -F: '{print $1}'|xargs ls -ltr 
-rw-r-----  1 Oracle  dba     3399 Oct 26 11:25 crash_ora_617.trc
-rw-r-----  1 Oracle  dba     3399 Oct 26 11:25 crash_ora_617.trc
-rw-r-----  1 Oracle  dba     1179 Oct 26 11:29 crash_ora_661.trc

On Windows, try these commands:

D: cd D:Oracleadmincrashudump
D:type *.TRC|find 'CREATE CONTROLFILE'

In Example 16-1, crash_ora_661.trc is the most recent file to contain the create controlfile script.

If there is a create controlfile script, proceed to Step 7. If there is not a create controlfile script, and all the control files are missing, proceed to Step 8.

Step 7: Run the create controlfile Script

First, find the trace file that contains the script. The instructions on how to do that are in Step 6. Once you find it, copy it to another filename, such as rebuild.sql. If your online redo logs are intact, you can use the noresetlogs section; if any online redo log group is completely damaged, you’ll need to use the resetlogs section. Take a look at the output, and select the appropriate section, deleting all other comments before and after the section.

The file then should look something like the one in Example 16-2.

Example 16-2. Example create controlfile script
STARTUP NOMOUNT
CREATE CONTROLFILE REUSE DATABASE "ORCL" NORESETLOGS ARCHIVELOG
  MAXLOGFILES 16
  MAXLOGMEMBERS 3
  MAXDATAFILES 100
  MAXINSTANCES 8
  MAXLOGHISTORY 292
LOGFILE
 GROUP 1 '/oracle/product/10.2.0/oradata/orcl/redo01.log' SIZE 50M,
 GROUP 2 '/oracle/product/10.2.0/oradata/orcl/redo02.log' SIZE 50M,
 GROUP 3 '/oracle/product/10.2.0/oradata/orcl/redo03.log' SIZE 50M
DATAFILE
 '/oracle/product/10.2.0/oradata/orcl/system01.dbf',
 '/oracle/product/10.2.0/oradata/orcl/undotbs01.dbf',
 '/oracle/product/10.2.0/oradata/orcl/sysaux01.dbf',
 '/oracle/product/10.2.0/oradata/orcl/users01.dbf',
 '/oracle/product/10.2.0/oradata/orcl/example01.dbf'
CHARACTER SET WE8ISO8859P1
;
RECOVER DATABASE
ALTER SYSTEM ARCHIVE LOG ALL;
ALTER DATABASE OPEN;
ALTER TABLESPACE TEMP ADD TEMPFILE '/oracle/product/10.2.0/oradata/orcl/temp01.dbf'
   SIZE 20971520 REUSE AUTOEXTEND ON NEXT 655360 MAXSIZE 32767M;

Once the file looks like Example 16-2, add the following line just above the STARTUP NOMOUNT line:

CONNECT /AS SYSDBA;

After you add this line, run the following command on the idle instance, substituting rebuild.sql with the appropriate name:

$ sqlplus /nolog < rebuild.sql

This should work without intervention and completely rebuild the control files. You then open the database when it’s done.

Tip

If this worked, and you used the resetlog option, it would be best to back up the database as soon as possible.

If this didn’t work, you can try Step 8. Most likely you will end up at Steps 21 and 23, though.

Step 8: Restore Control Files and Prepare the Database for Recovery

Tip

This step is required only if Steps 2 through 7 have failed.

If the precautions mentioned elsewhere in this chapter were followed, there is really only one scenario that would result in this position—loss of the entire system due to a cataclysmic event. Loss of a disk drive (or even multiple disk drives) is easily handled if the control files are multiplexed/mirrored. Even if all control files are lost, they can be rebuilt using the trace file created by running the backup controlfile to trace command.

Follow the next steps, starting with restoring the control files from backup. Chances are that the database files need to be restored as well. This is because it is difficult to use a control file that is older than the most recent database file. (Oracle complains and aborts if this happens.) To find out if the control file is newer than the datafiles, try the following steps without overwriting the database files and see what happens.

1) Restore control files from backup

The very first step in this process is to find and restore the most recent backup of the control file.

If you’re using rman with a catalog and you’re backing up the control file explicitly, you can use the restore controlfile command:

RMAN> restore controlfile

If you’ve set controlfile autobackup to on, connect to rman, and issue the restore controlfile from autobackup command:

RMAN> restore controlfile from autobackup

If you’re performing user-managed backups, you need to issue the backup controlfile to filename command in sqlplus. Locate the backed up control file, and copy it to all of the locations and filenames shown when running a select value from v$parameter where name like "control_files".

Again, this backup control file should be more recent than the most recent database file in the instance. If this isn’t the case, Oracle will complain.

2) Start up mount

To find out if the control file is valid and has been copied to all the correct locations, attempt to start up the database with the mount option. (This is the same command from Step 1.) If you’re running rman, you can simply issue this command after your restore controlfile command:

RMAN> startup mount

If you’re performing user-managed backups, run the following command on the mounted, closed database:

SQL> startup mount;

Once this step has been completed, proceed to Step 9.

Step 9: Recover the Database

Tip

This step is required only if Steps 2 through 7 have failed.

Once the control file is restored with a backup copy, attempt to recover the database using the backup control file.

Recover and open the database with rman

If you’re using rman, this section is easy. Just issue recover database as your next command. This single command handles recovering the database even if you’re using a backup control file.

RMAN> recover database;

Here’s an area where rman really shines. If recovery requires applying archived redo logfiles that have been backed up, rman automatically restores them from the appropriate location, applies them, then deletes them. The entire process occurs automatically, regardless of where the archived redo logs have to come from.

If you did not use a backup control file, you just need to open the database:

RMAN> alter database open;

If you did use a backup control file, you need to open the database with the resetlogs option:

RMAN> alter database open resetlogs;

Attempt to recover database manually

Since recovering the database with a backup control file requires the alter database open resetlogs option, it never hurts to try recovering the database normally first:

SQL> recover database;

If the backup control file option is required, Oracle will complain:

ORA-00283: Recover session cancelled due to errors
...
ORA-01207: file is more recent than controlfile - old controlfile

Tip

If the recover database command works, proceed to Step 10. If it doesn’t, attempt to recover the database using the backup control file, as described next.

If Oracle complains, you need to recover the database using the backup controlfile option. Attempt to recover the database using the following command on the mounted, closed database:

SQL> recover database using backup controlfile

If it works, the output will look something like Example 16-3.

Example 16-3. Sample output of recover database command
ORA-00279: change 38666 generated at 03/14/06 21:19:05 needed for thread 1
ORA-00289: suggestion : /db/Oracle/admin/crash/arch/arch.log1_494.dbf
ORA-00280: change 38666 for thread 1 is in sequence #494

Tip

If Oracle complains, there are probably some missing or corrupted datafiles. If there are, return to Steps 4 and 5. Once any missing or corrupted datafiles are restored, return to this step and attempt to recover the database again.

Sometimes you can be trapped in a Catch-22 when you’re recovering databases and Oracle complains about datafiles being newer than the control file. The only way to get around this is to use a backup version of the datafiles that is older than the backup version of the control file. If this is the case, proceed to Steps 21 and 23. Media recovery rolls forward any changes that this older file is missing.

Oracle requests all archived redo logs since the time of the oldest restored datafile. For example, if the backup that was used to restore the datafiles was from three days ago, Oracle needs all archived redo logs created since then. Also, the first logfile that it asks for is the oldest logfile that it wants.

If you’re performing user-managed backups, and you need older archived redo logs, you have to restore them yourself. If you have to find and apply the logs yourself, the most efficient way to roll through the archived redo logs is to have all of them sitting uncompressed in the directory that Oracle suggests as the location of the first file. (This, of course, requires much more storage than the rman method.) If this is the case, simply enter auto at the prompt. Otherwise, specify alternate locations, or press enter as it asks for each one, giving time to compress or remove the files that it no longer needs.

If it is able to do so, Oracle automatically rolls through all the archived redo logs and the online redo log. It then says Media recovery complete.

However, once Oracle rolls through all the archived redo logs, it may prompt for the online redo log. It does this by prompting for an archived redo log with a sequence number that is higher than the most recent archived redo log available. This means that it is looking for the online redo log. Try answering its prompt with the names of the online redo logfiles that you have. Unfortunately, as soon as you give it a name it doesn’t like, it makes you start the recover database using backup controlfile command again.

For example, suppose that you have the following three online redo logs:

/oracle/data/redolog01.dbf
/oracle/data/redolog02.dbf
/oracle/data/redolog03.dbf

When you are prompting for an archived redo log that has a higher sequence number than the highest numbered archived redo log that you have, answer the prompt with one of these files (e.g., /oracle/data/redolog01.dbf ). If the file that you give it does not contain the sequence number it is looking for, you will see a message like the following:

ORA-00310: archived log contains sequence 2; sequence 3 required
ORA-00334: archive log: '/oracle/data/redolog01.dbf'

Oracle cancels the database recovery, requiring you to start it over. Once you get to the same prompt again, respond with a different filename, such as /oracle/data/redolog02.dbf. If it contains the recovery thread it is looking for, it responds with a message like the following:

Log applied.
Media recovery complete.

If, after trying all the online redo logs, Oracle still asks for a log that you do not have, simply enter cancel.

Alter database open resetlogs

Once the media recovery is complete, the next step is to open the database. As mentioned earlier, when recovering the database using a backup control file, it must be opened with the resetlogs option. Do this by entering the following SQL command against the mounted, closed database:

SQL> alter database open resetlogs;

Later versions of Oracle allow you to create a control file with a resetlogs option. If this was done, you can simply open the database and do not need the resetlogs option here.

If you’re running a version of Oracle prior to 10g, you must take a backup immediately after recovering the database with the resetlogs option! Even if you’re running 10g, it would be good to take a backup as soon as you can. While 10g databases can recover through a resetlogs incident, you will have to redo the restore and recovery you just did.

It is best if it is a cold backup after shutting down the database, but a hot backup would be better than nothing. Just remember that if you allow the database to become operational, and something happens before you get it backed up again, there is a risk that:

  • The entire recovery might need to be performed again.

  • All changes made after using the resetlogs option are lost, if running a version prior to 10g.

Tip

If the database did not open successfully, return to Step 1 and start over.

If the database did open successfully, perform a backup of the entire database immediately—preferably a cold one. Congratulations! You’re done!

Step 10: Does “alter database open” Work?

Tip

If you shut down the database prior to this step, you’ll need to run startup mount again.

If the startup mount worked, this is actually only the second step that you perform. Mounting the database checks only the presence and consistency of the control files. If that works, opening the database is the next step. Doing so checks the presence and consistency of all datafiles, online redo logfiles, and any rollback segments or undo segments. To open the database, run the following command on the mounted, closed database:

SQL> alter database open;

If the attempt to open the database worked, Oracle simply says Database Altered. If this is the first attempt to open the database, and no datafiles or rollback segments were taken offline, you’re done!

Tip

If directed to this step by Step 23 (damaged log groups) and the attempt to open the database failed, return to Step 21 to recover the entire database.

If the database did open, proceed to Step 15.

Big shortcut for rman users

If you’re using rman, and the attempt to open the database did not work, you have a big choice to make. You can continue to follow this procedure to determine what’s wrong and fix things one at a time, or you can issue two commands and be done. If the latter idea sounds better, try these two commands:

RMAN> restore database ;
RMAN>recover database ;

User-managed backups: Read on

If you’re performing user-managed backups, or you’d like to perform the rest of these steps manually with rman, you now need to find out why the database wouldn’t open. The output varies depending on the condition. Here is a listing of what some of those conditions may be, accompanied by what the error might look like when that condition occurs:

Missing datafile
ORA-01157: cannot identify datafile 1 - file not found
ORA-01110: datafile 1: '/db/Oracle/a/oradata/crash/system01.dbf'
Corrupted datafile

A corrupted datafile can generate a number of different errors. For instance, it may mimic a missing datafile:

ORA-01157: cannot identify datafile 1 - file not found
ORA-01110: datafile 1: '/db/Oracle/a/oradata/crash/system01.dbf'

It also may completely confuse Oracle:

ORA-00600: internal error code, arguments: [kfhcfh1_01], [0], [], [], [],

A corrupted datafile also may cause a “failed verification check” error:

ORA-01122: database file 1 failed verification check
ORA-01110: datafile 1: '/db/Oracle/a/oradata/crash/system01.dbf'
ORA-01200: actual file size of 1279 is smaller than correct size of 40960 blocks

These are just a few examples of the types of errors that Oracle may give if a datafile is corrupted.

Missing member of any online log group

If the redo logs are multiplexed/mirrored and one or more of the multiplexed/mirrored copies is lost but at least one good copy of each online redo log remains, Oracle opens the database without any errors displayed to the terminal. I’m sure they consider this a feature, but it sure would be nice if it would at least complain. It’s great that it opens the database. I just wish that this potentially dangerous situation were a little more visible. The only error is a message like the following one in the alert log:

Errors in file /db/Oracle/admin/crash/bdump/crash_lgwr_10302.trc:
ORA-00313: open failed for members of log group 2 of thread 1
All members of any online log group are corrupted

However, if all members of any online log group are corrupted, Oracle will complain, and the database will not open. The error might look something like this:

ORA-00327: log 2 of thread 1, physical size less than needed
ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log'
ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log'
Missing all members of any online log group

A similar problem occurs if all members of an online log group are missing. Oracle will complain, and the database will not open. The error looks something like this:

ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log'
ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log'
Damaged undo tablespace
ORA-01122: database file 1 failed verification check
ORA-01110: datafile 1: '/oracle/product/10.2.0/oradata/orcl/undotbs01.dbf'
ORA-01200: actual file size of 1279 is smaller than correct size of 40960 blocks
Damaged rollback segment

If a rollback segment is damaged, the error looks like the following:

ORA-01545: rollback segment 'USERS_RS' specified not available

Cannot open database if all rollback segments are not available.

Damaged datafile

A damaged datafile is actually very easy to recover from. This is a good thing because this occurs more often than any other problem. Remember that there is only one copy of each datafile, unlike online redo logs and control files that can be multiplexed/mirrored. So, statistically speaking, it’s easier to lose one datafile than to lose all multiplexed/mirrored copies of a log group or all multiplexed/mirrored copies of the control file.

Oracle also can recover parts of the database while other parts of the database are brought online. Unfortunately, this helps only if a partially functioning database is of any use to the users in your environment. Therefore, a database that is completely worthless unless all tables are available will not benefit from the partial online restore feature. However, if the users can use one part of the database while the damaged files are being recovered, this feature may help to save face by allowing at least partial functionality during an outage. (Because doing an online (partial) restore actually makes more work for you, you should seriously investigate whether this option will help if you use it—before you actually have to use it. That way, when you need to do a large restore, that question will already be answered.)

There are three types of datafiles as far as recovery is concerned:

  • The first type of datafile in a recovery is a datafile that is not required to start the database. Historically, this was the system tablespace. In current versions, required tablespaces include the system, sysaux, and undo tablespaces. If a datafile is part of a tablespace that is not required to start the database, recovering this file (with the database online or offline) is very easy.

  • The second type is a datafile for the temporary tablespace that has been declared a tempfile. Such a restore is easy because Oracle simply recreates the tempfile for you when you open the database.

  • The third type of datafile is also a nonsystem datafile but one that happens to contain a rollback segment. Since rollback segments are needed to open the database, recovering such a file with the database online is difficult.

Tip

Again, you should migrate to undo tablespaces and undo segments as soon as you can.

The final type of datafile is a file contained within a required tablespace (system , sysaux, or undo tablespaces in 10g, and system in previous versions). This datafile cannot be recovered with the database online because the database cannot be brought online without it.

Damaged log group

If all members of a log group are damaged, there is great potential for data loss. The entire database may have to be restored, depending on the status of the log group that was damaged and the results of some attempts at fixing it. This may seem like a broken record, but this is why multiplexing/mirroring the log groups is so important.

If the error refers to a damaged log group, one option is to proceed directly to Step 17. However, to verify that nothing else is wrong, you should probably read the rest of this step and proceed to the next one.

Damaged required tablespace

If any datafiles belonging to required tablespaces are damaged, then you have to restore and recover them offline. The database cannot be opened without them.

If the error refers to a required tablespace, one option is to proceed directly to Step 11. However, to verify that nothing else is wrong, you should probably read the rest of this step and proceed to the next one.

Damaged rollback segment

Tip

If you’re still using rollback segments, you need to look at this part of the step. If you’re using automatic undo management, it does not apply to you.

Since Oracle has to open the datafiles that contain this rollback segment before it can verify that the rollback segment is available, this error will not occur unless a datafile has been taken offline. If Oracle encounters a damaged datafile (whether or not it contains a rollback segment), it will complain about that datafile and abort the attempt to open the database.

Remember that a rollback segment is a special part of a tablespace that stores rollback information. Rollback information is needed in order to undo (or roll back) an uncommitted transaction. Since a crashed database almost always contains uncommitted transactions, recovering a database with a damaged rollback segment is a little tricky. As previously mentioned, a damaged datafile may be taken offline, but Oracle will not open the database without the rollback segment.

If the error indicates that there is a damaged rollback segment, proceed to Step 18.

Before going any farther...

Remember that Oracle will stop attempting to open the database as soon as it encounters an error with one file. This means, of course, that there could be other damaged files. If there is at least one damaged datafile, now is a good time to see whether other files are damaged. Detailed instructions on how to do that are provided in Step 5.

Once you know the names of all the damaged files, you can recover them as described next.

How media recovery works

If any datafiles are restored from backup, the sqlplus or rman recover command is needed. These commands use the archived and online redo logs to redo any transactions that have occurred since the time that the backup of a datafile was taken. You can recover a complete database, a tablespace, or a datafile by issuing the commands recover database, recover tablespace tablespace_name, or recover datafile data_file_name, respectively.

With rman or user-managed backups, you can simply issue the command recover database. Both automatically figure out what needs to have media recovery applied to it and perform the appropriate recovery. The difference between rman and user-managed backups comes when you need an archived log that is available only on backup. If rman needs to obtain any archived redo logs from backup, it automates restoring them and deleting them as necessary. If you’re performing user-managed backups, you need to perform this step yourself.

You can also apply media recovery to an individual datafile by issuing the SQL commands below against a mounted, closed database:

SQL> recover datafile '/db/Oracle/a/oradata/crash/datafile01.dbf'

This command allows the restore of an older version of a datafile and uses redo to roll it forward to the point of failure. For example, if you took a backup of a datafile on Wednesday night, and that datafile was damaged on Thursday evening, you would restore that datafile from Wednesday night’s backup. Of course many transactions would have occurred since Wednesday night, making changes to the datafiles that you restored. Running the command recover [database|tablespace|datafile] reapplies those transactions to the restored datafile, rolling them forward to Thursday evening.

This recovery can work in a number of ways. After receiving the recover command, Oracle prompts for the name and location of the first archived redo log that it needs. If that log and all logs that have been made since that log are online, uncompressed, and in their original location, enter the word auto. This tells Oracle to assume that all files that it needs are online. It therefore can automatically roll through each log.

In order to do this, all files that Oracle will need must be online. First, get the name of the oldest file, because that is the first file it will need. That filename is displayed immediately after issuing the recover command:

ORA-00279: change 18499 generated at 02/21/06 11:49:56 needed for thread 1
ORA-00289: suggestion : /db/Oracle/admin/crash/arch/arch.log1_481.dbf
ORA-00280: change 18499 for thread 1 is in sequence #481
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

In the preceding example, the first file that Oracle needs is /db/Oracle/admin/crash/arch/arch.log1_481.dbf. Make sure that this file is online and not compressed or deleted. If it is deleted, restore it from backup. If it is compressed, uncompress it and any archived redo logfiles in that directory that are newer than it, because Oracle may need all of them to complete the media recovery. If you’re not using rman, it might be necessary to delete some of the older archived redo logs to make enough room for the files that need to be uncompressed. Once all archived redo logs that are newer than the one requested by Oracle have been restored and uncompressed, enter auto at the Specify log prompt. rman can be configured to restore only one redo log at a time from tape, apply that redo log, then delete it, removing the space issue. It can also be configured to keep a certain amount on disk using the maxsize parameter.

Warning

This whole part of the restore will obviously work only if you are running in archivelog mode. Make sure you’re using that option wherever you can.

If there isn’t enough space for all of the archived redo logs to be uncompressed, a little creativity may be required. Uncompress as many as possible, and then press Enter each time it suggests the next file. (Pressing Enter tells Oracle that the file that it is suggesting is available. If it finds that it is not available, it prompts for the same file again.) Once it has finished with one archive log, compress that log, and uncompress a newer log, because it will be needed shortly. (Obviously, a second window is required, and a third window wouldn’t hurt!)

At some point, it may ask for an archived redo log that is not available. This could mean that some of the archived redo logs or online redo logs are damaged. If the file cannot be located or restored, enter cancel.

More details on media recovery are available in Oracle’s documentation.

If any of the damaged datafiles is a member of any required tablespaces, proceed to Step 12. If none of them is a member of any required tablespaces, proceed to Step 13.

Step 11: Are There Damaged Datafiles for Required Tablespaces?

If the damaged file is part of a required tablespace (system, sysaux, undo, and temp in 10g, or system in previous versions), an offline recovery is required. Unfortunately, Oracle complains only that the datafile is missing—without saying what kind of datafile it is. Fortunately, even if Oracle is down, there is an easy way to determine which files belong to the required tablespaces. Finding out if the datafile contains a rollback segment is a little more difficult, but it is still possible. If you use rman with a recovery catalog, you can connect to the catalog, set the DBID of the database, and issue the report schema command. This command displays the files, locations, and whether they contain rollback segments.

RMAN> report schema;
Report of database schema
File Size(MB) Tablespace RB segs Datafile Name
---- ---------- ---------------- ------- -------------------
1  307200 SYSTEM   NO   /oracle/oradata/trgt/system01.dbf
2  20480  UNDOTBS  YES  /oracle/oradata/trgt/undotbs01.dbf
3  10240  CWMLITE  NO   /oracle/oradata/trgt/cwmlite01.dbf
4  10240  DRSYS    NO   /oracle/oradata/trgt/drsys01.dbf
5  10240  EXAMPLE  NO   /oracle/oradata/trgt/example01.dbf
6  10240  INDX     NO   /oracle/oradata/trgt/indx01.dbf
7  10240  TOOLS    NO   /oracle/oradata/trgt/tools01.dbf
8  10240  USERS    NO   /oracle/oradata/trgt/users01.dbf

To find out which datafiles are in the system tablespace, you’ll need to query sys.dba_data_files:

SQL> select file_name, tablespace_name from sys.dba_data_files;
/oracle/oradata/trgt/system01.dbf SYSTEM   
/oracle/oradata/trgt/undotbs01.dbf UNDOTBS  
/oracle/oradata/trgt/cwmlite01.dbf CWMLITE  
/oracle/oradata/trgt/drsys01.dbf DRSYS   
/oracle/oradata/trgt/example01.dbf EXAMPLE  
/oracle/oradata/trgt/indx01.dbf INDX    
/oracle/oradata/trgt/tools01.dbf TOOLS   
/oracle/oradata/trgt/users01.dbf USERS

This example report shows three datafiles that are members of the system , sysaux, and undo tablespaces. In your configuration, however, there may be multiple datafiles assigned to these tablespaces.

If any of the damaged datafiles is a member of a required tablespace, proceed to Step 12. If none of them are members of required tablespaces, proceed to Step 13.

Step 12: Restore All Datafiles in Required Tablespaces

Unlike other tablespaces, required tablespaces (system, sysaux, undo, and temp in 10g, or system in previous versions) must be available in order to open the database. Therefore, if any members of these tablespaces are damaged, they must be restored now. Before doing this, make sure that the database is not open. (It is OK if it is mounted.) To make sure, run the following command on the mounted, closed database:

SQL> select status from v$instance;
STATUS
-------
MOUNTED
1 row selected.

The preceding example shows that this instance is mounted, not open. If the database is not open, restore the damaged files from the most recent backup available. If you’re performing user-managed backups, you’ll need to figure out where those backups are and restore them. If you’re using rman, simply issue a single command:

RMAN> restore database;

rman automatically figures out which files need to be restored and restores only those files from the most recent backup. It does not restore files that appear to be alright.

Once all damaged files in the required tablespace are restored, run the following command on the mounted, closed database. If you want to just bring these tablespaces online and then fix other tablespaces later, you’ll need to apply media recovery just against the individual tablespaces. This may be more complex and may take more time than just restoring all damaged datafiles and issuing a single recover database command.

To recover just the required tablespaces, run individual recover commands against them. The following examples show media recovery being applied against the system tablespace:

RMAN> recover tablespace system;

Or in SQL:

SQL> recover tablespace system;

If you’re recovering multiple tablespaces, it’s probably best if you just issue the recover database command, which applies media recovery against the entire database.

Once this command has completed, the required tablespaces are recovered to the time of failure.

If it does complete successfully, and no other datafiles are damaged, return to Step 10. For more information about the recover tablespace command, read the earlier section “How media recovery works” at the end of Step 10. If there are other datafiles to recover, proceed to Step 13.

Step 13: Damaged Nonrequired Datafile?

So far, we have mounted the database, which proves that the control files are okay. It may have taken some effort if one or more of the control files were damaged, but it succeeded. We also have verified that any required tablespaces are intact, even if they required a restore and recovery. Most of the rest of this procedure concentrates on disabling damaged parts of the database so that it may be brought online as soon as possible. The process of elimination identifies all damaged datafiles once the database is opened successfully. They can then be easily restored.

If there are damaged datafiles that are not members of required tablespaces, proceed to Step 14. If there are no more damaged datafiles, proceed to Step 17.

Step 14: Take Damaged Datafile Offline

To open a database with a damaged, nonrequired datafile, take the datafile offline.

Warning

This step should not be used against datafiles that contain rollback segments. Those should be restored prior to bringing the database online.

If this instance is operating in archivelog mode, just take the datafile offline. It can be restored and recovered later, after the instance has been brought online. Here’s the command to do this:

SQL> alter database datafile 'filename' offline;

If the instance is operating in noarchivelog mode, that’s a different problem. Oracle does not allow the datafile to be taken offline because it knows it can’t be brought back online without media recovery. Without archivelog mode, there is no media recovery. If this is the case, you need to treat any damaged datafiles as required and go back to Step 12.

Once any damaged files are taken offline, return to Step 10 and attempt to open the database again.

Step 15: Were Any Datafiles Taken Offline?

Tip

Perform this step only if the database has been opened.

This step is really a very simple question: if the database opens without taking any datafiles offline, you’re done.

If you had to take some datafiles offline to open the database, you need to restore and recover them now. To find out if any datafiles were taken offline, run the following command:

SQL> select name from v$datafile where status = 'OFFLINE' ;
NAME
------------------
/db/oracle/a/oradata/crash/tools01.dbf
/db/oracle/a/oradata/crash/users01.dbf

If some datafiles were taken offline to open the database, proceed to Step 16. If you’re unsure, proceed to Step 16.

Step 16: Restore and Recover Offline Datafiles

If any datafiles were taken offline, you need to restore them and bring them online.

Restore the damaged datafiles

Once the names of the datafiles that need to be restored are determined, restore them from the latest available backup. If you’re performing user-managed backups, you have to decide which files to restore and where to restore them from. If you’re using rman, you only need to issue the following command:

RMAN> restore database ;

Once they are restored, recovery within Oracle can be accomplished in three different ways. These ways vary greatly in complexity and flexibility. Examine the following three media recovery methods and choose whichever one is best for you.

Datafile recovery

If there are a small number of datafiles to recover, this may be the easiest option. As each file is restored, issue the recover datafile command against it, and then bring it online. The following commands work in rman or sqlplus:

recover datafile 'datafile_name' ;
alter database datafile 'datafile_name' online ;

The downside to this method is that media recovery may take a while for each datafile. If recovering multiple datafiles within a single tablespace, this is probably wasting time.

Tablespace recovery

This is the hardest of the methods, but it may work faster than the previous method if there are several damaged datafiles within a tablespace. If forced to leave the partially functional database open while recovering the damaged datafiles, and there are several of them to recover, this is probably the best option.

First, find the names of all datafiles and the tablespace to which they belong. Since the database is now open, this can be done in one step, as demonstrated in Example 16-4.

Example 16-4. Listing dba_data_files
SQL> select file_name, tablespace_name from dba_data_files;
Statement processed.
FILE_NAME
 TABLESPACE_NAME
--------------------------------------------------------------------------------
 ------------------------------
/db/oracle/a/oradata/crash/users01.dbf
 USERS
/db/oracle/a/oradata/crash/tools01.dbf
 TOOLS
/db/oracle/a/oradata/crash/temp01.dbf
 TEMP
/db/oracle/a/oradata/crash/rbs01.dbf
 RBS
/db/oracle/a/oradata/crash/system01.dbf
 SYSTEM
/db/oracle/a/oradata/crash/test01.dbf
 TEST

Once all of the datafiles are restored and the names of all the tablespaces that contain these datafiles have been determined, issue the recover tablespace command against each of those tablespaces. Before doing so, however, each tablespace must be taken offline, as shown in Example 16-5. The commands in Example 16-5 work in rman or sqlplus.

Example 16-5. Tablespace-based recovery
alter tablespace tablespace_name1 offline;
recover tablespace tablespace_name1 ;
alter tablespace tablespace_name1 online;
alter tablespace tablespace_name2 offline;
recover tablespace tablespace_name2 ;alter tablespace tablespace_name2 online;

It’s obvious that this method is quite involved! It’s not pretty, and it’s not easy, but it allows recovery of multiple tablespaces while the instance continues to operate. If a partially functioning database is of any value to the users, this method may be their best friend.

Database recovery

This is actually the easiest method, but it requires that the database be shut down to perform it. After restoring all the database files that were taken offline, close the database, and issue the recover database command.

Once all the database files are restored, issue the commands shown in Example 16-6. These commands will work in rman or sqlplus.

Example 16-6. Normal database recovery
shutdown immediate ;
startup mount ;
recover database ;alter database open ;

To make sure that all tablespaces and datafiles have been returned to their proper status, run the commands shown in Example 16-7.

Tip

Depending on the version of Oracle, some of the following commands may display a status only if there is something “interesting.”

Example 16-7. Obtaining the names of all datafiles, control files, and logfiles
SQL> select name, status from v$datafile
NAME
 STATUS
--------------------------------------------------------------------------------
 -------
/db/oracle/a/oradata/crash/system01.dbf  SYSTEM
/db/oracle/a/oradata/crash/rbs01.dbf     ONLINE
/db/oracle/a/oradata/crash/temp01.dbf    ONLINE
/db/oracle/a/oradata/crash/tools01.dbf   ONLINE
/db/oracle/a/oradata/crash/users01.dbf   ONLINE
/db/oracle/a/oradata/crash/test01.dbf    ONLINE
6 rows selected.
SQL> select member, status from v$logfile
/oracle/product/10.2.0/oradata/orcl/redo03.log
/oracle/product/10.2.0/oradata/orcl/redo02.log
/oracle/product/10.2.0/oradata/orcl/redo01.log
SQL> select name, status from v$controlfile;
/oracle/product/10.2.0/oradata/orcl/control01.ctl
/oracle/product/10.2.0/oradata/orcl/control02.ctl
/oracle/product/10.2.0/oradata/orcl/control03.ctl

Example 16-7 shows that all datafiles, control files, and logfiles are in good condition. (In the case of the logfiles and control files, no status is good status.)

Tip

Once any datafiles that were taken offline have been restored and recovered, you’re done.

Step 17: Is There a Damaged Online Log Group?

When we refer to a “damaged online log group,” we mean that all members of a log group are damaged. If at least one member of a multiplexed/mirrored log group is intact, Oracle opens the database and simply puts an error message in the alert log. However, if all members of a log group are damaged, the database will not open, and the error will look something like this:

ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log'
ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log'

Tip

If there is no error like this, there is no damaged log group. Proceed to Step 18.

The first thing that must be determined is the status of the damaged log group. The three statuses to worry about are CURRENT, ACTIVE, and INACTIVE. To determine the status of the damaged log group, run the following command on the mounted, closed database:

SQL> select group#, status from v$log;

The output looks something like this:

GROUP#   STATUS
---------- ----------------
     1 INACTIVE
     2 CURRENT
     3 ACTIVE
3 rows selected.

The preceding example shows that log group 1 is inactive, group 2 is current, and group 3 is active. Here is an explanation of these statuses and how they affect recovery:

Current

The current log group is typically the log Oracle is currently writing to. In the case of a crash, it’s the one to which Oracle was writing when the failure occurred. It will be shown as current until the server is brought online and a log switch occurs.

Active

An active log group is usually the log group that Oracle just finished writing to. However, until a checkpoint occurs, this group is still needed for media recovery. Since a log switch always forces a checkpoint, a status of active is actually very rare. In fact, the only way to see this (before the system crashed) is to run the preceding command while a checkpoint is in progress. (In a properly tuned database, this is a very short period of time.)

Inactive

An inactive log group is one that is not being used by Oracle at the moment. It has been archived.

To determine what action to take next, first get the number of the log group whose logfiles are damaged. The preceding example error reads open failed for members of log group 2. Reference this number against the log groups listed by the select * from v$log command. In the previous example, log group 2 was CURRENT at the time the database crashed.

Tip

If the damaged log group was current, proceed to Step 20. If it was active, proceed to Step 23. If it was inactive, proceed to Step 25.

Step 18: Are Any Rollback Segments Unavailable?

Tip

As mentioned repeatedly, rollback segments are the older way to handle undo. If you have switched to undo segments, this step doesn’t apply to you. Undo segments were handled in Step 12.

If a rollback segment is damaged, Oracle will complain when attempting to open the database. The error looks like the following:

ORA-01545: rollback segment 'USERS_RS' specified not available
Cannot open database if all rollback segments are not available.

If the preceding error is displayed when attempting to open the database, proceed to Step 19. If not, return to Step 10.

Step 19: Recover Tablespace Containing Unavailable Rollback Segment

Tip

Perform this step only if directed to do so by Step 18.

The first thing that must be determined is which tablespace the damaged rollback segment is in. Unfortunately, there is no fixed view that contains this information. That means that it will have to be discovered through common sense and deduction.

Tip

This is why it is very helpful to put the rollback segments in dedicated tablespaces with names that easily identify them, such as RBS1. It’s even more helpful if the datafiles are named something helpful as well. For example, create a separate tablespace called RBS1, and call its datafiles rollback01.dbf, rollback02.dbf, and so on. That way, anyone who lands in this scenario will know exactly which datafiles contain rollback data.

First, remember that this error is not displayed unless a datafile has been taken offline. To get a complete list of files that were taken offline, run the following command on a mounted, closed database:

SQL> select TS#, name from v$datafile where status = 'OFFLINE' ;
NAME
--------------------------------------------------------------------------------
5 /db/oracle/a/oradata/crash/test01.dbf
1 row selected.

Now, find the name of the tablespace that contains this datafile:

SQL> select name from v$tablespace where TS# = '5' ;
NAME
--------------------------------------------------------------------------------
TEST
1 row selected.

Admittedly, the previous example was easy. There was only one datafile that was offline, which made finding its tablespace pretty easy. What if there were multiple datafiles contained within multiple tablespaces? How do you know which one contains the rollback segment? Unfortunately, there is no way to be sure while the database is closed. The rest of this step is simple. Restore any files that were taken offline, and use either the recover datafile or recover tablespace commands to roll them forward in time. If there are only one or two damaged datafiles, it’s probably quicker to use the recover datafile command. If there are several damaged datafiles, especially if they are all in one tablespace, the recover tablespace command is probably easiest. Either way will work.

Tip

Once any datafiles that contain rollback segments have been restored and recovered, return to Step 10 and attempt to open the database again.

Step 20: Is the Current Online Log Damaged?

Tip

Perform this step only if instructed to do so by Step 17. Otherwise, return to Step 17 now.

If the current online log group is damaged, you’ll see a message like the following when you attempt to open the database:

ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log'
ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log'

In the preceding example, a select group#, status from v$log command also would have shown that log group 2 was CURRENT at the time of failure.

This is the worst kind of failure to have because there definitely will be data loss. That is because the current online log can be required to restart even a fully functioning database. If the current file contains redo needed to bring the instance online, you will not be able to open the database without a full restore and incomplete recovery.

First, try opening the database with an alter database open resetlogs, which induces Oracle to recreate the online redo logs. If that works, you’re done.

If you cannot open the database with the resetlogs option, the only other step is to restore an older version of the control file. Unfortunately, you can’t restore only the control file because the datafiles then would be more recent than the control file. The only remaining option is to restore the entire database.

Tip

If the current online redo log is damaged, proceed to Step 22. If not, proceed to Step 24.

Step 21: Restore and Recover All Database Files from Backup

Warning

There are only two reasons to perform this step. The first is if instructed to do so by Step 20. The other is if there was an unsuccessful attempt to open the database after performing either Steps 25 or 27. This step is the most drastic method of recovery and should not be performed unless absolutely necessary.

Perform this step only after verifying (or rebuilding or restoring) the control files, and verifying that all members of the current online log group are damaged. This procedure is relatively easy. Simply determine the names and locations of all the datafiles, and restore them from their latest backup.

Warning

Restore only the datafiles, not the control files. Do not restore or overwrite the control files unless instructed to do so by Step 9!

Again, if you’re using rman, you can do all of this with two commands:

RMAN> restore database ;
RMAN>recover database ;

If you’re performing user-managed backup, you need to use your backup system to restore all datafiles. To determine the names of all the datafiles, run the following command on the mounted, closed database:

SQL> select name from v$datafile ;

Once all datafiles are restored, you should issue a media recovery against them.

SQL> recover database ;

Once all datafiles are restored and recovered, proceed to Step 22.

Step 22: Run alter database open resetlogs

Warning

Perform this step only if instructed to do so by Step 21. This is another drastic step that should be performed only if necessary!

The alter database open resetlogs command causes Oracle to open the database after clearing all contents of the online redo logfiles. Since there is no way to undo this step, it is a good idea to make copies of the online redo logfiles now. To find out all their names, run the following command on a mounted, closed database:

SQL> select member from v$logfile ;

To create an “undo” option, copy each file to <filename>.bak. After making a backup of the online redo logfiles, run the following command on a mounted, closed database:

SQL> alter database open resetlogs ;
Statement processed.

If the database opens, congratulations!

Warning

If you’re running a version prior to 10g, make a backup of this database immediately, preferably with the database shut down. That is because prior to 10g, Oracle could not roll through this point in time using the redo logs. Oracle versions prior to 10g must have a full backup taken after using the open resetlogs command in order to apply media recovery to this database using any redo logs that are made after the open resetlogs was performed.

It’s also a good idea to perform a backup as soon as possible even if you’re running 10g. A hot backup will suffice.

Once that backup is completed, you’re done!

Step 23: Is an Active Online Redo Log Damaged?

Tip

Perform this step only if instructed to do so by Step 17. Otherwise, return to Step 17 now.

If an active online log group is damaged, you’ll see a message like the following when attempting to open the database:

ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log'
ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log'

In the preceding example, a select group#, status from v$log command also would have shown that log group 2 was active at the time of failure.

Remember that an active log is one that is still needed for recovery. Depending on how the database crashed, there might still be data in the buffers. If you can get that data to flush to disk with a checkpoint, you can turn this active log into an inactive log and delete it.

To perform a checkpoint, proceed to Step 24. If there are no damaged active online redo logs, proceed to Step 25.

Step 24: Perform a Checkpoint

The way to attempt to recover from the scenario in Step 23 is to perform a checkpoint. If it is successful, the database should open successfully. To perform a checkpoint, issue the following command on the mounted, closed database:

SQL> alter system checkpoint local ;
Statement processed. 

Be patient. The reason that there is an active log group is probably that the checkpoint took a long time in the first place, and the database crashed during the checkpoint. Wait for Oracle to say that the checkpoint succeeded or failed. If it succeeded, Oracle will simply say, Statement processed. If it failed, there could be any number of Oracle errors.

After issuing the checkpoint, even if it is unsuccessful, return to Step 10, and attempt to open the database. If this attempt fails, return to Step 21 and recover the entire database.

Step 25: Is an Inactive Online Redo Log Damaged?

Tip

Perform this step only if instructed to do so by Step 17. Otherwise, return to Step 17 now.

If an inactive online log group is damaged, you’ll see a message like this when attempting to open the database:

ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log'
ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log'

In the preceding example, a select group#, status from v$log command also would show that log group 2 was inactive at the time of failure.

By comparison, this one should be a breeze. An inactive log is not needed by Oracle. If the log group is not needed, simply drop it and create another to replace it.

To drop and add an inactive log group, proceed to Step 26.

Step 26: Drop/Add a Damaged, Inactive Log Group

Tip

Perform this step only if instructed to do so by Step 25.

In all the previous examples, the damaged log group was group 2. Before we drop that group, we should make sure that we can add it back easily. Ensure that all the original redo log locations are still valid. To do this, get the names of all the members of that log group:

SQL> select member from v$logfile where GROUP# = 2 ;

For this example, Oracle returned the values:

/logs1/redolog01.dbf
/logs2/redolog01.dbf
/logs3/redolog01.dbf

Verify that the locations of all these files are still valid. For this example, assume that /logs3 is completely destroyed, and we are relocating all its contents to /logs4. Therefore, the members of log group 2 will be /logs1/redolog01.dbf, /logs2/redolog01.db, and /logs4/redolog01.dbf.

To drop log group 2, issue the following command on a mounted, closed database:

SQL> alter database drop logfile group 2 ;

Once that command completes successfully, add the log group back to the database. To do this, issue the following command (remember that we have replaced /logs3/redolog01.dbf with /logs4/redolog01.dbf ):

SQL> alter database add logfile group 2 ('/logs1/redolog01.dbf', '
/logs2/redolog01.dbf', '/logs4/redolog01.dbf') size 500K ;
Statement processed.

Once this command completes successfully, return to Step 10 and attempt to open the database.

You’re Done!

If you’ve made it this far, you’re done! All datafiles, control files, and logfiles should be online. Take a backup of the entire database immediately, preferably a cold one with the database down. If that can’t be done, then perform a hot backup.

Logical Backups

Physical backups protect you from physical damage, such as a damaged disk drive. A logical backup protects you from logical damage, such as when your DBA accidentally deletes an important table. Logical backups are done by Oracle’s data pump or export utility, which stores the data in a binary file that is useful only to Oracle.

Data pump and export are really used only for data exchange between databases. It has a number of disadvantages when using it for backups:

No transaction recovery

It is very important to note that exports are a picture of the database at some point in time, and they can be used to recover that table only to that point in time. There is no way to apply transaction logs (redo logs) on top of an imported table to bring it up to date. So although it might be quicker to restore a table using an import, it might be better just to restore the entire tablespace. The recover until change option can be used to redo all transactions except for the one that deleted the table.

Longer than physical backups

Exports usually take longer than physical backups because there is a lot of checking going on during the process. Depending on the speed of the system, this may be a little bit longer or quite a bit longer!

Full export requires restrict mode or the consistent=y option

You need to ensure that the export is consistent. One way to do that is to shut down and then open the database in restrict mode. That way no one can make changes to the database while the export is running. (Of course, this means that no one can access it either.) Another way is to add the consistency=y option to the export. This causes the export to view the entire database as it existed at the moment the export started.

Performing a Logical Backup

The type of export covered here is an export of the entire database, also known as a full export. The commands to do a full export are found in Example 16-8. Substitute the appropriate username, passwd, and file_name. (The username and password need to have the appropriate permissions to do an export.)

Example 16-8. Sample database export
$ exp userid=username/passwd full=Y consistent=Y constraints=Y file=file_name

This performs a full (every table) export of ORACLE_SID to file_name. For more information on the exp command, consult the manual.

Recovering with a Logical Backup

If you made any logical backups, or exports, using Oracle’s exp utility, they may be imported using Oracle’s imp utility. In order to use the following command, you must substitute the appropriate username, passwd, and file_name. For level, you need to use system or restore. (When to use these levels is covered next.)

$ imp username/passwd inctype=level full=Y file=file_name

A Broken Record

There are some common threads that appear throughout this chapter. Here they are once again:

Multiplex/mirror the redo logs

If all members (or the only member) of an active or current log group are lost, there will be data loss. If the redo logs are not multiplexed/mirrored, the chances of this happening are much greater than if the redo logs are multiplexed/mirrored. If the redo logs are multiplexed/mirrored, the chance that all members of a log group would be damaged is incredibly small. A little research and a little effort up front will significantly reduce the chances of data loss.

Watch the alert log

Even if the redo logs are multiplexed/mirrored, one or more members of a log group may be damaged while the database is operating. The only notification will be an entry in the alert log. Automate the checking of the alert logs for error messages. Otherwise, there may be only one functioning member of a log group, and you may not even know it.

Multiplex/mirror the control files

A significant portion of this recovery procedure is dedicated to recovering from the loss of all control files. If they are not multiplexed/mirrored, they should be—another example that Prior Proper Planning Purges the Person from the Performance of Painful Procedures.

Use archivelog mode

Without archivelog mode, Steps 21 and 23 could replace the entire recovery procedure. If one datafile is lost, restore all of them and open the database with the resetlogs option. All changes since the last cold backup will be lost, and a cold backup is the only kind of backup possible because hot backups require archivelog mode.

Tip

BackupCentral.com has a wiki page for every chapter in this book. Read or contribute updated information about this chapter at http://www.backupcentral.com.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.156.250