CHAPTER 15

image

The Big Picture and the Ten Deliverables

And if you don’t know Which to Do

Of all the things in front of you,

Then what you’ll have when you are through

Is just a mess without a clue …

Winnie the Pooh in The Tao of Pooh, by Benjamin Hoff

What are the deliverables of the database administration role? How likely is it that the deliverables will be completed if you cannot articulate what they are? Deliverables are not the same as current priorities, because priorities change from day to day; performance improvement may be your priority today, but it may not be your priority when performance returns to acceptable levels. Deliverables are not the same as assigned tasks, either, because assigned tasks change from day to day; resetting passwords for forgetful users may be one of your assigned tasks today, but you may not have do it any more if, for example, your organization begins using self-service technology or single sign-on technology, or if the task is assigned to a Service Desk. An example of a deliverable is “databases that meet the needs of the business”—the deliverable does not change from day to day. If there is only one DBA in the organization, then it is the individual’s deliverable. If database administration is performed by a team, it is a shared deliverable.

This is the most important chapter in this book—I discuss the big IT picture and offer very specific guidance in the form of ten deliverables of the database administration role. Few, if any, other books address this topic.

If you take the lessons in this chapter to heart, you can quickly become a better Oracle DBA than you thought possible. Competency in Oracle technology is only half of the challenge of being a DBA. If you had very little knowledge of Oracle technology but knew exactly “which” needed to be done, you could always find out how to do it—there is Google, and there are online manuals aplenty. Too many Oracle DBAs don’t know “which” to do, and what they have when they are through is “just a mess without a clue.”

Image Tip  This chapter is essentially unchanged from the previous edition, including the use of the now officially deprecated ITIL V2 terminology. After reviewing this chapter, I did not wish to change anything, and I commend it to you as the most important chapter and distinguishing feature of this book. Nothing in this chapter is specific to Oracle Database 12c; the entire discussion equally applies to prior and future versions. As regards ITIL V2 terminology, I still believe that you will find it sufficient unless you are a project management professional (PMP).

An Instructive Job Interview

Early in my career, I interviewed for a programmer/analyst job at Hewlett-Packard. The manager who interviewed me gave me the following test. He explained that the problem was to produce a report containing sorted employee information and asked me to draw a flowchart explaining the approach I would use to solve the problem. He gave me some paper and left the room for half an hour. When he left, I wrote a COBOL program that sorted and printed a file of employee records. I prided myself on my programming skills and I was certain the program would work correctly the first time.

When he came back, the manager read my program carefully and complimented me on my programming skills. Then he told me what I had missed. I had written a complete COBOL program but had completely ignored the system development life cycle: initiation, system concept development, planning, requirements analysis, design, development, integration and test, implementation, operations and maintenance, and disposition.

In summary, I was a very good COBOL programmer but did not see the big software development picture.

Well, I did not get the job at Hewlett-Packard, but soon after that I got my first job as a database administrator—at Intel. Database administration became my career, and, in time, I became the manager of a team of DBAs managing a thousand databases for a large service provider. My technical skills were never stronger, but something seemed to be missing. We were burned out from working 60 hours every week, we felt unappreciated, and we were constantly at loggerheads with other groups.

In time, I realized that technical knowledge was not enough. We needed to understand how all the pieces of IT fitted together and how they interacted with each other. In other words, we needed to see the big picture.

WHO IS A SENIOR DBA?

I have attended many interviews for Oracle DBA positions in my career and, with one exception, always found that the interviewers set great store on knowledge of Oracle syntax. A big problem is that the typical interviewer only asks questions about those Oracle features that he or she uses on the job and is most familiar with. Any candidate who has not used those Oracle features is then automatically eliminated.

Jeremiah Wilton was the first Oracle DBA at Amazon.com; he joined the company when it was still a small startup and built the DBA team there. In an interview published in the journal of the Northern California Oracle Users Group (NoCOUG), I asked him the following question.

My daughter’s piano teacher likes to say that practice makes permanent, not perfect. Just because I’ve been a database administrator a long time doesn’t qualify me as a “senior” database administrator—or does it? Who is a “senior” database administrator? Do I need a college degree? Do I need to be a “syntax junkie”? Do I really need experience with Oracle Streams or ASM to claim the title?

In his reply, Jeremiah suggested that anybody with a few years of experience under their belt was entitled to call themselves a senior DBA, but he did not value years of experience and knowledge of Oracle syntax very much:

To me, senior means that you have used a lot of Oracle’s features, solved a lot of problems, and experienced a variety of production situations. Do these qualities necessarily mean that I will want to hire you? No.

Of far greater importance than seniority is a DBA’s ability to solve problems in a deductive and logical manner, to synthesize creative solutions to problems, and to forge positive and constructive business relationships with colleagues and clients. For years at Amazon, we simply tried to hire extraordinarily smart people with a strong interest in working with Oracle and others. Some of Amazon’s most senior DBAs started with little or no Oracle experience. I believe that the focus on experience in specific technologies and seniority causes employers to pay more and get less than they could when filling DBA positions.

Jeremiah is proof that experience is overrated; he had no knowledge of Oracle database administration when he started at Amazon.com. He learned his trade on the job and went on to build the database team that Amazon.com relies on today.

How I Became a DBA

I became an Oracle DBA by accident. I was supporting another database technology when Big Bob, the Oracle DBA at my then employer, suddenly resigned. I was asked to take over because I had expressed an interest in becoming an Oracle DBA.

I created a documentation template and asked Bob to spend his remaining time documenting each database using the template. He protested that he did not have the time to provide so much detail and suggested that we meet for an hour or two.

At the meeting, Bob told me not to worry and that I was a smart kid and would soon learn my way around. We ran through the list of databases in about half an hour, spending less than a minute on each while I hastily scribbled notes.

But it wasn’t Big Bob’s fault, because documentation and record-keeping were not organizational priorities. When the time came for me to take over from him, Big Bob had little more for me than a few passwords, a firm handshake, and lots of good wishes.

THE IMPORTANCE OF COMMUNICATION

Which kind of DBA would you want to maintain your database? One whose speech was sprinkled with incomprehensible Oracle terminology or one who communicated with you in language you understood? I mentioned previously that all the interviews I attended during my job searches, with one single exception, focused on my knowledge of Oracle syntax. A US-based provider of remote DBA services called Database Specialists uses a very unusual approach to interviewing Oracle DBA candidates. Questions about Oracle syntax are not asked. Instead, the interview focuses on the candidate’s ability to communicate.

Database Specialists uses a systematic institutional approach to database administration, and the linchpin of the process is communication with its customers. A daily report on every database that is maintained by Database Specialists is sent to the appropriate distribution list. This is not a computer-generated report or graph but an actual memo from a live person. This gives customers visibility into database operations and reassures them that a systematic process of database administration is being followed, even if a memo might sometimes say only “No new issues at this time.” All the daily reports are available online in a customer-accessible Internet portal, and they constitute a historical record of all the issues with the database. Here is an example—names and details have been changed to protect customer confidentiality:

  From: Iggy Fernandez [mailto:[email protected]]
  Sent: Friday, February 29, 2008 10:11 AM
  To: [email protected]; [email protected];
  Cc: [email protected];
  Subject: Daily Database Review for ReallyBigCo PRODDB

  Our daily review of this instance has found the following:

      Pollux is now the primary database server and Castor is now the standby
  database server. The reversal of roles was performed last night and the
  transcript has been sent to Terry. I performed the switchover using the same
  method used in the past by Gary, only changing the dates embedded in file names.
      Quoting Terry, the reason for the switchover was: "Having switched to the
  standby, we are now using StorageTek instead of NetApp storage. We are working
  on replacing the disks in the NetApp array with faster ones."

      The recent performance problems have two symptoms: much higher CPU
  utilization than historical norms and continued degradation in I/O performance.
  The current hypothesis for increased CPU utilization is that the increase is
  correlated with the expansion of the amount of data and the number of users. The
  cause of the degradation in I/O performance is under investigation by the system
  administrators.
  --Iggy

When a DBA candidate is interviewed at Database Specialists, the focus is on the candidate’s communication skills. The first exercise is to write a daily report of the sort just illustrated. The candidate is shown an extract from the Oracle alert log and asked to write a report discussing the Oracle errors listed in the extract. There is no expectation that the candidate has had previous experience with those errors, and the candidate is welcome to research the answers online; this mimics the approach used by real DBAs in real life. In fact, there may not be any “right” answer.

The second exercise is a role-playing exercise that focuses on verbal communication. The exercise mimics a common event in the life of a DBA—a critical problem that has high visibility and requires a number of participants. The candidate is given access to a lab system owned by Database Specialists and is asked to join a telephonic conference. To prepare for the exercise, the candidate is directed to a white paper—available on the Database Specialists web site—that discusses the problem-solving approaches that would be useful during the exercise. Participating on the conference call are members of the Database Specialists team, one of whom represents the customer while the rest represent other IT personnel such as system administrators. The customer describes the problem, and the candidate is expected to ask questions, diagnose the problem, and solve it with the help of the other participants on the call. The problem is actually simulated in a lab database, and the candidate is expected to check the database and communicate the findings. The candidate is welcome to research the problem online, because there is no expectation that he or she has any experience with the specific problem that is being simulated.

Why does this interviewing approach work for Database Specialists? It works because Database Specialists uses a systematic institutional approach to database administration, one in which communication skills are critical.

ITIL

The IT Information Library (ITIL), sponsored by the U.K. government, provides a conceptual overview of and detailed guidance about IT practice. It is a collaborative effort of many IT organizations—not academic theory—and provides guidance that is independent of the hardware and software being used. The first version was issued in the late 1980s, and it is now the international standard for IT practice. The discussion that follows is an adaptation of the framework and terminology presented in the ITIL literature.1

The Big Picture

As shown in Figure 15-1, everything starts with the business. The IT department provides IT services to the business. These services are managed using IT service management (ITSM) processes such as Service Level Management, Incident Management, and Change Management. The Infrastructure Management tea006D manages the hardware and software that power the services required by the business—it is divided into a Design and Planning (D&P) team, a Deployment team, and an Operations team. The Application Management team designs and develops the software applications underlying the services used by the business. The ITSM team is the interface between the business and the Infrastructure Management team. In particular, the Service Desk is the single point of contact for all users of the services.

9781484201947_Fig15-01.jpg
Figure 15-1. The big picture

The preceding is a very formal description of IT practice, and many companies don’t have large IT teams. But IT principles don’t change whether the IT team consists of a single employee or hundreds of employees. ForH example, service-level expectations always exist, whether they are formally documented in a service-level agreement (SLA) or not.

Image Tip  The DBA is part of the Operations team. This is an intuitively correct classification because the term administrator suggests the day-to-day maintenance of an operational database. However, persons with the title Database Administrator can often be found in the Deployment team and the Applications Management team—a testament to the multidisciplinary job descriptions found in the real world.

IT Service Management Processes

The ITSM team manages the services used by the business. It is the interface between the business and the Infrastructure Management team. The ITSM team uses the ten management processes described in the following sections. Note that the services managed by the ITSM team are actually provided by the Operations team. ITSM principles apply regardless of whether a company has a dedicated ITSM team. For example, a principle of change management is that all changes be properly authorized. Another change-management principle is that records of changes are kept and documentation is kept up-to-date. Such principles don’t change regardless of whether a company has a dedicated change manager. A DBA should seek approval before making changes, keep records of changes, and update the documentation regardless.

Some of the processes of ITSM are strategic—Service Level Management, Financial Management, Capacity Management, Availability Management, and IT Continuity Management (also known as disaster recovery). Other branches are tactical—Incident Management, Problem Management, Configuration Management, Change Management, and Release Management. Formal definitions are provided in the following sections along with appropriate suggestions for the DBA. The definitions are taken from ITIL V2 Glossary v01 (Office of Government Commerce, 2006) and are reproduced with permission.2 The highlighted terms in the definitions have their own formal definitions, and the serious student can find them in the same document.

Service Level Management

Service Level Management (SLM) is “the Process responsible for negotiating Service Level Agreements, and ensuring that these are met. SLM is responsible for ensuring that all IT Service Management Processes, Operational Level Agreements, and Underpinning Contracts are appropriate for the agreed Service Level Targets. SLM monitors and reports on Service Levels, and holds regular Customer reviews.”

The database administrator cannot meet the expectations of the business if he or she does not know what they are. These expectations usually center on availability and performance; for example, databases that are used by e-commerce applications typically have very high availability and performance requirements.

In the absence of clearly communicated expectations and measurable service levels, the job of database administration becomes a reactive exercise instead of the proactive exercise it should be—for example, performance tuning is conducted only after the business complains of poor performance.

Financial Management

Financial Management for IT services is “the Process responsible for managing an IT Service Provider’s Budgeting, Accounting and Charging requirements.”

Oracle license payments can be a substantial part of the IT department’s budget. At the time of writing, a license for Oracle Database Enterprise Edition for a single CPU costs $47,500 or more and depends on the number of cores in the CPU; a license for a four-core Intel CPU costs $95,000. Many features such as partitioning are extra-cost options. Annual support costs are currently 22% of the base price, and development and standby databases must also be separately licensed.

DBAs should understand Oracle’s licensing policies and maintain an accurate inventory of installed software. Note that Oracle does not use license keys to unlock software, and it is therefore easy to install inadvertently software that is not properly licensed, such as Enterprise Edition instead of Standard Edition. It should be particularly noted that many features are automatically installed as part of the installation process and cannot be deinstalled. Examples include Partitioning and the various Management Packs. Collector Jobs for Diagnostics Pack are automatically created and scheduled even though very few sites are licensed to use the feature. As described in Chapter 6, the DBA must take explicit steps to deactivate the Management Packs after the database is created.

Image Tip  The DBA_FEATURE_USAGE_STATISTICS view shows which Oracle features are being used. This information should be regularly reviewed to ensure that unlicensed features are not being inadvertently used.

IT Service Continuity Management

IT Service Continuity Management (ITSCM) is “the Process responsible for managing Risks that could seriously impact IT Services. ITSCM ensures that the IT Service Provider can always provide minimum agreed Service Levels, by reducing the Risk to an acceptable level and Planning for the Recovery of IT Services. ITSCM should be designed to support Business Continuity Management.”

IT continuity management is commonly referred to as disaster recovery, and the DBA must be prepared to re-create the database at an alternate location if a disaster should make the primary location unusable. This service is a subset of business continuity management, which is responsible for all aspects of the business, including IT. The DBA is responsible for databases only, and the three approaches that can be used are hot standby, warm standby, and cold standby. In the hot standby approach, an Oracle database is created on a server in the alternate location and is continuously synchronized with the primary database so that failover time can be kept to a minimum. In the warm standby approach, the hardware (including network components) is available at an alternate site but the database has to be re-created from backup tapes—this option is less expensive than the hot standby approach. In the cold standby approach, alternate facilities with power and cabling are identified, but the hardware is only procured in the event of a disaster—this is the cheapest option.

Capacity Management

Capacity Management is “the Process responsible for ensuring that the Capacity of IT Services and the IT Infrastructure is able to deliver agreed Service Level Targets in a Cost Effective and timely manner. Capacity Management considers all Resources required to deliver the IT Service, and plans for short, medium and long term Business Requirements.”

In the absence of systematic capacity management, database administration becomes a reactive exercise in which, for example, space is added to databases only when they are close to failure. A systematic approach requires that trends be monitored. This requires periodic checking of database size, free space in tablespaces, CPU utilization, disk utilization, network utilization, and similar parameters, and taking corrective action to prevent the database from failing. DBAs are primarily interested in monitoring database trends; Chapter 10 discusses the use of STATSPACK for that purpose. Automatic Workload Repository (AWR) can also be used for monitoring database trends, but very few sites are licensed to use it.

Availability Management

Availability Management is “the Process responsible for defining, analysing, Planning, measuring and improving all aspects of the Availability of IT services. Availability Management is responsible for ensuring that all IT Infrastructure, Processes, Tools, Roles etc are appropriate for the agreed Service Level Targets for Availability.”

Most of the DBA’s time can be taken up by tasks relating to availability management. Database tuning (Chapter 16), SQL tuning (Chapter 17), and hardware upgrades are required to keep performance at acceptable levels. Database backups (Chapter 12) are required as insurance against database failures, and recovery testing (Chapter 13) is required to validate the usability of backups and to measure recovery times. An unplanned outage is usually classified as a Sev 1 issue, requiring the immediate attention of the DBA.

Incident Management

Incident Management is “the Process responsible for managing the Lifecycle of all Incidents. The primary Objective of Incident Management is to return the IT Service to Customers as quickly as possible.”

The Incident Management process is usually handled by the Service Desk. The Service Desk prioritizes each issue that comes to its attention and engages with the Operations team as necessary until the problem is resolved.

Problem Management

A problem is “the root cause of one or more incidents.” Problem Management is “the Process responsible for managing the Lifecycle of all Problems. The primary objectives of Problem Management are to prevent Incidents from happening, and to minimize the Impact of Incidents that cannot be prevented. Problem Management includes Problem Control, Error Control and Proactive Problem Management.”

Incident Management is a reactive process, whereas Problem Management is a proactive process. An example of an incident is a database outage caused when the archived log area fills up. The incident may be resolved by removing the oldest archived logs, but the root cause of the incident must also be addressed. For example, disk space may be inadequate, and additional disk space may have to be procured. The Problem Management process ensures that chronic problems are identified and fixed.

Change Management

A change is “the addition, modification or removal of anything that could have an effect on IT Services. The Scope should include all Configuration Items, Processes, Documentation etc.” Change Management is “the Process responsible for controlling the Lifecycle of all Changes. The primary objective of Change Management is to enable beneficial Changes to be made, with minimum disruption to IT Services.”

The Change Management process is tasked with ensuring that changes are appropriately authorized and tested before being applied to the infrastructure that supports the IT services required by the business. The Change Management process is also responsible for ensuring that the risks of the changes are understood and that conflicts are detected. For example, a change to the database may conflict with business tasks.

Configuration Management

Configuration is “a generic term, used to describe a group of Configuration Items that work together to deliver an IT Service, or a recognizable part of an IT Service. Configuration is also used to describe the parameter settings for one or more CIs.” Configuration Management is “the Process responsible for maintaining information about Configuration Items required to deliver an IT Service, including their Relationships. This information is managed throughout the Lifecycle of the CI. The primary objective of Configuration Management is to underpin the delivery of IT Services by providing accurate data to all IT Service Management Processes when and where it is needed.”

In the terminology of Configuration Management, a database is a configuration item. The Configuration Management process is responsible for creating a repository of information about each configuration item and recording changes made to each configuration item. Systematic configuration management improves the effectiveness of the other branches of IT Service Management. For example, ready access to configuration information can help in resolving incidents (Incident Management) and diagnosing chronic problems (Problem Management).

In Chapter 9, you saw how Remote Diagnostic Assistant (RDA) can be used to assemble information about a database into an HTML framework. In the terminology of Configuration Management, the information collected by RDA is called a configuration record and should be stored in a Configuration Management Database (CMDB). You can create a simple CMDB of sorts by linking your RDA collections into an Excel spreadsheet. RDA collections can be run at regular intervals and linked into the spreadsheet. This is an easy way to create and organize a historical record of changes to database configurations.

Image Note  An RDA collection is the simplest and quickest way of documenting a database environment.

Release Management

A Releas0065 is “a collection of hardware, software, documentation, Processes or other Components required to implement one or more approved Changes to IT Services. The contents of each Release are managed, tested, and deployed as a single entity.” Release Management is “the Process responsible for Planning, scheduling and controlling the movement of Releases to Test and Live Environments. The primary objective of Release Management is to ensure that the integrity of the Live Environment is protected and that the correct Components are released. Release Management works closely with Configuration Management and Change Management.”

Release Management is concerned with major changes and additions to the IT infrastructure, such as installation of database software (Chapter 5) and database creation (Chapter 6). You should keep careful notes whenever you install database software and create a database so that the process can be standardized and repeated.

Image Note  Oracle provides the Service Level Management Pack for service-level management, the Change Management Pack for change management, the Configuration Management Pack for configuration management, and the Provisioning Pack for release management, but few sites are licensed to use these tools because they are available only with Enterprise Edition and require extra license and support fees on top of those paid for Enterprise Edition.

Start with the End in Mind: The Ten Deliverables

In his best-selling book The 7 Habits of Highly Effective People (Free Press, 1989), Stephen Covey distills the secrets of effectiveness into seven principles. In my opinion, the most important habit is “Start with the End in Mind”—how you want your work to be evaluated when you have completed it. When Big Bob turned responsibilities over to me, he had little more for me than the database passwords, a firm handshake, and good wishes—he left me very unhappy.

To be effective as a DBA, you must start with the end in mind—the moment when you hand over responsibilities to your successor. What will you give them other than the database passwords, a firm handshake, and good wishes?

Here are the ten deliverables of the database administration role—they map to the ten deliverables of the Operations team, which are listed in the ITIL literature. In large organizations with many DBAs, these are shared deliverables:

  1. A database that meets the needs of the business: This is the most important deliverable, if not the only one. The needs of the business include certain levels of performance, security, and availability. You must understand the needs of the business, you must have a way of evaluating how well the needs are met, and you must have a methodology for meeting those needs. Any chronic performance, security, and availability issues must be discussed with the incoming DBA.
  2. A secure document library: The absence of a document repository causes a lot of valuable information to be lost. Examples of documents that should be retained include service-level agreements, network diagrams, architecture diagrams, licensing information, E-R diagrams, performance reports, audit reports, software manuals, installation notes, project notes, copies of important correspondence, and so on. Standard operating procedures (SOPs) are another important class of documents; you learn about them later in the chapter. Original software media and files should also be stored in the library for use if the database needs to be rebuilt or if additional databases need to be created. Note that the document library needs to be secure because it contains sensitive and confidential information.
  3. Work logs of service requests, alarms, and changes: Work logs are important for many reasons. They bring transparency and visibility to the database administration function. From the incident management perspective, it is necessary to review the work logs and identify inefficiencies and root causes. From the problem management perspective, it is necessary to review the work logs and identify chronic problems. From the availability management perspective, it is necessary to review the work logs and identify availability issues. These are just some examples of how work logs help bring about improvements and efficiencies in your ability to provide good service to the business.
  4. Standard operating procedures: Any database administration task that is done repeatedly should be codified into an SOP. Using a written SOP helps with efficiency and accuracy. We return to this subject later in the chapter.
  5. Procedures and records for backup testing and failover testing: It is absolutely essential that backup procedures and disaster-recovery procedures be documented. The procedures should be periodically tested, and records should be maintained.
  6. Maintenance and batch schedules, documentation, and records: Chapter 14 discussed database maintenance. Database maintenance procedures should be documented, and records should be maintained. If the maintenance procedures are automated, log records should also be automatically created. For example, an RMAN catalog can be used to store backup histories. Any repeating tasks or batch jobs that are the responsibility of the DBA should also be adequately documented, and records should be maintained for them.
  7. Database administration tools: Database administration tools include Oracle-supplied tools such as Database Control, Grid Control, and SQL Developer. The Management Packs, such as Diagnostics Pack, Tuning Pack, Change Management Pack, and Configuration Pack, are very valuable tools, but most organizations don’t purchase licenses to use them because of their high cost and because Enterprise Edition is a prerequisite. Other popular tools are Toad from Quest Software and DBArtisan from Embarcadero Technologies002E
  8. Management reports: Examples of database reports for management are reports on database growth, workload, and performance. STATSPACK and AWR histories should be retained for as long as practicable—the defaults (two weeks in the case of STATSPACK and eight days for AWR) are unsuitable. I suggest retaining data for at least one year if you can afford the space. Baseline snapshots should be retained indefinitely. For example, you can designate the period between 9 a.m. and 10 a.m. every Monday morning as a baseline period so that the snapshots marking the beginning and end of the period are retained indefinitely.
  9. Exception reports: This deliverable includes reports on SLA violations, security violations, backup failures, and the like. For example, a certain stored procedure or SQL statement may have been identified as critical to the business, and an exception report can be produced by mining STATSPACK data.
  10. Audit reports: This deliverable typically refers to audit reports conducted by security auditors but can also refer to internal audits of compliance with organizational processes such as Change Management or database reviews by external consultants. The absence of audits indicates a lack of oversight of the database management function.

The Book You Really Need and the Art of the SOP

The book you really need will never be found in bookstores—it is the book containing all the procedures that you need to operate your databases. You’re going to have to write that book yourself. Nobody can write the book for you, because you have a unique environment and nobody except you would write a book that caters to a unique environment.

Do you know how to start or stop a database? I thought I did—until I went to work in a large network operations center. We had Solaris, AIX, HP/UX, Linux, and Windows. We had Oracle 8i, Oracle 9i, and Oracle 10g. We had VCS, HP Service Guard, Sun Clusters, Data Guard, and RAC. There were so many variations of the startup and shutdown procedures that I could not remember all of them.

A common task such as adding a data file to a database requires different methods depending on whether you are using cooked files, raw devices, or ASM. Additional complexities are introduced by RAC and Data Guard. Raw files in particular are notoriously difficult to manage—they make it easy to damage the database. And, in my experience, DBAs routinely forget the important step of backing up the data file immediately after it is created.

A TRUE STORY

A DBA stopped a database using the shutdown command in preparation for moving some data files to a new location. Unknown to him, the database was managed by Veritas high-availability software, which automatically restarted the database. The database was corrupted when the DBA moved the files.

It happened to me—it could happen to you.

Benefits of SOPs

Here are some of the advantages of SOPs:

  • They improve consistency. A documented procedure is more likely to be executed consistently than one that is not documented.3
  • They improve quality. It is easier to do a good job if you don’t have to rely on memory or invent the procedure.
  • They facilitate continuous improvement. It is easier to improve the quality of a procedure if it is documented than if it is not documented.
  • They promote transparency. Would you hire anyone who insisted on charging you a lot of money but refused to tell you what was involved?
  • They improve efficiency. Things get done faster. They get done correctly the first time. Further, it is easier to improve the efficiency of a procedure if it is documented than if it is not documented.
  • They facilitate planning. It is easier to quantify the labor involved in a written procedure than one that is undocumented. This facilitates planning and project management.
  • They can reduce cost. SOPs established by senior personnel may be delegated to junior personnel who are paid less. Needless to say, there is also a definite cost associated with making mistakes and inefficient execution.
  • They facilitate knowledge transfer. Tribal knowledge is lost when the members of the tribe leave to join other tribes or is forgotten with the passage of time. Written documentation is more permanent. Tribal knowledge is also subject to the Telephone game phenomenon.4
  • They reduce risk. SOPs reduce the risk of things going wrong. These risks are exacerbated when regular performers are unavailable because of vacations, illness, resignation, and so on. Mistakes made by IT departments in large organizations often make the front page of the newspaper.
  • They improve employee morale. Employee morale is high if the organization works well. A dysfunctional organization has low employee morale.
  • They reduce blame games. It’s hard to blame a performer for following a well-established SOP. Responsibility for failure transfers from the performer to the SOP. Of course, the SOP needs to be improved for the next time around.
  • They improve customer satisfaction. The previous advantages would satisfy almost anybody, but we cannot neglect to mention that SOPs must inevitably improve customer satisfaction. Which customer would not be satisfied with your attention to detail?

A common excuse for not writing SOPs is that most tasks are simple enough. Consider the simple task of shutting down a database. Why does a custom SOP need to be written for this task or customized for each individual database? How customized, you might ask, can this simple task get? Well, the procedures for shutting down a database depend on the Oracle database version; the operating system; any high-availability mechanisms such as VCS, HP Serviceguard, and Sun Cluster; and Oracle components such as RAC and ASM. And here are some of the things a DBA might have to do before pressing the buttons that shut down the database:

  • Confirm that there are no conflicts with backup schedules.
  • Confirm that there are no conflicts with batch schedules.
  • Confirm that no incompatible activities have been scheduled at the same time.
  • Confirm that there are no conflicts with the SLA for database availability.
  • Obtain the permission of the business owners of all applications that are directly or indirectly impacted.
  • Send advance notifications to the user community.
  • Determine the impact on replication mechanisms and take the appropriate steps to eliminate or mitigate the impact.
  • Confirm the availability of other performers to bring down applications gracefully prior to shutting down the database.
  • Confirm the access of all performers to databases, servers, and applications.
  • Agree on communication mechanisms and performer handoffs.
  • Establish escalation procedures for use if things go wrong.
  • Establish procedures for use if users or jobs are still connected when the time comes to shut down the database.
  • Blackout alarm mechanisms.
  • Send broadcast messages to users just before the shutdown.

Another example of a simple task that can quickly become complicated is resetting a password. Metalink note 270516.1 explains the lengthy sequence of tasks that must be performed to change the SYSMAN password.

Structure of an SOP

The following sections describe an SOP template that can be customized to each site’s specific requirements and standards. Each SOP may be divided into the following sections.

Overview

The overview section is provided not so much for the benefit of those executing the SOP but for all those who are peripherally involved, such as customers who need the work done and managers or teams responsible for scheduling, approving, or supervising the work. A minimum of technical jargon should be used, and technical details should be suppressed if appropriate. The following points should be addressed:

  • Purpose: This is the key section for nontechnical reviewers or managers002E
  • Risks: A clear description of risks helps in getting approvals to perform the task—it also helps educate the performer. There can even be risks if the task is not performed.
  • Labor and billing: This section specifies the standard fee schedule, which is the standard number of labor hours required to complete the work. In cases where a standard fee schedule cannot easily be constructed because of the variability of the work, guidelines are provided for estimating the work. In some cases, it is appropriate to specify billing details (for example, interdepartmental billing).
  • Scheduling: This section specifies what advance notice is required, what information needs to be supplied by the requestor, what forms must be completed, and scheduling constraints, if any.
  • Prerequisites: Clearly documenting the prerequisites improves the chances that they will actually be met. You might want to include a “nice to have” section.

Testing

This section describes what testing should be completed in a laboratory setting before the real work can begin. Here are the reasons for testing:

  • An SOP may not be perfect. Testing the procedure in a laboratory setting that duplicates the targeted production setting may uncover deficiencies in the procedure.
  • Practice is good preparation. It may have been a long time since the performer last executed the procedure. Also, the SOP may have omitted some of the details, and practice in a laboratory setting will allow less experienced performers to supply them for themselves.
  • Testing smoothens out the approval process. In some cases, testing may not be considered necessary because the work is truly routine—for example, adding a user. In other cases, testing may not be practical because of the huge effort required to duplicate the target environment in a laboratory setting. If testing is unnecessary or impractical, then the author of the SOP should indicate as much and explain why.

Approvals

This section describes whose approval is required and the protocols to be observed—for example, verbal approval, written approval, formal meetings, advance notice, and so on. I have observed that most IT organizations fall into one of two categories:

  • Very little attention is paid to change management. The performers are given free rein to take suitable action. Performer morale is high, but the situation is certainly not desirable from a management viewpoint.
  • A huge amount of attention is paid to change management. Performers are not permitted to do the slightest work without the approval of “change czars.” Performers chafe under this scrutiny and complain of delays and the effort expended in submitting paperwork before artificial deadlines, attending change-management meetings, and answering questions from change czars who do not have expertise in the subject matter.

The approach I recommend is that the SOP explicitly state whose approval is required. The following is a short list:

  • The performers who will be actually performing the tasks
  • Managers of the performers required to perform the task or tasks
  • Representatives of users who will be affected by the task

If all the necessary approvals have been obtained, the approval of the change czars becomes a mere formality.

Notification

This section describes who should be notified before the work begins and the mechanisms and procedures that should be used.

Backup

This section provides step-by-step instructions for creating backups in case the work needs to be undone.

Staging Activities

This section addresses all other preparation steps that are not covered by the approval, notification, and backup sections.

Execution

This section provides step-by-step instructions on how the work should be done.

Verification

This sections states what “successful” execution means. For example, it may require confirmation from a user that they are able to use the application.

Backout

This section provides step-by-step instructions on using the backups if it becomes necessary to reverse the change.

Signoff

This section describes who decides that the performer has executed the work correctly.

Record-Keeping

This section makes the record-keeping requirements explicit. :

Quality Assurance

This section describes any quality assurance procedures that should be used to assess the quality and accuracy of the work.

KILL FIRST, ASK QUESTIONS LATER?

An article titled “A Day in the Life of an Enterprise DBA” was published in the March 1998 issue of Oracle magazine, published by Oracle. The protagonist used the Enterprise Manager tool to perform a variety of tasks. Here is how he diagnosed and fixed a slow system.

To find out who is hitting the system so hard, I start up TopSessions and look at the user-resource usage on the system. I sort based on redo activity to find the culprit. A developer is inserting data into the database and causing significant redo-log activity. A double-click on the user shows me the actual SQL that has been executed. Apparently, the developer is loading data onto the production system during production time. Should I call him before I kill his session? No—kill first, call later. He should know better. Soon after, the supervisors report that the system is running well again.

This brings to mind the 2002 James Bond movie Die Another Day, starring Pierce Brosnan as James Bond and Rosamund Pike as double agent Miranda Frost. In the movie, Frost describes Bond as follows:

He’s a double O, and a wild one as I discovered today. He’ll light the fuse on any explosive situation and be a danger to himself and others. Kill first, ask questions later. I think he’s a blunt instrument whose primary method is to provoke and confront …

Your motto should not be “Kill first, ask questions later” but “Follow the SOP.” That way, you won’t light the fuse on explosive situations and be a danger to yourself and to others!

Suggested SOPs

Here is a list of common database tasks. As you have seen, even everyday tasks such as stopping a database and changing passwords can be nontrivial. If you don’t have the time to write detailed SOPs, you should consider writing at least a few sentences on each topic:

  1. Connecting to a database: This describes how the DBA connects to the database to perform database administration activities. This SOP is invaluable in an emergency when speed is critical or when the primary DBA is unavailable.
  2. Starting a database: This describes how to start the database engine and associated components such as ASM, Data Guard, and database applications.
  3. Stopping a database: This describes how to stop the database engine and associated components.
  4. Backups and recovery: This describes how to perform an ad hoc backup and how to recover the database from backups.
  5. Removing archived redo logs: This describes what to do when the archived log destination fills up.
  6. Standby database maintenance: This describes how to fail over and fail back.
  7. Adding space: This describes the process for adding data files or increasing the size of data files.
  8. Health checks: This describes how to check the health of the database and of applications that use the database.
  9. Adding a user: This describes special procedures and security rules to be followed when adding a user to a database—for example, specific privileges that might be needed by users of particular applications.
  10. Resetting a password: This describes special procedures to be followed when performing password resets—for example, steps to prevent applications from malfunctioning.
  11. Clearing a lock: This describes internal procedures to be followed when terminating a process that is blocking other users.
  12. Maintenance activities: This describes daily maintenance activities, such as investigating backup failures, reviewing the error log, and checking the contents of trace files. It also covers weekly maintenance activities such as regenerating statistics, monthly maintenance activities such as preparing for month-end batch processing, and quarterly maintenance activities such as preparing for quarter-end processing.

Summary

Here is a short summary of the concepts this chapter touched on:

  • IT services are managed using the principles of IT service management (ITSM). The Infrastructure Management team manages the hardware and software that power the services required by the business—it is divided into a Design and Planning (D&P) team, a Deployment team, and an Operations team. The Application Management team designs and develops the software applications underlying the services used by the business. The ITSM team is the interface between the business and the Infrastructure Management team. In particular, the Service Desk is the single point of contact for all users of the services.
  • The DBA is part of the Operations team. This is an intuitive classification, because the term administration suggests the day-to-day maintenance of an operational database. However, people with that title can often be found in the Deployment team and the Applications Management team.
  • Some of the processes of ITSM are strategic: Service Level Management, Financial Management, Capacity Management, Availability Management, and IT Continuity Management (a.k.a disaster recovery). Other branches are tactical: Incident Management, Problem Management, Configuration Management, Change Management, and Release Management.
  • Oracle provides Management Packs for a number of ITSM processes, but most sites are not licensed to use them, because they are available only with Enterprise Edition and require extra license and support fees.
  • Oracle makes it very easy to download, install, and activate software, but this causes organizations to be vulnerable to using unlicensed software. Particular care has to be taken in the case of the Management Packs, which few sites are licensed to use but which are automatically installed and activated during the installation process.
  • Every database requires a disaster recovery plan. The three approaches that can be used are hot standby, warm standby, and cold standby. A cold standby is the cheapest option and is simply a plan for the procurement and provisioning of hardware at a previously identified location, complete reinstallation of software, and database recovery from backup tapes.
  • STATSPACK and AWR histories can be used to monitor trends in workloads and performance. They should be maintained for as long as practicable. Baselines should be retained indefinitely.
  • Incident management is a reactive process, with the goal being to restore service as soon as possible. Problem management is a proactive process, with the goal being to identify and eliminate chronic problems.
  • Configuration management is the linchpin of the other ITSM processes. Oracle provides the Configuration Management Pack, but few sites are licensed to use it—RDA is a cheap and simple substitute.
  • These are the ten deliverables for the database administration role: databases that meet the needs of the business; a secure document library; work logs of service requests, alarms, and changes; standard operating procedures (SOPs); backup testing and failover testing procedures and records; maintenance and batch schedules, documentation, and records; database administration tools; management reports; exception reports; and audit reports.
  • Any database administration task that is done repeatedly should be codified into an SOP. Using a written SOP has many benefits, including efficiency, quality, and consistency.

Exercises

  • Download Oracle Database Licensing Information, 12c Release 1 (12.1) from the Oracle website. Also download Oracle Technology Global Price List. Review the contents of the DBA_FEATURES_USAGE_INFO view in your database. Which extra-cost options and Management Packs have been automatically installed in your practice database? Which extra-cost options and Management Packs are already in use? For example, have AWR collections been automatically scheduled by the Oracle installer? Compute what the total licensing and annual support costs might be if your database was not solely for self-educational purposes.
  • Download ITIL V2 Glossary from www.best-management-practice.com/gempdf/ITIL_Glossary_May_v2_2007.pdf. Find the definitions of the terms in the various ITIL definitions provided in this chapter.
  • Write an SOP to add a data file to your database. Include enough detail that somebody who was unfamiliar with Oracle (for example, a Windows system administrator) would be able to perform this simple task. Remember to include the step of backing up the data file immediately after it was created.

Further Reading

  • ICT Infrastructure Management. Office of Government Commerce, 2002. This book describes information and communications technology (ICT) management and its relationship to IT Service Management. This refers to the ITIL V2 publication.
  • ITIL V2 Glossary v01. Office of Government Commerce, 2006. This is a glossary of terms, definitions, and acronyms used by ITIL V2. You can download it from www.best-management-practice.com/gempdf/ITIL_Glossary_May_v2_2007.pdf.
  • Mullins, Craig S. Database Administration: The Complete Guide to DBA Practices and Procedures (2nd Edition). Addison-Wesley Professional, 2012.
  • Service Delivery. Office of Government Commerce, 2000. This book describes the strategic aspects of IT Service Management, specifically the Service Level Management, Financial Management, Capacity Management, IT Continuity Management, and Release Management processes. This refers to the ITIL V2 publication.
  • Service Support. Office of Government Commerce, 2001. This book describes the tactical aspects of IT Service Management, specifically the Service Desk, Incident Management, Problem Management, Configuration Management, Change Management, and Release Management processes. This refers to the ITIL V2 publication.
  • Van Bon, Jan. Foundations of IT Service Management: Based on ITIL. Van Haren Publishing, 2005. This book is much more affordable than the publications of the Office of Government Commerce and is suitable for beginners.

Footnotes

1The descriptions and definitions used in this chapter are from ITIL V2.

2ITIL ® is a registered trademark and a registered community trademark of the Office of Government Commerce and is registered in the U.S. Patent and Trademark Office. ITIL Glossaries/Acronyms © Crown Copyright Office of Government Commerce. Reproduced with the permission of the Controller of HMSO and the Office of Government Commerce.

3I remember a case when a customer vociferously expressed dissatisfaction with the work performed by a certain individual, going so far as to suggest that he be dismissed from service. Management finally agreed that the real problem lay with the “process,” not with the performer, and that written procedures would be a better solution than dismissing the performer. I believe that IT management in general is too eager to blame the performer rather than the process. Performers are evaluated every year, but organizational processes are rarely evaluated. I believe that improving organizational processes will inevitably lead to improvements in employee performance. The likelihood that unwritten standards will be violated is much greater than that written standards will be violated. Standards can be violated intentionally or unintentionally. An unintentional violation of unwritten standards usually results when a task is performed by a newcomer to the group or when a veteran performer forgets to use one of the elements of the standard. An unintentional violation of written standard usually results from inadequate training or from sloppy execution. Standards can also be intentionally violated. However, the violator has a convenient excuse if the standard is unwritten, and deliberate violations stem from the belief that the standard is imperfect. It is not difficult for experienced individuals to find something about a procedure that they might choose to do differently if left to their own devices, and therefore organizations that rely on unwritten standards are likely to experience steady erosion of standards.

4Telephone is a game in which each participant whispers a sentence to the next. Errors begin to accumulate, and the last participant receives a highly garbled version of the original sentence.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.192.247