Chapter 1. Introduction to Disaster Recovery and Business Continuity

 

‘Meet success like a gentleman and disaster like a man.’

 
 --Frederick Edwin Smith (1872-1930)

The business world has changed significantly in the past few years. Organizations have undergone huge technical and non-technical transformations over the last decade. Regardless of the industry, more and more businesses are operating on a 24x7 global basis. Competition has also increased dramatically and is now available at a click of a mouse button. Even small organizations with less than a dozen employees depend on several modern technologies and worldwide competition to remain in business. To stay in business, alive and kicking, is of paramount importance to every modern organization. Today, it is not possible to run your business using the same methods and processes that were used five or ten years ago.

Secondly, the advancement and easy availability of new and useful technologies have enabled thousands of organizations worldwide to implement and use them extensively for their day-to-day functions. Today it is almost impossible to run any modern organization without the use of some computer or telecom-related technology. For example, every modern organization will require several computers, databases, Internet access, e-mail, web-hosting, telephones, etc, for running its day-to-day operations. In addition, the customers of every organization have also become heavily dependent on technology for various needs which must be serviced through technology. Though organizations may have implemented several modern technologies they may or may not have the expertise to support them internally. Hence, a high dependence on external qualified vendors and service providers is also very critical. For example, if a vendor is not able to provide timely and efficient service for critical IT functions or a database, the organization can get into serious trouble.

Nobody is immune to risks, but preventing, minimizing and avoiding disasters of all kinds have become extremely important to every organization today. Less than a decade ago concepts like disaster recovery (DR) and business continuity (BC) were almost unknown or just considered as optional academic subjects. The only traditional method organizations followed for DR or BC was to sign up to some insurance for their key equipment along with a few optional covers. But protecting today’s business requires going beyond having some insurance cover and keeping your fingers crossed. In addition to normal business pressures, there is an added pressure to continuously protect businesses from all kinds of threats and risks to survival.

With so much dependence on technology, an important question facing business managers today is: ‘how can you handle predictable disasters striking your business?’ Secondly: ‘who are the best persons to protect your business?’ ‘And what sort of qualification and mindset does one need to work in a DR and BC department?’ ‘Where and how can you find or identify such persons?’ And so on.

How this book will help: Many managers still live in constant fear about how to protect their businesses from various disasters, and worry about who will help. This book clears away such doubts and shows you how DR and BC can be successfully implemented with a simple combination of qualified internal staff, vendors, external consultants and plain common sense. This simple book is aimed at small organizations and IT departments wishing to get a bird’s eye view of the many DR and BC practices around. The various chapters will elaborate on a variety of IT and non-IT disasters that can strike an organization at any time. Each chapter gives short descriptions and explanations of the various terms and concepts used in DR and BC. A fictitious company called RockSolid Corp is used in many examples throughout this book. The entire book is written in a frequently asked question (FAQ) format for easy and speedy reading.

This book also draws on the best management practice contained in BS25999 to ensure that small organizations are also able to benefit from this guidance.

Who should read this book?

This book is aimed at anyone who is directly or indirectly involved with disaster recovery or business continuity. If you belong to one of the groups mentioned below then you will find this book extremely useful. Though the book is aimed at small and medium organizations the concepts hold good for large organizations too.

  • IT Managers

  • Chief Technical Officers or Chief Information Officers

  • Business managers and consultants

  • Board members

  • Risk and safety officers

  • IT consultants

  • Anyone who has been assigned the responsibility for overseeing DR and BC for their organizations.

What is a disaster?

A general dictionary defines a disaster as an ‘occurrence causing widespread destruction and distress, or a catastrophe’. In a business environment, any event or crisis that adversely affects or disables your organization’s critical business functions is a disaster. According to a number of reputable surveys and studies, hundreds of organizations worldwide go out of business every year because of the disasters that strike them, many of them fully preventable. Most small businesses cannot recover from major disasters, and even large organizations sometimes struggle.

Disasters can come in all shapes and sizes, and from all directions. This can be explained through some examples.

Example – Natural disaster: Suppose that, due to some mishap, there was a major fire in the RockSolid computer data centre, and all the main computers containing years’ of data and required business applications get burnt down. This would automatically mean that none of the RockSolid employees would be able to do any work. The entire business could come to a standstill within hours. Recovering from such a disaster would require a huge amount of effort, time and money. In addition, there could be losses in terms of reputation, losing customers, insurance and legal hassles, etc.

Example – Technical disaster: Instead of a fire, suppose there was a serious technical fault, caused by a hacker intrusion, a deadly virus attack or a software bug, that resulted in all computers shutting down. This would also mean that none of the RockSolid employees would be able to work and business would come to a standstill. Recovering from such a disaster would also require a huge amount of effort, time and money.

Example – Lack of knowledge: Organizations can also cripple themselves due to lack of adequate knowledge or by having a penny wise, pound foolish way of thinking.

Finance Department: ‘Hello, techies. Our finance server is not working. Can you fix it immediately?’

Techie: ‘Which one?’

Finance Department: ‘The one that we use in our department. The black system with the green keyboard.’

Techie: ‘I had a look at it, but the hard disk is dead. We will have to replace it. I will call the vendor and arrange for a replacement if possible.’

Finance Department: ‘What about our data?’

Techie: ‘Can’t recover. The disk is dead, and we have not been backing up the data of that server, because nobody told us to. Besides, you did not approve purchase of a tape drive for that machine. Your previous finance manager was maintaining the system herself because of confidential data.’

Finance Department: ‘Gasp!!! We have all our payroll, purchasing, billing, sales and other important financial data for the entire company and customers on that machine. We have just keyed in five years’ data!’

Techie: ‘Too bad. Got to go. I have to attend another support call somewhere.’

Finance Department: ‘Help! Call the CFO!! Call the CEO!!! Call the Army!!!!’

A situation like this can cripple a five-year-old business within an hour. And there are other types of potential disaster. Some disasters could even be deliberate – sabotage, theft, espionage, etc. Hence it is necessary to ensure that organizations have properly tested plans to recover and minimize all predictable and controllable disasters at all times. Today, having a proper and tested DR plan is also a mandatory audit and compliance requirement in many organizations. Naturally, organizations will not be able to safeguard themselves against all types of disaster, but they can definitely safeguard their business against many common types of preventable disasters.

What is disaster recovery?

No modern organization can run its daily operations without computers, software, telecommunications, the Internet and so on. Disasters can cripple businesses within hours. Today’s computer systems and networks are also extremely complex and complicated. In view of the complexity and interdependencies of various equipment, processes, people, etc, disasters can strike at any point and at any time. In today’s highly competitive, 24x7 global business environment the leisurely time when a business could take days and weeks to resume operations is over. If a critical computer system is not working, or unavailable, then businesses may have to close down virtually over night. In many cases it is almost impossible to switch over to alternative manual or legacy processes for any length of time. Today, businesses must be able to resume operations quickly, almost to the exact point where they stopped when the disaster struck. Though awareness of disaster recovery is increasing everywhere, very few organizations are actually well-equipped to handle disasters and restore normal operations as swiftly as possible.

Disaster recovery (DR) is the methodical preparation and execution of all the steps that will be needed to speedily recover from a disaster, usually one caused by technology. Disaster recovery planning is mainly technology-focused. Technology can mean voice and data communication systems, servers and computers, databases, critical data, web servers, e-mail, etc. Your DR plan should have tested and proven methods to tackle and recover from all predictable and controllable IT disasters for each of the above. For example, if there is a critical server running some crucial software, then your DR plan for that system can be a standby system in an alternative location running the identical software and having daily data synchronization. In addition, the main system can also have disk mirroring, tape backups, a periodic image backup, proper change management processes, etc, for added precautions.

A proper DR plan is of critical importance to your business. It should be documented and periodically updated with key staff, contact information, locations of backups, recovery procedures, vendor information, contracts, communications procedures, and a testing schedule. Additional elements may be necessary, depending on company size. More details are provided later in the book.

What is business continuity?

Business continuity (BC) ensures that certain essential business functions can continue to operate in spite of various disasters striking your organization. BC is a process that identifies various risks that threaten your organization and provides measures to safeguard the interests of its key stakeholders, customers, reputation, brand value, etc. Suppose a technical or non-technical disaster strikes your organization. Naturally all your critical staff will be deployed to try to recover from the disaster. Recovering from the disaster could range from a few minutes to several days, or never. But it is essential in many customer-oriented organizations to ensure that certain ‘minimum’ business functions ‘continue’ to operate even while the main disaster is being attended to. Unless the disaster is very severe and hits all areas, or is not under the control of your management, the entire organization need not come to a standstill.

BC is mainly business-focused and will concentrate on strategies and plans for various disaster events. BC planning will prepare business areas and organizations to survive serious business interruptions, and provides the ability to perform certain ‘critical business functions’ even during a disruptive event. For example, if a major disaster strikes a small bank’s main computer during banking hours, the bank management can speedily decide to allow customers to still deposit and withdraw a nominal amount of cash until such time as the main computer is fixed in the background. This is business continuity, and will ensure that customers have some minimal acceptable service in spite of a disaster. Having business continuity will also help preserve the company’s reputation, image, and so on.

Note: A business continuity solution need not always be a technical one, though there could be a technical disaster. Business continuity is all about providing speedy workable alternatives to minimize adverse impact. Anything that meets the purpose can be classified as business continuity. Business continuity management is managing risks to ensure your critical business functions can continue to provide acceptable levels of service even in the event of a major IT or non-IT disaster. For example, if your entire data centre that houses all the important servers gets damaged in a fire, electrical short circuit or some other sudden disaster, your BC management team should assist in recovering the company from such situations in previously planned ways. Your BC management should prepare your organization for disaster recovery options that apply before, if and when a disaster occurs.

If your budgets and resources were unlimited, you could probably build a twin of your entire organization elsewhere. But such luxuries are rarely available, nor practical. The ultimate choice of which business continuity option you need for each type of disaster should be made in consultation with several departments and business managers. As stated before, your business continuity method need not always be a technical solution. Your BC management team must be able to provide cost-effective and acceptable disaster prevention solutions to each of your critical business functions.

What is crisis management?

Depending on the nature of a disaster, it may be necessary for your organization to convene a group of senior managers to control adverse media reports, handle customer satisfaction, retain deserting customers, etc. This is crisis management. Crisis management is also panic prevention. For example, in the event of a major disaster in a reputable organization, suppose there was no crisis management team. Then, there could be a possibility of a newspaper publishing a negative report causing adverse impacts on the business, stock price, reputation, etc. The media can often blow a simple issue out of all proportion, causing widespread panic and mayhem. Hence a crisis management function becomes important to protect your business from such situations. A crisis management team can ensure that such situations and possibilities are controlled by proactively taking measures to minimize losses of various kinds, including reputation losses.

Table 1. Summary and examples of concepts

Disaster

A reputable bank’s main computer’s hard disk fails on Monday morning during peak banking hours. Banking operations are halted. Tellers cannot verify account balances or do any electronic transactions.

Disaster recovery (DR)

Technical staff repair the computer by replacing the hard disk and restoring data as fast as possible. Repair and restore could take several hours or more than a day.

Business continuity (BC)

Bank management allow all customers to withdraw up to one thousand dollars manually by filling in and signing a paper withdrawal slip. Other transactions also done by filling in paper forms. Paper information to be fed into the main computer later.

Crisis management (CM)

Senior executives of the bank assure customers that the technical problem will not cause any financial loss or improper accounting to anyone.

Note: Although the academic definitions and meanings of DR and BC are different, both terms are used simultaneously in many questions in this book. The reason for this is that the answers and concepts hold good for both in many cases. This book does not worry too much about the exact textbook or academic definitions of various terms. This is because, in the real world, businessmen are not unduly concerned about exact textbook definitions. They are only concerned about quick practical solutions for recovering from business disasters. The main objective of this book is to educate organizations and IT departments on practical and real-world ways of preventing various predictable disasters and continuing in business – it is not a theoretical textbook.

Why are DR and BC important?

As mentioned earlier, organizations have become extremely dependent on technology for their day-to-day operations and servicing their customers. It is not possible for any modern organization to switch over to manual processes for any length of time during a business interruption. A business interruption is any event (sudden or anticipated) that can disrupt normal business at an organization’s location. For example, it is not possible to switch back to manual typewriters, postal service, telex, and hand-written documents, etc, if the entire computer, Internet and e-mail network is down. Another important concern is that any major damage to the infrastructure can result in severe financial losses, loss of reputation, and may even result in closure of the business. Today most companies are interconnected among themselves, and to the outside world via the Internet. Any technology-related or other major failures in the company can result in the company being cut off from the rest of the world. Some of the reasons why disaster recovery and business continuity are important for your business are listed below:

  • Businesses have become extremely dependent on IT. So failures in IT are more likely to affect the business than other areas, and that impact is more likely to be severe.

  • In a networked, workflow type of environment a failure can hamper many departments and units.

  • IT environments have become extremely complex and inter-related, so the number of potential failure points is increasing day by day.

  • When IT fails there is not enough time to recover at a leisurely pace, because of end-user, customer and other business pressures.

  • Without a proven DR and BC process organizations can go out of business within hours or days.

Who are the real owners of DR, BC and CM?

This is actually a tricky question. Most people would say the owners would or should be the person(s) supporting the IT equipment, or the operators handling the business functions. After all, you might argue they are the ones operating the system or know how it works. But this is an incorrect assumption. Actually, the true owners of DR, BC and CM are the business managers of your organization. Your organization may have hired some IT staff or an external vendor to provide technical support and baby-sit an important server. But, speaking from a business perspective, those IT staff, operators or external vendors are not really the owners of your DR, BC or CM for your organization. For example, if the server stops operating you cannot hold the IT staff responsible for your organization being unable to conduct its business. They may know what it takes to repair or restore the system, but it is your business managers who should know or understand the potential loss in terms of financial, reputation or legal aspects of stoppage of various critical businesses and IT functions. Your business managers are responsible for ensuring provision of necessary budgets, manpower, resources, alternative methods, etc, to tackle and prevent disasters. They are the real owners of, and ultimately responsible for, DR, BC and CM. The various ways in which your business managers can demonstrate ownership are as follows:

  • Knowledge: Understand what the loss is in terms of financial, reputation, regulatory or legal consequences for disasters related to their critical business functions or IT equipment.

  • Financial support: Provide necessary budgets for comprehensive maintenance of hardware, software, telecom equipment, spares, backup devices, etc. For example, suppose your business managers do not approve the purchase of a good tape drive and the necessary software, or fail to enrol into hardware maintenance for an important server – the IT staff will not be able to do much in the event of a server crash, data loss or some other technical problem on that server.

  • Provide necessary manpower: Your business managers must ensure that departments have the necessary manpower in all areas. It is very common in organizations to skimp on manpower when it comes to support, maintenance, etc, but demand the best from a slave-sized workforce. The common saying ‘Hire an Einstein, but refuse his request for a blackboard’ describes a situation that is prevalent in many organizations worldwide. Reduced manpower and facilities in critical areas will inevitably, directly or indirectly, affect the business. See the question on staff ratio later.

  • Implement recommendations: Your business managers must listen to recommendations proposed by technical staff, support staff, etc, for implementing DR and BC environments. Establishing DR and BC is an expensive business. Not every critical IT function can be worked around with a low-cost alternative. It is a common practice in many organizations to ignore or avoid IT and non-IT recommendations by giving standard excuses, like cost, even though organizations will be perfectly capable of affording it. If you are serious about DR and BC, then your senior management must support the necessary costs and budgets for implementing all sensible recommendations, industry standards and work-arounds necessary for DR and BC.

  • Get involved: Senior management, including the CEO, must get involved in all aspects of their organization’s DR and BC processes. You must have a ‘Show me’ or ‘Prove it to me’ attitude to ensure your business is truly protected. Nowadays, having a BC or DR site for many organizations is a mandatory business and audit requirement.

  • Policies: Just like other essential policies in HR, finance, etc, a DR and BC policy must be enforced for all critical systems by the senior management.

  • Sustained commitment: DR and BC is a continuous exercise. Remember – DR or BC facilities are like insurance and cost money constantly. It is not enough to show interest and invest some money on a one-off basis. Continuous commitment and expenditure are required to establish proper DR and BC facilities.

What is the cost of a disaster?

A disaster will lead to numerous costs, implications and even long-term damage. It is not only the financial cost of the equipment or process that has failed. There can be hidden costs and problems. It can even have long-term cascading effects. Depending on the nature of the business, the various costs associated with a disaster could include:

  • Business losses

  • Reputation losses

  • Losing customers

  • Stock prices dipping or free-fall

  • Employee productivity losses

  • Billing losses

  • Unnecessary expenditure

  • Fines and penalties

  • Lawsuits

  • Travel and logistics expenses

  • Insurance and other hassles

  • Other industry-specific losses.

Business costs: The anticipated loss of money that the company would have made if the systems were working, eg, if the company were doing its business via a website. Amazon.com, for example, could lose thousands of dollars to its competition if their website were down even for a few hours.

Productivity costs: Number of employees affected multiplied by their hourly cost. For example, assume that your organization had hired ten external consultants at a rate of $100 an hour each for developing a software application installed on a particular file server. If that particular server was down for three hours during business hours then your organization would suffer a loss of $3,000 for those three hours. This is because that amount will still need to be paid by you to those consultants without any productive work in return.

Reputation costs: No specific formula exists to calculate reputation costs. They can range from a minor manageable scratch to a total crash of your company’s stock value and image in the eyes of customers and the general public. For example, if your company purchase order system is down, causing purchase orders to be delayed beyond committed delivery dates, your company may run a risk of losing those orders to your competitors or suffer loss of reputation due to not fulfilling orders in time, etc.

Direct costs: Costs for repair or replacement of the failed equipment, manpower costs, vendor costs, liabilities, etc.

Other costs: Other costs specific to your industry, for example, a customer may bring a lawsuit against your organization for delay.

Depending on the disaster one or more of the above losses can ruin your organization – hence the importance of paying due attention to DR and BC practices and processes. Each of the above should be considered in sufficient detail and the probability of occurrence must be calculated to ensure proper business continuity alternatives. Damage must be estimated in terms of revenue, reputation, security, employees, etc. Based on the study, a detailed BC plan should be prepared and implemented to ensure resumption of business processes following a disruption. Today, having a BC plan is a mandatory business, audit and compliance requirement for many organizations. You may have to prove to regulators and your customers that your internal processes are strong enough to withstand disasters and continue servicing customers. For example, the RockSolid Corp may have to prove to its major external customers that it has adequate DR facilities and that RockSolid can provide essential services even in the event of a disaster.

Who are the right persons to manage DR and BC work?

DR and BC are nowadays almost a mature science and there are umpteen numbers of consultants, templates, certifications and best practices available to everyone. If organizations need to establish DR and BC it is easily possible to get competent resumés by the hundred within hours of posting a job advertisement. In spite of such availability, the perfect candidates to manage a DR or BC function need some special skills and a very different mindset, as explained below. They need two skills that no training programme or certification can usually teach:

Skill 1: Nature of a coward

The kind of people who are perfectly suitable for DR and BC departments are those who can think like cowards, talk like cowards, plan like cowards and constantly spread a healthy dose of cowardice around the organization. Every organization that is serious about risk management should nurture, promote and respect cowards in their DR and BC departments to protect their businesses from all the risks they face.

Now you may dispute why any organization needs cowards. Nobody has ever erected a statue honouring a coward. Everyone insists on the need for brave leaders everywhere, people who can make tough decisions, are flamboyant, lead and boldly take the road ‘less travelled’. Nobody has ever heard of a coward doing all that. True, braveness, toughness and all those cool flamboyant leadership skills are required to run and grow a business. But if your thinking is a bit warped and out of the box, such people are not exactly suitable for protecting the business because of what they are and what they don’t want to be. Let me begin my argument in favour of cowards with a couple of examples.

Example . 

A ship’s captain wanted sailors for his ship. So he called a dozen hefty-looking chaps and asked who in the group were brave and excellent swimmers. About five of them lifted their hands. To everyone’s surprise, the captain selected the remaining seven as his sailors. When asked why he chose the cowards he replied, ‘The chaps I selected do not know how to swim and are not very brave. So they will try the hardest to keep the ship afloat.’

Investing in cowards could be the best business decision you can take to save your business from predictable and even unpredictable disasters. A Chinese proverb says, ‘Only a coward can create the best defences’. This method should be your approach to protecting your business. A brave man usually does not bother to create many defences because he is always confident that he has the power and strength to withstand and tackle any danger. Also, he is incapable of seeing risks the way a coward can. But a coward knows there are always countless dangers all around that he cannot tackle. So he tries to build the best possible defences. He sees risks and dangers in practically anything that normal people cannot. Applied to your business he or she can smell and see a risk in an instant like a shark that is able to smell blood from miles away.

Cowards have a special advantage that nobody else has. They have no limits in their ability to see and cover risks. They see things that ordinary people cannot, they think in an extremely paranoid fashion. Fear controls their imagination. A coward trusts no one, not even himself. Cowards have a ‘I will believe it when I see it’ and ‘Prove it to me’ attitude. They don’t believe anything they have not personally seen working to their absolute satisfaction. They can get into nitpicking detail and view risks from countless directions.

For a coward, everything is a risk. Fear helps a coward build fantastic fences. A brave leader will not hesitate to go to war. But a coward will prevent war from happening as long as possible or for ever. For example, a brash and brave manager may take a quick decision to fire an employee on flimsy grounds. But a coward will think of how this incident could affect the business, what safeguards are currently available and how the situation could take an ugly shape. A coward thinks in terms of lawsuits, or the influential contacts the employee may have, or the damage an aggrieved employee could do to the organization.

Skill 2: Leave no important task unfinished.

Another important skill a DR or BC person must have is to leave no task unfinished, as shown in the following example.

Example . 

A young man applied for a job as a farm-hand. When the farmer asked for his qualifications, he said, ‘I can sleep when the wind blows’. This puzzled the farmer, but he liked the young man and hired him nonetheless.

A few days later, the farmer and his wife were awakened in the night by a violent storm. They quickly began to check things out to see if all was secure. They found that the shutters of the farmhouse had been securely fastened. A good supply of logs had been set next to the fireplace. And the young man slept soundly. The farmer and his wife then inspected their property. They found that the farm tools had been placed in the storage shed, safe from the elements. The tractor had been moved into the garage. The harvest was already stored inside. There was drinking water in the kitchen. The barn was properly locked. Even the animals were calm. All was well. It was only then that the farmer understood the meaning of the young man’s words, ‘I can sleep when the wind blows’. Since the farmhand did his work loyally and faithfully when the skies were clear, he was prepared for the storm when it broke. And when the wind blew, he was not afraid. He could sleep in peace. And, indeed, he was sleeping in peace.

Moral of the story?

There was nothing dramatic or sensational in the young farmhand’s preparations. He just faithfully did what was needed each day. The story illustrates a principle that is often overlooked about being prepared for various events that occur in life. It is only when we are facing the weather that we wish we had taken care of certain things that needed attention much earlier.

What is a DR or BC site?

A ‘DR site’ is a disaster recovery site. A ‘BC site’ is a business continuity site. The terms are sometimes used interchangeably. Either way, it is usually an alternative site that can be used by the business if the primary or main site fails or becomes inaccessible. For example, assume that your organization provides critical technical support on various financial applications to a key external client. Suppose there is a major IT disaster in the organization preventing your staff from providing support to that client. Then, as part of disaster recovery, certain identified support staff can immediately relocate to your DR or BC site and start providing technical support. Essential support can continue from there while the main site is being rectified. Of course, the DR or BC site must have the necessary IT infrastructure and facilities to provide the required minimum or mutually agreed level of support.

DR or BC sites can be any or all of the following, depending on organization size, importance, and so on:

  • A small or fully-fledged alternative, workable office with essential technical set-up within your city.

  • A small or fully-fledged alternative workable site with essential technical set-up outside the city or in a different state or even a different country.

  • A branch office where essential functions can continue.

  • An outsourced disaster recovery location provided by a third party service provider. Nowadays, many organizations provide generic or custom-made disaster recovery locations for other organizations for a fee.

  • Certain activities can also be done from home if remote connectivity options are available.

What is a command centre?

A command centre is a facility with adequate phone lines and other basic facilities to begin recovery operations. Typically it is a temporary facility used by your senior management team to begin coordinating the recovery process and used until the alternative sites are functional.

Where should a DR or BC site be located?

Several factors need to be considered when establishing a DR or BC Site. It depends on the nature of your organization and its dependent items, eg, vendor services, telecom links, material availabilities, etc. Choice of DR or BC site should also consider political, geographical, natural, human and other risks associated with the DR site location. For example, a software development company that is heavily dependent on international telecom links cannot have its DR site located in a rural area where the telecom vendors cannot provide data and voice links. Whereas another organization, eg, a manufacturing company could probably have its DR or BC site with some essential equipment located anywhere where there is an electrical supply and transport facilities.

It makes business sense to have the DR or BC site located at an acceptable distance from the main site from a logistics perspective. If essential services have to start rapidly within hours or a business day from an alternative location, the DR/BC site should be located reasonably near your main site to avoid long travel and associated logistics problems. The time to travel to a DR/BCP location is a key factor in deciding where it can be located. The various factors to be considered include:

  • Data transfer requirements between main and DR site.

  • Periodicity and amount of data.

  • Ease of travel between main and DR site.

  • Availability of support services, eg, telecom vendors, computer vendors, spare parts, etc.

  • Availability of power, water, etc. It is preferable to have your DR site powered by a different electrical power grid.

  • Political and civil issues of the region. For example, it does not make sense to keep your DR site in a city or country that may suffer civil disturbances.

  • Some organizations prefer to keep their DR site located in other countries. For example, many software development companies in India have an operating DR site in Singapore and have data synchronization between the two. Should a disaster strike the main site, a core essential team in Singapore can continue to provide customer support and keep their data intact.

Establishing and maintaining a ready-to-use DR or BC site is an expensive business. Fortunately, it may not be necessary to really switch over to your DR or BC site for years. But it is like insurance – one can never predict when it will be necessary.

Can organizations handle DR and BC all by themselves?

Disaster recovery is not rocket science. In fact, it is more plain common sense to ensure that your business does not go down the drain due to factors within your organization’s control. A DR-BC plan must be created by involving several departments within your organization. It is not an individual effort, although an individual in a small organization may oversee it. As mentioned before, on the lighter side, it is best to have a person who is paranoid and afraid of anything and everything as a DR-BC manager. Before creating a plan, every organization must classify its functions in terms of priorities and impacts. Your business and technical managers must analyse the business together and rank it in terms of priorities and business impact. For example, organizations may classify all their business functions as Low, Medium and Top priorities with a business impact for each. Obviously, not everything done by the organization can be classified as top priority or high impact.

For example, a general classification could be:

  • What business functions must be up-and-running within minutes or hours of a disaster striking? For example, an organization that depends heavily on e-mail for its business cannot afford to have its e-mail server down for hours and days. It may classify e-mail as top priority, and take all necessary steps to have alternative e-mail systems. Another organization that depends heavily on a web server may classify all its web systems as top priority.

  • What business functions can be down for 24 hours? For example, an organization that depends occasionally on fax can classify its fax services as medium priority and can tolerate a day’s downtime.

  • What business functions can be down for more than 24 hours, more than two days, a week, etc? For example, certain software development projects and product development that is still in the design or development phase can tolerate a few days or weeks of downtime. These can be classified as low priority and can wait.

Running a DR or BC successfully also depends on several other factors. If your organization has several experienced employees who know each and every business process in detail, how they work and their importance, it is possible to create a reasonably good DR or BC plan. Otherwise, you can hire external consultants or use some standard templates. Templates are detailed checklists prepared by various organizations that can be readily used to compare an organization’s preparedness. For example, the fire department can provide you with a checklist or template that contains several checks for preventing fire. It is also possible to have your building inspected by the fire department to certify whether the building is safe or not. Similarly, a backup software manufacturer can provide a checklist of the important things to ensure during and after a backup of data.

Important tip: Things within an organization’s control must get the necessary priority, budgets and importance. The following checklist can be used:

  • What areas and business functions are completely within your organization’s control? For example, computers, data, backups, etc, are usually within an organization’s control for recovery. Any loss here can be handled by the organization by implementing various safeguards and budgets, using your own manpower and resources.

  • What areas and business functions are partially within your organization’s control? Here there could be some dependence on an external service provider. For example, an organization’s telephone network is provided by a telecom company. An organization cannot have its own independent telephone network separate from the external world, and has to depend on local and international telecom service providers. Problems and shutdowns in the telecom service provider affect the organization’s business, but will not be within its control. If landlines don’t work, perhaps mobile phones can be used temporarily until the telecom department fixes the fault.

  • What areas and business functions are outside your organization’s control? For example, if an office is situated near an oil or gas terminal and a fire erupts within those facilities, it can affect your and all other nearby offices. Or if there is a terrorist attack, the police may cordon off the entire street or building, preventing your staff from reaching or leaving their workplaces. Businessowners will have no say or control in such matters other than cooperating with the government forces in spite of business losses. In such cases businesses may have to resort to insurance claims, alternative sites, delays, etc.

What about DR and BC assistance from external consultants?

Nowadays, disaster recovery consultancy itself is a big business. Hundreds of DR consultants and firms have sprung up all over the world claiming to be the best among the lot. It is also industry-specific. But it is not possible to get a single, good DR consultancy that covers the entire range of business and technical processes, even though they may all claim to be experts in all areas. It is necessary to evaluate the need for inviting external consultants and then decide the way forward. A combination of internal and external expertise would be appropriate in most cases.

However, the best (but unknown), DR and BC consultants to start the process could be within your own organization.

Example . 

Here is a short story, in which many claim to have been personally involved. Sometime in the 1980s a very important nuclear reactor suddenly stopped working. The design experts, scientists, etc, struggled very hard to set it right, but were not successful. Finally, with much opposition from the scientists, they decided to call an ex-mechanic who was involved in the installation of the reactor. The mechanic arrived, looked around for a few minutes, and tightened a bolt in one of the sections and the machinery started working. Later, he submitted a bill of $5,000 for the repair. Aghast at such an atrocious amount for just tightening a bolt, they demanded an explanation for it. The ex-mechanic split the amount as follows and resubmitted the bill.

1. Service charge for tightening the bolt:

$50

2. Knowing exactly which bolt to tighten:

$4,950

3. Total:

$5,000

As you see from this story, an experienced IT person, electrician, security guard, finance chap, etc, within your own organization may have enough knowledge of what they will need to run the show in the event of a disaster striking their area of work. Their experience should be used and a combination of internal experienced staff with some external consultants would be a good choice. Organizations must select DR consultants carefully and avoid those who only give superficial advice. However, it may not be easy to pinpoint the right consultant or a single consultant for all your business needs. It is better to choose consultants based on the area of DR coverage. Credentials and references play an important role in selecting them. For example, hire a reputable or experienced IT person to recommend IT DR methods, a reputable financial consultant to provide financial DR methods, etc.

Ideally, a DR or BC consultant must be a ‘nuts and bolts’ person who can sit with your key staff to understand your needs and then recommend practical, real world solutions. For example, if your organization wants to have a DR facility for its financial systems, the consultant must sit with your finance team and understand how the system works, the software required, the type of equipment, data synchronization requirements, etc, and then recommend a suitable disaster recovery setup, and must also be able to demonstrate its working with a mock run.

The importance of practical experience: Sir Francis Bacon said long ago, ‘Knowledge is power’. Perhaps this can be modified for today’s world as ‘Practical knowledge is power’. Though professional certifications are becoming very important for any job, practical and real world knowledge is of paramount importance. It is important to ‘first learn the trade before experimenting with tricks of the trade’. Practical hands-on experience and implementation ability are the keys to good DR consultancy.

An Indian mythological story shows the importance of real world experience over pure academic excellence:

Example . 

A highly learned scholar was once travelling in a boat. The boat also carried several villagers and fishermen. Wishing to pass time the scholar picked up a conversation with the other passengers and started enquiring about their educational qualifications. When he realized that most of them were illiterates and had no good academic qualifications he started showing off his rich knowledge of the Vedas and Upanishads (Hindu sacred texts). And he also started insulting them by teasing that they had wasted a large amount of their lives by not studying rich academic works. Suddenly, a violent storm broke out and the boat started leaking. Immediately the boatkeeper advised everyone to jump out and swim to the shore. Everyone jumped out, but the scholar started panicking and held on to the boat. When advised to jump, he shouted that he did not know how to swim. The boatkeeper replied that the scholar had wasted his entire life by not learning how to swim and that his pure academic excellence was not going to help him now, and jumped out of the boat to swim to safety.

A second example is:

Example . 

A passenger plane’s pilot suddenly developed a heart attack and collapsed in the middle of the flight. A frantic airhostess called the tower and shouted for help to assist in landing the plane safely. However, no one in the tower was qualified enough to guide her on how to land a plane safely. Suddenly somebody suggested that they could call an aviation professor from a reputable university nearby to help. So they called the university. The professor arrived promptly, picked up the microphone and started his advice in the following manner: ‘Let me first begin with the principles of aerodynamics, before we get into the theory of aircraft engines.’

What kinds of disaster should an organization be aware of?

Disasters can come in all flavours, internal and external, so different factors need to be considered for each critical system. Your entire organization’s processes and systems should be classified into broad categories and tackled one by one. The DR or BC selection process starts with an assessment of the potential risks, their probability and impact for your particular enterprise. Next comes a business impact analysis (BIA). This helps determine which applications and systems require the most protection, based on the value of the data and the business impact of downtime as well as other cost factors. Organizations can broadly classify risks, with their probability of occurrence and impact, as follows:

  • Technical risks: This will cover all IT-related issues, eg, backups, data storage and retrieval, loss of IT equipment, communication failures, virus attacks, software problems, power failures, etc.

  • Non-technical risks: Building security, theft, and access by unauthorized personnel, fire hazards, etc.

  • Political risks: Change of government and policies, civil disturbances, terrorism.

  • Financial and legal risks: Stock market manipulations, bankruptcy, fraud, financial irregularities, failure to comply with legal regulations or standards, etc.

  • Human risks: Losing important staff to competitors, mass resignations, death/injury/illness of key staff, disgruntled employees, workplace harassment, spies and industrial espionage, etc.

  • Reputation risks: All factors that can affect an organization’s image, eg, employee harassment, litigation, legal turmoil, bad publicity, etc.

  • Dependency risks: If an organization depends on external organizations, vendors and even other countries for its business it could be at risk, eg, a restaurant can depend on the existence of a large company nearby: if that large company relocates, this restaurant can go out of business.

  • Natural risks: Flood, earthquake, hurricane, wildfire, etc.

Table 3. Simple risk analysis

Risk

Probability

Impact

Technical

High

High

Political

Low

High

Financial

Medium

High

Fire

High

High

Note: A DR and BC plan is an ongoing process. It can never be perfect or complete.

What is a technical risk?

Any organization today will use one or more of the following IT systems:

  • Computers of various sizes and capacities ranging from small laptops to large mainframes.

  • Data backup systems to store and retrieve large amounts of data.

  • E-mail systems for internal and external communication.

  • Telecommunication systems, eg, fax, dial-up lines, leased lines for connecting their offices, branches, etc, within and between cities, states and countries.

  • Various software programmes, eg, office suites, databases, remote connectivity tools, monitoring tools, design software, e-mail, etc.

  • Web servers for hosting intranets, public servers, etc.

... and dozens of other enterprise technologies.

Each of the above must be interconnected for the entire organization to function and hence each has the potential to fail in a number of areas. A simple cable disconnection on an international data leased line can cut off every part of the entire organization. Similarly, every item of equipment can fail in its own unique way or behave erratically for various reasons. Heavy usage of any such equipment always entails a hidden risk. For example, if the power supply fluctuates there is a high probability of computer disks crashing or corruption of data on many computers. All such IT-related failures, or potential to fail, can be classified as technical risks and sufficient workable, cost-effective alternatives are needed to minimize risk.

What are some of the most common technical risks?

Some of the technical risks common to most organizations are listed below. Disasters in each can range from simple problems to absolute catastrophes.

  • Risk to data

  • Virus risks

  • Power failure risks

  • Local area network (LAN) failures

  • Information security risks

  • Telecommunication risks

  • Software risks.

Each of these risks will be explained in detail in a separate chapter:

What are some of the most common non-technical risks?

Some of the non-technical risks and disasters that organizations can face are:

  • IT staff disasters

  • IT vendor disasters

  • Reputation disasters

  • Financial disasters

  • Labour union disasters

  • Legal disasters

  • Political disasters

  • Natural disasters

  • Terrorist disasters.

Most of these will be explained in a separate chapter on non-IT disasters – see chapter 12.

What is a business impact analysis (BIA)?

This is a detailed analysis of the impact on your business if a specific set of IT or non-IT services is not available. It tries to determine the risks in terms of revenue loss, reputation loss, productivity loss, etc, if the IT infrastructure or other critical facilities are down due to a disaster. A BIA will consider the following:

  • Impact of damage to premises, data centre, etc.

  • Impact of damage to IT systems: servers, computers, networks, telecommunications, etc.

  • Impact of damage to important data in terms of loss or corruption.

  • Loss of key staff: IT support, business managers, etc.

  • Impact on external and internal customers.

  • Legal and reputation implications if disasters occur.

  • Dependencies on external vendors, suppliers, etc.

  • Impact of security threats: viruses, hackers who may steal confidential information, etc.

  • Impact of damage and loss of power, air conditioners, etc, required for IT services.

  • Damage due to sabotage, natural disasters, political threats, etc.

  • Other industry-specific impacts.

For example, a very basic BIA can be as follows:

Table 4. Simple business impact analysis

System

Probability

Impact of downtime

Company web server down

High

$5,000 in lost business per hour

Company network down

Medium

Productivity loss of $50 per hour per employee

Organizations can prepare such charts to decide which business functions require priority in BC planning.

Who can invoke business continuity?

As part of their business continuity plans, organizations must first decide what qualifies as a disaster. Any routine equipment problem, maintenance downtime, short-term problems, etc, should not be termed disasters or alternative facilities invoked. The decision to brand an IT shutdown as a disaster must be taken only by senior business and IT managers. A business recovery team can also be constituted: this is a group of qualified and senior staff responsible for maintaining the business recovery procedures and for coordinating the recovery of business functions in an organization. For example, suppose that the entire IT infrastructure is down due to a power fault. If the fault is expected to be rectified within a couple of hours then the organization need not classify it as a crisis and start rushing employees to invoke their disaster recovery or business continuity procedures. On the other hand, if it is sure that the power failure is more severe and may take more than an acceptable time to restore, then the senior management may invoke the disaster recovery procedures.

The following types of disaster can necessitate invoking business continuity beyond the agreed RTO and RPO (explained later):

  • Severe or major business impact

  • Adverse customer impact

  • High risk exposure to organization

  • Critical system down.

What are the options available for business continuity?

Technically and financially it is possible to build a twin of your organization. But not all organizations may want this, or can afford such a luxury. Business continuity is industry-specific. For example, the police, fire, ambulance services, etc, cannot afford to have their IT and other infrastructure down even for a few minutes. Other organizations such as a small automobile spare parts manufacturer may be able to withstand an IT failure for a couple of days. Depending on the organization, size, industry, budgets, etc, companies can have a number of choices:

Manual: If it is possible to use manual methods.

Other offices: If an organization is decentralized and has many independent branches, then it may be possible to use the facilities in another branch until the affected branch comes online again.

Cold standby: Organizations can have an alternative site with basic IT and non-IT facilities that can be switched on during extended failures.

Warm standby: This involves re-establishing critical systems and services within a short period of time, usually achieved by having redundant equipment that can be used during disasters.

Hot standby: This will involve having an alternative site with continuous mirroring of live data and configurations. These sort of facilities are usually used by banks, the military, etc, where it is not possible to afford any downtime.

What is a DR or BC exercise?

One of the ways of testing the disaster recovery ‘readiness’ of an organization is to conduct frequent mock exercises of the various areas mentioned in the DR plan. This is usually done by simulating a crisis situation. Such mock exercises test the organization’s ability to respond to a crisis in a planned and effective manner instead of becoming chaotic. For example, if the finance department server is a critical DR item, a mock exercise can be conducted on a weekend or after hours by invoking a mock disaster by shutting down the system, sending the finance chaps to work from the DR site and noting down all the issues, limitations, deficiencies, missed out items, etc. These exercises will give a first-hand feeling for how an organization or department can handle a real disaster if one occurs. Appropriate measures can then be taken to ensure proper disaster recovery. For example, your finance department may notice that it is not possible to operate the finance application without connecting at least one printer. Later, during the next mock run, a printer can be hooked up to test the finance application.

What are the biggest roadblocks for disaster recovery or business continuity?

Every businessman would like to have 100% disaster recovery and business continuity. However, very few organizations are actually willing to make the necessary investments in terms of people and budgets to ensure reliable disaster recovery and business continuity environments. Some of the biggest roadblocks that prevent proper disaster recovery or business continuity are:

  • Lack of sustained management commitment: One of the primary roadblocks for disaster recovery will be lack of sustained top management commitment. For example, the top management may approve the establishment of a DR or BCP site at a time when they are particularly influenced by business and competitive pressures. But later they may not be willing to invest the necessary ongoing budgets and manpower to keep the site fully operational at all times.

  • Inadequate budgets: Business managers unable or unwilling to invest sufficiently to establish DR and BC options. Disaster recovery options require investment in redundant equipment, spares, data synchronization equipment, software, hardware, training, insurance, alternative sites, etc.

  • Manpower: Not willing to invest in additional technical staff needed to maintain and manage a DR site.

  • Knowledge: Lack of knowledge about what is needed to establish a proper disaster recovery.

  • Other reasons: Various internal factors and limitations.

In fact, in a large number of cases DR and BC plans just remain on paper or have insufficient capability to handle real disasters. But businesses really do need to invest in the necessary budgets and staff if they are to ensure that their businesses are safe from preventable disasters.

How much money is required to establish a proper disaster recovery facility?

It depends on various factors and the nature of the organization. Theoretically, it is possible to establish a twin of the entire organization if budgets were unlimited. However, such a setup is rarely possible or required. Broadly, DR costs can be classified as:

  • People costs: The number of additional employees, contractors or vendors required and trained to handle various disasters.

  • IT costs: How many additional computer systems, software licences, telephones, communication systems, etc, are required?

  • Maintenance and ongoing costs: It is not enough to just establish a fully-fledged disaster recovery setup as a one-off exercise. The whole setup must be properly maintained and periodically updated with new systems, software, data updates, dry runs, etc. This all costs money.

  • Infrastructure and other costs: This involves costs for building rents, electricity costs, air conditioning costs, security, transport, telephone costs, etc.

  • Other costs: Various one-off or ongoing costs.

Some do’s and don’ts

Do’s

  • Identify a dedicated team within your organization to be responsible for DR and BC.

  • Ensure each member knows exactly what he or she is supposed to do and not do.

  • Clearly establish the scope of your DR and BC plans.

  • Analyse all your business functions and sort them by their importance.

  • Have an organizational policy to enforce DR and BC practices.

  • Have periodic meetings on DR issues and keep the DR plan updated regularly.

  • Keep your customers and employees informed.

  • Conduct DR exercises regularly. Keep yourself updated with new industry practices, standards and qualifications.

Don’ts

  • Take DR and BC functions lightly.

  • Give DR inadequate budgets and resources.

  • Ignore internal talent and knowledge within your organization.

Are there any international qualifications for disaster recovery and business continuity?

Yes. More and more employers are looking at certification as a condition of employment and certification is now often seen as a qualifying pre-requisite for the hiring of consultants. Today, there are primarily two recognized professional institutions certifying the business continuity professional: the Business Continuity Institute (BCI, www.thebci.org), based in the UK, and DRI International (DRII, www.dr.org), based in Falls Church, Virginia, USA. Both are member-owned, not-for-profit organizations. Both offer certification at different levels.

The BCI has five membership grades:

  • Student

  • Affiliate of the Business Continuity Institute

  • ABCI Associate of the Business Continuity Institute

  • MBCI Member of the Business Continuity Institute

  • FBCI Fellow of the Business Continuity Institute.

DRII International has three membership grades:

  • ABCP Associate Business Continuity Planner

  • CBCP Certified Business Continuity Planner

  • MBCP Master Business Continuity Planner.

For details of the various DR and BC standards, training and certification options, see Appendices 3 and 4.

Is there any training available for disaster recovery, business continuity, etc?

Yes. There are many courses available. The Disaster Recovery Institute International (DRI International) provides various basic and advanced courses. Visit www.drii.org for more information. You can also visit Business Continuity Institute at www.thebci.org. Many universities have also started providing diploma and graduate courses on disaster recovery and business continuity.

For more information on disaster recovery and business continuity training and certification, see Appendix 2.

Are there any international standards for business continuity planning?

Yes. BS25999 is a British Standard that has been developed to provide best practice guidance to organizations on business continuity management. It is a two-part standard. Part 1 is a Code of Practice, and Part 2 is a specification for a management system. An accredited certification scheme exists that enables organizations to achieve external certification of their business continuity arrangements.

For more information on BS25999, see www.itgovernance.co.uk/bs25999.aspx.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.25.161