Figure 6-1: The information security management system.
Chapter 6
Designing Services to Be Fit for Use: Service Design Part 2: The Warranty Processes
In This Chapter
Making sure you can provide what the customer wants
Checking you have enough capacity
Preparing for all eventualities
Keeping IT secure
Would you agree to doing something for someone without checking you have the ways and means of achieving it? If someone asks you for a lift, do you check whether your partner is using the car that day? If someone asks to borrow some money, do you check you have enough spare after you’ve paid your bills? In other words, do you check you have the right assets, and that the assets are properly configured and ready to go?
This chapter and Chapter 5 cover the service design stage of the ITIL service lifecycle. (I introduce the service lifecycle in Chapter 3.) This is the stage that covers the gathering and analysis of the service requirements, and the design of the service to meet the requirements. Chapter 5 covers the design coordination process and the relationship management processes: service level management, service catalogue management and supplier management. In this chapter, I cover what I call the warranty processes: availability, management, capacity management, IT service continuity management, and information security management.
Making Sure the Service Is Available: Availability Management
What do the users do when the service breaks – that is, it becomes unavailable? They complain. Why? Because they can’t achieve their business goals – they can’t get the value that they expect from the service. Think how you feel when you’re at home and your broadband connection or satellite connection fails. I bet you use a few choice expressions to describe your providers!
So what do I mean by availability? The availability of what? The IT service. In accordance with what? The SLAs (see Chapter 5) or agreed business needs. Asking the business about availability is the equivalent of asking, ‘When do you want it?’ So the aim is to keep the users happy by ensuring that the service is actually there and working when they want it.
To give you a better idea of what availability management does, here are some other objectives:
Ensure that service availability meets agreed targets, by managing services and resource-related availability performance.
Ensure that services are designed to meet agreed availability requirements.
Assist with the diagnosis and resolution of availability-related incidents and problems.
Produce and maintain an up-to-date availability plan.
Providing good availability of your service isn’t something that should happen by accident. Some IT departments provide an IT service that they think is about right and then tweak it when the business complains. But ideally, you start thinking about availability when designing the service – in the service design stage of the service lifecycle.
Seeing the process in action
Through the use of the service level management process, the IT department liaises with the business and identifies its needs. For example, if a project has been approved and set up for the development of a new IT service, then, at some point early in the project, the service level manager gathers the service level requirements (SLRs). These include the customer’s requirements for the availability of the service. I don’t know how your organisation measures availability, but often it’s expressed as a percentage – for example, the email service will be available 99 per cent of the time.
Using the availability management process, you perform the activities to review these requirements and check to see whether they are achievable. If the requirements are not achievable, then you make suggestions on how the availability of the service can be improved to meet them, and report back to the service level manager, who can discuss the requirements further with the business. This discussion includes considering who will pay for the necessary improvements.
Once the requirements are agreed, availability management ensures that the service is designed to meet these availability requirements. This is part of the service design stage. During the service transition stage, availability management can get involved in testing to ensure that the design works in the way expected. In the service operation stage, availability management monitors the availability of the service to ensure that the requirements have been met.
Defining some availability management terms
Availability, reliability, maintainability, serviceability and resilience – these are terms you find in the availability management process. That’s a lot of jargon (much of it designed by someone with an -ability passion); but don’t worry, in the next sections I explain each term. To begin:
Availability: According to ITIL, this is ‘the ability of a service or component to perform its agreed function when required’. Availability refers to both the end-to-end service and the components. You normally refer to the availability of the overall service in your SLA. Availability management must consider two levels: service availability and component availability. Component availability depends upon reliability and maintainability.
Reliability: The ITIL books define reliability as ‘a measure of how long a service or component can perform its agreed function without interruption’. In other words, for how long will it work before breaking? Technical staff have a couple of ways of measuring this:
• Mean time between failures (MTBF): On average, how long does the component work before failing?
• Mean time between service incidents (MTBSI): On average, how frequently does the component fail?
Maintainability: ITIL defines maintainability as ‘a measure of how quickly and effectively a service or component can be restored to normal working after a failure’. So, when a component breaks, can you fix it quickly? For example, if a user’s PC breaks, how long will it take you to fix it and help the user get back to work? This is usually measured using mean time to restore service (MTRS): on average, how long does it take to fix this type of component?
Serviceability: In ITIL’s words this is ‘the ability of a third-party supplier to meet the terms of its contract’. So, for components or services that you outsource, you need to be sure that the supplier sticks to its part of the bargain. Serviceability is a bit of a strange term because it just refers to the contractual arrangements you have with a supplier, and you can’t measure this. However, you should be able to measure the agreed levels of availability, reliability and/or maintainability for a supporting service or component.
Resilience: ITIL’s definition of resilience is ‘the ability of a component or IT service to resist failure or to recover quickly following a failure’. For example, you may have two network routers sitting side by side, both doing the same job at the same time. If one fails, the other carries on as if nothing has happened. In this case, the user still receives the service and is blissfully unaware that anything has failed. A second example would be having a spare laptop ready to replace a user’s desktop system – an example of restoring the service very quickly.
Improving availability
Generally speaking, you improve the availability of a service in two ways: increase uptime and decrease downtime (no, these don’t mean the same):
Increasing uptime: Keeping the IT service there for as long as possible requires you to design for availability: use reliable components or build resilience into the design of the service so that if a single component fails, the whole service doesn’t fail. One form of resilience is to build in redundancy: have two components doing the same job, so if one fails the other takes over. If you apply this principle to the entire end-to-end service, you can eliminate any single points of failure, where the failure of a single component can affect many services (for example, a network switch failure can affect many services and many users). Building in a lot of resilience costs a lot of money, so you need to justify the business requirement and have someone who’s prepared to pay.
Decreasing downtime: The ability of your staff to restore the service in a timely fashion when it fails. This includes taking a look at your incident management and problem management processes (see Chapter 8) to ensure they take into account the needs of your critical services. You review your processes and procedures, especially how people contribute.
The speed of recovery relies on the following:
• Detecting the failure quickly: Don’t wait for the user to tell you an incident has occurred; use monitoring tools.
• Diagnosing the fault quickly: Have the right staff available at the right time who can point at a component and say, ‘That’s the one that’s failed.’ You can sometimes use monitoring tools.
• Fixing the fault quickly: Have spare parts available and staff that have practised various scenarios and know what to do to fix the fault and restore the service.
Looking at the activities of availability management
ITIL groups the activities of the availability management process into those that are proactive and those that are reactive.
Proactive activities
The proactive activities of availability management take place during the service design stage of the service lifecycle. This is where you ensure that you have clear requirements for the availability of the service, and the design activities focus on creating a service that meets these requirements. The activities are as follows:
Risk assessment and management. You review the assets and components that make up the service, and identify any potential risks to the availability of the service – what could cause a failure. Then you suggest cost-justifiable improvements to reduce the risks. (You can find out a bit more about risk management in Chapter 9.)
Planning and designing new or changed services. You get the requirements for new or changed services and ensure they translate into requirements of availability. Designing the service involves the factors I describe in the previous section ‘Improving availability’. This activity also involves the design and selection of the components and how they contribute to the overall service availability.
Implementing cost-justifiable countermeasures. The previous two activities yield many suggestions for improving the availability of your IT services. Each suggestion is likely to cost money, and you can’t justify every one. So you review the various options and choose the most appropriate.
Reviewing all new and changed services and testing all availability and resilience mechanisms. You continually review your services and regularly test the fail-over mechanisms you have in place. Only through such activities can you be sure everything will work the way you expect it to, when you want it to.
Continual review and improvement. Nothing remains stable for long: customer needs change and technology progresses. Availability management must be aware of changes coming from the business as well as from other parts of the IT provider organisation, and react accordingly. Regular reviews are an ideal way to ensure that changes don’t go unnoticed and opportunities for improvement and optimisation are captured and acted on.
Reactive activities
The reactive activities of availability management take place during the service operation stage of the service lifecycle. These activities aim to ensure that the targets for availability of the service are met and that any issues are spotted before outages occur:
Monitoring, measuring, analysing, reporting and reviewing service and component availability. You do this to ensure that availability targets are met. You provide much of this data to the service level management process (see Chapter 5) to contribute to the regular reports that are delivered to the customers, usually on a day-to-day basis, by your service operation staff.
Investigating all service and component unavailability and instigating remedial action. Here availability management gets involved with the incident management and problem management processes (see Chapter 8). You can think of this activity as providing expert help when there are availability-related incidents and problems. In addition, availability management ensures that previous problems influence the design of future services.
Have We Got Enough? Capacity Management
After availability (which I cover in the earlier section ‘Looking at the activities of availability management’), what annoys users most is IT services that run slowly. Oh, how they complain. But sometimes speed is a matter of perception: how do you know whether the IT service is running as fast as intended?
Asking the business about capacity is like asking, ‘How much do you want and how fast should it work?’ Capacity management is concerned with both the capacity and performance of the IT service. So the aim is to keep the users happy by ensuring that there is enough of the service to meet the needs of all the users, and that the service works fast enough.
The capacity of the service depends on the component parts. So this may include having enough network bandwidth, enough server processing power, enough data storage, powerful enough PCs. Also consider software. Poorly written software code can lead to an application that uses more resources than necessary. Then you have people. Having enough people may be a management issue not an IT service management issue, but you may be able to use some similar concepts or methods of capacity management to help.
Eh? You’ve got to know how much is needed then provide what you’ve agreed.
Other objectives of capacity management include:
Ensure that service performance achievements meet their agreed targets, by managing the performance and capacity of both services and resources.
Assist with the diagnosis and resolution of performance- and capacity-related incidents and problems.
Ensure that services are designed to meet agreed performance and capacity requirements.
Produce and maintain a capacity plan – in line with the budget lifecycle.
Much of capacity management is a balancing act – in fact, two balancing acts:
Cost against resources: Balancing the need to ensure that the appropriate capacity is present with making the most efficient use of capacity-related resources.
Supply versus demand: Ensuring that the available supply of IT processing power is matched to the demands made on it by the business.
In common with the availability management process (see the earlier section ‘Making Sure the Service is Available: Availability Management’), the requirements for capacity are gathered via the service level management process (see Chapter 5). Will the customer understand what is meant by the capacity of the IT service? I suspect not. But you should be able to ask the customer how many staff use the IT service and when are the busiest times of day. You then follow the same process that I outline in the earlier section ‘Seeing the process in action’.
Defining some capacity management terms
Here are a couple of bits of terminology:
Performance: Performance refers to how quickly the computer system processes your data and responds to requests from users. Performance is often measured by system response time or transaction response time. This is usually the time it takes from when the users press the return key to when something appears on screen.
Think about when you are using the Internet at home. Perhaps you are on your favourite Internet shopping site. How long will you wait for the web page to refresh – 30 seconds, 15 seconds, 5 seconds or just 1 second – before you give up and go to a competitor’s site? Response time is important to users; it can slow down their productivity or just simply frustrate them.
Capacity plan: You use this plan to manage the resources required to deliver IT services. The plan contains scenarios for different predictions of business demand, and costed options to deliver the agreed service level targets. You usually create the plan once a year in line with your budget cycle. The plan has many purposes. One is a method of asking the organisation for more money. Capacity costs money, therefore it makes sense to plan carefully any increase (or decrease) in capacity. The capacity plan documents the need for additional capacity for the coming year, and can be used to justify the expenditure.
Understanding capacity management sub-processes
Capacity management carries out many activities, but the purpose of the activities can vary depending upon which sub-process you consider:
Business capacity management (BCM): Translates business needs and plans into requirements for the service and IT infrastructure to ensure that the future business requirements for the IT services are planned and implemented. To do this, the BCM sub-process receives requirements from the business (via service level management or the service strategy) and reviews them to see whether anything will create a change in demand for the IT services (the process looks for things that may trip up IT).
Typically, market campaigns and promotions cause issues. I have heard of many cases where business units have ‘forgotten’ to mention a TV campaign that led to a large increase in usage of the IT service, and the IT department was unable to plan for it. Even when events are planned for, it’s sometimes difficult to forecast them correctly. For example, in 2011 the UK Government launched a website that allows the public to see crime figures for their region. So many people tried to use the site that it crashed and was taken down.
Service capacity management (SCM): Focuses on the end-to-end IT service and is concerned with whether the agreed service capacity levels are provided to the customers. This involves the management, control and prediction of the end-to-end performance and capacity of the live services. So the key is to monitor and measure against the targets in the SLAs.
Component capacity management (CCM): Focuses on the underlying technology of the service that provides the capacity. You need to understand how each component contributes to the service. Similarly, each component may become a bottleneck. So this sub-process includes the management, control and prediction of the performance utilisation and capacity of individual IT technology components.
Looking at the activities of capacity management
Here are the main activities of the capacity management process:
Reviewing current capacity and performance. When you receive a set of service requirements, you review them to see whether you can achieve them from a capacity and performance point of view. This involves reviewing the current design of your services.
Improving current service and component capacity. You can use many capacity management techniques to identify how to make improvements to your services; check out the ITIL service design book, which has plenty of suggestions.
Assessing, agreeing and documenting new requirements and capacity. Requirements for changes or new services will most likely be received through the service level management process (see Chapter 5). You translate the requirements into detailed capacity requirements. Capacity and resources cost money, so you carefully assess each requirement, and any improvements must be cost-justifiable.
Planning new capacity. After you review capacity requirements and decide how you are to meet the requirements, you must plan to make the resources available. The capacity plan contains your recommendations for achieving this, along with the costs. The plan should be regularly reviewed and kept up to date.
Being Prepared for Anything: IT Service Continuity Management
Imagine your main computer centre is located near a main road. One day a tanker carrying a toxic substance crashes on the main road and the police create an exclusion zone and evacuate the area. How long can you operate your IT service for? Your services are still available and they still work; technically, there has been no failure. The organisation’s business is evacuated along with the IT staff. How long is it before you need access, which you are denied by the police, to the services? Is this an occasion when you invoke your disaster recovery – or IT service continuity – plan?
IT service continuity management (ITSCM) is more or less another name for disaster recovery.
Business continuity management refers to the activities that your business performs to decide what to do in the event of a disaster or other large event that prevents the business from operating. So, the business comes first. It decides what business processes are essential to the viability of the organisation. The prime concern of an organisation when faced with some unexpected event is to remain in business. It should plan for this. This is business continuity management. When the business realises that one or many of these activities involve an IT service, the IT provider must get involved in planning how to recover the essential IT services. Once you’ve got the requirements from the customer, you then identify the appropriate way of recovering the service. This, of course, costs money.
Defining some IT service continuity management terms
The following sections explain some of the terminology you come across in the ITIL service continuity management process.
Business impact analysis
In the event of a disaster, it’s important for the IT department to have a plan in place that tells it which IT services should be recovered and in what order. Who tells IT what to do first? Simple answer: the business. The severity of the disaster is determined by the size of the impact on the business.
An important part of IT service continuity is to perform a business impact analysis (BIA). This is, of course, an analysis of the impact of a disaster on the business.
Disasters can have an impact on businesses in many ways, and not all impacts occur at the same time. If a bank’s main trading IT service were to fail, the impact would occur immediately and the bank would lose money. However, if the human resources system failed, the bank might be able to manage without it for several days.
Number 1: The trading service. To be recovered in ten minutes.
Number 2: The email service. To be recovered in four hours.
Number 3: The human resources service. To be recovered in two days.
Risk analysis
The BIA (see the previous section) helps you determine which services you have to recover. Now you can look more closely at these services to identify the risks to them. You may consider threats such as flood, fire and pestilence (disasters have been caused by rodents eating through cables). You can then consider how vulnerable your systems are to the threats. For example, the river that your IT data centre is near is not the only source of a threat of flooding. The threat could just as easily be that the pipe which feeds the coffee machine in the corridor situated over the top of the computer room bursts and floods your equipment.
After you identify and value risks, you must decide whether to take action. This may be risk reduction – in other words, move the coffee machine! Or the appropriate action may be to set up a recovery plan. You can find out a little more about risk by looking at Chapter 9.
IT service continuity strategy
Your IT service continuity strategy is your decision about how you intend to recover your IT services in the event of a disaster. There are a couple of extremes:
A complete duplicate of your computer room, or data centre: You have duplicate servers, data storage and network connections in another location some distance away from your main computer room. You ensure that the data is duplicated on both systems, so when you have a disaster you simply switch to your second system. Great, but very expensive.
An empty room available that you can use as your backup data centre: This must be far enough away from your main system so that both cannot be affected by the same disaster. So when you have a disaster, you have to acquire some hardware and software from your favourite computer store and take it to this empty room and rebuild your systems from scratch. In reality, I hope this is less ad hoc. You have a proper plan in place. However, it does describe the other extreme at which recovering your IT services is likely to take several days.
Other strategies exist in between these two extremes, and the clever bit is to decide which is most appropriate to your organisation. You make the decision based on the result of your BIA and risk analysis (see the preceding sections).
Looking at the activities of IT service continuity management
These activities aren’t a process for invoking disaster recovery but a process for recognising the need to plan for significant events and for setting up the appropriate facilities:
Initiation: Getting started. This requires recognition of the importance of business continuity management and ITSCM from senior management. Hopefully this is followed by getting approval, funding and support to establish business continuity management and ITSCM. To get started, someone must decide whether you need a project team, a department or just an individual to be responsible for ITSCM. Then you must create a plan to introduce ITSCM.
Requirements and strategy: This activity is split in two:
• Requirements: This is the point at which you must perform a BIA and a risk analysis (see the earlier sections on these). The results of these two activities tell you which IT services you must recover in the event of a disaster, in which order you must recover them, and how long you have to recover each IT service.
• Strategy: You take the output of the business impact analysis and risk analysis and decide the type of recovery site. You can decide to have no recovery system or a fully equipped duplicate system.
Implementation: This doesn’t mean implementing or invoking your continuity plan. Now you’ve decided what recovery mechanism you need, you set it up, which usually involves:
• Developing the ITSCM plans and procedures
• Organisational planning – deciding who’ll do what
• Building or acquiring the recovery site and systems
• Initial testing of the recovery systems
Ongoing management: Once you have your continuity plan in place, for how long will it remain up to date? Not long! As soon as someone makes a change to one of your IT services, your continuity plan is out of date. That’s life! Here are some ideas of things you can do to ensure the continuity plan remains up to date:
• Education, awareness and training of all those affected or involved
• Regular review and audit of your ITSCM processes
• Regular testing of your recovery systems
• Linking in to the change management processes (see Chapter 7) so that you can trap anything that may alter your recovery plans
Ensuring Security: Information Security Management
Where are your valuables stored? Don’t worry, I’m not going to come round and steal them! I hope they’re secure. I expect that your money is in the bank and your valuable possessions are stored somewhere safe. What do you do when young children come to your house? Do you put your favourite vase or ornaments out of reach so they don’t get damaged? I’m sure children don’t deliberately break things, but accidents happen. Mind you, do you set expectations? I still remember from my childhood, a friend’s parent telling me, ‘This is how you behave when you are in our house: be careful not to touch anything.’ So what has this got to do with information security management? The point is, we all have valuable possessions, and we do whatever we have to do to protect them. What’s more, one of the best ways to protect things from accidental damage is to set expectations and encourage people to act in a specific way when they’re in your domain.
Instead of the word possession, ITIL uses the word asset. An asset is something of value. Not all your assets have the same value. So here’s what you do:
1. Decide which are your most valuable assets. Start with the ones you can’t manage without.
2. Decide what the risks are to the assets. For example, could the assets be stolen by a burglar or damaged in a fire?
3. Decide the best way of protecting your assets.
One of your most valuable IT assets is likely to be the information and data that is stored on and used by your systems. This is one of the most important things to protect. The data is at the heart of your organisation. It may be data about the products you sell, what they cost, and the customers you sell them to. Each organisation’s data will be different. In IT terms, this data is stored on hard discs and data volumes scattered around your IT systems. The data must be protected against deliberate or accidental damage. In order to do this you also have to protect your IT infrastructure and applications. This is the same as locking your house when you go out in order to protect your valuables that are hidden inside. Your house is the equivalent of the infrastructure, and your valuables are the equivalent of your IT data.
So, the information security management process:
Provides strategic direction for security activities
Ensures that a management system is in place to manage all aspects of IT security
Manages information security risks
Protects the interests of those relying on information, and the systems and communications that deliver the information, from harm resulting from failures of availability, confidentiality and integrity
Relating this information to the other warranty aspects, the users cannot get value if the service is not there (unavailable), or it goes slow (lack of capacity), or if the data and information they use is missing, corrupt or unavailable due to a security incident.
Defining some information security management terms
The following sections Dummify some ITIL technical terms.
Confidentiality, integrity and availability
One way of considering the most appropriate way of protecting your IT assets is to think about confidentiality, integrity and availability:
Confidentiality: Making sure that only the right people know where your assets are, and that the information is seen by only those who have a right to know. IT has clever ways of doing this, including setting access (when you’re given a username and password), and encryption of highly confidential data (electronically coding it so that the recipient’s system has to know the code in order to read it).
Integrity: Making sure your assets don’t get damaged. IT data and information can easily get corrupted if not protected correctly, so you need to be sure that your information is complete, accurate and protected against unauthorised modification. This is what your anti-virus system does.
Availability: Making sure the assets are there when needed. Ensuring that information is available and usable when required, and systems can resist attacks and recover from or prevent failures.
Security risks
Once you’ve established the customer’s security requirements, you must identify any risks that may prevent you from achieving what the customer wants. This includes a review of the assets and components used to provide the IT service. The review identifies anything that’s a risk to security. Here are some examples of risks:
Systems that don’t have password control
Personal USB devices brought in from home and connected to a PC
Users who write their passwords on sticky notes and leave them on the side of the PC
Getting things under control
Once you have established the sort of risks you are facing (see the preceding section) you need to put controls in place to deal with them.
What is a control? When you add a lock to your house, you’re adding a control. Similarly, when you take out insurance, this is also a control. So you must add similar control to protect your IT services. Here are some examples of security controls:
Installing firewalls to detect and prevent intruders
Taking regular data backups so you can restore corrupt data
Mechanisms that, when your systems are attacked, shut them down to prevent further harm
User access controls (usernames and passwords) to prevent the wrong people getting access to your systems
Information security policy
Once you understand the need for security in your organisation, don’t keep it a secret – tell everyone how important it is. The best way to do this is to create an information security policy: a document that makes it clear to the entire business how seriously the organisation takes security. Circulate the policy to everyone, not just within the IT provider organisation, but to the users and customers.
Access control: Who is allowed to use your IT services and how they get access
Anti-virus policy: What the anti-virus system protects and how often it is updated
Email and Internet policies: What websites users can and can’t visit; what they can and can’t download
Use and misuse of IT assets: What users can and can’t do to their desktop system
Information security management system
The information security management system (ISMS) is a framework of activities. Have a look at Figure 6-1.
Figure 6-1: The information security management system.
© Crown copyright 2011. Reproduced under licence from the Cabinet Office.
This framework of activities wasn’t invented by ITIL but is based on stuff taken from a standard called ISO/IEC 27001. Best not to reinvent the wheel then. The ISMS provides a simple approach to planning and implementing the controls and measures needed for information security management:
Control: The controls needed to maintain information security to support the business needs; includes the organisational structures, roles and responsibilities needed to manage information security
Plan: Identification of the requirements of information security and the planning of how they will be achieved
Implement: Implementation of the controls and mechanisms for achieving the planned level of security
Evaluate: The review and measurement of the success of the controls and mechanisms
Maintain: The continual improvement of information security
If you think the approach of the ISMS looks familiar, you’re right. It’s based on a well-known quality management approach known as the Deming Cycle. The cycle is the plan, do, check, act method for gradual improvement in any organisation. You can find out more about the Deming Cycle in Chapter 9.
Looking at the activities of information security management
The activities of the information security management process are as follows:
Produce and maintain an information security policy. Does what it says on the tin.
Communicate, implement and enforce adherence to all security policies. You must make it clear how seriously security is taken by your organisation. Some organisations link not adhering to the security policy to disciplinary procedures.
Assess and categorise information assets, risks and vulnerabilities. This involves allocating some relative value to each security risk, so that you know which risks are most significant to your organisation.
Regularly assess, review and report security risks and threats. Things change, so you need to do this regularly and keep things up to date.
Impose and review risk security controls, and review and implement risk mitigation: Here you implement the controls and mechanisms that you’ve identified to support your information security policy.
Monitor and manage security incidents and breaches. Breaches must not go unnoticed. Your policy will not be taken seriously if you don’t police it.
Report, review and reduce security breaches and major incidents: You must learn from the things that happen in your organisation.
Identifying Service Design Roles
Each service management process should have a process owner and a process manager. I explain these two roles in Chapter 2. Some of the service design roles are listed in Chapter 5. For the warranty processes, the relevant process manager roles are:
Availability manager
Capacity manager
IT service continuity manager
Information security manager
Each role is responsible for the activities described in the appropriate section of this chapter.
3.21.39.142