Chapter 17

Conclusion

17.1 Summary

In Part One of the book, we introduced the notion of service availability and presented a set of relevant principles and concepts that are associated with dependability, for the purpose of achieving service continuity even in the presence of some failures in the underlying system. We went on to give a brief description of the circumstances, the vision of developing open standards for service availability and how the Service Availability (SA) Forum was founded.

Part Two discussed some of the key services and frameworks specified by the SA Forum for delivering service availability. The main emphasis was on the reasons behind the design choices for the standard as much of this information was not documented in the specification at all. Therefore, this part can be viewed as the main principles behind the SA Forum services and frameworks providing the readers with valuable insight into the specifications and therefore enable them to use the specifications as they were intended.

We started this part of the book with an overview of the architecture of a SA Forum system, and discussed the dependencies and interrelations among these services. In the information model chapter, we explained the background and considerations that were taken into account as the different part of SA Forum Information Model were developed; and application designers need to follow these same considerations whenever they extend the information model for their application.

In the chapter that explained the various platform services we provided the rationale behind the designs of the Hardware Platform Interface and the Cluster Membership service reflecting respectively the physical and the logical platform views of clustered systems. One is being discovery driven, while the other is based on configuration these two worlds could be perceived as antagonistic therefore as part of this discussion we described how the Platform Management service bridged the gap between the corresponding hardware level and the level where membership in a cluster was handled.

In the Availability Management Framework (AMF) chapter, we presented some of the key dependability concepts that were introduced in Part One and placed them into the context of a SA Forum system. Through this we gave a comprehensive introduction to AMF showing how service availability could be maintained. The AMF specification is often perceived as too complex without clear distinction of what should be addressed by the developers and what is of concern of configuration designers and administrators. Therefore we divided our discussion around the tasks of component development, the configuration design and the system administration.

We trust that the main takeaway from the chapter is that AMF can manage the service availability at different levels of sophistication for a wide variety of applications. Its capabilities range from simple life-cycle management of application components implementing none of the application programming interface (API) to the orchestration of sophisticated switch-over and failover scenarios among components implementing all the bells and whistles possible.

The subsequent chapter on communication and synchronization utilities developed further the support offered to applications developers by discussing the Event, Message, and Checkpoint services. They provide the active and standby entities with facilities for state replication and communication, which are appropriate for the clustered environment. Namely, that they decouple the interacting entities in a location agnostic way.

The system management chapter discussed the essential functions for fault management that included the Log, Notification, and Information Model Management services. One might consider these services less important as they typically do not contribute to the functionality provided by the application to its users. However, when it comes to service availability all revolves round management and fault management is of primary concern as we presented it through these services.

Next we explained the approach taken by the Software Management Framework for upgrading a SA Forum system with minimum service disruption – another aspect essential for service availability. In particular, we focused on the notions of software inventory and upgrade campaign in the chapter. The former geared toward software vendors responsible for supplying the new or modified software as the expectation was/is that they need to describe their product developed for SA Forum compliant systems in terms of the SA Forum specifications. The latter is concerned with the deployment of new software or configuration, including the failure handling mechanism during an upgrade.

We concluded this part by giving hints on how to combine the SA Forum services and frameworks, whether, when and by whom they should be used together, and what to be avoided, in order to achieve the desired service availability goal.

Part Three of the book contained a collection of topics that had a more practical nature, addressing issues such as the programming model, the implementation of the SA Forum middleware, simple examples, the migration paths for non-SA Forum applications, and the use of formal techniques. The aim was to take the readers one step closer to the daily software design and development tasks so that a link between the discussed principles and practice could be established.

We started this part with the chapter on the programming model and API conventions in the C programming language, the definitive specification of the SA Forum interfaces. Apart from the general discussions on the consistent usage pattern across all the SA Forum services and frameworks, we have also included a number of topics and issues that are frequently considered by the developers in practice. These included the interaction with Portable Operating System Interface for Unix (POSIX), memory management, pointers handling, finding out the implementation limits, the availability of area service libraries, and the issue of backward compatibility. The Java mappings chapter continued with the discussions on the history, rationale, and mapping idioms used in the SA Forum Java language mappings, together with its usage and experience to date. The SA Forum middleware implementations chapter described OpenHPI and OpenSAF, the two most complete and up-to-date open source implementations of the SA Forum specifications.

The chapter on integrating the VideoLAN Client with OpenSAF demonstrated an example of using the SA Forum specifications in practice. We presented the different levels of integration with OpenSAF and discussed the relative implications of each approach on the overall availability achievement and complexity of the implementation. The migration paths for legacy applications chapter explained the benefits of migrating non-SA Forum applications to use the SA Forum middleware. We presented the various integration aspects and gave recommendations and guidelines to developers for their choices of integration approaches.

In the last chapter of this part, we showed how formal techniques could be used as a basis for tool support for the generation of SA Forum system configurations and upgrade campaigns for the Software Management Framework. We discussed the challenges and the most promising techniques and platforms for delivering tool support to the tasks of site design and maintenance.

In terms of the specific SA Forum services, we have omitted the Security, Naming, Lock, and Timer services in this book. As pointed out in the beginning, we left security out because it is a topic that crosscuts all the services and it could have easily filled the book of this size. Due to the nature of the other three services, we feel that their omissions do not compromise the overall treatment of the subject on service availability. Following our philosophy of making this book as pragmatic as possible, we would have liked to tackle topics such as testing systems after they have been implemented. That would have included the discussions of techniques such as failures and events simulation, fault injection and processing, and so on. Again, this would have caused a significant increase in the number of pages needed in the book.

We believe the presentation in this book has achieved the goal we set out in the beginning, and has indeed captured the undocumented reasoning behind the many design decisions of the SA Forum specifications. We feel that this book has succeeded in linking the principles with practice of service availability.

17.2 The Future

At the time of writing, the SA Forum specifications still remain to be the only open standard with a rather comprehensive support for service availability and a proven track record in the telecommunications domain. Coupled with the ongoing open source implementations and the continuous feedback loop from the implementation experience to rectify and improve the specifications, the anticipation is that the quality will continue to increase.

Although this book has been written to address specifically the SA Forum specifications, the principles, and concepts used in arriving at the various services and frameworks, which are the results of many years of accumulated experience of the many contributors to the SA Forum, are applicable to other types of systems and new technologies as well. One example is the much publicized cloud computing, which has evolved recently into a main trend in the IT and enterprise sectors.

As with any new buzzword coming into town, there is no overall agreement on the definition of cloud computing. However, the term cloud computing is generally characterized as having access to IT infrastructures, platforms, and applications over the Internet, and the users are charged based on the usage of the requested services. The main offerings come from a diverse range of companies including Amazon's Elastic Compute Cloud (EC2) [136], Microsoft's Windows Azure [137], VMware's vCloud® [138], and Google's App Engine [139]. On the horizon there are also telecom grade cloud infrastructures [15] that are suitable for the network equipment providers and operators to offer better communication solutions to their customers.

One of the most cited advantages and at the same time top obstacles of adopting cloud computing has been high availability (HA) [140]. It is evident from recent reports of service outage of cloud computing providers such as Amazon Web Services [4] and Microsoft [141], the HA issue needs to be addressed. Even when this is resolved at the cloud computing provider level, a cloud computing user, typically via platform/infrastructure as a service, still needs to deal with HA issues at the application level.

As an example, given the choice of VMware HA and VMware FT (fault tolerant) [142], an application designer must decide whether loss of data and service for the period of time during which a failover is performed by VMware HA is acceptable or not. If not, then VMware FT should be used instead, which provides a replica running lock-step with the primary so that there is no loss of data nor service even when there is a host failure. Similarly, a developer using Windows Azure must also decide whether loss of work during a virtual machine restart [143] is acceptable or not. If not, the developer must introduce his or her own solution. Yet another example comes from the recent Amazon EC2 service disruption report [5] in which the problem was reportedly caused by an erroneous step in an upgrade. The report has further identified that some EC2 customer applications were impacted in this event because they only used a single Availability Zone. This was because either they did not have access to multiple Availability Zones due to their types of requested services or they did not understand the consequence of using a single Availability Zone was basically the cause of a single point of failure. As one imagine, all of these issues must be thought through by developers at the application level, regardless of whether they are using cloud computing or the SA Forum services and frameworks.

It is important, however, to put these cases into perspective: Virtualization as used in cloud computing addresses typically host, that is, hardware failures whether there is a standby virtual machine running in lock-step or not. The fact is that failures occur not only in hardware, but even more frequently in software whether it is the operating system, the hypervisor, the middleware, or the different applications that include the offending piece of code, or maybe it is the interaction among them causes the failure. In this latter case it is extremely difficult, if not impossible, to detect and debug the software at development time. Unfortunately this does not reduce the impact of the failure and may not even reduce its frequency in the live system.

Furthermore systems designed for service availability also have to consider and protect at least to some extent against administrative and human errors. From this perspective upgrades are the most critical operations as they are ‘part of life’ of the systems of such continuous operation.

It is unlikely that any single technology or method is able and will provide a solution for all these cases. Experience shows that the different techniques need to be used in unison to provide the best possible/feasible protection at the lowest possible/feasible overhead. Indeed, we did not emphasize this aspect throughout our book as it is quite obvious that all these techniques introduce some overhead in the system and accordingly they also imply economical and performance consequence which could be studied and evaluated. That in itself is again a whole area of research.

To remain at our subject, one needs to approach the problem systematically by the analysis of potential faults at all levels of the system contributing to the delivery of the services, their impact on other parts of the system and the system's services in general. These need to be matched with the applicable methods of protection. Only with such global systematic view one is prepared to make the decision on the design the system and the techniques deployed for the protection of the different services. Note that when looking at availability at the service level the needs may vary based on the different service level agreements regulating each one of them.

Ultimately, in spite of the relatively long history of research and investigation into dependable systems with considerable success on the technology and solution fronts, the fact remains that system failures causing unavailability of services are occurring far too frequently. More often than not, post-mortem reports on these system failures suggested that the causes of the failures were more to do with how the systems were designed, rather than the lack of viable solutions. This casual observation is also reinforced by the fact that the subject of designing dependable systems is not in the core of computing education, and that computing professionals with skills and experience in HA remain to be a very small minority of specialists in the field.

This very last point brings out an issue related to educating the next generation of computing professionals. We advocate the teaching of dependable systems design in the core of the Computer Science [144], Information Technology [145], and Information Systems [146] curriculums, instead of treating it as an advanced, specialized, and optional topic. Since we are depending more and more on computer-based systems, it is extremely important to bring this issue upfront to the computer-based system designers, rather than trying to fix the implementations afterwards. Contrary to the popular belief that this is a difficult topic, the subject can even be taught with no complex mathematics, but by showing the concepts, principles, design, and practice, as we have demonstrated in this book. If we have our own way, the days of collecting dossier on unavailability of services and their consequences could very well be numbered.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.43.26