Chapter 9

Model-Based Software Management: The Software Management Framework

Maria Toeroe

Ericsson, Town of Mount Royal, Quebec, Canada

9.1 Introduction

The key expectations toward a system providing Service Availability (SA) is that its services are available to their users virtually any time, be it at 3 a.m. on a Sunday night or 3 p.m. on Thursday afternoon. While a couple of decades ago only few services had to satisfy this requirement, today with the Internet becoming the infrastructure for all aspects of life the possibility of shutting down the system for maintenance and software upgrades has long gone. Moreover with the Internet's global reach even the notion of slow business hours is fading away quickly: 3 a.m. in North-America means 3 p.m. at the other side of the globe with all its urgency that is not localized any more.

Nevertheless the maintenance and upgrades still have to be carried out somehow at some point and the Software Management Framework (SMF) [49] of the SA Forum offers an approach to resolve this conflict.

As we have discussed in Chapter 6 the Availability Management Framework (AMF) [48] is already designed so that it is capable of maintaining SA in the presence of a failure. The solution is based on the redundancy introduced to the system so that services provided by a failing component can be failed over to another healthy component standing by for exactly this purpose. Luckily just because we introduced this redundancy our component should not fail more often, so this redundant standby is there for most of the time ‘idling,’ waiting for something to happen that requires it to take over the service provisioning. It is also clear that it does not have to be a failure that causes the take over. AMF also provides administrative operations that can be used to rearrange the assignments. For example, if we would like to perform some maintenance on or upgrade the active component we can take advantage of this redundancy as well and use administrative control for coordination. Thus, we do not need a completely new solution; we only need to complement the existing one—AMF—so that we can handle the additional requirements of coordinating upgrades and reconfigurations. The SA Forum Technical Workgroup has defined the SMF exactly with this intention in mind.

In this chapter we look at the solution offered by the SMF, which achieves its task in tight collaboration with AMF and Information Model Management service (IMM).

The first version of the SMF specification revolves mainly around two aspects of software management that we are going to discuss in details: The software inventory and the upgrade campaign.

The chapter introduces the entity type which is a key concept related to the software inventory. We have already mentioned at different Application Interface Specification (AIS) services that define entities that are represented in the information model. We also indicated that these entities often have types that reflect a grouping of these entities based on their similarities. The primary source of these similarities is that these entities run the same software, which leads us to the domain of software management. We will look in more details at the notions and requirements SMF introduces toward software vendors to describe their software intended for a SA Forum compliant system and how this information can be used to derive the different entity types.

The chapter also introduces the method SMF utilizes to deploy new software and new configurations. Its key notion is the upgrade campaign that the SMF specification defined to be able to describe for an SMF implementation the target of the upgrade and the procedures to be used to achieve this target. It is reflected in the information model to provide a way to monitor the execution.

The discussion covers the failure handling mechanisms required by the specification during upgrades, their basic concepts, and failure case coverage. It discusses the additional measures that may need to be taken by applications and a middleware implementation.

Finally we indicate the areas of software management not covered by the first version of the SMF specification that need to be dealt with in future releases or by other means.

9.2 Background

The first question is what we mean by software management in the context of SA. Is it any different from software management in general? Even those who have an idea about software management in this context may still wonder what it has to do with some application programing interface (API) specification such as the SA Forum AIS.

There might be different interpretations, but software management typically covers the software life-cycle from the moment of the completion of its development—software delivery—to the moment it is deployed in a target system. That is, it normally deals with the packaging, the versioning, and the distribution of the software product on the software vendor side; and on the customer side, with the installation, verification, configuration, and finally the deployment of the software on the target site where it is used for the services it can provide to its users.

There are a number of solutions even standards covering the first part (i.e., packaging, versioning, and distribution). Many of these solutions address to different extent even the second part.

From the perspective of SA this second part is the critical one as to deploy a new version of some software usually means that the instance running currently the old version of the software is terminated and replaced with an instance running the new software version. Obviously, during this replacement the instance terminated cannot provide any services. The provisioning needs to be switched to another redundant instance in a similar manner as in the case of failures discussed at the AMF in Chapter 6.

The redundancy introduced to protect against failures in systems built for SA can be used in other context as well. It does not matter from this perspective whether a service provider entity is taken out of service for the time of its upgrade or it went out of service due to an error and it needs to be repaired before it can take assignments again.

Yet, the problem is not exactly the same.

The notable difference between failure handling and software upgrade is that while at repair the same software is restarted, in case of an upgrade (or downgrade) a different, a new version of the software is being restarted and in case of redundancy this new software needs to work together with the old version at least for some time.

So the issue of compatibility comes into the picture.

Compatibility issues are of course not completely new and unique to systems providing SA. Today's systems are built from software components coming from different vendors. System integration can be a challenging task as all these software and hardware components need to be compatible and work with each other flawlessly to compose a system with the intended characteristics and features.

Unfortunately it is virtually impossible to guarantee this flawlessness as the essence of software solutions is their flexibility with which comes the huge state space a running instance of the software can be in at any moment of time. Putting together two or more such instances explodes the state space very quickly and only a finite number of combinations can be tested in finite time.

An attempt for the solution of compatibility is the standardization of interfaces through which different software pieces interact. In any case, it creates a dependency as if an interface changes both parties collaborating through that interface need to align. This means that if one piece of software is upgraded to the new interface, the other one needs to follow otherwise they will not be able to collaborate. This in turn means that the scope of the upgrade may need to be widened to resolve all compatibility issues and instead of the upgrade of a single software piece the complete vertical software stack needs to be upgraded.

With redundant applications and systems the compatibility issue expands yet into another dimension. Now software components that are peers and working together to protect a service (i.e., the same functionality) may run different software versions yet need to be compatible to collaborate properly. It is as if the horizontal dimension of these peers was added to the vertical dimension of the stack of software pieces collaborating to achieve certain functionality.

Another aspect that needs to be taken into account at deployment is that different pieces of software may impact each other just by being collocated, because they use the same resources. All of us using computers are familiar with the popup window that tells us that the upgrade will take place only after a system restart. Well, this means that not only the services provided by the software entities targeted by the upgrade, but all service of the entire computer will be impacted by the restart operation and only because their software is executing on the same computer.

Yet the biggest challenge in case of SA is that during upgrades and system reconfigurations there are so many things that may go wrong. Murphy's Law applies many many times. There are synchronizations that need to be fulfilled because of dependencies in addition to those normally present between the peers protecting the services. There could be errors, bugs in the system introduced by the upgrade itself or happening as part of the ‘normal’ operation. Issues may be the result of given combination of software that might have eluded testing, which is not that difficult after all considering the combined state space of the system.

In short, the software management solution for SA system needs to be able to recover from conditions that are beyond normal operation; and it needs to be able to do so with the least service impact.

Finally to top all these challenges is the fact that upgrades are very software specific: Meaning that different pieces of software may require different solutions for optimal result. There are a lot of specificities at the application level that need application level coordination, that require application level knowledge, and therefore hard or impossible to handle from the system or middleware level being transparent to these lower layers.

The first version of the SA Forum SMF specification could not possibly address all these issues in a limited time with limited resources. Instead the SA Forum Technical Workgroup defined a limited scope that would enable the addition, the removal, and the upgrade of AMF applications and in particular the addition or removal of partial or complete applications as well the upgrade of the redundant logical entities of running applications.

In spite of this limited scope the concepts defined in the specification were designed so that it will be possible to expand them to the entire system potentially supporting even hardware upgrades and reconfigurations.

In the rest of the chapter we present the software management solution offered in the SA Forum SMF specification and we also list the issues that this first version does not address.

9.3 Software Management a la Carte

9.3.1 Overview of the SA Forum Solution

The SMF is one of the latest additions to the SA Forum AIS. As the previous section showed there were great challenges—a big shoe to fill. So the decision was to proceed stepwise and the primary goal of the first version of the specification was to facilitate the addition, removal, upgrade, and reconfiguration of applications while maintaining SA.

Since in SA Forum systems the AMF is responsible for maintaining SA and applications providing such services are expected to be AMF applications, the solution offered in the current SMF also takes into consideration for SA only AMF applications. As a result it heavily relies on the AMF itself and the AMF concepts in general. SMF builds around the AMF information model, the features associated with the represented AMF entities and the AMF administrative API.

However, since in the long run the solution should not be limited to AMF entities only the main concepts of the SMF were defined at a higher abstraction level and therefore they are more generic.

The SMF specification defines the concept of the deployment configuration, which is the collection of all the software deployed in a system as well as the logical entities configured through the information model. Accordingly it identifies two areas of software management:

The first area is concerned with the software inventory, the delivery of new software to the system, the removal of any obsolete one, and the description of the features, capabilities of the software, and its configuration options. The SMF specification identifies this area as the software delivery phase and the related information model as the software catalog.

The second area of concern is the act of deploying a new configuration of entities in a live system which may remove, replace existing, or add some new entities or just rearrange, reconfigure existing ones, and any combination of these options. The specification refers to this as the deployment phase.

Note that when existing applications are being reconfigured and rearranged in the deployment phase there may not be always a need for a preceding delivery phase if there is no software involved (e.g., reconfiguration or removal of entities); or it may require only the installation and/or removal of the software on certain nodes within the cluster using an image already delivered to the system and stored in its software repository.

The assumption is that logically there is only one software repository in an SA Forum compliant system and the images in this repository can be used at any time for software installation, validation, repair, and removal as necessary. The ‘unit’ of these images is the software bundle. The validity, consistency and integrity of software bundles are checked when the bundles are delivered to the repository.

However the first version of SMF does not define operations for this check, delivery, or any other operation concerned with the repository. Neither does it define the format or the exact content of software bundles. The reason for this is that these are platform and therefore implementation specific details and most platforms already have gone through the definition of them such as the widely used for Linux systems RPM Package Manager (RPM) format (formerly known as Redhat Package Manager) and related utilities standardized in the Linux Standard Base specifications [88]. It would have been a huge effort to define these and it would have been just a re-inventing of the wheel.

So the focus was shifted to the issues that are specific to SA. In this respect there were only one item identified in the area of software delivery, and that was the description of the software capabilities and configuration options. For this purpose the SMF specification includes an XML (eXtensible Markup Language schema [81, 89] to be used by software vendors to describe their products intended for SA Forum compliant systems and in particular for software to be managed by AMF. We will look at the entity types file (ETF) [90] and its role in more detail in Section 9.3.2.

While the software inventory is purely the concern of the SMF, the logical entities that need to be configured and upgraded are of the concern of other AIS services such as the AMF or the Platform Management service (PLM), or even user applications—considering that the IMM [38] is not restricted to AIS services only.

The expectation is that all these entities are somehow reflected in the information model as (managed) objects, which define the configuration of the entities that the system should implement.

This is a key point: The SMF focuses only on the configured content of the system information model, which is manipulated during upgrades and reconfigurations. The runtime content should follow this as the consequence of the configuration changes being deployed.

One may also realize that even though in the information model there might be seemingly independent objects that are implemented by different AIS services and applications, they are not necessarily independent. They may reflect different aspects of the same logical entities in the system.

For example, an AMF application is described through components and service units (SUs), and so on, for the purpose of the availability management performed by AMF. In the actual system these components may be processes that are started and stopped by AMF using the component life-cycle commands and the AMF API as described in Chapter 6. These same processes may communicate using message queues of the Message service [44], which in turn implements the queues themselves. However their creation and deletion is controlled by the application processes (started as AMF components) and they may potentially use some application specific configuration information, which can be part of the system information model as configuration objects the same way as the AMF components and SUs.

Obviously, changes in the software the components are running may require changes in this application specific part of the configuration as well, and all these changes need to be made in a synchronized way during an upgrade while maintaining SA during and of course after the upgrade.

To deal with the problem at hand the SMF specification defines the concept of the upgrade campaign, which is composed of upgrade procedures that may or may not be ordered Figure 9.1. Each of the upgrade procedures consists of one or more upgrade steps typically executed in sequence. Each of the steps upgrades and/or reconfigures a set of entities by performing the necessary software installations and removals and the information model changes in a synchronized way. We will go into more details on the upgrade campaign specification in Section 9.3.3 and its execution in Section 9.3.4.

Figure 9.1 Overall view of the upgrade campaign. (Based on [49].)

9.1

To perform an upgrade or reconfiguration an SMF implementation expects an upgrade campaign specification. It is an XML file that follows the upgrade campaign specification (UCS) XML schema [90] accompanying the SMF specification.

The upgrade campaign specification relies on templates as shorthand to identify existing target entities in the system. At the same time they also allow for the specification of generic upgrade campaigns applicable to different deployments that an SMF implementation can tailor to the actual configuration of the system at runtime.

When executing an upgrade campaign the system administrator would like to see a completely different perspective: He or she would like to know exactly what is being upgraded, how the upgrade is progressing and if something fails where it has happened in the sequence of operations of the upgrade campaign and physically in the system.

There is also a need to recover from failures as much as possible automatically and with as little as possible service outage and loss of any kind.

To address these needs the SMF specification defines also an information model and associated state models for the upgrade campaign that reflects the status of the campaign execution.

Through the state models, SMF also identifies failure scenarios that may occur during campaign execution and their handling options that an administrator can choose from should an error be detected. Section 9.3.4 goes into all these details at some length.

9.3.2 Entity Types File: Is It Eaten or Drunk by SMF?

Vendors are expected to deliver their software intended for SA Forum compliant systems as software bundles and describe the content of each software bundle in an associated ETF. It is an XML file following the standardized XML schema that allows the characterization of the software in AIS terms. That is, according to the features the different AIS services define for their entities. The first release of the specification focused only on the AMF entities and therefore the schema also contains only elements for them; however, the approach is extendable to other services and even applications as needed.

We have seen in Chapters 5 and 6 that both PLM and AMF define types in their information model that characterize groups of entities in the system, which run the same software and have similar configuration. As a result these types serve as a single point of control for these groups of entities from the perspective of AMF and PLM. It also means that these types fully define all configuration features required for the group of entities they characterize. The main concepts of this view defined by the SMF basic model presented in Figure 9.2.

Figure 9.2 The SMF basic information model [49].

9.2

When it comes to the software itself, however, it is usually more flexible and can be deployed in a variety of ways. For example, we have seen that the AMF needs to know how many component service instances (CSIs) it can assign to a component in the active and the standby states so for the component type we can configure that all components of the type can accept two active assignments or five standbys. Configuring two actives or five standbys in the AMF information model does not necessarily imply that this is all that the software is capable of. It may only mean that this is the configuration that AMF needs to apply at runtime due to other limitations. One may restrict a configuration to limit the load, to synchronize the collaboration between different pieces and for many other reasons. A software that can handle multiple active and standby assignments at the same time may also be used in configurations that require only active assignments (e.g., N-way-active redundancy model), or either active or standby assignments, but never both at the same time (e.g., 2N or N+M redundancy model).

To characterize all such possible (AMF) types may be impossible; instead the SMF introduces the concept of the software entity prototype. A prototype does not have to be fully specified like the entity types of the AMF information model, for example; it only needs to indicate the limitations built into the software implementation so that proper types can be derived from them.

If a feature is not specified in the prototype description, it means that the software has no limitations in that respect. For example, the prototype of the above component may not have limits for the number of active and standby assignments at all, which would mean that the software implementation can handle any number of assignments; or the prototype may limit the active assignments as a maximum of 5 and the standbys in 10 as it is limited in the implementing code.

The ETF is the description of these prototypes that a system integrator can use to create a valid configuration in which the derived types describe all and the exact configuration attributes appropriate for the particular deployment configuration.

The AMF also defines compound entities (SUs, service groups, etc.) and their types. These characterize the collaboration of entities in the system.

Again if we look at their limitations we can see that some software implementations may put assumptions or restrict the way they collaborate with other software pieces. Others may have no limitations at all. A component prototype, for example, may require another and this sponsoring component prototype may limit that an instance of it may be able to collaborate with at most 10 component instances. Such restrictions can also be described in the ETF through the appropriate compound entity prototype.

Since these restrictions are optional, there are mandatory and optional prototypes in the ETF—at least when an ETF is used to describe the content of a software bundle. The mandatory prototypes are those that describe the basic building blocks delivered by the software bundle. In case of AMF, these building blocks are the component prototype and the component service prototype. If there is no restriction on the combination of these in a deployment configuration, there is no need to describe any further prototypes. Otherwise the specified compound entity prototypes describe the ways that these elementary prototypes can be combined at deployment.

Note that the notion of software description with respect to its potential configurations opens the door to the automation of configuration generation. Indeed creating a valid configuration can be a challenging task for systems that deploy many different software pieces collaborating with each other and also configured for SA. To appreciate the challenge one just needs to recall the AMF information model with its variety of object classes describing entities and their inter-relations. All these need to be fully defined in a consistent manner for AMF before it can start to manage the system.

From this perspective one may also perceive the ETF as a set of constraints a configuration needs to satisfy, which means that it can be used to introduce these constraints to the configuration generation method. Of course in this case an ETF may not be associated with a particular software bundle or it may be associated with more than one.

We will explore further the potential of automatic configuration generation in Chapter 16. Here we simply make the point that the SMF specification has opened the door and enabled this possibility.

The SMF information model in force at the time of writing does not include the object classes describing the different entity prototypes. It encompasses only the object classes describing the software bundles and the derived entity types of the different AIS services. Some entity types reference directly the software bundles that delivered them.

The reason for this is that the system information model is focused on the information essential at runtime, and since the prototypes are not used directly in the system they are not essential. They may become part of the information model or different implementations of the SMF may include them to automate software management beyond the scope of the current SA Forum specifications.

So the answer to the ‘proverbial’ question of this subsection is that according to the first version of the specification SMF neither drinks nor eats ETF. It is for tools supporting SMF and the SA Forum compliant system in general, which may become part of the system in the future.

9.3.3 The Upgrade Campaign and Its Specification

An UCS is also an XML file—like ETF—and it describes for SMF implementations the parameterization and customization of the sequence of standard upgrade actions to perform to carry out the upgrade.

In other words, the SMF specification defines a standard process for upgrades which can be customized depending on the target. Then the upgrade campaign specification focuses only on this customization, it describes the object of the different standard actions and also if needed defines the appropriate custom actions.

As opposed to ETF, an SMF implementation requires an UCS file as an input for an upgrade (which in reality may be downgrade or ‘just’ a reconfiguration).

The specification defines three main parts of an upgrade campaign: the initialization, the campaign body, and the campaign wrap-up.

9.3.3.1 Campaign Initialization

During campaign initialization the SMF prepares the system for the upgrade campaign. This includes checking the different prerequisites that

1. The system is in a state that ensures—as much as possible—the successful execution:

This includes the verification that the SMF implementation is healthy and the software repository is accessible, all software bundles needed (new and old) are available in the repository, the upgrade campaign is applicable—e.g., the system configuration has not been updated compared to the referenced one, there are enough resources in the system to perform the upgrade such as memory, disk space, and so on, and that no other upgrade campaign is running.

If necessary SMF may even verify through API callbacks that the applications are ready for an upgrade as we will see in Section 9.3.6.

2. The system is in a state that does not jeopardize SA unnecessarily during the execution of the campaign:

Upgrades may be performed for different purposes. Sometimes they add new services, remove old ones, or improve their quality, but not in a way that would justify an outage. In these cases if the system does not have the required redundancy, for example, one would not want to initiate the campaign execution. Other times the goal of the upgrade may actually be the repair of the degraded system, the improvement of its security or stability. In these cases—particularly if it is a repair—the system may not have a chance to recover and may further degrade without the execution of the campaign, which makes it urgent regardless the outage it may cause. Yet another issue is that services may be protected at different levels by the system. Dropping an active assignment for a service instance (SI) that has three standby assignments normally is less permissible than losing one which is configured to have a single active assignment and no standbys at all.

To be able to decide which case it is, the campaign specification includes a section that indicates the permitted outage during execution in terms of AMF SIs. While typically no outage is allowed, an urgent campaign would allow any outage—all SIs may go unassigned. In other cases the configuration may be such that dropping some less important SIs is necessary and therefore allowed.

3. The system can prepare for the upgrade so that it can recover from failures occurring during execution:

The most important preparation is the creation of a backup that can be used to restore the current consistent system state from scratch should everything else fail. If such a backup is not possible, the campaign execution is not allowed to start. In less drastic cases SMF may be asked to gracefully undo the changes made by the upgrade for which it needs to reserve the required resources—for example, history file(s), checkpoints—as the SMF implementation requires. SMF is also required to log all the upgrade process, so it needs to prepare for this too.

In addition, since the system continues to provide services even during the upgrade and therefore its state is changing continuously, application may also want to log their application specific changes. This needs to be done in such a way that, should the backup be restored at any point during the upgrade, using these logs the applications will be able to recover the changes performed from the moment of the backup till the moment the system failed so that this backup was restored. This recovery process is usually referred as roll-forward and the first version of the specification leaves the responsibility with the applications whether it is needed or not.

Interested applications may want to receive a callback from SMF that indicates that they should create a backup and anything associated with this operation.

As part of the standard operations of the campaign initialization SMF adds the new entity types to the system information model.

The campaign initialization can be tailored as mentioned by

  • Referencing a configuration base to which the campaign applies;
  • The list of new software bundles, which need to be available in the software repository;
  • The description of the new entity types to be added to the information model;
  • The parameterization of the standard callbacks to registered users.

In addition, one may specify any administrative operations, information model changes, custom callbacks, and command line interface (CLI) operations.

The upgrade campaign specification contains only the customizations of the standard operations and additional custom operations. It does not contain the standard operations themselves.

9.3.3.2 Campaign Body

The upgrade campaign body is dedicated to the actual upgrade procedures that deploy the changes in the system. The first version of the specification distinguishes two types of upgrade procedures based on the method they use for the upgrade.

  • Single-step procedure

The single-step procedure—not surprisingly—consists of a single upgrade step. It is geared toward upgrades during which SA is not a concern. Typically these are the cases when new entities are added to the system or when old entities are removed. In the former case SA is not a concern as the services the new entities will provide are not provided yet or if provided they are provided by other entities already existing in the system. Similarly in the latter case, when we remove entities from the system it also means that we do not count on them any more for SA. Their services are either removed or provided by other entities.

  • Rolling upgrade procedure

The second procedure type uses the rolling upgrade method. The idea behind the rolling method is to utilize the redundancy existing in the system for the purpose of fault tolerance also for the purpose of the upgrade. That is, we split up the group of entities protecting some services into subgroups of redundant entities so that taking out of service one such subgroup does not cause any service outage. Then we take each of these subgroups one at a time out of service and upgrade their entities within an upgrade step. The upgrade completes once all subgroups have been upgraded—when the upgrade steps rolled over all the subgroups. Since the entities in the different subgroups are redundant they are expected to be similar, which means they can be characterized through a template describing their common features. For AMF entities, this may be as simple as having the same component type or SU type or belonging to the same service group, which typically implies having the same SU type.1 This similarity also means that these entities are upgradable using the same upgrade step.

What is an Upgrade Step?

The specification defines an upgrade step as a standard sequence of actions performed in order to upgrade a set of software entities. This sequence of upgrade actions is (we first give the actions then explain them one by one).

1. Online installation of new software
2. Lock deactivation unit
3. Terminate deactivation unit
4. Offline removal of old software
5. Modification of the information model
6. Offline installation of the new software
7. Instantiation of the activation unit
8. Unlock of the activation unit
9. Online removal of old software

Installation and Removal of the Software

The installation of the software means the creation of an executable image of the software within the SA Forum system using the software bundle in the software repository as a source. The removal operation removes such a previously installed software image from the system. The installation and the removal operations are typically invoked by CLI commands associated with the particular software bundle.

The executable software image in question is associated with a particular PLM execution environment (EE) within the system on which it is installed meaning that entities that need to execute in that EE are instantiated by executing this particular image.

Why are we so complicated in describing this? The reason is that often in these systems nodes do not have dedicated hard drives. The image may be created remotely on a common file system. Moreover it is also possible that the same image may be associated with more than one node. All these are implementation specific details that the SMF implementation should resolve for the particular system it is executing on. From a ‘user perspective’ the installation and removal CLI commands are issued on the node, in the PLM EE for which the executable image is being created or removed.

Offline and Online Operations

In the context of the SMF the online–offline distinction is related to the impact the operation has on the entities of the system. An online operation may not have any impact on any entity running or potentially running in the system at the moment of or after its execution all the way until the related information model modifications are made. For example, the software image may be installed for a new component type beforehand. However this image is not used and does not impact in any way the component to be upgraded to this new type until the object representing the component in the information model is modified to refer to this new type. This no impact includes even the case of the component restart due to a failure, for example. That is, such a restart should still use the software image associated with the component's current type and the new installation should not have any impact on it.

An offline operation has an associated scope of impact, that is, a set of entities that may be impacted if they are running or starting at the time of or after the execution of the operation. In order to avoid any unplanned impact, these entities need to be taken offline. They need to be kept offline until the upgrade of the related entities has completed and they are ready for deployment. Note that once the impacted entities have been taken offline the operation in fact becomes an online operation with respect to the rest of the system as it must not affect any other remaining entity.

Deactivation and Activation Units

The deactivation unit is the set of entities that is taken offline for the time of some offline operations (installation, removal or both). The activation unit is the collection of entities that is taken back online once the offline operations have completed.

The reason of distinguishing the activation and the deactivation units is that when we want to remove some entities from the configurations they will not be taken back online, and when we add new entities they do not exist yet to take them offline. At the extreme, an upgrade step may have an empty activation unit if all the entities taken offline are also removed from the system; and the deactivation unit is empty if all the entities to be taken offline for the operations are newly introduced within the step.

To take the entities offline typically one needs to administratively lock them first. This, as appropriate, will also trigger a switch-over of the services they may be providing. After this they can be terminated without any service impact.

To bring the entities online first they are instantiated and if it is successful they can be unlocked to allow them to take service assignments as necessary.

The assumption is that the new entities are all created in the locked administrative state so that SMF can control their release for service. This is done only when everything (the entities themselves and their environment) is ready for them to take assignments.

Since the SMF uses the administrative operations to take the entities offline and online, the impacted by an operation set of entities needs to be adjusted (e.g., expanded) to the set for which these required administrative operations have been defined. For example, AMF components cannot be locked or terminated by themselves as the lock operation is not defined for them. So if a component needs to be taken offline the deactivation unit needs to refer to its encapsulating SU.

Another way of looking at the activation and deactivation units is that these are the entities to which SMF applies the administrative operations defined by the upgrade step.

Modifications of the Information Model

In this action the SMF applies the information model changes to the objects representing the entities targeted by the upgrade step. As we have seen these entities may not be the same as the activation and deactivation units but they are usually within their scope. Hence SMF interprets them this way too unless they are explicitly listed.

Explicit listing is unavoidable in case of new entities. For SMF to be able to add a new entity to the configuration it needs the complete configuration of the entity, that is, it requires the specification of the configuration object representing the entity in the information model with all its mandatory attributes.

In case of existing entities the template approach can be used to identify a set of similar targeted entities and also to describe the changes in their configuration. For example, components of a given component type can be selected using this component type as the template. Their upgrade may only require the configuration change that their object representation references the new component type, which can be given as a single attribute modification that SMF can apply for each object representing a targeted component.

Reduced Upgrade Step

Looking at the standard upgrade step described at the beginning of section ‘What is an Upgrade Step?’ on page 275, if there is no need for any offline operation within an upgrade step (action #4 and #6 would be empty) the lock operation (actions #2) can be skipped completely and instead of the separate termination and instantiation (in actions #3 and #7) a restart operation can be used once the information model changes have been applied in action #5. Since no lock is performed there is no need to unlock (action #8) either. We are left with the following sequence:

1. Online installation of new software
2. Modification of the information model
3. Restart of the activation unit
4. Online removal of old software

This is referred as the reduced upgrade step. It is ideal from the perspective of SA as it upgrades entities while they may be providing services. In the context of AMF this is the case when a component is restartable (see Chapter 6 Section 6.3.2.4) and also it requires no offline installation or offline removal operations, that is, it impacts no other entity and even the component is not impacted until the information model has been modified.

Considering the PLM—which was out of scope of the first version of the SMF specification—when upgrading an operating system one may have no other option than to use this reduced step as to be able to communicate with the EE and control its upgrade SMF would need to use the old operating system. Only when the installation and configuration of the new operating system has been completed using the old environment, SMF can restart the EE with the new operating system and if it is successful the old can be removed again using the new environment.

To summarize, to specify an upgrade procedure one needs to identify:

  • which upgrade method to use: single-step or rolling;
  • the activation and deactivation units, which are always template based for rolling upgrades;
  • the software bundles that need to be installed on and removed from the nodes;
  • the configuration changes for the entities targeted by the upgrade;
  • whether the standard or the reduced upgrade step applies; and
  • the execution level of the procedure, which determines the execution order of procedures within the upgrade campaign.

SMF initiates procedures of a higher execution level only after it has completed all procedures of the lower levels. SMF may execute procedures of the same execution level simultaneously or in any order.

In addition to the upgrade step specification that we discussed in some details, each procedure also has an initialization and a wrap-up portion similar to the upgrade campaign.

The most important information given in the initialization portion is the permitted outage information, which is actually used during the campaign initialization for the prerequisite check. The reason it is listed at the procedure is that to be able to determine the outage one needs to know the deactivation unit of the upgrade steps of the procedure.

Upgrade Procedure Customization

The rest of the initialization and the procedure wrap-up allow for adding customized action to prepare for a procedure, to verify its outcome and/or to wrap it up.

A typical use would be the case when the upgrade is performed to provide some new services by the new version of the software. Since we are upgrading existing entities we would do that using a rolling upgrade procedure. This, however, does not allow us to add the new SIs. Instead of defining a separate single-step just for the addition of these entities the right place to add the new SIs to the information model is the procedure wrap-up when all the SUs protecting them have been upgraded. Symmetrically, if the new version does not support some obsolete SIs they need to be removed in the procedure initialization.

In the procedure initialization and wrap-up customized actions may contain information model changes, administrative operations, CLI commands and also so-called customized callbacks, which are API calls to some registered users.

We will have a closer look at the callback mechanism in Section 9.3.6. Here the important point is that SMF can transfer the control of the upgrade execution to some other entity for a certain time period. During this period this other entity may perform additional actions necessary for the upgrade in a synchronized manner with the upgrade procedure.

These customized callbacks may be used also for the customization of the upgrade steps. The UCS schema defines a set of insertion points where SMF would provide customized callbacks within particular upgrade steps of an upgrade procedure. The particular upgrade step on which a callback is made may be the first, the last, or the step halfway of the procedure; it may also be made on all steps.

The hook-up points for the callbacks within an upgrade step are:

  • before lock,
  • before termination,
  • after IMM changes,
  • after instantiation, and
  • after unlock.

9.3.3.3 Campaign Wrap-Up

The most important task of the campaign wrap-up is to verify that the system is working correctly after the completion of the upgrade, and when this has been established to free all the resources that were allocated to ensure the success of the upgrade and to enable system recovery in case of a failure. It also provides an opportunity to clean up the information model by removing objects representing entity types and related software bundles that have become obsolete.

Again there are some standard operations that the SMF specification requires from implementations to perform. Most importantly there is an observation period during which SMF is in an ‘observation’ mode: It listens whether anyone reports any errors that it can correlate with the just completed upgrade campaign.

The upgrade campaign specification establishes the time period for which SMF performs such monitoring. This period of ‘waiting-to-commit’ the campaign starts at the moment the campaign has completed successfully including all the upgrade procedures and the wrap-up actions required for the completion.

At the end of the waiting period SMF prompts the administrator to make the decision whether to commit the campaign or to revert the configuration changes gracefully. This is the last chance for such a graceful rollback as after the administrator issuing the commit operation SMF frees all the resources enabling it.

However if necessary it still remains possible-albeit at some losses-to recover the system state in effect before the upgrade campaign by restoring the backup made at the initialization. For this a second waiting period defines during which SMF blocks the initiation of a new campaign and keeps the backup intact. Once this timer expires there is no guarantee that the state before the upgrade can be restored.

Besides the timers just as for campaign initialization and in the campaign body one may define administrative operations, information model changes, CLI commands, and custom callbacks at two points of the wrap-up.

  • At the completion of the campaign (i.e., after the completion of its last procedure)—the purpose of these actions is to complete the upgrade campaign and to verify its success, for example, at the application level. If specified, the waiting-to-commit timer starts once these operations have completed.
  • At commit—these actions are related to committing the campaign and also to the release of the different resources dedicated to the campaign.

The upgrade campaign specification contains the parameterization of the timers, the optional actions and the list of entity types and software bundles that SMF needs to remove from the information model and from the system's software repository. It does not include the standard actions as described here and in the SMF specification. The upgrade campaign specification indicates only the modifications that need to be made to the standard upgrade campaign process to customize it for the upgrade of the targeted entities.

9.3.4 Upgrade Campaign Execution Status and Failure Handling

One of the main requirements of manageability is the capability of monitoring the system's behavior. This need is even greater during upgrades and configuration changes. In case of the SMF, we would like to be able to monitor the progress of an upgrade campaign, which entities have been upgraded, which were taken out of service and then back. Most importantly we would like to obtain as much information as possible when something goes not quite as planned.

Everyone is familiar with the progress bar when upgrading some software on their computer. Indeed this was a suggestion, but how to translate an upgrade campaign into a simple progress bar? What does 57% mean for an upgrade campaign? Is that information useful if an error should occur?

Instead of trying to interpret percentages, the SMF specification defines an information model presented in Figure 9.3 together with a state model for the concepts that we have already seen—the upgrade campaign, its upgrade procedures and their upgrade steps. This way during execution the progress of the campaign can be observed through the SMF information model in which the objects of the relevant classes reflect—among others—the state information for each step and procedure of the campaign.

Figure 9.3 The SMF information model for upgrade campaigns [49].

9.3

An upgrade campaign is represented by an object of a configuration object class and it reflects the executions of the upgrade campaigns. Under each upgrade campaign object SMF creates a tree of runtime objects that represents the procedures and their upgrade steps as they apply to the system configuration when the campaign is being executed.

At the upgrade procedure level this means that SMF creates exactly as many objects as there are procedures defined in the campaign. However, under each rolling upgrade procedure it creates as many upgrade steps objects as the number of objects in the system configuration that satisfy the template defined for the rolling upgrade procedure.

In the tree, each upgrade step object has its associated activation and/or deactivation units, each of which is associated with some nodes where software is installed or removed during the execution of the upgrade step.

This tree, which is rooted in the upgrade campaign object, is essentially the unfolding of the information contained in the upgrade campaign specification as it applies to a particular system configuration. Meaning that SMF applies the templates of the upgrade campaign specification to the current system configuration to identify the entities that compose the activation and deactivation units and that are targeted by the configuration changes.

For example, if the activation unit template of a rolling procedure refers to an entity type X and this template is applied to configuration A, which has five such entities then it defines five upgrade steps with their activation and deactivation units. If the same template is applied to configuration B which only has two entities of type X, then only two steps will appear in the information model of system B.

This mechanism allows for the specification of generic upgrade campaigns that are adjusted automatically by SMF to the different deployment configurations.

SMF performs this unfolding of the campaign specification at the campaign initialization the latest.

Once the campaign execution starts SMF reflects its progress by setting the states defined in the specification for upgrade campaigns, upgrade procedures, and upgrade steps in the objects representing these logical concepts.

The SMF state model combines the state machines defined at each level (e.g., step, procedure, and campaign). Each state machine reflects the execution stages of a step, procedure, or the campaign including the potential error handling and the administrative operations. The resulting state model is rather complex and all its details can be found in the specification [49]. Here we provide only a summary that we believe will help to grasp the main ideas behind.

9.3.4.1 State Model Overview

The campaign, the procedure, and the step state machines all start from an initial state and the intention is to reach the (execution) completed state.2 If this cannot be achieved due to a failure or an administrative intervention we would like at least to complete the campaign execution in a state which is equivalent to the system's configuration at the initiation of the upgrade campaign.3 We would like to achieve this with no more service interruption than the campaign would allow us.

This means that when we start the execution of an upgrade campaign all objects in its tree have their state attribute set to ‘initial.’ When the campaign successfully completes all the objects in the tree should have their state attribute set to the appropriate ‘completed’ or ‘execution completed’ state.

If at any point during the execution one decides to revert back to the campaign's initial configuration—a rollback is initiated—then those upgrade steps and procedures that moved away from the ‘initial’ state need to reach their appropriate ‘rolled back’4 state, while those that stayed in the initial state remain so.

This means an action-by-action step-by-step undoing of the already executed portion of the upgrade campaign in order to return to the configuration that was in effect at the initiation of the campaign.

When all upgrade steps and procedures of a campaign are in either the ‘initial,’ ‘undone,’ or ‘rolled back’ states the system state is considered to be consistent and the configuration is equivalent to the one in effect at the campaign initiation.5

Accordingly, there are two parts of each state machine: one for the upgrade or forward path and one for the rollback or reverse path.

The execution starts in the forward path and remains so until a rollback is initiated. When this happens, the execution switches to the second part: to the rollback path. It remains there until the rollback either completes or fails.

Failure of an Upgrade Action

Whether the execution is in the forward or in the reverse path, it can only proceed as longs as the executed upgrade actions are successful and therefore the system state is consistent and known. If an action fails the SMF is typically permitted to retry the upgrade step.

Retry means that first the SMF performs in reverse order the undo action—as defined in the SMF specification—for each of the already executed upgrade actions within the failed step. This should bring the configuration back to the one it was at the beginning of the step.

Subsequently if the failed step is undone successfully and the retry is permitted, SMF makes another attempt to execute the same upgrade step. If successful, the campaign execution can proceed as if no error occurred.

If the retry is not allowed or too many attempts have been made, the campaign cannot complete in the forward path any more. In this case the only possible remaining goal is to roll back the entire system to the configuration initial for the campaign.

If the actions of the failed step cannot be undone successfully, the entire upgrade campaign has to fail because the system state has become unknown therefore consistency cannot be claimed. To bring the system into a consistent state it needs to be built up from scratch by performing a so-called fallback.

This is when the backup created at the campaign initialization becomes important: The backup stores all the information necessary to restore the same consistent system state that was in effect when the backup was created but from scratch. The restoration of such a saved system state after a complete system restart is referred as a fallback.

Fallback is the last resort of the SMF to recover the system after some failure. The first version of the specification leaves it implementation specific and it does not require that an SMF implementation performs the fallback operation automatically. Instead it expects that the administrator makes this decision.

Obviously failures may occur also during the rollback of the upgrade steps in the reverse path. In such cases the SMF attempts a retry in a similar manner as during the forward path. The main difference is that since during rollback the system is already on a recovery path the only remaining recovery option is the fallback whether it is due to another failure or exceeding the permitted number of retry attempts.

Whether the campaign execution reaches the ‘execution completed’ state or the ‘rollback completed’ state, the SMF returns the control to the administrator for the final decision to commit the upgrade campaign or its rollback.

As mentioned earlier at the campaign wrap-up, committing the campaign implies that SMF releases the resources used to protect the campaign execution and after this the system cannot return gracefully to the state before the campaign initiation. A fallback still remains possible for some time if the upgrade campaign specification set such a waiting time as discussed in Section 9.3.3.3.

Asynchronous Failures

The problem with upgrades is that the success of the upgrade action does not imply that the newly deployed configuration and/or software indeed functions as intended. In addition these functional mishaps may not even show up right away when a new or the upgraded entity is put back into service. This is partially because the function may not be exercised immediately or—considering an AMF entity—it may not get an assignment immediately. It is up to the AMF to decide when to instantiate such an entity and when to give it an assignment. As a result correlating an error detected in the system with the upgrade action that led to the problem is virtually impossible.

To complicate the situation further, once the entity is under the control of AMF, the error detection mechanisms and recoveries defined for AMF apply. This means that either AMF will detect the error or it will be reported to AMF so it performs the appropriate recovery and repair actions. However, if, for example, a bug or corruption causing the error was introduced by the upgrade campaign, AMF will not be able to remedy the situation through a simple restart of the entity, not even by the reboot of the node which it uses as a repair action.

The upgrade campaign needs to be rolled back as soon as possible to recover the earlier healthy state. To make this possible the information about the problem needs to be funneled to SMF. At the same time it is also desirable that AMF stops all futile repair actions after the isolation of the faulty entity and most importantly it does not escalate the error to a wider fault zone and with that taking even more entities out of service.

While the situation is similar at the upgrade of entities of other services and even applications, the SMF specification focuses on a solution for the AMF entities since in an SA Forum compliant system these are the entities considered for SA.

The solution offered by the first version of the SMF specification is the following:

SMF while executing the upgrade campaign marks those SUs that are altered in any way by the upgrade campaign by setting their maintenance campaign attribute to the distinguished name (DN) of the upgrade campaign object.

This setting disables the normal repair actions that AMF would apply in case of a failure. AMF only isolates the SU and recovers any service it was providing at the time of the failure, but AMF does not attempt to repair it. Instead AMF sets the operational state of the SU to be disabled and reports this in a state change notification indicating also that the SU was under maintenance and provides the DN of the object representing the campaign.

From the beginning of the campaign execution the SMF subscribes to these state change notifications. When SMF detects a notification referencing the DN of the upgrade campaign in progress it interprets it as a potential indication of an asynchronous error. Therefore it suspends the campaign execution and returns the control to the administrator to decide whether the error is campaign related.

Depending on whether the campaign is on the forward or the reverse path, the administrator has several options and may decide that:

  • the error is not campaign related or it is related, but can be corrected; in either case the administrator needs to deal with error and clear it so that the entity can be returned to AMF's control; in this case the campaign can proceed on its current path;
  • the error is campaign related.
    • In this case if the rollback option is still available, the administrator may order a rollback in which case the SMF implementation deals with the problem implicitly by reverting the altered entities including the one on which the error was reported to their earlier configuration and clears the error condition for AMF this way.
    • If rollback is not possible or the error situation is more severe the more drastic recovery, the fallback restores the backed up state of the system and clears the error condition for AMF at the same time.

Fallback vs. Rollback

It is worth spending a few words on explaining better the difference between rollback and fallback.

If we consider the system state it encompasses:

  • the system configuration as defined by the configuration objects in the system information model; and also
  • the software image associated with this system configuration including system and application software;
  • the runtime information reflecting the actual status of the represented entities; and
  • the application data.

At normal operation only the last two items change: The application data changes according to the services the system provides; and the runtime information changes depending on the changing status of the entities in the system.

When an upgrade campaign is being executed it manipulates the first two items: The system configuration and the associated software image. But since the system continues to provide its services even during the execution of an upgrade campaign—as required for SA—the runtime information and the application data also continue to change.

The changes in the application data are critical for SA as they reflect in some application specific way the services that the system provides and protects.

When it comes to the runtime status information we need to distinguish two categories: The persistent and the nonpersistent data.

The persistent runtime objects and attributes of IMM [38] play a similar role as the configuration objects and attributes—that is, they must survive system and IMM restart—except that they are created, maintained, and destroyed by an object implementer (e.g., applications, AIS service) and not by an administrator. Hence they require a similar handling as application data.

Non-persistent objects and attributes by definition do not survive system or IMM restart as they have relevance to the current incarnation of the system.

Chart A of Figure 9.4 provides a graphical representation of these changes in normal—no-error—conditions. The upgrade changes shown in dotted line refer to the configuration and software image changes, while the application changes are reflected with the solid line and refer to the persistent runtime data and application data changes. During upgrade as time progresses both upgrade and application changes are made progressively.

Figure 9.4 Comparison of the fallback, rollback, and retry operations.

9.4

A tangible example would be the upgrade of the online banking service application from version 1 to 2 while a bank customer makes a deposit to his account. The change of the software version is the upgrade change while the deposit to the account is the application data change.

At the initiation of the upgrade campaign when the backup is created it includes the information for all four parts. This means that if a fallback is used to restore the state stored in this backup it clears all the upgrade and application changes made from the beginning of the upgrade as it is shown in chart B of Figure 9.4. Obviously this is not an ideal situation for our bank customer as this means that if we consider the previous deposit scenario then the system would lose the information about his deposit.

We must note that the reason for a fallback may be anything not only an upgrade. In high availability systems backups are performed on a regular basis so that if the system occurs in an unstable or unknown state for which no other known remedy exists a stable healthy state is restored by a fallback to the latest backup where SA can be guaranteed again.

The SMF also mentions the concept of rolling forward although currently the specification only recommends it to applications. A roll-forward is the process of re-applying of some logged transactions after a fallback in order to restore the state changes committed from the moment of taking the backup up until the moment of the fallback occurred.

For example, if the deposit transaction was logged by the banking system and this log survived the fallback then it can be used to restore the correct bank account information of our bank customer.

Since these transactions are application specific the first version of the SMF specification left the roll-forward to the discretion of applications.

When a rollback is performed it does not impact the service provisioning. In chart C Figure 9.4 we see that while the upgrade changes regress back to the same point as at the beginning of the upgrade the application changes continue to progress. For our deposit example this means that while the software is changed back through the rollback to the original version 1 the information about our bank customer's deposit remains intact.

For comparison purposes chart D of Figure 9.4 also shows the retry scenario. Here the system rolls back the upgrade changes to the one at the beginning of the upgrade step, but then in the retry it performs the same changes again while the application changes proceed without any interruption. As a result the retry if successful only introduces some delay in the completion of the upgrade campaign.

9.3.5 Administrative and Management Aspects

The SMF allows administrators to create and delete software bundle and upgrade campaign objects; and to control the execution of upgrade campaigns by issuing administrative operations on the upgrade campaign object representing the appropriate upgrade campaign.

9.3.5.1 SMF Information Model Management

Software Bundle Objects

Software bundles delivered to the system's software repository are represented in the SMF information model, which defines a configuration object class for this purpose (see Figure 9.2).

A software bundle object contains the information necessary to install and remove the represented software bundle within the SA Forum system. These are the CLI commands and command line arguments listed separately for online and offline installation and removal. The attribute values need to be set exactly in the way the SMF implementation should issue these commands in the EE in which the software bundle needs to be installed or removed from. The upgrade specification only refers to the software bundle object from which the appropriate CLI command and its arguments are fetched when the upgrade step calls for the operations.

Additional attributes indicate the default timer for the CLI operations and the scope of impact for each of the offline operations as it applies to the particular system. The first helps to determine the failure of an operation. The second attribute provides a further hint on the activation and deactivation units, a given implementation may or may not use.

Due to this tailoring of the attribute values to the concrete target system, the same software bundle may be represented by objects with different settings in different systems reflecting the system specifics such as the needs or assumptions of SMF implementation, the file system solution used, or its organization.

In any case, the presence on the software bundle object in the SMF information model indicates that the software bundle is available in the software repository and can be accessed during upgrades as required.

The exact way it is presented or the way it is imported is not specified by the first version of the SMF specification. All these details and the image management—beyond the mentioned installation and removal CLI commands—are completely left open for the SMF implementations to decide. Different solutions may extend the currently defined object class with additional attributes such as the path to the software bundle in the repository or the required package handler utility if several of them are supported.

An implementation may also decide to link the software bundle object creation and deletion operations with the import and removal of the software bundle to/from the software repository and by that implying some image handling operations. Although this is not required it is a natural interpretation of the current specification.

Upgrade Campaign Objects

As discussed in Section 9.3.4, upgrade campaigns are represented in the SMF information model by configuration objects. They link the XML upgrade campaign specification file with its execution in the system. The configuration object reflects the execution state of the upgrade campaign specification it represents.

When the administrator creates an upgrade campaign object SMF takes the associated XML file as an input and applies it to the current configuration of the system. It may perform some checks (e.g., whether the campaign is applicable to the system) and create the entire sub-tree of upgrade procedures and steps right away.

Once the upgrade campaign object has been created the administrator may issue administrative operations on it to initiate the execution, suspension, rollback, fallback, or to commit the campaign.

Since the campaign object is a configuration object it will remain in the information model until the administrator deletes it. In a way, these objects keep record of the system evolution.

The rest of the sub-tree consists of runtime objects and in their case it is left to the discretion of the SMF implementation their life span. The specification only requires their presence for the time of the execution of the campaign as they reflect the status of the execution. So an implementation may remove them when the campaign has been committed or keep them until someone deletes the campaign object.

9.3.5.2 SMF Administrative Operations

The first version of the SMF specification defines the following four SMF administrative operations applicable to upgrade campaign objects:

  • execute,
  • suspend,
  • rollback, and
  • commit.

The administrator issues them on the object representing the upgrade campaign to be performed.

Although the specification does not define the exact format and semantics of the fallback administrative operation yet, it requires that an administrator should be able to initiate a fallback. The reason of not being defined in the specification is that the fallback is not limited to the scope of software management only. It is used to recover the system from any type of situations when no other remedy can be used. Similarly the backup operation has not been defined by the SMF specification, it only requires that at the beginning of the campaign it is performed in some system specific way and the SMF implementation knows how to trigger it.

It has not been decided yet which AIS service shall offer the backup and fallback operations. One potential candidate is the PLM, the other one is of course the SMF, or they may deserve the definition of their own entirely new service.

After this small detour let us see the administrative operations defined by the first version of SMF.

Execute

The execute administrative operation initiates the execution or the continuation of the upgrade campaign represented by the campaign object on which the operation is issued. It is applicable to campaigns in the initial or suspended states. It can be issued only for one upgrade campaign at a time. That is, simultaneous execution of upgrade campaigns is not supported.

If the sub-tree of runtime objects of the upgrade campaign has not been created yet, SMF creates it when it receives the execute command and checks the campaign pre-requisites. If successful, SMF starts the execution of the upgrade procedures following their execution level starting with those at the lowest.

The execute operation returns when either:

  • the campaign has completed successfully by completing all procedures and the wrap-up actions verifying the success;
  • the campaign has been suspended;
  • the campaign reached a failure state.

The execute command is effectively a resume operation when used in the suspended state regardless of the reason of the suspension. If the suspension is due to an asynchronous error, issuing the operation implies that the error condition has been dealt with.

If the campaign is in a failure state (an error other than an asynchronous has been detected) the execute operation does not apply any more and returns an error. It depends on the failure state whether rollback may still apply or a fallback needs to be performed.

Suspend

The administrator may suspend the execution of an upgrade campaign any time whether it is executing in the forward path or rolling back.

Suspending the campaign means that SMF carries on with any ongoing upgrade step until it is completed, undone, or rolled back at which point the parent procedure becomes suspended. So at the procedure level SMF suspends ongoing procedures at the next upgrade step boundary. When all ongoing procedures reach such a suspended state, the entire campaign reaches the suspended state.

Procedures in their verification stage cannot be suspended and similarly the final verification phase of the campaign cannot be suspended either.

In the suspended state the administrator decides whether to continue the execution of the campaign, to roll it back, or to perform a fallback. To resume a campaign suspended in its forward path the already mentioned execute operation is used. Alternatively a rollback may be initiated or resumed using the rollback operation discussed next. As we mentioned earlier there is no standard fallback operation defined currently, but it needs to be available in some way for a system to be SA Forum compliant.

The suspend operation is applicable while the campaign is in the executing or in the rolling back states; and it returns as soon as the ongoing campaign reaches a suspended or a failure state.

Rollback

The rollback operation applies only in a suspended state or when the campaign execution has completed successfully. In this case and if the campaign was suspended on the forward path, the rollback command initiates the graceful undoing of the configuration changes performed by the campaign thus far.

When issued on a suspended campaign which is already rolling back, the operation resumes the rollback process.

Similarly to the execute operation the rollback returns either when:

  • the rollback completed successfully;
  • the rollback was suspended; or
  • the rollback has failed.

In the last case from SMF perspective the only available administrative operation is a fallback.

Commit

With the commit operation the administrator confirms that the campaign or its rollback was successful and the SMF may release all resources but the backup allocated for the execution and the protection of the upgrade campaign. As a result it applies only in one of the completed states.

After committing a campaign a rollback cannot be performed any more, but the fallback operation may still return the system to the configuration before the campaign provided that the backup created at campaign initiation is still available. As mentioned earlier the upgrade campaign specification may specify a period between subsequent upgrade campaigns and guarantee that the backup and with that the fallback are available at least for this period.

9.3.6 User Perspective

When supporting upgrades at the middleware level the challenge is that the middleware cannot be aware of special application features and needs such as synchronization procedures or switching between functionalities or versions. These are only known at the application level.

For example, a new version of the software may implement a new synchronization method between the peer components protecting the same services. Since the currently running version does not understand this new method, it cannot be used as long as there may be old components in the protection group. But when all components have been upgraded to the new version it is desirable to switch them to this new synchronization method. One way to do it is to implement a negotiation in the new version of the software through which peers decide when the new method can be used.

The potential problem with this is that successful negotiation requires that all the parties impacted participate in the negotiation. Considering the AMF, it instantiates only the required number of components which may not mean all components. For example, if in the service group only five SUs need to be instantiated at a time out of 10, the uninstantiated ones cannot participate in the negotiation and may still have the old software. If the instantiated five have been upgraded and switched to the new method when one of them fails and therefore one of the old ones starts to instantiate it cannot effectively collaborate with the others.

The SMF would be aware that there are 10 nodes to upgrade with the new software to complete the upgrade of the service group even if only five of its SUs are instantiated. And since it controls the upgrade, it also knows when the last node of the 10 is upgraded. What it does not know is that this is a significant event for the application. If the application could indicate somehow to SMF that it would like to know when the last node was upgraded and SMF could signal this then the application could perform the change to the new synchronization method at the right time.

The SMF specification introduces the concept of upgrade-awareness. It refers to SMF user processes that need to synchronize ‘application level’6 actions during upgrades and therefore need to know about their initiation and progress.

At the discussion of the customization of the upgrade step and procedure in Section 9.3.3.2, we mentioned that the upgrade campaign schema allows one to define customized callbacks to user processes for certain stages of the campaign execution. These stages are the campaign and procedure initialization and wrap-up; within upgrade steps at particular actions of a particular upgrade step (or all steps) of an upgrade procedure; and also at rollback initiation.

These stages were defined with upgrade-aware processes in mind. The solution works as follows:

On one side, in the upgrade campaign specification there are these hook-up points at which one can define callbacks. The callback definition includes a callback label and a set of parameters. The label is like a keyword identifying some kind of actions at the application level. The optional parameters associated may provide further information if necessary. For SMF they are all completely transparent.

On the other hand, any process which needs to synchronize certain actions with an ongoing upgrade needs to register with the SMF to be able to receive a callback and also to define a filter for the callback labels that matches its interest.

When during the execution of the upgrade campaign SMF reaches a point for which the upgrade campaign specification defines a callback, SMF checks if there is a registered process which defined a filter that the specified callback label satisfies. If so, SMF calls back the interested process. In the callback SMF implementation provides the callback label and parameters as defined in the upgrade campaign specification together with the DN of the SMF information model object representing the upgrade campaign, procedure, or step within which the call is being made and whether the actions need to be executed in the forward path or the reverse path.

For each callback, the upgrade campaign specification also indicates whether the SMF implementation needs to wait for the response of the process and for how long.

If the UCS specifies a waiting time, SMF essentially hands over the control to the application process for this period at most. When the process completes the actions associated with the callback label, it needs to report back the outcome to SMF, so SMF can decide whether the campaign can proceed or it encountered an error.

If the process reports a failure within the timeout period, it is treated as any other action failure of the campaign. That is, if it is within an upgrade step, the step can be retried otherwise a rollback or a fallback may be required.

If no wait timeout is specified or the timer expires before the response is received, SMF proceeds with the campaign regardless of the outcome.

To help the work of upgrade campaign designers to define the callbacks within an upgrade campaign, in the ETF the component prototype descriptor has a section in which the software vendor can describe the callbacks the implementation expects or able to interpret. It consists of the labels and the conditions these labels indicate for the software implementation.

For example, in the ETF description of a component prototype the vendor may say that the components derived from the prototype expect a callback with the label BACKUP at campaign initialization at the time the system backup is created. Such a component would like to receive the callback whenever a new upgrade campaign starts, so the appropriate callback needs to be included in all campaigns targeting the system.

Another case is our earlier example on the new synchronization method for which the vendor may indicate that on the last step of the upgrade procedure (or in the procedure wrap up) the components expect a callback with the AppSyncSwitch label. Such a component would only be interested in campaigns that upgrade its own base type, so the callback needs to be included only for such upgrades.

9.3.7 Service Interaction

9.3.7.1 SMF Interaction with IMM

As we discussed in Sections 9.3.4.1 and 9.3.5, the inventory of software bundles and upgrade campaigns are represented as objects in the SMF part of the information model. The administrator interacts with the SMF by manipulating these objects and issuing administrative operations using the IMM OM-API (object management application programing interface).

Accordingly, an SMF implementation needs to register with the IMM as the implementer of these objects using the IMM OI-API (the object implementer API of the IMM). From that moment on it will receive a callback from IMM whenever these objects change or when the administrator issues an administrative operation on any of them. When the operation completes SMF reports the result to IMM, which in turn delivers it to the administrator.

This part of the SMF interaction with IMM is the same as for all the AIS services.

However to perform upgrade campaigns the SMF also need to act as a management client to the AIS services whose entities are being manipulated within a campaign. In IMM terms SMF needs to act as an object manger to perform this task and it needs to use the IMM OM-API. As a result it can issue administrative operations and change the system configuration by manipulating the configuration objects of the system information model. The prerequisite to this is that SMF is able to obtain the administrative ownership of the objects targeted by the upgrade campaign. In IMM the administrative ownership represents the right to manipulate an object (Chapter 8).

In particular, the standard actions of the upgrade step that are defined on the activation and deactivation units map into administrative operations on the entities of these units. We even gave this alternative definition of the activation and deactivation units in Section 9.3.3.2.

For example, in case of AMF, the action ‘terminate deactivation unit’ means that SMF needs to issue a lock-instantiation administrative operation on each of the entities that match the deactivation unit template.

The first version of the SMF specification provides the mapping of the upgrade actions to the AMF administrative operations only. It is not straightforward for other AIS services and other mapping may be required. For example, Cluster Membership service (CLM) does not provide separate administrative operations that could be mapped into the lock and the termination upgrade actions, so when the scope of the SMF is extended to CLM and other services these differences need to be dealt with.

In addition to administrative operations, SMF also maps the configuration changes that are defined in the upgrade campaign specification into configuration change bundles (or IMM CCBs). This part of the upgrade campaign specification schema was defined so that this mapping is more or less straightforward.

Another part of the upgrade campaign that SMF maps into CCBs is the addition and removal of entity types and the customized upgrade action of campaign and procedure initialization and wrap up that indicate IMM operations.

Finally, SMF is also required to verify that the software bundles listed in the upgrade campaign specification are indeed in the software repository. The assumption here is that if the bundle is in the repository then there is an object in IMM representing it. However since this is related to image management, which is not covered by the first version of the specification, the specification also allows for implementation specific solutions.

9.3.7.2 SMF Interaction with AMF

From the previous section it follows that the SMF is not in direct interaction with the AMF. Instead this interaction is mediated through the IMM when it comes to the administrative operations. SMF changes the AMF configuration also through IMM by applying CCBs to the AMF objects that match the template specification for the entities targeted by the upgrade.

As we noted earlier SMF does not control how AMF distributes the assignments during an upgrade. It cannot do so exactly because of the way it interacts with AMF.

When SMF unlocks some entities after they were taken out for an upgrade and returns them to AMF's control AMF may or may not give them an assignment right away. It depends on the configuration and the current conditions in the system.

This means that any fault introduced by the upgrade may not manifest right away. There is no guarantee for this even in those cases when they would get an assignment.

As mentioned in section ‘Asynchronous Failures’, if an error was introduced during the upgrade the AMF recovery and repair mechanisms cannot resolve the resulting problem, therefore we need to prevent the normally triggered AMF repair and fault escalation mechanism. Instead the AMF specification provides a way to feed back the error to SMF as it is a potential upgrade failure.

To implement this feature the SMF evaluates the entities targeted by the configuration changes (i.e., those that are part of CCBs) and determines the enclosing AMF SUs. SMF sets the maintenance campaign attribute of these SUs to the upgrade campaign, which in turn disables the AMF repair attempts and also AMF can provide this information as a feedback to SMF in the notifications it generates.

Obviously, SMF needs to listen to these notifications so that it can react to them in a timely manner. Currently it suspends the upgrade campaign and reports the problem to the administrator for deliberation.

Once the campaign has been committed SMF clears the maintenance campaign attribute setting so that AMF can apply the repair actions again.

9.3.7.3 SMF Interaction with NTF

As we have seen in the previous session part of the interaction between AMF and SMF is mediated by the Notification service (NTF) [39]. To receive the appropriate AMF notifications, the SMF needs to subscribe to the operational state change notification generated by AMF for its SUs. In particular it needs to look for those that indicate in their additional information field the DN of upgrade campaign being executed.

As a notification producer, the SMF generates no alarms at all, which may be a surprise to some readers. The reason behind this decision is that the upgrade campaign is controlled through administrative operations and all the errors and failures are reported through this interface. There is no particular need to replicate the same information in the form of alarms.

The first version of SMF assumes a rather tight administrative control during the upgrade campaign execution and leaves all significant decisions to the administrator even in those cases when the state model would suggest only a single operation as a possibility. Under these circumstances there is no reason to alert the administrator by additional alarms.

With respect to notifications, a SMF implementation is expected to generate state change notifications for step, procedure, and campaign state changes. This provides an alternative—a push based—way of monitoring of the upgrade campaign in progress.

9.3.7.4 SMF Interaction with LOG

Even though no details are given in terms of the level, the content or the format, the SMF specification requires that an SMF implementation logs the progress of the upgrade campaign execution. It is recommended that SMF uses the AIS Log service [40].

9.3.8 Open Issues

Considering the scope of software management and that at the moment of writing only the first release of the SMF specification is available, it is not a surprise that the list of open issues is rather long.

The goal of the first release was to address the issues that are specific to SA and set the direction of software management for SA Forum compliant systems. As a result the scope of the first SMF specification was limited to:

  • the addition and removal of AMF entities; and
  • the upgrade of redundant AMF entities provided that the new version is compatible with the old.

No upgrade method has been defined for incompatible upgrades and for nonredundant entities such as services. These issues remain for future releases.

The area most people expect to be covered in a software management related specification is the software packaging and image management. As discussed in Section 9.3.1 in this area there are already a number of solutions, recommendations, and standards, which actually made superfluous to address the topic. Yet, it would be beneficial to define the integration between the SA Forum SMF and the existing package handling utilities and image management solutions. We have seen already that the fact of representing software bundles as objects in the SMF information model offers the opportunity of linking the object creation and deletion with the package import and removal operations.

Other aspects—like the installation and removal of software within the cluster—due to the variety of clustered solutions do not lend such an easy resolution. It is difficult to define the required operations without making assumptions about the underlying architecture, which in turn may put unnecessary limitations on implementations and therefore undesirable.

The situation is similar and related in case of the backup and the fallback operations. The associated roll-forward operation adds complexity with the diversity of application. Yet, the subject of SA calls for a solution for these operations.

Obviously in the future the SMF has to address the entire AIS stack, and in particular the upgrade of the entities of the PLM [36], which covers the middleware, the operating system, the virtualization facilities, and even firmware.

The current SMF specification ultimately leaves to the administrator to decide whether the execution or a rollback of an upgrade campaign was successful or not. Considering the complexity of these systems it is a question of whether it is feasible to expect the administrator to make such a decision or based on the information available in the system the decision should be automatic. Experience with implementations shall show whether an automatic decision is preferred or the administrator should always be able to rule over the system's decision.

Since this is the section to let our imagination soar we can imagine a software management solution that would know when a new hardware node is plugged into the system what software it requires and install it automatically before the associated cluster nodes may join the cluster. The potential for such a solution is in the SA Forum architecture; however, the first release of specification does not cover this area.

Another desired scenario that many would like a software management solution to resolve is the following: One would like to be able to provide a new desired configuration of the system and leave it up to the SMF to figure out how to migrate the system to this new configuration. After a short pondering about the task one may realize that to be able to do this for all possible cases would require a rather elaborate solution hence the SA Forum Technical Workgroup decided not to require this capability from SMF implementations and has come up with the definition of the upgrade campaign. This allowed to focus on the runtime aspects of the problem of software upgrades and also brought the potential of implementing the SMF specification closer.

It also means that the problem is left for solutions that provide additional value to SA Forum compliant implementations. Note also that since the upgrade campaign has been standardized, the provider of such a solution can be anyone that follows this specification. Of course the challenge is to be able to come up with a solution that can generate an upgrade campaign that migrates the system from its current configuration to any new desired configuration with no or minimal service outage.

9.3.9 Recommendation

To provide guarantees with respect to SA in SA Forum compliant systems at least the upgrade of AMF entities that are in the scope of the first release of the SMF specification should be handled according to the specification. The reason behind this is that in such systems only these entities are considered for SA.

Therefore following the specification should provide the same SA guarantees across a variety of systems regardless of the actual SMF implementation used as all of them should align on at least the requirements of the standard. Hence application and system designers can rely on them when designing a new system, a new application or its new version.

As mentioned in earlier sections, the design of the SMF specification implies some assumptions beyond those of the AMF specification when it comes to SA during upgrades and application and system designers need to satisfy these assumptions.

So what is needed to be able to provide SA during upgrades?

Probably the most important requirement applications need to satisfy is the peer compatibility within a protection group as shown in Figure 9.5. That is, the new version of the software needs to be able to collaborate with the old version during the process of the rolling upgrade when the active and standby components may be running different versions of the software yet need to be able to exchange state information to participate in a protection group.

Figure 9.5 Compatibility types.

9.5

From this perspective one needs to consider not only the upgrade, but also the rollback or downgrade scenario. If anything goes wrong during the upgrade—and since Mr Murphy never sleeps this may happen—it is important that the rolling back of the campaign does not cause service outages either.

This backward compatibility requirement means different things for different applications depending on whether they are stateless or stateful and in what way.

The vertical compatibility between components of different types collaborating within a SU (which is most people consider when the subject of compatibility comes up) determines the scope of each upgrade step. Whether a component can be upgraded by itself, or the entire SU or hosting node needs to be upgraded together.

As presented in Section 9.3.6 the SMF specification caters for application level coordination. This also needs to be considered for the case of rollback: If there is a customized callback on the forward pass the SMF implementation will make a callback on the rollback pass symmetrically to the forward pass and will indicate the direction it is executing the campaign specification, which needs to be interpreted by the application properly.

One needs to proceed with caution, however, as this part of the specification is least tested in practice at the moment of writing; particularly when it comes to the upgrade campaign creation.

From Section 9.3.8 it is also clear that there is no standard way of upgrading the complete stack of the SA Forum compliant systems, which means that different SMF implementations may treat differently the upgrade of non-AMF entities.

The main vision of the SMF specification is, however, that software management should be automated. This vision is reflected in Figure 9.6.

Figure 9.6 The information flow envisioned in the SMF specification [49].

9.6

Accordingly vendors would describe their software in ETFs, which are delivered with the software bundle. The ETF information is then used by a campaign builder tool to create a new configuration that uses the new software to provide the required services (given as ‘other input’) on the cluster the configuration of which is fetched from IMM. For a new system it may be enough to provide the new configuration via IMM. However for a running system the new configuration is processed to generate an upgrade campaign that would migrate the system from its current configuration to this targeted new one. In both cases the software bundles are delivered to the system's software repository so that SMF can use them to install the new software.

9.4 Conclusion

Among the existing standards and specifications, the SA Forum SMF makes a leap forward in the field of software management. This is because the efforts were focused in the area where no specification or standard has provided guidelines yet. This was possible because of the AMF specification, which provided a conceptual basis that the SMF specification could rely on and develop further.

Of course since the problem has existed ever since high-availability systems were used people have been dealing with it in different ways, so there were a variety of solutions available. However most of these solutions were proprietary. The terminology and the concepts used in the field varied enormously, so even when using the same term people would mean different things. So the first achievement of the SMF specification was to define a common terminology appropriate for the problem domain.

As we demonstrated throughout the chapter the SMF specification also provided, indeed, a framework to deal with high availability and SA in the context of software upgrades and system reconfigurations.

Even more, it set the direction for the area to follow in the future as it further evolves addressing the still open issues.

The concepts, the methods defined in the specification can be used in a much wider context and not only for SA Forum systems. It is true that certain features are highly specialized for these systems, but the ideas are not limited to them and can be extended easily to the software entities of platform or even to hardware entities. Grid and cloud computing are candidates for such extensions.

In some cases the current specification is overly cautious since it targets systems at the highest extreme of the scale. These precautions can be easily removed in systems of lower requirements. Even for SA systems some of the precautions may need to be removed in the future as it is extremely difficult for a human to make decisions about the consistency and the correctness of such systems. These breakpoints are really there to be able to feed in the results and reports of other tools that may help the administrator.

As discussed in Section 9.3.8, future releases should address the upgrade of the entire SA Forum stack and also the integration with existing utilities that complement the scope of the SMF.

 

 

1 SUs of a service group that are hosted on the same node group are identical except for the time of an upgrade during which—due to the nature of the operation—belonging to the same service group does not imply the same SU type.

2 Note that the ‘completed’ state is not a final state in the sense as the final state is defined for finite state machines.

3 For the purpose of the current discussion we will refer to this state as the campaign's initial configuration. It is not to be confused with the initial configuration in effect at system startup time.

4 For the upgrade step the ‘undone’ and the ‘rolled back’ states are equivalent. ‘Undone’ is a final state of the upgrade step automaton causing a rollback, while other steps that reached the ‘completed’ state will move to the ‘rolled back’ state.

5 Note that we distinguish the system configuration from the system state here. We elaborate on this distinction in section.

6 Being ‘application level’ is determined from SMF perspective and not from the typical system stack. It means that it is outside of the scope SMF is aware of and designed to handle.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.129.90