2.2. Non-functional requirements

The following sections describe some additional considerations related to the infrastructure. These considerations come under the heading of non-functional as they do not relate to a specific functional unit of the grid, such as job management, broker, and so on.

2.2.1. Performance

When considering enabling an application to execute in a grid environment, the performance of the grid and the performance requirements of the application must be considered. The service requester is interested in a quality of service that includes acceptable turnaround time. Of course, if building a grid and one or more applications that will be provided as a service on the grid, then the service provider also has interest in maximizing the utilization and throughput of the systems within the grid. The performance objectives of these two perspectives are discussed below.

Resource provider’s perspective

The performance objective for a grid infrastructure is to achieve maximum utilization of the various resources within the grid to achieve maximum throughput. The resources may include but are not limited to CPU cycles, memory, disk space, federated databases, or application processing. Workload balancing and preemptive scheduling may be used to achieve the performance objectives. Applications may be allowed to take advantage of multiple resources by dividing the grid into smaller instances to have the work distributed throughout the grid. The goal is to take advantage of the grid as a whole to improve the application performance. The workload management can make sure that all resources within the grid are actively servicing jobs or requests within the grid.

Service requester’s perspective

The turnaround time of an application running on the grid could vary depending on the type of grid resource used and the resource provider’s quality-of-service agreement. For example, a quick turnaround may be achieved by submitting a processing-intensive standalone batch job to a high-performance grid resource. This assumes that the job is started immediately and that it is not preempted by another job during execution. The same batch job may be scheduled to run overnight when the resource demands are lower if a quick turnaround is not required. The resource provider may charge different prices for these two types of service.

If the application has many independent sub-jobs that can be scheduled for parallel execution, the turnaround time could be improved appreciably by running each sub-job on multiple grid hosts.

Turnaround time factors

This section discusses some of the factors that can impact the turnaround time of applications run on the grid resources.

Communication delays

Network speed and network latency can have significant impact to the application performance if it requires communicating with another application running on a remote machine. It is important to consider the proximity of the communicating applications to one another and the network speed and latency.

Data access delays

The network bandwidth and speed will be the critical factors for applications that need to access remote data. Proximity of the application to the data and the network capacity/speed will be important considerations.

Lack of optimization of the application to the grid resource

Optimum application performance is usually achieved by proper tuning and optimization on a particular operating system and hardware configuration. This poses possible issues if an application is simply loaded on a new grid host and run. This issue may be resolved if the service provider makes an arrangement with the resource provider so that the application’s optimum configuration and resource requirements are identified ahead of time and applied when the application is run.

Contention for resource

Resource contention is always a problem when resources are shared. If resource contention impacts performance significantly, alternate resources may need to be introduced. For example, if a database is the source of contention, then introducing a replica may be an answer. In addition, the network may need to be divided to separate the traffic to the databases. Optimum sharing of the grid hosts may be achieved by a proper scheduling algorithm and workload balancing. For example, the shortest job first (SJF) batch job scheduling algorithm may provide the best turnaround time.

Reliability

Failures in the grid resource and network can cause unforeseen delays. To provide reliable job execution, the grid resource may apply various recovery methods for different failures. For example, in the checkpoint-restart environment, some amount of delay will be incurred each time a checkpoint is taken. A much longer delay may be experienced if the server crashed and the application was migrated to a new server to complete the run. In other instances, the delay may take the entire time to recover from a failure such as network outages.

2.2.2. Reliability

Reliability is always an issue in computing, and the grid environment is no exception. The best method of approaching this difficult issue is to anticipate all possible failures and provide a means to handle them. The best reliability is to be surprise tolerant. The grid computing infrastructure must deal with host interruptions and network interruptions. Below are some approaches to dealing with such interruptions.

Checkpoint-restart

While a job is running, checkpoint images are taken at regular intervals. A checkpoint contains a snapshot of the job states. If a machine crashes or fails during the job execution, the job can be restarted on a new machine using the most recent checkpoint image. In this way, a long-running job that runs for months or even years can continue to run even though computers fail occasionally.

Persistent storage

The relevant state of each submitted job is stored in persistent storage by a grid manager to protect against local machine failure. When the local machine is restarted after a failure, the stored job information is retrieved. The connection to the job manager is reestablished.

Heartbeat monitoring

In a healthy heartbeat, a probing message is sent to a process and the process responds. If the process fails to respond, an alternate process may be probed. The alternate process can help to determine the status of the first process, and even restart it. However, if the alternate process also fails to respond then we assume that either the host machine has crashed or the network has failed. In this case, the client must wait until the communication can be reestablished.

System management

Any design will require a basic set of systems management tools to help determine availability and performance within the grid. A design without these tools is limited in how much support and information can be given about the health of the grid infrastructure. Alternate networks within a grid architecture can be dedicated to perform these functions so as to not hamper the performance of the grid.

2.2.3. Topology considerations

The distributed nature of grid computing makes spanning across geographies and organizations inevitable. As an intra-grid topology is extended to an inter-grid topology, the complexity increases. For example, the non-functional and operational requirements such as security, directory services, reliability, and performance become more complex. These considerations are discussed briefly in the following sections.

Figure 2-6. Grid topologies


Network topology

The network topology within the grid architecture can take on many different shapes. The networking components can represent the LAN or campus connectivity, or even WAN communication between the grid networks. The network’s responsibility is to provide adequate bandwidth for any of the grid systems. Like many other components within the infrastructure, the networking can be customized to provide higher levels of availability, performance, or security.

Grid systems are for the most part network intensive due to security and other architectural limitations. For data grids in particular, which may have storage resources spread across the enterprise network, an infrastructure that is designed to handle a significant network load is critical to ensuring adequate performance.

The application-enablement considerations should include strategies to minimize network communication and to minimize the network latency. Assuming the application has been designed with minimal network communication, there are a number of ways to minimize the network latency. For example, a gigabit Ethernet LAN could be used to support high-speed clustering or utilize high-speed Internet backbone between remote networks.

Data topology

It would be desirable to assign executing jobs to machines nearest to the data that these jobs require. This would reduce network traffic and possibly reduce scalability limits.

Data requires storage space. The storage possibilities are endless within a grid design. The storage needs to be secured, backed up, managed, and/or replicated. Within a grid design, you want to make sure that your data is always available to the resources that need it. Besides availability, you want to make sure that your data is properly secured, as you would not want unauthorized access to sensitive data. Lastly, you want more than decent performance for access to your data, Obviously, some of this relies on the bandwidth and distance to the data, but you will not want any I/O problems to slow down your grid applications. For applications that are more disk intensive, or for a data grid, more emphasis can be placed on storage resources, such as those providing higher capacity, redundancy, or fault-tolerance.

2.2.4. Mixed platform environments

A grid environment is a collection of heterogeneous hosts with various operating systems and software stacks. To execute an application, the grid infrastructure needs to know the application’s can prerequisites to find the matching grid host environment. Below are some things that the grid infrastructure must be aware of to ensure that applications can execute properly. It is equally as important for the application developer to consider these factors in order to maximize the kinds and numbers of environments on which the application will be able to execute.

Runtime considerations

The application’s runtime requirements and the grid host’s runtime environments must match. As an example, below are some considerations for Java applications. Similar requirements may exist for applications developed in other applications.

Java Virtual Machine (JVM)

Applications written in the Java programming language require the Java Virtual Machine (JVM). Java applications may be sensitive to the JVM version. To address this sensitivity, the application needs to identify the JVM version as a prerequisite. The prerequisite may specify the required JVM version or the minimum JVM version.

Java applications may be sensitive to the Java heap size. The Java application needs to specify the minimum heap size as part of its prerequisite.

Java packages such as J2SE or J2EE may also need to be identified as part of the prerequisites.

Availability of application across platforms (portability)

The executables of certain applications are platform specific. For example, an application written in the C or C++ programming language needs to be recompiled on the target platform before it can be run. The application could be pre-compiled for each platform and the resulting executables marked for a target platform. This will increase the number of qualifying grid host environments where the application can run. The limitation of this method will be the cost-effectiveness of porting the application to another platform.

Awareness of OS environment

The grid is a collection of heterogeneous computing resources. If the application has certain dependencies or requirements specific to the operating system, the application needs to verify that the correct environment is available and handle issues related to the differing environments.

Output file formats

The knowledge of the output file format is necessary when the output of an application running on one grid host is accessed by another application running on a different grid host. The two grid hosts may have different platform environments. XML may be considered as the data exchange format. XML has now become popular not only as a markup language for data exchange, but also as a data format for semi-structured data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.79.63