Architecture of cluster resource manager

The process of resource management is complex. It is based on multiple parameters which are highly dynamic in nature. To perform resource balancing, the resource manager has to know about all the active services and the resources consumed by each of the services at this point of time. It should also be aware of the actual capacity of every node in the cluster in terms of memory, CPU, and disk space, and the aggregate amount of resources available in the cluster. The resources consumed by a service can change over time, depending on the load it is handling. This also needs to be accounted for before making a decision to move a service from one node to another. To add to the complexity, the cluster resources are not static. The number of nodes in the cluster can increase or decrease at any point of time, which can lead to a change in load distribution. Scheduled or unscheduled upgrades can also roll through the cluster, causing temporal outages of nodes and services. Also, the very fact of cloud resources running on commodity hardware forces the resource manager to be highly fault tolerant.

To achieve these tasks, the Service Fabric cluster resource manager uses two components. The first component is an agent which is installed on every node of a cluster. The agent is responsible for collecting information from the hosting node and relaying it to a centralized service. This information will include CPU utilization, memory utilization, remaining disk space, and so on. The agent is also responsible for heartbeat checks for the node.

The second component is a service. Service Fabric is a collection of services. The cluster resource manager service is responsible for aggregating all of the information supplied by the agent and other management services and reacting to changes based on the desired state configuration of the cluster and service. The fault tolerance of the service manager is achieved via replication, similar to how it is done for the services hosted on Service Fabric. The resource manager service runs seven replicas to ensure high availability.

To understand the process of aggregation, let's take an example.

The following figure illustrates a Service Fabric cluster with six nodes. There are seven services deployed on this cluster with the names ABCDE, and F. The diagram illustrates the initial distribution of the services on the cluster based on placement rules configured for the services. Services A, B, and C are placed on node 5 (N5), service D on node 6 (N6), service G on node 2 (N2), service F on node 3 (N3) and service E on node 4 (N4). The resource manager service itself is hosted on node 1 (N1). Every node has a Service Fabric agent running which communicates with the resource manager service hosted on N1:

General resource manager functions

During runtime, if the amount of resources consumed by services changes, or if a service fails, or if a new node joins or leaves the cluster, all the changes on a specific node are aggregated and periodically sent to the central resource manager service. This is indicated by lines 1 and 2. Once aggregated, the results are analyzed before they are persisted by the resource manager service. Periodically, a process within the cluster resource manager service, looks at all of the changes, and determines whether there are any corrective actions required. This process is indicated by the step 3 in the preceding figure.

To understand step 3 in detail, let's consider a scenario where the cluster resource manager determines that N5 is overloaded. The following diagram illustrates a rebalancing process governed by the resource manager. This case is reported by the agent installed on N5. The resource manager service then checks available resources in other nodes of the cluster. Let's assume that N4 is underutilized as reported by the agent installed on N4. The resource manager coordinates with other subsystems to move a service, which is service B in this instance, to N4. This is indicated by step 5 in the following diagram:

The Resource Manager reconfigures the clusters

This whole process is automated and its complexity is abstracted from the end user. This level of automation is what makes hyperscale deployments possible on Service Fabric.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.50.29