7

Model Aggregation

In the Model aggregation basics section of Chapter 3, Workings of the Federated Learning System, we introduced the concept of aggregation within the federated learning (FL) process at a high level. Recall that aggregation is the means by which an FL approach uses the models trained locally by each agent to produce a model with strong global performance. It is clear to see that the strength and robustness of the aggregation method employed are directly correlated to the resulting performance of the end global model.

As a result, choosing the appropriate aggregation method based on the local datasets, agents, and FL system hierarchy is key to achieving good performance with FL. In fact, the focal point of many publications in this field is providing mathematically backed convergence guarantees for these methods in a variety of theoretical scenarios.

The goal of this chapter is to cover some of the research that has been done on aggregation methods and their convergence in both ideal and non-ideal cases, tying these methods to their strengths in the different scenarios that arise in the practical applications of FL. After reading the chapter, you should be able to understand how different characterizations of an FL scenario call for different aggregation methods, and you should have an idea of how these algorithms can actually be implemented.

In this chapter, we will cover the following topics:

  • Revisiting aggregation
  • Understanding FedAvg
  • Modifying aggregation for non-ideal cases

Technical requirements

The Python algorithm implementations presented in the book can all be found in the ch7 folder, which is located at https://github.com/PacktPublishing/Federated-Learning-with-Python/tree/main/ch7.

Important note

You can use the code files for personal or educational purposes. Please note that we will not support deployments for commercial use and will not be responsible for any errors, issues, or damages caused by using the code.

For the pure aggregation algorithms, auxiliary code is included to display example output from preset local parameters. The aggregation methods that modify the local training process require an FL system in order to operate – for these, full implementations using STADLE are included. Also, the pure aggregation algorithms can be directly tested with STADLE by configuring the aggregation method. Information on how to run the examples can be found in the associated README files.

The installation of the stadle-client package through pip is necessary to run the full FL process examples. The following command can be used to perform this installation:

pip install stadle-client

Using a virtual environment is recommended to isolate the specific package versions installed with stadle-client from other installations on the system.

Revisiting aggregation

To solidly contextualize aggregation within FL, first, we describe the components of a system that are necessary for FL to be applied:

  • A set of computational agents that perform the local training portion of FL.
  • Each agent possesses a local dataset (static or dynamic), of which no portion can be communicated to another agent under the strictest FL scenario.
  • Each agent possesses a parameterized model that can be trained on the local dataset, a process that produces the local optima parameter set for the model.
  • A parameter server, or aggregator, which receives the locally trained models at each iteration from the agents and sends back the resulting model produced by the aggregation method chosen to be used.

Every FL communication round can then be broken down into the following two phases:

  • The local training phase, where agents train their local models on their local datasets for some number of iterations
  • The aggregation phase, where the agents send the resulting trained local models from the previous phase to the aggregator and receive the aggregated model for use as the starting model in the local training phase of the next round.

So, what exactly does it mean for an agent to send a locally trained model during the aggregation phase? The general approach is to use the parameter sets that define the local models, allowing for some degree of generalization across all models that can be parameterized in such a way. However, a second approach focuses on sending the local gradients accumulated during the local training when using a gradient-based optimization approach to the aggregator, with the agents updating their models using the received aggregate gradient at the end of the round. While this approach restricts usage to models with gradient-based local training methods, the prevalence of such methods when training deep learning models has led to a subset of aggregation methods based on gradient aggregation. In this chapter, we choose to frame model aggregation through the lens of the FedAvg algorithm.

Understanding FedAvg

In Chapter 3, Workings of the Federated Learning System, the aggregation algorithm known as FedAvg was introduced to help clarify the general structure and represent the more abstract concepts discussed earlier with a specific example. FedAvg was used for two reasons: simplicity in the underlying algorithm, and generalizability across more model types than gradient-based approaches. It also benefits from extensive references by researchers, with performance analysis in different theoretical scenarios using FedAvg as a baseline when proposing new aggregation methods. This focus in the research community can most likely be attributed to the fact that the original FedAvg paper was published by the team working at Google that first brought exposure to the concept and benefits of FL. For further reading, this paper can be found at https://arxiv.org/abs/1602.05629?context=cs.

FedAvg is predated by an aggregation approach known as Federated Stochastic Gradient Descent (FedSGD). FedSGD can be viewed as the gradient aggregation analog of the model parameter averaged performed by FedAvg. In addition, the concept of averaging model parameters was examined prior to FedAvg for parallelized SGD approaches, outside of the context of FL. Essentially, the analysis of these parallelized SGD approaches mirrors the Independently and Identically Distributed (IID) case of FedAvg – this concept will be discussed later in the section. Regardless, the simplicity, generalizability, and popularity of FedAvg make it a good base to delve deeper into, contextualizing the need for the numerous aggregation approaches discussed in later sections that have built upon or, otherwise, improved on FedAvg.

Previously, FedAvg was only presented as an algorithm that takes models with a respective local dataset size of , where the sum equals N and returns:

As shown in the Aggregating local models section of Chapter 4, Federated Learning Server Implementation with Python, simple-fl uses the following function to compute a weighted average of the buffered models (models sent from clients during the current round) based on the amount of data used to locally train each model:

def _average_aggregate(self,
                           buffer: List[np.array],
                           num_samples: List[int]) -> np.array:
        """
        Given a list of models, compute the average model (FedAvg).
        This function provides a primitive mathematical operation.
        :param buffer: List[np.array] - A list of models to be aggregated
        :return: np.array - The aggregated models
        """
        denominator = sum(num_samples)
        # weighted average
        model = float(num_samples[0]) / denominator * buffer[0]
        for i in range(1, len(buffer)):
            model += float(num_samples[i]) / denominator * buffer[i]
        return model

The original algorithm does not differ too greatly from this portrayal. The high-level steps of the algorithm are as follows:

  1. The server randomly samples K * C clients, where K is the total number of clients and C is a parameter between 0 and 1.
  2. The selected K * C clients receive the most recent aggregate model and begin to train the model on their local data.
  3. Each client sends its locally trained model back to the server after some desired amount of training is completed.
  4. The server computes the parameter-wise arithmetic mean of the received models to compute the newest aggregate model.

Parallels can be immediately drawn between this formal representation and our presentation of the FL process, with ClientUpdate performing local training for an agent and the server performing aggregation using the same weighted averaging algorithm. One important point is the sampling of a subset of clients to perform the local training and model transmission during each round, allowing for client subsampling parameterized by C. This parameter is included to experimentally determine the convergence rates of various client set sizes – in an ideal scenario, this will be set to 1.

As previously mentioned, FedAvg is the ideal FL scenario that essentially mirrors an approach to parallelized stochastic gradient descent. In parallelized SGD (pSGD), the goal is to leverage hardware parallelization (for example, running on multiple cores in parallel) in order to speed up SGD convergence on a specific machine learning task. One approach for this task is for each core to train a base model on some subset of the data in parallel for some number of iterations, then aggregate the partially trained models and use the aggregated models as the next base for training. In this case, if the cores are considered to be agents in an FL scenario, the parallelized SGD approach is the same as FedAvg in an ideal scenario. This means all of the convergence guarantees and respective analyses that were done for pSGD can be directly applied to FedAvg, assuming the ideal FL scenario. From this prior work, it has, therefore, been shown that FedAvg demonstrates strong convergence rates.

After all this praise for FedAvg, it is only natural to question why more complex aggregation methods are even necessary. Recall that the phrase “ideal FL scenario” was used several times when discussing FedAvg convergence. The unfortunate reality is that most practical FL applications will fail to meet one or more of the conditions stipulated by that phrase.

The ideal FL scenario can be broken down into three main conditions:

  • The local datasets used for training are IID (the datasets are independently drawn from the same data distribution).
  • The computational agents are relatively homogeneous in computational power.
  • All agents can be assumed to be non-adversarial.

At a high level, it is clear why these qualities would be desirable in an FL scenario. To understand, in greater detail, why these three conditions are necessary, the performance of FedAvg in the absence of each condition will be examined in the upcoming subsections.

Dataset distributions

To examine FedAvg in the non-IID case, first, it is important to define what exactly is being referred to by the distribution of a dataset. In classification problems, the data distribution often refers to the distribution of the true classes associated with each data point. For example, consider the MNIST dataset, where each image is a handwritten digit from 0 to 9. If a uniform random sample of 1,000 images was to be taken from the dataset, the expected number of images from each class would be the same – this could be considered a uniform data distribution. Alternatively, a sample with 910 images of the digit 0 along with 10 images of the other digits would be a heavily skewed data distribution.

To generalize outside of classification tasks, this definition can be extended to refer to the distribution of features present across the dataset. These features could be manually crafted and provided to the model (such as linear regression), or they could be extracted from the raw data as part of the model pipeline (such as deep CNN models). For classification problems, the class distribution is generally contained within the feature distribution, due to the implicit belief that the features are sufficient for correctly predicting the class. The benefit of looking at feature distributions is the data-centric focus on features (versus the task-centric focus on classes), allowing for generalization across machine learning tasks.

However, in the context of experimental analysis, the ability to easily construct non-IID samples from a dataset makes classification tasks ideal for testing the robustness of FedAvg and different aggregation methods within an FL context. To examine FedAvg in this section, consider a toy FL scenario where each agent trains a CNN on data samples taken from the MNIST dataset described earlier. There are two main cases, which are detailed next.

IID case

The convergence of the models can be represented through the use of the model parameter space. The parameter space of a model with n parameters can be thought of as an n-dimensional Euclidean space, where each parameter corresponds to one dimension in the space. Consider an initialized model; the initial parameters of this model can then be represented as a point in the parameter space. As local training and aggregation occur, this representative point will move in the parameter space, with the end goal being convergence to a point in the space corresponding to a local optimum of the loss or error function being minimized.

One key point of these functions is the dependence on the data used during the local training process – when the datasets across the agents are IID, there is a general tendency for the optima of the respective loss/error functions to be relatively close in the parameter space. Consider a trivial case where the datasets are IID and all models are initialized with the same parameters. As shown in the Model aggregation basics section of Chapter 3, Workings of the Federated Learning System, a simplified version of the parameter space can be depicted:

Figure 7.1 – Models with the same initialization and IID datasets

Figure 7.1 – Models with the same initialization and IID datasets

Observe how both models start at the same point (purple x) and move toward the same optima (purple dot), resulting in aggregate models close to the optima shared by both models.

Due to the resulting similarity of the error/loss functions across the agents, the models tend to converge toward the same or similar optima in the space during training. This means that the change in the models after each aggregation step is relatively small, resulting in the convergence rates mirroring the single local model case. If the underlying data distribution is representative of the true data distribution (for example, uniform across the 10 different digits for MNIST), the resulting aggregated model will demonstrate strong performance.

Next, consider the generalized IID case where each model is initialized separately:

Figure 7.2 – Models with different initializations and IID datasets

Figure 7.2 – Models with different initializations and IID datasets

In this scenario, observe how both models start at different points (bold/dotted x) and initially move toward different optima, producing a poor first model. However, after the first aggregation, both models start at the same point and move toward the same optima, resulting in similar convergence to the first case.

It should be clear that this reduces to the previous case after the first aggregation step since each model starts the second round with the resulting aggregated parameters. As a result, the convergence properties previously stated can be extended to the general case of FedAvg with IID local datasets.

Non-IID Case

The key property in the IID local dataset case that allows for convergence speeds mirroring the single model case is the similarity of the local optima of the loss/error functions, due to their construction from similar data distributions. In the non-IID case, similarity in the optima is generally no longer observed.

Using the MNIST example, let’s consider an FL scenario with two agents such that the first agent only has images with digits 0 to 4 and the second agent only has images with digits 5 to 9; that is, the datasets are not IID. These datasets would essentially lead to two completely different five-class classification tasks at the local training level, as opposed to the original 10-class classification problem – this will result in completely different parameter space optima between the first agent and the second agent. Consider the simplified representation of this parameter space as follows, with both models having the same initialization:

Figure 7.3 – Models with different initializations and non-IID datasets

Figure 7.3 – Models with different initializations and non-IID datasets

Now that optima are no longer shared (triangles/squares representing optima for bold/dotted model, respectively), even repeated aggregations cannot create an aggregate model close to optima of either model. The models diverge, or drift, during each local training phase due to the different target optima in each round.

Only a small subset of the optima will be shared between the loss/error functions of both agents. As a result, there is a high probability that each model will move toward optima that are not shared during local training, leading the models to drift apart in the parameter space. Each aggregation step will then pull the models toward the wrong optima, reverting the progress made during local training and hampering convergence. Note that just taking the average of optima from different agents is very unlikely to be near optima from any of the agents in the parameter space, so in this case, the result of continued aggregation is generally a model that performs poorly across the whole dataset. Convergence to a shared optimum might eventually occur due to stochasticity observed during local training, inducing movement of the aggregate model in the parameter space, but this does not have theoretical guarantees and will be far slower than convergence in the IID case when it does occur.

Important note

This MNIST example is a theoretical extreme of non-IID datasets. In practice, non-IID datasets might refer to different skews in the data distributions across agents (for example, twice as many images with digits 0–4 versus 5–9, and vice versa). The severity of the difference is correlated to the performance of FedAvg, so adequate performance can still be reached in less severe cases. However, in these cases, the performance of FedAvg will generally always be inferior to the analogous centralized training task where a single model is trained on all of the local datasets at once – the theoretically optimal model achievable by FL.

While this section focused on the statistical basis for the issues that arise from non-IID datasets, the next section examines a far more direct problem that can arise – especially when deploying at larger scales.

Computational power distributions

An unstated assumption of the agents participating in FL is that each agent is capable of performing local training if given infinite time. Agents with limited computational power (memory and speed) might take significantly more time than other agents to finish local training, or they might require techniques such as quantization to support the model and training process. However, agents that cannot complete local training during some rounds will trivially prevent convergence by stalling the FL process.

Generally, convergence bounds and experimental results focus on the number of communication rounds required to reach some level of performance. Under this metric and the aforementioned assumption, convergence is completely independent of the computational power afforded to each agent, since computational power only affects the actual time necessary to complete one round. However, convergence speed in practical applications is measured by the actual time taken, not the number of completed communication rounds – this means that the time to complete each round is as important as the number of rounds. This metric of the total time taken is where naïve FedAvg demonstrates poor performance when heterogeneous computation power is observed in the agents participating in FL

Specifically, the time to complete each round is bottlenecked by the local training time of the slowest agent participating in the round; this is because aggregation is trivially fast compared to training in most cases and must wait for all agents to complete local training. When all agents are participating in the round, this bottleneck becomes the slowest overall agent. In the homogeneous computational power case, the difference in local training time between the fastest agent and the slowest agent will be relatively insignificant. In the heterogeneous case, a single straggler will greatly reduce the convergence time of FedAvg and lead to significant idle time in the faster agents waiting to receive the aggregated model.

Two modifications to FedAvg with full agent participation might initially seem to address this problem; however, both have drawbacks that lead to suboptimal performance:

  • One approach is to rely on agent subsampling in each round, leading to the probability of the straggler effect occurring in each round depending on the number of agents and the sample size taken in each round. This can be sufficient in cases with only a few straggling agents, but it becomes proportionately worse as this number increases and does not completely eliminate the problem from occurring. In addition, small sample sizes lose out on the robustness benefits from aggregation over more agents.
  • A second approach is to allow all agents to begin local training at the beginning of each round and prematurely begin aggregation after some number of models have been received. This method has the benefit of being able to completely eliminate the straggler effect without greatly restricting the number of agents participating in aggregation during each round. However, it results in the slowest agents never participating in aggregation over all rounds, essentially reducing the number of active agents and potentially limiting the variety of data used during training. In addition, the agents that are too slow to participate in aggregation will have done computational work for no benefit.

It is clear that some local adjustment based on available computational power is necessary for aggregation to be performed efficiently, regardless of the end aggregation method applied to the received models at the end of each round.

Both the non-IID case and the heterogeneous computation power case focus on properties of an FL system that are generally easy to observe and under some level of administrative control. The next case we present deviates from this by challenging a key assumption when considering practical FL systems.

Protecting against adversarial agents

So far, it has been assumed that every agent participating in an FL scenario always acts in the desired way; that is, actively and correctly training the received model locally and participating in model transmission to/from the aggregator. This is easily achieved in a research setting, where the federated setting is simulated and the agents are singularly controlled; however, this assumption of agents behaving correctly does not always hold in practice.

One example that does not involve targeted malicious intent is an error in the model weights being transmitted by an agent to the aggregator. This can happen when the dataset used by an agent is flawed or the training algorithm is incorrectly implemented (the corruption of the parameter data during transmission is also possible). In the worst case, this can essentially lead to the parameters of one or many models being statistically equivalent to random noise. When the L2 norm (the extension of vector magnitude for n-dimensional tensors) of the random noise is not significantly greater than that of the valid models, FedAvg will suffer performance loss that is proportional to the ratio of faulty agents to all agents – which is relatively acceptable when this ratio is small. However, even a single faulty agent can induce a near-random aggregate model if the norm of the agent’s noise is significantly high. This is due to the nature of the arithmetic mean being performed internally during the FedAvg aggregation.

The problem becomes worse when agents can be controlled by malicious adversaries. A single malicious agent with sufficient information is capable of producing any desired model after aggregation through large modifications to the parameters of the model it submits. Even without direct knowledge of the model parameters and associated weights of the other agents, a malicious agent can leverage relatively small changes between the local models and the aggregate model in later rounds to use the previous aggregate model as an estimate of the expected local model parameters.

Therefore, FedAvg offers little to no robustness against both random and controlled adversarial agents in an FL setting. While one potential means of mitigation would be to separately monitor the agents and prevent adversarial agents from transmitting models, significant damage to the convergence of the final model might have already occurred in the time necessary to identify such agents.

It should now be clear that FedAvg trades robustness in these non-ideal cases for simplicity in the calculation. Unfortunately, this robustness is a key consideration for practical applications of FL due to the lack of control compared to the research setting. The next section focuses on methods of achieving robustness against the three non-ideal cases presented in this section.

Modifying aggregation for non-ideal cases

In practical FL applications, at least one of the aforementioned assumptions that constitute an ideal FL scenario generally does not hold; therefore, the usage of alternative aggregation methods might be necessary to best perform FL. The goal of this section is to cover examples of aggregation methods that target heterogeneous computational power, adversarial agents, and non-IID datasets, in order of difficulty.

Handling heterogeneous computational power

As mentioned earlier, the ideal aggregation approach, in this case, consistently avoids the straggler effect while maximizing the number of agents participating in FL and allowing all agents to contribute to some extent, regardless of computational power differences. Agents become stragglers during a round when their local training takes significantly more time than the majority of the agents. Therefore, effectively addressing this problem actually requires some level of adaptability at the agent level in the local training process, based on the computational power available to each agent.

Manual adjustment

One straightforward way of accomplishing this is to change the number of local training iterations based on the time necessary for each iteration. In other words, the local training time is fixed and each agent performs as many iterations as possible within this time, as opposed to performing a fixed number of iterations. This trivially eliminates the straggler problem but might result in poor performance if a large amount of local training time must be allocated for the slow agents to meaningfully contribute due to the model drift from faster agents potentially performing too many local training iterations. This can be mitigated by setting a maximum number of local training iterations. However, a careful balance in the allocated local training must be found to have enough time for slow agents to produce adequate models while preventing faster agents from sitting idle after reaching the maximum number of iterations. It is also unclear how such a threshold could be preemptively determined to achieve optimal performance instead of relying on experimental results to search for the best configuration.

Automatic adjustment – FedProx

An aggregation method known as FedProx follows this same methodology of dynamically adjusting the local training processes for each agent based on computational power, while also revising the termination condition for local training to aid in the theoretical analysis of convergence. Specifically, the fixed number of local training iterations is replaced by a termination condition for the training loop that accommodates agents with varying levels of computational power.

The underlying concept for this termination condition is the γ-inexact solution, which is satisfied when the magnitude of the gradient at the γ-inexact optima is less than γ times the magnitude of the gradient at the beginning of local training. Intuitively, γ is a value between 0 and 1, with values closer to 0 leading to more local training iterations due to the stricter termination condition. Therefore, γ allows for the parameterization of an agent’s computational power.

One potential problem with the termination condition approach is the divergence of the locally trained model from the aggregate model after many iterations of local training resulting from a strict condition. To combat this, FedProx adds a proximal term to the objective function being minimized equal to the following:

Here, represents the received aggregate model weights.

The proximal term penalizes differences between the current weights and the aggregated model weights, restricting the aforementioned local model divergence with the strength parameterized by μ. From these two concepts, FedProx allows for a variable number of iterations proportional to the computational power to be performed by each agent without requiring manually tuned iteration counts for each agent or a set amount of allocated training time. Because of the addition of the proximal term, FedProx requires gradient-based optimization methods to be employed in order to work – more information on the underlying theory and comparison to FedAvg can be found in the original paper (which is at https://arxiv.org/abs/1812.06127).

Implementing FedProx

Because the modifications made by FedProx to FedAvg are all on the client side, the actual implementation of FedProx consists entirely of modifications to the local training framework. Specifically, FedProx involves a new termination condition for local training and the addition of a constraining term to the local loss function. Therefore, it is helpful to use an example of the local training code to frame exactly how FedProx can be integrated.

Let’s consider the following generic training code using PyTorch:

agg_model = ... # Get aggregate model – abstracted out of example
model.load_state_dict(agg_model.state_dict())
for epoch in range(num_epochs):
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

Let this be the code that performs num_epochs epochs of training on the local dataset using the received aggregate model for each round. The first necessary modification for FedProx is to replace the fixed number of epochs with a dynamic termination condition, checking whether a γ-inexact solution has been found with the aggregated model as the initial model. To do this, the total gradient over the entire training dataset for the aggregate model and the current local model must be stored – this can be performed as follows:

agg_model = ... # Get aggregated model from aggregator
model.load_state_dict(agg_model.state_dict())
agg_grad = None
curr_grad = None
gamma = 0.9
mu = 0.001

Values for the two FedProx parameters, gamma and mu, are set, and variables to store the gradients of both the aggregate model and the latest local model are defined.

We then define the γ-inexact new termination condition for local training using these gradient variables:

def gamma_inexact_solution_found(curr_grad, agg_grad, gamma):
    if (curr_grad is None):
        return False
    return curr_grad.norm(p=2) < gamma * agg_grad.norm(p=2)

This condition is now checked before each training loop iteration to determine when to stop local training. The total_grad variable is created to store the cumulative gradients that were created from each minibatch during backpropagation:

model.train()
while (not gamma_inexact_solution_found(curr_grad, agg_grad, gamma)):
    total_grad = torch.cat([torch.zeros_like(param.data.flatten()) for param in model.parameters()])
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)

To compute the proximal term, the weights of both the aggregate model and the latest local model are computed. From these weights, the proximal term is computed and added to the loss term:

        curr_weights = torch.cat([param.data.flatten() for param in model.parameters()])
        agg_weights = torch.cat([param.data.flatten() for param in agg_model.parameters()])
        prox_term = mu * torch.norm(curr_weights - agg_weights, p=2)**2
        loss += prox_term

The gradients are computed and added to the cumulative sum stored in total_grad:

        loss.backward()
        grad = torch.cat([param.grad.flatten() for param in model.parameters()])
        total_grad += grad
        optimizer.step()

Finally, we update agg_grad (if the gradients were computed with the aggregate weights) and curr_grad after the current local training iteration is completed:

    if (agg_grad == None):
        agg_grad = total_grad
    curr_grad = total_grad

These modifications allow for FedProx to be implemented on top of FedAvg. The full FL example using FedProx can be found at https://github.com/PacktPublishing/Federated-Learning-with-Python/tree/main/ch7/agg_fl_examples/cifar_fedprox_example.

An auxiliary approach to handle the heterogeneous computational power scenario that helps with computational efficiency when only mild heterogeneity is observed is the idea of compensation in aggregation. Consider the case where aggregation occurs once the number of received models surpasses some threshold (generally, this is less than the number of participating agents). Using this threshold allows the straggler effect to be mitigated; however, the work done by slower agents ends up being discarded each round, leading to training inefficiency.

The core idea of compensation is to allow for the local training that is done by a slower agent in one round to instead be included in the model aggregation of a subsequent round. The age of the model is compensated for in this subsequent round by multiplying the weight used for the weighted average and a penalizing term during aggregation. By doing so, slower agents can be given two or three times as much training time as that used by the faster agents while avoiding the straggler effect. Mild heterogeneity is required in order to prevent cases where slower agents require too much extra time for training. This is due to the associated penalty given to the model after many rounds have passed; it will be severe enough to effectively lead to no contribution and reduce aggregation without compensation – this is necessary to prevent models that are too old from hampering the convergence of the aggregate model.

Finally, we examine methods that help to address the third non-ideal property, where some subset of agents are controlled by an adversary or are, otherwise, behaving in an undesirable way.

Adversarial agents

In the previous section, it was shown that the core problem with FedAvg in the presence of adversarial agents was the lack of robustness to outliers in the underlying arithmetic mean used during aggregation. This naturally raises the question of whether this mean can be estimated in such a manner that does offer such robustness. The answer is the class of robust mean estimators. There are many such estimators that offer varying trade-offs between robustness, distance from the true arithmetic mean, and computational efficiency.

For use as a base for the implementation of the following aggregation methods, consider the following general aggregation function:

def aggregate(parameter_vectors):
    # Perform some form of aggregation
    return aggregated_parameter_vector

This function takes a list of parameter vectors and returns the resulting aggregated parameter vector.

Now we will examine three example implementations of robust mean estimators.

Aggregation using the geometric median

The geometric median of a sample is the point minimizing the sum of L1 distances between itself and the sample. This is conceptually similar to the arithmetic mean, which is the point minimizing the sum of L2 distances between itself and the sample. The use of L1 distances allows for greater robustness to outliers; in fact, an arbitrary point can only be induced in the geometric median if at least half of the points are from adversarial agents. However, the geometric median cannot be directly computed, instead relying on numerical approximations or iterative algorithms to compute.

To compute the geometric mean iteratively, Weiszfeld’s algorithm can be used as follows:

def geometric_median_aggregate(parameter_vectors, epsilon):
    vector_shape = parameter_vectors[0].shape
    vector_buffer = list(v.flatten() for v in parameter_vectors)
    prev_median = np.zeros(vector_buffer[0].shape)
    delta = np.inf
    vector_matrix = np.vstack(vector_buffer)
    while (delta > epsilon):
        dists = np.sqrt(np.sum((vector_matrix - prev_median[np.newaxis, :])**2, axis=1))
        curr_median = np.sum(vector_matrix / dists[:, np.newaxis], axis=0) / np.sum(1 / dists)
        delta = np.linalg.norm(curr_median - prev_median)
        prev_median = curr_median
    return prev_median.reshape(vector_shape)

This algorithm uses the fact that the geometric median of a set of points is the point that minimizes the sum of Euclidean distances over the set, performing a form of weighted least squares with weights inversely proportional to the Euclidean distance between the point and the current median estimate at each iteration.

Aggregation using the coordinate-wise median

The coordinate-wise median is constructed by taking the median of each coordinate across the sample, as the name suggests. This median can be directly computed, unlike the geometric median, and intuitively offers similar robustness to outliers due to the properties of the median in univariate statistics. However, it is unclear whether the resulting model displays any theoretical similarities to the arithmetic mean in regard to performance on the dataset and convergence.

NumPy makes the implementation of this function quite simple, as follows:

def coordinate_median_aggregate(parameter_vectors):
    return np.median(parameter_vectors, axis=0)

It is clear that the coordinate-wise median is far more computationally efficient to compute than the geometric median, trading off theoretical guarantees for speed.

Aggregation using the Krum algorithm

An alternative approach is to isolate outlier points from adversarial agents prior to aggregation. The most well-known example of this approach is the Krum algorithm, where distance-based scoring is performed prior to aggregation as a means of locating outlier points.

Specifically, the Krum algorithm first computes the pairwise L2 distance between each point – these distances are then used to compute a score for each point equal to the sum of the n-f-2 smallest L2 distances (f is a parameter that is set). Then, Krum outputs the received point with the lowest score, effectively returning the point with a minimal total L2 distance with f outlier points that are ignored. Alternatively, the scoring approach used by Krum can be used to trim outlier points prior to the computation of the arithmetic mean. In both cases, for sufficiently large n and 2f+2 < n, convergence rates similar to those of FedAvg in the non-adversarial case are achieved. More information on the Krum algorithm can be found in the original paper, which is located at https://papers.nips.cc/paper/2017/hash/f4b9ec30ad9f68f89b29639786cb62ef-Abstract.html.

The Krum algorithm can be used to perform aggregation as follows:

def krum_aggregate(parameter_vectors, f, use_mean=False):
    num_vectors = len(parameter_vectors)
    filtered_size = max(1, num_vectors-f-2)
    scores = np.zeros(num_vectors)
    for i in range(num_vectors):
        distances = np.zeros(num_vectors)
        for j in range(num_vectors):
            distances[j] = np.linalg.norm(parameter_vectors[i] - parameter_vectors[j])
        scores[i] = np.sum(np.sort(distances)[:filtered_size])
    if (use_mean):
        idx = np.argsort(scores)[:filtered_size]
        return np.mean(np.stack(parameter_vectors)[idx], axis=0)
    else:
        idx = np.argmin(scores)
        return parameter_vectors[idx]

Note that a flag has been included to determine which of the two Krum aggregation approaches (single selection versus trimmed mean) should be used. Vectorizing the distance computation is possible, but the iterative approach was preferred due to the expectation of large parameter vectors and smaller agent counts.

Non-IID datasets

The theoretical underpinning granted to FL by working with IID datasets plays a significant role in allowing performant aggregate models to be achieved through FL. At a high level, this can be explained by the discrepancy between the learning done by models in different datasets. No theoretical guarantees can be made for the convergence of such models when dataset-agnostic aggregation methods are applied – unless constraints on the non-IID nature of the datasets are applied. The key hindering factor is the high probability of local models moving toward non-shared optima in the parameter space, leading to consistent drift between the local models and the aggregate model after each local training phase.

There are methods that attempt to restrict the modifications made to the aggregate model based on the local machine learning task, relying on the overparameterization of deep learning models to find relatively disjointed parameter subsets to optimize the aggregate model of each task. One such aggregation approach is FedCurv, which uses the Fisher information matrix of the previous aggregate model to act as a regulator for auxiliary parameter modifications during local training. However, the robustness of this approach for extreme non-IID cases in practical applications likely needs to be tested further to ensure acceptable performance.

Implementing FedCurv

The implementation of FedCurv involves two key modifications to the standard FedAvg approach. First, the local loss function must be modified to include the regularization term incorporating the aggregated Fisher information from the previous round. Second, the Fisher information matrix of the parameters must be calculated and aggregated correctly for use in the next round.

The local training example code, as shown in the Implementing FedProx section, will be used again to demonstrate an implementation of FedCurv. In Chapter 4, Federated Learning Server Implementation with Python, we saw that a model conversion layer allows for framework-agnostic model representations to be operated on by the aggregator. Previously, these representations only contained the respective parameters from the original models; however, this agnostic representation actually allows for any desired parameter to be aggregated, even those only loosely tied to the true model parameters. This means that the secondary parameters can be bundled and sent with the local model, aggregated, and then separated from the aggregate model in the next round.

In FedCurv, there are two sets of parameters that must be computed locally and aggregated for use in the next round; therefore, it can be assumed that these parameters are sent with the local model after training and separated from the aggregate model before training, for the sake of brevity in the example code (the implementation of this functionality is straightforward). As a result, the two key modifications for FedCurv, as mentioned earlier, can be simplified down into computing the Fisher information parameters after locally training the model and computing the regularization term with the received aggregate Fisher information parameters.

The Fisher information matrix refers to the covariance of the gradient of the log-likelihood function of a model with respect to its parameters, often empirically evaluated over the data present. FedCurv only utilizes the diagonal entries of this matrix, the variances between the gradient parameters, and their expected values of zero.

At a high level, this variance term can be considered an estimate of how influential the parameter is in changing the performance of the model on the data. This information is essential for preventing the modification of parameters key to good performance on one dataset during the local training of other agents – the underlying idea behind FedCurv.

Relaxing the measure of model performance from the gradient of the log-likelihood to the gradient of any objective function allows for the direct use of the gradient terms computed during backpropagation when computing the variance terms for models using gradient-based optimization methods, such as deep learning models. Specifically, the variance term of a parameter is equal to the square of its respective gradient term, allowing for the terms to be directly computed from the net gradients calculated during local training.

First, we create two variables to store the agent’s most recent Fisher information parameters and the received aggregate Fisher information parameters, which are used to determine the Fisher information from the other agents. The value of the lambda parameter of FedCurv is fixed, and total_grad is initialized as a container for the cumulative gradient from each training loop:

agg_model = ... # Get aggregated model from aggregator
model.load_state_dict(agg_model.state_dict())
fisher_info_params = ... # Initialize at start, then maintain to store past round parameters
agg_fisher_info_params = ... # Separate aggregate Fisher information parameters from aggregate model parameters
# Only consider other agents, and convert to PyTorch tensor
agg_fisher_info_params = {k:torch.tensor(agg_fisher_info_params[k] - fisher_info_params[k]) for k in fisher_info_params.keys()}
# Scaling parameter for FedCurv regularization term
fedcurv_lambda = 1.0
total_grad = {i:torch.zeros_like(param.data) for i,param in enumerate(model.parameters())}

Then, we compute the FedCurv regularization term from the model weights and the aggregate Fisher information parameters. This term is weighted by lambda and added to the loss term before computing the gradients:

model.train()
for epoch in range(num_epochs):
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        for i,param in enumerate(model.parameters()):
            # Factor out regularization term to use saved fisher info parameters
            reg_term = (param.data ** 2) * agg_fisher_info_params[f'fedcurv_u_{i}']
            reg_term += 2 * param.data * agg_fisher_info_params[f'fedcurv_v_{i}']
            reg_term += (agg_fisher_info_params[f'fedcurv_v_{i}'] ** 2) / agg_fisher_info_params[f'fedcurv_u_{i}']
            loss += fedcurv_lambda * reg_term.sum()

The gradients are then computed and stored in total_grad before updating the model weights:

        loss.backward()
        for i,param in enumerate(model.parameters()):
            total_grad[i] += param.grad
        optimizer.step()

Finally, we compute and store the agent’s most recent Fisher information parameters for use in the next round:

for i,param in enumerate(model.parameters()):
    fisher_info_params[f'fedcurv_u_{i}'] = (total_grad[i] ** 2).numpy()
    fisher_info_params[f'fedcurv_v_{i}'] = ((total_grad[i] ** 2) * param.data).numpy()

Therefore, framework-agnostic aggregation can be used to implement FedCurv on top of FedAvg. The full FL example using FedCurv can be found at https://github.com/PacktPublishing/Federated-Learning-with-Python/tree/main/ch7/agg_fl_examples/cifar_fedcurv_example.

Data-sharing approach

To make further progress, changes to external aspects of the FL scenario are necessary. For example, let’s assume that the data privacy restriction is loosened, such that small subsets of the local datasets from each agent can be shared with the other agents. This data-sharing approach allows for homogeneity in the local data distributions proportional to the amount of shared data to be achieved, at the expense of the key stationary data property of FL that makes it desirable in many privacy-oriented applications. Thus, data-sharing approaches are generally unsuitable for the majority of applications.

Personalization through fine-tuning

It is clear that producing a single model that demonstrates strong performance across the local datasets is not easy when the datasets are IID. However, what would happen if the single model restriction was removed from the FL process? If the goal is to produce local models that perform well on the same edge devices where training is conducted, removing the single model restriction allows for the use of different local models that have been trained on the exact data distributions where inference is being applied.

This concept is called personalization, in which agents use versions of the aggregate model tuned for the local data distribution to achieve strong performance. The key point of this approach is to balance the local performance of the locally trained model with the global performance and the resulting robustness of the aggregate model received in each round. One method of accomplishing this is for each agent to maintain their local models across the rounds, updating the local model with the weighted average of the previous local model and the received aggregate model during each round.

Alternatively, consider a relaxation that allows for multiple aggregate models to be produced in each round. In cases where the local data distributions can be clustered into just a few separated groups, distribution-aware aggregation would allow for the selective application of aggregation methods to groups of models belonging to the same distribution cluster.

One example of this approach is the Performance-Based Neighbor Selection (PENS) algorithm, where agents receive locally trained models from other agents and test them on their own local dataset during the first phase. Using the assumption that models trained on similar datasets will perform better than models trained on different datasets, the agents then determine the set of other agents with similar data distributions, allowing for aggregation to only be performed with similar agents in the second phase.

A second approach is to add an intermediate aggregation step between the local models and the global aggregate model called a cluster model. By leveraging knowledge about the agent data distributions or through a dynamic allocation method, agents with similar data distributions can be assigned to a cluster aggregator, which is then known to produce a strong model due to its agents having IID datasets.

Balancing the performance of the cluster models with the robustness of global aggregation leads to the concept of the semi-global model, in which subsamples of the cluster models can be selected (potentially based on data distribution) to create a smaller set of partially global aggregate models that maintain performance and robustness. Therefore, the cluster and semi-global model approach is beneficial for both aggregation and achieving a fully distributed FL system.

Summary

The goal of this chapter was to provide a conceptual overview of the current knowledge of aggregation, the key theoretical step in FL that allows for the disjoint training done by each agent to be pooled together with minimal transmission required. FedAvg is a simple, yet surprisingly powerful aggregation algorithm that performs well in an ideal FL scenario. This scenario is achieved when training is done across IID datasets using machines with similar levels of computational power and no adversarial or otherwise incorrectly performing agents.

Unfortunately, these conditions are often not met when deploying an FL system in the real world. To address these cases, we introduced and implemented modified aggregation approaches: FedProx, FedCurv, and three different robust mean estimators. After reading this chapter, you should have a solid understanding of the considerations that must be taken into account for practical FL applications, and you should be able to integrate the aforementioned algorithms into these applications.

In the next chapter, we will do a deep dive into some of the existing FL frameworks with several toy examples to demonstrate the functionalities provided by each.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.214.56