MAML

MAML is one of the recently introduced and most popularly used meta learning algorithms and it has created a major breakthrough in meta learning research. Learning to learn is the key focus of meta learning and we know that, in meta learning, we learn from various related tasks containing only a small number of data points and the meta learner produces a quick learner that can generalize well on a new related task even with a lesser number of training samples.

The basic idea of MAML is to find a better initial parameter so that, with good initial parameters, the model can learn quickly on new tasks with fewer gradient steps.

So, what do we mean by that? Let's say we are performing a classification task using a neural network. How do we train the network? We will start off with initializing random weights and train the network by minimizing the loss. How do we minimize the loss? We do so using gradient descent. Okay, but how do we use gradient descent for minimizing the loss? We use gradient descent for finding the optimal weights that will give us the minimal loss. We take multiple gradient steps to find the optimal weights so that we can reach the convergence.

In MAML, we try to find these optimal weights by learning from the distribution of similar tasks. So, for a new task, we don't have to start with randomly initialized weights—instead, we can start with optimal weights, which will take fewer gradient steps to reach convergence and it doesn't require more data points for training.

Let's understand MAML in simple terms; let's say we have three related tasks: T₁, T₂, and T₃. First, we randomly initialize our model parameter, θ. We train our network on task T₁. Then, we try to minimize the loss L by gradient descent. We minimize the loss by finding the optimal parameter, . Similarly, for tasks T₂ and T_3, we will start off with a randomly initialized model parameter, θ, and minimize the loss by finding the right set of parameters by gradient descent. Let's say and are the optimal parameters for the tasks, T₂ and T₃, respectively.

As you can see in the following diagram, we start off each task with the randomly initialized parameter θ and minimize the loss by finding the optimal parameters , and for each of the tasks T₁, T₂, and T₃ respectively:

However, instead of initializing θ in a random position—that is, with random values—if we initialize θ in a position that is common to all three tasks, we don't need to take more gradient steps and it will take us less time for training. MAML tries to do exactly this. MAML tries to find this optimal parameter θ that is common to many of the related tasks, so we can train a new task relatively quickly with few data points without having to take many gradient steps.

As shown in the following diagram, we shift θ to a position that is common to all different optimal θ' values:

So, for a new related task, say, T₄, we don't have to start with a randomly initialized parameter, θ. Instead, we can start with the optimal θ value so that it will take fewer gradient steps to attain convergence.

So, in MAML, we try to find this optimal θ value that is common to related tasks, so that will help us in learning from fewer data points and minimizing our training time. MAML is model agnostic, meaning that we can apply MAML to any models that are trainable with gradient descent. But how exactly does MAML work? How do we shift the model parameters to an optimal position? We will explore that in detail in the next section.

Table of Contents for MAML

Create new playlist

Sign In

Sign Up

Table of Contents for
MAML