Meta imitation learning

If we want our robot to be more generalist and to perform various tasks, then our robots should learn quickly. But how can we enable our robots to learn quickly? Well, how do we humans learn quickly? Don't we easily learn new skills by just looking at other individuals? Similarly, if we enable our robot to learn by just looking at our actions, then we can easily make the robot learn complex goals efficiently and we don't have to engineer complex goal and reward functions. This type of learning—that is, learning from human actions—is called imitation learning, where the robot tries to mimic human action. A robot doesn't really have to learn only from human actions; it can also learn from another robot performing a task or a video of a human/robot performing a task.

But imitation learning is not as simple as it sounds. A robot will take a lot of time and demonstrations to learn the goal and to identify the right policy. So, we'll augment the robot with prior experience as demonstrations (training data) so that it doesn't have to learn each skill completely from scratch. Augmenting the robot with prior experience helps it to learn quickly. So, to learn several skills, we need to collect demonstrations for each of those skills—that is, we need to augment the robots with task-specific demonstration data.

But how can we enable our robot to learn quickly from a single demonstration for a task? Can we use meta learning here? Can we reuse the demonstration data and learn from several related tasks to learn the new task quickly? So, we combine meta learning and imitation learning and form Meta Imitation Learning (MIL). With MIL, we can make use of demonstration data from a variety of other tasks to learn a new task quickly with just one demonstration. So, we can find the right policy for a new task with just one demonstration of that task.

For MIL, we can use any of the meta learning algorithms we've seen. We'll use MAML as our meta learning algorithm, which is compatible with any algorithm that can be trained with gradient descent and we'll use policy gradients as our algorithm for finding the right policy. In policy gradients, we directly optimize the parameterized policy with some parameter .

Our goal is to learn a policy that can quickly adapt to new tasks from a single demonstration of that task. By doing so, we can remove our dependency on a large amount of demonstration data for each of the tasks. What is actually our task here? Our task will contain the trajectories. A trajectory consists of a sequence of observations and actions from the expert policy which is the demonstrations. Wait. What is an expert policy? Since we're performing imitation learning, we're learning from the experts (human actions) so we call that policy an expert policy and it's denoted by :

Okay, what should our loss function be? The loss function denotes how our robot actions differ from the expert actions. We can use mean squared error loss as our loss function for continuous actions, and cross-entropy as a loss function for discrete actions. Let's say we have continuous actions; then we can represent our mean squared error loss as follows:

Say we have a distribution over tasks . We sample a batch of tasks and for each task , we sample some demonstration data, train the network by minimizing the loss, and find the optimal parameter . Next, we perform meta optimization by calculating meta gradients and find the optimal initial parameter . We'll see exactly how this works in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.107.100