
environment. To account for the usage and adaption of a
meta-model during planning, we call our approach Meta
Adaptation Controller (MAC).
First, we introduce related work and preliminaries. Next,
the challenge and our approach to solving it are described.
Finally, experiment results are presented that compare MAC
with MPC employing different meta-learning methods.
II. RELATED WORK
In recent years, robotics has achieved remarkable success
with model-based RL approaches [13], [14], [15]. The agent
can choose optimal actions by utilizing the experiences
generated by the model [7]. As a result, the amount of
data required for model-based methods is typically much
smaller than their model-free counterparts, making these
algorithms more attractive for robotic applications. One
drawback in many of these works is the assumption that the
environment is stationary. In real robot applications, however,
many uncertainties are difficult to model or predict, some of
which are internal (e.g., malfunctions [9]) and others external
(e.g., wind [12]). These uncertainties make the stationary
assumption impractical. That can lead to suboptimal behavior
or even catastrophic failure. Therefore, a quick adaptation of
the learned model is critical.
”Gradient-based meta-learning methods leverage gradient
descent to learn the commonalities among various tasks”
[16, p. 1]. One such method introduced by Finn et al.
[17] is Model-Agnostic Meta-Learning (MAML). The key
idea of MAML is to tune a model’s initial parameters such
that the model has maximal performance on a new task.
Here, meta-learning is achieved with bi-level optimization, a
models task-specific optimization and a task-agnostic meta
optimization. Instantiated for MFRL, MAML uses policy
gradients of a neural network model, whereas, in MBRL,
MAML is used to train a dynamics model. REPTILE by
Nicol et al. [18] is the first-order implementation of MAML.
In contrast to MAML, task-specific gradients do not need
to be differentiated through the optimization process. This
makes REPTILE more computationally efficient with similar
performance.
A model-based approach using gradient-based MRL was
presented in the work of Nagabandi et al. [9] and targets
online adaption of a robotic system that encounters different
system dynamics in real-world environments. In this context,
Kaushik et al. [11] point out that in an MRL setup where
situations do not possess strong global similarity, finding
a single set of initial parameters is often not sufficient
to learn quickly. One potential solution would be to find
several initial sets of model parameters during meta-training
and, when encountering a new task, use the most similar
one so that an agent can adapt through several gradient
steps. Their work Fast Adaptation through Meta-Learning
Embeddings (FAMLE) approaches this solution by extending
a dynamical models input with a learnable d-dimensional
vector describing a task. Similarly, Belkhale et al. [12] intro-
duce a meta-learning approach that enables a quadcopter to
adapt online to various physical properties of payloads (e.g.,
mass, tether length) using variational inference. Intuitively
each payload causes different system dynamics and therefore
defines a task to be learned. Since it is unlikely to accurately
model such dynamics by hand and it is not realistic to
know every payloads properties value beforehand, the meta-
learning goal is the rapid adaption to unknown payloads
without prior knowledge of the payload’s physical properties.
That is why a probabilistic encoder network finds a task-
specific latent vector fed into a dynamics network as an
auxiliary network. Using the latent vector, the dynamics
network learns to model the factors of variation that affect
the payload’s dynamics and are not present in the current
state. All these algorithms use MPC during online adaption.
Our work introduces a new controller for online adaption in
a model-based meta-reinforcement learning setting.
III. PRELIMINARIES
A. Meta Learning
Quick online adaption to new tasks can be viewed in
the light of a few-shot learning setting where the goal of
meta-learning is to adapt a model fθto an unseen task
Mjof a task distribution p(M)with a small amount of
kdata samples [17]. The meta-learning procedure usually
is divided into meta-training with nmeta-learning tasks
Miand meta-testing with ymeta-test tasks Mjboth
drawn from p(M)without replacement [3]. During meta-
training, task data may be split into train and test sets
usually representing kdata points of a task Dmeta-train =
{(Dtr
i=1,Dts
i=1), . . . (Dtr
i=n,Dts
i=n)}. Meta-testing task data
Dmeta-test = (Dmeta-test
j=1 ,...,Dmeta-test
j=y)is hold out dur-
ing meta-training [3]. Meta-training is then performed with
Dmeta-train and can be viewed as bi-level learning of model
parameters [19]. In the inner-level, an update algorithm Alg
with hyperparameters ψmust find task-specific parameters φi
by adjusting meta-parameters θ. In the outer-level, θmust be
adjusted to minimize the cumulative loss of all φiacross all
learning tasks by finding common characteristics of different
tasks through meta parameters θ?:
outer-level
z }| {
θ?= arg min
θ
n
X
i=1
LDi∼Mi(φi)
where φi=Algψ
Di∼Mi(θ)
| {z }
inner-level
(1)
Once θ?is found, it can be used during meta-testing for
quick adaption: φj=Alg(θ?,Dj)
B. Model-based Reinforcement Learning
In RL, a task can be described as a Markov Decision Pro-
cess (MDP) M={S, A, p (st=0), p (st+1 |st, at), r, H }
with a set of states S, a set of actions A, a reward
function r:S × A 7→ R, an initial state distribution
p(st=0), a transition probability distribution p(st+1 |st, at),
and a discrete-time finite or continuous-time infinite hori-
zon H. MBRL methods sample ground truth data Di=
{(s0, a0, s1),(s1, a1, s2), . . .}from a specific task Miand