论文标题

学习持续控制的组成神经程序

Learning Compositional Neural Programs for Continuous Control

论文作者

Pierrot, Thomas, Perrin, Nicolas, Behbahani, Feryal, Laterre, Alexandre, Sigaud, Olivier, Beguir, Karim, de Freitas, Nando

论文摘要

我们提出了一种新的解决方案,用于挑战稀疏奖励,连续的控制问题,这些问题需要在多个抽象的层次上进行分层计划。我们的解决方案称为Alphanpi-X,涉及三个独立的学习阶段。首先,我们使用经验重播的范围内强化学习算法来学习一组原子目标条件政策,可以轻松地重新用于许多任务。其次,我们学习描述原子政策对环境的影响的自我模型。第三,利用自我模型学习具有多个抽象级别的递归构图程序。关键的见解是,自我模型可以通过想象力实现计划,从而消除了学习高级构图计划时与世界互动的需求。为了完成学习的第三阶段,我们扩展了Alphanpi算法,该算法应用Alphazero学习递归神经程序员解释器。我们从经验上表明,Alphanpi-X可以有效地学会解决具有挑战性的稀疏操纵任务,例如堆叠多个块,在这些块中,强大的无模型基线失败。

We propose a novel solution to challenging sparse-reward, continuous control problems that require hierarchical planning at multiple levels of abstraction. Our solution, dubbed AlphaNPI-X, involves three separate stages of learning. First, we use off-policy reinforcement learning algorithms with experience replay to learn a set of atomic goal-conditioned policies, which can be easily repurposed for many tasks. Second, we learn self-models describing the effect of the atomic policies on the environment. Third, the self-models are harnessed to learn recursive compositional programs with multiple levels of abstraction. The key insight is that the self-models enable planning by imagination, obviating the need for interaction with the world when learning higher-level compositional programs. To accomplish the third stage of learning, we extend the AlphaNPI algorithm, which applies AlphaZero to learn recursive neural programmer-interpreters. We empirically show that AlphaNPI-X can effectively learn to tackle challenging sparse manipulation tasks, such as stacking multiple blocks, where powerful model-free baselines fail.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源