通过价值分歧的自动课程学习

论文标题

通过价值分歧的自动课程学习

Automatic Curriculum Learning through Value Disagreement

论文作者

Zhang, Yunzhi, Abbeel, Pieter, Pinto, Lerrel

论文摘要

不断解决新的未解决任务是学习多种行为的关键。通过加强学习（RL），我们朝着解决具有一个目标的任务方面取得了长足的进步。但是，在代理需要达到多个目标的多任务领域中，训练目标的选择在很大程度上会影响样本效率。当生物代理学习时，通常会有一个有组织而有意义的秩序。受此启发的启发，我们建议为代理需要解决的目标设置自动课程。我们的关键见解是，如果我们可以在代理商能够达到的一组目标的边界采样目标，那么与随机采样目标相比，它将提供更强的学习信号。为了实现这一想法，我们介绍了一个目标建议模块，该模块优先考虑目标，以最大程度地提高政策Q功能的认知不确定性。这种简单的技术采样了既不难以解决的目标，也不太容易解决，从而可以持续改进。我们在13项多目标机器人任务和5项导航任务中评估了我们的方法，并证明了当前最新方法的性能提高。

Continually solving new, unsolved tasks is the key to learning diverse behaviors. Through reinforcement learning (RL), we have made massive strides towards solving tasks that have a single goal. However, in the multi-task domain, where an agent needs to reach multiple goals, the choice of training goals can largely affect sample efficiency. When biological agents learn, there is often an organized and meaningful order to which learning happens. Inspired by this, we propose setting up an automatic curriculum for goals that the agent needs to solve. Our key insight is that if we can sample goals at the frontier of the set of goals that an agent is able to reach, it will provide a significantly stronger learning signal compared to randomly sampled goals. To operationalize this idea, we introduce a goal proposal module that prioritizes goals that maximize the epistemic uncertainty of the Q-function of the policy. This simple technique samples goals that are neither too hard nor too easy for the agent to solve, hence enabling continual improvement. We evaluate our method across 13 multi-goal robotic tasks and 5 navigation tasks, and demonstrate performance gains over current state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题