安全受限的政策转移带有后继功能

论文标题

安全受限的政策转移带有后继功能

Safety-Constrained Policy Transfer with Successor Features

论文作者

Feng, Zeyu, Zhang, Bowen, Bi, Jianxin, Soh, Harold

论文摘要

在这项工作中，我们专注于强化学习中的安全政策转移问题：我们试图在学习具有指定约束的新任务时利用现有政策。这个问题对于关键的关键应用很重要，在这种应用中，相互作用是昂贵的，不受约束的政策可能会导致不良或危险的结果，例如，与人类相互作用的物理机器人。我们提出了一个受约束的马尔可夫决策过程（CMDP）公式，同时可以转移政策和遵守安全约束。我们的配方将任务目标与安全考虑的干净分开，并允许对各种约束的规范进行规范。我们的方法依赖于通过拉格朗日公式将广义政策改进到受限设置的新颖扩展。我们设计了一种双重优化算法，该算法估算目标任务的最佳二元变量，从而可以安全地转移从源任务上学到的后继功能中得出的策略。我们在模拟领域的实验表明我们的方法是有效的。它访问不安全状态的频率较低，并且在考虑安全限制时胜过替代的最先进方法。

In this work, we focus on the problem of safe policy transfer in reinforcement learning: we seek to leverage existing policies when learning a new task with specified constraints. This problem is important for safety-critical applications where interactions are costly and unconstrained policies can lead to undesirable or dangerous outcomes, e.g., with physical robots that interact with humans. We propose a Constrained Markov Decision Process (CMDP) formulation that simultaneously enables the transfer of policies and adherence to safety constraints. Our formulation cleanly separates task goals from safety considerations and permits the specification of a wide variety of constraints. Our approach relies on a novel extension of generalized policy improvement to constrained settings via a Lagrangian formulation. We devise a dual optimization algorithm that estimates the optimal dual variable of a target task, thus enabling safe transfer of policies derived from successor features learned on source tasks. Our experiments in simulated domains show that our approach is effective; it visits unsafe states less frequently and outperforms alternative state-of-the-art methods when taking safety constraints into account.

下载PDF全文

下载文献需遵守相关版权规定

论文标题