更安全：通过技能获取的数据有效且安全的加强学习

论文标题

更安全：通过技能获取的数据有效且安全的加强学习

SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition

论文作者

Slack, Dylan, Chow, Yinlam, Dai, Bo, Wichers, Nevan

论文摘要

使用深层生成模型从离线演示中提取策略原始的方法已显示出有望加速增强学习（RL）的新任务。直觉上，这些方法还应该有助于培训宣传媒体，因为它们可以执行有用的技能。但是，我们确定这些技术无法为安全的政策学习提供很好的装备，因为它们忽略了负面的经历（例如，不安全或不成功），只专注于积极的经验，这会损害他们安全地对新任务的推广能力。相反，我们将LetantAfetyConteDlectAdving在脱机数据集中的许多任务（包括负面经验和积极经验）的示范数据集中进行了对对比培训的建模。使用此较晚变量，我们的RL框架，安全技能先验（更安全）提取了特定于任务的安全原始技能，以安全地成功地将其推广到新任务。在推论阶段，接受更安全的培训的政策学会将安全的技能纳入成功的政策。从理论上讲，我们表征了为什么更安全的行为能够实施安全的政策学习，并证明其在受游戏操作启发的几种复杂的至关重要的机器人握把任务上，在这种情况下，Saferoutperperforms在成功和安全方面的最先进的原始学习方法。

Methods that extract policy primitives from offline demonstrations using deep generative models have shown promise at accelerating reinforcement learning(RL) for new tasks. Intuitively, these methods should also help to trainsafeRLagents because they enforce useful skills. However, we identify these techniques are not well equipped for safe policy learning because they ignore negative experiences(e.g., unsafe or unsuccessful), focusing only on positive experiences, which harms their ability to generalize to new tasks safely. Rather, we model the latentsafetycontextusing principled contrastive training on an offline dataset of demonstrations from many tasks, including both negative and positive experiences. Using this late variable, our RL framework, SAFEty skill pRiors (SAFER) extracts task-specific safe primitive skills to safely and successfully generalize to new tasks. In the inference stage, policies trained with SAFER learn to compose safe skills into successful policies. We theoretically characterize why SAFER can enforce safe policy learning and demonstrate its effectiveness on several complex safety-critical robotic grasping tasks inspired by the game Operation, in which SAFERoutperforms state-of-the-art primitive learning methods in success and safety.

下载PDF全文

下载文献需遵守相关版权规定

论文标题