一阶在策略领域的优化受限

论文标题

一阶在策略领域的优化受限

First Order Constrained Optimization in Policy Space

论文作者

Zhang, Yiming, Vuong, Quan, Ross, Keith W.

论文摘要

在强化学习中，代理商试图通过与环境互动来学习高性能行为，这种行为通常以奖励功能的形式进行量化。然而，行为的某些方面是被认为是不安全并避免的行为方面 - 最好通过约束来捕获。我们提出了一种名为“一阶”的新方法，称为政策领域（焦点）的优化，该方法最大化了代理商的整体奖励，同时确保代理商满足一系列成本限制。使用从当前策略生成的数据，POCOPS首先通过解决非参数化策略空间中的约束优化问题来找到最佳更新策略。然后，FOCOPS然后将更新策略投射回参数策略空间。我们的方法在整个训练中都具有最严重的限制违规行为的上限，因此本质上是一阶。我们提供了经验证据，表明我们的简单方法可以在一组受约束的机器人机车任务上取得更好的性能。

In reinforcement learning, an agent attempts to learn high-performing behaviors through interacting with the environment, such behaviors are often quantified in the form of a reward function. However some aspects of behavior-such as ones which are deemed unsafe and to be avoided-are best captured through constraints. We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space. Our approach has an approximate upper bound for worst-case constraint violation throughout training and is first-order in nature therefore simple to implement. We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题