深度PQR：使用锚固动作求解逆增强学习

论文标题

深度PQR：使用锚固动作求解逆增强学习

Deep PQR: Solving Inverse Reinforcement Learning using Anchor Actions

论文作者

Geng, Sinong, Nassif, Houssam, Manzanares, Carlos A., Reppen, A. Max, Sircar, Ronnie

论文摘要

我们提出了一个奖励功能估计框架，用于通过基于深度能量的政策进行逆增强学习。我们将我们的方法命名为PQR，因为它依次估算了策略，$ q $功能和通过深度学习的奖励功能。 PQR不假定奖励仅取决于状态，而是允许依赖行动选择。此外，PQR允许随机状态过渡。为了实现这一目标，我们假设存在一个锚式动作，其奖励是已知的，通常是什么都不做的动作，没有任何回报。我们介绍了PQR方法的估计器和算法。当知道环境过渡时，我们证明PQR奖励估计器独特地恢复了真正的奖励。通过未知过渡，我们绑定了PQR的估计误差。最后，通过合成和现实世界数据集证明了PQR的性能。

We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies. We name our method PQR, as it sequentially estimates the Policy, the $Q$-function, and the Reward function by deep learning. PQR does not assume that the reward solely depends on the state, instead it allows for a dependency on the choice of action. Moreover, PQR allows for stochastic state transitions. To accomplish this, we assume the existence of one anchor action whose reward is known, typically the action of doing nothing, yielding no reward. We present both estimators and algorithms for the PQR method. When the environment transition is known, we prove that the PQR reward estimator uniquely recovers the true reward. With unknown transitions, we bound the estimation error of PQR. Finally, the performance of PQR is demonstrated by synthetic and real-world datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题