DCE：双重保守估计的离线加固学习

论文标题

DCE：双重保守估计的离线加固学习

DCE: Offline Reinforcement Learning With Double Conservative Estimates

论文作者

Zhao, Chen, Huang, Kai Xing, yuan, Chun

论文摘要

离线强化学习引起了人们对解决传统强化学习的应用挑战的极大兴趣。离线增强学习使用先前收集的数据集训练代理而无需任何互动。为了解决对OOD的高估（分布式）动作的高估，保守的估计值对所有输入都具有较低的价值。以前的保守估计方法通常很难避免OOD作用对Q值估计的影响。此外，这些算法通常需要失去一些计算效率，以实现保守估计的目的。在本文中，我们提出了一种简单的保守估计方法，即双重保守估计（DCE），该方法使用两种保守估算方法来限制政策。我们的算法引入了V功能，以避免分发作用的错误，同时隐含实现保守估计。此外，我们的算法使用可控的罚款术语，改变了培训中的保守主义程度。从理论上讲，我们说明了该方法如何影响OOD动作和分布动作的估计。我们的实验分别表明，两种保守的估计方法会影响所有国家行动的估计。 DCE展示了D4RL的最新性能。

Offline Reinforcement Learning has attracted much interest in solving the application challenge for traditional reinforcement learning. Offline reinforcement learning uses previously-collected datasets to train agents without any interaction. For addressing the overestimation of OOD (out-of-distribution) actions, conservative estimates give a low value for all inputs. Previous conservative estimation methods are usually difficult to avoid the impact of OOD actions on Q-value estimates. In addition, these algorithms usually need to lose some computational efficiency to achieve the purpose of conservative estimation. In this paper, we propose a simple conservative estimation method, double conservative estimates (DCE), which use two conservative estimation method to constraint policy. Our algorithm introduces V-function to avoid the error of in-distribution action while implicit achieving conservative estimation. In addition, our algorithm uses a controllable penalty term changing the degree of conservatism in training. We theoretically show how this method influences the estimation of OOD actions and in-distribution actions. Our experiment separately shows that two conservative estimation methods impact the estimation of all state-action. DCE demonstrates the state-of-the-art performance on D4RL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题