持续强化学习任务的基于时间逻辑的奖励成型

论文标题

持续强化学习任务的基于时间逻辑的奖励成型

Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks

论文作者

Jiang, Yuqian, Bharadwaj, Sudarshanan, Wu, Bo, Shah, Rishi, Topcu, Ufuk, Stone, Peter

论文摘要

在持续的任务中，平均奖励加强学习可能是比更常见的折扣奖励公式更合适的问题。像往常一样，在这种情况下学习最佳政策通常需要大量的培训经验。奖励成型是将领域知识纳入加强学习的常见方法，以加快融合到最佳政策的速度。但是，据我们所知，迄今为止，仅在折扣环境中建立了奖励成型的理论属性。本文介绍了平均奖励学习的第一个奖励成型框架，并证明，根据标准假设，可以恢复原始奖励功能下的最佳政策。为了避免需要手动构造塑形函数，我们引入了一种利用以时间逻辑公式表示的域知识的方法。该公式会自动转化为整个学习过程中提供额外奖励的整形功能。我们在三个持续任务上评估了建议的方法。在所有情况下，与相关基线相比，塑造的速度都会加快学习政策的表现，而不会降低学习政策的表现。

In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题