分层奖励：设计规格和快速学习所需行为的奖励

论文标题

分层奖励：设计规格和快速学习所需行为的奖励

Tiered Reward: Designing Rewards for Specification and Fast Learning of Desired Behavior

论文作者

Zhou, Zhiyuan, Raman, Shreyas Sundara, Sowerby, Henry, Littman, Michael L.

论文摘要

加强学习者试图通过环境相互作用来最大化奖励信号。作为人类，我们在学习过程中的工作是设计奖励功能，以表达所需的行为，并使代理商能够迅速学习这种行为。但是，设计良好的奖励功能以诱导所需的行为通常很难，更不用说奖励使学习快速的问题。在这项工作中，我们介绍了一个奖励结构的家庭，我们称之为分层奖励，该奖励解决了这两个问题。我们认为在提出的任务中的奖励设计问题是达到理想的状态并避免了不良状态。首先，我们建议对政策空间进行严格的部分订购，以解决行为偏好的权衡。我们更喜欢更快地到达良好状态的政策，同时避免更长的状态。接下来，我们引入分层奖励，这是一类独立于环境的奖励功能，并表明它可以根据我们的喜好关系诱导帕累托最佳的政策。最后，我们证明了分层的奖励会导致通过多种表格和深度加强学习算法进行快速学习。

Reinforcement-learning agents seek to maximize a reward signal through environmental interactions. As humans, our job in the learning process is to design reward functions to express desired behavior and enable the agent to learn such behavior swiftly. However, designing good reward functions to induce the desired behavior is generally hard, let alone the question of which rewards make learning fast. In this work, we introduce a family of a reward structures we call Tiered Reward that addresses both of these questions. We consider the reward-design problem in tasks formulated as reaching desirable states and avoiding undesirable states. To start, we propose a strict partial ordering of the policy space to resolve trade-offs in behavior preference. We prefer policies that reach the good states faster and with higher probability while avoiding the bad states longer. Next, we introduce Tiered Reward, a class of environment-independent reward functions and show it is guaranteed to induce policies that are Pareto-optimal according to our preference relation. Finally, we demonstrate that Tiered Reward leads to fast learning with multiple tabular and deep reinforcement-learning algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题