论文标题
奖励编码环境动态改善基于偏好的强化学习
Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning
论文作者
论文摘要
基于偏好的增强学习(RL)算法通过将其从人类的偏好反馈中提取来帮助避免手工制作的奖励功能的陷阱,但是由于人类所需的标签数量繁重,即使对于相对简单的任务,它们仍然不切实际。在这项工作中,我们证明了奖励函数(REED)中编码环境动态的编码大大减少了基于最新首选项的RL框架中所需的首选项标签数量。我们假设基于REED的方法可以更好地划分状态行动空间,并促进对优先数据集未包含的状态行动对的概括。芦苇通过自我监督的时间一致性任务与国家行动表示中的环境动态进行编码之间的迭代,并从国家行动表示中引导基于首选项的奖励函数。虽然先前的方法仅在偏好标记的轨迹对上进行训练,但里德将州行动的表示形式暴露于政策培训期间经历的所有过渡。我们探讨了Prefppo [1]和Pebble [2]偏好学习框架中REED的好处,并在实验条件下展示了政策学习速度和最终政策绩效的改进。例如,在带有50个偏好标签的四倍步行和Walker-Walk上,基于Reed的奖励功能恢复了83%和66%的地面真相奖励政策表现,而没有REED仅38 \%和21 \%。对于某些领域,基于REED的奖励功能导致政策优于基于地面真理奖励的政策。
Preference-based reinforcement learning (RL) algorithms help avoid the pitfalls of hand-crafted reward functions by distilling them from human preference feedback, but they remain impractical due to the burdensome number of labels required from the human, even for relatively simple tasks. In this work, we demonstrate that encoding environment dynamics in the reward function (REED) dramatically reduces the number of preference labels required in state-of-the-art preference-based RL frameworks. We hypothesize that REED-based methods better partition the state-action space and facilitate generalization to state-action pairs not included in the preference dataset. REED iterates between encoding environment dynamics in a state-action representation via a self-supervised temporal consistency task, and bootstrapping the preference-based reward function from the state-action representation. Whereas prior approaches train only on the preference-labelled trajectory pairs, REED exposes the state-action representation to all transitions experienced during policy training. We explore the benefits of REED within the PrefPPO [1] and PEBBLE [2] preference learning frameworks and demonstrate improvements across experimental conditions to both the speed of policy learning and the final policy performance. For example, on quadruped-walk and walker-walk with 50 preference labels, REED-based reward functions recover 83% and 66% of ground truth reward policy performance and without REED only 38\% and 21\% are recovered. For some domains, REED-based reward functions result in policies that outperform policies trained on the ground truth reward.