论文标题
奖励错误指定的效果:映射和缓解未对准的模型
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
论文作者
论文摘要
奖励黑客 - RL代理在错误指定的奖励函数中利用差距 - 已被广泛观察到,但尚未系统地研究。为了了解奖励黑客的产生方式,我们构建了四个RL环境,并具有错误的奖励。我们研究奖励黑客攻击是代理能力的函数:模型容量,动作空间分辨率,观察空间噪声和训练时间。比功能较低的代理商相比,更有能力的代理通常会利用奖励误差,获得更高的代理奖励和更低的真实奖励。此外,我们发现相变的实例:代理行为在定性上转移的能力阈值,从而导致真正的奖励急剧下降。这种相变对监视ML系统的安全构成挑战。为了解决这个问题,我们提出了异常政策的一项异常检测任务,并提供了几个基线检测器。
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.