语言沟通作为（反）奖励设计

论文标题

语言沟通作为（反）奖励设计

Linguistic communication as (inverse) reward design

论文作者

Sumers, Theodore R., Hawkins, Robert D., Ho, Mark K., Griffiths, Thomas L., Hadfield-Menell, Dylan

论文摘要

自然语言是将奖励信息传达给自主代理商的直觉和表达方式。它涵盖了从具体说明到世界的抽象描述的所有内容。尽管如此，自然语言通常还是很具有挑战性的学习：机器学习方法很难从如此广泛的输入中进行适当的推论。本文提出将奖励设计的概括作为对地面语言交流的统一原则的概括：演讲者选择话语来最大程度地利用听众的未来行为。我们首先扩展奖励设计，以在线性强盗环境中纳入有关未知未来状态的推理。然后，我们定义一个扬声器模型，该模型根据此目标选择话语。模拟表明，短词扬声器（主要是关于单个已知状态的推理）倾向于使用说明，而长马利琴语的说话者（主要是关于未知的未知状态的推理）倾向于描述奖励功能。然后，我们定义了一个务实的听众，该听众通过共同推断说话者的潜在视野和奖励来执行逆奖励设计。我们的发现表明，将奖励设计扩展到语言交流，包括潜在扬声器视野的概念，是从自然语言监督中实现更强大的一致性结果的有希望的方向。

Natural language is an intuitive and expressive way to communicate reward information to autonomous agents. It encompasses everything from concrete instructions to abstract descriptions of the world. Despite this, natural language is often challenging to learn from: it is difficult for machine learning methods to make appropriate inferences from such a wide range of input. This paper proposes a generalization of reward design as a unifying principle to ground linguistic communication: speakers choose utterances to maximize expected rewards from the listener's future behaviors. We first extend reward design to incorporate reasoning about unknown future states in a linear bandit setting. We then define a speaker model which chooses utterances according to this objective. Simulations show that short-horizon speakers (reasoning primarily about a single, known state) tend to use instructions, while long-horizon speakers (reasoning primarily about unknown, future states) tend to describe the reward function. We then define a pragmatic listener which performs inverse reward design by jointly inferring the speaker's latent horizon and rewards. Our findings suggest that this extension of reward design to linguistic communication, including the notion of a latent speaker horizon, is a promising direction for achieving more robust alignment outcomes from natural language supervision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题