从语言反馈中学习奖励

论文标题

从语言反馈中学习奖励

Learning Rewards from Linguistic Feedback

论文作者

Sumers, Theodore R., Ho, Mark K., Hawkins, Robert D., Narasimhan, Karthik, Griffiths, Thomas L.

论文摘要

我们探索不受限制的自然语言反馈，作为人造药物的学习信号。人类使用丰富而多样化的语言来教授，但是从语言上进行交互式学习的大多数先前工作都假定了一种特定的输入形式（例如命令）。我们提出了一个一般的框架，该框架不会使用基于方面的情感分析将反馈分解为马尔可夫决策过程特征的情绪。然后，我们进行了逆增强学习的类似物，使人们对特征的情感回归，以推断老师的潜在奖励功能。为了评估我们的方法，我们首先在教师和学习者都是人类的合作任务中收集教学行为的语料库。我们实施了三个人工学习者：基于情感的“文字”和“务实”模型，以及经过端对端的推理网络来预测潜在的奖励。然后，我们重复初始实验，并将其与人类老师配对。这三者都成功地从交互式人类的反馈中学习。情感模型的表现优于推理网络，其“务实”模型接近人类绩效。因此，我们的工作提供了对自然主义语言反馈的信息结构的见解，以及将其用于增强学习的方法。

We explore unconstrained natural language feedback as a learning signal for artificial agents. Humans use rich and varied language to teach, yet most prior work on interactive learning from language assumes a particular form of input (e.g., commands). We propose a general framework which does not make this assumption, using aspect-based sentiment analysis to decompose feedback into sentiment about the features of a Markov decision process. We then perform an analogue of inverse reinforcement learning, regressing the sentiment on the features to infer the teacher's latent reward function. To evaluate our approach, we first collect a corpus of teaching behavior in a cooperative task where both teacher and learner are human. We implement three artificial learners: sentiment-based "literal" and "pragmatic" models, and an inference network trained end-to-end to predict latent rewards. We then repeat our initial experiment and pair them with human teachers. All three successfully learn from interactive human feedback. The sentiment models outperform the inference network, with the "pragmatic" model approaching human performance. Our work thus provides insight into the information structure of naturalistic linguistic feedback as well as methods to leverage it for reinforcement learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题