序列模型模仿学习没有观察到的上下文

论文标题

序列模型模仿学习没有观察到的上下文

Sequence Model Imitation Learning with Unobserved Contexts

论文作者

Swamy, Gokul, Choudhury, Sanjiban, Bagnell, J. Andrew, Wu, Zhiwei Steven

论文摘要

我们认为，随着更多信息的揭示，我们认为学习者模仿专家在整个情节过程中提高的能力的能力。一个例子是，当专家可以访问特权信息时：虽然学习者可能无法通过考虑整个国家和行动的历史来早期在情节中准确地重现专家行为，但他们可能最终能够确定隐藏的上下文并像专家一样行事。我们证明，与非政策方法相比，在政上模仿学习算法（有或不访问可查询专家）可以更好地处理这些渐近可实现的问题。这是因为事实证明，政府算法可以从其最初的次优行动中恢复过来，而非政策方法则将其次优的过去行动视为来自专家。这通常表现为闩锁行为：对过去的动作的幼稚重复。我们在玩具匪集域中进行实验，该实验表明，与统一的上政策方法的均匀性能相比，非政策方法是否能够渐近地匹配专家性能。我们证明，在几个连续的控制任务上，policy方法能够使用历史记录来识别上下文，而在访问历史记录时，违反政策方法实际上表现较差。

We consider imitation learning problems where the learner's ability to mimic the expert increases throughout the course of an episode as more information is revealed. One example of this is when the expert has access to privileged information: while the learner might not be able to accurately reproduce expert behavior early on in an episode, by considering the entire history of states and actions, they might be able to eventually identify the hidden context and act as the expert would. We prove that on-policy imitation learning algorithms (with or without access to a queryable expert) are better equipped to handle these sorts of asymptotically realizable problems than off-policy methods. This is because on-policy algorithms provably learn to recover from their initially suboptimal actions, while off-policy methods treat their suboptimal past actions as though they came from the expert. This often manifests as a latching behavior: a naive repetition of past actions. We conduct experiments in a toy bandit domain that show that there exist sharp phase transitions of whether off-policy approaches are able to match expert performance asymptotically, in contrast to the uniformly good performance of on-policy approaches. We demonstrate that on several continuous control tasks, on-policy approaches are able to use history to identify the context while off-policy approaches actually perform worse when given access to history.

下载PDF全文

下载文献需遵守相关版权规定

论文标题