模仿从过渡模型差异下的观察中学习

论文标题

模仿从过渡模型差异下的观察中学习

Imitation Learning from Observations under Transition Model Disparity

论文作者

Gangwani, Tanmay, Zhou, Yuan, Peng, Jian

论文摘要

通过利用专家观察数据集（也称为模仿从观察结果学习（ILO））来学习任务，这是学习技能的重要范式，而无需访问专家奖励功能或专家行动。我们在专家和学习者代理在不同环境中运行的环境中考虑ILO，而差异的来源是过渡动力学模型。可扩展性ILO的最新方法利用对抗性学习与专家和学习者的状态转变分布相匹配，这种方法在动态不相同时变得具有挑战性。在这项工作中，我们提出了一种算法，该算法在学习者环境中训练中介政策，并将其用作学习者的替代专家。学到了中介政策，以使其生成的状态过渡与专家数据集中的国家过渡接近。为了得出一种实用且可扩展的算法，我们采用先前工作的概念来估计概率分布的支持。使用Mujoco运动任务的实验表明，我们的方法与ILO的基准相比，具有过渡动力学不匹配。

Learning to perform tasks by leveraging a dataset of expert observations, also known as imitation learning from observations (ILO), is an important paradigm for learning skills without access to the expert reward function or the expert actions. We consider ILO in the setting where the expert and the learner agents operate in different environments, with the source of the discrepancy being the transition dynamics model. Recent methods for scalable ILO utilize adversarial learning to match the state-transition distributions of the expert and the learner, an approach that becomes challenging when the dynamics are dissimilar. In this work, we propose an algorithm that trains an intermediary policy in the learner environment and uses it as a surrogate expert for the learner. The intermediary policy is learned such that the state transitions generated by it are close to the state transitions in the expert dataset. To derive a practical and scalable algorithm, we employ concepts from prior work on estimating the support of a probability distribution. Experiments using MuJoCo locomotion tasks highlight that our method compares favorably to the baselines for ILO with transition dynamics mismatch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题