论文标题
从观察和示范中的模仿学习之间的保证几乎等效
On the Guaranteed Almost Equivalence between Imitation Learning from Observation and Demonstration
论文作者
论文摘要
由于从专家数据中重建专家政策时,从观察结果(LFO)学习(LFO)比从演示中学习(LFD)更可取。但是,以前的研究表明,LFO的性能不如巨大的差距,这使得在实践中使用LFO的挑战。相比之下,本文证明了LFO在确定性的机器人环境中几乎等同于LFD,甚至在具有界限的机器人环境中更普遍地等同于LFD。在确定性的机器人环境中,从控制理论的角度来看,我们表明LFO和LFD之间的反向动态分歧接近零,这意味着LFO几乎等同于LFD。为了进一步放松确定性的约束并更好地适应实际环境,我们考虑了机器人环境中的有限随机性,并证明了LFD和LFO的优化目标在更广泛的环境中几乎保持不变。进行多个机器人任务的广泛实验是为了证明LFO与LFD可比的性能相当。实际上,实际上最常见的机器人系统是具有有限随机性的机器人环境(即本文考虑的环境)。因此,我们的发现极大地扩展了LFO的潜力,并建议我们可以安全地应用LFO而不牺牲与LFD在实践中相比的性能。
Imitation learning from observation (LfO) is more preferable than imitation learning from demonstration (LfD) due to the nonnecessity of expert actions when reconstructing the expert policy from the expert data. However, previous studies imply that the performance of LfO is inferior to LfD by a tremendous gap, which makes it challenging to employ LfO in practice. By contrast, this paper proves that LfO is almost equivalent to LfD in the deterministic robot environment, and more generally even in the robot environment with bounded randomness. In the deterministic robot environment, from the perspective of the control theory, we show that the inverse dynamics disagreement between LfO and LfD approaches zero, meaning that LfO is almost equivalent to LfD. To further relax the deterministic constraint and better adapt to the practical environment, we consider bounded randomness in the robot environment and prove that the optimizing targets for both LfD and LfO remain almost same in the more generalized setting. Extensive experiments for multiple robot tasks are conducted to empirically demonstrate that LfO achieves comparable performance to LfD. In fact, most common robot systems in reality are the robot environment with bounded randomness (i.e., the environment this paper considered). Hence, our findings greatly extend the potential of LfO and suggest that we can safely apply LfO without sacrificing the performance compared to LfD in practice.