论文标题
Squirl:从长马机器人操纵任务的视频演示中进行强大而有效的学习
SQUIRL: Robust and Efficient Learning from Video Demonstration of Long-Horizon Robotic Manipulation Tasks
论文作者
论文摘要
深度强化学习(RL)的最新进展证明了其学习复杂的机器人操纵任务的潜力。但是,RL仍然要求机器人收集大量的现实体验。为了解决这个问题,最近的作品提出了从专家示范(LFD)(特别是通过逆增强学习(IRL))学习的拟议学习,鉴于仅通过少量的专家演示才能实现强大的性能。然而,由于其需要大量的机器人体验,将IRL部署在真正的机器人上仍然具有挑战性。本文旨在通过强大的,样本效率和一般的元元素算法(Squirl)来应对这种可伸缩性挑战,该算法仅在一次视频演示中执行了一个新的但相关的长途任务。首先,该算法引导了任务编码器的学习和使用行为克隆(BC)的任务条件策略。然后,它通过直接从合并的机器人和专家轨迹中恢复Q功能来收集现实机器人的经验并绕开奖励学习。接下来,该算法使用Q功能来重新评估机器人收集的所有累积经验,以迅速改善政策。最后,该策略在新任务上比BC更强大(90%+成功),同时在测试时不需要试用。最后,我们的现实机器人和模拟实验证明了我们算法在不同状态空间,动作空间和基于视觉的操纵任务(例如,Pick-Pour-Place和Pick-Carry-Drop)上的通用性。
Recent advances in deep reinforcement learning (RL) have demonstrated its potential to learn complex robotic manipulation tasks. However, RL still requires the robot to collect a large amount of real-world experience. To address this problem, recent works have proposed learning from expert demonstrations (LfD), particularly via inverse reinforcement learning (IRL), given its ability to achieve robust performance with only a small number of expert demonstrations. Nevertheless, deploying IRL on real robots is still challenging due to the large number of robot experiences it requires. This paper aims to address this scalability challenge with a robust, sample-efficient, and general meta-IRL algorithm, SQUIRL, that performs a new but related long-horizon task robustly given only a single video demonstration. First, this algorithm bootstraps the learning of a task encoder and a task-conditioned policy using behavioral cloning (BC). It then collects real-robot experiences and bypasses reward learning by directly recovering a Q-function from the combined robot and expert trajectories. Next, this algorithm uses the Q-function to re-evaluate all cumulative experiences collected by the robot to improve the policy quickly. In the end, the policy performs more robustly (90%+ success) than BC on new tasks while requiring no trial-and-errors at test time. Finally, our real-robot and simulated experiments demonstrate our algorithm's generality across different state spaces, action spaces, and vision-based manipulation tasks, e.g., pick-pour-place and pick-carry-drop.