论文标题
广义数据分布迭代
Generalized Data Distribution Iteration
论文作者
论文摘要
为了获得较高的样本效率和同时的最终表现,这是深入增强学习(DRL)的主要挑战之一。以前的工作可以应对这些挑战之一,但通常未能同时解决这些挑战。在本文中,我们试图同时解决这两个挑战。为了实现这一目标,我们首先将这些挑战分解为两个经典的RL问题:数据丰富性和探索探索权衡取舍。然后,我们将这两个问题投入到培训数据分配优化问题中,即在有限的互动中获得所需的培训数据,并通过i)同时解决这些问题,并通过单调数据分配优化的方式对行为策略的能力和多样性进行明确建模和控制行为策略的能力和多样性以及对行为策略的选择性/采样分配的选择性/采样分配。最后,我们将此过程集成到广义策略迭代(GPI)中,并获得一个更通用的框架,称为广义数据分布迭代(GDI)。我们使用GDI框架来介绍从DQN到Agent57的著名RL方法的基于操作的版本。总结了GDI优势与GPI的理论保证。我们还展示了我们在街机学习环境(AL)上的最先进(SOTA)表现,其中我们的算法达到了9620.33%的平均人类正常化评分(HNS),1146.39%的中位数HNS,仅使用200M训练框架超过22个人类世界记录。我们的性能与Agent57相当,而我们消耗了500倍的数据。我们认为,在获得啤酒中的真正的超人代理人之前,还有很长的路要走。
To obtain higher sample efficiency and superior final performance simultaneously has been one of the major challenges for deep reinforcement learning (DRL). Previous work could handle one of these challenges but typically failed to address them concurrently. In this paper, we try to tackle these two challenges simultaneously. To achieve this, we firstly decouple these challenges into two classic RL problems: data richness and exploration-exploitation trade-off. Then, we cast these two problems into the training data distribution optimization problem, namely to obtain desired training data within limited interactions, and address them concurrently via i) explicit modeling and control of the capacity and diversity of behavior policy and ii) more fine-grained and adaptive control of selective/sampling distribution of the behavior policy using a monotonic data distribution optimization. Finally, we integrate this process into Generalized Policy Iteration (GPI) and obtain a more general framework called Generalized Data Distribution Iteration (GDI). We use the GDI framework to introduce operator-based versions of well-known RL methods from DQN to Agent57. Theoretical guarantee of the superiority of GDI compared with GPI is concluded. We also demonstrate our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.33% mean human normalized score (HNS), 1146.39% median HNS and surpassed 22 human world records using only 200M training frames. Our performance is comparable to Agent57's while we consume 500 times less data. We argue that there is still a long way to go before obtaining real superhuman agents in ALE.