论文标题
基于模型的概括在强化学习中的好处
The Benefits of Model-Based Generalization in Reinforcement Learning
论文作者
论文摘要
人们普遍认为,基于模型的增强学习(RL)具有通过允许代理合成大量想象的体验来提高样品效率的潜力。经验重播(ER)可以被认为是一种简单的模型,事实证明,它有效地提高了深度RL的稳定性和效率。原则上,通过从真实的经验推广到增强数据集的额外合理经验,可以通过概括数据来改善ER。但是,鉴于学到的价值功能也可以推广,因此模型概括应该更好的原因并不明显。在这里,我们对何时以及如何期望通过学习模型产生的数据有用的理论和经验见解有用。首先,我们提供了一个简单的定理,激励如何学习模型作为中间步骤可以缩小可能的值功能,而不是使用Bellman方程直接从数据中学习值函数。其次,我们提供了一个说明性的示例,该例子从经验上说明了如何在具有神经网络函数近似的更具体的环境中发生类似效果。最后,我们提供了广泛的实验,显示了在具有组合复杂性的环境中基于模型的学习对在线RL的好处,但是构成的结构使学习模型可以概括。在这些实验中,我们要注意控制其他因素,以便在尽可能的范围内隔离,使用相对于单独使用的学习模型产生的经验的好处。
Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, given that learned value functions can also generalize, it is not immediately obvious why model generalization should be better. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a simple theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.