论文标题
了解加强学习中的贪婪时,了解近似政策评估的病理
Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning
论文作者
论文摘要
尽管有经验成功,但具有价值函数近似值的强化学习理论(RL)在根本上仍然是不完整的。先前的工作已经确定了在结合了近似政策评估和贪婪的RL算法中出现的多种病理行为。一个突出的例子是策略振荡,其中算法可能在策略之间无限期地循环,而不是融合到固定点。然而,不太了解的是振荡区域中政策的质量。在本文中,我们提供了简单的示例,说明除了策略振荡和多个固定点 - 相同的基本问题还可能导致给定近似值的最糟糕的策略。当算法优化评估准确性通过当前政策下发生的状态的分布加权的评估准确性时,可能会产生这种行为,但基于此分布中很少或不存在的状态的价值贪婪。这意味着用于贪婪的价值是不可靠的,可以将策略引导到不良的方向上。我们认为这可能导致最糟糕的政策表明,这种算法是不可靠的。这样的例子的存在有助于缩小可能有帮助的理论保证和算法思想的种类。我们通过分析和实验表明,这种病理行为会影响广泛的RL和动态编程算法。这种行为可以在有或没有引导性的情况下以及线性函数近似以及更复杂的参数化函数(如神经网络)中出现。
Despite empirical success, the theory of reinforcement learning (RL) with value function approximation remains fundamentally incomplete. Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between policies, rather than converging to a fixed point. What is not well understood however is the quality of the policies in the region of oscillation. In this paper we present simple examples illustrating that in addition to policy oscillation and multiple fixed points -- the same basic issue can lead to convergence to the worst possible policy for a given approximation. Such behaviours can arise when algorithms optimize evaluation accuracy weighted by the distribution of states that occur under the current policy, but greedify based on the value of states which are rare or nonexistent under this distribution. This means the values used for greedification are unreliable and can steer the policy in undesirable directions. Our observation that this can lead to the worst possible policy shows that in a general sense such algorithms are unreliable. The existence of such examples helps to narrow the kind of theoretical guarantees that are possible and the kind of algorithmic ideas that are likely to be helpful. We demonstrate analytically and experimentally that such pathological behaviours can impact a wide range of RL and dynamic programming algorithms; such behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.