对一般状态和行动空间的政策优化

论文标题

对一般状态和行动空间的政策优化

Policy Optimization over General State and Action Spaces

论文作者

Ju, Caleb, Lan, Guanghui

论文摘要

众所周知，在一般状态和行动空间上的加强学习（RL）问题是具有挑战性的。与Tableau设置相反，人们无法列举所有州，然后迭代地更新每个州的策略。这样可以防止应用许多研究良好的RL方法，尤其是那些具有可证明的收敛保证的方法。在本文中，我们首先介绍了最近开发的政策镜下降方法来处理一般状态和行动空间。我们引入了将功能近似结合到此方法中的新方法，因此我们根本不需要使用明确的策略参数化。此外，我们提出了一种新颖的策略双重平均方法，可以应用更简单的函数近似技术。我们将线性收敛速率与全球最优性或sublinear收敛到平稳性，以用于在精确的策略评估下解决不同类别的RL问题。然后，我们为政策评估的近似错误定义了适当的概念，并研究了它们对适用于有限行动或连续行动空间的通用RL问题的这些方法的影响的影响。据我们所知，这些算法框架的发展以及它们的融合分析似乎是文献中的新事物。初步数值结果证明了上述方法的鲁棒性，并表明它们可以与最新的RL算法具有竞争力。

Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature. Preliminary numerical results demonstrate the robustness of the aforementioned methods and show they can be competitive with state-of-the-art RL algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题