论文标题
收紧上置信度增强学习中的探索
Tightening Exploration in Upper Confidence Reinforcement Learning
论文作者
论文摘要
在(Jaksch等,2010)中引入的上级信心增强学习(UCRL2)算法是一种流行的方法,可以在平均奖励标准下对未知离散的Markov决策过程进行遗憾最小化。尽管它具有良好而通用的理论遗憾保证,但该算法及其变体一直保持到现在的理论性,因为在简单环境中的数值实验在学习发生之前表现出很长的燃烧阶段。为了追求实用效率,我们按照UCRL2的行呈现UCRL3,但采用两个关键修改:首先,它使用最先进的时间均匀浓度不平等,以计算每种状态行动对的奖励和(组成部分)过渡分布的信心集。此外,为了收紧探索,它使用了每个过渡分布的支持的自适应计算,这反过来又使我们能够重新审视UCRL2的扩展价值迭代程序,从而通过忽略较低的概率过渡来优化分布,同时仍然确保近距离近距离效率。我们通过标准环境中的数值实验证明,与UCRL2及其变体相比,这种方式减少探索可以实现大量的数值改进。从理论方面来说,这些关键修改使我们能够在UCRL2上的UCRL3改善遗憾,这首先使局部直径和局部有效支持的概念,这要归功于差异引人注目的浓度范围。
The upper confidence reinforcement learning (UCRL2) algorithm introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees, this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple environments exhibit long burn-in phases before the learning takes place. In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and (component-wise) transition distributions for each state-action pair. Furthermore, to tighten exploration, it uses an adaptive computation of the support of each transition distribution, which in turn enables us to revisit the extended value iteration procedure of UCRL2 to optimize over distributions with reduced support by disregarding low probability transitions, while still ensuring near-optimism. We demonstrate, through numerical experiments in standard environments, that reducing exploration this way yields a substantial numerical improvement compared to UCRL2 and its variants. On the theoretical side, these key modifications enable us to derive a regret bound for UCRL3 improving on UCRL2, that for the first time makes appear notions of local diameter and local effective support, thanks to variance-aware concentration bounds.