论文标题

学习具有线性函数近似值的无限马平均奖励MDP

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

论文作者

Wei, Chen-Yu, Jafarnia-Jahromi, Mehdi, Luo, Haipeng, Jain, Rahul

论文摘要

我们开发了几种新算法,用于学习Markov决策过程,在无限 - 马的平均值奖励设置具有线性函数近似。使用乐观原则,并假设MDP具有线性结构,我们首先提出了一种具有最佳$ \ widetilde {o}(\ sqrt {t})$的计算效率低下的算法,而$ \ \ \ \ \ \\\\\\\\\\\\\\\\\\\\\\\\\ sqrt {t})$ recarters $ respectiant $ \ \ \ \ wideTilde {o}(o}(t^^^{3/4} $)接下来,从对抗性线性匪徒那里汲取灵感,我们开发了另一种有效的算法,并使用$ \ widetilde {o}(\ sqrt {t})$遗憾地在不同的假设下遗憾,改善了Hao等人的最佳现有结果。 (2020)带有$ \ widetilde {o}(t^{2/3})$遗憾。此外,我们在Kakade(2002)提出的这种算法和自然政策梯度算法之间建立了联系,并表明我们的分析改善了Agarwal等人最近给出的样本复杂性。 (2020)。

We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $\widetilde{O}(\sqrt{T})$ regret and another computationally efficient variant with $\widetilde{O}(T^{3/4})$ regret, where $T$ is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $\widetilde{O}(\sqrt{T})$ regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with $\widetilde{O}(T^{2/3})$ regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源