有关使用生成甲骨文解决有限MDP的目标Q学习的注释

论文标题

有关使用生成甲骨文解决有限MDP的目标Q学习的注释

A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle

论文作者

Li, Ziniu, Xu, Tian, Yu, Yang

论文摘要

q学习功能近似的学习可能会在非政策设置中差异，而目标网络是解决此问题的强大技术。在本手稿中，我们检查了与生成甲骨文的表格情况下相关目标Q学习算法的样品复杂性。我们指出[Lee和He，2020]中的误导性主张，并进行了严格的分析。特别是，我们证明了[Lee and He，2020]中目标Q学习算法的样本复杂性是$ \ widetilde {\ Mathcal O}（| \ Mathcal S |^2 | \ Mathcal A |^2（1-γ）此外，我们表明该样本复杂性已提高到$ \ widetilde {\ Mathcal o}（| \ Mathcal S || \ Mathcal a |（1-γ）^{ - 5} \ varepsilon^{ - 2}）$ s || \ Mathcal a |（1-γ）^{ - 4} \ varepsilon^{ - 2}）$，如果$（1/2，1）$进一步。与香草Q学习相比，我们的结果得出的结论是，定期固定的目标Q功能的引入不会牺牲样品的复杂性。

Q-learning with function approximation could diverge in the off-policy setting and the target network is a powerful technique to address this issue. In this manuscript, we examine the sample complexity of the associated target Q-learning algorithm in the tabular case with a generative oracle. We point out a misleading claim in [Lee and He, 2020] and establish a tight analysis. In particular, we demonstrate that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is $\widetilde{\mathcal O}(|\mathcal S|^2|\mathcal A|^2 (1-γ)^{-5}\varepsilon^{-2})$. Furthermore, we show that this sample complexity is improved to $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-γ)^{-5}\varepsilon^{-2})$ if we can sequentially update all state-action pairs and $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-γ)^{-4}\varepsilon^{-2})$ if $γ$ is further in $(1/2, 1)$. Compared with the vanilla Q-learning, our results conclude that the introduction of a periodically-frozen target Q-function does not sacrifice the sample complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题