赌徒的问题及以后

论文标题

赌徒的问题及以后

The Gambler's Problem and Beyond

论文作者

Wang, Baoxiang, Li, Shuai, Li, Jiajin, Chan, Siu On

论文摘要

我们分析了赌徒的问题，这是一个简单的加强学习问题，赌徒有机会在达到目标之前将赌注翻倍或失去赌注。这是Sutton和Barto（2018）在强化学习教科书中介绍的早期示例，在那里他们提到了具有高频组件的最佳价值功能的有趣模式，并重复了非平滑点。但是，它是没有进一步调查的。我们为离散和连续情况的最佳值函数提供了确切的公式。虽然看起来很简单，但值函数是病理性的：分形，自相似，衍生物以零或无穷大，而不是作为基本功能写的。实际上，它是广义的Cantor函数之一，它具有到目前为止未知的复杂性。我们的分析可以为改善价值函数近似，基于梯度的算法和Q学习的见解提供见解。

We analyze the Gambler's problem, a simple reinforcement learning problem where the gambler has the chance to double or lose the bets until the target is reached. This is an early example introduced in the reinforcement learning textbook by Sutton and Barto (2018), where they mention an interesting pattern of the optimal value function with high-frequency components and repeating non-smooth points. It is however without further investigation. We provide the exact formula for the optimal value function for both the discrete and the continuous cases. Though simple as it might seem, the value function is pathological: fractal, self-similar, derivative taking either zero or infinity, and not written as elementary functions. It is in fact one of the generalized Cantor functions, where it holds a complexity that has been uncharted thus far. Our analyses could provide insights into improving value function approximation, gradient-based algorithms, and Q-learning, in real applications and implementations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题