Alphasnake：非确定NP-Hard Markov决策过程的政策迭代

论文标题

Alphasnake：非确定NP-Hard Markov决策过程的政策迭代

AlphaSnake: Policy Iteration on a Nondeterministic NP-hard Markov Decision Process

论文作者

Du, Kevin, Gemp, Ian, Wu, Yi, Wu, Yingying

论文摘要

强化学习最近已被用来解决图理论中众所周知的NP-HARD组合问题。在这些问题中，即使仅限于结构复杂图的单个实例，哈密顿周期问题也难以分析。在本文中，我们使用Monte Carlo Tree Search（MCTS），这是许多最先进的增强式学习算法（例如Alphazero）背后的搜索算法，以创建自主的代理商，以学习玩游戏游戏，以Grid Graphs上的汉密尔顿周期为中心。蛇游戏可以作为单人打折的马尔可夫决策过程（MDP）进行配制，在随机环境中，代理必须在该过程中表现最佳。确定蛇的最佳政策，定义为以更高的优先级获胜或获胜率的可能性最大化的政策，并最大程度地减少了优先级赢得的预期时间步骤的预期次数，这是NP -HARD。在性能方面，与蛇游戏中的先前工作相比，我们的算法是第一个获得超过$ 0.5 $的获胜率的算法（统一的随机策略实现了获胜率$ <2.57 \ times 10^{ - 15} $），这表明Alphazero在接近NP-HARD环境方面的多功能性。

Reinforcement learning has recently been used to approach well-known NP-hard combinatorial problems in graph theory. Among these problems, Hamiltonian cycle problems are exceptionally difficult to analyze, even when restricted to individual instances of structurally complex graphs. In this paper, we use Monte Carlo Tree Search (MCTS), the search algorithm behind many state-of-the-art reinforcement learning algorithms such as AlphaZero, to create autonomous agents that learn to play the game of Snake, a game centered on properties of Hamiltonian cycles on grid graphs. The game of Snake can be formulated as a single-player discounted Markov Decision Process (MDP) where the agent must behave optimally in a stochastic environment. Determining the optimal policy for Snake, defined as the policy that maximizes the probability of winning - or win rate - with higher priority and minimizes the expected number of time steps to win with lower priority, is conjectured to be NP-hard. Performance-wise, compared to prior work in the Snake game, our algorithm is the first to achieve a win rate over $0.5$ (a uniform random policy achieves a win rate $< 2.57 \times 10^{-15}$), demonstrating the versatility of AlphaZero in approaching NP-hard environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题