有效的政策空间响应甲骨文

论文标题

有效的政策空间响应甲骨文

Efficient Policy Space Response Oracles

论文作者

Zhou, Ming, Chen, Jingxiao, Wen, Ying, Zhang, Weinan, Yang, Yaodong, Yu, Yong, Wang, Jun

论文摘要

策略空间响应甲骨文方法（PSRO）提供了一种通用解决方案，可以在两人零和游戏中学习NASH平衡，但遭受了两个缺点：（1）由于需要通过模拟一致的元游戏评估而导致的计算效率低下，并且（2）由于在每个Eptrategy上找到固定的Meta-Strategety的最佳响应而效率低下。在这项工作中，我们提出了有效的PSRO（EPSRO），该PSRO（EPSRO）在很大程度上提高了以上两个步骤的效率。我们开发的核心是对不受限制的限制性（URR）游戏的新引入的无regret优化的子例程。通过在每个时期求解URR，可以评估当前游戏并在一个前传球中计算最佳响应，而无需进行元游戏模拟。从理论上讲，我们证明EPSRO的解决方案程序对可剥削性提供了单调的改进，而现有PSRO方法都没有。此外，我们证明了No-Regret优化的遗憾绑定为$ \ MATHCAL {O}（\ sqrt {t \ log {[（k^2+k）/2]}}）$，其中$ k $是限制性策略集的大小。最重要的是，Epsro的理想特性是它是可行的，这允许在诱导行为多样性的政策空间中高效探索。我们在三类游戏中测试Epsro，并报告墙时的50倍加速度和10倍的数据效率，同时保持与Kuhn和Leduc扑克游戏现有的PSRO方法相似的可剥削性。

Policy Space Response Oracle methods (PSRO) provide a general solution to learn Nash equilibrium in two-player zero-sum games but suffer from two drawbacks: (1) the computation inefficiency due to the need for consistent meta-game evaluation via simulations, and (2) the exploration inefficiency due to finding the best response against a fixed meta-strategy at every epoch. In this work, we propose Efficient PSRO (EPSRO) that largely improves the efficiency of the above two steps. Central to our development is the newly-introduced subroutine of no-regret optimization on the unrestricted-restricted (URR) game. By solving URR at each epoch, one can evaluate the current game and compute the best response in one forward pass without the need for meta-game simulations. Theoretically, we prove that the solution procedures of EPSRO offer a monotonic improvement on the exploitability, which none of existing PSRO methods possess. Furthermore, we prove that the no-regret optimization has a regret bound of $\mathcal{O}(\sqrt{T\log{[(k^2+k)/2]}})$, where $k$ is the size of restricted policy set. Most importantly, a desirable property of EPSRO is that it is parallelizable, this allows for highly efficient exploration in the policy space that induces behavioral diversity. We test EPSRO on three classes of games, and report a 50x speedup in wall-time and 10x data efficiency while maintaining similar exploitability as existing PSRO methods on Kuhn and Leduc Poker games.

下载PDF全文

下载文献需遵守相关版权规定

论文标题