有效地计算对手团队马尔可夫游戏中的纳什平衡

论文标题

有效地计算对手团队马尔可夫游戏中的纳什平衡

Efficiently Computing Nash Equilibria in Adversarial Team Markov Games

论文作者

Kalogiannis, Fivos, Anagnostides, Ioannis, Panageas, Ioannis, Vlatakis-Gkaragkounis, Emmanouil-Vasileios, Chatziafratis, Vaggos, Stavroulakis, Stelios

论文摘要

计算NASH平衡策略是多方面增强学习中的一个核心问题，在理论和实践中都受到了广泛的关注。但是，到目前为止，可证明的保证限于完全竞争性的或合作的场景，或者在大多数实际应用中实现了难以满足的强大假设。在这项工作中，我们通过调查Infinite-Horizon \ Emph {对抗性团队Markov Games}来偏离那些先前的结果，这是一款自然而充分动机的游戏类别，在该游戏中，如果没有任何明确的协调或交流 - 正在与对手竞争的球员竞争。这种设置允许对零和马尔可夫潜在游戏进行统一的处理，并作为模拟更现实的战略互动的一步，这些互动具有竞争和合作利益。我们的主要贡献是第一种计算固定的$ε$ - 对抗性团队马尔可夫游戏中具有计算复杂性的NASH平衡的算法，在游戏的所有自然参数中都是多项式的，以及$ 1/ε$。拟议的算法特别自然和实用，它基于为团队中的每个球员执行独立的政策梯度步骤，并与对手一边的最佳反应同时进行。反过来，通过解决精心构造的线性程序来获得对手的政策。我们的分析利用非标准技术来为具有非convex限制的非线性程序建立KKT最佳条件，从而导致对诱导Lagrange乘数的自然解释。在此过程中，由于von Stengel和Koller（GEB`97），我们大大扩展了对对抗（正常形式）团队游戏中最佳政策的重要特征。

Computing Nash equilibrium policies is a central problem in multi-agent reinforcement learning that has received extensive attention both in theory and in practice. However, provable guarantees have been thus far either limited to fully competitive or cooperative scenarios or impose strong assumptions that are difficult to meet in most practical applications. In this work, we depart from those prior results by investigating infinite-horizon \emph{adversarial team Markov games}, a natural and well-motivated class of games in which a team of identically-interested players -- in the absence of any explicit coordination or communication -- is competing against an adversarial player. This setting allows for a unifying treatment of zero-sum Markov games and Markov potential games, and serves as a step to model more realistic strategic interactions that feature both competing and cooperative interests. Our main contribution is the first algorithm for computing stationary $ε$-approximate Nash equilibria in adversarial team Markov games with computational complexity that is polynomial in all the natural parameters of the game, as well as $1/ε$. The proposed algorithm is particularly natural and practical, and it is based on performing independent policy gradient steps for each player in the team, in tandem with best responses from the side of the adversary; in turn, the policy for the adversary is then obtained by solving a carefully constructed linear program. Our analysis leverages non-standard techniques to establish the KKT optimality conditions for a nonlinear program with nonconvex constraints, thereby leading to a natural interpretation of the induced Lagrange multipliers. Along the way, we significantly extend an important characterization of optimal policies in adversarial (normal-form) team games due to Von Stengel and Koller (GEB `97).

下载PDF全文

下载文献需遵守相关版权规定

论文标题