打破多种状态空间中维度的诅咒：一个统一的代理置换框架

论文标题

打破多种状态空间中维度的诅咒：一个统一的代理置换框架

Breaking the Curse of Dimensionality in Multiagent State Space: A Unified Agent Permutation Framework

论文作者

Hao, Xiaotian, Mao, Hangyu, Wang, Weixun, Yang, Yaodong, Li, Dong, Zheng, Yan, Wang, Zhen, Hao, Jianye

论文摘要

多基因增强学习（MARL）中的状态空间随着代理编号成倍增长。这种维度的诅咒导致可伸缩性差和样品效率低，抑制了MARL数十年。为了打破这一诅咒，我们提出了一个统一的代理置换框架，该框架利用排列不变性（PI）和置换置换率（PE）归纳偏见以减少多重状态空间。我们的见解是，将实体的顺序排列在有因数的多基因状态空间中不会改变信息。具体来说，我们提出了两个新颖的实现：动态置换网络（DPN）和一个超级策略网络（HPN）。核心想法是构建单独的实体PI输入和PE输出网络模块，以端到端的方式连接实体的状态空间和动作空间。 DPN通过两个单独的模块选择网络实现此类连接，它们始终将相同的输入模块分配给相同的输入实体（保证PI），并将相同的输出模块分配给相同的实体相关输出（保证PE）。为了增强表示能力，HPN用HyperNetworks代替了DPN的模块选择网络，以直接生成相应的模块权重。在SMAC，Google Research Football和MPE中进行的广泛实验证明，提出的方法显着提高了现有MARL算法的性能和学习效率。值得注意的是，在SMAC中，我们在几乎所有硬性和超级艰难的场景中都达到了100％的胜利（以前从未实现）。

The state space in Multiagent Reinforcement Learning (MARL) grows exponentially with the agent number. Such a curse of dimensionality results in poor scalability and low sample efficiency, inhibiting MARL for decades. To break this curse, we propose a unified agent permutation framework that exploits the permutation invariance (PI) and permutation equivariance (PE) inductive biases to reduce the multiagent state space. Our insight is that permuting the order of entities in the factored multiagent state space does not change the information. Specifically, we propose two novel implementations: a Dynamic Permutation Network (DPN) and a Hyper Policy Network (HPN). The core idea is to build separate entity-wise PI input and PE output network modules to connect the entity-factored state space and action space in an end-to-end way. DPN achieves such connections by two separate module selection networks, which consistently assign the same input module to the same input entity (guarantee PI) and assign the same output module to the same entity-related output (guarantee PE). To enhance the representation capability, HPN replaces the module selection networks of DPN with hypernetworks to directly generate the corresponding module weights. Extensive experiments in SMAC, Google Research Football and MPE validate that the proposed methods significantly boost the performance and the learning efficiency of existing MARL algorithms. Remarkably, in SMAC, we achieve 100% win rates in almost all hard and super-hard scenarios (never achieved before).

下载PDF全文

下载文献需遵守相关版权规定

论文标题