论文标题

在合作多代理动力学系统中分散学习的遗憾范围

Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

论文作者

Asghari, Seyed Mohammad, Ouyang, Yi, Nayyar, Ashutosh

论文摘要

遗憾的分析在多机构增强学习(MARL)中具有挑战性,这主要是由于动态环境和代理之间的分散信息。我们试图在多代理线性季度(LQ)动态系统中分散学习的背景下解决这一挑战。我们从一个简单的设置开始,该设置由两种代理和两个动态脱钩的随机线性系统组成,每个系统由一个由代理控制。系统通过二次成本函数耦合。当两个系统的动态尚不清楚并且在代理之间没有通信时,我们表明,没有学习策略可以在$ t $后悔中产生子线性,其中$ t $是时间范围。当只有一个系统的动力学未知,并且从控制未知系统到另一个代理的代理有一个方向通信时,我们会根据辅助单代理LQ问题的构建提出MARL算法。拟议中的MARL算法中的辅助单代理问题是两种学习剂之间的隐式协调机制。这使代理商可以在辅助单代理问题的遗憾的$ O(\ sqrt {t})内实现遗憾。因此,使用现有结果对单代理LQ遗憾,我们的算法提供了$ \ tilde {o}(\ sqrt {t})$遗憾。 (这里$ \ tilde {o}(\ cdot)$隐藏常数和对数因素)。我们的数值实验表明,在实践中,该界限是匹配的。从两个代理问题中,我们将结果扩展到具有某些通信模式的多代理LQ系统。

Regret analysis is challenging in Multi-Agent Reinforcement Learning (MARL) primarily due to the dynamical environments and the decentralized information among agents. We attempt to solve this challenge in the context of decentralized learning in multi-agent linear-quadratic (LQ) dynamical systems. We begin with a simple setup consisting of two agents and two dynamically decoupled stochastic linear systems, each system controlled by an agent. The systems are coupled through a quadratic cost function. When both systems' dynamics are unknown and there is no communication among the agents, we show that no learning policy can generate sub-linear in $T$ regret, where $T$ is the time horizon. When only one system's dynamics are unknown and there is one-directional communication from the agent controlling the unknown system to the other agent, we propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem. The auxiliary single-agent problem in the proposed MARL algorithm serves as an implicit coordination mechanism among the two learning agents. This allows the agents to achieve a regret within $O(\sqrt{T})$ of the regret of the auxiliary single-agent problem. Consequently, using existing results for single-agent LQ regret, our algorithm provides a $\tilde{O}(\sqrt{T})$ regret bound. (Here $\tilde{O}(\cdot)$ hides constants and logarithmic factors). Our numerical experiments indicate that this bound is matched in practice. From the two-agent problem, we extend our results to multi-agent LQ systems with certain communication patterns.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源