在合作多代理动力学系统中分散学习的遗憾范围

论文标题

在合作多代理动力学系统中分散学习的遗憾范围

Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

论文作者

Asghari, Seyed Mohammad, Ouyang, Yi, Nayyar, Ashutosh

论文摘要

遗憾的分析在多机构增强学习（MARL）中具有挑战性，这主要是由于动态环境和代理之间的分散信息。我们试图在多代理线性季度（LQ）动态系统中分散学习的背景下解决这一挑战。我们从一个简单的设置开始，该设置由两种代理和两个动态脱钩的随机线性系统组成，每个系统由一个由代理控制。系统通过二次成本函数耦合。当两个系统的动态尚不清楚并且在代理之间没有通信时，我们表明，没有学习策略可以在$ t $后悔中产生子线性，其中$ t $是时间范围。当只有一个系统的动力学未知，并且从控制未知系统到另一个代理的代理有一个方向通信时，我们会根据辅助单代理LQ问题的构建提出MARL算法。拟议中的MARL算法中的辅助单代理问题是两种学习剂之间的隐式协调机制。这使代理商可以在辅助单代理问题的遗憾的$ O（\ sqrt {t}）内实现遗憾。因此，使用现有结果对单代理LQ遗憾，我们的算法提供了$ \ tilde {o}（\ sqrt {t}）$遗憾。（这里$ \ tilde {o}（\ cdot）$隐藏常数和对数因素）。我们的数值实验表明，在实践中，该界限是匹配的。从两个代理问题中，我们将结果扩展到具有某些通信模式的多代理LQ系统。

Regret analysis is challenging in Multi-Agent Reinforcement Learning (MARL) primarily due to the dynamical environments and the decentralized information among agents. We attempt to solve this challenge in the context of decentralized learning in multi-agent linear-quadratic (LQ) dynamical systems. We begin with a simple setup consisting of two agents and two dynamically decoupled stochastic linear systems, each system controlled by an agent. The systems are coupled through a quadratic cost function. When both systems' dynamics are unknown and there is no communication among the agents, we show that no learning policy can generate sub-linear in $T$ regret, where $T$ is the time horizon. When only one system's dynamics are unknown and there is one-directional communication from the agent controlling the unknown system to the other agent, we propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem. The auxiliary single-agent problem in the proposed MARL algorithm serves as an implicit coordination mechanism among the two learning agents. This allows the agents to achieve a regret within $O(\sqrt{T})$ of the regret of the auxiliary single-agent problem. Consequently, using existing results for single-agent LQ regret, our algorithm provides a $\tilde{O}(\sqrt{T})$ regret bound. (Here $\tilde{O}(\cdot)$ hides constants and logarithmic factors). Our numerical experiments indicate that this bound is matched in practice. From the two-agent problem, we extend our results to multi-agent LQ systems with certain communication patterns.

下载PDF全文

下载文献需遵守相关版权规定

论文标题