鉴于光子实施，实施无冲突的多代理Q学习方法

论文标题

鉴于光子实施，实施无冲突的多代理Q学习方法

Bandit approach to conflict-free multi-agent Q-learning in view of photonic implementation

论文作者

Shinkawa, Hiroaki, Chauvet, Nicolas, Röhm, André, Mihana, Takatomo, Horisaki, Ryoichi, Bachelier, Guillaume, Naruse, Makoto

论文摘要

最近，已经进行了有关光子增强学习的广泛研究，以利用光的物理性质来加速计算过程。先前的研究利用光子的量子干扰来实现集体决策，而无需选择解决竞争性的多军强盗问题，这是强化学习的基本例子。但是，强盗问题涉及静态环境，在这种环境中，代理商的行动不会影响奖励概率。这项研究旨在将常规方法扩展到针对网格世界问题的更一般的多代理增强学习。与常规方法不同，拟议的计划涉及一个动态环境，在这种环境中，由于代理商的行为，奖励会改变。成功的光子增强学习方案既需要有助于学习质量的光子系统，又需要合适的算法。鉴于潜在的光子实现，这项研究提出了一种新颖的学习算法，不连续的匪徒Q学习。在这里，在强盗问题的背景下，环境中的国家行动对被视为插槽机，并且更新的Q值被认为是强盗问题的奖励。我们执行数值模拟以验证匪徒算法的有效性。此外，我们提出了一个多代理体系结构，在该体系结构中，通过光和量子原理的量子干扰，可以间接连接代理，确保代理之间选择国家行动对选择的无冲突属性。我们证明，由于多个代理商之间的避免冲突，多代理的增强学习可以加速。

Recently, extensive studies on photonic reinforcement learning to accelerate the process of calculation by exploiting the physical nature of light have been conducted. Previous studies utilized quantum interference of photons to achieve collective decision-making without choice conflicts when solving the competitive multi-armed bandit problem, a fundamental example of reinforcement learning. However, the bandit problem deals with a static environment where the agent's action does not influence the reward probabilities. This study aims to extend the conventional approach to a more general multi-agent reinforcement learning targeting the grid world problem. Unlike the conventional approach, the proposed scheme deals with a dynamic environment where the reward changes because of agents' actions. A successful photonic reinforcement learning scheme requires both a photonic system that contributes to the quality of learning and a suitable algorithm. This study proposes a novel learning algorithm, discontinuous bandit Q-learning, in view of a potential photonic implementation. Here, state-action pairs in the environment are regarded as slot machines in the context of the bandit problem and an updated amount of Q-value is regarded as the reward of the bandit problem. We perform numerical simulations to validate the effectiveness of the bandit algorithm. In addition, we propose a multi-agent architecture in which agents are indirectly connected through quantum interference of light and quantum principles ensure the conflict-free property of state-action pair selections among agents. We demonstrate that multi-agent reinforcement learning can be accelerated owing to conflict avoidance among multiple agents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题