深层替代Q学习用于自动驾驶

论文标题

深层替代Q学习用于自动驾驶

Deep Surrogate Q-Learning for Autonomous Driving

论文作者

Kalweit, Maria, Kalweit, Gabriel, Werling, Moritz, Boedecker, Joschka

论文摘要

深入强化学习系统在实际系统上的应用方面具有挑战性的问题是它们对不断变化的环境及其效率W.R.T.的适应性。计算资源和数据。在将学习车道变化行为应用于自动驾驶时，代理必须处理不同数量的周围车辆。此外，由于测试驾驶员无法在现实世界中执行任意数量的车道变化，因此所需的过渡次数会引起瓶颈。在非政策环境中，可以通过观察他人的行动来获得有关解决任务的其他信息。尽管在经典的RL设置中，此知识仍然未使用，但我们将其他驱动程序用作替代品来更有效地学习代理的价值功能。我们提出替代Q学习，以解决上述问题，并大大减少所需的驾驶时间。我们进一步提出了一个有效的实施，基于Q-函数的置换量表深度神经网络架构，以估算传感器范围内的车辆数量的动作值。我们表明，该体系结构导致了一种新颖的重播抽样技术，我们称之为以场景为中心的体验重播，并评估替代Q学习和以场景为中心的体验在开放的交通模拟器相扑中重播。此外，我们表明我们的方法通过在实际Higd数据集中学习策略来增强RL系统的现实适用性。

Challenging problems of deep reinforcement learning systems with regard to the application on real systems are their adaptivity to changing environments and their efficiency w.r.t. computational resources and data. In the application of learning lane-change behavior for autonomous driving, agents have to deal with a varying number of surrounding vehicles. Furthermore, the number of required transitions imposes a bottleneck, since test drivers cannot perform an arbitrary amount of lane changes in the real world. In the off-policy setting, additional information on solving the task can be gained by observing actions from others. While in the classical RL setup this knowledge remains unused, we use other drivers as surrogates to learn the agent's value function more efficiently. We propose Surrogate Q-learning that deals with the aforementioned problems and reduces the required driving time drastically. We further propose an efficient implementation based on a permutation-equivariant deep neural network architecture of the Q-function to estimate action-values for a variable number of vehicles in sensor range. We show that the architecture leads to a novel replay sampling technique we call Scene-centric Experience Replay and evaluate the performance of Surrogate Q-learning and Scene-centric Experience Replay in the open traffic simulator SUMO. Additionally, we show that our methods enhance real-world applicability of RL systems by learning policies on the real highD dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题