对抗训练的演员评论家离线加强学习

论文标题

对抗训练的演员评论家离线加强学习

Adversarially Trained Actor Critic for Offline Reinforcement Learning

论文作者

Cheng, Ching-An, Xie, Tengyang, Jiang, Nan, Agarwal, Alekh

论文摘要

我们根据相对悲观主义的概念，在数据覆盖不足的情况下，提出了经过对抗训练的演员评论家（ATAC），这是一种新的无模型算法（RL）。 ATAC被设计为两人Stackelberg游戏：政策演员与受对抗训练的价值评论家竞争，后者发现参与者不如数据收集行为政策的数据持续情况。我们证明，当演员在两个玩家游戏中不后悔时，运行ATAC会产生一项政策，证明1）在控制悲观程度的各种超级参数上都超过了行为策略，而2）2）与数据覆盖的最佳策略竞争，由数据覆盖的最佳策略以适当选择的超参数选择。与现有作品相比，尤其是我们的框架提供了一般函数近似的理论保证，也提供了可扩展到复杂环境和大型数据集的深度RL实现。在D4RL基准测试中，ATAC在一系列连续的控制任务上始终优于最先进的离线RL算法。

We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题