离线元强化学习的上下文变压器

论文标题

离线元强化学习的上下文变压器

Contextual Transformer for Offline Meta Reinforcement Learning

论文作者

Lin, Runji, Li, Ye, Feng, Xidong, Zhang, Zhaowei, Fung, Xian Hong Wu, Zhang, Haifeng, Wang, Jun, Du, Yali, Yang, Yaodong

论文摘要

大规模序列模型中的预处理范式在自然语言处理和计算机视觉任务方面取得了重大进展。但是，这种范式仍然受到强化学习（RL）的几个挑战的阻碍，包括基于离线数据缺乏自我监督的预科算法以及对未见下游任务的有效微调/及时调查。在这项工作中，我们探讨了提示如何改善基于序列建模的离线增强学习（离线RL）算法。首先，我们建议对离线RL进行及时调整，其中上下文向量序列与输入串联以指导有条件的策略生成。因此，我们可以在离线数据集上以自我监督的损失预处理模型，并学习提示指导政策采取所需的行动。其次，我们将框架扩展到Meta-RL设置，并提出上下文Meta Transformer（CMT）； CMT利用不同任务之间的上下文作为提示对看不见的任务的概括。我们对三种不同的离线RL设置进行了广泛的实验：D4RL数据集的离线单代理RL，Mujoco基准上的离线元RL以及SMAC基准上的离线MARL。优越的结果验证了我们方法的强大性能和一般性。

The pretrain-finetuning paradigm in large-scale sequence models has made significant progress in natural language processing and computer vision tasks. However, such a paradigm is still hindered by several challenges in Reinforcement Learning (RL), including the lack of self-supervised pretraining algorithms based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks. In this work, we explore how prompts can improve sequence modeling-based offline reinforcement learning (offline-RL) algorithms. Firstly, we propose prompt tuning for offline RL, where a context vector sequence is concatenated with the input to guide the conditional policy generation. As such, we can pretrain a model on the offline dataset with self-supervised loss and learn a prompt to guide the policy towards desired actions. Secondly, we extend our framework to Meta-RL settings and propose Contextual Meta Transformer (CMT); CMT leverages the context among different tasks as the prompt to improve generalization on unseen tasks. We conduct extensive experiments across three different offline-RL settings: offline single-agent RL on the D4RL dataset, offline Meta-RL on the MuJoCo benchmark, and offline MARL on the SMAC benchmark. Superior results validate the strong performance, and generality of our methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题