论文标题
优势加权的离线元强化学习
Offline Meta-Reinforcement Learning with Advantage Weighting
论文作者
论文摘要
本文介绍了离线元强化学习(离线元RL)问题设置,并提出了一种在此设置中表现良好的算法。离线元rl类似于广泛成功的监督学习策略,即在大量固定的,预收取的数据(可能是从各种任务中)预先培训模型,并将模型细化到具有相对较少数据的新任务。也就是说,在离线META-RL中,我们从几个任务中的固定的,预收取的数据进行了元训练,以适应新任务中的数据(少于5个轨迹)数据的新任务。从离线的性质上讲,离线元素RL的算法可以利用可用的最大培训数据池,并消除元训练期间潜在的不安全或昂贵的数据收集。该设置继承了离线RL的挑战,但它有很大的不同之处,因为离线RL通常不考虑a)转移到新任务或b)限制测试任务中的数据,我们在离线元rl中都遇到了这两种数据。针对离线元RL设置,我们提出了具有优势加权的元演员评论家(Macaw),这是一种基于优化的元学习算法,该算法使用简单的,有监督的回归目标用于元训练的内部和外部环路。在常见的元基准测试的离线变体上,我们从经验上发现,这种方法可以完全离线元增强学习,并在先前的方法上取得了显着的收益。
This paper introduces the offline meta-reinforcement learning (offline meta-RL) problem setting and proposes an algorithm that performs well in this setting. Offline meta-RL is analogous to the widely successful supervised learning strategy of pre-training a model on a large batch of fixed, pre-collected data (possibly from various tasks) and fine-tuning the model to a new task with relatively little data. That is, in offline meta-RL, we meta-train on fixed, pre-collected data from several tasks in order to adapt to a new task with a very small amount (less than 5 trajectories) of data from the new task. By nature of being offline, algorithms for offline meta-RL can utilize the largest possible pool of training data available and eliminate potentially unsafe or costly data collection during meta-training. This setting inherits the challenges of offline RL, but it differs significantly because offline RL does not generally consider a) transfer to new tasks or b) limited data from the test task, both of which we face in offline meta-RL. Targeting the offline meta-RL setting, we propose Meta-Actor Critic with Advantage Weighting (MACAW), an optimization-based meta-learning algorithm that uses simple, supervised regression objectives for both the inner and outer loop of meta-training. On offline variants of common meta-RL benchmarks, we empirically find that this approach enables fully offline meta-reinforcement learning and achieves notable gains over prior methods.