论文标题
贝叶斯自适应基于模型的政策优化
Bayes-Adaptive Deep Model-Based Policy Optimisation
论文作者
论文摘要
我们介绍了一种基于贝叶斯(深)模型的强化学习方法(ROMBRL),该方法可以捕获模型的不确定性,以实现样品有效的策略优化。我们建议将基于模型的政策优化问题作为贝叶斯自适应马尔可夫决策过程(BAMDP)。 Rombrl通过信念分布通过深贝叶斯神经网络维持模型不确定性,该网络是通过随机梯度汉密尔顿蒙特卡洛(Hamiltonian Monte Carlo)生成的。通过采样模型和基于历史的策略控制的模拟来传播不确定性。随着信念在访问的历史中进行编码,我们提出了一个基于历史的政策网络,该网络可以端到端训练以跨越历史领域进行训练,并将使用经常性的信任区域策略优化进行培训。我们表明,在样本复杂性和任务性能方面,ROMBRL在许多具有挑战性的控制基准任务上都优于现有方法。本文的源代码也可在https://github.com/thobotics/rombrl上公开可用。
We introduce a Bayesian (deep) model-based reinforcement learning method (RoMBRL) that can capture model uncertainty to achieve sample-efficient policy optimisation. We propose to formulate the model-based policy optimisation problem as a Bayes-adaptive Markov decision process (BAMDP). RoMBRL maintains model uncertainty via belief distributions through a deep Bayesian neural network whose samples are generated via stochastic gradient Hamiltonian Monte Carlo. Uncertainty is propagated through simulations controlled by sampled models and history-based policies. As beliefs are encoded in visited histories, we propose a history-based policy network that can be end-to-end trained to generalise across history space and will be trained using recurrent Trust-Region Policy Optimisation. We show that RoMBRL outperforms existing approaches on many challenging control benchmark tasks in terms of sample complexity and task performance. The source code of this paper is also publicly available on https://github.com/thobotics/RoMBRL.