价值梯度加权基于模型的增强学习

论文标题

价值梯度加权基于模型的增强学习

Value Gradient weighted Model-Based Reinforcement Learning

论文作者

Voelcker, Claas, Liao, Victor, Garg, Animesh, Farahmand, Amir-massoud

论文摘要

基于模型的增强学习（MBRL）是一种采样有效的技术，可获得控制策略，但不可避免的建模误差通常会导致性能恶化。 MBRL中的模型通常仅适用于重建动态，尤其是状态观察，而模型误差对策略的影响并未由培训目标捕获。这导致了MBRL的预期目标，实现良好的政策和价值学习与实践中未来国家预测中采用的损失函数的目标之间的不匹配。天真的直觉表明，价值感知的模型学习将解决此问题，实际上，已经基于理论分析提出了该客观不匹配问题的几种解决方案。但是，在实践中，它们往往是基于常用的最大可能性（MLE）方法。在本文中，我们提出了价值梯度加权模型学习（VAGRAM），这是一种用于增值模型学习的新方法，可改善MBRL在具有挑战性的环境中的性能，例如小型模型容量和分散注意力状态维度的存在。我们分析了MLE和价值感知的方法，并演示了他们如何在学习价值吸引的模型并突出显示在深度学习环境中稳定优化的其他目标时如何考虑探索和功能近似的行为。我们通过表明我们的损失函数能够在Mujoco基准套件上获得高回报来验证我们的分析，同时比基于最大似然的方法更健壮。

Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies, yet unavoidable modeling errors often lead performance deterioration. The model in MBRL is often solely fitted to reconstruct dynamics, state observations in particular, while the impact of model error on the policy is not captured by the training objective. This leads to a mismatch between the intended goal of MBRL, enabling good policy and value learning, and the target of the loss function employed in practice, future state prediction. Naive intuition would suggest that value-aware model learning would fix this problem and, indeed, several solutions to this objective mismatch problem have been proposed based on theoretical analysis. However, they tend to be inferior in practice to commonly used maximum likelihood (MLE) based approaches. In this paper we propose the Value-gradient weighted Model Learning (VaGraM), a novel method for value-aware model learning which improves the performance of MBRL in challenging settings, such as small model capacity and the presence of distracting state dimensions. We analyze both MLE and value-aware approaches and demonstrate how they fail to account for exploration and the behavior of function approximation when learning value-aware models and highlight the additional goals that must be met to stabilize optimization in the deep learning setting. We verify our analysis by showing that our loss function is able to achieve high returns on the Mujoco benchmark suite while being more robust than maximum likelihood based approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题