论文标题
降低差异的非上调记忆效率策略搜索
Variance-Reduced Off-Policy Memory-Efficient Policy Search
论文作者
论文摘要
政策优化是强化学习(RL)的挑战性问题。为此问题设计的算法通常会遭受其估计值的较高差异,从而导致样本效率差,并且存在收敛问题。最近已经提出了一些降低一些方差的政体策略梯度算法,该算法使用随机优化的方法来减少增强算法中梯度估计的方差。但是,这些算法并非用于脱离政策设置,并且具有内存感知,因为它们需要不时收集和存储大型``参考''批次。为了实现降低差异的非政策稳定策略优化,我们提出了一个算法家族,该算法有效,随机差异降低,并能够从差异样本中学习。经验研究验证了所提出的方法的有效性。
Off-policy policy optimization is a challenging problem in reinforcement learning (RL). The algorithms designed for this problem often suffer from high variance in their estimators, which results in poor sample efficiency, and have issues with convergence. A few variance-reduced on-policy policy gradient algorithms have been recently proposed that use methods from stochastic optimization to reduce the variance of the gradient estimate in the REINFORCE algorithm. However, these algorithms are not designed for the off-policy setting and are memory-inefficient, since they need to collect and store a large ``reference'' batch of samples from time to time. To achieve variance-reduced off-policy-stable policy optimization, we propose an algorithm family that is memory-efficient, stochastically variance-reduced, and capable of learning from off-policy samples. Empirical studies validate the effectiveness of the proposed approaches.