衡量元提升学习的绩效策略抽样

论文标题

衡量元提升学习的绩效策略抽样

Performance-Weighed Policy Sampling for Meta-Reinforcement Learning

论文作者

Ahmed, Ibrahim, Quinones-Grueiro, Marcos, Biswas, Gautam

论文摘要

本文讨论了一种增强的模型无关元学习（E-MAML）算法，该算法在应用于新的学习任务时从少量培训示例中生成策略函数的快速收敛。 E-MAML建立在模型 - 不合稳定元学习（MAML）之上（MAML），维护了一组在环境中学习的策略参数。我们将E-MAML应用于开发强化学习（RL）的动态系统的基于基于的在线容错控制方案。当发生新的故障时，将应用增强功能，以重新定位新的RL策略的参数，该参数以新故障的少量系统行为样本来更快地适应。这取代了MAML中的随机任务采样步骤。相反，它利用了控制器的现存先前生成的经验。对增强的采样以最大程度地跨越参数空间，以促进对新断层的适应。我们证明了我们在著名的购物车示例中，然后在飞机的燃油传输系统上结合E-MAML与近端政策优化（PPO）的性能。

This paper discusses an Enhanced Model-Agnostic Meta-Learning (E-MAML) algorithm that generates fast convergence of the policy function from a small number of training examples when applied to new learning tasks. Built on top of Model-Agnostic Meta-Learning (MAML), E-MAML maintains a set of policy parameters learned in the environment for previous tasks. We apply E-MAML to developing reinforcement learning (RL)-based online fault tolerant control schemes for dynamic systems. The enhancement is applied when a new fault occurs, to re-initialize the parameters of a new RL policy that achieves faster adaption with a small number of samples of system behavior with the new fault. This replaces the random task sampling step in MAML. Instead, it exploits the extant previously generated experiences of the controller. The enhancement is sampled to maximally span the parameter space to facilitate adaption to the new fault. We demonstrate the performance of our approach combining E-MAML with proximal policy optimization (PPO) on the well-known cart pole example, and then on the fuel transfer system of an aircraft.

下载PDF全文

下载文献需遵守相关版权规定

论文标题