论文标题
样品有效的自动化深钢筋学习
Sample-Efficient Automated Deep Reinforcement Learning
论文作者
论文摘要
尽管在各个领域的具有挑战性问题方面取得了重大进展,但由于对超参数选择的敏感性,应用最先进的深入强化学习(RL)算法仍然具有挑战性。这种敏感性可以部分归因于RL问题的非平稳性,这可能需要在学习过程的各个阶段进行不同的高参数设置。此外,在RL设置中,超参数优化(HPO)需要大量的环境相互作用,从而阻碍了RL中的成功转移到现实世界中的应用程序。在这项工作中,我们解决了RL中样本效率和动态HPO的问题。我们提出了一个基于人群的自动RL(AUTORL)框架,以元优化任意的非上政策RL算法。在此框架中,我们在同时训练代理商的同时优化了超参数和神经体系结构。通过在人群中共享收集的经验,我们大大提高了元优化的样本效率。我们在案例研究中证明了我们的样品效率自动化方法的能力,该案例研究与穆约科科基准套件中流行的TD3算法相比,我们减少了元优化所需的环境相互作用的数量,而不是基于人群的培训,则减少了环境相互作用的数量。
Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning process. Additionally, in the RL setting, hyperparameter optimization (HPO) requires a large number of environment interactions, hindering the transfer of the successes in RL to real-world applications. In this work, we tackle the issues of sample-efficient and dynamic HPO in RL. We propose a population-based automated RL (AutoRL) framework to meta-optimize arbitrary off-policy RL algorithms. In this framework, we optimize the hyperparameters and also the neural architecture while simultaneously training the agent. By sharing the collected experience across the population, we substantially increase the sample efficiency of the meta-optimization. We demonstrate the capabilities of our sample-efficient AutoRL approach in a case study with the popular TD3 algorithm in the MuJoCo benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training.