论文标题
我应该知道什么?在单个体验流中使用元梯度下降进行预测特征发现
What Should I Know? Using Meta-gradient Descent for Predictive Feature Discovery in a Single Stream of Experience
论文作者
论文摘要
在计算加强学习中,越来越多的作品试图通过预测未来的感觉来构建代理人对世界的看法。关于环境观察的预测用作额外的输入功能,以实现更好的目标指导决策。这项工作中的一个公开挑战是从代理商可能做出的无限预测中确定哪些预测可能最能支持决策。在持续的学习问题中,这一挑战尤其明显,在这种问题中,单一的经验可以为单一的代理人提供。作为主要贡献,我们介绍了一个元梯度下降过程,通过该过程,代理商学习1)要做出的预测,2)其所选预测的估计值; 3)如何使用这些估计来生成最大化未来奖励的策略 - 在单个持续不断学习的过程中,所有这些都是在单个持续学习过程中。在本手稿中,我们将表示为一般价值函数的预测:对未来信号积累的时间扩展估计。我们证明,通过与环境的互动,代理可以独立选择解决部分观察性的预测,从而产生类似于专业指定的GVF的性能。通过学习,而不是手动指定这些预测,我们使代理商能够以自我监督的方式确定有用的预测,从而迈向真正的自主系统。
In computational reinforcement learning, a growing body of work seeks to construct an agent's perception of the world through predictions of future sensations; predictions about environment observations are used as additional input features to enable better goal-directed decision-making. An open challenge in this line of work is determining from the infinitely many predictions that the agent could possibly make which predictions might best support decision-making. This challenge is especially apparent in continual learning problems where a single stream of experience is available to a singular agent. As a primary contribution, we introduce a meta-gradient descent process by which an agent learns 1) what predictions to make, 2) the estimates for its chosen predictions, and 3) how to use those estimates to generate policies that maximize future reward -- all during a single ongoing process of continual learning. In this manuscript we consider predictions expressed as General Value Functions: temporally extended estimates of the accumulation of a future signal. We demonstrate that through interaction with the environment an agent can independently select predictions that resolve partial-observability, resulting in performance similar to expertly specified GVFs. By learning, rather than manually specifying these predictions, we enable the agent to identify useful predictions in a self-supervised manner, taking a step towards truly autonomous systems.