论文标题
线性估计器的完整表征离线政策评估
A Complete Characterization of Linear Estimators for Offline Policy Evaluation
论文作者
论文摘要
离线政策评估是强化学习中的一个基本统计问题,涉及估计由潜在的政策收集的数据的某些决策政策的价值功能。为了解决复杂,高维观察的问题,理论家和从业者都对理解增强学习功能近似的可能性都引起了人们的重大兴趣。尽管进行了重大研究,但我们可能期望离线政策评估何时可以进行的尖锐表征,即使在线性函数近似的最简单设置中,到目前为止,仍然难以捉摸,最近在文献中出现了令人惊讶的强力效果。 在这项工作中,我们确定了简单的控制理论和线性偏格条件,这些条件对于经典方法,特别是拟合的Q-率(FQI)和最小二乘时间差异学习(LSTD),可以在离线政策评估中取得成功。使用这种表征,我们建立了这些估计器成功的制度的精确层次结构。我们证明,LSTD在严格弱的条件下工作比FQI工作。此外,我们确定如果无法通过LSTD解决问题,那么即使在无限数据的限制下,也无法通过一类广泛的线性估计器来解决它。综上所述,我们的结果提供了脱机政策评估的线性估计器行为的完整图片,统一了对规范算法的先前不同分析,并提供了关于离线政策评估的基本统计复杂性的明显更加清晰的概念。
Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation, unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of offline policy evaluation.