POMDPS中基于未来的基于价值的非政策评估

论文标题

POMDPS中基于未来的基于价值的非政策评估

Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

论文作者

Uehara, Masatoshi, Kiyohara, Haruka, Bennett, Andrew, Chernozhukov, Victor, Jiang, Nan, Kallus, Nathan, Shi, Chengchun, Sun, Wen

论文摘要

我们研究具有一般函数近似的部分可观察的MDP（POMDP）的政策评估（OPE）。现有的方法，例如顺序重要性采样估计器和拟合-Q评估，受POMDP中的地平线的诅咒。为了解决这个问题，我们通过引入将未来代理作为输入的未来依赖性值函数来开发一种新颖的无模型OPE方法。未来依赖性的价值函数在完全可观察的MDP中起着与经典价值函数相似的作用。我们为未来依赖性价值作为条件力矩方程而得出一个新的贝尔曼方程，将历史记录代理用作仪器变量。我们进一步提出了一种最小值学习方法，以使用新的Bellman方程来学习未来依赖的价值功能。我们获得PAC结果，这意味着我们的OPE估计器是一致的，只要期货和历史包含有关潜在状态和Bellman完整性的足够信息。最后，我们将方法扩展到学习动力学，并在POMDP中建立我们的方法与众所周知的光谱学习方法之间的联系。

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题