混杂线性MDP的统计估计：仪器变量方法

论文标题

混杂线性MDP的统计估计：仪器变量方法

Statistical Estimation of Confounded Linear MDPs: An Instrumental Variable Approach

论文作者

Lu, Miao, Yang, Wenhao, Zhang, Liangyu, Zhang, Zhihua

论文摘要

在马尔可夫决策过程（MDP）中，可能存在不可观察的混杂因素并对数据生成过程产生影响，因此经典的非政策评估（OPE）估计器可能无法确定目标策略的真实价值函数。在本文中，我们研究了与可观察到的仪器变量的混杂MDP中OPE的统计特性。具体而言，我们根据仪器变量提出了一个两阶段估计器，并在具有线性结构的混杂MDP中建立了其统计特性。对于非反应分析，我们证明了一个$ \ Mathcal {o}（n^{ - 1/2}）$ - 错误绑定了$ n $是样本的数量。对于渐近分析，我们证明了两阶段估计量在渐近正常上，典型的速率为$ n^{1/2} $。据我们所知，我们是第一个通过仪器变量显示混合线性MDP的两阶段估计量的统计结果。

In an Markov decision process (MDP), unobservable confounders may exist and have impacts on the data generating process, so that the classic off-policy evaluation (OPE) estimators may fail to identify the true value function of the target policy. In this paper, we study the statistical properties of OPE in confounded MDPs with observable instrumental variables. Specifically, we propose a two-stage estimator based on the instrumental variables and establish its statistical properties in the confounded MDPs with a linear structure. For non-asymptotic analysis, we prove a $\mathcal{O}(n^{-1/2})$-error bound where $n$ is the number of samples. For asymptotic analysis, we prove that the two-stage estimator is asymptotically normal with a typical rate of $n^{1/2}$. To the best of our knowledge, we are the first to show such statistical results of the two-stage estimator for confounded linear MDPs via instrumental variables.

下载PDF全文

下载文献需遵守相关版权规定

论文标题