批次加固学习的指数下限：批量RL比在线RL要困难得多

论文标题

批次加固学习的指数下限：批量RL比在线RL要困难得多

Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

论文作者

Zanette, Andrea

论文摘要

强化学习的几种实际应用涉及代理商从过去的数据中学习，而无需进一步探索。通常，这些应用程序需要我们到1）确定几乎最佳的政策或2）估计目标策略的价值。对于这两个任务，我们都会在\ emph {指数}信息理论下限中以折扣的无限地平线MDPS降低了动作值函数的线性函数表示，即使1）\ emph {corlizizyability}，2）批处理算法观察确切的奖励和过渡\ emph iS ang iS ang iS ang ang iS ang ang iS and ang ang ang ang ang ang ang and ang ang and and ang and and ang and and ang ang and ang and ang and ang and ang and and ang and batch and 3） \ emph {best}问题类的先验数据分布。我们的工作介绍了一个新的“ Oracle +批处理算法”框架，以证明每个分布都有的下限。这项工作显示了批处理和在线增强学习之间的指数分离。

Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive \emph{exponential} information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) \emph{realizability} holds, 2) the batch algorithm observes the exact reward and transition \emph{functions}, and 3) the batch algorithm is given the \emph{best} a priori data distribution for the problem class. Our work introduces a new `oracle + batch algorithm' framework to prove lower bounds that hold for every distribution. The work shows an exponential separation between batch and online reinforcement learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题