论文标题

跟随击打领导者进行对抗马尔可夫的决策过程,并带有强盗反馈

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

论文作者

Dai, Yan, Luo, Haipeng, Chen, Liyu

论文摘要

我们考虑对对抗性马尔可夫决策过程(AMDP)的遗憾最小化,其中损失功能随着时间的流逝而变化和对抗性选择,学习者仅观察到访问的国家行动对的损失(即强盗反馈)。尽管使用在线培训(OMD)方法对此问题进行了大量研究,但对于以下扰动领导者(FTPL)方法的了解鲜为人知,这些方法通常在计算上更有效,并且更易于实施,因为它仅需要解决离线计划问题。在此的激励下,我们仔细研究了学习AMDP的FTPL,从标准的情节有限型摩尼子设置开始。我们在分析中发现了一些独特而有趣的困难,并提出解决方法,最终表明FTPL在这种情况下也能够达到近乎最佳的遗憾界限。更重要的是,我们找到了两个重要的应用:首先,FTPL的分析很容易被推广到延迟的匪徒反馈和订单最佳的遗憾,而OMD方法则显示出额外的困难(Jin等,2022)。其次,使用FTPL,我们还开发了第一个用于学习在无限马环境中通过强盗反馈和随机过渡的无限马设置的AMDP的NO-Regret算法。我们的算法是有效的,假设访问了离线规划Oracle,即使为了更容易的全信息设置,唯一的现有算法(Chandrasekaran和Tewari,2021年)在计算上效率低下。

We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源