论文标题
带有复合匿名延迟反馈的有界内存对抗匪徒
Bounded Memory Adversarial Bandits with Composite Anonymous Delayed Feedback
论文作者
论文摘要
我们研究了对抗性的匪徒与复合匿名延迟反馈有关的问题。在这种情况下,一项操作的损失分为$ d $组件,在选择动作后连续一轮散布。在每回合中,该算法都观察到来自最新$ d $回合的损失的汇总。以前的作品着重于遗忘的对抗环境,同时我们研究了更艰难的非欺骗环境。即使损失序列是有限的,我们显示不掩饰的设置也会引起$ω(t)$ pseudo后悔。但是,我们提出了一种包装算法,该算法在许多对抗性强盗问题上都享受$ o(t)$ policy遗憾,假设损失序列是有限的记忆。特别是,对于$ k $ armed的强盗和强盗凸优化,我们具有$ \ mathcal {o}(t^{2/3})$ policy policy selford bund Bond。我们还证明了$ k $武装的强盗的匹配下限。即使损失序列遗忘,我们的下限也起作用,但延迟是非掩饰的。它回答了\ cite {wang2021Adaptive}中提出的开放问题,表明非义延迟足以产生$ \tildeΩ(t^{2/3})$遗憾。
We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen. And in each round, the algorithm observes the aggregation of losses that come from the latest $d$ rounds. Previous works focus on oblivious adversarial setting, while we investigate the harder non-oblivious setting. We show non-oblivious setting incurs $Ω(T)$ pseudo regret even when the loss sequence is bounded memory. However, we propose a wrapper algorithm which enjoys $o(T)$ policy regret on many adversarial bandit problems with the assumption that the loss sequence is bounded memory. Especially, for $K$-armed bandit and bandit convex optimization, we have $\mathcal{O}(T^{2/3})$ policy regret bound. We also prove a matching lower bound for $K$-armed bandit. Our lower bound works even when the loss sequence is oblivious but the delay is non-oblivious. It answers the open problem proposed in \cite{wang2021adaptive}, showing that non-oblivious delay is enough to incur $\tildeΩ(T^{2/3})$ regret.