带有复合匿名延迟反馈的有界内存对抗匪徒

论文标题

带有复合匿名延迟反馈的有界内存对抗匪徒

Bounded Memory Adversarial Bandits with Composite Anonymous Delayed Feedback

论文作者

Wan, Zongqi, Sun, Xiaoming, Zhang, Jialin

论文摘要

我们研究了对抗性的匪徒与复合匿名延迟反馈有关的问题。在这种情况下，一项操作的损失分为$ d $组件，在选择动作后连续一轮散布。在每回合中，该算法都观察到来自最新$ d $回合的损失的汇总。以前的作品着重于遗忘的对抗环境，同时我们研究了更艰难的非欺骗环境。即使损失序列是有限的，我们显示不掩饰的设置也会引起$ω（t）$ pseudo后悔。但是，我们提出了一种包装算法，该算法在许多对抗性强盗问题上都享受$ o（t）$ policy遗憾，假设损失序列是有限的记忆。特别是，对于$ k $ armed的强盗和强盗凸优化，我们具有$ \ mathcal {o}（t^{2/3}）$ policy policy selford bund Bond。我们还证明了$ k $武装的强盗的匹配下限。即使损失序列遗忘，我们的下限也起作用，但延迟是非掩饰的。它回答了\ cite {wang2021Adaptive}中提出的开放问题，表明非义延迟足以产生$ \tildeΩ（t^{2/3}）$遗憾。

We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen. And in each round, the algorithm observes the aggregation of losses that come from the latest $d$ rounds. Previous works focus on oblivious adversarial setting, while we investigate the harder non-oblivious setting. We show non-oblivious setting incurs $Ω(T)$ pseudo regret even when the loss sequence is bounded memory. However, we propose a wrapper algorithm which enjoys $o(T)$ policy regret on many adversarial bandit problems with the assumption that the loss sequence is bounded memory. Especially, for $K$-armed bandit and bandit convex optimization, we have $\mathcal{O}(T^{2/3})$ policy regret bound. We also prove a matching lower bound for $K$-armed bandit. Our lower bound works even when the loss sequence is oblivious but the delay is non-oblivious. It answers the open problem proposed in \cite{wang2021adaptive}, showing that non-oblivious delay is enough to incur $\tildeΩ(T^{2/3})$ regret.

下载PDF全文

下载文献需遵守相关版权规定

论文标题