论文标题

线性上下文匪徒的几乎最佳算法,具有对抗性损坏

Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions

论文作者

He, Jiafan, Zhou, Dongruo, Zhang, Tong, Gu, Quanquan

论文摘要

我们在存在对抗性腐败的情况下研究线性上下文的匪徒问题,在场,每个回合的奖励都被对手损坏,腐败级别(即,地平线上的腐败总数)为$ c \ geq 0 $。在这种情况下,最著名的算法受到限制,因为它们要么在计算效率低下,要么需要对腐败做出强有力的假设,或者他们的遗憾至少比没有腐败的遗憾差的$ C $倍。在本文中,为了克服这些局限性,我们提出了一种基于不确定性的乐观原则的新算法。我们算法的核心是加权山脊回归,每个选择的动作的重量都取决于其置信度,直到一定的阈值。我们表明,对于已知的$ c $和未知$ C $ case,我们的算法和适当选择的超级参数都使人感到遗憾,几乎与下限相匹配。因此,我们的算法几乎是两种情况的对数因素的最佳选择。值得注意的是,我们的算法同时对损坏和未腐败的案件($ c = 0 $)实现了近乎最理想的遗憾。

We study the linear contextual bandit problem in the presence of adversarial corruption, where the reward at each round is corrupted by an adversary, and the corruption level (i.e., the sum of corruption magnitudes over the horizon) is $C\geq 0$. The best-known algorithms in this setting are limited in that they either are computationally inefficient or require a strong assumption on the corruption, or their regret is at least $C$ times worse than the regret without corruption. In this paper, to overcome these limitations, we propose a new algorithm based on the principle of optimism in the face of uncertainty. At the core of our algorithm is a weighted ridge regression where the weight of each chosen action depends on its confidence up to some threshold. We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds. Thus, our algorithm is nearly optimal up to logarithmic factors for both cases. Notably, our algorithm achieves the near-optimal regret for both corrupted and uncorrupted cases ($C=0$) simultaneously.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源