利用初始提示在随机线性匪徒中免费

论文标题

利用初始提示在随机线性匪徒中免费

Leveraging Initial Hints for Free in Stochastic Linear Bandits

论文作者

Cutkosky, Ashok, Dann, Chris, Das, Abhimanyu, Qiuyi, Zhang

论文摘要

我们以匪徒反馈的方式研究了优化的设置，并以最初的最佳作用暗示的形式提供了向学习者提供的其他先验知识。当提示准确时，我们提出了一种用于随机线性匪徒的小说算法，该算法使用这种提示将其遗憾提高到$ \ tilde o（\ sqrt {t}）$时，同时保持最小值$ \ \ tilde o（d \ sqrt {t}）$ sqrt {t}）$遗憾地独立于HINT。此外，我们在最佳案例和最差案例遗憾之间提供了紧密的折衷，以及匹配的下限。也许令人惊讶的是，我们的作品表明，利用提示可以显示出可证明的收益，而不会牺牲最差的表现，这意味着我们的算法可以免费适应提示的质量。我们还向$ m $初始提示的情况提供了算法的扩展名，这表明我们可以实现$ \ tilde o（m^{2/3} \ sqrt {t}）$遗憾。

We study the setting of optimizing with bandit feedback with additional prior knowledge provided to the learner in the form of an initial hint of the optimal action. We present a novel algorithm for stochastic linear bandits that uses this hint to improve its regret to $\tilde O(\sqrt{T})$ when the hint is accurate, while maintaining a minimax-optimal $\tilde O(d\sqrt{T})$ regret independent of the quality of the hint. Furthermore, we provide a Pareto frontier of tight tradeoffs between best-case and worst-case regret, with matching lower bounds. Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance, implying that our algorithm adapts to the quality of the hint for free. We also provide an extension of our algorithm to the case of $m$ initial hints, showing that we can achieve a $\tilde O(m^{2/3}\sqrt{T})$ regret.

下载PDF全文

下载文献需遵守相关版权规定

论文标题