打破$ \ sqrt {t} $屏障：随机上下文线性匪徒中独立于实例的对数遗憾

论文标题

打破$ \ sqrt {t} $屏障：随机上下文线性匪徒中独立于实例的对数遗憾

Breaking the $\sqrt{T}$ Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits

论文作者

Ghosh, Avishek, Sankararaman, Abishek

论文摘要

我们证明了一个独立的实例（poly）对数遗憾，对带有线性回报的随机上下文匪徒。以前，在\ cite {chu2011 -contextual}中，$ \ mathcal {o}的下限（\ sqrt {t}）$显示了上下文线性匪徒与任意（对手选择）上下文的上下文线性匪徒问题。在本文中，我们表明随机上下文确实有助于将遗憾从$ \ sqrt {t} $减少到$ \ polylog（t）$。我们提出了低遗憾的随机上下文强盗（\ texttt {lr-scb}），它利用随机上下文并执行参数估计（以$ \ ell_2 $ norm）执行，并同时遗憾地最小化。 \ texttt {lr-scb}在时期工作，其中使用上一个时期的参数估计用于减少当前时期的遗憾。 \ texttt {lr-scb}的（poly）对数遗憾源于两个关键事实：（a）应用规范自适应算法利用参数估计的应用，以及（b）对移动的线性上下文上下文的分析，表明了这种转移导致遗憾的变化。我们还通过实验表明，随机上下文确实给人以$ \ polylog（t）$缩放的遗憾。

We prove an instance independent (poly) logarithmic regret for stochastic contextual bandits with linear payoff. Previously, in \cite{chu2011contextual}, a lower bound of $\mathcal{O}(\sqrt{T})$ is shown for the contextual linear bandit problem with arbitrary (adversarily chosen) contexts. In this paper, we show that stochastic contexts indeed help to reduce the regret from $\sqrt{T}$ to $\polylog(T)$. We propose Low Regret Stochastic Contextual Bandits (\texttt{LR-SCB}), which takes advantage of the stochastic contexts and performs parameter estimation (in $\ell_2$ norm) and regret minimization simultaneously. \texttt{LR-SCB} works in epochs, where the parameter estimation of the previous epoch is used to reduce the regret of the current epoch. The (poly) logarithmic regret of \texttt{LR-SCB} stems from two crucial facts: (a) the application of a norm adaptive algorithm to exploit the parameter estimation and (b) an analysis of the shifted linear contextual bandit algorithm, showing that shifting results in increasing regret. We have also shown experimentally that stochastic contexts indeed incurs a regret that scales with $\polylog(T)$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题