腐烂无限多武器的土匪

论文标题

腐烂无限多武器的土匪

Rotting Infinitely Many-armed Bandits

论文作者

Kim, Jung-hun, Vojnovic, Milan, Yun, Se-Young

论文摘要

我们考虑腐烂奖励的无限多臂匪徒问题，其中手臂的平均奖励是根据任意趋势在每次拉动的手臂上减少的，最大腐烂速率$ \ varrho = o（1）$。我们表明，这个学习问题具有$ω（\ max \ {\ varrho^{1/3} t，\ sqrt {t} \}）$ worst-case遗憾的遗憾下降下限，其中$ t $是地平线。我们表明，匹配的上限$ \ tilde {o}（\ max \ {\ varrho^{1/3} t，\ sqrt {t} \}）$，最多可以通过algorithm for a ucb for ucb for a narm and and and thresh golthmic conthermic来实现多型因素考虑到，当算法知道最大腐烂速率$ \ varrho $的值时。我们还表明，$ \ tilde {o}（\ max \ {\ varrho^{1/3} t，t^{3/4} \}）$遗憾的上限可以通过不知道$ \ varrho $的价值与适应性UCB indaptex搭配的$ \ varrho $的algorithm来实现。

We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $Ω(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.

下载PDF全文

下载文献需遵守相关版权规定

论文标题