论文标题

腐烂无限多武器的土匪

Rotting Infinitely Many-armed Bandits

论文作者

Kim, Jung-hun, Vojnovic, Milan, Yun, Se-Young

论文摘要

我们考虑腐烂奖励的无限多臂匪徒问题,其中手臂的平均奖励是根据任意趋势在每次拉动的手臂上减少的,最大腐烂速率$ \ varrho = o(1)$。我们表明,这个学习问题具有$ω(\ max \ {\ varrho^{1/3} t,\ sqrt {t} \})$ worst-case遗憾的遗憾下降下限,其中$ t $是地平线。我们表明,匹配的上限$ \ tilde {o}(\ max \ {\ varrho^{1/3} t,\ sqrt {t} \})$,最多可以通过algorithm for a ucb for ucb for a narm and and and thresh golthmic conthermic来实现多型因素考虑到,当算法知道最大腐烂速率$ \ varrho $的值时。我们还表明,$ \ tilde {o}(\ max \ {\ varrho^{1/3} t,t^{3/4} \})$遗憾的上限可以通过不知道$ \ varrho $的价值与适应性UCB indaptex搭配的$ \ varrho $的algorithm来实现。

We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $Ω(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源