论文标题
补给收益下的非机构强盗:改进的计划和统一的遗憾
Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret
论文作者
论文摘要
随机多武器的强盗设置最近在非平稳制度中进行了研究,在该制度中,每个动作的平均收益是自上次播放以来通过的回合数量的不稳定功能。该模型捕获了用户的自然行为方面,这些方面至关重要地确定了推荐平台,广告放置系统等的性能。即使假设对平均收益功能的先验知识,在上述模型中计算最佳计划是NP-HARD,而最先进的ART是$ 1/4 $ - APPRXIMATION算法,对于每轮最多可以播放一只手臂的情况。我们首先关注平均收益功能的设置。在这种情况下,我们通过开发多项式时$(1- {1}/{e})$ - 近似算法(渐近且预期)来显着改善计划问题的最佳保证。此外,与先前的工作相比,我们的算法可以在每回合中播放一个以上的手臂,从而提高了保证。移至强盗设置时,当平均收益功能最初未知时,我们展示了如何将算法转变为带有Sublinear遗憾的强盗算法。
The stochastic multi-armed bandit setting has been recently studied in the non-stationary regime, where the mean payoff of each action is a non-decreasing function of the number of rounds passed since it was last played. This model captures natural behavioral aspects of the users which crucially determine the performance of recommendation platforms, ad placement systems, and more. Even assuming prior knowledge of the mean payoff functions, computing an optimal planning in the above model is NP-hard, while the state-of-the-art is a $1/4$-approximation algorithm for the case where at most one arm can be played per round. We first focus on the setting where the mean payoff functions are known. In this setting, we significantly improve the best-known guarantees for the planning problem by developing a polynomial-time $(1-{1}/{e})$-approximation algorithm (asymptotically and in expectation), based on a novel combination of randomized LP rounding and a time-correlated (interleaved) scheduling method. Furthermore, our algorithm achieves improved guarantees -- compared to prior work -- for the case where more than one arm can be played at each round. Moving to the bandit setting, when the mean payoff functions are initially unknown, we show how our algorithm can be transformed into a bandit algorithm with sublinear regret.