有限的连续武装土匪

论文标题

有限的连续武装土匪

Finite Continuum-Armed Bandits

论文作者

Gaucher, Solenne

论文摘要

我们认为代理商有$ t $ ressources的情况要分配给较大的$ n $动作。每个动作最多可以一次完成，并以未知的平均值获得随机奖励。代理商的目标是最大化她的累积奖励。当有关该动作的附带信息（例如以协变量的形式）时，可能会出现非琐碎策略。专注于非参数环境，平均奖励是一维协变量的未知函数，我们提出了解决此问题的最佳策略。在对奖励功能的自然假设下，我们证明，当预算$ t $与$ n $的动作数量成正比时，最佳遗憾量表为$ o（t^{1/3}）$ to poly-logarithmic因素。与$ n $相比，当$ t $变小时，会发生平稳的过渡。当比率$ t/n $从常数降低到$ n^{ - 1/3} $时，遗憾会逐渐增加到$ O（t^{1/2}）$ rates $ rates $率。

We consider a situation where an agent has $T$ ressources to be allocated to a larger number $N$ of actions. Each action can be completed at most once and results in a stochastic reward with unknown mean. The goal of the agent is to maximize her cumulative reward. Non trivial strategies are possible when side information on the actions is available, for example in the form of covariates. Focusing on a nonparametric setting, where the mean reward is an unknown function of a one-dimensional covariate, we propose an optimal strategy for this problem. Under natural assumptions on the reward function, we prove that the optimal regret scales as $O(T^{1/3})$ up to poly-logarithmic factors when the budget $T$ is proportional to the number of actions $N$. When $T$ becomes small compared to $N$, a smooth transition occurs. When the ratio $T/N$ decreases from a constant to $N^{-1/3}$, the regret increases progressively up to the $O(T^{1/2})$ rate encountered in continuum-armed bandits.

下载PDF全文

下载文献需遵守相关版权规定

论文标题