论文标题

具有图形结构侧观测的对抗线性上下文匪徒

Adversarial Linear Contextual Bandits with Graph-Structured Side Observations

论文作者

Wang, Lingda, Li, Bingcong, Zhou, Huozhi, Giannakis, Georgios B., Varshney, Lav R., Zhao, Zhizhen

论文摘要

本文研究了对抗图形上下文匪徒,这是对抗性多臂匪徒的变体,这些变体利用了最常见的侧面信息的两类:\ emph {contexts}和\ emph {side观察}。在这种情况下,学习代理在与$ d $维的上下文向量展示后,从一组$ k $动作中反复选择。代理不仅会造成并观察到所选作用的丢失,还观察到其在观测结构中相邻作用的损失,这些损失被编码为一系列反馈图。此设置对社交网络中的各种应用程序进行了建模,在该应用程序中,可以使用上下文和图形结构的侧面观察。基于\ texttt {exp3}开发了两种有效的算法。在轻度条件下,我们的分析表明,对于无方向的反馈图,第一个算法,\ texttt {exp3-lgc-u},实现了订单$ \ mathcal {o}的遗憾(\ sqrt {(k+α(g)d)反馈图的\ emph {独立编号}。也为有向图设置提供了稍弱的结果。第二个算法,\ texttt {ext3-lgc-ix}是针对特殊类别的问题开发的,为此,遗憾的是$ \ m nathcal {o}(\ sqrt {\ sqrt {α(g)dt \ log \ log {k}} \ log(kt(kt)} $ for notected forsected forsect forsect foodexts forsectiondection forsect fordecks for foodexps forseffect。数值测试证实了所提出的算法的效率。

This paper studies the adversarial graphical contextual bandits, a variant of adversarial multi-armed bandits that leverage two categories of the most common side information: \emph{contexts} and \emph{side observations}. In this setting, a learning agent repeatedly chooses from a set of $K$ actions after being presented with a $d$-dimensional context vector. The agent not only incurs and observes the loss of the chosen action, but also observes the losses of its neighboring actions in the observation structures, which are encoded as a series of feedback graphs. This setting models a variety of applications in social networks, where both contexts and graph-structured side observations are available. Two efficient algorithms are developed based on \texttt{EXP3}. Under mild conditions, our analysis shows that for undirected feedback graphs the first algorithm, \texttt{EXP3-LGC-U}, achieves the regret of order $\mathcal{O}(\sqrt{(K+α(G)d)T\log{K}})$ over the time horizon $T$, where $α(G)$ is the average \emph{independence number} of the feedback graphs. A slightly weaker result is presented for the directed graph setting as well. The second algorithm, \texttt{EXP3-LGC-IX}, is developed for a special class of problems, for which the regret is reduced to $\mathcal{O}(\sqrt{α(G)dT\log{K}\log(KT)})$ for both directed as well as undirected feedback graphs. Numerical tests corroborate the efficiency of proposed algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源