论文标题
上下文用户浏览匪徒,用于大规模在线移动推荐
Contextual User Browsing Bandits for Large-Scale Online Mobile Recommendation
论文作者
论文摘要
在线推荐服务向用户推荐多种商品。如今,很大一部分用户访问移动设备的电子商务平台。由于移动设备的屏幕尺寸有限,因此项目位置对点击具有重大影响:1)更高的位置会导致更多商品的点击率。 2)“伪暴露”问题:乍一看只显示了几个推荐的项目,并且用户需要滑动屏幕以浏览其他项目。因此,用户没有查看一些排名后面的推荐项目,并且将这种项目视为负样本是不合适的。尽管许多作品将在线推荐建模为上下文匪徒问题,但他们很少考虑职位的影响,因此估算奖励功能可能会偏爱。在本文中,我们旨在解决这两个问题,以提高在线移动推荐的性能。我们的贡献是四倍。首先,由于我们关注了一组推荐项目的奖励,因此我们将在线推荐建模为上下文组合匪徒问题,并定义推荐集的奖励。其次,我们提出了一种称为UBM-Linucb的新颖上下文组合匪徒方法,以通过采用用户浏览模型(UBM)来解决与位置相关的两个问题,这是一个单击Web搜索的单击模型。第三,我们提供了正式的遗憾分析,并证明我们的算法独立于物品数量实现sublritear遗憾。最后,我们通过一个新颖的无偏估计器在两个现实世界数据集上评估了我们的算法。淘宝(Tamo)也实施了在线实验,这是世界上最受欢迎的电子商务平台之一。两个CTR指标的结果表明,我们的算法优于其他上下文Bandit算法。
Online recommendation services recommend multiple commodities to users. Nowadays, a considerable proportion of users visit e-commerce platforms by mobile devices. Due to the limited screen size of mobile devices, positions of items have a significant influence on clicks: 1) Higher positions lead to more clicks for one commodity. 2) The 'pseudo-exposure' issue: Only a few recommended items are shown at first glance and users need to slide the screen to browse other items. Therefore, some recommended items ranked behind are not viewed by users and it is not proper to treat this kind of items as negative samples. While many works model the online recommendation as contextual bandit problems, they rarely take the influence of positions into consideration and thus the estimation of the reward function may be biased. In this paper, we aim at addressing these two issues to improve the performance of online mobile recommendation. Our contributions are four-fold. First, since we concern the reward of a set of recommended items, we model the online recommendation as a contextual combinatorial bandit problem and define the reward of a recommended set. Second, we propose a novel contextual combinatorial bandit method called UBM-LinUCB to address two issues related to positions by adopting the User Browsing Model (UBM), a click model for web search. Third, we provide a formal regret analysis and prove that our algorithm achieves sublinear regret independent of the number of items. Finally, we evaluate our algorithm on two real-world datasets by a novel unbiased estimator. An online experiment is also implemented in Taobao, one of the most popular e-commerce platforms in the world. Results on two CTR metrics show that our algorithm outperforms the other contextual bandit algorithms.