反事实学习随机策略的连续行动

论文标题

反事实学习随机策略的连续行动

Counterfactual Learning of Stochastic Policies with Continuous Actions

论文作者

Zenati, Houssam, Bietti, Alberto, Martin, Matthieu, Diemert, Eustache, Gaillard, Pierre, Mairal, Julien

论文摘要

对于许多应用程序（例如Web广告或医疗保健），已记录数据的反事实推理变得越来越重要。在本文中，我们从反事实风险最小化（CRM）的角度解决了学习随机政策的问题。尽管CRM框架对离散行动具有吸引力并进行了良好的研究，但连续的操作案例带来了有关模型化，优化和〜离线模型选择的新挑战，这些模型选择具有真实数据，这一点尤其具有挑战性。我们的论文有助于CRM估计管道的这三个方面。首先，我们基于上下文和动作的联合内核嵌入引入建模策略，从而克服了先前离散方法的缺点。其次，我们从经验上表明，反事实学习的优化方面很重要，我们证明了近端点算法和平滑估计器的好处。最后，我们为现实世界记录系统中的离线策略提出了一个评估协议，这很具有挑战性，因为策略无法在测试数据上重播，并且我们发布了一个新的大规模数据集，以及多个合成但现实的，但现实的，评估的设置。

Counterfactual reasoning from logged data has become increasingly important for many applications such as web advertising or healthcare. In this paper, we address the problem of learning stochastic policies with continuous actions from the viewpoint of counterfactual risk minimization (CRM). While the CRM framework is appealing and well studied for discrete actions, the continuous action case raises new challenges about modelization, optimization, and~offline model selection with real data which turns out to be particularly challenging. Our paper contributes to these three aspects of the CRM estimation pipeline. First, we introduce a modelling strategy based on a joint kernel embedding of contexts and actions, which overcomes the shortcomings of previous discretization approaches. Second, we empirically show that the optimization aspect of counterfactual learning is important, and we demonstrate the benefits of proximal point algorithms and smooth estimators. Finally, we propose an evaluation protocol for offline policies in real-world logged systems, which is challenging since policies cannot be replayed on test data, and we release a new large-scale dataset along with multiple synthetic, yet realistic, evaluation setups.

下载PDF全文

下载文献需遵守相关版权规定

论文标题