论文标题
嵌入式空间中的非政策评估
Off-Policy Evaluation in Embedded Spaces
论文作者
论文摘要
非政策评估方法在推荐系统和搜索引擎中很重要,其中使用现有的记录策略收集的数据来估计新提出的策略的性能。解决此问题的一种常见方法是加权,其中数据是通过目标和记录策略中给定的上下文的概率之间的密度比加权的。实际上,经常出现两个问题。首先,许多问题具有很大的动作空间,我们可能不会观察到大多数行动的奖励,因此,在有限的样本中,我们可能会遇到违反积极性的行为。其次,许多推荐系统不是概率的,因此可以访问伐木和目标策略密度可能是不可行的。为了解决这些问题,我们介绍了嵌入式置换加权估计器。估算器计算嵌入空间的密度比,从而降低了违反阳性的可能性。密度比是计算的,利用了将流量标准化和密度比估计作为分类问题的最新进展,以获得实践中可行的估计值。
Off-policy evaluation methods are important in recommendation systems and search engines, where data collected under an existing logging policy is used to estimate the performance of a new proposed policy. A common approach to this problem is weighting, where data is weighted by a density ratio between the probability of actions given contexts in the target and logged policies. In practice, two issues often arise. First, many problems have very large action spaces and we may not observe rewards for most actions, and so in finite samples we may encounter a positivity violation. Second, many recommendation systems are not probabilistic and so having access to logging and target policy densities may not be feasible. To address these issues, we introduce the featurized embedded permutation weighting estimator. The estimator computes the density ratio in an action embedding space, which reduces the possibility of positivity violations. The density ratio is computed leveraging recent advances in normalizing flows and density ratio estimation as a classification problem, in order to obtain estimates which are feasible in practice.