论文标题
Top-N推荐算法:寻求最先进的
Top-N Recommendation Algorithms: A Quest for the State-of-the-Art
论文作者
论文摘要
与应用机器学习的其他领域一样,有关推荐系统算法的研究主要由改善最先进的努力(通常是在准确度措施方面)主导。然而,最近的一些研究工作表明,这些年来报告的改进有时“不加起来”,几年前出版的方法在独立评估时通常比最新模型差不多。不同的因素会导致这种现象,包括一些研究人员通常只能微调自己的模型,而不是基准。在本文中,我们报告了十种协作过滤算法(涵盖传统和神经模型)对十种协作过滤算法的深入,系统和可再现的比较的结果 - 在最近的三个数据集中的几种常见绩效指标上经常用于最近的文献中评估。我们的结果表明,在检查的Top-N推荐任务中,数据集和指标之间没有一致的获胜者。此外,我们发现,对于任何精确测量,任何考虑的神经模型都没有带来最佳性能。关于在测量范围内算法的性能排名,我们发现线性模型,最近的邻居方法和传统的矩阵分解始终在评估的适度尺寸但常用的数据集中表现良好。因此,我们的工作应作为研究人员在未来绩效比较中考虑的现有基准的指南。此外,通过为不同数据集提供一组微调基线模型,我们希望我们的工作有助于建立对顶级建议任务的最新技术的共同理解。
Research on recommender systems algorithms, like other areas of applied machine learning, is largely dominated by efforts to improve the state-of-the-art, typically in terms of accuracy measures. Several recent research works however indicate that the reported improvements over the years sometimes "don't add up", and that methods that were published several years ago often outperform the latest models when evaluated independently. Different factors contribute to this phenomenon, including that some researchers probably often only fine-tune their own models but not the baselines. In this paper, we report the outcomes of an in-depth, systematic, and reproducible comparison of ten collaborative filtering algorithms - covering both traditional and neural models - on several common performance measures on three datasets which are frequently used for evaluation in the recent literature. Our results show that there is no consistent winner across datasets and metrics for the examined top-n recommendation task. Moreover, we find that for none of the accuracy measurements any of the considered neural models led to the best performance. Regarding the performance ranking of algorithms across the measurements, we found that linear models, nearest-neighbor methods, and traditional matrix factorization consistently perform well for the evaluated modest-sized, but commonly-used datasets. Our work shall therefore serve as a guideline for researchers regarding existing baselines to consider in future performance comparisons. Moreover, by providing a set of fine-tuned baseline models for different datasets, we hope that our work helps to establish a common understanding of the state-of-the-art for top-n recommendation tasks.