论文标题
Inpars:使用大语言模型的信息检索数据增强
InPars: Data Augmentation for Information Retrieval using Large Language Models
论文作者
论文摘要
信息检索社区最近由于经过验证的变压器模型而见证了一场革命。这场革命的另一个关键要素是MS MARCO数据集,其规模和多样性使零射击转移学习到各种任务。但是,并非所有IR任务和域都可以平等地从一个数据集中受益。在各种NLP任务中进行的广泛研究表明,使用特定领域的训练数据与通用训练数据相反,可以改善神经模型的性能。在这项工作中,我们将大型语言模型的少量功能作为IR任务的合成数据生成器。我们表明,仅在我们的无监督数据集上的模型均超过了强大的基线,例如BM25以及最近提出的自我监督密集的检索方法。此外,在受监督和我们的合成数据上均具有比仅在监督数据上进行的模型更高的零射击转移的检索员。代码,模型和数据可从https://github.com/zetaalphavector/inpars获得。
The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Furthermore, retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data. Code, models, and data are available at https://github.com/zetaalphavector/inpars .