通过对比度自我审视进行数据效率进行预处理

论文标题

通过对比度自我审视进行数据效率进行预处理

Data-Efficient Pretraining via Contrastive Self-Supervision

论文作者

Rethmeier, Nils, Augenstein, Isabelle

论文摘要

对于自然语言处理“文本到文本”任务，普遍的方法在很大程度上依赖于越来越大的“任务 - 外部”数据的大型自我监督模型。从高资源预处理中转移学习效果很好，但是研究集中在具有非常大的数据和计算要求的设置上，而有效的低资源学习的潜力，没有大型的“任务 - 外部”预处理，仍然不足。在这项工作中，我们评估了针对资源有效学习的三个核心挑战。也就是说，我们分析：（1）预处理数据（$ x $）效率；（2）零到几个标签（$ y $）效率；（3）长尾概括，因为长尾保存已与算法公平性有关，并且因为尾巴中的数据受到定义受到限制。为了应对这些挑战，我们提出了一个数据并计算有效的自我监督，对比文本编码器，并在“任务内部”文本数据的60MB上预估计，并将其与Roberta进行比较，Roberta在160GB的“任务 - 外部”文本上进行了预审。我们发现我们的方法在罗伯塔（Roberta）的微调时间的1/5中进行了预处理和微调。

For natural language processing `text-to-text' tasks, the prevailing approaches heavily rely on pretraining large self-supervised models on increasingly larger `task-external' data. Transfer learning from high-resource pretraining works well, but research has focused on settings with very large data and compute requirements, while the potential of efficient low-resource learning, without large `task-external' pretraining, remains under-explored. In this work, we evaluate against three core challenges for resource efficient learning. Namely, we analyze: (1) pretraining data ($X$) efficiency; (2) zero to few-shot label ($Y$) efficiency; and (3) long-tail generalization, since long-tail preservation has been linked to algorithmic fairness and because data in the tail is limited by definition. To address these challenges, we propose a data and compute efficient self-supervised, contrastive text encoder, pretrained on 60MB of `task-internal' text data, and compare it to RoBERTa, which was pretrained on 160GB of `task-external' text. We find our method outperforms RoBERTa, while pretraining and fine-tuning in a 1/5th of RoBERTa's fine-tuning time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题