少更多：与GZIP的无参数文本分类

论文标题

少更多：与GZIP的无参数文本分类

Less is More: Parameter-Free Text Classification with Gzip

论文作者

Jiang, Zhiying, Yang, Matthew Y. R., Tsirlin, Mikhail, Tang, Raphael, Lin, Jimmy

论文摘要

深度神经网络（DNN）通常用于文本分类任务，因为它们通常可以达到高度的准确性。但是，DNN可以在计算上具有数十亿个参数和大量标记的数据，这可以使它们使用昂贵，优化并转移到实践中的分布外（OOD）案例。在本文中，我们提出了一种非参数替代DNNS的替代方案，它在文本分类中很容易，轻巧且通用：像Gzip这样的简单压缩机和$ k $ neartest-neartest-Neighbor分类器的组合。我们的方法没有进行任何培训，预训练或微调，就可以在六个分布的数据集中获得具有非预言深度学习方法竞争的结果。它甚至超过了所有五个OOD数据集的BERT，包括四种低资源语言。我们的方法在几次射击设置中的表现也特别出色，因为标记的数据太稀缺了，无法达到令人满意的精度。

Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题