建立用于低资源语言的文本分类的基准

论文标题

建立用于低资源语言的文本分类的基准

Establishing Baselines for Text Classification in Low-Resource Languages

论文作者

Cruz, Jan Christian Blaise, Cheng, Charibeth

论文摘要

虽然证明基于变压器的鉴定技术在涉及低资源，低数据环境的任务中有效，但缺乏正确建立的基准和基准数据集，使得很难比较旨在应对低资源环境的不同方法。在这项工作中，我们提供三个贡献。首先，我们将两个先前未发行的数据集作为基准数据集介绍，用于文本分类和低资源多标签文本分类，用于低资源语言菲律宾语。其次，我们为菲律宾设置中使用的更好的Bert和Distilbert模型预算了。第三，我们引入了一个简单的降解测试，该测试基准了模型对性能降解的阻力，因为训练样本的数量减少。我们分析了验证的模型的降解速度，并着眼于使用这种方法来比较旨在在低资源设置内运行的模型。我们发布了所有模型和数据集供研究界使用。

While transformer-based finetuning techniques have proven effective in tasks that involve low-resource, low-data environments, a lack of properly established baselines and benchmark datasets make it hard to compare different approaches that are aimed at tackling the low-resource setting. In this work, we provide three contributions. First, we introduce two previously unreleased datasets as benchmark datasets for text classification and low-resource multilabel text classification for the low-resource language Filipino. Second, we pretrain better BERT and DistilBERT models for use within the Filipino setting. Third, we introduce a simple degradation test that benchmarks a model's resistance to performance degradation as the number of training samples are reduced. We analyze our pretrained model's degradation speeds and look towards the use of this method for comparing models aimed at operating within the low-resource setting. We release all our models and datasets for the research community to use.

下载PDF全文

下载文献需遵守相关版权规定

论文标题