论文标题

选择性文本增强,带有单词角色,用于低资源文本分类

Selective Text Augmentation with Word Roles for Low-Resource Text Classification

论文作者

Guo, Biyang, Han, Songqiao, Huang, Hailiang

论文摘要

数据增强技术被广泛用于文本分类任务中,以提高分类器的性能,尤其是在低资源场景中。大多数以前的方法都会进行文本增强,而无需考虑文本中单词的不同功能,这可能会产生不令人满意的样本。不同的单词可能在文本分类中起不同的作用,这激发了我们战略性地选择适当的角色以增加文本增强作用。在这项工作中,我们首先从统计相关性和语义相似性的角度来确定文本中的单词与文本类别之间的关系,然后利用它们将单词分为四个角色 - 金,冒险,奖金,奖金和琐碎单词,这些角色具有不同的文本分类功能。基于这些单词角色,我们提出了一种称为STA(选择性文本增强)的新的增强技术,其中不同的文本编辑操作被选择性地应用于具有特定角色的单词。 STA可以在保留原始核心语义的同时生成多样化和相对干净的样本,并且也很容易实现。 5个基准低资源文本分类数据集进行的大量实验表明,增强由STA生成的样本成功提高了分类模型的性能,这些模型的性能显着胜过以前的非选择性方法,包括两种基于语言模型的大型技术。跨数据库实验进一步表明,与以前的方法相比,STA可以帮助分类器更好地推广到其他数据集。

Data augmentation techniques are widely used in text classification tasks to improve the performance of classifiers, especially in low-resource scenarios. Most previous methods conduct text augmentation without considering the different functionalities of the words in the text, which may generate unsatisfactory samples. Different words may play different roles in text classification, which inspires us to strategically select the proper roles for text augmentation. In this work, we first identify the relationships between the words in a text and the text category from the perspectives of statistical correlation and semantic similarity and then utilize them to divide the words into four roles -- Gold, Venture, Bonus, and Trivial words, which have different functionalities for text classification. Based on these word roles, we present a new augmentation technique called STA (Selective Text Augmentation) where different text-editing operations are selectively applied to words with specific roles. STA can generate diverse and relatively clean samples, while preserving the original core semantics, and is also quite simple to implement. Extensive experiments on 5 benchmark low-resource text classification datasets illustrate that augmented samples produced by STA successfully boost the performance of classification models which significantly outperforms previous non-selective methods, including two large language model-based techniques. Cross-dataset experiments further indicate that STA can help the classifiers generalize better to other datasets than previous methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源