论文标题
检测文本形式:文本分类方法的研究
Detecting Text Formality: A Study of Text Classification Approaches
论文作者
论文摘要
形式是文本文档的重要特征之一。自动检测文本的形式水平可能对各种自然语言处理任务有益。以前,引入了两个大型数据集,该数据集用于多种语言,其中包含正式注释-GYAFC和X-Formal。但是,它们主要用于训练样式转移模型。同时,单独检测文本形式也可能是一个有用的应用程序。这项工作提出了我们基于统计,基于神经和基于变压器的机器学习方法的形式检测方法的知识系统研究,并提供了公众使用最佳模型。我们进行了三种类型的实验 - 单语,多语言和跨语性。该研究显示了CHAR BILSTM模型对基于变压器的模型的克服,用于单语和多种形式分类任务,而基于变压器的分类器对跨语义知识转移更加稳定。
Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation -- GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments -- monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.