论文标题

孟加拉文字内容中的情感分类:比较研究

Sentiment Classification in Bangla Textual Content: A Comparative Study

论文作者

Hasan, Md. Arid, Tajrin, Jannatul, Chowdhury, Shammur Absar, Alam, Firoj

论文摘要

情感分析已被广泛用于了解我们对产品社会和政治议程或用户体验的看法。它是NLP的核心和经过深入研究的区域之一。但是,对于像孟加拉语这样的低资源语言而言,重要的挑战之一是缺乏资源。在当前的孟加拉文献中,另一个重要的局限性是由于缺乏定义明确的火车/测试拆分而没有可比的结果。在这项研究中,我们探索了使用经典和深度学习算法设计的几种标记为数据集的公开情感和设计分类器。在我们的研究中,经典算法包括SVM和随机森林,深度学习算法包括CNN,FastText和基于变压器的模型。我们在模型性能和时间资源复杂性方面比较了这些模型。我们的发现表明,基于变形金刚的模型,孟加拉尚未探索过,胜过所有其他模型。此外,我们根据每类价值得分创建了词典内容的加权列表。然后,我们在数据集中分析了每个类的高意义条目的内容。为了获得可重复性,我们将公开可用的数据拆分和排名的词典列表。提出的结果可用于将来的研究作为基准。

Sentiment analysis has been widely used to understand our views on social and political agendas or user experiences over a product. It is one of the cores and well-researched areas in NLP. However, for low-resource languages, like Bangla, one of the prominent challenge is the lack of resources. Another important limitation, in the current literature for Bangla, is the absence of comparable results due to the lack of a well-defined train/test split. In this study, we explore several publicly available sentiment labeled datasets and designed classifiers using both classical and deep learning algorithms. In our study, the classical algorithms include SVM and Random Forest, and deep learning algorithms include CNN, FastText, and transformer-based models. We compare these models in terms of model performance and time-resource complexity. Our finding suggests transformer-based models, which have not been explored earlier for Bangla, outperform all other models. Furthermore, we created a weighted list of lexicon content based on the valence score per class. We then analyzed the content for high significance entries per class, in the datasets. For reproducibility, we make publicly available data splits and the ranked lexicon list. The presented results can be used for future studies as a benchmark.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源