通过主题语义对比度学习缓解短文主题建模的数据稀疏性

论文标题

通过主题语义对比度学习缓解短文主题建模的数据稀疏性

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning

论文作者

Wu, Xiaobao, Luu, Anh Tuan, Dong, Xinshuai

论文摘要

为了克服简短文本主题建模中的数据稀疏问题，现有方法通常依赖于数据增强或简短文本的数据特征来介绍更多单词共发生信息。但是，它们中的大多数没有充分利用增强数据或数据特征：它们在数据中的样本之间学习不足，从而导致语义上相似的文本对的相似主题分布。为了更好地解决数据稀疏性，在本文中，我们提出了一个新颖的简短文本主题建模框架，主题语义对比主题模型（TSCTM）。为了充分建模样品之间的关系，我们采用了一种基于主题语义的有效正面和负面抽样策略的新对比学习方法。这种对比学习方法完善了表示，丰富了学习信号，从而减轻了稀疏问题。广泛的实验结果表明，无论数据增强可用性如何，我们的TSCTM都优于最先进的基线，从而产生高质量的主题和主题分布。

To overcome the data sparsity issue in short text topic modeling, existing methods commonly rely on data augmentation or the data characteristic of short texts to introduce more word co-occurrence information. However, most of them do not make full use of the augmented data or the data characteristic: they insufficiently learn the relations among samples in data, leading to dissimilar topic distributions of semantically similar text pairs. To better address data sparsity, in this paper we propose a novel short text topic modeling framework, Topic-Semantic Contrastive Topic Model (TSCTM). To sufficiently model the relations among samples, we employ a new contrastive learning method with efficient positive and negative sampling strategies based on topic semantics. This contrastive learning method refines the representations, enriches the learning signals, and thus mitigates the sparsity issue. Extensive experimental results show that our TSCTM outperforms state-of-the-art baselines regardless of the data augmentation availability, producing high-quality topics and topic distributions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题