论文标题

偿还元数据债务:使用主题模型学习概念的表示

Paying down metadata debt: learning the representation of concepts using topic models

论文作者

Chen, Jiahao, Veloso, Manuela

论文摘要

我们引入了一个称为元数据债务的数据管理问题,以确定数据概念及其逻辑表示之间的映射。我们描述了如何使用基于低级别矩阵因子化的半监视主题模型来学习此映射,这些模型解释了缺失和嘈杂的标签,再加上稀疏惩罚以提高本地化和解释性。我们介绍了一种量规转换方法,该方法使我们能够在主题和概念标签之间构建明确的关联,从而将意义分配给主题。我们还展示了如何将此主题模型用于半监视的学习任务,例如从已知标签中推断,评估现有标签中可能的错误以及预测缺失的功能。我们从kaggle.com上预测超过25,000个数据集上的主题标签,展示了学习语义有意义的功能的能力。

We introduce a data management problem called metadata debt, to identify the mapping between data concepts and their logical representations. We describe how this mapping can be learned using semisupervised topic models based on low-rank matrix factorizations that account for missing and noisy labels, coupled with sparsity penalties to improve localization and interpretability. We introduce a gauge transformation approach that allows us to construct explicit associations between topics and concept labels, and thus assign meaning to topics. We also show how to use this topic model for semisupervised learning tasks like extrapolating from known labels, evaluating possible errors in existing labels, and predicting missing features. We show results from this topic model in predicting subject tags on over 25,000 datasets from Kaggle.com, demonstrating the ability to learn semantically meaningful features.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源