论文标题

主题建模历史报纸中的话语动态

Topic modelling discourse dynamics in historical newspapers

论文作者

Marjanen, Jani, Zosa, Elaine, Hengchen, Simon, Pivovarova, Lidia, Tolonen, Mikko

论文摘要

本文解决了历史研究的历时数据分析中的方法论问题。我们将两个主题模型的家族(LDA和DTM)应用于相对较大的历史报纸集,目的是捕捉和理解话语动态。我们的案例研究重点是1854年至1917年之间在芬兰发表的报纸和期刊,但我们的方法很容易被转换为任何直接数据。我们的主要贡献是a)将主题模型应用于巨大且不平衡的历时文本收集的组合抽样,培训和推理程序; b)讨论此类数据的两个主题模型之间的差异; c)量化一段时期的主题突出性,从而将文档主题分配的概括为话语级别; d)讨论人文解释在通过主题模型分析话语动态方面的作用。

This paper addresses methodological issues in diachronic data analysis for historical research. We apply two families of topic models (LDA and DTM) on a relatively large set of historical newspapers, with the aim of capturing and understanding discourse dynamics. Our case study focuses on newspapers and periodicals published in Finland between 1854 and 1917, but our method can easily be transposed to any diachronic data. Our main contributions are a) a combined sampling, training and inference procedure for applying topic models to huge and imbalanced diachronic text collections; b) a discussion on the differences between two topic models for this type of data; c) quantifying topic prominence for a period and thus a generalization of document-wise topic assignment to a discourse level; and d) a discussion of the role of humanistic interpretation with regard to analysing discourse dynamics through topic models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源