Sinhala语言语料库和Sri Lankan Facebook十年来

论文标题

Sinhala语言语料库和Sri Lankan Facebook十年来

Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

论文作者

Wijeratne, Yudhanjaya, de Silva, Nisansa

论文摘要

本文通过数据，分析和政策团队的语言努力以及算法派生的停止词的列表提出了两个口语的Sinhala语言语言。在2010年至2020年的两个语料库中，较大的较大，包含28,825,820至29,549,672个单词的多语言文本，由533 Sri Lankan Facebook Pages发表，包括政治，媒体，名人和其他类别；较小的语料库总计为仅从较大较大的文本中提取的仅5,402,76个单词。这两个语料库都有创建日期，原始页面和内容类型的标记。

This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger. Both corpora have markers for their date of creation, page of origin, and content type.

下载PDF全文

下载文献需遵守相关版权规定

论文标题