论文标题
Sinhala语言语料库和Sri Lankan Facebook十年来
Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook
论文作者
论文摘要
本文通过数据,分析和政策团队的语言努力以及算法派生的停止词的列表提出了两个口语的Sinhala语言语言。在2010年至2020年的两个语料库中,较大的较大,包含28,825,820至29,549,672个单词的多语言文本,由533 Sri Lankan Facebook Pages发表,包括政治,媒体,名人和其他类别;较小的语料库总计为仅从较大较大的文本中提取的仅5,402,76个单词。这两个语料库都有创建日期,原始页面和内容类型的标记。
This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger. Both corpora have markers for their date of creation, page of origin, and content type.