论文标题

CC-Riddle:回答汉字谜语数据集的问题

CC-Riddle: A Question Answering Dataset of Chinese Character Riddles

论文作者

Xu, Fan, Zhang, Yunxiang, Wan, Xiaojun

论文摘要

中文角色是中文特有的文化娱乐的一种独特形式。它通常包括两个部分:谜语描述和解决方案。对谜语的解是一个单个字符,而谜语描述主要描述了溶液的字形,偶尔会补充其解释和发音。解决汉字谜语是一项艰巨的任务,需要了解角色字形,常识和对象征性语言的掌握。在本文中,我们构造了一个名为CC-Riddle的Haracter Riddle数据集\ textbf {c} hinese \ textbf {c},该数据集涵盖了大多数常见的简化中文字符。施工过程是网络爬行,语言模型生成和手动过滤的结合。在生成阶段,我们将中国语音字母,术语和溶液特征的含义输入到生成模型中,然后产生多个谜语描述。然后手动过滤产生的谜语,最后的CC-riddle数据集由人写的谜语和这些过滤的,生成的谜语组成。为了评估语言模型在解决角色谜语的任务上的性能,我们使用基于检索的,生成和多项选择质量质量质量策略来测试三种语言模型:BERT,CHETGPT和CANTGLM。测试结果表明,当前的语言模型仍然难以解决汉字谜语。 cc-riddle可在\ url {https://github.com/pku0xff/cc-riddle}上公开获得。

The Chinese character riddle is a unique form of cultural entertainment specific to the Chinese language. It typically comprises two parts: the riddle description and the solution. The solution to the riddle is a single character, while the riddle description primarily describes the glyph of the solution, occasionally supplemented with its explanation and pronunciation. Solving Chinese character riddles is a challenging task that demands understanding of character glyph, general knowledge, and a grasp of figurative language. In this paper, we construct a \textbf{C}hinese \textbf{C}haracter riddle dataset named CC-Riddle, which covers the majority of common simplified Chinese characters. The construction process is a combination of web crawling, language model generation and manual filtering. In generation stage, we input the Chinese phonetic alphabet, glyph and meaning of the solution character into the generation model, which then produces multiple riddle descriptions. The generated riddles are then manually filtered and the final CC-Riddle dataset is composed of both human-written riddles and these filtered, generated riddles. In order to assess the performance of language models on the task of solving character riddles, we use retrieval-based, generative and multiple-choice QA strategies to test three language models: BERT, ChatGPT and ChatGLM. The test results reveal that current language models still struggle to solve Chinese character riddles. CC-Riddle is publicly available at \url{https://github.com/pku0xff/CC-Riddle}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源