论文标题
Gutenberg对话数据集
The Gutenberg Dialogue Dataset
论文作者
论文摘要
大型数据集对于许多NLP任务的神经建模至关重要。当前公开可用的开放域对话数据集在质量(例如每日dialog)和大小(例如OpenSubtitles)之间进行了权衡。我们通过构建英语1480万个话语的高质量数据集以及德语,荷兰,西班牙,葡萄牙语,意大利语和匈牙利人的较小数据集来缩小这一差异。我们从Project Gutenberg提供的公共域书籍中提取和处理对话。我们描述我们的对话提取管道,分析所使用的各种启发式方法的效果,并对提取的对话进行错误分析。最后,我们进行实验表明,通过培训我们的数据,可以在零拍和填充设置中比在较大但嘈杂的OpenSubtitles数据集中实现更好的响应质量。我们的开源管道(https://github.com/ricsinaruto/gutenberg-dialog)可以扩展到更多语言,而几乎没有额外的努力。研究人员还可以通过调整各种权衡参数来构建现有数据集的版本。我们还构建了一个网络演示,以与我们的模型进行交互:https://ricsinaruto.github.io/chatbot.html。
Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We extract and process dialogues from public-domain books made available by Project Gutenberg. We describe our dialogue extraction pipeline, analyze the effects of the various heuristics used, and present an error analysis of extracted dialogues. Finally, we conduct experiments showing that better response quality can be achieved in zero-shot and finetuning settings by training on our data than on the larger but much noisier Opensubtitles dataset. Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) can be extended to further languages with little additional effort. Researchers can also build their versions of existing datasets by adjusting various trade-off parameters. We also built a web demo for interacting with our models: https://ricsinaruto.github.io/chatbot.html.