论文标题

Tickettalk:通过基于交易的对话系统进行人级绩效

TicketTalk: Toward human-level performance with end-to-end, transaction-based dialog systems

论文作者

Byrne, Bill, Krishnamoorthi, Karthik, Ganesh, Saravanan, Kale, Mihir Sanjay

论文摘要

我们为基于交易的对话系统提供了一种数据驱动的,端到端的方法,该方法在口头响应质量和事实基础准确性方面以近乎人类的水平执行。我们表明,该系统的两个基本组成部分产生了这些结果:一个足够大且具有多样的,内域标记的数据集,以及基于神经网络的预训练模型,该模型同时产生口头响应和API调用预测。在数据方面,我们介绍了Tickettalk,这是一个带有23,789个带注释的对话的电影票务对话框数据集。电影票务对话从完全开放的,不受限制地到更具结构化的知识基础,话语功能和转弯数。在定性的人类评估中,只有10,000个Tickettalk对话训练的模型生成的响应被评为“有意义”的时间86.5%,几乎与在相同情况下的人类反应相同。我们简单,以API为重点的注释模式导致了一项更容易的标签任务,从而使其更快,更具成本效益。它也是能够准确预测API调用的关键组件。我们通过将API调用纳入培训数据中来处理事实基础,从而使我们的模型可以了解要采取哪些操作以及何时采取的措施。该模型的API呼叫预测在相同的10,000台数字集上受过培训,在我们的评估中被评为正确的93.9%,超过了相应的人类标签的评分。我们展示了API预测和响应生成分数如何随着数据集大小从5000增加到21,000个对话框而提高。我们的分析还清楚地说明了预训练的好处。我们将在本文中公开发布Tickettalk数据集,以促进未来基于交易的对话的工作。

We present a data-driven, end-to-end approach to transaction-based dialog systems that performs at near-human levels in terms of verbal response quality and factual grounding accuracy. We show that two essential components of the system produce these results: a sufficiently large and diverse, in-domain labeled dataset, and a neural network-based, pre-trained model that generates both verbal responses and API call predictions. In terms of data, we introduce TicketTalk, a movie ticketing dialog dataset with 23,789 annotated conversations. The movie ticketing conversations range from completely open-ended and unrestricted to more structured, both in terms of their knowledge base, discourse features, and number of turns. In qualitative human evaluations, model-generated responses trained on just 10,000 TicketTalk dialogs were rated to "make sense" 86.5 percent of the time, almost the same as human responses in the same contexts. Our simple, API-focused annotation schema results in a much easier labeling task making it faster and more cost effective. It is also the key component for being able to predict API calls accurately. We handle factual grounding by incorporating API calls in the training data, allowing our model to learn which actions to take and when. Trained on the same 10,000-dialog set, the model's API call predictions were rated to be correct 93.9 percent of the time in our evaluations, surpassing the ratings for the corresponding human labels. We show how API prediction and response generation scores improve as the dataset size incrementally increases from 5000 to 21,000 dialogs. Our analysis also clearly illustrates the benefits of pre-training. We are publicly releasing the TicketTalk dataset with this paper to facilitate future work on transaction-based dialogs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源