论文标题
杯子:用户界面上的交互式多模式接地
MUG: Interactive Multimodal Grounding on User Interfaces
论文作者
论文摘要
我们提出了杯子,这是一种用于多模式接地的新颖交互任务,用户和代理在接口屏幕上合作工作。先前的工作在一轮中建模为多模式UI接地:用户给出命令,而代理响应命令。但是,在现实的情况下,当目标动作本质上难以用自然语言表达时,用户命令可能是模棱两可的。杯子允许多轮交互作用,以便在看到代理响应后,用户可以为代理提供进一步的命令来完善甚至纠正其动作。这种相互作用对于改善现实世界用例中的接地性能至关重要。为了调查问题,我们创建了一个新数据集,该数据集由移动接口上的77,820个人类用户交互序列组成,其中20%涉及多个回合的交互。为了建立我们的基准,我们尝试了一系列建模变体和评估策略,包括离线和在线评估 - 在线策略包括人类评估和模拟器自动化。我们的实验表明,允许迭代相互作用在整个测试数据集中显着提高了绝对任务完成的18%,而在具有挑战性的子集中,绝对任务完成了31%。我们的结果为进一步研究问题奠定了基础。
We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish our benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation-the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that allowing iterative interaction significantly improves the absolute task completion by 18% over the entire test dataset and 31% over the challenging subset. Our results lay the foundation for further investigation of the problem.