CONSENSENSESQA 2.0：通过游戏化公开AI的限制

论文标题

CONSENSENSESQA 2.0：通过游戏化公开AI的限制

CommonsenseQA 2.0: Exposing the Limits of AI through Gamification

论文作者

Talmor, Alon, Yoran, Ori, Bras, Ronan Le, Bhagavatula, Chandra, Goldberg, Yoav, Choi, Yejin, Berant, Jonathan

论文摘要

构建测试现代自然语言理解模型能力的基准很困难 - 预先训练的语言模型利用基准中的工件来实现人类的平等，但仍然在对抗性示例上失败并犯了证明缺乏常识的错误。在这项工作中，我们建议游戏化作为数据构建的框架。游戏中玩家的目标是在使用特定的短语中撰写误导竞争对手AI的问题。游戏环境会导致增强的用户参与度，并同时使游戏设计师对收集的数据进行控制，从而使我们能够大规模收集高质量的数据。使用我们的方法，我们创建CommonSenseQA 2.0，其中包括14,343是/否问题，并证明了其难度比游戏本身中使用的AI大的模型。我们最好的基线，具有11B参数的基于T5的独角兽的精度为70.2％，在几次推理设置中的精度高于GPT-3（52.9％）。两者都远低于人类表现，为94.1％。

Constructing benchmarks that test the abilities of modern natural language understanding models is difficult - pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense. In this work, we propose gamification as a framework for data construction. The goal of players in the game is to compose questions that mislead a rival AI while using specific phrases for extra points. The game environment leads to enhanced user engagement and simultaneously gives the game designer control over the collected data, allowing us to collect high-quality data at scale. Using our method we create CommonsenseQA 2.0, which includes 14,343 yes/no questions, and demonstrate its difficulty for models that are orders-of-magnitude larger than the AI used in the game itself. Our best baseline, the T5-based Unicorn with 11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3 (52.9%) in a few-shot inference setup. Both score well below human performance which is at 94.1%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题