评估大规模多语种语义解析的字节和文字级别模型

论文标题

评估大规模多语种语义解析的字节和文字级别模型

Evaluating Byte and Wordpiece Level Models for Massively Multilingual Semantic Parsing

论文作者

Nicosia, Massimo, Piccinno, Francesco

论文摘要

无令牌的方法已成功应用于一系列单词和跨度级别的任务。在这项工作中，我们比较了一个字节级（BYT5）和基于文字的（MT5）序列与序列模型上的序列模型，以大规模多语言语义解析数据集的51种语言。我们检查了多个实验设置：（i）零射，（ii）完整的金数据和（iii）与合成数据的零射击。通过利用用于机器翻译示例的最先进的标签投影方法，我们能够将精确匹配精度的差距降低至5分，相对于所有语言的金数据训练的模型。我们还提供了BYT5的跨语性传输的见解，并显示模型在所有参数大小中相对于MT5的比较。

Token free approaches have been successfully applied to a series of word and span level tasks. In this work, we compare a byte-level (ByT5) and a wordpiece based (mT5) sequence to sequence model on the 51 languages of the MASSIVE multilingual semantic parsing dataset. We examine multiple experimental settings: (i) zero-shot, (ii) full gold data and (iii) zero-shot with synthetic data. By leveraging a state-of-the-art label projection method for machine translated examples, we are able to reduce the gap in exact match accuracy to only 5 points with respect to a model trained on gold data from all the languages. We additionally provide insights on the cross-lingual transfer of ByT5 and show how the model compares with respect to mT5 across all parameter sizes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题