通过本地结构探针检测对多语言模型难以理解的语言

论文标题

通过本地结构探针检测对多语言模型难以理解的语言

Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes

论文作者

Clouâtre, Louis, Parthasarathi, Prasanna, Zouaq, Amal, Chandar, Sarath

论文摘要

为低资源和濒危语言提供更好的语言工具对于公平增长至关重要。事实证明，大量多语言预算模型的最新进展已在对各种语言进行零弹性转移方面出奇地有效。但是，这种转移不是通用的，目前多种语言尚未通过多语言方法来理解。据估计，只有72种语言拥有一组“标记的数据集”，我们可以在该数据集中测试模型的性能，绝大多数语言没有可用的资源来简单地评估性能。在这项工作中，我们试图澄清哪种语言确实可以从这种转移中受益。为此，我们开发了一种一般方法，该方法仅需要未标记的文本来检测跨语言模型对哪些语言的理解。我们的方法来自以下假设：如果模型的理解对使用一种语言的文本扰动不敏感，则可能对该语言有限。我们构建了一个跨语性句子相似性任务，以经验评估350个语言，主要是低资源的语言。

Providing better language tools for low-resource and endangered languages is imperative for equitable growth. Recent progress with massively multilingual pretrained models has proven surprisingly effective at performing zero-shot transfer to a wide variety of languages. However, this transfer is not universal, with many languages not currently understood by multilingual approaches. It is estimated that only 72 languages possess a "small set of labeled datasets" on which we could test a model's performance, the vast majority of languages not having the resources available to simply evaluate performances on. In this work, we attempt to clarify which languages do and do not currently benefit from such transfer. To that end, we develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model. Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language. We construct a cross-lingual sentence similarity task to evaluate our approach empirically on 350, primarily low-resource, languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题