乘法：一种可扩展且可扩展的方法来基准神经代码生成

论文标题

乘法：一种可扩展且可扩展的方法来基准神经代码生成

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

论文作者

Cassano, Federico, Gouwar, John, Nguyen, Daniel, Nguyen, Sydney, Phipps-Costin, Luna, Pinckney, Donald, Yee, Ming-Ho, Zi, Yangtian, Anderson, Carolyn Jane, Feldman, Molly Q, Guha, Arjun, Greenberg, Michael, Jangda, Abhinav

论文摘要

大型语言模型已经证明了同时生成自然语言和编程语言文本的能力。这样的模型打开了多语言代码生成的可能性：代码生成模型可以将知识从一种语言推广到另一种语言吗？尽管当代代码生成模型可以生成语义上正确的Python代码，但对它们使用其他语言的能力知之甚少。我们建议使用Multipl-E，该系统用于将单元测试驱动的代码生成基准转换为新语言。我们通过使用Multipl-E将两个流行的Python代码生成基准转换为另外的编程语言，从而创建了第一个大规模的多语言代码生成基准。我们使用Multipl-E将人道基准和MBPP基准扩展到包含一系列编程范式和受欢迎程度的18种语言。使用这些新的并行基准测试，我们评估了三种最先进的代码生成模型的多语言性能：Codex，CodeGen和Incoder。我们发现，对于其他几种语言，Codex匹配甚至超过了其在Python上的性能。多乘积表示的编程语言范围使我们能够探索语言频率和语言特征对模型性能的影响。最后，将代码生成基准分配到新的编程语言的乘法方法既可扩展又可扩展，从而使评估新型号，基准和语言变得直接。

Large language models have demonstrated the ability to generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen, and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题