Papers
arxiv:2406.07436

McEval: Massively Multilingual Code Evaluation

Published on Jun 11
· Submitted by Challenging666 on Jun 12
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Code large language models (LLMs) have shown remarkable advances in code understanding, completion, and generation tasks. Programming benchmarks, comprised of a selection of code challenges and corresponding test cases, serve as a standard to evaluate the capability of different LLMs in such tasks. However, most existing benchmarks primarily focus on Python and are still restricted to a limited number of languages, where other languages are translated from the Python samples (e.g. MultiPL-E) degrading the data diversity. To further facilitate the research of code LLMs, we propose a massively multilingual code benchmark covering 40 programming languages (McEval) with 16K test samples, which substantially pushes the limits of code LLMs in multilingual scenarios. The benchmark contains challenging code completion, understanding, and generation evaluation tasks with finely curated massively multilingual instruction corpora McEval-Instruct. In addition, we introduce an effective multilingual coder mCoder trained on McEval-Instruct to support multilingual programming language generation. Extensive experimental results on McEval show that there is still a difficult journey between open-source models and closed-source LLMs (e.g. GPT-series models) in numerous languages. The instruction corpora, evaluation benchmark, and leaderboard are available at https://mceval.github.io/.

Community

Paper author Paper submitter
edited Jun 12

This paper first proposes a large-scale multi-language multi-task code evaluation benchmark McEval (not from HumanEval or MBPP), covering 40 programming languages, including multi-language code generation, multi-language code explanation, and multi-language code completion tasks, with sufficient test samples (16K).

This paper evaluated the performance of 20+ popular LLMs on McEval. The experimental results show that compared with GPT-4, there is still a large gap in multilingual programming capabilities between open-source models. Most of them are still not comparable to GPT-3.5. In addition, open-source models on the leaderboard such as Codestral, DeepSeek-Coder, CodeQwen, and their corresponding derivative models also have excellent performance.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.07436 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.07436 in a Space README.md to link it from this page.

Collections including this paper 26