MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

Apr 17, 2023·
Donald Pinckney
Donald Pinckney
,
13 Additional Authors
,
Equal Contribution
· 1 min read
Type
Publication
IEEE Transactions on Software Engineering, 49(7)

Large language models (LLMs) are blowing up the internet now, both for casual natural language use as well as for programming tasks. ChatGPT, Codex, and other tools appear to be able to code fairly well, but how well depends on which programming language! We designed and built MultiPL-E, a systematic and extensible system for fairly evaluating LLMs across a large number of programming languages (18!).

The key insight is that LLM programming benchmark suites (HumanEval, etc.) are written as Python unit tests, and unit tests are (almost always) written in a subset of Python and do not use features such as functions, loops, etc. Therefore, we were able to write trivial “compilers” to translate Python unit tests to nearly any other language, and obtain equivalent benchmark suites. This work was published in TSE 2023 and presented at ESEC/FSE 2023.