Papers
arxiv:2504.15254

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

Published on Apr 21
· Submitted by anirudhkhatry on Apr 24
Authors:
,
,
,
,

Abstract

C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.

Community

Paper author Paper submitter

🚀Introducing CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️
A dataset of 100 real-world C repositories across various domains, each paired with:
🦀 Handwritten safe Rust interfaces.
🧪 Rust test cases to validate correctness.

t1_2.png

Transpiling C to Rust helps modernize legacy code with memory safety guarantees. CRUST-Bench evaluates whether transpilation methods yield safe, idiomatic Rust, using handcrafted interfaces and tests to ensure safety and validate correctness.

interface.png

Our benchmark is the first to provide:

  1. Rust tests
  2. Rust interfaces, which are necessary for the transpiled code to work with the tests
  3. A sizable number of real-scale transpilation problems.

comp.png
Screenshot 2025-04-23 at 11.42.37 AM.png

We evaluate state-of-the-art closed-source LLMs (like o1, Claude-3.7, and Gemini-1.5-Pro), open-source models like QwQ-32B and virtuoso-32B, and the SWE-Agent on CRUST-Bench.
Even the best model—OpenAI's o1—passes only 15/100 tasks in a single-shot setting.

Language Models often fail to:

  1. Respect ownership rules
  2. Infer type information
  3. Follow idiomatic Rust interfaces
  4. Preserve correct lifetimes

In the paper, we provide a taxonomy of common LLM mistakes.

Please read the full paper for more details.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.15254 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.15254 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.15254 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.