---
license: mit
inference:
  parameters:
    max_length: 1024
widget:
- text: "a b : ℕ\n⊢ a + b = b + a"
  example_title: "Example"
---

[LeanDojo: Theorem Proving with Retrieval-Augmented Language Models](https://arxiv.org/abs/xxxx.xxxxx)      
Under review, NeurIPS (Datasets and Benchmarks Track), 2023  
[Kaiyu Yang](https://yangky11.github.io/), [Aidan Swope](https://aidanswope.com/about), [Alex Gu](https://minimario.github.io/), [Rahul Chalamala](https://www.linkedin.com/in/rchalamala),  
[Peiyang Song](https://www.linkedin.com/in/peiyang-song-3279b3251/), [Shixing Yu](https://billysx.github.io/), [Saad Godil](https://www.linkedin.com/in/saad-godil-9728353/), [Ryan Prenger](https://www.linkedin.com/in/ryan-prenger-18797ba1/), [Anima Anandkumar](http://tensorlab.cms.caltech.edu/users/anima/)

```bibtex
@inproceedings{yang2023leandojo,
    title={{LeanDojo}: Theorem Proving with Retrieval-Augmented Language Models},
    author={Yang, Kaiyu and Swope, Aidan and Gu, Alex and Chalamala, Rahul and Song, Peiyang and Yu, Shixing and Godil, Saad and Prenger, Ryan and Anandkumar, Anima},
    booktitle={Neural Information Processing Systems (NeurIPS)},
    year={2023}
}
```

Please visit [LeanDojo Website](https://leandojo.org/) for details.

## Input Format

The model's input consists of retrieved premises concatenated with the current proof state and truncated to 2300 UTF-8 bytes. 
The proof state is formatted by Lean's pretty printer, the same as the input format of our [model w/o retrieval](https://huggingface.co/kaiyuy/leandojo-lean3-tacgen-byt5-small).
Retrieved premises are in the form of Lean code, except that proofs are removed and fully qualified names (marked by `<a>...</a>`) are used. Please see the example on the right.