--- license: bsd-3-clause --- # CodeT5 (large-size model 770M) ## Model description CodeT5 is a family of encoder-decoder language models for code from the paper: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) by Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. The checkpoint included in this repository is denoted as **CodeT5-large** (770M), which is introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi. ## Training data CodeT5-large was pretrained on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data in six programming languages (Ruby/JavaScript/Go/Python/Java/PHP). See Section 4.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. ## Training procedure CodeT5-large was pretrained using masked span prediction objective for 150 epochs. See Section 4.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. ## Evaluation results We validate the effectiveness of this checkpoint pretrained with simplified strategies on [CodeXGLUE](https://github.com/microsoft/CodeXGLUE) benchmark. See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. ## How to use This model can be easily loaded using the `T5ForConditionalGeneration` functionality: ```python from transformers import AutoTokenizer, T5ForConditionalGeneration tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large") model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large") text = "def greet(user): print(f'hello !')" input_ids = tokenizer(text, return_tensors="pt").input_ids # simply generate a single sequence generated_ids = model.generate(input_ids, max_length=8) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) ``` ## BibTeX entry and citation info ```bibtex @inproceedings{CodeT52021, author = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi}, title = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, booktitle = {EMNLP}, pages = {8696--8708}, publisher = {Association for Computational Linguistics}, year = {2021} } @article{CodeRL2022 author = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi}, title = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning}, journal = {arXiv preprint}, volume = {abs/2207.01780}, year = {2022} } ```