|
--- |
|
license: BSD-3 |
|
tags: |
|
- codet5 |
|
datasets: |
|
- code_search_net |
|
inference: true |
|
--- |
|
|
|
# CodeT5-base for Code Summarization |
|
|
|
[CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data in a multi-lingual training setting ( |
|
Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021 |
|
paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859) |
|
by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more |
|
at [this repository](https://github.com/salesforce/CodeT5). |
|
|
|
## How to use |
|
|
|
Here is how to use this model: |
|
|
|
```python |
|
from transformers import RobertaTokenizer, T5ForConditionalGeneration |
|
|
|
if __name__ == '__main__': |
|
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base-multi-sum') |
|
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum') |
|
|
|
text = """def svg_to_image(string, size=None): |
|
if isinstance(string, unicode): |
|
string = string.encode('utf-8') |
|
renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string)) |
|
if not renderer.isValid(): |
|
raise ValueError('Invalid SVG data.') |
|
if size is None: |
|
size = renderer.defaultSize() |
|
image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32) |
|
painter = QtGui.QPainter(image) |
|
renderer.render(painter) |
|
return image""" |
|
|
|
input_ids = tokenizer(text, return_tensors="pt").input_ids |
|
|
|
generated_ids = model.generate(input_ids, max_length=20) |
|
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) |
|
# this prints: "Convert a SVG string to a QImage." |
|
``` |
|
|
|
## Fine-tuning data |
|
|
|
We employ the filtered version of CodeSearchNet data [[Husain et al., 2019](https://arxiv.org/abs/1909.09436)] |
|
from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on |
|
code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can |
|
prepare text (or code) for the model using RobertaTokenizer with the vocab files from [codet5-base](https://huggingface.co/Salesforce/codet5-base). |
|
|
|
### Data statistic |
|
|
|
| Programming Language | Training | Dev | Test | |
|
| :------------------- | :------: | :----: | :----: | |
|
| Python | 251,820 | 13,914 | 14,918 | |
|
| PHP | 241,241 | 12,982 | 14,014 | |
|
| Go | 167,288 | 7,325 | 8,122 | |
|
| Java | 164,923 | 5,183 | 10,955 | |
|
| JavaScript | 58,025 | 3,885 | 3,291 | |
|
| Ruby | 24,927 | 1,400 | 1,261 | |
|
|
|
## Training procedure |
|
|
|
We fine-tune codet5-base on these six programming languages (Ruby/JavaScript/Go/Python/Java/PHP) in the multi-task learning setting. We employ the |
|
balanced sampling to avoid biasing towards high-resource tasks. Please refer to the [paper](https://arxiv.org/abs/2109.00859) for more details. |
|
|
|
## Evaluation results |
|
|
|
Unlike the paper allowing to select different best checkpoints for different programming languages (PLs), here we employ one checkpoint for |
|
all PLs. Besides, we remove the task control prefix to specify the PL in training and inference. The results on the test set are shown as below: |
|
|
|
| Model | Ruby | Javascript | Go | Python | Java | PHP | Overall | |
|
| ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: | |
|
| Seq2Seq | 9.64 | 10.21 | 13.98 | 15.93 | 15.09 | 21.08 | 14.32 | |
|
| Transformer | 11.18 | 11.59 | 16.38 | 15.81 | 16.26 | 22.12 | 15.56 | |
|
| [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf) | 11.17 | 11.90 | 17.72 | 18.14 | 16.47 | 24.02 | 16.57 | |
|
| [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 | |
|
| [PLBART](https://aclanthology.org/2021.naacl-main.211.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 | |
|
| [CodeT5-small](https://arxiv.org/abs/2109.00859) |14.87 | 15.32 | 19.25 | 20.04 | 19.92 | 25.46 | 19.14 | |
|
| [CodeT5-base](https://arxiv.org/abs/2109.00859) | **15.24** | 16.16 | 19.56 | 20.01 | **20.31** | 26.03 | 19.55 | |
|
| [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | **15.24** | **16.18** | **19.95** | **20.42** | 20.26 | **26.10** | **19.69** | |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@inproceedings{ |
|
wang2021codet5, |
|
title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, |
|
author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi}, |
|
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021}, |
|
year={2021}, |
|
} |
|
``` |