Salesforce
/

codet5-base-multi-sum

+---
+license: BSD-3
+tags:
+- codet5
+datasets:
+- code_search_net
+inference: true
+---
+# CodeT5 for code summarization (base-sized model)
+[CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data
+from [Husain et al., 2019](https://arxiv.org/abs/1909.09436) in a multi-lingual training setting (
+Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021
+paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859)
+by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more
+at [this repository](https://github.com/salesforce/CodeT5).
+## How to use
+Here is how to use this model:
+```python
+from transformers import RobertaTokenizer, T5ForConditionalGeneration
+if __name__ == '__main__':
+    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
+    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
+    text = """def svg_to_image(string, size=None):
+    if isinstance(string, unicode):
+        string = string.encode('utf-8')
+        renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
+    if not renderer.isValid():
+        raise ValueError('Invalid SVG data.')
+    if size is None:
+        size = renderer.defaultSize()
+        image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
+        painter = QtGui.QPainter(image)
+        renderer.render(painter)
+    return image"""
+    input_ids = tokenizer(text, return_tensors="pt").input_ids
+    generated_ids = model.generate(input_ids, max_length=20)
+    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
+    # this prints: "Convert a SVG string to a QImage."
+```
+## Fine-tuning data
+We employ the filtered version of CodeSearchNet data
+from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on
+code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can
+prepare text (or code) for the model using RobertaTokenizer, with the vocab files
+from [codet5-base](https://huggingface.co/Salesforce/codet5-base).
+### Data Statistic
+| Programming Language | Training |  Dev   |  Test  |
+| :------------------- | :------: | :----: | :----: |
+| Python               | 251,820  | 13,914 | 14,918 |
+| PHP                  | 241,241  | 12,982 | 14,014 |
+| Go                   | 167,288  | 7,325  | 8,122  |
+| Java                 | 164,923  | 5,183  | 10,955 |
+| JavaScript           |  58,025  | 3,885  | 3,291  |
+| Ruby                 |  24,927  | 1,400  | 1,261  |
+## Training procedure
+We fine-tune codet5-base on six PLs (Ruby/JavaScript/Go/Python/Java/PHP) in the multi-task learning setting. We employ
+balanced sampling to avoid biasing towards high-resource tasks. Please refer to
+the [paper](https://arxiv.org/abs/2109.00859) for more details.
+## Evaluation results
+Unlike the paper allowing to select different best checkpoints for different tasks, here we employ one checkpoint for
+all PLs. Besides, we remove the prefix to specify the PL in training and inference. The results on the test set are shown as below:
+| Model       |   Ruby    | Javascript |    Go     |  Python   |   Java    |    PHP    |  Overall  |
+| ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
+| Seq2Seq     |   9.64    |   10.21    |   13.98   |   15.93   |   15.09   |   21.08   |   14.32   |
+| Transformer |   11.18   |   11.59    |   16.38   |   15.81   |   16.26   |   22.12   |   15.56   |
+| [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf)     |   11.17   |   11.90    |   17.72   |   18.14   |   16.47   |   24.02   |   16.57   |
+| [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf)    | 12.16 | 14.90  | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
+| [PLBART](https://arxiv.org/pdf/2002.08155.pdf)    | 14.11 |15.56  |  18.91 |   19.30 |  18.45 |  23.58 |  18.32 |
+| [CodeT5-small](https://arxiv.org/abs/2109.00859)    |14.87    | 15.32   |  19.25    | 20.04   |  19.92   |  25.46   |  19.14 |
+| [CodeT5-base](https://arxiv.org/abs/2109.00859)    |  15.24   |  16.16   |  19.56   |  20.01   |  20.31   |  26.03   |  19.55 |
+| [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859)    | 15.24       | 16.18       | 19.95   |    20.42       | 20.26   |    26.10   |    19.69 |
+### BibTeX entry and citation info
+```bibtex
+@inproceedings{
+    wang2021codet5,
+    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
+    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
+    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
+    year={2021},
+}
+```