Files changed (1) hide show
  1. README.md +76 -16
README.md CHANGED
@@ -8,19 +8,29 @@ widget:
8
 
9
 
10
  # CodeTrans model for program synthesis
11
- Pretrained model on programming language lisp inspired DSL using the t5 small model architecture. It was first released in
12
- [this repository](https://github.com/agemagician/CodeTrans).
13
 
14
-
15
- ## Model description
16
-
17
- This CodeTrans model is based on the `t5-small` model. It has its own SentencePiece vocabulary model. It used transfer-learning pre-training on 7 unsupervised datasets in the software development domain. It is then fine-tuned on the program synthesis task for the lisp inspired DSL code.
18
-
19
- ## Intended uses & limitations
20
-
21
- The model could be used to generate lisp inspired DSL code given the human language description tasks.
22
-
23
- ### How to use
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  Here is how to use this model to generate lisp inspired DSL code using Transformers SummarizationPipeline:
26
 
@@ -42,20 +52,44 @@ Run this example in [colab notebook](https://github.com/agemagician/CodeTrans/bl
42
  The supervised training tasks datasets can be downloaded on [Link](https://www.dropbox.com/sh/488bq2of10r4wvw/AACs5CGIQuwtsD7j_Ls_JAORa/finetuning_dataset?dl=0&subfolder_nav_tracking=1)
43
 
44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ## Training procedure
46
 
47
- ### Transfer-learning Pretraining
 
 
48
 
49
  The model was trained on a single TPU Pod V3-8 for 500,000 steps in total, using sequence length 512 (batch size 4096).
50
  It has a total of approximately 220M parameters and was trained using the encoder-decoder architecture.
51
  The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
52
 
53
- ### Fine-tuning
54
 
55
  This model was then fine-tuned on a single TPU Pod V2-8 for 5,000 steps in total, using sequence length 512 (batch size 256), using only the dataset only containing lisp inspired DSL data.
56
 
57
 
58
- ## Evaluation results
 
 
59
 
60
  For the code documentation tasks, different models achieves the following results on different programming languages (in BLEU score):
61
 
@@ -77,6 +111,32 @@ Test results :
77
  | State of the art | 85.80 |
78
 
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
- > Created by [Ahmed Elnaggar](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/) and Wei Ding | [LinkedIn](https://www.linkedin.com/in/wei-ding-92561270/)
82
 
 
8
 
9
 
10
  # CodeTrans model for program synthesis
 
 
11
 
12
+ ## Table of Contents
13
+ - [Model Details](#model-details)
14
+ - [How to Get Started With the Model](#how-to-get-started-with-the-model)
15
+ - [Uses](#uses)
16
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
17
+ - [Training](#training)
18
+ - [Evaluation](#evaluation)
19
+ - [Environmental Impact](#environmental-impact)
20
+ - [Citation Information](#citation-information)
21
+
22
+ ## Model Details
23
+ - **Model Description:** This CodeTrans model is based on the `t5-small` model. It has its own SentencePiece vocabulary model. It used transfer-learning pre-training on 7 unsupervised datasets in the software development domain. It is then fine-tuned on the program synthesis task for the lisp inspired DSL code.
24
+ - **Developed by:** [Ahmed Elnaggar](https://www.linkedin.com/in/prof-ahmed-elnaggar/),[Wei Ding](https://www.linkedin.com/in/wei-ding-92561270/)
25
+ - **Model Type:** Summarization
26
+ - **Language(s):** English
27
+ - **License:** Unknown
28
+ - **Resources for more information:**
29
+ - [Research Paper](https://arxiv.org/pdf/2104.02443.pdf)
30
+ - [GitHub Repo](https://github.com/agemagician/CodeTrans)
31
+
32
+
33
+ ## How to Get Started With the Model
34
 
35
  Here is how to use this model to generate lisp inspired DSL code using Transformers SummarizationPipeline:
36
 
 
52
  The supervised training tasks datasets can be downloaded on [Link](https://www.dropbox.com/sh/488bq2of10r4wvw/AACs5CGIQuwtsD7j_Ls_JAORa/finetuning_dataset?dl=0&subfolder_nav_tracking=1)
53
 
54
 
55
+
56
+
57
+ ## Uses
58
+
59
+ #### Direct Use
60
+
61
+ The model could be used to generate lisp inspired DSL code given the human language description tasks.
62
+
63
+
64
+ ## Training
65
+
66
+ #### Training Data
67
+
68
+ The supervised training tasks datasets can be downloaded on [Link](https://www.dropbox.com/sh/488bq2of10r4wvw/AACs5CGIQuwtsD7j_Ls_JAORa/finetuning_dataset?dl=0&subfolder_nav_tracking=1)
69
+
70
+ The authors provide additionally notes about the vocabulary used, in the [associated paper](https://arxiv.org/pdf/2104.02443.pdf):
71
+
72
+ > We used the SentencePiece model (Kudo, 2018) to construct the vocabulary for this research, as well as to decode and encode the input/output.
73
+
74
+
75
  ## Training procedure
76
 
77
+ #### Preprocessing
78
+
79
+ ##### Transfer-learning Pretraining
80
 
81
  The model was trained on a single TPU Pod V3-8 for 500,000 steps in total, using sequence length 512 (batch size 4096).
82
  It has a total of approximately 220M parameters and was trained using the encoder-decoder architecture.
83
  The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
84
 
85
+ ###### Fine-tuning
86
 
87
  This model was then fine-tuned on a single TPU Pod V2-8 for 5,000 steps in total, using sequence length 512 (batch size 256), using only the dataset only containing lisp inspired DSL data.
88
 
89
 
90
+ ## Evaluation
91
+
92
+ #### Results
93
 
94
  For the code documentation tasks, different models achieves the following results on different programming languages (in BLEU score):
95
 
 
111
  | State of the art | 85.80 |
112
 
113
 
114
+ ## Environmental Impact
115
+
116
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). We present the hardware type based on the [associated paper](https://arxiv.org/pdf/2105.09680.pdf).
117
+
118
+
119
+ - **Hardware Type:** Nvidia RTX 8000 GPUs
120
+
121
+ - **Hours used:** Unknown
122
+
123
+ - **Cloud Provider:** GCC TPU v2-8 and v3-8.
124
+
125
+ - **Compute Region:** Unknown
126
+
127
+ - **Carbon Emitted:** Unknown
128
+
129
+ ## Citation Information
130
+
131
+ ```bibtex
132
+ @misc{elnaggar2021codetrans,
133
+ title={CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing},
134
+ author={Ahmed Elnaggar and Wei Ding and Llion Jones and Tom Gibbs and Tamas Feher and Christoph Angerer and Silvia Severini and Florian Matthes and Burkhard Rost},
135
+ year={2021},
136
+ eprint={2104.02443},
137
+ archivePrefix={arXiv},
138
+ primaryClass={cs.SE}
139
+ }
140
+ ```
141
 
 
142