model documentation

#1
by nazneen - opened
Files changed (1) hide show
  1. README.md +199 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - text-2-text-generation
4
+ - plbart
5
+ license: apache-2.0
6
+ ---
7
+
8
+ # Model Card for plbart-java-cs
9
+
10
+ # Model Details
11
+
12
+ ## Model Description
13
+
14
+ The PLBART model was proposed in Unified Pre-training for Program Understanding and Generation
15
+
16
+ - **Developed by:** UCLA NLP
17
+ - **Shared by [Optional]:** Peerapong C.
18
+
19
+ - **Model type:** Text2Text Generation
20
+ - **Language(s) (NLP):** More information needed
21
+ - **License:** Apache 2.0
22
+ - **Parent Model:** bert-base-multilingual-uncased
23
+ - **Resources for more information:**
24
+ - [Associated Paper](https://arxiv.org/abs/2103.06333)
25
+ - [Model Documentation](https://huggingface.co/docs/transformers/model_doc/plbart)
26
+
27
+
28
+ # Uses
29
+
30
+
31
+ ## Direct Use
32
+ This model can be used for the task of Text2Text Generation
33
+
34
+ ## Downstream Use [Optional]
35
+
36
+ More information needed.
37
+
38
+ ## Out-of-Scope Use
39
+
40
+ The model should not be used to intentionally create hostile or alienating environments for people.
41
+
42
+ # Bias, Risks, and Limitations
43
+
44
+
45
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
46
+
47
+
48
+
49
+ ## Recommendations
50
+
51
+
52
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
53
+
54
+ # Training Details
55
+
56
+ ## Training Data
57
+
58
+ More information needed
59
+
60
+ ## Training Procedure
61
+
62
+
63
+ ### Preprocessing
64
+
65
+ The model creators note in the [associated paper](https://arxiv.org/pdf/2103.06333.pdf)
66
+ > We tokenize all the data with a sentencepiece model (Kudo and Richardson, 2018) learned on 1/5’th of the pre-training data. We train sentencepiece to learn 50,000 subword tokens. One key challenge to aggregate data from different modalities is that some modalities may have more data, such as we have 14 times more data in PL than NL. Therefore, we mix and up/down sample the data following Conneau and Lample (2019) to alleviate the bias towards PL.
67
+
68
+
69
+
70
+
71
+
72
+ ### Speeds, Sizes, Times
73
+ The model creators note in the [associated paper](https://arxiv.org/pdf/2103.06333.pdf):
74
+ > The effective batch size is maintained at 2048 instances.
75
+
76
+
77
+ # Evaluation
78
+
79
+
80
+ ## Testing Data, Factors & Metrics
81
+
82
+ ### Testing Data
83
+
84
+ The model creators note in the [associated paper](https://arxiv.org/pdf/2103.06333.pdf):
85
+ >CodeXGLUE (Lu et al., 2021) provided public dataset and corresponding train validation-test splits for all the tasks
86
+
87
+
88
+ ### Factors
89
+ More information needed
90
+
91
+ ### Metrics
92
+
93
+ More information needed
94
+
95
+
96
+ ## Results
97
+
98
+ More information needed
99
+
100
+
101
+ # Model Examination
102
+
103
+ More information needed
104
+
105
+ # Environmental Impact
106
+
107
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
108
+
109
+ - **Hardware Type:** More information needed
110
+ - **Hours used:** More information needed
111
+ - **Cloud Provider:** More information needed
112
+ - **Compute Region:** More information needed
113
+ - **Carbon Emitted:** More information needed
114
+
115
+ # Technical Specifications [optional]
116
+
117
+ ## Model Architecture and Objective
118
+ The model creators note in the [associated paper](https://arxiv.org/pdf/2103.06333.pdf):
119
+ > PLBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for code-to-text, text-to-code, code-to-code tasks. As the model is multilingual it expects the sequences in a different format. A special language id token is added in both the source and target text. The source text format is X [eos, src_lang_code] where X is the source text.
120
+
121
+ ## Compute Infrastructure
122
+ The model creators note in the [associated paper](https://arxiv.org/pdf/2103.06333.pdf):
123
+ > PLBART uses the same architecture as BARTbase (Lewis et al., 2020), it uses the sequence-to-sequence Transformer architecture (Vaswani et al., 2017), with 6 layers of encoder and 6 layers of decoder with model dimension of 768 and 12 heads (∼140M parameters). The only exception is, we include an additional layer normalization layer on top of both the encoder and decoder following Liu et al. (2020),
124
+
125
+ ### Hardware
126
+
127
+
128
+ More information needed
129
+
130
+ ### Software
131
+
132
+ More information needed.
133
+
134
+ # Citation
135
+
136
+
137
+ **BibTeX:**
138
+
139
+
140
+ ```bibtex
141
+ @misc{https://doi.org/10.48550/arxiv.2103.06333,
142
+ doi = {10.48550/ARXIV.2103.06333},
143
+
144
+ url = {https://arxiv.org/abs/2103.06333},
145
+
146
+ author = {Ahmad, Wasi Uddin and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
147
+
148
+ keywords = {Computation and Language (cs.CL), Programming Languages (cs.PL), FOS: Computer and information sciences, FOS: Computer and information sciences},
149
+
150
+ title = {Unified Pre-training for Program Understanding and Generation},
151
+
152
+ publisher = {arXiv},
153
+
154
+ year = {2021},
155
+
156
+ copyright = {arXiv.org perpetual, non-exclusive license}
157
+ }
158
+
159
+ ```
160
+
161
+
162
+ # Glossary [optional]
163
+
164
+ The model creators note in the [associated paper](https://arxiv.org/pdf/2103.06333.pdf):
165
+
166
+ > CodeBLEU is a metric for measuring the quality of the synthesized code (Ren et al., 2020). Unlike BLEU, CodeBLEU also considers grammatical and logical correctness based on the abstract syntax tree and the data-flow structure.
167
+
168
+
169
+
170
+ # More Information [optional]
171
+ More information needed
172
+
173
+
174
+ # Model Card Authors [optional]
175
+
176
+ UCLA NLP in collaboration with Ezi Ozoani and the Hugging Face team
177
+
178
+
179
+ # Model Card Contact
180
+
181
+ More information needed
182
+
183
+ # How to Get Started with the Model
184
+
185
+ Use the code below to get started with the model.
186
+
187
+ <details>
188
+ <summary> Click to expand </summary>
189
+
190
+ ```python
191
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
192
+
193
+ tokenizer = AutoTokenizer.from_pretrained("uclanlp/plbart-java-cs")
194
+
195
+ model = AutoModelForSeq2SeqLM.from_pretrained("uclanlp/plbart-java-cs")
196
+ ```
197
+ </details>
198
+
199
+