Text Generation
Transformers
PyTorch
code
gpt2
custom_code
Eval Results
text-generation-inference
lvwerra HF staff commited on
Commit
fc6b64b
1 Parent(s): bca2f60

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -1
README.md CHANGED
@@ -183,4 +183,116 @@ model-index:
183
  verified: false
184
  ---
185
 
186
- # SantaCoder
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
  verified: false
184
  ---
185
 
186
+ # SantaCoder
187
+
188
+ ![banner](https://huggingface.co/datasets/bigcode/admin/resolve/main/banner.png)
189
+
190
+ # Table of Contents
191
+
192
+ 1. [Model Summary](#model-summary)
193
+ 2. [Use](#use)
194
+ 3. [Limitations](#limitations)
195
+ 4. [Training](#training)
196
+ 5. [Citation](#citation)
197
+
198
+ # Model Summary
199
+
200
+ The SantaCoder models are a series of 1B parameter models trained on Python, Java, and JavaScript. They were trained on datasets with different filter parameters and with architecture and objective variations. The main model uses multi-query attention, was trained using near-deduplication and commnent-to-code ratio as filtering criteria and using the Fill-in-the-Middle objective.
201
+
202
+ - **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
203
+ - **Project Website:** [bigcode-project.org]www.bigcode-project.org)
204
+ - **Paper:** [Coming soon]()
205
+ - **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org)
206
+ - **Languages:** Python, Java, and JavaScript
207
+
208
+ |Model|Architecture|Objective|Filtering|
209
+ |:-|:-|:-|:-|:-|
210
+ |`mha`|MHA|AR + FIM| Base |
211
+ |`no-fim`| MQA | AR| Base |
212
+ |`fim`| MQA | AR + FIM | Base |
213
+ |`stars`| MQA | AR + FIM | GitHub stars |
214
+ |`fertility`| MQA | AR + FIM | Tokenizer fertility |
215
+ |`comments`| MQA | AR + FIM | Comment-to-code ratio |
216
+ |`dedup-alt`| MQA | AR + FIM | Stronger near-deduplication |
217
+ |`dedup-alt-comments`| MQA | AR + FIM | Stronger near-deduplication and comment-to-code ratio |
218
+
219
+ The `dedup-alt-comments` model is the best performing model and was trained twice as long as the others. This checkpoint is available here on the `main`
220
+
221
+ # Use
222
+
223
+ ## Intended use
224
+
225
+
226
+
227
+ **Feel free to share your generations in the Community tab!**
228
+
229
+ ## How to use
230
+
231
+ ### Generation
232
+ ```python
233
+ # pip install -q transformers
234
+ from transformers import AutoModelForCausalLM, AutoTokenizer
235
+
236
+ checkpoint = "bigcode/santacoder"
237
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
238
+
239
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
240
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to()
241
+
242
+ inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
243
+ outputs = model.generate(inputs)
244
+ print(tokenizer.decode(outputs[0]))
245
+ ```
246
+
247
+ ### Fill-in-the-middle
248
+ Fill-in-the-mid uses special tokens to identify the prefix/middle/suffic part of the input and output:
249
+
250
+ ```python
251
+ input_text = "<fim-prefix>def print_hello_world():\n <fim-suffix>\n print("Hello world!")<fim-middle>
252
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
253
+ outputs = model.generate(inputs)
254
+ print(tokenizer.decode(outputs[0]))
255
+ ```
256
+
257
+ ### Load other checkpoints
258
+ We upload the checkpoint of each experiment to a seperate branch as well as the intermediate checkpoints as commits on the branches. You can load them with the `revision` flag:
259
+
260
+ ```python
261
+ checkpoint = "bigcode/santacoder"
262
+ revision = "no-fim" # name of branch or commit hash
263
+
264
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, revision=revision, trust_remote_code=True).to(device)
265
+ ```
266
+
267
+ ### Attribution
268
+
269
+ The pretraining dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset which requires attribution. We provide a [search index](TODO) that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.
270
+
271
+ # Limitations
272
+
273
+ The model has been trained on source code in Python, Java, and JavaScript. The predominant language in source is English although other languages are also present. As such the model is capable to generate code snippets provided some context but the generated code is not guaranteed to work as intended. It can be inefficient, contain bugs or exploits.
274
+
275
+ # Training
276
+
277
+ ## Model
278
+
279
+ - **Architecture:** GPT-2 model with multi-query attention and Fill-in-the-Middle objective
280
+ - **Pretraining steps:** 600K
281
+ - **Pretraining tokens:** 236 billion
282
+ - **Precision:** float16
283
+
284
+ ## Hardware
285
+
286
+ - **GPUs:** 96 Tesla V100
287
+ - **Training time:** 6.2 days
288
+ - **Total FLOPS:** 2.1 x 10e21
289
+
290
+ ## Software
291
+
292
+ - **Orchestration:** [Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
293
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
294
+ - **FP16 if applicable:** [apex](https://github.com/NVIDIA/apex)
295
+
296
+
297
+ # Citation
298
+ **TODO**