File size: 2,203 Bytes
5fd9ae2 e7e4f5d 5fd9ae2 505bd87 6d1687a 505bd87 6d1687a 505bd87 6d1687a 505bd87 6d1687a 505bd87 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
---
language:
- code
license: apache-2.0
tags:
- code
- gpt2
- generation
datasets:
- "codeparrot/codeparrot-clean"
- "openai_humaneval"
metrics:
- "evaluate-metric/code_eval"
---
# CodeParrot 🦜 (small)
CodeParrot 🦜 is a GPT-2 model (110M parameters) trained to generate Python code.
## Usage
You can load the CodeParrot model and tokenizer directly in `transformers`:
```Python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small")
inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)
```
or with a `pipeline`:
```Python
from transformers import pipeline
pipe = pipeline("text-generation", model="codeparrot/codeparrot-small")
outputs = pipe("def hello_world():")
```
## Training
The model was trained on the cleaned [CodeParrot 🦜 dataset](https://huggingface.co/datasets/codeparrot/codeparrot-clean) with the following settings:
|Config|Value|
|-------|-----|
|Batch size| 192 |
|Context size| 1024 |
|Training steps| 150'000|
|Gradient accumulation| 1|
|Gradient checkpointing| False|
|Learning rate| 5e-4 |
|Weight decay | 0.1 |
|Warmup steps| 2000 |
|Schedule| Cosine |
The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 29 billion tokens.
## Performance
We evaluated the model on OpenAI's [HumanEval](https://huggingface.co/datasets/openai_humaneval) benchmark which consists of programming challenges:
| Metric | Value |
|-------|-----|
|pass@1 | 3.80% |
|pass@10 | 6.57% |
|pass@100 | 12.78% |
The [pass@k metric](https://huggingface.co/metrics/code_eval) tells the probability that at least one out of k generations passes the tests.
## Resources
- Dataset: [full](https://huggingface.co/datasets/codeparrot/codeparrot-clean), [train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train), [valid](https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid)
- Code: [repository](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot)
- Spaces: [generation](), [highlighting]() |