File size: 2,464 Bytes
792df68
be44105
 
 
792df68
 
 
 
 
 
be44105
 
 
 
792df68
be44105
792df68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
language:
- code
- en
license: apache-2.0
tags:
- code
- gpt2
- generation
datasets:
- codeparrot/codeparrot-clean
- openai_humaneval
- semeru/code-text-python
- semeru/galeras-causal4se-3k-levenshtein
metrics:
- evaluate-metric/code_eval
---

# Compatibilized CodeParrot 🦜 (small)

This is the compatibilized version of CodeParrot 🦜 is a GPT-2 model (110M parameters) trained to generate Python code. 

The compatibilization is based on the [sequential-rationales](https://github.com/keyonvafa/sequential-rationales) process formulated by Vafa et.al. 

## Usage

You can load the CodeParrot model and tokenizer directly in `transformers` and use Galeras dataset for sampling the model:

```Python
from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("semeru/compatible-codeparrot-small")
model = AutoModelWithLMHead.from_pretrained("semeru/compatible-codeparrot-small")

df_sampled_code['size'] =  df_sampled_code['ground_truth'].map(lambda code: len(tokenizer(code)['input_ids']))
df_sampled_code['input_ids'] = tokenizer(df_sampled_code['prompt'].tolist())['input_ids']

```

## Training

The model was trained on the cleaned [CodeParrot 🦜 dataset](https://huggingface.co/datasets/codeparrot/codeparrot-clean) with the following settings:

|Config|Value|
|-------|-----|
|Batch size| 192 |
|Context size| 1024 |
|Training steps| 150'000|
|Gradient accumulation| 1|
|Gradient checkpointing| False|
|Learning rate| 5e-4 |
|Weight decay | 0.1 |
|Warmup steps| 2000 |
|Schedule| Cosine |

The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 29 billion tokens.

## Performance

We evaluated the model on OpenAI's [HumanEval](https://huggingface.co/datasets/openai_humaneval) benchmark which consists of programming challenges:

| Metric | Value |
|-------|-----|
|pass@1 | 3.80% |
|pass@10 | 6.57%	 |
|pass@100 | 12.78% |

The [pass@k metric](https://huggingface.co/metrics/code_eval) tells the probability that at least one out of k generations passes the tests. 

## Resources

- Dataset: [full](https://huggingface.co/datasets/codeparrot/codeparrot-clean), [train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train), [valid](https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid)
- Code: [repository](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot)
- Spaces: [generation](), [highlighting]()