--- language: - code - en license: apache-2.0 tags: - code - gpt2 - generation datasets: - codeparrot/codeparrot-clean - openai_humaneval - semeru/code-text-python - semeru/galeras-causal4se-3k-levenshtein metrics: - evaluate-metric/code_eval --- # Compatibilized CodeParrot 🦜 (small) This is the compatibilized version of CodeParrot 🦜 is a GPT-2 model (110M parameters) trained to generate Python code. The compatibilization is based on the [sequential-rationales](https://github.com/keyonvafa/sequential-rationales) process formulated by Vafa et.al. ## Usage You can load the CodeParrot model and tokenizer directly in `transformers` and use Galeras dataset for sampling the model: ```Python from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("semeru/compatible-codeparrot-small") model = AutoModelWithLMHead.from_pretrained("semeru/compatible-codeparrot-small") df_sampled_code['size'] = df_sampled_code['ground_truth'].map(lambda code: len(tokenizer(code)['input_ids'])) df_sampled_code['input_ids'] = tokenizer(df_sampled_code['prompt'].tolist())['input_ids'] ``` ## Training The model was trained on the cleaned [CodeParrot 🦜 dataset](https://huggingface.co/datasets/codeparrot/codeparrot-clean) with the following settings: |Config|Value| |-------|-----| |Batch size| 192 | |Context size| 1024 | |Training steps| 150'000| |Gradient accumulation| 1| |Gradient checkpointing| False| |Learning rate| 5e-4 | |Weight decay | 0.1 | |Warmup steps| 2000 | |Schedule| Cosine | The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 29 billion tokens. ## Performance We evaluated the model on OpenAI's [HumanEval](https://huggingface.co/datasets/openai_humaneval) benchmark which consists of programming challenges: | Metric | Value | |-------|-----| |pass@1 | 3.80% | |pass@10 | 6.57% | |pass@100 | 12.78% | The [pass@k metric](https://huggingface.co/metrics/code_eval) tells the probability that at least one out of k generations passes the tests. ## Resources - Dataset: [full](https://huggingface.co/datasets/codeparrot/codeparrot-clean), [train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train), [valid](https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid) - Code: [repository](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot) - Spaces: [generation](), [highlighting]()