--- language: code tags: - code - gpt2 - generation datasets: - giulio98/xlcost-single-prompt widget: - text: "'''\nfunction to add two numbers\n'''\n###\n" example_title: "add two numbers" model-index: - name: codegen-350M-multi-xlcost results: - task: name: Code Generation type: code-generation dataset: name: "XLCost" type: code_eval_outputs metrics: - name: pass@1 type: code_eval_outputs value: 3.325 - name: pass@10 type: code_eval_outputs value: 15 - name: codebleu type: codebleu value: 20.18191 --- # CodeGen-350M-multi-xlcost-v2 CodeGen-350M-multi-xlcost is a CodeGen model fine-tuned on the Python split of XLCost dataset using Deepspeed. ## Usage You can load the CodeGen-350M-multi-xlcost-v2 model and tokenizer directly in `transformers`: ```Python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("giulio98/codegen-350M-multi-xlcost-v2") model = AutoModelForCausalLM.from_pretrained("giulio98/codegen-350M-multi-xlcost-v2") text = tokenizer.eos_token + "\'\'\'\n" + "function to add two numbers" + "\n\'\'\'\n" + "###\n" input_ids = tokenizer(text, return_tensors="pt").input_ids generated_ids = model.generate(input_ids, max_length=128) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) ``` Output: ```Python ''' function to add two numbers ''' ### def add(a, b): return a + b ``` ## Training The model was finetuned on [XLCost-single-prompt](https://huggingface.co/datasets/giulio98/xlcost-single-prompt), an improved version of the original XLCost dataset [ xlcost-text-to-code](https://huggingface.co/datasets/codeparrot/xlcost-text-to-code). Below the hyperparameters. | Hyperparameter | value | |---------------------------|--------| |Per device train batch size| 16 | |Context size| 1024 | |Training steps| 259| |Gradient accumulation| 2| |Gradient checkpointing| True| |Learning rate|1.8e-05 | |Weight decay | 0.1 | |Warmup steps| 35 | |Schedule| linear | |zero stage| 2 | Below the deepspeed configuration ```Python { "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": 0.000018, "betas": [ 0.9, 0.999 ], "eps": 1e-8, "weight_decay": 0.1 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 0.000018, "warmup_num_steps": 35 } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": false }, "allgather_partitions": true, "allgather_bucket_size": 200000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 200000000, "contiguous_gradients": true }, "gradient_accumulation_steps": 2, "train_batch_size": 32, "train_micro_batch_size_per_gpu": 16, "gradient_clipping": 1, "wall_clock_breakdown": false } ``` The training was executed on 1 x V100 (16GB) GPU for 28min 50sec ## Performance We evaluated the model on the first 400 samples of XLCOST's [XLCost-single-prompt test split](https://huggingface.co/datasets/giulio98/xlcost-single-prompt/viewer/Python/test) and comparing the outputs of the generated codes with respect to the expected output using pass@k metric. | Metric | codegen-350M-multi-xlcost-v2 | codegen-350M-multi-xlcost | codegen-350M-mono(zero-shot) | codegen-350M-mono (one-shot) | codegen-350M-mono(few-shot) |--------|-----|-----|-----|-----|-----| |pass@1 |3.325% |3.70% | 0.4% | 0.35% | 0.48% | |pass@10 |15%| 14.5% | 3.5% | 3 % | 3.75% | |CodeBLEU |20.18%| None | 15.15% | 19.42 % | 20.27% | The [pass@k metric](https://huggingface.co/metrics/code_eval) tells the probability that at least one out of k generations passes the tests. ## Citations ``` @article{Nijkamp2022ACP, title={A Conversational Paradigm for Program Synthesis}, author={Nijkamp, Erik and Pang, Bo and Hayashi, Hiroaki and Tu, Lifu and Wang, Huan and Zhou, Yingbo and Savarese, Silvio and Xiong, Caiming}, journal={arXiv preprint}, year={2022} } ```