--- license: llama2 model-index: - name: ETRI_CodeLLaMA_7B_CPP results: - task: type: text-generation dataset: type: HumanEval-X name: humanevalsynthesize-cpp metrics: - name: pass@1 type: pass@1 value: 34.3% verified: false --- ## **ETRI_CodeLLaMA_7B_CPP** We used LoRa to further pre-train Meta's CodeLLaMA-7B-hf model with high-quality C++ code tokens. Furthermore, we was fine-tuned on CodeM's C++ instruction data. ## Model Details This model was trained using LoRa and achieved a pass@1 of 34.3% on HumanEvalX-cpp. ETRI_CodeLLaMA_7B_CPP is a C++ specialized model. ## Dataset Details We pre-trained CodeLLaMA-7B further using 543 GB of C++ code collected online, and fine-tuned it using CodeM's C++ instruction data. We utilized 1 x A100-80GB GPU for the training. ## Requirements ``` peft==0.3.0.dev0  tokenizers==0.13.3  transformers==4.33.0  bitsandbytes==0.41.1 ``` ## How to reproduce HumanEval-X results We use Bigcode-evaluation-harness repo for evaluating our trained model. bigcode-evaluation-harness ``` git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git ``` Then, run main.py as follows. ``` accelerate launch bigcode-evaluation-harness/main.py \ --model DDIDU/ETRI_CodeLLaMA_7B_CPP \ --max_length_generation 512 \ --prompt continue \ --tasks humanevalsynthesize-cpp \ --temperature 0.2 \ --n_samples 100 \ --precision bf16 \ --do_sample True \ --batch_size 10 \ --allow_code_execution \ --save_generations \ ``` ## Model use ``` from transformers import AutoTokenizer import transformers import torch model = "DDIDU/ETRI_CodeLLaMA_7B_CPP" tokenizer = AutoTokenizer.from_pretrained(model) pipeline = transformers.pipeline( "text-generation", model=model, torch_dtype=torch.bfloat16, device_map="auto", ) sequences = pipeline( 'import socket\n\ndef ping_exponential_backoff(host: str):', do_sample=True, top_k=10, temperature=0.1, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, max_length=200, ) for seq in sequences: print(f"Result: {seq['generated_text']}") ```