--- base_model: meta-llama/Llama-2-7b-hf --- # Model Details - SFT based on [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) with merged alpaca datasets - DPO: trained on top of SFT model as LoRa Adapter, with merged [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) data - PPO: trained on top of dpo model and reward model, with multi-adapters, with [PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) data for futher RLHF - Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2 ## Model and Training Details - **Finetuned from model:** [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) - **Dataset:** - SFT (mixed train): - [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) - [vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4) - DPO (mixed train): - [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) - [Unified-Language-Model-Alignment/Anthropic_HH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden) - PPO: - [PKU-Alignment/PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K) - [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) - [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) ### Training Results ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/miik5Tb6A8G6sDTlnQA-V.png) ### Evaluation The reward score and toxicity scores are computed and compared with [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) data on SFT/DPO/PPO models | Model | Toxicity | Reward | | ----- |:--------:|:--------:| | SFT_v0.1 | 0.0698 | -0.2828 | | DPO_v0.1 | 0.0356 | -0.2633 | | PPO_v0.1 | 0.0321 | 0.38 | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/m-k6kUuIJVTkYM2l3uBPd.png) ### Compute Infrastructure The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB ### Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,) tokenizer.pad_token = tokenizer.eos_token tokenizer.eos_token = DEFINE_EOS_TOKEN model.config.eos_token = DEFINE_EOS_TOKEN model.config.eos_token_id = tokenizer.eos_token_id def format_prompt(question): return f"###Question: {question}\n###Answer: " instruction = "Your text here" input = format_prompt(instruction) inputs = tokenizer(input, return_tensors='pt') output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1) output = tokenizer.decode(output[0], skip_special_tokens=True) print(output) ``` ## Model Card Authors Yiyu (Michael) Ren ## Model Card Contact Email: renyiyuap@gmail.com ### Framework versions - PEFT 0.8.2