renyiyu
/

llama-2-7b-ppo-lora-v0.1

Model card Files Files and versions Community

renyiyu commited on Feb 29

Commit

8e8a9cd

•

1 Parent(s): c7660ae

Create README.md

Files changed (1) hide show

README.md +81 -0

README.md ADDED Viewed

	@@ -0,0 +1,81 @@

+---
+base_model: meta-llama/Llama-2-7b-hf
+---
+# Model Details
+- SFT based on [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) with merged alpaca datasets
+- DPO: trained on top of SFT model as LoRa Adapter, with merged [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) data
+- PPO: trained on top of dpo model and reward model, with multi-adapters, with [PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) data for futher RLHF
+- Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2
+## Model and Training Details
+- **Finetuned from model:** [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
+- **Dataset:**
+  - SFT (mixed train):
+    - [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned)
+    - [vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4)
+  - DPO (mixed train):
+    - [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
+    - [Unified-Language-Model-Alignment/Anthropic_HH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden)
+  - PPO:
+    - [PKU-Alignment/PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K)
+    - [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K)
+    - [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF)
+### Training Results
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/miik5Tb6A8G6sDTlnQA-V.png)
+### Evaluation
+The reward score and toxicity scores are computed and compared with [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) data on SFT/DPO/PPO models
+| Model | Toxicity |  Reward  |
+| ----- |:--------:|:--------:|
+|  SFT_v0.1  |  0.0698  | -0.2828  |
+|  DPO_v0.1  |  0.0356  | -0.2633  |
+|  PPO_v0.1 |  0.0321  |  0.38    |
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/m-k6kUuIJVTkYM2l3uBPd.png)
+### Compute Infrastructure
+The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB
+### Inference
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.eos_token = DEFINE_EOS_TOKEN
+model.config.eos_token = DEFINE_EOS_TOKEN
+model.config.eos_token_id = tokenizer.eos_token_id
+def format_prompt(question):
+    return f"###Question: {question}\n###Answer: "
+instruction = "Your text here"
+input = format_prompt(instruction)
+inputs = tokenizer(input, return_tensors='pt')
+output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1)
+output = tokenizer.decode(output[0], skip_special_tokens=True)
+print(output)
+```
+## Model Card Authors
+Yiyu (Michael) Ren
+## Model Card Contact
+Email: renyiyuap@gmail.com
+### Framework versions
+- PEFT 0.8.2