File size: 3,134 Bytes
8e8a9cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
base_model: meta-llama/Llama-2-7b-hf
---

# Model Details

- SFT based on [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) with merged alpaca datasets
- DPO: trained on top of SFT model as LoRa Adapter, with merged [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) data
- PPO: trained on top of dpo model and reward model, with multi-adapters, with [PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) data for futher RLHF
- Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2


## Model and Training Details

- **Finetuned from model:** [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)

- **Dataset:**
  - SFT (mixed train):
    - [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned)
    - [vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4)
  - DPO (mixed train):
    - [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
    - [Unified-Language-Model-Alignment/Anthropic_HH_Golden](https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden)
  - PPO:
    - [PKU-Alignment/PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K)
    - [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K)
    - [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF)

### Training Results

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/miik5Tb6A8G6sDTlnQA-V.png)

### Evaluation

The reward score and toxicity scores are computed and compared with [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) data on SFT/DPO/PPO models

| Model | Toxicity |  Reward  |
| ----- |:--------:|:--------:|
|  SFT_v0.1  |  0.0698  | -0.2828  |
|  DPO_v0.1  |  0.0356  | -0.2633  |
|  PPO_v0.1 |  0.0321  |  0.38    | 
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b1dd2a855f6b5fe621bc0e/m-k6kUuIJVTkYM2l3uBPd.png)

### Compute Infrastructure

The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB

### Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token_id = tokenizer.eos_token_id

def format_prompt(question):
    return f"###Question: {question}\n###Answer: "

instruction = "Your text here"
input = format_prompt(instruction)
inputs = tokenizer(input, return_tensors='pt')
output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1)
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(output)

```
## Model Card Authors

Yiyu (Michael) Ren

## Model Card Contact

Email: renyiyuap@gmail.com

### Framework versions

- PEFT 0.8.2