File size: 928 Bytes
0121069 70715a4 9bf875b 70715a4 775458a 0121069 ef55760 0121069 fea0be1 7a433bb 295ebfc 0121069 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
This model is trained with Iterative DPO in OpenRLHF
Datasets and Hyperparameters
- Reward Model:https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-700k
- SFT Model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture
- Prompt Dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1
```
Max Prompt Length: 2048
Max Response Length: 2048
best_of_n: 2 (2 samples for each prompt)
Learning Rate: 5e-7
Beta: 0.1
Scheduler: Cosine with Warmup (0.03) and MinLR (0.1 * init_lr)
Rollout Batch Size: 20000
Training Batch Size: 256
Number of Iterations: 9
```
Evaluation
```
########## First turn ##########
score
model turn
Llama3-iter-dpo 1 8.55
########## Second turn ##########
score
model turn
Llama3-iter-dpo 2 7.95625
########## Average ##########
score
model
Llama3-iter-dpo 8.253125
Llama3-sft-baseline 7.69
``` |