metadata
model-index:
- name: robinlee99/Pythia-2.8B-TLDR-Iterative-SamPO
results: []
datasets:
- webis/tldr-17
language:
- en
base_model: EleutherAI/pythia-2.8b
license: apache-2.0
Model Card for Pythia-2.8B-TLDR-Iterative-SamPO
This repository provides a fine-tuned version of Pythia-2.8B, using our proposed SamPO algorithm: Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence.
Performance
vs. SFT | wins | len / token |
---|---|---|
DPO | 60.98 | 53.8 |
Iterative DPO | 73.58 | 66.65 |
Length Normed DPO | 58.13 | 47.34 |
SimPO | 33.33 | 31.9 |
Iterative SamPO | 73.58 | 49.54 |
Evaluation Details
We test our model with the same GPT-4 Win rate prompt template proposed by the DPO paper. The sampled test set is included in this repo.
Training hyperparameters
The following hyperparameters were used during DPO/SamPO training:
- DPO beta: 0.5
- learning_rate: 1e-6
- total_train_batch_size: 128
- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- Weight Decay: 0.0
- num_epochs: 1.0