General Preference Modeling with Preference Representations for Aligning Language Models (https://arxiv.org/abs/2410.02197)
GPO-Llama-3-8B-Instruct-GPM-2B
This model was developed using General Preference Optimization (GPO) at iteration 3 and the General Preference representation Model (GPM) (specifically, using GPM-Gemma-2B), based on the meta-llama/Meta-Llama-3-8B-Instruct architecture as starting point. We utilized the prompt sets from the openbmb/UltraFeedback dataset, splited to 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.
Links to Other Models
Model Description
- Model type: A 8B parameter GPT-like model fine-tuned on synthetic datasets.
- Language(s) (NLP): Primarily English
- License: Apache-2.0
- Finetuned from model: meta-llama/Meta-Llama-3-8B-Instruct
AlpacaEval Leaderboard Evaluation Results
Model | LC. Win Rate | Win Rate | Avg. Length |
---|---|---|---|
GPO-Llama-3-8B-Instruct-GPM-2B | 38.43 | 48.87 | 2613 |
Open LLM Leaderboard Evaluation Results
Results are reported by using lm-evaluation-harness v0.4.1
arc_challenge | truthfulqa_mc2 | winogrande | gsm8k | hellaswag | mmlu | average | |
---|---|---|---|---|---|---|---|
GPO-Llama-3-8B-Instruct-GPM-2B | 61.43 | 53.54 | 75.22 | 76.12 | 78.06 | 65.65 | 68.34 |
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-07
- beta: 0.001
- per_device_train_batch_size: 8
- gradient_accumulation_steps: 1
- seed: 42
- distributed_type: deepspeed_zero3
- num_devices: 8
- optimizer: RMSProp
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_train_epochs: 6.0 (stop at epoch=1.0)
Citation
@article{zhang2024general,
title={General Preference Modeling with Preference Representations for Aligning Language Models},
author={Zhang, Yifan and Zhang, Ge and Wu, Yue and Xu, Kangping and Gu, Quanquan},
journal={arXiv preprint arXiv:2410.02197},
year={2024}
}
- Downloads last month
- 23