General Preference Modeling with Preference Representations for Aligning Language Models (https://arxiv.org/abs/2410.02197)

GPO-Llama-3-8B-Instruct-GPM-2B

This model was developed using General Preference Optimization (GPO) at iteration 3 and the General Preference representation Model (GPM) (specifically, using GPM-Gemma-2B), based on the meta-llama/Meta-Llama-3-8B-Instruct architecture as starting point. We utilized the prompt sets from the openbmb/UltraFeedback dataset, splited to 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.

Links to Other Models

Model Description

Model type: A 8B parameter GPT-like model fine-tuned on synthetic datasets.
Language(s) (NLP): Primarily English
License: Apache-2.0
Finetuned from model: meta-llama/Meta-Llama-3-8B-Instruct

AlpacaEval Leaderboard Evaluation Results

Model	LC. Win Rate	Win Rate	Avg. Length
GPO-Llama-3-8B-Instruct-GPM-2B	38.43	48.87	2613

Open LLM Leaderboard Evaluation Results

Results are reported by using lm-evaluation-harness v0.4.1

	arc_challenge	truthfulqa_mc2	winogrande	gsm8k	hellaswag	mmlu	average
GPO-Llama-3-8B-Instruct-GPM-2B	61.43	53.54	75.22	76.12	78.06	65.65	68.34

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
beta: 0.001
per_device_train_batch_size: 8
gradient_accumulation_steps: 1
seed: 42
distributed_type: deepspeed_zero3
num_devices: 8
optimizer: RMSProp
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_train_epochs: 6.0 (stop at epoch=1.0)

Citation

@article{zhang2024general,
  title={General Preference Modeling with Preference Representations for Aligning Language Models},
  author={Zhang, Yifan and Zhang, Ge and Wu, Yue and Xu, Kangping and Gu, Quanquan},
  journal={arXiv preprint arXiv:2410.02197},
  year={2024}
}

general-preference
/

GPO-Llama-3-8B-Instruct-GPM-2B