Edit model card

Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)

Mistral7B-PairRM-SPPO-Iter1

This model was developed using Self-Play Preference Optimization at iteration 1, based on the mistralai/Mistral-7B-Instruct-v0.2 architecture as starting point. We utilized the prompt sets from the openbmb/UltraFeedback dataset, splited to 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.

This is the model reported in the paper , with K=5 (generate 5 responses per iteration). We attached the Arena-Hard eval results in this model page.

Links to Other Models

Model Description

  • Model type: A 7B parameter GPT-like model fine-tuned on synthetic datasets.
  • Language(s) (NLP): Primarily English
  • License: Apache-2.0
  • Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2

AlpacaEval Leaderboard Evaluation Results

Model LC. Win Rate Win Rate Avg. Length
Mistral7B-PairRM-SPPO Iter 1 24.79 23.51 1855
Mistral7B-PairRM-SPPO Iter 2 26.89 27.62 2019
Mistral7B-PairRM-SPPO Iter 3 28.53 31.02 2163
Mistral7B-PairRM-SPPO Iter 1 (best-of-16) 28.71 27.77 1901
Mistral7B-PairRM-SPPO Iter 2 (best-of-16) 31.23 32.12 2035
Mistral7B-PairRM-SPPO Iter 3 (best-of-16) 32.13 34.94 2174

Arena-Hard Evaluation Results

Model Score 95% CI average # Tokens
Mistral7B-PairRM-SPPO-Iter3 23.3 (-1.8, 1.8) 578

Open LLM Leaderboard Evaluation Results

Results are reported by using lm-evaluation-harness v0.4.1

arc_challenge truthfulqa_mc2 winogrande gsm8k hellaswag mmlu average
Mistral7B-PairRM-SPPO Iter 1 65.02 69.4 77.82 43.82 85.11 58.84 66.67
Mistral7B-PairRM-SPPO Iter 2 65.53 69.55 77.03 44.35 85.29 58.72 66.75
Mistral7B-PairRM-SPPO Iter 3 65.36 69.97 76.8 42.68 85.16 58.45 66.4

MT-Bench Evaluation Results

1st Turn 2nd Turn Average
Mistral7B-PairRM-SPPO Iter 1 7.63 6.79 7.21
Mistral7B-PairRM-SPPO Iter 2 7.90 7.08 7.49
Mistral7B-PairRM-SPPO Iter 3 7.84 7.34 7.59

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • eta: 1000
  • per_device_train_batch_size: 8
  • gradient_accumulation_steps: 1
  • seed: 42
  • distributed_type: deepspeed_zero3
  • num_devices: 8
  • optimizer: RMSProp
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_train_epochs: 18.0 (stop at epoch=1.0)

Citation

@misc{wu2024self,
      title={Self-Play Preference Optimization for Language Model Alignment}, 
      author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
      year={2024},
      eprint={2405.00675},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
Downloads last month
7
Safetensors
Model size
7.24B params
Tensor type
BF16
·

Dataset used to train UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1

Collection including UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1