Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)

Mistral7B-PairRM-SPPO

This model was developed using Self-Play Preference Optimization, based on the mistralai/Mistral-7B-Instruct-v0.2 architecture as starting point. We utilized the prompt sets from the openbmb/UltraFeedback dataset, splited to 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.

While K = 5, this model uses three samples to estimate the soft probabilities P(y_w > y_l) and P(y_l > y_w). These samples include the winner, the loser, and another random sample. This approach has shown to deliver better performance on AlpacaEval 2.0 compared to the results reported in our paper.

❗Please refer to the original checkpoint at UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3 as reported in our paper. We anticipate that the version in the paper demonstrates a more consistent performance improvement across all evaluation tasks.

Links to Other Models

Model Description

  • Model type: A 7B parameter GPT-like model fine-tuned on synthetic datasets.
  • Language(s) (NLP): Primarily English
  • License: Apache-2.0
  • Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2

AlpacaEval Leaderboard Evaluation Results

Model LC. Win Rate Win Rate Avg. Length
Mistral7B-PairRM-SPPO 30.46 32.14 2114

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • eta: 1000
  • per_device_train_batch_size: 8
  • gradient_accumulation_steps: 1
  • seed: 42
  • distributed_type: deepspeed_zero3
  • num_devices: 8
  • optimizer: RMSProp
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_train_epochs: 18.0 (stop at epoch=1.0)

Citation

@misc{wu2024self,
      title={Self-Play Preference Optimization for Language Model Alignment}, 
      author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
      year={2024},
      eprint={2405.00675},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
Downloads last month
3,569
Safetensors
Model size
7.24B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for UCLA-AGI/Mistral7B-PairRM-SPPO

Quantizations
4 models

Dataset used to train UCLA-AGI/Mistral7B-PairRM-SPPO

Collection including UCLA-AGI/Mistral7B-PairRM-SPPO