angelahzyuan commited on
Commit
364a367
1 Parent(s): 0df89e4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - openbmb/UltraFeedback
5
+ language:
6
+ - en
7
+ pipeline_tag: text-generation
8
+ ---
9
+ Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)
10
+
11
+ # Mistral7B-PairRM-SPPO-Iter3
12
+
13
+ This model was developed using [Self-Play Preference Optimization](https://arxiv.org/abs/2405.00675) at iteration 3, based on the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.
14
+
15
+
16
+ While K = 5, this model uses three samples to estimate the soft probabilities P(y_w > y_l) and P(y_l > y_w). These samples include the winner, the loser, and another random sample. This approach has shown to deliver better performance on AlpacaEval 2.0 compared to the results reported in [our paper](https://arxiv.org/abs/2405.00675).
17
+
18
+ ❗Please refer to the original checkpoint at [**UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3**](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3) as **reported in our paper**. We anticipate that the version in the paper demonstrates a more consistent performance improvement across all evaluation tasks.
19
+
20
+ ## Links to Other Models
21
+ - [Mistral7B-PairRM-SPPO-Iter1](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1)
22
+ - [Mistral7B-PairRM-SPPO-Iter2](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2)
23
+ - [Mistral7B-PairRM-SPPO-Iter3](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3)
24
+ - [Mistral7B-PairRM-SPPO](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO)
25
+
26
+
27
+
28
+ ### Model Description
29
+
30
+ - Model type: A 7B parameter GPT-like model fine-tuned on synthetic datasets.
31
+ - Language(s) (NLP): Primarily English
32
+ - License: Apache-2.0
33
+ - Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2
34
+
35
+
36
+ ## [AlpacaEval Leaderboard Evaluation Results](https://tatsu-lab.github.io/alpaca_eval/)
37
+
38
+
39
+ | Model | LC. Win Rate | Win Rate | Avg. Length |
40
+ |-------------------------------------------|:------------:|:--------:|:-----------:|
41
+ | Mistral7B-PairRM-SPPO | 30.46 | 32.14 | 2114 |
42
+
43
+
44
+ ### Training hyperparameters
45
+ The following hyperparameters were used during training:
46
+
47
+ - learning_rate: 5e-07
48
+ - eta: 1000
49
+ - per_device_train_batch_size: 8
50
+ - gradient_accumulation_steps: 1
51
+ - seed: 42
52
+ - distributed_type: deepspeed_zero3
53
+ - num_devices: 8
54
+ - optimizer: RMSProp
55
+ - lr_scheduler_type: linear
56
+ - lr_scheduler_warmup_ratio: 0.1
57
+ - num_train_epochs: 18.0 (stop at epoch=1.0)
58
+
59
+
60
+
61
+
62
+ ## Citation
63
+ ```
64
+ @misc{wu2024self,
65
+ title={Self-Play Preference Optimization for Language Model Alignment},
66
+ author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
67
+ year={2024},
68
+ eprint={2405.00675},
69
+ archivePrefix={arXiv},
70
+ primaryClass={cs.LG}
71
+ }
72
+ ```