UCLA-AGI
/

Mistral7B-PairRM-SPPO-Iter2

Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

angelahzyuan commited on May 6

Commit

7380dd4

•

1 Parent(s): 5e37a0b

Update README.md

Files changed (1) hide show

README.md +7 -1

README.md CHANGED Viewed

@@ -8,12 +8,18 @@ pipeline_tag: text-generation
 ---
 Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)
-# Mistral7B-PairRM-SPPO-Iter1
 This model was developed using [Self-Play Preference Optimization](https://arxiv.org/abs/2405.00675) at iteration 2, based on the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.
 **This is the model reported in the paper** , with K=5 (generate 5 responses per iteration). We attached the Arena-Hard eval results in this model page.
 ### Model Description

 ---
 Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)
+# Mistral7B-PairRM-SPPO-Iter2
 This model was developed using [Self-Play Preference Optimization](https://arxiv.org/abs/2405.00675) at iteration 2, based on the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.
 **This is the model reported in the paper** , with K=5 (generate 5 responses per iteration). We attached the Arena-Hard eval results in this model page.
+## Links to Other Models
+- [Mistral7B-PairRM-SPPO-Iter1](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1)
+- [Mistral7B-PairRM-SPPO-Iter2](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2)
+- [Mistral7B-PairRM-SPPO-Iter3](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3)
+- [Mistral7B-PairRM-SPPO](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO)
 ### Model Description