Update README.md
Browse files
README.md
CHANGED
@@ -20,15 +20,28 @@ model-index:
|
|
20 |
should probably proofread and complete it, then remove this comment. -->
|
21 |
|
22 |
# MMPO_Gemma_7b_gamma1.1_epoch3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
This model is a fine-tuned version of [kykim0/gemma-7b-ultrachat-sft](https://huggingface.co/kykim0/gemma-7b-ultrachat-sft) on the [allenai/ultrafeedback_binarized_cleaned](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset.
|
24 |
|
25 |
-
|
26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
-
|
29 |
-
|
30 |
|
31 |
-
Also, MT-Bench score is 7.53.
|
32 |
|
33 |
|
34 |
## Training and evaluation data
|
|
|
20 |
should probably proofread and complete it, then remove this comment. -->
|
21 |
|
22 |
# MMPO_Gemma_7b_gamma1.1_epoch3
|
23 |
+
this is the model checkpoint for the paper:
|
24 |
+
|
25 |
+
**Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback** <br>
|
26 |
+
Kyuyoung Kim*, Ah Jeong Seo*, Hao Liu, Jinwoo Shin, Kimin Lee <br>
|
27 |
+
*In EMNLP 2024 Findings*
|
28 |
+
|
29 |
+
|
30 |
This model is a fine-tuned version of [kykim0/gemma-7b-ultrachat-sft](https://huggingface.co/kykim0/gemma-7b-ultrachat-sft) on the [allenai/ultrafeedback_binarized_cleaned](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset.
|
31 |
|
32 |
+
The model is optimized with MMPO(Margin Matching Preference Optimization), which integrates per-feedback margin to enhance optimization.
|
33 |
+
Specifically, given quality margins in pairwise preferences, MMPO utilizes soft target probabilities based on the Bradley-Terry model.
|
34 |
+
You can find more details in the paper or the [official code](https://github.com/kykim0/margin-matching-pref-opt).
|
35 |
+
|
36 |
+
|
37 |
+
## Evaluation results
|
38 |
+
|
39 |
+
For MT-Bench, this model shows a score of 7.53, which is higher than the score of 7.40 when trained with DPO:
|
40 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/641a94e305290a1350418646/iFpJYNNHJZhlU70PK17k4.png" width="50%" />
|
41 |
|
42 |
+
For RewardBench, it achieves state-of-the-art performance compared to competing models at the same scale:
|
43 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/641a94e305290a1350418646/OIwbSMUgvbD9HuVo6aVqV.png" width="80%" />
|
44 |
|
|
|
45 |
|
46 |
|
47 |
## Training and evaluation data
|