Ahjeong commited on
Commit
e22509a
1 Parent(s): fe1a540

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -5
README.md CHANGED
@@ -20,15 +20,28 @@ model-index:
20
  should probably proofread and complete it, then remove this comment. -->
21
 
22
  # MMPO_Gemma_7b_gamma1.1_epoch3
 
 
 
 
 
 
 
23
  This model is a fine-tuned version of [kykim0/gemma-7b-ultrachat-sft](https://huggingface.co/kykim0/gemma-7b-ultrachat-sft) on the [allenai/ultrafeedback_binarized_cleaned](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset.
24
 
25
- This model is optimized with MMPO(Margin Matching Preference Optimization), which is a variation of DPO and utilizes margin information.
26
- For more detail, our paper is under review for now and a link will be attached if the paper is published on ArXiv.
 
 
 
 
 
 
 
27
 
28
- It achieves the following results on the RewardBench dataset:
29
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/641a94e305290a1350418646/F0OdaUbBZXEwjbGcvV1de.png)
30
 
31
- Also, MT-Bench score is 7.53.
32
 
33
 
34
  ## Training and evaluation data
 
20
  should probably proofread and complete it, then remove this comment. -->
21
 
22
  # MMPO_Gemma_7b_gamma1.1_epoch3
23
+ this is the model checkpoint for the paper:
24
+
25
+ **Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback** <br>
26
+ Kyuyoung Kim*, Ah Jeong Seo*, Hao Liu, Jinwoo Shin, Kimin Lee <br>
27
+ *In EMNLP 2024 Findings*
28
+
29
+
30
  This model is a fine-tuned version of [kykim0/gemma-7b-ultrachat-sft](https://huggingface.co/kykim0/gemma-7b-ultrachat-sft) on the [allenai/ultrafeedback_binarized_cleaned](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset.
31
 
32
+ The model is optimized with MMPO(Margin Matching Preference Optimization), which integrates per-feedback margin to enhance optimization.
33
+ Specifically, given quality margins in pairwise preferences, MMPO utilizes soft target probabilities based on the Bradley-Terry model.
34
+ You can find more details in the paper or the [official code](https://github.com/kykim0/margin-matching-pref-opt).
35
+
36
+
37
+ ## Evaluation results
38
+
39
+ For MT-Bench, this model shows a score of 7.53, which is higher than the score of 7.40 when trained with DPO:
40
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/641a94e305290a1350418646/iFpJYNNHJZhlU70PK17k4.png" width="50%" />
41
 
42
+ For RewardBench, it achieves state-of-the-art performance compared to competing models at the same scale:
43
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/641a94e305290a1350418646/OIwbSMUgvbD9HuVo6aVqV.png" width="80%" />
44
 
 
45
 
46
 
47
  ## Training and evaluation data