Ahjeong
/

MMPO_Gemma_7b_gamma1.1_epoch3

Text Generation

alignment-handbook

Generated from Trainer

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Ahjeong commited on Sep 30, 2024

Commit

e22509a

·

verified ·

1 Parent(s): fe1a540

Update README.md

Files changed (1) hide show

README.md +18 -5

README.md CHANGED Viewed

@@ -20,15 +20,28 @@ model-index:
 should probably proofread and complete it, then remove this comment. -->
 # MMPO_Gemma_7b_gamma1.1_epoch3
 This model is a fine-tuned version of [kykim0/gemma-7b-ultrachat-sft](https://huggingface.co/kykim0/gemma-7b-ultrachat-sft) on the [allenai/ultrafeedback_binarized_cleaned](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset.
-This model is optimized with MMPO(Margin Matching Preference Optimization), which is a variation of DPO and utilizes margin information.
-For more detail, our paper is under review for now and a link will be attached if the paper is published on ArXiv.
-It achieves the following results on the RewardBench dataset:
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/641a94e305290a1350418646/F0OdaUbBZXEwjbGcvV1de.png)
-Also, MT-Bench score is 7.53.
 ## Training and evaluation data

 should probably proofread and complete it, then remove this comment. -->
 # MMPO_Gemma_7b_gamma1.1_epoch3
+this is the model checkpoint for the paper:
+**Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback** <br>
+Kyuyoung Kim*, Ah Jeong Seo*, Hao Liu, Jinwoo Shin, Kimin Lee  <br>
+*In EMNLP 2024 Findings*
 This model is a fine-tuned version of [kykim0/gemma-7b-ultrachat-sft](https://huggingface.co/kykim0/gemma-7b-ultrachat-sft) on the [allenai/ultrafeedback_binarized_cleaned](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset.
+The model is optimized with MMPO(Margin Matching Preference Optimization), which integrates per-feedback margin to enhance optimization.
+Specifically, given quality margins in pairwise preferences, MMPO utilizes soft target probabilities based on the Bradley-Terry model.
+You can find more details in the paper or the [official code](https://github.com/kykim0/margin-matching-pref-opt).
+## Evaluation results
+For MT-Bench, this model shows a score of 7.53, which is higher than the score of 7.40 when trained with DPO:
+<img src="https://cdn-uploads.huggingface.co/production/uploads/641a94e305290a1350418646/iFpJYNNHJZhlU70PK17k4.png" width="50%" />
+For RewardBench, it achieves state-of-the-art performance compared to competing models at the same scale:
+<img src="https://cdn-uploads.huggingface.co/production/uploads/641a94e305290a1350418646/OIwbSMUgvbD9HuVo6aVqV.png" width="80%" />
 ## Training and evaluation data