MMPO_Gemma_7b_gamma1.1_epoch3
this is the model checkpoint for the paper:
Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback
Kyuyoung Kim*, Ah Jeong Seo*, Hao Liu, Jinwoo Shin, Kimin Lee
In EMNLP 2024 Findings
This model is a fine-tuned version of kykim0/gemma-7b-ultrachat-sft on the allenai/ultrafeedback_binarized_cleaned dataset.
The model is optimized with MMPO(Margin Matching Preference Optimization), which integrates per-feedback margin to enhance optimization. Specifically, given quality margins in pairwise preferences, MMPO utilizes soft target probabilities based on the Bradley-Terry model. You can find more details in the paper or the official code.
Evaluation results
For MT-Bench, this model shows a score of 7.53, which is higher than the score of 7.40 when trained with DPO:
For RewardBench, it achieves state-of-the-art performance compared to competing models at the same scale:
Training and evaluation data
- Training: UltraFeedback
- Evaluation: MT-Bench, RewardBench
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 16
- total_train_batch_size: 64
- total_eval_batch_size: 64
- optimizer: AdamW
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.3
- mix_precision: bfloat16
- num_epochs: 3
- Downloads last month
- 2
Model tree for Ahjeong/MMPO_Gemma_7b_gamma1.1_epoch3
Base model
google/gemma-7b