Edit model card

MMPO_Gemma_7b_gamma1.1_epoch3

this is the model checkpoint for the paper:

Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback
Kyuyoung Kim*, Ah Jeong Seo*, Hao Liu, Jinwoo Shin, Kimin Lee
In EMNLP 2024 Findings

This model is a fine-tuned version of kykim0/gemma-7b-ultrachat-sft on the allenai/ultrafeedback_binarized_cleaned dataset.

The model is optimized with MMPO(Margin Matching Preference Optimization), which integrates per-feedback margin to enhance optimization. Specifically, given quality margins in pairwise preferences, MMPO utilizes soft target probabilities based on the Bradley-Terry model. You can find more details in the paper or the official code.

Evaluation results

For MT-Bench, this model shows a score of 7.53, which is higher than the score of 7.40 when trained with DPO:

For RewardBench, it achieves state-of-the-art performance compared to competing models at the same scale:

Training and evaluation data

  • Training: UltraFeedback
  • Evaluation: MT-Bench, RewardBench

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 16
  • total_train_batch_size: 64
  • total_eval_batch_size: 64
  • optimizer: AdamW
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.3
  • mix_precision: bfloat16
  • num_epochs: 3
Downloads last month
2
Safetensors
Model size
8.54B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Ahjeong/MMPO_Gemma_7b_gamma1.1_epoch3

Base model

google/gemma-7b
Finetuned
(85)
this model

Dataset used to train Ahjeong/MMPO_Gemma_7b_gamma1.1_epoch3