Edit model card

phi-2-ipo-test-iter-0

This model is a fine-tuned version of lole25/phi-2-sft-ultrachat-lora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 2546.4375
  • Rewards/chosen: -0.1591
  • Rewards/rejected: -0.1612
  • Rewards/accuracies: 0.5220
  • Rewards/margins: 0.0021
  • Logps/rejected: -249.6534
  • Logps/chosen: -272.5227
  • Logits/rejected: 0.4171
  • Logits/chosen: 0.3526

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 4

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
2477.3281 0.32 100 2500.7156 -0.0018 -0.0018 0.4930 -0.0000 -233.7207 -256.7978 0.8796 0.8221
2224.3488 0.64 200 2499.8904 -0.0195 -0.0198 0.5015 0.0003 -235.5204 -258.5673 0.8051 0.7462
1898.0719 0.96 300 2505.6912 -0.0563 -0.0571 0.5140 0.0008 -239.2530 -262.2491 0.6844 0.6233
1879.8852 1.28 400 2516.0835 -0.0944 -0.0957 0.5200 0.0013 -243.1053 -266.0533 0.5839 0.5215
1917.2811 1.6 500 2527.1995 -0.1156 -0.1170 0.5115 0.0014 -245.2343 -268.1747 0.5244 0.4611
1799.3824 1.92 600 2534.4292 -0.1363 -0.1381 0.5210 0.0018 -247.3504 -270.2482 0.4714 0.4075
1751.5762 2.24 700 2531.3550 -0.1448 -0.1474 0.5180 0.0026 -248.2780 -271.0988 0.4545 0.3906
1711.1711 2.56 800 2536.2451 -0.1487 -0.1511 0.5145 0.0024 -248.6440 -271.4834 0.4402 0.3759
1894.4447 2.88 900 2542.6299 -0.1549 -0.1570 0.5235 0.0022 -249.2417 -272.1000 0.4262 0.3618
1798.5389 3.2 1000 2542.7288 -0.1581 -0.1604 0.5205 0.0023 -249.5780 -272.4200 0.4202 0.3559
1834.9711 3.52 1100 2542.2373 -0.1586 -0.1610 0.5205 0.0024 -249.6345 -272.4703 0.4177 0.3532
1765.5148 3.84 1200 2546.1714 -0.1589 -0.1610 0.5220 0.0021 -249.6357 -272.5010 0.4160 0.3515

Framework versions

  • PEFT 0.7.1
  • Transformers 4.36.2
  • Pytorch 2.2.1+cu121
  • Datasets 2.14.6
  • Tokenizers 0.15.2
Downloads last month
8
Unable to determine this model’s pipeline type. Check the docs .

Adapter for

Dataset used to train DUAL-GPO-2/phi-2-ipo-test-iter-0

Collection including DUAL-GPO-2/phi-2-ipo-test-iter-0