Edit model card

phi-2-dpo-ultrafeedback-lora

This model is a fine-tuned version of lole25/phi-2-sft-ultrachat-lora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6537
  • Rewards/chosen: -0.2570
  • Rewards/rejected: -0.3767
  • Rewards/accuracies: 0.6580
  • Rewards/margins: 0.1196
  • Logps/rejected: -269.1014
  • Logps/chosen: -285.9487
  • Logits/rejected: 0.7335
  • Logits/chosen: 0.6309

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 64
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 2

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6929 0.21 100 0.6928 0.0002 -0.0010 0.5320 0.0012 -231.5360 -260.2240 0.9168 0.8145
0.6893 0.42 200 0.6891 -0.0038 -0.0134 0.6500 0.0096 -232.7742 -260.6225 0.9234 0.8205
0.6809 0.63 300 0.6810 -0.0312 -0.0611 0.6680 0.0299 -237.5431 -263.3647 0.9151 0.8092
0.6671 0.84 400 0.6723 -0.0854 -0.1408 0.6640 0.0553 -245.5124 -268.7867 0.8790 0.7713
0.6627 1.05 500 0.6645 -0.1494 -0.2293 0.6680 0.0799 -254.3704 -275.1849 0.8294 0.7217
0.6476 1.26 600 0.6591 -0.1979 -0.2968 0.6640 0.0989 -261.1124 -280.0337 0.7883 0.6828
0.6488 1.47 700 0.6559 -0.2310 -0.3414 0.6620 0.1104 -265.5783 -283.3440 0.7549 0.6511
0.6449 1.67 800 0.6542 -0.2518 -0.3695 0.6560 0.1177 -268.3814 -285.4226 0.7372 0.6347
0.6487 1.88 900 0.6539 -0.2571 -0.3764 0.6560 0.1193 -269.0724 -285.9532 0.7320 0.6299

Framework versions

  • PEFT 0.7.1
  • Transformers 4.36.2
  • Pytorch 2.1.2+cu118
  • Datasets 2.14.6
  • Tokenizers 0.15.2
Downloads last month
1
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lole25/phi-2-dpo-ultrafeedback-lora

Base model

microsoft/phi-2
Adapter
(633)
this model

Dataset used to train lole25/phi-2-dpo-ultrafeedback-lora