Edit model card

zephyr-7b-dpo-qlora

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.4920
  • Rewards/chosen: -2.5098
  • Rewards/rejected: -3.5905
  • Rewards/accuracies: 0.7560
  • Rewards/margins: 1.0807
  • Logps/rejected: -600.3103
  • Logps/chosen: -516.2818
  • Logits/rejected: 2.5098
  • Logits/chosen: 2.2972

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6622 0.05 100 0.6637 0.0126 -0.0636 0.6840 0.0762 -247.6176 -264.0424 -2.2973 -2.3242
0.6069 0.1 200 0.6175 -0.5399 -0.8086 0.6720 0.2687 -322.1209 -319.2918 -1.9985 -2.0644
0.5858 0.16 300 0.5707 -0.8385 -1.3622 0.6930 0.5238 -377.4863 -349.1537 0.2196 0.1195
0.5518 0.21 400 0.5536 -0.8070 -1.4119 0.7230 0.6049 -382.4471 -346.0015 0.8423 0.7208
0.5953 0.26 500 0.5575 -0.6678 -1.1831 0.7110 0.5153 -359.5695 -332.0846 1.2558 1.0708
0.5032 0.31 600 0.5359 -1.3551 -2.1333 0.7310 0.7782 -454.5939 -400.8145 2.8427 2.7062
0.5741 0.37 700 0.5317 -1.2904 -2.0407 0.7260 0.7503 -445.3269 -394.3451 3.1371 2.9904
0.5318 0.42 800 0.5149 -1.6058 -2.4688 0.7450 0.8630 -488.1442 -425.8877 3.7140 3.5383
0.5353 0.47 900 0.5125 -2.5710 -3.5411 0.7460 0.9701 -595.3752 -522.4096 4.4179 4.2065
0.574 0.52 1000 0.5035 -2.6228 -3.6684 0.7370 1.0456 -608.1039 -527.5898 2.6517 2.4408
0.471 0.58 1100 0.5028 -2.6309 -3.7142 0.75 1.0833 -612.6806 -528.3990 2.2637 2.0694
0.4888 0.63 1200 0.4965 -2.4412 -3.4135 0.7530 0.9723 -582.6143 -509.4261 2.4042 2.2263
0.5204 0.68 1300 0.4941 -2.2701 -3.2940 0.7480 1.0239 -570.6591 -492.3148 2.2065 2.0121
0.5158 0.73 1400 0.4925 -2.6194 -3.7070 0.7540 1.0875 -611.9571 -527.2493 2.4817 2.2784
0.4677 0.79 1500 0.4922 -2.6220 -3.7128 0.7540 1.0908 -612.5421 -527.5074 2.5848 2.3739
0.5464 0.84 1600 0.4925 -2.5137 -3.5972 0.7510 1.0835 -600.9805 -516.6763 2.4955 2.2803
0.5078 0.89 1700 0.4920 -2.5031 -3.5840 0.7550 1.0809 -599.6627 -515.6122 2.5160 2.3031
0.4864 0.94 1800 0.4921 -2.5103 -3.5902 0.7550 1.0799 -600.2827 -516.3320 2.5115 2.2982
0.5211 0.99 1900 0.4921 -2.5098 -3.5900 0.7550 1.0803 -600.2638 -516.2831 2.5098 2.2971

Framework versions

  • PEFT 0.7.1
  • Transformers 4.38.2
  • Pytorch 2.1.2
  • Datasets 2.14.6
  • Tokenizers 0.15.2
Downloads last month
0
Unable to determine this model’s pipeline type. Check the docs .

Adapter for

Dataset used to train ale-bay/zephyr-7b-dpo-qlora