Edit model card

zephyr-7b-dpo-qlora

This model is a fine-tuned version of ale-bay/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.4975
  • Rewards/chosen: -2.4549
  • Rewards/rejected: -3.4757
  • Rewards/accuracies: 0.7490
  • Rewards/margins: 1.0207
  • Logps/rejected: -595.2866
  • Logps/chosen: -517.1966
  • Logits/rejected: -1.3432
  • Logits/chosen: -1.4358

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6641 0.05 100 0.6636 0.0054 -0.0681 0.6900 0.0735 -254.5337 -271.1659 -2.0436 -2.1368
0.6105 0.1 200 0.6075 -0.3236 -0.5938 0.6890 0.2702 -307.0967 -304.0613 -2.0030 -2.0919
0.5883 0.16 300 0.5817 -0.7122 -1.1286 0.7020 0.4164 -360.5768 -342.9188 -1.9914 -2.0761
0.5651 0.21 400 0.5665 -0.7901 -1.2897 0.7250 0.4996 -376.6874 -350.7093 -1.9001 -1.9820
0.5136 0.26 500 0.5520 -1.0330 -1.6646 0.7190 0.6316 -414.1808 -374.9992 -1.8081 -1.8880
0.5587 0.31 600 0.5327 -1.3215 -2.0089 0.7320 0.6874 -448.6079 -403.8534 -1.4665 -1.5609
0.5167 0.37 700 0.5299 -1.2797 -2.1992 0.7230 0.9196 -467.6413 -399.6684 -1.3918 -1.4903
0.5465 0.42 800 0.5189 -1.6646 -2.4686 0.7200 0.8041 -494.5844 -438.1617 -1.3685 -1.4642
0.5002 0.47 900 0.5142 -1.7844 -2.7217 0.7290 0.9373 -519.8885 -450.1383 -1.4179 -1.5054
0.5017 0.52 1000 0.5058 -2.6175 -3.6120 0.7360 0.9946 -608.9218 -533.4493 -1.2973 -1.3948
0.4966 0.58 1100 0.5043 -2.0581 -2.9819 0.7370 0.9239 -545.9103 -477.5080 -1.3783 -1.4740
0.5087 0.63 1200 0.5040 -2.3715 -3.3475 0.7450 0.9760 -582.4712 -508.8495 -1.3331 -1.4262
0.4799 0.68 1300 0.5011 -2.3067 -3.3444 0.7450 1.0377 -582.1562 -502.3687 -1.3340 -1.4277
0.4606 0.73 1400 0.4991 -2.5016 -3.5583 0.7430 1.0567 -603.5469 -521.8631 -1.3291 -1.4219
0.4763 0.79 1500 0.4985 -2.4979 -3.5204 0.7470 1.0225 -599.7631 -521.4944 -1.3394 -1.4325
0.5008 0.84 1600 0.4977 -2.4555 -3.4719 0.7480 1.0164 -594.9102 -517.2504 -1.3492 -1.4415
0.4654 0.89 1700 0.4976 -2.4498 -3.4672 0.7510 1.0174 -594.4417 -516.6852 -1.3478 -1.4402
0.4854 0.94 1800 0.4975 -2.4526 -3.4731 0.7480 1.0205 -595.0339 -516.9640 -1.3441 -1.4366
0.4879 0.99 1900 0.4974 -2.4531 -3.4740 0.75 1.0209 -595.1221 -517.0148 -1.3432 -1.4359

Framework versions

  • PEFT 0.7.1
  • Transformers 4.39.3
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.1
  • Tokenizers 0.15.2
Downloads last month
18
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for ale-bay/zephyr-7b-dpo-qlora

Adapter
(1170)
this model

Dataset used to train ale-bay/zephyr-7b-dpo-qlora