Edit model card

zephyr-7b-dpo-lora-r16

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6410
  • Rewards/chosen: -2.2125
  • Rewards/rejected: -3.0591
  • Rewards/accuracies: 0.6650
  • Rewards/margins: 0.8466
  • Logps/rejected: -554.3575
  • Logps/chosen: -489.4880
  • Logits/rejected: -2.1525
  • Logits/chosen: -2.1542

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 20

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6933 0.992 62 0.6932 0.0001 0.0001 0.4950 -0.0001 -248.4330 -268.2283 -2.8603 -2.8949
0.6898 2.0 125 0.6900 -0.0019 -0.0083 0.6300 0.0064 -249.2751 -268.4230 -2.8539 -2.8884
0.6767 2.992 187 0.6785 -0.0026 -0.0339 0.6300 0.0312 -251.8334 -268.4987 -2.8254 -2.8577
0.5861 4.0 250 0.6520 -0.0414 -0.1367 0.6350 0.0953 -262.1136 -272.3749 -2.8027 -2.8314
0.5654 4.992 312 0.6219 -0.2603 -0.4550 0.6500 0.1947 -293.9497 -294.2625 -2.7777 -2.8036
0.4986 6.0 375 0.6055 -0.4927 -0.7779 0.6800 0.2851 -326.2355 -317.5081 -2.7652 -2.7893
0.4719 6.992 437 0.6055 -0.7077 -1.0586 0.6900 0.3508 -354.3046 -339.0088 -2.7391 -2.7606
0.4512 8.0 500 0.6028 -0.7213 -1.1042 0.6750 0.3829 -358.8697 -340.3660 -2.7246 -2.7431
0.264 8.992 562 0.5955 -1.0493 -1.4939 0.7000 0.4446 -397.8353 -373.1655 -2.6715 -2.6867
0.3516 10.0 625 0.5927 -1.1473 -1.6948 0.6800 0.5474 -417.9223 -382.9673 -2.5714 -2.5856
0.3271 10.992 687 0.5922 -1.4044 -2.0377 0.6900 0.6332 -452.2125 -408.6782 -2.4751 -2.4864
0.336 12.0 750 0.6034 -1.6164 -2.3135 0.7100 0.6972 -479.8002 -429.8719 -2.3841 -2.3919
0.2157 12.992 812 0.6125 -1.6968 -2.4270 0.6800 0.7302 -491.1445 -437.9121 -2.3161 -2.3226
0.2436 14.0 875 0.6211 -1.9546 -2.7134 0.6800 0.7588 -519.7897 -463.6995 -2.2583 -2.2637
0.1747 14.992 937 0.6250 -2.0090 -2.8105 0.6750 0.8015 -529.4984 -469.1342 -2.2179 -2.2224
0.162 16.0 1000 0.6350 -2.1464 -2.9679 0.6750 0.8214 -545.2337 -482.8784 -2.1872 -2.1901
0.1898 16.992 1062 0.6415 -2.2332 -3.0695 0.6700 0.8363 -555.3980 -491.5554 -2.1618 -2.1639
0.1337 18.0 1125 0.6401 -2.2070 -3.0519 0.6700 0.8449 -553.6342 -488.9332 -2.1605 -2.1619
0.1233 18.992 1187 0.6414 -2.2093 -3.0569 0.6650 0.8476 -554.1345 -489.1610 -2.1630 -2.1636
0.1832 19.84 1240 0.6410 -2.2125 -3.0591 0.6650 0.8466 -554.3575 -489.4880 -2.1525 -2.1542

Framework versions

  • PEFT 0.12.0
  • Transformers 4.44.0
  • Pytorch 2.4.0+cu121
  • Datasets 2.20.0
  • Tokenizers 0.19.1
Downloads last month
0
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for LaoRay/zephyr-7b-dpo-lora-r16

Adapter
this model

Dataset used to train LaoRay/zephyr-7b-dpo-lora-r16