metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5118	0.1151	100	0.5923	-0.1120	-0.4506	0.7070	0.3386	-417.6701	-390.5766	-2.1984	-2.2213
0.4206	0.2303	200	0.5055	-0.2913	-1.0785	0.8008	0.7872	-480.4641	-408.5089	-3.2280	-3.1644
0.4144	0.3454	300	0.4504	-0.3084	-1.2736	0.7773	0.9651	-499.9700	-410.2218	-4.0963	-3.8861
0.4011	0.4606	400	0.4135	-0.4247	-1.5332	0.8086	1.1086	-525.9362	-421.8441	-4.8370	-4.5018
0.3915	0.5757	500	0.3740	-0.3892	-1.7143	0.8516	1.3251	-544.0394	-418.2938	-5.1877	-4.7675
0.3726	0.6908	600	0.3468	-0.4807	-1.8892	0.8438	1.4085	-561.5286	-427.4439	-5.6248	-5.1461
0.3522	0.8060	700	0.3249	-0.5431	-2.0476	0.8789	1.5044	-577.3692	-433.6906	-5.6819	-5.2107
0.3643	0.9211	800	0.3183	-0.6032	-2.1160	0.8711	1.5128	-584.2130	-439.6992	-5.8852	-5.4031