metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6201	0.21	100	0.6253	-0.2753	-0.6662	0.7031	0.3909	-378.2405	-331.1124	0.4172	0.3706
0.5547	0.42	200	0.5549	-0.6988	-1.4726	0.7656	0.7738	-458.8863	-373.4661	0.4261	0.3909
0.5343	0.63	300	0.5316	-0.8044	-1.6474	0.7656	0.8430	-476.3628	-384.0199	0.2851	0.2449
0.5323	0.84	400	0.5211	-0.9068	-1.8283	0.7812	0.9216	-494.4600	-394.2621	0.2834	0.2514
0.352	1.05	500	0.5258	-1.9533	-3.4166	0.7969	1.4634	-653.2899	-498.9117	-0.0846	-0.0654
0.3342	1.26	600	0.5268	-2.3123	-3.7246	0.7930	1.4124	-684.0857	-534.8101	0.1128	0.1344
0.337	1.47	700	0.5290	-2.3753	-3.8837	0.7773	1.5084	-699.9910	-541.1116	0.0099	0.0414
0.3398	1.67	800	0.5297	-2.5097	-4.0133	0.7734	1.5036	-712.9506	-554.5546	0.0381	0.0750
0.307	1.88	900	0.5261	-2.4591	-3.9221	0.7773	1.4631	-703.8400	-549.4910	0.0289	0.0663