metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.4898	0.12	100	0.5505	-0.1967	-1.0051	0.6875	0.8085	-353.2088	-339.4445	-1.7659	-1.8469
0.4277	0.23	200	0.4655	-0.4834	-1.8836	0.7383	1.4002	-370.7788	-345.1795	-1.7248	-1.8009
0.4188	0.35	300	0.3922	-0.0720	-2.0263	0.7969	1.9544	-373.6328	-336.9513	-1.6143	-1.6899
0.3506	0.46	400	0.3457	0.2171	-2.0472	0.8203	2.2643	-374.0495	-331.1692	-1.9794	-2.0296
0.3611	0.58	500	0.2959	0.2498	-2.4347	0.8516	2.6844	-381.7997	-330.5164	-1.8183	-1.8592
0.3562	0.69	600	0.2513	0.3868	-2.4732	0.8711	2.8600	-382.5696	-327.7753	-1.9217	-1.9736
0.3624	0.81	700	0.2194	0.6454	-2.3556	0.9062	3.0010	-380.2178	-322.6031	-1.9301	-1.9717
0.4069	0.92	800	0.2027	0.6729	-2.3580	0.9141	3.0309	-380.2658	-322.0539	-1.9204	-1.9591