metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5405	0.12	100	0.6086	-0.8599	-1.1867	0.6953	0.3268	-451.7755	-421.5048	-1.6547	-1.7462
0.4371	0.23	200	0.5454	-2.0208	-2.5842	0.7422	0.5634	-591.5291	-537.5920	-0.7151	-0.8867
0.4348	0.35	300	0.5012	-2.0998	-2.8410	0.7734	0.7413	-617.2101	-545.4883	-0.3499	-0.5939
0.3733	0.46	400	0.4721	-2.1506	-2.9308	0.7773	0.7802	-626.1902	-550.5717	-0.2280	-0.5456
0.3689	0.58	500	0.4484	-2.0467	-2.9485	0.7969	0.9018	-627.9595	-540.1826	-0.1091	-0.4774
0.3829	0.69	600	0.4419	-2.0265	-2.9075	0.8086	0.8810	-623.8541	-538.1624	-0.1412	-0.5099
0.3725	0.81	700	0.4329	-1.9184	-2.8079	0.8242	0.8895	-613.8932	-527.3496	-0.3224	-0.6920
0.4052	0.92	800	0.4292	-1.8869	-2.7914	0.8242	0.9045	-612.2493	-524.2042	-0.4436	-0.8025