metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Logits/chosen	Logits/rejected	Logps/chosen	Logps/rejected	Validation Loss	Rewards/accuracies	Rewards/chosen	Rewards/margins	Rewards/rejected
0.3047	0.1	100	-2.4405	-2.3863	-361.0801	-337.7748	0.8551	0.3203	-0.4930	-0.2905	-0.2025
0.1861	0.21	200	-1.5418	-1.5107	-450.2716	-421.0934	1.0495	0.3867	-1.3850	-0.3493	-1.0357
0.1608	0.31	300	-1.4367	-1.4022	-454.9446	-422.9684	1.0910	0.3945	-1.4317	-0.3772	-1.0544
0.1368	0.42	400	-1.0538	-1.0131	-520.1699	-479.6456	1.3010	0.4102	-2.0839	-0.4627	-1.6212
0.1364	0.52	500	-1.6466	-1.6090	-470.0934	-430.8614	1.1773	0.3711	-1.5832	-0.4498	-1.1334
0.1223	0.63	600	1.3206	-2.2971	-1.8297	0.4141	-0.4674	-500.4930	-541.4883	-1.1541	-1.1880
0.0971	0.73	700	1.4638	-2.6554	-2.1594	0.3906	-0.4959	-533.4667	-577.3128	-0.9392	-0.9712
0.1035	0.84	800	1.4475	-2.5761	-2.1538	0.3945	-0.4222	-532.9068	-569.3817	-0.8902	-0.9232
0.088	0.94	900	1.3947	-2.4314	-2.0023	0.3867	-0.4292	-517.7516	-554.9180	-1.0823	-1.1239