metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.2555	0.1	100	1.4172	-4.8884	-5.6701	0.5898	0.7817	-884.5335	-800.6121	-1.3358	-1.3942
0.1854	0.21	200	1.6754	-6.1508	-7.3259	0.6211	1.1752	-1050.1200	-926.8517	-1.1088	-1.1853
0.1799	0.31	300	1.5590	-5.9157	-6.9794	0.5977	1.0637	-1015.4615	-903.3419	-1.0193	-1.1110
0.1679	0.42	400	2.1030	-7.8503	-9.2060	0.6094	1.3557	-1238.1252	-1096.8108	-0.5753	-0.7096
0.1693	0.52	500	1.6563	-6.3408	-7.6718	0.625	1.3310	-1084.7078	-945.8611	-0.8598	-0.9873
0.1609	0.63	600	1.6818	-6.4795	-7.7992	0.6211	1.3198	-1097.4480	-959.7227	-0.4515	-0.6164
0.1559	0.73	700	1.9278	-7.3485	-8.7955	0.6133	1.4470	-1197.0731	-1046.6217	-0.4166	-0.5852
0.1433	0.84	800	1.9050	-7.1496	-8.6252	0.6172	1.4756	-1180.0403	-1026.7318	-0.5141	-0.6745
0.1479	0.94	900	1.8979	-6.9869	-8.4701	0.6094	1.4832	-1164.5387	-1010.4669	-0.5643	-0.7199