metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5905	0.07	100	0.6429	-0.1380	-0.3441	0.6719	0.2061	-351.9318	-325.5744	-1.7244	-1.7878
0.4495	0.15	200	0.5600	-0.4940	-1.0973	0.7461	0.6032	-427.2510	-361.1815	-1.3665	-1.4371
0.3963	0.22	300	0.5291	-1.1123	-2.0359	0.7422	0.9236	-521.1155	-423.0034	-1.2770	-1.4609
0.4012	0.3	400	0.5315	-1.0588	-1.9923	0.7734	0.9334	-516.7505	-417.6586	-1.1223	-1.3373
0.3559	0.37	500	0.5276	-1.4423	-2.5146	0.7578	1.0723	-568.9822	-456.0086	-0.6834	-1.0067
0.3291	0.45	600	0.5103	-1.6617	-2.7811	0.7695	1.1194	-595.6332	-477.9445	0.1886	-0.2334
0.2735	0.52	700	0.5289	-2.2950	-3.7006	0.7617	1.4056	-687.5872	-541.2795	0.6722	0.1870
0.2752	0.59	800	0.5229	-2.2134	-3.5070	0.7656	1.2935	-668.2236	-533.1202	0.2752	-0.1628
0.2492	0.67	900	0.5152	-2.0646	-3.3529	0.7734	1.2882	-652.8116	-518.2382	1.0726	0.5184
0.262	0.74	1000	0.5241	-2.4505	-3.8564	0.7617	1.4059	-703.1603	-556.8265	1.3124	0.6805
0.2299	0.82	1100	0.5313	-2.7647	-4.2433	0.7578	1.4786	-741.8574	-588.2495	1.4834	0.8391
0.1974	0.89	1200	0.5367	-2.9484	-4.4713	0.7617	1.5229	-764.6512	-606.6174	1.5458	0.8964
0.1842	0.97	1300	0.5366	-2.9738	-4.4991	0.7617	1.5252	-767.4317	-609.1594	1.6095	0.9559