metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6142	0.07	100	0.6372	-0.2112	-0.4255	0.6992	0.2143	-330.3116	-302.7545	-1.7521	-1.7871
0.4726	0.15	200	0.5516	-1.3441	-2.1046	0.75	0.7605	-498.2208	-416.0410	-2.0018	-2.0471
0.4421	0.22	300	0.5335	-1.1470	-2.0463	0.7539	0.8992	-492.3901	-396.3379	-1.7522	-1.8325
0.3828	0.3	400	0.5238	-1.6652	-2.7617	0.7695	1.0965	-563.9280	-448.1488	-0.9530	-1.1204
0.3576	0.37	500	0.5184	-1.6238	-2.7277	0.7695	1.1039	-560.5328	-444.0173	-0.8922	-1.1202
0.3328	0.45	600	0.5151	-2.1202	-3.4092	0.7656	1.2890	-628.6859	-493.6552	0.2423	-0.0694
0.3131	0.52	700	0.5153	-1.7034	-2.9038	0.7656	1.2004	-578.1398	-451.9696	0.1729	-0.1656
0.2547	0.59	800	0.5256	-2.5366	-3.8570	0.7617	1.3204	-673.4565	-535.2915	0.4476	0.1270
0.2764	0.67	900	0.5221	-2.5675	-3.9457	0.7773	1.3782	-682.3342	-538.3813	0.0520	-0.2431
0.2261	0.74	1000	0.5298	-2.7657	-4.2499	0.7695	1.4842	-712.7483	-558.2006	0.2023	-0.1104
0.2219	0.82	1100	0.5380	-3.0986	-4.6646	0.7695	1.5660	-754.2211	-591.4904	0.3078	-0.0067
0.2165	0.89	1200	0.5336	-2.9855	-4.5026	0.7617	1.5170	-738.0179	-580.1855	0.2015	-0.0980
0.1728	0.97	1300	0.5418	-3.1726	-4.7390	0.7539	1.5664	-761.6608	-598.8974	0.2389	-0.0634