eurus-dpo-qlora-uf-ours-5e-7

This model is a fine-tuned version of openbmb/Eurus-7b-sft on the generation/UF dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Rewards/margins Max	Rewards/margins Min	Rewards/margins Std	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6787	0.28	100	0.6902	-0.0196	-0.0275	0.6050	0.0078	0.0658	-0.0439	0.0352	-260.2682	-276.8446	-2.1835	-2.3057
0.6038	0.56	200	0.6829	-0.2121	-0.2562	0.5930	0.0440	0.4186	-0.2883	0.2265	-283.1364	-296.0924	-2.1563	-2.2736
0.4746	0.85	300	0.7105	-0.7773	-0.8546	0.5660	0.0773	1.0401	-0.8434	0.6093	-342.9795	-352.6140	-2.0904	-2.1991
0.4288	1.13	400	0.7566	-1.3505	-1.4749	0.5700	0.1245	1.6613	-1.3515	0.9884	-405.0142	-409.9261	-2.0237	-2.1254
0.3807	1.41	500	0.7770	-1.7690	-1.9759	0.5760	0.2069	2.1466	-1.6287	1.2537	-455.1077	-451.7817	-1.9637	-2.0584
0.3449	1.69	600	0.8093	-2.3053	-2.6236	0.5730	0.3183	2.7910	-1.9845	1.5908	-519.8788	-505.4114	-1.8829	-1.9707
0.3253	1.97	700	0.8022	-2.3688	-2.7622	0.5900	0.3934	3.0600	-2.0479	1.6969	-533.7401	-511.7566	-1.8637	-1.9524
0.2445	2.25	800	0.8262	-2.6179	-3.0584	0.5880	0.4405	3.3852	-2.2378	1.8658	-563.3621	-536.6691	-1.8329	-1.9194
0.3015	2.54	900	0.8293	-2.6774	-3.1416	0.5930	0.4642	3.5043	-2.2912	1.9184	-571.6796	-542.6185	-1.8281	-1.9147
0.2725	2.82	1000	0.8251	-2.6509	-3.1193	0.5930	0.4684	3.5001	-2.2741	1.9114	-569.4471	-539.9697	-1.8277	-1.9148