eurus-dpop-qlora-uf-ours-5e-6

This model is a fine-tuned version of openbmb/Eurus-7b-sft on the generation/UF dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Positive Losses	Dpo Losses	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Rewards/margins Max	Rewards/margins Min	Rewards/margins Std	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6736	0.28	100	1.3303	6.3472	0.6796	-0.0381	-0.0727	0.6030	0.0346	0.3012	-0.1958	0.1617	-264.7934	-278.6886	-2.1474	-2.2655
0.5967	0.56	200	1.9249	12.1132	0.6721	-0.0924	-0.1544	0.5930	0.0619	0.4845	-0.3189	0.2624	-272.9586	-284.1257	-2.2051	-2.3263
0.5403	0.85	300	2.2645	15.4958	0.6655	-0.1316	-0.2109	0.6030	0.0792	0.5268	-0.3293	0.2845	-278.6066	-288.0423	-2.1931	-2.3125
0.5489	1.13	400	2.7577	20.2944	0.6603	-0.1822	-0.2848	0.6170	0.1026	0.6736	-0.3927	0.3533	-285.9984	-293.0988	-2.1500	-2.2685
0.4521	1.41	500	3.3498	26.1254	0.6549	-0.2464	-0.3696	0.6080	0.1232	0.7653	-0.4233	0.3948	-294.4765	-299.5168	-2.1093	-2.2289
0.4973	1.69	600	3.2114	24.9181	0.6525	-0.2330	-0.3588	0.6280	0.1258	0.7463	-0.4100	0.3853	-293.4038	-298.1804	-2.0925	-2.2110
0.4859	1.97	700	3.9841	32.5303	0.6484	-0.3118	-0.4659	0.6230	0.1542	0.9148	-0.4919	0.4674	-304.1142	-306.0565	-2.0901	-2.2081
0.3213	2.25	800	5.6914	49.4901	0.6455	-0.4866	-0.6893	0.6210	0.2027	1.2066	-0.6341	0.6132	-326.4517	-323.5386	-2.0652	-2.1817
0.4163	2.54	900	5.0729	43.3077	0.6426	-0.4232	-0.6206	0.6270	0.1975	1.1450	-0.5818	0.5750	-319.5832	-317.1975	-2.0654	-2.1825
0.3992	2.82	1000	5.1952	44.5160	0.6420	-0.4357	-0.6373	0.6300	0.2016	1.1648	-0.5900	0.5841	-321.2483	-318.4470	-2.0618	-2.1784