eurus-dpo-qlora-uf-ours-5e-6

This model is a fine-tuned version of openbmb/Eurus-7b-sft on the generation/UF dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Rewards/margins Max	Rewards/margins Min	Rewards/margins Std	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.4256	0.28	100	0.8163	-1.8022	-1.9583	0.5610	0.1561	2.2049	-1.8191	1.3259	-453.3455	-455.0959	-1.9771	-2.0751
0.1591	0.56	200	1.2122	-5.0976	-6.6216	0.6050	1.5239	9.9971	-4.8753	4.8268	-919.6762	-784.6454	-1.3460	-1.4469
0.1126	0.85	300	1.7230	-6.1628	-8.5878	0.6090	2.4250	18.9102	-8.2202	8.7236	-1116.3019	-891.1599	-1.2133	-1.3142
0.074	1.13	400	2.0005	-8.7127	-11.9396	0.6220	3.2269	20.1537	-9.9867	9.6878	-1451.4778	-1146.1495	-1.3244	-1.4370
0.0551	1.41	500	2.6568	-10.4325	-15.1571	0.6260	4.7246	28.6045	-13.6975	13.8040	-1773.2283	-1318.1323	-1.2958	-1.4257
0.169	1.69	600	3.7089	-14.9797	-20.5965	0.6160	5.6168	36.0405	-19.8931	18.0728	-2317.1677	-1772.8466	-1.0370	-1.1529
0.0661	1.97	700	4.1957	-15.9319	-22.6457	0.6220	6.7138	41.9072	-22.6906	20.9609	-2522.0879	-1868.0721	-1.1163	-1.2633
0.0044	2.25	800	5.9108	-22.7617	-31.4584	0.6230	8.6967	56.6380	-31.9336	28.6036	-3403.3569	-2551.0461	-0.9371	-1.0936
0.011	2.54	900	5.9213	-23.0839	-32.0567	0.6230	8.9728	56.9548	-32.0980	28.8598	-3463.1873	-2583.2671	-0.9208	-1.0846
0.0138	2.82	1000	6.0584	-23.3438	-32.4235	0.6280	9.0798	58.3224	-32.8664	29.5381	-3499.8743	-2609.2573	-0.9160	-1.0810