nlp-a5

This model is a fine-tuned version of gpt2 on distilabel-intel-orca-dpo-pairs dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6454	0.1382	50	0.7701	0.4667	-1.3878	0.7591	1.8546	-406.8403	-339.9385	-95.7163	-95.3393
0.7265	0.2764	100	0.7531	0.2791	-2.1548	0.7777	2.4339	-410.6752	-340.8765	-85.4456	-85.2691
0.5317	0.4147	150	0.7164	0.0401	-2.6230	0.7743	2.6631	-413.0164	-342.0717	-77.7900	-78.4781
0.8947	0.5529	200	0.7223	-0.0327	-3.1585	0.7961	3.1258	-415.6938	-342.4356	-73.7223	-74.3845
0.6882	0.6911	250	0.6677	0.6186	-2.0402	0.7904	2.6588	-410.1023	-339.1790	-66.4183	-67.2267
0.4596	0.8293	300	0.6199	0.5863	-2.4937	0.8116	3.0800	-412.3698	-339.3405	-66.5151	-67.2825
0.6719	0.9675	350	0.6214	1.1018	-1.4390	0.7842	2.5408	-407.0965	-336.7633	-64.9415	-65.8130
0.119	1.1057	400	0.6442	0.4069	-2.8694	0.8282	3.2763	-414.2482	-340.2375	-64.6611	-65.4554
0.1427	1.2440	450	0.6730	1.1133	-1.9897	0.8131	3.1030	-409.8499	-336.7056	-65.8348	-66.7287
0.1022	1.3822	500	0.6409	0.9778	-2.1491	0.8235	3.1270	-410.6469	-337.3829	-66.9816	-67.8481