dense_reward_trainer_final_opt__NumTrainEpochs2_SaveStrategiesepoch_reward_modeling_anthropic_hh

This model is a fine-tuned version of facebook/opt-1.3b on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Accuracy	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Nll Loss	Logit Total Loss	Logit Loss
0.7169	0.11	100	0.6921	0.5959	-1.7367	-1.8694	0.5855	0.1326	3.0057	0.6899	0.6665
0.7082	0.23	200	0.6978	0.5938	-3.3995	-3.5818	0.5802	0.1823	3.2073	0.6959	0.6706
0.6744	0.34	300	0.6681	0.6062	-2.3751	-2.7036	0.5956	0.3285	2.7061	0.6656	0.6450
0.6154	0.46	400	0.6490	0.6433	-1.5136	-1.9306	0.6310	0.4171	2.8065	0.6474	0.6256
0.6405	0.57	500	0.6573	0.6351	-1.4041	-1.8257	0.6226	0.4216	2.6995	0.6577	0.6371
0.6284	0.69	600	0.6448	0.6557	-2.3215	-2.7092	0.6440	0.3877	2.6968	0.6433	0.6225
0.6399	0.8	700	0.6454	0.6227	-2.0755	-2.4642	0.6125	0.3887	2.8089	0.6435	0.6217
0.669	0.91	800	0.6385	0.6474	-1.7053	-2.1240	0.6379	0.4187	2.6687	0.6350	0.6145
0.4788	1.03	900	0.6636	0.6577	-2.1522	-2.8529	0.6435	0.7007	2.5723	0.6620	0.6427
0.4529	1.14	1000	0.6938	0.6577	-1.1456	-2.0167	0.6488	0.8712	2.5628	0.6897	0.6708
0.4378	1.26	1100	0.7319	0.6536	-1.4771	-2.4829	0.6427	1.0058	2.5495	0.7282	0.7098
0.4496	1.37	1200	0.7034	0.6660	-2.6046	-3.5817	0.6524	0.9771	2.5483	0.7006	0.6819
0.3539	1.49	1300	0.7023	0.6598	-2.2279	-3.2122	0.6516	0.9842	2.5144	0.6963	0.6780
0.5494	1.6	1400	0.6784	0.6536	-2.3300	-3.3018	0.6435	0.9718	2.4946	0.6749	0.6565
0.4075	1.71	1500	0.6935	0.6948	-0.9575	-2.0411	0.6843	1.0836	2.4900	0.6884	0.6702
0.4789	1.83	1600	0.6941	0.6598	-2.1270	-3.1756	0.6496	1.0487	2.5026	0.6924	0.6741
0.4093	1.94	1700	0.6907	0.6825	-2.0106	-3.0639	0.6657	1.0533	2.4906	0.6892	0.6710